Codec2 — an Open Source, Low-Bandwidth Voice Codec
Bruce Perens writes "Codec2 is an Open Source digital voice codec for low-bandwidth applications, in its first Alpha release. Currently it can encode 3.75 seconds of clear speech in 1050 bytes, and there are opportunities to code in additional compression that will further reduce its bandwidth. The main developer is David Rowe, who also worked on Speex. Originally designed for Amateur Radio, both via sound-card software modems on HF radio and as an alternative to the proprietary voice codec presently used in D-STAR, the codec is probably also useful for telephony at a fraction of current bandwidths. The algorithm is based on papers from the 1980s, and is intended to be unencumbered by valid unexpired patent claims. The license is LGPL2. The project is seeking developers for testing in applications, algorithmic improvement, conversion to fixed-point, and coding to be more suitable
for embedded systems."
I'll be presenting on Codec2 at the ARRL/TAPR Digital Communications Conference this weekend in Vancouver Washington, Near Portland. I'll try to get the video online.
Bruce Perens.
The original rationale for Codec2 is at Codec2.org. I've been promoting this issue for about four years, as I was bothered by the proprietary nature of the AMBE codec in D-STAR. But I didn't have the math, etc., to do the work myself. It was really fortunate that David became motivated to do the work without charge. He has a Ph.D. in voice coding. By the way, look over his web site rowetel.com for the other work he's done: two really nice Open Hardware projects - a PBX and a mesh telephony device, an Open Source echo canceler for digital telephony, used in Asterisk and elsewhere, and his own electric car conversion. He'd be my nomination for the MacArthur grant.
Bruce Perens.
What is the compression ratio for more interactive communication, e.g. 20 ms sampling time instead of 3-4 seconds?
Speex: Speex is based on CELP and is designed to compress voice at bitrates ranging from 2 to 44 kbps. Some of Speex's features include: Narrowband (8 kHz), wideband (16 kHz), and ultra-wideband (32 kHz) compression in the same bitstream
BLACK KNIGHT SECURITY SYSTEMS
We'll bite your legs off
i assume it's acceptable... but it angers me that someone thought it was relevant to give the exact number of bytes for a seemingly arbitrary 3.5 seconds of audio, but failed to say how long it take to encode that 3.5 seconds of audio, or what average latency can be expected after buffer conditions are met.
As a newly licenced ham in a area where Dstar repeaters are everywhere (VK) and free software advocate I have recently become aware of the issues with Dstar and have been reading about this work so it's quite surreal to have it pop up on /. in the week where I get my licence.
I havn't had a chance to read the Dstar specifications but am wondering if the voice codec is flagged in the dstar digital stream. and if it would be possible to create translating repeaters
so dual output repeaters with differently coded data streams it'd take more spectrum but would also allow for a migration path (at least for repeater users?)
Jean-Marc Valin is on the project mailing list and David is another Speex developer and the person Jean-Marc recommended to me. We are trying for an improvement over Speex at low rates.
Bruce Perens.
>3.75 seconds of clear speech in 1050 bytes
That's 2240 bps, 2.19 kbps, quite impressive. Maybe one day they can beat MELP (up to 600bps) and remain open.
Excellent work.
I use digital almost exclusively and have wondered about when a suitable open source voice project would emerge. I look foreward to seeing it developed further. Tim VK4YEH
I hope this takes off. It would be great to have a good OSS voice codec for amateur radio.
I didn't see it mentioned when quickly scanning TFA, but how does this codec handle packet loss?
It is all nice and well to develop a codec to cram as much speech as possible in as few bits as possible, but in this case, one lost packet could mean a gap of several seconds. The success of a low-bandwidth codec, at least when it comes to IP telephony, also depends on how well it can handle lost packets. Low bandwidth codecs are usually used in low bandwidth networks, such as the internet, and there the packetloss is the highest.
Same goes for delay and jitter, by the way. If a stream of packets is delayed, and more voice is crammed in fewer bits, then the delays in the voice stream will get longer too.
It encoded those 3.75 seconds in 0.06 seconds and decoded in 0.04 seconds on my AMD Phenom 9750 2.4 GHz, one core only, compiled with GCC and the -O3 switch. That's all of the overhead of the program starting and exiting, too. It's using floating, not fixed point.
This, it seems, bodes well for low latency of the final implementation on a DSP chip.
Bruce Perens.
At such high compression rates, one could wonder if the optimizations to transmit clear speech make assumptions about the language used. Does it work well with French ? Arabic ? Chinese ?
The Wise adapts himself to the world. The Fool adapts the world to himself. Therefore, all progress depends on the Fool.
"This is Snake, I am in front of the disposal site..."
One of the fastest ways to ensure its testing and distribution is to use it in Mumble - the low latency voice chat software (with an iPhone client as well).
Mumble is typically used by gaming clans for their chat rooms and it Codec2 would be tested in real-life conditions.
1050 bytes for 3.75 seconds of speech is the equivalent of 2240 bits per second- good enough that an old-school 2400 baud modem would be able to transfer speech in realtime. Impressive. But I seem to recall that the speech synthesizer of the TI-99 stored voice audio in as little as 1200 bits per second. It was well-documented enough that TI emulators emulate the speech synthesizer as well. But the sound quality left to be desired, which is probably one area where codec2 shines. I've listened to the example files and the sound quality seems fine- I can't tell the difference in audio quality between source and target files. Partially this may be because the source material already seems to be bandwidth-limited- probably using an 8 kHz low pass filter as is common for telephony applications.
Visit http://ringbreak.dnd.utwente.nl/~mrjb/growingbettersoftware to download your free copy of the book
I didn't dig completely into your site, but was just wondering if groups are doing work on speech recognition algorithms on your compressed bitstream? Is this an active area of research?
codec2, the... second codec.
What's up with the source code repository? I keep getting a variety of "repository temporarily relocated" messages when I try checking out the source.
the vocal tract is more or less the same for every human
Different languages use different parts of the vocal tract. If a language distorts clicks, it won't pass Zulu or the Bushmen languages. Languages also make different distinctions on the parts of the vocal tract they do use. If a codec distorts pitches, it won't pass intelligible Cantonese, Yoruba, or Mandarin.
There were differences in the range of sounds used by every language, but today that is not the case, thanks to communications advances. E.g Japanese people had incorporated external language sounds and it is not alien anymore.
The only foreign sounds that have been fully assimilated into the phonology of Japanese are the 'y' compounds (e.g. "kyo", "hya", "chu" (phonemically "tyu")), borrowed a long time ago from a Chinese language. Otherwise, there's still a lot of rounding-to-the-nearest-phoneme that goes on: Merry "Kurisumasu". (If we killed Santa, would it be "Kurisumashita"?)
This post, and most of the comments, flew right over my head. I'm just wondering, is there any way someone could use this to make a ST:TNG-style commbadge? Or just some very long-range walkie-talkies?
Bruce, have you guys done any testing of performance in the presence of background noise? I know that in the PMR area, there are a lot of firemen who are very unhappy with what happens to AMBE when their is background noise (e.g. saws, Personal Alert Safety System, fire) gets into the mike - while AMBE does ok at encoding just speech, throw the noise of a saw in the background and all you get is garbage.
While the initial application of CODEC2 is hams in their shacks with their noise-canceling mikes, It Would Be Nice If the vocoder didn't curl up its toes and die in an noisy environment.
See "Urgent Communications", September 10th edition, page 10, "Round 2 of digital radio fireground tests held", and the test plan.
www.eFax.com are spammers
Who wants to be the first to make a web service based on this codec and 3.75-second messages? :-)
Dear Slashdot: next time you want to mess with the site, add a rich-text editor for comments.
Actually - it runs at 2520 baud! (1050 bytes / 3.75 sec) * (9bits) = 2520 bps [One stop - no parity]
Does anybody know if this has any application in voice recognition?
I have been working on mod_codec2.c for FreeSWITCH, which is committed in a WIP module. The library for codec2 isn't a library at all just yet. I'm working with David and Bruce to make sure we can get a working libcodec2 in place ASAP so we have a real VoIP demo that people can compile, call and test against. /b
I fail to see the point of this for realtime telephony applications.
With current best generally avaliable speex codecs the per packet payload of actual information is not much bigger than the protocol header (On the order of 60-70 bytes total size per packet) it can't get any bigger without introducting unacceptable latency from RT delays.. (Computers have to wait for enough voice information to put it in a PACKET and send it out over the wire)
Throw in IPv6 with current codecs and the IP protocol header actually becomes larger than the payload itself. It sounds great on paper and for some applications (streaming, circut switched / compatible carriers) it can be quite useful... for voice applications its kind of pointless without aggressive header compression. It seems to be just too far over the edge in terms of diminishing returns where the IP header rather than usable signal ALREADY dominates.
I had the same reaction, given the slashdot summary, but if you read the actual web page it's 20ms samples. You still have the problem of how to wrap it in IP packets, if you're going to do that, which gets much more annoying on low bit rate codecs. Take your 51-bit sample, pad it to 7 bytes, add 20 bytes of UDP RTP headers, 20 bytes of IP headers, maybe some IPSEC for fun, etc., maybe some Ethernet headers.... Obviously if you're actually trying to run over a slow transmission system, you're more likely to just run it as raw bits, or at least raw bytes, and maybe use CSLIP or something.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
20ms samples, 51 bits, 2550 bits/sec. Are you sure about the 40 frames/sec vs. 50? Maybe he's doing that to get it under 2400 bps?
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
FreeSWITCH now has CODEC2 support just checked in. Please check it out! http://www.freeswitch.org/ /b
Perhaps we need a "timing" section on the website? I tend to be more interested in low-power machines.
---
1.5Ghz single-core C7 CPU (while some other stuff is running)
time ./c2enc ../raw/hts1a.raw hts1a_c2.bit
real 0m0.566s
user 0m0.560s
sys 0m0.004s
time ./c2dec hts1a_c2.bit hts1a_c2.raw
real 0m0.495s
user 0m0.328s
sys 0m0.004s
---
And here's an Acer Apsire with an ATOM N450 (1.66Ghz single-core CPU, 512MB Cache, some other stuff is running)
time ./c2enc ../raw/hts1a.raw hts1a_c2.bit
real 0m0.451s
user 0m0.420s
sys 0m0.004s
time ./c2dec hts1a_c2.bit hts1a_c2.raw
real 0m0.251s
user 0m0.244s
sys 0m0.004s
I wrote to Bruce Perens and David Rowe the following email early July 2010. David Rowe responded, understandably, that his primary area of interest and knowledge is really just in the voice codec itself, and he had no specific comment regarding my proposals for modulation and FEC. Bruce Perens never replied to me, which was a disappointment. Perhaps my note never made it past his spam filter.
So, since there is active discussion going on here (with many folks who know much more about signal engineering than me), I am posting the email its entirety, and I sincerely invite comment about my proposals regarding modulation and FEC (particularly to point out anything which is factually incorrect).
Perhaps Bruce would be willing to comment as well.
====== email below ======
Gentlemen,
I am a newly licensed amateur radio operator, and have read with interest in the past weeks about different modes of digital radio. Having read about D-Star, I recognized the need for an open alternative to AMBE, and then was pleasantly surprised to have run across both of your codec2 project sites.
I wanted to share some of my thoughts regarding this project, especially with respect to the longer-term design goals.
-----
VOICE CODEC
I read an old post by David, either on his blog or in a list serve archive, where he asked whether it would be better to (1) release an alpha version of the codec NOW and risk turning people off with poor audio quality or (2) wait for better audio quality before the first release but risk people losing interest and drifting away. I favor early release for the reason that there would be more time for (parallel) development of other softwares around the codec, time to flesh out bugs in the protocol, on-air tests could be conducted, etc. Hardware could even be started in the meantime; the final codec version could later be loaded onto an FPGA.
-----
PROTOCOL DESIGN
Although I certainly see the allure of keeping the existing modem and plugging in a daughterboard to swap AMBE to codec2 (ie, modding existing D*Star HT or mobile), I suggest that a redesign of the protocol altogether may be far more beneficial in the long run.
Presently, as I understand it, D-Star DV allocates its 4800b/s as follows:
+ 2400 AMBE voice
+ 1200 of 2/3 FEC ( actually the voice + FEC are included together in a single frame)
+ 1200 data (no FEC!)
(obviously header/overhead/sync is a part of the 1200 data)
= GMSK modulated it occupies ca 6.25kHz.
I would like to see it done this way instead:
+ header with id, routing?, and _specified division of voice and data_
+ (3200 voice) OR (3200 data) OR ( 2400 voice + 1200 data)
+ 1600 FEC for all of above
= QPSK modulated it could occupy as little as 3kHz bandwidth (at eff 1.6 ; theoretical max for QPSK is 2 = 2.4kHz bw).
-----
OTHER THOUGHTS
1. If we used QPSK modulation with spectral efficiency of 1.6, we could increase the total data rate to 9600bps while maintaing the same occupied bandwidth of ca. 6kHz.
2. My proposed header specifies that the voice rate is [3200] or [2400] (or, if we bump the total rate to 9600, we could have a maximum of 6400 bps voice data rate with 2/3 FEC) ; the receiver would implement the correct decoding. I don't know how tolerant the current algorithms are of changing the codec bitrate midstream (Speex has VBR though, so I suppose it could work).
3. No. 2, above, is important as data could be streamed intermittently, irrespective of whether or not a voice transmission is taking place.
4. With good filters, 9600 bps could still be GMSK modulated and occupy a standard FM channel width of 12.5kHz but I suppose this is dodgy.
5. I am not as familiar with the tradeoffs of GMSK v. QPSK (power efficiency, SNR, complexity?) and I'm sure there is a good reason GMSK was chosen for D-Star.
6. FEC.
a. D-Star includes no FEC on the header or data frames! The FEC is an integral part of AMBE.
It looks like an Awesome addition to Asterisk. Of course, no other platforms will support it (ATAs, ip phones or Carriers), but for asterisk-to-asterisk traffic, it would fucking rock. IAX2 trunking + this codec = WIN.
WTF am I doing replying to an AC at 5 A.M on a Friday night?
I think you are not using the definition of latency that most in the field would use.
Latency is how long it takes to process the data. Its a computer science type of thing. If you understand Knuth and his tape drive sorting examples, this is pretty obvious...
For example, heres a nice, simple, hopelessly useless codec that has almost exactly 100 ms of latency:
I've had great difficulty explaining this even to folks who write and maintain codecs, which is a fairly odd thing to encounter. There are fundamental design decisions about how you buffer, error correct and frame the data which can have a huge impact on latency. For example, a framing format which likes to bundle packets into pages will multiply latency by that amount. Using CRC/ECC creates a sequence point where all data up to there must be transmitted/received before it can be trusted to be decoded. (That last point is one of my biggest issues with Ogg's design, for example, which should really have separate CRC per packet, not page)
The most fundamental of all design decisions is trading off latency vs quality. The most obvious case is buffering an entire stream before deciding on the coefficients to most efficiently compress it. At the other end, you encounter issues with short bursts that need higher bitrate: if you don't buffer at all, you're stuck with fixed frame size regardless of content.
In fact, this is one of the reasons why broadcast video tends to be worse quality than stored/web content. They can't buffer up enough data to get better prediction.
I do like the example of the entire-message PTT and I will use that to illustrate the concept in future :)
To pick an older one that I'm familiar with, consider the SHARC. Floating point is plenty fast. Assuming you don't take advantage of weird features available only to an assembly language programmer, floating point will be faster than fixed point. (there is special fixed-point hardware that the C compiler will not take advantage of)
The PowerPC "G4" is nearly a DSP, especially if you ignore the MMU. There again, floating point is fast. You get a throughput of 4 floating-point operations (even fused-multiply-add) per cycle.
The danger with floating point is that lots of DSP chips do badly with denormalized floating point numbers. These numbers map to zero or are slower. You can find that code suddenly runs slow when the audio is nearly silent.
Is there a GStreamer plug-in available? I had to use GStreamer to stream radio from my house to a museum where we were doing an Amateur Radio demonstration (noise floor at the museum was -70dBm, with interfering carriers near the HF bands of over -30dBm), and was using Speex to do it, but I had to really fight with it to get a low enough latency. If Codec2 were to be a GStreamer plug-in by January I'd be a happy guy (yes, I'd offer to do the work, but I am drowning at work right now...)
www.eFax.com are spammers