Voice Over IP for Linux Games?
fathom asks: "A few friends and I are attempting to move all of our gaming from the Windows platform to whatever distro of Linux we like. For the most part we've all had great success: just about everything we play is fine under Linux. However, there is one major drawback: we don't know of any software programs for Linux to do Voice Over IP like BattleCom, RogerWilco, and the GameVoice. Are there any programs out there like this for Linux?" Why limit to Linux? What Voice Over IP software can be used for any Unix that's flexible enough to work for other applications as well as games? We did a similar question about a year ago; has anything changed since then?
sipc
Video Conferencing for Linux
Voice over IP technologies are the same as those used for video conferencing, but with audio codecs only. The two VoIP/VideoConf standards for call setup and control are H.323 and SIP.
Hi All,
I wrote a trivial test app using the java sound API to make a VoIP program. It didn't implement any kind of standard, and it was completely insecure, but it worked after a relatively small amount of effort and it performed really well.
Java Sound passes just about everything through to the card so Java vs C didn't really come in to play much. All I did was decide that one machine was going to play server, and then everyone who connected to that machine got their byte streams mixed using the java Mixer and then sent back the mixed stream.
I'm up to my neck it projects right now, but if someone wanted to lead it up, I'd submit code and experience. Then we wouldn't have to worry about platform at all.
Jason
Loki had the same sort of problems when they ported Tribes2. They switched over to a freely available GSM encoding (from a university in Germany). It worked so well they're adding the code to the Windows version so you can chat between versions.
Guys, I wasn't trying to be anal, only educate people, because I know a lot of poeple are not aware there is an actual protocol called VoIP.
The comparison to ftp is entirely accurate.
I never said anyone was wrong, only that we should avoid confusion.
VOIP, or Voice Over IP, has come to mean a specific suite of protocols for providing telco integration, call setup/teardown, etc. We shouldn't use it as a generic term for 'speech over internet' anymore..
ssh has the -C flag that uses the same compression algorithm as gzip. That might make the above scheme barely practical. I know, I know, it was a joke..
I will point out that ssh compression works wonders on a vnc session if the server is running sshd. There is a nice howto on the vnc page for tunneling the connection over ssh. Secure and faster....bonus!
Well, for the year 2001 you may want to use `ssh' instead of `rsh', and /dev/dsp instead of /dev/audio for Linux, but the idea is still the same ...
Almost as much fun as making the Sun next to you belch while the newbie is using it!
GSM, the standard used in cell-phones, works fine with less than 1/5th of that bandwidth (bidirectional in 2.4-2.8kbps, will do realtime compression in a 486, and there are freely available GSM compressors.
.96kbps, but they sound horrible. in comparison.
There are also public domain encoders for the military voice standards, LPC and LPC10. Those are usable in as little as
.sig: Now legally binding!
How about Speak Freely for Unix?
I have played with it a bit, and it seemed to work, but I haven't actually used it for gaming yet.  It didn't seem as simple to configure and use as some of the windoze voice comm programs, though.
I know when I was looking for tools like Net2Phone and Dialpad.com for linux I found a few SDKs on Freshmeat for VoIP. However, none of them were geared toward gaming such as Rogerwilco and Battlecom. I would love to find a good cross platform "game-geared" VoIP system. Btw - only RWBS is for Linux - no linux clients that I know of
The ultimate network admin tool needs HELP!
I'll step in and answer some of those questions for you :)
:) Well, try it with a 1hz signal. You still get 0. Keep trying till you get to multiplying it times a 5hz signal. All of the sudden, you'll notice that wherever the signal is positive, it gets multipled by a positive value, and wherever it is negative, it gets multiplied by a negative value (making it positive) - so you no longer get 0 as your result. For everything over 5 hz, you'll get 0 again. You just broke down a simple waveform into a sinous component. :)
First, before we can go into the FFT, we need to discuss the DCT (Discrete Fourier Transform). Of course, the purpose of the DFT is to break down a signal into component waveforms - but how? Well, picture that you have a waveform that is just a sine, lets say, 5 hz across the area you're looking at. Now, picture multiplying that waveform, at every location, times a sine waveform of zero hz, summing those together, and normalizing. What do you get? You get 0.
Now, as was discussed in another reply on this thread, to represent *any* signal, you can't just use a sine at even mhz boundaries and steps. For that, you have to use the sum of a sine and cosine. Due to a trick you can use involving imaginary exponents, you can get one part of the data to come out "real" and the other part "imaginary", and correspond to the individual components of your signal. But, since you get twice as much data back (real and imaginary components), it is a poor choice for compression. DCTs and MDCTs are briefly discussed in the other reply as well. The main difference between a DCT and MDCT is for use in block transforms. Also, the difference between a DFT and a FFT is that, in a DFT, you'll find that you're doing a lot of the same calculations multiple times, due to various properties of sines and cosines. FFTs are basicly a re-ordering of the calculations so that redundant ones are done (less often) (optimally, once). There are several different FFT algorithms.
Block transforms are to get rid of a nasty side effect of transformed data. If a signal exists in one part of the data, but not another, frequency decomposition has trouble dealing with this. It generally causes "ripples" of energy to appear (the goal of doing transforms, compression-wise, is to concentrate the energy in certain frequencies, and then store them - so, ripples are bad). If you look at a very large sample, many frequencies will start and stop. So, you break it down into blocks - if there's a start or stop of a frequency, it only causes ripples in that section.
This works fine until you start throwing away data on a DCT. Because different data will get thrown away in different blocks, while they'll have the same overall level of quality, there will still be discontinuities between the blocks. The MDCT effectively halves the block size and vastly reduces discontinuities by including an overlapped area in its calculations.
Before I can discuss quantization, you first have to understand thresholding and the principles of compressing a transformed signal, which were briefly discussed in my original post. After you transform the signal, ideally, your energy is concentrated in specific frequencies. The effect is something like a starburst in the upper left-hand corner of each block that was transformed. Generally, you will still have *some* energy in weak frequencies, but not much. So, you kill them off - generally with a threshold that varies over the human hearing range, in the case of audio. Also in audio, you generally want to take masking effects into account when killing off weak signals. Once your energy is left in strong signals, you need to store how strong. However, while your input signal might have been composed of 8 or 16 bit integers, your output data will generally be high decimal-resolution floating point values. You need to get it back into integers. This is known as quantization. Some schemes simply convert the data back linearly. Some create a table of arbitrary endpoints for what-converts-to-what. Some use a smooth function. There is a lot of debate over what is the best method. I personally recommend, in this case, after seing the tiny gains made by various other quantization methods, for a huge cpu/complexity cost, using linear quantization.
Huffman encoding is typically used to losslessly encode the quantized data. Huffman encoding has proved attractive because they already know what sort of tree they can build to compress the data well. However, I feel that using arithmatic encoding can give *huge* advantages, via frequency preduction. Because, not only do you know the signal density for a given location for an arbitrary signal, you know what it has been like for this *particular* signal in the past, and can scale your probabilities appropriately. (oh, btw, if you want info on how to use huffman or arithmatic encoding, just ask).
Anyways, I better get back to work. Ciao!
- Rei
You know when it's okay to shout fire in a crowded theatre? When it's on fire.
I just did a quick check on human voice ranges. Vowel sounds contain notable power at frequencies as low as 50 Hz. The sibilants and fricatives, s and f, contain significant amounts of power at frequencies as high as 8,000 Hz. Frequencies above 1,000 Hz contribute the most intelligibility to speech, even though only 16% of the energy lies above that frequency (Chapanis, 1996). Using such an arbitrary method of thresholding as truncation is *far* from optimal in signal compression. Perhaps it has some advantages in analog signal transmission - I don't deal much with that, and that seems to be what you deal with, so perhaps their are applications for it there - but when it comes to encoding digital signals for compression with your goal being the clarity vs. size ratio, that is a very poor method. A quick look at human hearing research shows that humans have vastly different sensitivies over different ranges, and just "throwing away" data outside of certain ranges regardless of signal intensity, and keeping data within those ranges regardless of how weak it is, will ruin your clarity vs. size ratio. That's why you won't see a single, good encoder, whos target is speech or music, that does this. Please, if you can present an example to the contrary, please do so :).
;)
- Rei
P.S. - for those who care, the cpu cost of having a varying threshold over various ranges, instead of a constant one, is negligable compared to the time it takes to do the MDCT, quantize, encode, etc.
P.P.S - Any specific URLs from those organizations I should check out? I'm always looking for a good distraction from work
You know when it's okay to shout fire in a crowded theatre? When it's on fire.
Actually, that isn't quite true. "All signals that come out of my microphone are real in the time domain so the frequency domain spectrum is symmetric". If you'll look at a raw audio file, you'll notice that it indeed not symmetric. I'm not sure how familiar you are with signal transforms, so I'll back up a bit. "real" and "imaginary" components are really just mathematical tricks, based on the property that
:)
e^(X*i) = cos(X) + i*sin(X) (or is it the other way around?). The 'i' merely acts as a placeholder, it doesn't actually mean that the frequencies themselves are imaginary. By using exponential math, we can simply add to multiply.
A sine which is contained completely in a rtain frequency range, like a cosine, cannot store phase information - it requires both of them. Now, of course, you can extend the waveform in question so that it isn't completely contained in a certain frequency range - but that is no longer an FFT, but a DCT.
FFTs are useful because they evenly separate signals, and are quite fast. By computing the magnitude of a certain frequency's complex component, you can do windowing quite nicely to tell where your signals are. But, this magnitude alone is not enough to accurately reproduce the original signal with phase information. And, without phase information, cancellation effects can be very bad in the worst case, in fact, to the point of completely messing up your block.
Your example was really a DCT, but using sines instead of cosines
- Rei
You know when it's okay to shout fire in a crowded theatre? When it's on fire.
Sorry, some of us have lives on the weekends ;) My humblest apologies for having better things to do on a saturday night than to debate gun control with someone who loves to continue to dodge the simple question, "Do you think the US legal system is wrong 50 times as often as it is right?", by making pseudo-statistical arguments without overcoming the sheer numbers, and nitpick choices of examples in arguments without hitting the core of them (I.e., "just because something else is worse doesn't justify something bad"). But, regardless, this thread is about audio compression :)
;) ). Do you mean "arithmatic encoding", perhaps? I have a neat theory for using that, with scaled probabilities, to create an optimal compression ratio (predictive) for thresholded, quantized data, that I came up with after the last slashdot conversation on compression (it was video compression then).
;)
;)
When you do a block DCT or MDCT on an audio signal, you're not looking at a whole page of text's worth. You generally look at a fraction of a second. Speech has little redundancy at this level. However, that isn't what I was referring to. Do you have any background in audio compression? There are two keys in compressing audio using current methods: Frequency masking and signal response. Frequency masking is the fact that when the human ear hears a strong signal, weak signals that are near it in frequency seem to "dissapear" or "merge" with the stronger signal. Signal response (hearing response, frequency sensitivity, etc), is how good, overall, the human mind/ears are at hearing weak signals at various frequencies across the spectrum. By a careful knowledge of these, in music or voice, you can kill off many more frequencies than without it. However, it also is a big CPU consumer to do it very carefully. Cutting out some of the analysis can save you a good bit of CPU - and, in the case of human voice, which tends to be in a very audible range with few masking effects, won't affect your compression rate much.
Second, please, if you can create a good sounding speech synth - especially one that can give inflections, emotion, etc - please, please share it with us. Until then, good luck having something like this work (simply neglecting CPU issues) without sounding like a 50s robot that messes up once a second.
Oh, and to answer your theory about masking out frequencies below 500hz and above 3000hz: No. That will sound so unbelievably awful. First off, lets neglect the fact that someone with a voice like Barry White would be inaudible, and that you'd never hear a 't' or 's'. Ignoring that, that's a silly way to do it. You need a simple curve, even just a simple line graph. It takes little CPU time, and will actually be able to reproduce the original sound well. Arbitrary truncation points are unbearably bad.
Next, you seem to be of the notion that MP3 encoders are "tweaked towards music". MP3 encoding is a fairly abitrary term. MP3 is a specific format for encoding streamable, quantized, transformed data. You can use any truncation scheme you want - even the silly one you proposed. Most encoders you'll find are tweaked towards the human hearing range - an optimal choice for both voice and music (especially voice, though! Voice compresses very well, because, compared to music, it has most of its energy concentrated in a few signals at any given time).
Next, why use "logarithmic encoding" for compression? Logarithmic encoding is a (poor) way to store raw (uncompressed) audio data - it sacrifices low-level clarity for the ability to represent very loud signals - something seldom of use in normal audio compression applications (have you ever noticed how quiet signals on an 8-bit sound card are very crackly, but the loud ones are clear? Thats the sort of effect logarithmic encoding gives to sound). It is useful in efficient Pulse Code Modulation (PCM) of data for maximizing the number of transmissions over a small number of physical channels, but doesn't even begin to apply as far as storing quantized data is concerned (that would be like using a bubble sort to compute Pi or something
Please... if you're qualified to discuss audio compression, how about the basics? Do you know how to compute an FFT? Do you know why you wouldn't use an FFT for audio or video compression? What about a DCT? MDCT? What do you know about quanization schemes? The advantages/disadvantages to storing quantized data with huffman encoding vs. arithmatic encoding? Have you ever written a single signal processing function? (I've written a whole library). Do you know anything about the subject at all?
If you don't know what you're talking about, please don't be suggesting encoding schemes. There are enough bad ones out there already
- Rei
P.S. - sorry if I seem a bit bitchy. For some reason, they decided to leave us without air conditioning today at work
You know when it's okay to shout fire in a crowded theatre? When it's on fire.
That's why there is SDL. It uses DirectX on Windows and DRI on X (as well as many other graphics layers / OS's).
I think the problem with Windows developers in general is that they don't think of coding crossplatform in the first place. It's easy to understand why: they are taught DirectX and MFC, and Windows has a huge percent of the desktop market. Also, some games are coded so horribly (compare the duct-tape-that-is-EverQuest to any Blizzard product) that porting certain games look like they would be a nightmare.
On the other hand, I think Linux developers are more trained to code portably. With all the unix flavors out there, source portability is already a must. It also seems that these developers care about porting to Windows. Many apps for X are available on Windows (like a lot of the Gtk stuff), but not the other way around.
So Linux developers actually care about portability, but Windows developers do not. Maybe we can convince them to change their ways?
Surely the Windows developers out there don't thoroughly enjoy Windows-only programming, do they? I've used DirectX, and it was ludicrous. It isn't direct at all (Come on, DirectMusic? DirectPlay? Direct is just a buzzword..) and the classes are a mess. I haven't heard much good about MFC either, but I've heard only good about Qt (and I've used both).
Qt works on Windows. There's no reason to use MFC. Yes it does cost money, but aren't we talking about real game companies here? SDL works on Windows. There's no reason to use DirectX "directly" (whatever that means). You know how long it would take to port Windows apps/games to Linux that were all written in Qt and SDL? All of a recompile.
Its targeted at game programmers, to be integrated in-game, as a cross-platform alternative to Microsoft's DirectPlay and DirectPlay voice, but could be used to do a stand-alone VOIP app as well (though I am not aware of any currently).
Most modern computer games uses Direct3D and DirectX a great deal. These libraries are not portable, and they are what most developers have experience with. You are asking, in essence, for developers to either have two devteams working in parellel, or one team programming both versions. In either case, the game company can either completely rewrite the game engine for each OS, or it can create one highly portable engine. The problem with the latter option is that DirectX in particular really is the backbone of Windows gaming. It would be very hard to convince developers to give it up, and I'm not sure "DirectX free" windows games could match the perfomance of their windows-only counterparts.
Alos, have you considered the expense of training all these developers in Linux? Remember, most of them do not have Linux experience.
Finally, when you consider that Windows controls 90something percent of the desktop gamer market, it just doesn't make sense for a company to pour massive resources into developing Linux and Windows games simultaneously that only a relatively small number of people would buy. At least a dedicated porting company like Loki doesn't have to worry about graphic artists, level designers, story writers, or game design as a whole.
I'm the stranger...posting to