Voice Is the Next Big Platform, But Amazon Already Owns It (backchannel.com)

← Back to Stories (view on slashdot.org)

Voice Is the Next Big Platform, But Amazon Already Owns It (backchannel.com)

Posted by ryuzaki0 on Saturday December 24, 2016 @08:34PM from the Alexa-crush-the-competition dept.

Six million homes already have an Amazon device with it Alexa voice assistant -- about 5% of all households. But Backchannel argues that Amazon is already dominating the race to become the operating system for future voice-activated devices, with Forrester tech analyst James McQuivey pointing out that "having microphones in your environment is a lot more convenient than pulling out your phone." The Alexa-enabled Echo is a true unicorn, one of those rare products that arrives every few years and fundamentally changes the way we live... After years of false starts, voice interface will finally creep into the mainstream as more people purchase voice-enabled speakers and other gadgets, and as the tech that powers voice starts to improve.
Despite competition from Google Home, and a rumored "Home Hub" from Microsoft, Amazon "has a two-year jump on its competition, having first introduced the Echo speaker in November 2014," notes the article, adding that Amazon also "opened its platform early to third-party developers." (Alexa now has more than 5,000 "skills".) They argue that Amazon is already winning the war of the operating systems by familiarizing consumers with "a new computing interface -- a voice devoid of a screen -- that will eventually grow to be more ubiquitous and more useful than our smartphones... Soon, you'll speak your wants into the air -- anywhere -- and a woman's warm voice with a mid-Atlantic accent will talk back to you, ready to fulfill your commands."

3 of 229 comments (clear)

Min score:

Reason:

Sort:

Six million Alexa installs... compared to? by tlambert · 2016-12-24 21:08 · Score: 5, Informative

Six million Alexa installs... compared to?
A billion Apple devices with Siri... http://www.theverge.com/2016/1...
Uh, who owns it again?
Re:Oh yeah, just what I need. by Anonymous Coward · 2016-12-24 21:57 · Score: 5, Informative

It is always listening (if you do not turn the mic off), it does notify you when it accesses the cloud service and... it does process locally for its activation word. Its supposedly not always streaming your voice to the cloud (since you'd wind up noticing that on your internet usage anyways). Processing voice isn't the problem, its the decision engine behind there that figures out what you wanted from the words input and then serves it up to you. On your device, this would be painfully slow, since it does require an AI of its own. So... stream voice to cloud, use eleventy-bajillion processors to do it all and the device is dumb and therefor cheap on the customer end. All the Echo, Dot, Siri, Cortana really do is record, stream, and playback your data... that's called an MP3 player and should be as cheap as one. That should be your complaint, that these things cost more than 20 bucks when they are the definition of a dumb terminal.
From https://www.amazon.com/gp/help/customer/display.html?pop-up=1&nodeId=201602230#Question10
"Amazon Echo and Echo Dot FAQs
Amazon Echo and Echo Dot are far-field Alexa-enabled devices.
1. How do Amazon Echo and Echo Dot recognize the wake word?
Amazon Echo and Echo Dot use on-device keyword spotting to detect the wake word. When these devices detect the wake word, they stream audio to the Cloud, including a fraction of a second of audio before the wake word.
2. How do I know when Amazon Echo or Echo Dot are streaming my voice to the Cloud?
When Amazon Echo or Echo Dot detect the wake word, when you press the action button on top of the devices, or when you press and hold your remote's microphone button, the light ring around the top of your Amazon Echo turns blue, to indicate that Amazon Echo is streaming audio to the Cloud. When you use the wake word, the audio stream includes a fraction of a second of audio before the wake word, and closes once your question or request has been processed. Within Sounds settings in the Alexa App (Settings > [Your Device Name] > Sounds), you can enable a 'wake up sound,' a short audible tone that plays after the wake word is recognized to indicate that the device is streaming audio. You can also enable an 'end of request sound' that will play a short audible tone at the end of your request, to indicate that the connection has closed and the device is no longer streaming audio.
3. Can I turn off the microphone on Amazon Echo and Echo Dot?
Yes, you can turn Amazon Echo or Echo Dot's microphone off by pushing the microphone on/off button on the top of your device. When the microphone on/off button turns red, the microphone is off. The device will not respond to the wake word, nor respond to the action button, until you reactivate the microphone by pushing the microphone on/off button again. Even when the device’s microphone is off, Amazon Echo or Echo Dot will still respond to requests you make through your remote."
All that aside, yes they could change this with a firmware update and you'd never know. And that would be why I will not be buying or using one. To find out they did it i'd have to commit a federal felony (thanks DMCA!)
Scaling things up by DrYak · 2016-12-25 03:14 · Score: 3, Informative

I don't understand why none of this stuff operates locally. It's always some remote server in the cloud. I remember having IBM ViaVoice (back then I think it was called "Voice Type Dictation" or "SimplySpeaking") on my goddam Pentium 75mhz computer. After about an hour of training, it would nail mostly everything I said. I find it incredibly difficult to believe that we don't have the hardware resources necessary to perform local speech-to-text and text processing inside your house without ever touching the internet.
The problems are scaling it up and the finer small details.
Regarding speech :
Modern offline text-to-speech technology is able to handle about 95% accuracy. (Being able to feed back based on past context to tell which homophone makes more sense, etc.)
- Which is damn cool already (it's only 1 in 20 words that need to be fixed ! Fucking impressive !!!)
- And is pretty useful to dictate toughs for those people who speak faster than they type (i.e.: most random joe six-pack outside /. and especially outside of steno communities), they can mostly speak what they want and only fix here and there (only a single word every 20. Or about a word every 2-3 sentences).
- But that's completely useless on the scale of things which are required for Siri- / Alexa- / Cortana- / Whatever- type of constant speech flux of commands. The point is to completely do away with keyboard and mouse. Not to have to pull out a keyboard (or pull out your smartphone out of your pocket) to correct every third sentence you speak to your home assistant.
The only practical application would be speaking in robotic rigid sentences. "Military-type radio speak" rigidity
(Strict word ordering: "[name], [order: [verb] [noun] ]". Fixed protocols : AI should ack what it understood and ask for confirmation "[user], you ask me to [verb] the [noun] ?", and user should confirm/correct "Yes do it [=fixed sentence] / No [=fixed sentence], [followed by new order]")
That is the kind of speech protocol that leaves very few ambiguity and risks of error (that's why it's used by military, law enforcement, catastrophe responders, or simply people working outdoor with very noisy radio conditions - ski teacher of a club spread accross mountains in my personnal experience).
That could work nearly flawlessly with modern tech.
But it is very far from the "having a casual discussion with your assistant" experience that most companies are wanting to sell.
To reach that level of fluent conversation, current experience shows (100% fully autonomous real-time text subtitling, 100% fully autonomous real-time translation, etc.) that you needs several orders magnitude more accuracy (think 99.9% accuracy. Only one missed word every thousand. Or in practice an error every day or so). And due to the law of diminishing returns, that means fuck-tons more of processing power. Several data-centers worth of processing in your basement.
(Don't believe me ? Look at youtube auto-generated subtitles. And Google certainly throws more processing power at them than simply a desktop computer).
And all the above is only about *parsing* the speech (i.e.: getting the speech-to-text accurate enough). Then you need to make *sense* out of the speech.
Again, with modern technology, making the system react to a bunch of preset command is trivial (the kind where you write a plug-in to get new commands supported) and could probably be handled on raspberry pi.
But again the things that these companies are trying to sell to random users are much more complex : "Having a natural conversation with your assistant".
That require three things :
- tons more of processing (good bye, raspberry pi)
- tons more of reference data (much more than a few commands that the user has custom pre-configured)
- fuck ton of data gathering... (recording every command spoken by every user)
- ...coupled with analysis of reject / mis-interpreted command... (most probably by huma

--
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]