What's interesting to me reading this is the expectation that voice search should do things that text search currently doesn't do. If I ask Google about Bill Murray running for president, I presumably am looking for articles that mention the words "Bill", "Murray", and "President". I would expect to return articles that best match the query, and I would never expect Google to be able to tell me the difference between real and fake articles. For a server to answer the author's question it has to understand the question so well it can change the question into "What is the list of people that have run for U.S. president?" and then check if Bill Murray is in that list. That's a tall order.
We are moving past expecting the most word matches into expecting the server to understand what we really want. Voice search is more like the "I'm feeling lucky" button, because it takes longer; you only have time for one answer and the first answer has to be right. It comes without the expectation that you're lucky if the answer happens to be right, now we need the first result to be the rightest result there is.
So the glass is half empty. I personally prefer to see it half-full, but it's also true and the critique is more valuable and interesting than optimism.
It seems like there's something akin to the uncanny valley effect going on here where a voice UI invites people to think about the other end of the conversation as a person and then be disappointed when they hit the edges of what it's designed to do.
There's a really interesting discussion to be had about how UI decisions can make that process smoother — I really liked https://bigmedium.com/speaking/design-in-the-era-of-the-algo... as a call for how you can make the failure modes of the system more graceful. I think a lot of the success in the next decade or so is going to come from the places which figure out good answers for not making a system which seems to promise more than it can deliver.
> a voice UI invites people to think about the other end of the conversation as a person
It's not the voice UI that does this, it's the marketing.
If they sold it as "speak your google search terms", it'd work a lot better. It'd also be a lot less sexy, but that's still mostly what it is IMHO. Not to say that it isn't impressive stuff, it is! But it's highly oversold, still.
Yes! Voice is sort-of "tactile" if you will. The process of speaking instead of typing may well cause us to expect a more human interaction. I bet they're already studying this effect and changing search results for voice searches accordingly, but it will be fun to watch how it unfolds.
Very nice article, I only skimmed so far, but I think I agree with all of it. It has a definitely pro-consumer bent that I wish would come true, but seems like trends are in the other direction. I suspect there's too much money in search and improving query understanding at scale for companies that get there to be as transparent and open and sharing as this author is asking.
The fact that we interact by voice leads toward a sort of inadvertent theory-of-mind about the other party, which makes the pulling away of the curtain with so many of the answers much more jarring. Voice interaction seems to recruit a much deeper evolutionary expectation than the much more recent phenomenon of typing and reading.
Yes! I wonder whether that's an argument in favor of things like deliberately using quasi-robotic styles to help people recognize the limitations faster. It'll be interesting to see what product designers come up with and how the market adapts to this.
Because it was sold as something better. When siri first came out it was billed as something revolutionary and that it would continuously improve as more people used it. Did that even happen? Outside of the "happy path" sort of questions, I find it rather disappointing with pre-canned responses, or just showing me search results most of the time. Now I really only use it for setting reminders or alarms.
Totally. Me too, my use of Google Voice and Siri and Alexa has declined because I don't usually get what I want the first time. Reminders and alarms it always gets right, it's faster to set a reminder by voice than by typing. But I think you and I are illustrating how we expect more from voice search than text search. My use of text search hasn't declined like my use of voice search, and text search is just as fundamentally bad as voice search. I think it's because I can see & sift many results, and because I can easily iterate on my query when it's not quite right. Voice search can't do either easily.
Perhaps Amazon, Apple, Google and Microsoft all initially thought that the revolutionary part was being able to speak a query and have the query match what you said, and that the search part was already good enough.
>I think it's because I can see & sift many results, and because I can easily iterate on my query
Absolutely. In fact sometimes I'm searching for something, whether in a search engine, at an ecommerce site, or whatever and I'm not getting what I want immediately. If I'm on a phone or tablet, I'll often grab a nearby laptop because it's just faster and easier to do a lot of typing and clicking on. (Less true with more recent tablets but my basic point is that there's sometimes a lot of fast iteration when I'm trying to find the answer to something non-obvious.)
> But I think you and I are illustrating how we expect more from voice search than text search
I think @seiferteric's point was that the expectation may not be there because it is a voice search. That expectation is there because that's how it was marketed.
If the marketing for these things was: "Ask a question, and get search results by voice" I don't think I'd have the expectation that it find and deliver the correct answer to me.
But the marketing for all of these devices is: "It's a personal assistant! Ask it a question and you'll get an answer!"
I'm personally not convinced that the high expectations are because it's a voice interaction, but rather that the technology simply can't live up to the marketing pitch.
I think part of it is desire. Many people would love to have an assistant like that. There is a vague memory of the days of personal secretaries, a girl (those were sexist days) who could looks things up for you so that you can spend your efforts are other tasks. There are a lot of times when everybody could use help, but they don't have it.
Try a reminder including the word "play". The choice to play an album or open an app dominates, so it tells you it doesn't have an app with some nonsense name.
List decoding was invented by 1955. This is a set of hard problems, but very well studied ones.
I do expect voice to do something text doesn't do, because text allows for interactions that voice doesn't. I can't skim a page of search results via voice. I'm not going to sit there and have Alexa read 10 page titles and URLs to me. I ask a question, I expect a concise voice response. Anything else is a complete failure of the UI.
Currently there is one way for a computer to interact with humans via voice: direct and unquestioning answers.
> I can't skim a page of search results via voice. I'm not going to sit there and have Alexa read 10 page titles and URLs to me.
I think you can. Alexa can read you the titles and you can ask it more information about a specific title. It’s like asking the waiter which desserts they have then interrupting him because you don’t know what a pannacotta is.
That's certainly true for most people sitting in front of a computer but I really liked the inclusive design guidelines from Microsoft[1] reminding us that there are many people for whom any particular assumption is untrue, often only temporarily or in a specific situation:
As an example, a coworker mentioned that his use of Alexa went from casual to heavy when they had a child and the ability to do things while carrying a baby suddenly became really important. I suspect there are more situations like that than we might think at first.
1. I know, 90s me is still getting used to saying that too
And I know a couple with a relatively young child and they love their Echo. All the tell a joke and other things along those lines that are kinda dumb to me. Or questions that I'd just as soon type on my phone or a computer but which are more natural to just speak in a family conversational setting.
I think a good model to imagine is that you don't have any computers on you and you're talking to an assistant over the phone who has access to Google and other online resources. The types of responses you'd expect from that at least modestly intelligent assistant are probably not all that different from what you'd like to hear from a digital assistant.
If I were to ask a question that had a long list of potential responses, I'd expect them to ask me to clarify or narrow down what I'm looking for or at least explicitly ask me if I really wanted them to read the whole list.
I'd expect my assistant to know a fair amount about me and the current context. Using those clues a human can pick out what I really care about, at least in most cases. Even when there is a list of responses I'd expect an assistant to give a better summary when asking for clarification.
Certainly learning my preferences is an important component of a personal assistant. e.g. I almost always go to the airport using a particular service.
That said, for those of us who don't have personal admins, there's a lot of opportunity for digital services that fall between purely self-service travel booking as it exists today and and having an assistant.
I find it interesting that Google couldn't get the Bill Murray one. No matter what combination of "did bill murray run for president" I search for, I get a rich snippet on the results page from snopes.com which says
"Claim: Comedian Bill Murray is running for president and proclaimed religion to be "the worst enemy of mankind."
Claimed by: Internet
Fact check by Snopes.com: FALSE"
It would seem they are reasonably close but this is more of a product integration failure than a recognition failure.
Great point! I get the same from Google. Looking back in the article, he only criticized Siri and Cortana on this question. He claimed none of them got it right, but didn't say specifically what Google did with it, and it's entirely possible it was a different answer before now.
Bigger picture though, Bill Murray is famous making it easier to answer questions like this. In general, does the wording of the author's question truly imply he's searching for a fact check, and do you expect Google to know that even if there are no articles that match the wording of the question? The snopes articles does contain the terms "did", and "Bill Murray", and "run for president", so we don't have any evidence that Google understands the question, we just have some content that matches the query.
The issue I see is that the computational question of search has long been trying to measure relevance by matching the query against the corpus. This Bill Murray question is an example of how that can break down. I might actually want the fake articles... and I might not. There's no way for the search engine to know without making an inference, and the expectation that mass market search engines make inferences seems pretty new to me - and I don't expect that when I do text searching. I guess I just expect voice search to push the need for question understanding and inference making even faster than text search has.
My first Hacker News inclusion. I feel like there should be some rite of passage. Well, other than the sudden and unanticipated login attempts.
My testing device for Google was the Google Home speaker, which appears to have a different tolerance for reading search results. I've had it rattle off several sentences from web pages for other keywords in the list (see, for example, the boiling point of water), but for the Bill Murray question there seems to be some kind of limiter. I just re-checked, using the exact phrasing I had before, and it still says that it doesn't know, but it's learning all the time.
I'm guessing there is some kind of a relevance check for the speaker version compared to the phone version. The phone is probably happier to return any result (a la Siri), whereas the speaker appears to be making some attempt to understand what I'm asking for before reading search results.
This particular question appears to trigger the speaker not to read the search results. We can only speculate as to why: does it not find it relevant enough? Is there a reserved path on "Did xyz" questions when sent to Google Home? Am I unknowingly in the A/B testing group that doesn't get the answer? There's few ways of knowing black-box without massive data testing, but it is curious.
Welcome! I don't know of any rite of passage, but maybe I gave you your first upvote? ;)
> I'm guessing there is some kind of a relevance check for the speaker version compared to the phone version.
I would bet on that & expect it too... I'm sure all these voice search products are experimenting with how voice search needs to be tuned differently than text search.
What does "Google Home" speaker usually do when there is no clear result? On the phone google assistant just displays google search results in such a case, obviously that can't work on the "Google Home" speaker.
Sometimes it'll read a page, e.g. it read a passage from Wikipedia a few times. But if it really can't decide, it'll say something like "I'm sorry, I don't know that one".
You see this issue with Google Map searches. Over time it seems to have relied more and more on structured data and less on algorithmic results. But it still returns bad results when the software obviously lacks the data. Better to just say "nothing found" sometimes.
That's the thing. Google can do a lot of those crazy things with their little popup boxes. I've seen cases where I ask a really weird question and it summarizes an entire stackoverflow thread into a neat paragraph that answers my question. Clickthrough and the exact answer that Google showed me isn't anywhere on the page.
Also the query "did Bill Murray run for president" returns a Snopes article debunking the myth as the first result. This should totally be something a Siri thing could parse and tell you about.
If I'm in front of a computer, I can find out information like whether Bill Murray is running for President. What I need to know if I am away from my computer -- which basically means I am walking somewhere or going somewhere in a car -- is not about what objects I can see from space, but whether a certain store is still open and what time will it close.
That is, the things you need from a digital assistant -- you really need, and you need them right now. Otherwise you wouldn't be using a digital assistant to look them up, you'd look them up on your laptop like everyone else. I know, I know, in Africa no one has a laptop and their only computing device is a phone, etc, but in the first world, access to fully powered machines with full keyboard interfaces are ubiquitous, people prefer to use these types of interfaces for general research rather then their phones, and we turn to voice assistants when we are in the middle of doing some task and need specific information to assist us in completing that task. Therefore the expectations are pretty high. The frustration level of say, getting bad information about a store being open that you are on your way to is way higher than getting bad information about the height of the Eiffel tower.
To this day, I can't find out the hours of the stores I am going to. The whole experience is really frustrating:
"Siri! When does cup-o-Java close?"
shows directions to random stores
"Siri! What are the hours of cup-o-Java"
shows directions to other random stores
"Siri! Is Cup-o-Java closed or open?"
shows directions to more random stores
On the other hand, I was stunned that I could walk the Byzantine alleys of Venice and get precise information about turning left in 100 feet to get to my favorite campo. These are tiny mazes of backalleys some wide enough for just a single person to walk down, and in which no cars are allowed, but they're all precisely mapped out so that I'm never lost any more. You can drop me pretty much anywhere and google maps will guide me out of there. But it can't tell me whether the museum in my destination is open today, or how much admission costs today, or whether they accept credit cards.
Voice search will not be actually useful until it can respond naturally to naturally voiced questions. The whole point of voice UI is to make things more natural for human interaction. If it's just a matter of how well computers can interpret speech, well, that's a fun parlor game, but it's not what is implicitly promised by a voice UI, and it's definitely not what's explicitly promised by the marketing for these services. And ultimately until these things can interact naturally, they are doomed to being a novelty.
Since the voice interface can only really give one Answer, it needs to be more certain that the Answer is a good or common Answer rather than the best Answer. Variation in quality needs to be reduced rather than just optimization of PageRank.
It's a bit like pressing I'm Feeling Lucky for your result. I'd hope that it was more optimized for always good results rather than often great, but occasionally lousy.
As much fun as this test is, it dodges the most interesting question of all: "Are these machines supposed to be talking search engines?"
I'm increasingly believing that the answer is: "No." These machines (especially Alexa) are rapidly gaining popularity while still providing pretty ragged answers to search queries. So we should start asking: "Are they taking on a different function that didn't match our early expectations?"
In a word, yeah. Alexa is a really nifty jukebox for those of us that don't have the good sense to create formal playlists. It's a handy kitchen timer, especially if you've got multiple pots doing different things. It's a better alarm clock and a better purveyor of soothing bedtime sounds. (If you're asking: Good god, how many people really want or need that, think: Fussing infants.)
Smartphones already provide pretty excellent search results on the fly. I'm not sure voice-powered assistants will re-solve that problem with great success. But there are a surprising number of rudimentary needs around the house for which a voice-enabled device becomes quite handy.
Interesting - maybe it means that voice interfaces are better for tasks of a certain shape: Those that are typically multi-step, specific, and "deep" in an app. Things like setting a kitchen timer, saving a reminder for yourself, replying to a text, or setting a travel destination.
Tasks that require a high amount of breadth, like search, don't scale well to a voice interface.
These digital assistants are just begging for an app store. Search is just the first app, jokes and weather are other useful apps. These could easily follow a similar product life cycle pattern as smartphones.
Well, that's kinda what skills are in the case of Alexa. Part of the issue though is discoverability. I forget what I've installed or I forget what the right wizard's incantation is to access some skill/app.
I expect there are a lot of questions that lend themselves to concise answers. BUT if sensible informally phrased questions don't get answered properly a decent percentage of the time, we learn not to bother.
I agree that voice interfaces aren't good for a lot of things. How do I cook XYZ? probably isn't suited. But overall performance just isn't that great.
That is perfectly suited to voice if voice worked. When I call my mom for the recipe for cake it would be a whole lot easier if my mom would say "beat the eggs for 1 minute", listen for the beater to start and then say stop after one minute. My mom has better things to do with her time than walk me through the recipe, but an assistant should be able to do this.
Of course I have just transformed the problem into something that technology isn't able to do. However the problem isn't with the voice interface it is our AI isn't yet up to all that. (poor AI, every time they do something useful we rename it and move the goal posts)
Fair enough. I was thinking of it as a one time answer. But you're absolutely right that a good interface could maybe show you a recipe on a screen somewhere and then walk you through the process step by step.
This is a really interesting point - particularly the kitchen timer thing. Right now, I'm the kitchen timer for my wife. She'll say, "set a timer for 8 minutes." I'll interrupt what I'm doing to comply. I may buy Alexa just for that...
Because it doesn't have the local intelligence to understand your unique waveforms that are saying something along the lines of "set a timer for five minutes." Now I'm sure someone could design a specialized device that could act as a voice activated timer--I suspect such exists--but Alexa is a lot more general purpose.
I wonder if it's less that Google Home can't have the local intelligence due to lack of storage/memory/processing, and more that they want to keep it away from their competitors?
Could Google reasonably put their current voice recognition in a small device?
Google is actually working on this. At this year's IO they announced that they are working with silicon manufacturers to include hardware acceleration for their TensorFlow Lite framework. This makes it possible to do on-device speech recognition and natural language processing while keeping the power consumption at acceptable levels.
The evidence would suggest it would be hard. I've played around with various dictation software over the years and it's always been pretty awful. It's only recently with the cloud-based services that it's started to approach usable.
Software that limits itself to specific keywords can absolutely work well. After all, Google and Alexa do it with their wake words. A voice activated timer could be built fairly easily if it hasn't been done already.
General purpose is a lot harder--and then you need the Internet connection for a lot of the queries anyway so there's no real reason to build in local voice recognition if you then can't really do anything useful with it.
I was actually looking at that after I saw this question. PiAUISuite/voicecommand (looks like it can be used with APIs but doesn't need to be) seems to be one and Jasper another. I don't have personal experience with either (yet) but I suspect they're not nearly as good as Alexa/Google Home, especially with a basic microphone. I'm thinking I may play around though.
Thank you for the links. It seems like PiAUISuite uses Google's speech to text (look for the curl call [1]), while Jasper allows choosing between different engines [2]. It has install instructions for PocketSphinx [3] and Julius [4]. Those two seem close to what I have been looking for, although Julius apparently lacks a good model for English at the moment.
I tried the conversational weather one on google assistant on my phone.
"Should I take an umbrella tomorrow?"
"No..." and shows me tomorrows forecast.
"What about the day after?"
"No..." and shows a forecast that when I look more closely at I notice is for today.
Neither of these also spotted that I'm heading to another city tomorrow, which is in my calendar. If I changed it to "do I need to take an umbrella for my trip tomorrow" it just searches google and gives me a search result suggesting I take a small folding umbrella... for a trip to Thailand.
I use google assistant for controlling smart lights sometimes with "ok google, turn on/off the lights". At one point I tried "Ok google, turn off the lights in ten minutes" and it just searched it. That seems super simple and like something it shouldn't have had any trouble with, but here we are :/
It might be so, but remember that experts were predicting computer will beat top humans in Go in 10 years. Maybe next month there will be a breakthrough with NLP. The amount of research and compute going into this problem is amazing, and we don't know what's around the corner.
Also, it might be possible to create a much better assistant today, but it would be too expensive to offer to the public for free. What if it requires 100 TPUs to run?
The thing for me in that example is it's not a particularly complex set of options. It knows I'm asking for the weather, it knows that it should give me the weather in a particular location, and google knows where I'll be tomorrow.
The other main mistake was (despite it getting the conversational part right) thinking "the day after" means today.
I don't think these things require TPUs or new NLP.
Yes, they managed assistants badly - Google, Apple and Amazon. They could have added more APIs, recognized many more patterns of user requests, and it's just plain old programming, so what's keeping them?
For example, Google Assistant doesn't play nice with YouTube. I guess it isn't in Google's interest to have a free agent software serve a free music site when the same service makes money for Amazon (Alexa). They wanted to make their agent profitable, so didn't let it be extended for free. Just my guess.
Google Home with YouTube is amazing. I say play Gwen Bottle and get Gwen Stefanni sing Message in a bottle. Or say play One Love Peas and get Blach Eye Peas ad One Love concert. What are you talking about?
This one problem of NLP is kinda solved¹ for more than 10 years already.
What you are seeing here is a much simpler CS problem, but much harder social problem. It is "why can't my applications talk to each other?". I really doubt it will be solved in 10 years.
1 - Humans don't have a perfect solution for it, and machines are still worse, but not that worse.
This isn't an NLP problem, it's a coder problem. These solutions already exist, Google/ other assistant providers just need to dedicate the man hours to make it happen.
Perhaps, these just feel like such common things to ask and do that I don't get why they're not planned for. The conversational side works for the weather, the remaining problems are:
1. There is an in-built assumption I am always where I currently am. This part isn't anything to do with the NLP.
2. "the day after" is translated to "today", possibly. [edit - see lower, it is in only one case]
3. This is more of an NLP one, it understands that the context of "weather" carries from one question to the next, but not the timing. So asking for the weather tomorrow and then "one day later" gets tomorrows as well. [edit - more complex than this, it's actually working in some areas and not in others]
I'd like to see what user-stories it's trying to solve, because apart from setting timers and alarms it's been massively hit and miss for me.
I tried to repeat what I'd put in and this time I had:
"What's the weather the day after" - translated to tomorrow, with no context, that makes sense.
"What's the weather today?" - weather today, followed by "What about the day after?" which gave me results about the film.
"What's the weather tomorrow?" - weather tomorrow, followed by "what about the day after?" which then worked.
So it works just fine for "weather" but not for asking if I need to take an umbrella. And it doesn't work if I ask for today then the day after, but does for tomorrow and the day after.
Why does the context get passed on for the day correctly for "tomorrow" but not for "today"? Why can it get that "the day after" means tomorrow, unless I've asked about an umbrella in which case it means today? At the core of my question is how is it this inconsistent?
I agree. Hardcoding might not get me a great general purpose assistant, but then the generic solution is failing at that too. Hardcoding can get me to a useful subset of the functionality now.
I found it curious the author didn't like that google was quoting webpages. I actually liked that Google was not taking credit for the info and Google was informing where the info was coming from so I can decide if I want to trust that info or not.
Of course if I ask for a fact like "How many inches in a meter" I just want the answer. But, if I ask "what's the weather going to be tomorrow" I might prefer an answer like "badweather.com says it's going to rain tomorrow" so I can then think (ugh, badweather.com is always wrong) and ask "What does goodweather.com say about tomorrow's weather". Ideally I could ask the assistant to use a particular site by default. This is specially true for me because Siri's default doesn't seem very accurate to me being that I live on the other side of the world from the offices of the company they use for weather info.
I'm sorry for the confusion. My issue with quoting webpages is not that it quotes them -- that's fine -- but that it does so in a very verbose manner. This leads to information overload. For example:
Me: What is the boiling point of water at an altitude of 1km?
Google: At sea level, water boils at 212 °F. With each 500-feet increase in elevation, the boiling point of water is lowered by just under 1 °F. At 7,500 feet, for example, water boils at about 198 °F. Because water boils at a lower temperature at higher elevations, foods that are prepared by boiling or simmering will cook at
The problem is with the vast quantity of information, and the fact that some is both irrelevant and truncated. The last sentence is incomplete and cut off, yet as a listener I have no way of knowing this. I will thus try to remember it, at the expense of the facts that came previously.
When reading a webpage, the important part is to read the specific parts of interest, and not overload the user. If it can't do that, it risks providing irrelevant or, quite frankly, confusing data (such as the odd answer to how much a Dreamliner weighs). I don't know if that's better than not providing an answer at all.
Wouldn't it have been a lot more magical if it had simply asked, "Would you like me to play it?" (knowing that you have a Spotify subscription) at the end of answering your question?
Or, at least, for you to be able to say "Play it!" rather than the unnatural "Okay Google, play the album Kintsugi on Spotify"...
The ability to function as a virtual assistant, even at the level of a not-so-sharp intern [1], would be a killer app.
Give it some parameters for a trip you're taking. It comes back with some options and follow-up questions. We are a long way from that point. Even that not-so-sharp intern has a huge amount of internalized knowledge about general preferences, cities, airports, etc. and probably knows questions to ask to narrow things down.
I strongly suspect there are other domains where a lot of people are assuming we're 90% there and we're not.
[1] Not to insult interns or any other group. I just mean you don't need to be at experienced executive assistant level to be really useful.
Agreed, determining human intent is really, really, really hard. There are some many context clues we use in everyday life that the voice interface will never have access to like where am I standing, what's my expression, etc.
I notice some different answers on my devices. For example on the "Are tomatoes vegetables?" question my Google Home states that they are definitely a fruit. (quoting Oxford Dictionary)
Edit: And "What's the height in meters of the Empire State Building?" gets me "381 meters, 443 meters to tip"
I'm surprised by a number of the failures, such that I did wonder if the failure might be happening on the speech recognition side rather than the response generation side.
Both Google Home and Alexa have very little problem recognising my speech whilst friends often struggle even when they seem to say the precise same phrase, to the point it's mildly entertaining. With Google I suspect they've tailored to my voice (I've used voice commands extensively for several years) but I've only had a Dot briefly and it worked well from the start. Another surprise is that they cope well with my perculiarly English English phrasing and pronunciation, but I'm sure there are lots of less widely spoken dialects that would throw them.
Alexa is the first thing I've owned that really does quite a solid job in the voice recognition department (whatever its failings to return something useful based on that recognition). Siri on my phone is rather hit or miss by contrast.
I suspect that the microphone array has a lot to do with it. Anecdotally, I've read pieces by people saying that homebrew "Echos" together with the Alexa APIs aren't as good as an actual Alexa.
I keep coming back to the idea that progress towards AGI might be made by someone working on a "coordinator" agent. We might have several narrowly focused agents with deep knowledge in particular domains: a mathematician, a fact-checker, a botanist, a structural engineer, etc.; then have an agent that broadly understands how to route requests to the right vertical. Maybe that's already descriptive of the underlying architecture for some of these agents. The alternative might be that we interface with several different conversational agents, and like interfacing with people, we use our judgement to decide which specialist to ask.
I'm curious about how accurate the assistants were at listening. That used to be the most pressing issue with voice commands that had relegated the technology to a running joke. It appears that's understandably what the companies have been focusing on so far, but there's still work to be done to get to 100% accuracy of listening, especially when you take into account exotic names and interchanging between languages and slang.
Might be interesting to test with Wolfram Alpha as well. It looks like some of the questions wouldn't fit the WA API, but I'm curious how it would score.
Forget about questions, I cannot even get Siri on Apple TV to recognize what I am saying. I have often wanted to keep a kind of journal like this poster but I suspect it would recognize the correct words about 30% of the time.
My wife who, unlike me, is not a native English speaker has probably a 10% success rate. This is why any kind of forthcoming voice-response Apple device is completely a nonstarter to me.
The parent poster is dubious that improvements in features will coincide with improvements in recognizing his voice. Voice assist functionality could be 200% more awesome but if it specifically doesn't seem good at just recognizing what he has said such functionality is useless to him. This isn't terribly strange at all.
> Siri took the crown on factual questions, but surprisingly did poorly on reasoning (“Queries”) where I expected the Wolfram Alpha-backed service to get flying colours.
Didn't Apple ditch the WolframAlpha integration pretty quickly after Siri was released? I remember a lot of the Wolfram type queries stopped working shortly after release.
So I decided to ask Siri some of these that he listed as giving a Bing search answer that I felt Wolfram would have answered correctly for Siri. I my case I did get the correct answer, not a Bing result.
Where does the Jackfruit grow?
What is the boiling point of water at an altitude of 1km?
i wonder if there would be any use in services that don't respond in real-time.
I think these digital assistants are nerfed by the real-time response requirement. I'd be happy to ask some of those questions, and get a pop up in a few minutes. And they could be of much higher quality as they can be processed and better researched.
I’d be interested to see how Siri performs on different devices. The Siri on your phone is not the same Siri that you have on your Apple TV. The Apple TV version is much better at giving information about tv shows, movies and music, but terrible in comparison for everything else.
Wolfram Alpha is slow. Even if it's right, it's only good for knowledge questions out of its database - things like public figures ("how many children does barak obama have"), physical statistics ("what is the melting point of tungsten"), and so on work fine. However, topical ("what about aluminium?"), temporal ("what's the weather?"), and location-based ("show me restaurants nearby") are outside of scope for wolfram alpha entirely - so a given app must aggregate.
Why don't apps aggregate? The "can you handle this?" api endpoint is frequently returns false positives, and the proper API is really slow (multiple seconds) for negatives . If we get a false positive, or something hard to detect as a negative, that's the only answer we can show. And since a voice assistant is expected to return one answer quickly, this is straight out.
One problem is that the computing power dedicated to each user is minuscule. If you could dedicate a super computer for processing every input, you could have a much more sophisticated system that could easily deal with all of those queries.
It's interesting that while many fails, there is still one that wins. That is if you combine the efforts of these 4 digital assistants, you'll get a much smarter one. Do they have an API? Can you query siri, cortana, etc..?
There are APIs for both Amazon and Google's voice assistant services. Not surprisingly Siri doesn't expose a public one, I've no idea about Cortana. I've messed around a little with them on the Raspberry Pi.
This idea, while simple in principle, might be kinda annoying in practice. You're still left with similar issues - how do you decide which talking cylinder service answered the question best? Do you play all of the answers? For me I'm fairly sure listening to all of them in a row would frustrate me even further - just waiting for Alexa to finish telling me the news headlines is sometimes kinda annoying, especially when that information in visual form can be grokked almost instantly. Many of these devices, especially the Google one, are getting better at context based followup questions - managing who to send your follow up question to could be kinda crappy as well. I suppose you could do one device that could ask each service individually ("Alexa...", "Ok Google..."), but in my experience as soon as I get one bad answer, I inevitably just use google.com to find what I need rather than risk wasting my time on another failed conversation.
The main part that I've found hard to do in home rolled voice assistants is microphone arrays. Almost all these devices use pretty sophisticated microphone technologies for things like noise cancelling, subject isolation etc, which so far has been non-trivial to do to a similar standard in homemade versions of them. It also certainly used to be the case that creating your own "hotword" system to call the Alexa API was technically against the ToS (it allowed you to use a button press to call Alexa instead), as naturally Amazon would rather you buy a real Echo. No idea if this is still the case, and at any rate Amazon can't really enforce this either, but worth mentioning.
Man asks Amazon digital assistant about the price of Lay's chips, is disappointed when it "has little interest in having a conversation about it" and wants to sell it to him instead. o_0
English is best known for being simpler than average, not more complex. It's one of the flagships (along with Latin / Mandarin Chinese / Swahili) for the theory "languages which are widely learned by adults become simplified over time".
My girlfriend has the name Taryn and Siri really struggles with it, usually correcting it to Karen or Terell (both names in my address book). Alexa does better but probably because it doesn't know about the other names.
Not that it excuses Siri's shortcomings or helps if you're referring to her by name mid-sentence texting to someone else, but you can give assign nicknames in your Contacts app. It might make it less frustrating to dictate texts or start calls. So instead of saying "Call Taryn" and getting it misheard you could say "Call my girlfriend".
It's far less than half for me. Maybe less than 5%. Examples of queries that I frequently need that won't work:
- OK Google, please download an offline maps area for Yosemite National Park and about 80 kilometers around it.
- OK Google, navigate to Yosemite National Park. Highway 120 is closed so please remember that when we get to an area without reception, do not route me via 120 in the offline maps.
- OK Google, let me know when we reach a place along our route where I can buy an SD card.
- OK Google, let me know when we reach the last Trader Joe's or Safeway along our route that is still open.
- OK Google, when is the next Caltrain arriving in Palo Alto that stops in San Bruno?
- OK Google, if I miss that train, when is the next train?
- OK Google, get me an UberPool for 2 people to Castro St. in Mountain View as long as it's under $10 and would arrive within 30 minutes.
- OK Google, what is the name of the driver and what is the license plate?
- OK Google, navigate to my friend's party on Facebook.
- OK Google, call my friend 5 minutes before we arrive. Their number is in the wall on the event page for my friend's party.
- OK Google, find me a restaurant that has non-americanized Chinese food, recommended with higher frequency on Chinese language websites than English websites, and has vegetarian options.
- OK Google, connect to my Nest thermostat. My username is XXX and my password is YYY. Turn on the air conditioner 30 minutes before we arrive home based on the navigation.
- OK Google, close that Java error that popped up and is covering the navigation.
- OK Google, please zoom out the map slightly so I can see how far we are from the destination.
- OK Google, please go back to the normal navigation view.
- OK Google, navigate to my next calendar event's location.
- OK Google, install Facebook Messenger, login as XXX with password YYY, and message ZZZ saying that I'll be late by however much the Uber app estimates.
- OK Google, turn off my alarm clock whenever I am biking or driving.
- OK Google, let me know if I get an e-mail from XXX in the next 2 hours.
- OK Google, please block all calls except from XXX for the next 1 hour. XXX's phone number is in their e-mail signature.
Yeah. We're a LONG way before assistants are useful. The pieces are all there. It's not a machine learning problem anymore. It's just that there are just way too many walled gardens between the various parties that hold the data necessary to be useful.
> there are just way too many walled gardens between the various parties that hold the data necessary to implement any of the above.
This is changing. If Google had their way, they would be assigning IPV6 addresses to bits of dust lying around in your house and trying to assign semantic meaning to them. If they had their way.
We are moving past expecting the most word matches into expecting the server to understand what we really want. Voice search is more like the "I'm feeling lucky" button, because it takes longer; you only have time for one answer and the first answer has to be right. It comes without the expectation that you're lucky if the answer happens to be right, now we need the first result to be the rightest result there is.
So the glass is half empty. I personally prefer to see it half-full, but it's also true and the critique is more valuable and interesting than optimism.