A Microphone, a Speaker, and an Internet Connection Walk into a Bar...




Both AI (Artificial Intelligence) and voice interfaces were hot topics at this year’s SXSW Interactive conference. While the technology still falls short of the independently-cognitive vision familiar to science fiction fans, progress has been rapid, and opportunity is expanding. Two important presentations from this year’s conference: Crafting Conversations: Design in the Age of AI, and The Role of Voice in Music Discovery, captured important and notably contrasting approaches to designing for voice-enabled interfaces.

4.9 billion devices running Voice AI in 2016 will grow to a projected 21 billion by 2020.

In Crafting Conversations, Google™ Conversation Design Lead Daniel Padgett summarized the foundation of design for voice and highlighted the focuses that guide design practices for Google Home. “I teach robots to talk...” stated Padgett, and he positioned the state of voice interfaces as marking a stark contrast to earlier stages of computer technology in which “we had to learn to speak to the computer in its native language.” Indeed, to anyone working on early technologies such as punch cards and command line inputs, today’s voice-enabled devices must seem almost miraculous.


Padgett’s views on the growth of Voice AI reveal much about Google’s broader strategy. He stressed the speed and simplicity of voice queries—comparing them to the number of “taps” necessary for text input for even simple searches—as well as the ubiquity of the service. These map well to Google’s core brand values exemplified by its clean white search landing page. He also cited statistics illustrating the category’s growth: 400 million-plus devices running Google Assistant alone, a sales volume of one Google voice-enabled device per second from October 2017–January 2018, and a platform supporting 22 languages (also a core competency for Google).

In The Role of Voice in Music, SoundHound™ Inc. VP and General Manager Katie McMahon expressed contrasting views of the state of Voice AI. While Padgett emphasized the evolution of the technology, McMahon framed the current point in Voice AI development as a generational one. She stated that while the year 2000 defined the “Touch-Tap-Swipe generation,” 2015 marked “Gen V: the voice-first generation.” She also identified 2017 as a tipping point in the development voice-enabled AI, much as 2007 marked the takeoff for mobile-first UX/UI strategies. She noted the coming growth of the category as well, from approximately 4.9 billion devices running voice AI in 2016 to a projected 21 billion by 2020.


Google’s approach to designing for Voice AI focus on the cognitive approach to human conversation. Padgett outlined four broad considerations to explain his team’s design approach for voice.

The first involves modeling conversation. Central to this is the Cooperative Principle developed by Paul Grice in 1975. Grice emphasized four “maxims” that facilitated effective conversation by building cooperation between speakers. Quality (truth), quantity (the right amount), relevance (the topic at hand), and manner (clarity) are all important to keep conversation engaged. Other linguistic cues, such as turn taking, questions, silence and even gesture also inform our conversations. Indeed, Padgett emphasized the need for clear and concise inquiries to make best use of Voice AI.


Google has optimized language processing to a word error rate of 4.9%, making this a solved issue.

The second consideration is in knowing the two speakers involved in Voice AI conversation. The first is the human side. Padgett described the human in the conversation as a “hands-busy/eyes-busy/multitasker.” Their personas identified them as “instant experts” with high standards and low tolerance for error in their use of Voice AI. They are happiest when acting within what they instinctively know and do in conversation. The flip side is the Voice AI itself. Padgett astutely describes this as literally “the voice of your brand.” As such, it deserves a specific role and even its own backstory to establish it as a core brand channel.

Google’s third consideration is the toolkit for Voice AI—addressing the nature of the voice signal itself, and the ability to recognize and understand speech. The spoken word, according to Padgett, is both always moving and by nature ephemeral: always fading (best illustrated by a game of Pass It On). Google constructs its Voice AI responses specifically to the ephemeral nature of speech: answering the primary query and then adding prompts to explore further information. And while Google has optimized language processing to a word error rate of 4.9%, there remains development to resolve the dilemma of what someone says combined with the intuitive interpretation of how they said it.

Lastly Google considers the expanding ecosystem of technology and communication. It aspires to “design for the overlaps” between voice only, voice-forward, intermodal, and visual communication and function.


SoundHound appears to take a more holistic—and perhaps innovative—approach to the use of Voice AI. From its position as a leader in music discovery, it has developed a self-contained ecosystem for Voice AI enabled devices and applications. SoundHound continues to focus on music, but has integrated two additional applications: Houndify™—a Voice AI platform, and Hound™–an AI-enabled Voice Assistant.

The Houndify AI offers two strategic technology advances to voice queries relative to other platforms such as the Amazon Echo™/Alexa™, Apple’s SIri™ or OK Google. 

The first of these is a different model of query described by McMahon as “compound/complex.” While Padgett stressed the ideal of simple, concise questions, this still limits utility, and remains at the level of “speaking in the computer’s language.” The Houndify AI can handle queries with both inclusions and exclusions. An example would be “OK Hound, find me a restaurant within 3 miles but not a pizza place,” or “find me a flight next week to Chicago but not on United Airlines.” The answers generated by Houndify—while more lengthy and detailed than Google’s assistant, are also more specific. This is also a more intuitive manner of voice search for people. People often know more about what they aren’t looking for when they’re in browsing mode.


SoundHound users can search, discover and play music using voice commands instead of clicking, texting, tapping or swiping.

The second tech innovation for Houndify involves what McMahon called “Speech to Meaning.” This involves integrating the two primary Machine Learning aspects of Voice AI: Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU). By making these data sets interoperable, the interactions between human and AI are more seamless and organic.

SoundHound displays its own penchant for innovation in use of voice technology. Through UX research, SoundHound discovered that the number one reason users abandoned its platform was because navigating the app was frustrating. Rather than taking a visual UX/UI approach to address this problem, SoundHound looked at voice-driven navigation as a better, more user-focused solution. Now, SoundHound users can search, discover and play music using voice commands instead of clicking, texting, tapping or swiping. The company’s willingness to go beyond an incremental move exemplifies the innovative DNA of a company comfortable in applying the principles of design thinking and agile development.


McMahon echoed Padgett’s endorsement of brand as an important dimension of Voice AI. Here, Houndify also diverges from Google’s strategy. While Google enjoys the strength of its brand and Android™ ecosystem. Houndify is adaptive for use by an independent brand to create and integrate Voice AI functionality and applications. Houndify and Hound don’t own a brand of device or system, and thus becomes an enabling platform, McMahon continued. This makes Houndify a potentially valuable partner for brands that prefer to amplify their own voice through the technology. This open source offers additional flexibility by being adaptable across devices.


Houndify is a potentially invaluable partner for brands that prefer to amplify their own voice.

Houndify’s flexibility gives designers and companies an additional dimension of choice to consider when integrating voice AI. Companies my prefer the brand halo of Alexa, Google, or Siri as an amplifying feature. Or they may see a potential competitive advantage in creating their own Voice AI presence—one that’s unique to their brand.


Padgett indicated that Google’s strategy addresses both the static placement of in-home smart speakers as well as mobile devices. Each has unique operating conditions, levels of privacy, and utility for the user. Google’s expansion into smart displays (also being developed by Amazon™, Panasonic™, and others) also tips their hand. It’s clear that they see an integration of voice and visual browsing, particularly in the home environment.

Padgett also emphasized the need for better use of linguistics, creative writing, and script writing as part of the UX toolkit for voice. McMahon countered that “with little or no UI, systems need to become smarter.” It is clear that this portends an advantage for the systems best able to automate Machine Learning and expand AI capabilities.

These are still early days for Voice AI, and while there are early leaders, it seems that there is still ample time to develop best practices and claim leadership in multiple markets.


: : Contact Tom Berno directly at tb.idea21@gmail.com for more information