To Commercialize, Voice Tech Must First Solve Its ‘Cocktail Party Problem’

Ken Sutton

Sep 14, 2021

Our voice technologies have not been engineered to confront the messiness of the real world or the cacophony of our actual lives.

On average, men and women speak roughly 15,000 words per day. We call our friends and family, log into Zoom for meetings with our colleagues, discuss our days with our loved ones, or if you’re like me, you argue with the ref about a bad call they made in the playoffs.

Hospitality, travel, IoT and the auto industry are all on the cusp of leveling-up voice assistant adoption and the monetization of voice. The global voice and speech recognition market is expected to grow at a CAGR of 17.2% from 2019 to reach $26.8 billion by 2025, according to Meticulous Research. Companies like Amazon and Apple will accelerate this growth as they leverage ambient computing capabilities, which will continue to push voice interfaces forward as a primary interface.

Our voice technologies have not been engineered to confront the messiness of the real world or the cacophony of our actual lives.

As voice technologies become ubiquitous, companies are turning their focus to the value of the data latent in these new channels. Microsoft’s recent acquisition of Nuance is not just about achieving better NLP or voice assistant technology, it’s also about theProductivity in the construction industry has likewise remained static since 1995, primarily driven by the aging demographic of the existing labor force, the apprenticeship nature of the job, and difficulty in attracting and retaining new workers. In short, there is insufficient labor to do the job, while existing staff are becoming increasingly less productive as skilled workers that have accumulated decades of experience in their crafts are lost due to retirement.

Google has monetized every click of your mouse, and the same thing is now happening with voice. Advertisers have found that speak-through conversion rates are higher than click-through conversation rates. Brands need to begin developing voice strategies to reach customers — or risk being left behind.

Voice tech adoption was already on the rise, but with most of the world under lockdown protocol during the COVID-19 pandemic, adoption is set to skyrocket. Nearly 40% of internet users in the U.S. use smart speakers at least monthly in 2020, according to Insider Intelligence.

Yet, there are several fundamental technology barriers keeping us from reaching the full potential of the technology.

The Steep Climb to Commercializing Voice

By the end of 2020, worldwide shipments of wearable devices rose 27.2% to 153.5 million from a year earlier, but despite all the progress made in voice technologies and their integration in a plethora of end-user devices, they are still largely limited to simple tasks. That is finally starting to change as consumers demand more from these interactions, and voice becomes a more essential interface.

In 2018, in-car shoppers spent $230 billion to order food, coffee, groceries or items to pick up at a store. The auto industry is one of the earliest adopters of voice AI, but in order to really capture voice technology’s true potential, it needs to become a more seamless, truly hands-free experience. Ambient car noise still muddies the signal enough that it keeps users tethered to using their phones.

Simply selling more voice-enabled devices won’t magically solve the limitations of voice technology. There are two main challenges confronting the evolution of voice technologies: the understanding of intent and emotion, and overcoming issues associated with signal-to-noise ratios (SNR) in highnoise or crowded environments.

Do You Understand the Words Coming Out of My Mouth?

Intent has been a core, and improving, focus of most NLP technologies. Swaths of data have been collected to help voice assistants better understand intent. While voice tech has advanced in certain areas, such as customer service channels, it still faces major challenges when confronted with understanding the myriad signals from the real world.

We have been able to grow capability to understand signals of intent in closed channels that require specific understanding — valuable for doing simple tasks, knowing when to escalate a customer’s problem to a human agent, or seamlessly directing customers through a limited set of options. For the tech to be viable in real-world situations, however, it must understand a much wider variety of situations and inputs.

Voice technologies currently work in conjunction with other data points from wearables, and as we gain more signals that we can correlate, we can begin to provide more agile and robust context for greater understanding in voice technologies.

Using Human Tools to Solve Human Problems

Our voice technologies have not been engineered to confront the messiness of the real world or the cacophony of our actual lives.

The background noise and chatter challenge has been a difficult one for voice technologies to overcome. Much like intent and emotion, we have not engineered our voice technologies to parse real-world cacophony. This “cocktail party problem” is one of the greatest barriers to voice technologies reaching a level of understanding comparable to humans. Exacerbating this challenge is the fact that we simply can’t achieve adequate testing for this effect in a traditional lab environment.

The growing adoption of voice in devices and the subsequent quality and quantity of data we now have offers the prospect of finally overcoming the cocktail party problem. It will be necessary for the technology to advance to its full usefulness.

Solving these problems requires voice tech to meet the human standard for voice and match the complexities of the human auditory system. Yes, you need really good NLP and conversational AI, but this goes deeper — you have to be able to extract clean and complete signals.

When we develop voice strategies that account and solve for these challenges, the business proposition for voice becomes unavoidable. The underlying data takes on enormous value overnight. When you have a clean signal, you have access to contextual data that brands desperately need for quality customer engagements.

Such data will let you understand what type of purchasing decisions happen when a person is energetic or tired. It allows us to know what types of music should be played based on the mood. It allows us to identify speakers accurately and correlate behaviors to individuals in a household.

Better contextualization and understanding needs to be a priority so these technologies can develop past their current limitations. To unlock that realworld potential, we need to focus on real-world situations.

About the Author

Ken Sutton is CEO and co-founder of Yobe, a software company that uses edge-based AI to unlock the potential of voice technologies for modern brands.

Follow us on social media for the latest updates in B2B!

How Branded Moving Trucks Help Storage Facilities Attract More Customers

August 1, 2025

You know that feeling when you see the same truck three times in a week? First at the grocery store, then outside your friend’s place, then stuck in traffic next to you. By the third sighting, that company name is burned into your brain. Smart storage facility owners figured this out years ago. They’re…

Winter is Coming: 9 Battle-Tested Strategies to Shield Your Commercial Property from Skyrocketing Insurance Costs

August 1, 2025

The numbers are brutal. Insurance deductibles that used to be a manageable $2,000 flat fee have morphed into percentage-based nightmares tied to property values. What was once a minor business expense can now hit six figures with a single burst pipe or ice dam incident. Meanwhile, insurance premiums have surged 20.4% on average, leaving property…

From Zero to 460% Engagement: 6 UGC Campaigns That Broke All the Rules

August 1, 2025

Picture this: You spend months crafting the perfect marketing campaign. Professional photographers, expensive equipment, polished copy. You launch it and… crickets. Meanwhile, your competitor posts a simple challenge asking customers to share their stories, and suddenly they’re swimming in engagement, leads, and brand loyalty. Sound familiar? You’re not alone. The biggest mistake brands make with…

Lights, Camera, Action

August 1, 2025

The Purpose Factory Event 2025 emerges at a moment when organizations are being challenged to redefine what success really means. Beyond profits and projections, the gathering champions a model of growth that intertwines cultural impact with strategic vision. It’s a forum where companies explore how values can be engineered into operations, not just marketed in…

To Commercialize, Voice Tech Must First Solve Its ‘Cocktail Party Problem’

Our voice technologies have not been engineered to confront the messiness of the real world or the cacophony of our actual lives.

The Steep Climb to Commercializing Voice

Do You Understand the Words Coming Out of My Mouth?

Using Human Tools to Solve Human Problems

About the Author

Latest

How Branded Moving Trucks Help Storage Facilities Attract More Customers

Winter is Coming: 9 Battle-Tested Strategies to Shield Your Commercial Property from Skyrocketing Insurance Costs

From Zero to 460% Engagement: 6 UGC Campaigns That Broke All the Rules

Lights, Camera, Action

Related

Software & Technology

Reflection and What’s Ahead: How Applied Digital Built the AI Factories of Tomorrow

Healthcare

Execution at Scale: How Applied Digital Is Powering AI Infrastructure in Ellendale

Healthcare

Workforce, Housing, and Growth: How Applied Digital Is Revitalizing a Rural Town Through AI Infrastructure