OpenAI Previews New Audio Tool That Can Read Text, Mimic Voi

As deepfakes proliferate, OpenAI is refining the tech used to clone voices — but the company insists it’s doing so responsibly.

Today marks the preview debut of OpenAI’s Voice Engine, an expansion of the company’s existing text-to-speech API. Under development for about two years, Voice Engine allows users to upload any 15-second voice sample to generate a synthetic copy of that voice. But there’s no date for public availability yet, giving the company time to respond to how the model is used and abused.

“We want to make sure that everyone feels good about how it’s being deployed — that we understand the landscape of where this tech is dangerous and we have mitigations in place for that,” Jeff Harris, a member of the product staff at OpenAI, told TechCrunch in an interview.

OpenAI is sharing early results from a test for a feature that can read words aloud in a convincing human voice — highlighting a new frontier for artificial intelligence and raising the specter of deepfake risks.

The company is sharing early demos and use cases from a small-scale preview of the text-to-speech model, called Voice Engine, which it has shared with about 10 developers so far, a spokesperson said. OpenAI decided against a wider rollout of the feature, which it briefed reporters on earlier this month.

The Voice Engine: A Sneak Peek

OpenAI’s Voice Engine is no ordinary text-to-speech model. Unlike its predecessors, which churned out robotic or monotonous audio, Voice Engine crafts speech that sounds eerily human. It captures the nuances of individual voices—their unique cadence, intonations, and idiosyncrasies. All it requires is a mere 15 seconds of recorded audio from the person whose voice it aims to mimic.

During a recent demonstration, OpenAI CEO Sam Altman briefly explained the technology using a voice that was indistinguishable from his actual speech. Imagine a world where AI-generated voices are virtually indistinguishable from real ones. It’s a leap forward that both fascinates and unnerves us.

The Safety Dance

But tread carefully, for this path is fraught with delicate safety considerations. The ability to mimic human speech with such precision raises ethical questions. What if this technology falls into the wrong hands? Imagine political leaders, celebrities, or even your neighbor being impersonated by an AI-generated voice. The implications are staggering.

OpenAI acknowledges these risks. In a blog post, they state, “We recognize that generating speech that resembles people’s voices has serious risks, which are especially top of mind in an election year.” They’re actively engaging with stakeholders—policymakers, industry experts, educators, and creatives—to ensure responsible development.

The Deepfake Dilemma

Deepfakes have already infiltrated our digital landscape. In January, a realistic-sounding phone call purportedly from President Joe Biden urged New Hampshire residents not to vote in the primaries. The AI-generated voice was convincing enough to cause a stir. Now, imagine that sophistication applied to any voice, anywhere.

Use Cases and Partnerships

OpenAI has shared early demos and use cases with a select group of developers. Among them is the Norman Prince Neurosciences Institute at the not-for-profit health system Lifespan. They’re leveraging Voice Engine to help patients recover their lost voices—a noble application indeed.

The Verdict

Voice Engine is a technical marvel, capable of producing human-caliber speech. But as we venture into this uncharted territory, let’s proceed with caution. The power to mimic voices is both awe-inspiring and unnerving. As OpenAI continues to refine this tool, we must collectively shape its responsible use.

Stay tuned for more updates as Voice Engine evolves. The future of audio just got a whole lot more intriguing—and a tad unsettling.