Vox technica: How Siri gets its voice
In early October, CNN revealed that veteran voice actor Susan Bennett was the voice behind Siri until Apple changed it in iOS 7. Her utterances, she revealed in an interview, were being used by the tech giant (and its likely voice synthesis partner Nuance) to generate the digital assistant’s own words.
Of course, even a company as technologically sophisticated as Apple is unlikely to have figured out a way to clone Ms. Bennett and place tiny copies of her inside every iPad and iPhone. Which makes for a question more fascinating than that of Siri’s identity: How exactly is a person’s voice transformed into a software program that can synthesize any text thrown at it?
My voice is my passport
In Sneakers, a much underrated movie that seems oddly appropriate in today’s era of government spying on its own citizens, Robert Redford’s ragtag team of hackers manages to bypass a sophisticated voice-based security system by splicing together individual words taped from an unsuspecting employee.
The process of giving voice to iOS’s digital assistant may not be all that different, although it is far more thorough. “For a large and dynamic synthesis application, the voice talent (one or more actors) will be needed in the recording studio for anywhere from several weeks to a number of months,” says veteran voice actor Scott Reyns, who is based in San Francisco. “They’ll end up reading from thousands to tens of thousands of sentences so that a good amount of coverage is recorded for phrasing and intonation.”
As you can imagine, the complexity of this process varies from language to language; some tongues are more complicated than others. After all, pronouncing English with the wrong intonation—like, say, not inflecting a question—results in a voice that sounds unnatural but doesn’t necessarily alter the meaning of the words that are spoken.
That’s not always the case, according to Arash Zafarnia, director for consulting firm Handsome, based in Austin, Texas: “Compare that to Chinese, where tone and intonations are essential to distinguishing words that have the same vowels and consonants,” and you end up with a whole new level of difficulty. For this reason, consistency is key in obtaining a good voice sample: “The same words and phrases have to be repeated dozens of times. The voice of the actor should not change at all—it must stay consistent through all the period of recordings in order to produce the best result possible,” says Zafarnia.
Slice and dice
Once the initial voice data has been collected, it must be broken down into small components that can then be reassembled into new words. Think of it as a high-tech version of cutting and splicing different lengths of tape together—a process that music producers (and would-be spies) were very familiar with before the advent of digital editing.
In order to produce high-quality output, individual words have to be broken down into phonemes, which are the building blocks of every spoken language. For example, the word Macintosh can be broken down into eight different phonemes, which are then classified according to the universally recognized International Phonetic Alphabet. That reduces the word to its basic sounds, represented in the IPA by the symbols m·æ·k·ɨ·n·t·ɒ·ʃ.
Each sound is categorized, with multiple copies stored in a database to provide variety. Common phonetic combinations are also extracted from the source material and stored alongside the individual phonemes to produce a more natural-sounding output. In extreme cases, entire phrases are manually assembled by voice specialists to produce the highest-quality output when synthesizing many common expressions.
As you can imagine, the amount of work that goes into this phase is staggering, and critical to the ultimate quality of the speech produced by a synthesizer, with many hundreds or thousands of individual snippets of sound extracted and saved. “The difference might be in intonation, stress, pitch,” says Zafarnia. “There might be dozens and hundreds of versions of the same vowel or consonant.”
Once the phonetic database is complete, it is shipped alongside the final product, and either installed on servers that provide voice synthesis remotely across the Internet—as in Siri’s case—or directly on a device, as is the case, for example, for the VoiceOver software that is shipped as part of both OS X and iOS.
When asked to transform a sentence into speech, the synthesis engine will first look for a predefined entry in its database. If it doesn’t find one, it will then try to make sense of the input’s linguistic makeup, so that it can assign the proper intonation to all the words. Next, it will break it down into combinations of phonemes, and look for the most appropriate candidate sounds in its database.
In an ideal scenario, the engine’s database would contain every possible combination of sounds that can be produced by a human voice—a goal that would be nearly impossible to achieve. Instead, the software looks for a series of best matches, stringing them together into a final audio stream. In some cases, such as with nonstandard or foreign words, this may be very hard to do, leading to incorrect results. “There are always things that the synthesizer has to actually synthesize—for example numbers or rarely used words,” says Handsome’s Zafarnia. “The former are not too difficult, [but the latter] are more difficult and have to be created [artificially],” often resulting in unusual or incorrect pronunciation.
Almost like the real thing
Making Siri talk requires the contribution of many different experts, from actors to engineers to voice specialists. And even with the best technology currently available, the occasional slurred word or mispronounced name is inevitable.
Still, despite their ever-increasing accuracy, synthesized voices are no substitute for the real thing. “The human voice is the most dynamic instrument we know of, so one doesn’t have to listen very closely to hear a lack of characteristic inflection and other qualities,” stresses actor Scott Reyns, adding that “when emotion, engaging and compelling an audience, telling a story, or getting a message across that sells counts, companies hire the real thing: actual humans.”
Updated at 1:17 p.m. Pacific to correct Handsome's location from San Francisco to Austin, Texas.