Yamaha Announces American English Vocaloid4 library "CYBERDIVA"

Apr 20, 2015
3 min read

Yamaha has announced at the NAMM 2015 tradeshow the upcoming release of a female American English VOCALOID4 library called CYBER DIVA. Being Yamaha’s first direct release of an English VOCALOID sound bank, CYBER DIVA is touted as being able to reproduce the natural pronunciation of a native English speaker, and it is expected to be available as a download to customers in the US during the first quarter of this year for 149.99 USD. A packaged version is also expected to be released in Japan in early February as well. There are currently five demo songs available on YouTube (1,2,3,4,5), with two songs created by UtataP and three songs made by a collaboration between CircusP and CrusherP. The recommended pitch range is from G2 to C4, and the recommended tempo goes from 60 to 180 beats per minute. The library will have only a single voice bank (and thus is not cross-synthesizable) but is touted to be able reproduce “rough” as well as “harsh” vocals through VOCALOID4’s growl feature. The official website also posted an interview detailing the development process, showing the many problems that were encountered in the nearly two years of development.

According to the interview with developer Michael Wilson, the journey began in late 2012 with Yamaha sitting down and analyzing existing English voice

banks to determine why English didn’t sound very good with VOCALOID. Three conclusions were reached—the phonetic dictionary used to convert written words into phonemes contained a mix of British and American English, sometimes the audio didn’t match the phoneme it was supposed to represent, and the presence of noise deteriorated synthesis quality.

In initial tests, the English project team discovered many problems that stood in the way of a high quality voice bank, the most egregious being the mislabeling of phonemes. Michael attempted to fix this problem by designing a new recording script (VOCALOID libraries are created from a recording where the voicer sings a special script that contains all the necessary phonemes), which made the resulting voice easier to understand due to higher consistency and less mislabeling. However, it did come at the cost of expressibility due to the shorter script being harder to sing.

The recording process went through four singers in total, two for the initial test before the script changes and two afterwards. Eventually the team settled on coaching one of the second group to sound more like the other person in the group and went through several hours of recording to go through the script several times. Michael talks about how they recorded three different pitches for diphones and triphones (sounds consisting of two or three phonemes, respectively, e.g. the diphone “he” and the triphone “ello” in “hello”) and six different pitches for “stationaries” (phonemes that can be prolonged, like vowel sounds). The latter was much more than any previous English VOCALOID library to date. Being integral to the VOCALOID synthesis engine, the different pitches allow for a more natural synthesis result, since a singer will often sing the same syllable differently based on where it is in his/her pitch range. This library also incorporated 231 triphones per pitch, which is also the most in any English sound bank to date.

The process of turning the audio into a library and fixing pronunciation problems also had its share of problems. Being the only native English speaker on the team, Michael’s concerns with particular pronunciation problems usually ended up being tabled by the rest of the team, and thus he recruited two other native speakers to tip the scales. Since more people were added to the team, team member Baba developed a bug tracker to keep track of all the pronunciation issues that needed to be fixed. Another problem faced by the team was that sometimes the synthesis would yield noisy results, and this was fixed by tweaking their excitation-plus-resonance (EpR) model in a similar fashion to that employed by the Music Technology Group at Universitat Pompeu Fabra, birthplace of the technology behind VOCALOID. Lastly, there were deficiencies in the standard English pronunciation dictionary that came with VOCALOID, so the team painstakingly developed a new custom dictionary whose 10,000+ entries were personally vetted by the native speaker team. Incidentally, this dictionary revealed an inefficiency in the VOCALOID software and it had to be fixed before the dictionary could be used.

Eventually, the product was “finalized” in March of 2014, but with the announcement of VOCALOID4 and its growl feature, some extra samples had to be recorded to include in the final release.