Phoneme modification for Procedural Name Generation

25 Nov 2016

I started this mini-project primarily to play with neural networks. After reading Michael Nielsen’s “Neural Networks and Deep Learning”^[1] and Andrej Karpathy’s “Unreasonable Effectiveness of Recurrent Neural Networks”^[2] and lots of neat tweets from Alex Champandard^[3], I decided to spend a few weeks learning a bit more about them.

I read about convolutional^[4] neural networks {1}^[5] - {2}^[6] - {3}^[7] - {4}^[8], which are about grouping inputs together, primarily for image processing, but also can be used for text processing. I then read about recurrent^[9] neural networks {1}^[10] - {2}^[11] - {3}^[12], which are about sequences of inputs and outputs. I read about deep neural networks, which stack many layers together to learn at different levels.

I usually need a mini project to learn a topic, so I decided I’d use procedural name generation to guide my learning.

1 Motivating Example#

In Asimov’s stories, the name of the robot Daneel Olivaw is similar to Daniel Oliver. Can I generate names that are similar to existing names, but with minor changes?

Start with	`d a n i e l`	`o l i v e r`
Pronounce as	`D AE N Y AH L`	`AA L AH V ER`
Modify to	`D AE N IY1 L`	`AA L AH V AO`
Spelled as	`d a n e e l`	`o l i v a w`

This seems like a potentially useful way to procedurally generate names that are similar to existing names, but spelled or pronounced differently. My plan was to use one neural network to learn how to pronounce words and another neural network to learn how to spell words given a pronunciation.

2 Coding#

I spent a few days reading hundreds of articles but I learn best by alternating reading and playing, so I had to code for a bit. So I installed TensorFlow and tflearn and went through the computer vision example in the tutorial^[13], then turned to recurrent neural networks, which seemed like a good fit for procedural name generation. I used the TensorFlow example for city name generation and ran it on the U.S. baby name list^[14]. Sometimes it worked well (Lonnie Kristopherk Karo Jecerel Recel Jorshon) but other times it didn’t (Breltrie gg nneiee iaEiutFuA hko). It’s not that different from Markov chain name generation. It’s not using the pronunciation at all; it’s just predicting what letters fit together.

For the procedural name generation I wanted to try sequence to sequence^[15] models {1}^[16]. These are recurrent neural networks but they wait until the end of the input string before writing out the output. This means they need to store everything in memory. It seems like a bad idea to me, but it seems to work in many situations, including Google Translate^[17]. TensorFlow has an example program to translate English into French.

The spelling daniel is represented as graphemes d a n i e l. The pronunciation is phonemes D AE N Y AH L. The idea for pronunciation is to “translate” the sequence of graphemes into a sequence of phonemes. Then I modify the phonemes in some way. Then I translate the phonemes back into graphemes.

3 Pronunciation (grapheme→phoneme)#

It turns out the CMU Sphinx project has already implemented translation from graphemes to phonemes, using for training the CMU Pronouncing Dictionary^[18], which has an ascii (not IPA) representation of pronunciations for over 100,000 English words. I downloaded their g2p-seq2seq code^[19]. They also have a pre-trained neural network model. Convenient! I ran their code on their data:

git clone https://github.com/cmusphinx/cmudict
git clone https://github.com/cmusphinx/g2p-seq2seq.git
cd g2p-seq2seq
curl -O g2p-seq2seq-cmudict.tar.gz https://sourceforge.net/projects/cmusphinx/files/G2P%20Models/g2p-seq2seq-cmudict.tar.gz/download
tar xf g2p-seq2seq-cmudict.tar.gz

PYTHONPATH=. python3 g2p_seq2seq/app.py --interactive --model g2p-seq2seq-cmudict

I typed in names like michael and got back pronunciations like M AY K AH L Good, it works.

Since I also wanted to learn how to run this stuff myself, I decided to train a small model and make sure it worked. The first step was to clean up the CMU dictionary. The original version has comments (lines starting with “;;;” or lines with “# ...”) and accent annotations (michael is M AY1 K AH0 L instead of M AY K AH L). The easiest way to clean it up was to remove all digits and then remove all lines with non-letters:

perl -ne 's/\d//g; print if /^[a-zA-Z\s]+$/;' <../cmudict/cmudict.dict >cmudict-basic.txt

The next step was to train a small model:

PYTHONPATH=. python3 g2p_seq2seq/app.py --model my-g2p --train cmudict-basic.txt --size 32

I let this run for a while and realized I didn’t know if it would ever stop. Fortunately it’s actually making checkpoints along the way. I stopped it after half an hour. I then tried using the model:

PYTHONPATH=. python3 g2p_seq2seq/app.py --model my-g2p --interactive

I typed in names like michael and got back pronunciations like M IH CH EY L (pronouncing the ch like “cheese” and not like “kite”). Ok, not the right answer, but an understandable mistake. Would it be better if I trained a bigger model for longer? Yes! I let it run for 6 hours, and it produced much better results. Great! I had grapheme to phoneme working. (Note: I didn’t actually use the results of this step; I only did it to make sure I could run it, before changing the code.)

4 Spelling (phoneme→grapheme)#

With pronunciation there are several projects that do what I want, including some using neural networks. For spelling I looked for projects that convert phonemes to graphemes but I didn’t find anything. So instead I read the g2p-seq2seq code and modified it to work on phonemes to graphemes (phoneme2grapheme.py).

python3 phoneme2grapheme.py --model my-g2p --train cmudict-basic.txt --size 32 --max_steps 5000

It took a few more iterations of fixing my bugs before I got it to run. Yay, it now works! M AY K AH L produces michel. I trained it for around 16 hours, hoping for better results:

python3 phoneme2grapheme.py --model my-g2p --train cmudict-basic.txt --size 256 --max_steps 40000

Yay, it works! M AY K AH L produces michael. But does it work on new made-up names, or is it just memorizing the spelling of existing names (overfitting)?

5 Pronunciation Changes#

I wrote a program phoneme-change.py to modify pronunciations in some way. For michael → M AY K AH L what happens if I change the initial M sound to F AY K AH L or V AY K AH L? The neural network outputs feichel and vical. What happens if I change the first I sound to M EH K AH L or M UH K AH L? The neural network outputs meckle and mookle. What happens if I append a prefix or suffix^[20], D IH M AY K AH L or M AY K AH L S AH N? The neural network outputs demichal and michelson. (See the “Output” appendix for more)

The neural network successfully generates reasonable spellings.

Notice though that it didn’t keep the chae part of michael. There are a lot of different ways to spell that sound; it’s ichae in michael, icu in bicuspid, ica in formica, iche in lichen, yca in lycan, yco in glycogen, yche in psychedelic, yc in recycle. There isn’t one “right” way. If it had preserved the spelling, it might have made feichael, vichael, mechael, moochael, demichael, michaelson. Is this important? I don’t know. (See the “Alignment” appendix)

If I feed those spellings back into Apple’s speech synthesis, I don’t get the pronunciations I was trying to produce, so in that sense, even though the spellings are reasonable, they may not match what I want people to hear in their heads when they read the words.

Here’s where I finished this project.

6 Conclusion#

Neural networks are neat, and there are more ways to use them than I had known about before.

I had played with neural networks a long time ago. Things are much easier to try now than they were twenty years ago:

You can get code: tools (that all seem to start with T) include Torch, Theano, TensorFlow.
You can get data: MNIST for handwriting digit recognition, CMUdict for pronunciation, web archives, image databases, wikipedia, imdb, open street map, and lots more.
You can get compute resources: Azure, Amazon EC2, and Google Compute Engine give you easy access to lots of computing power (including GPUs).

When I was in college, I had access to 1 megaFLOPs and if I was lucky I’d get access to 50 megaFLOPs. Now, I can get 50 teraFLOPs for $10/hour. That’s one million times as much computing power. And I can easily get even more.

As far as name generation, I’m glad I picked it as a motivating example, as it was fun to play with and did motivate me to learn a bit. However I don’t know what I’d actually want from a name generator like this, and I don’t think I’d use neural networks for it. If making a name generator was my primary goal, I should’ve started by making these changes by hand, then deciding what algorithms would let me implement such a system. Instead, I started with the technology in search of a problem. That’s ok in this case, as my goal was to learn something about neural networks.

As cool as neural networks are, they’re still a “black box” that doesn’t offer me a lot of control over the output. For game design, including procedural generation, I think I want the designer to have more of a say in what comes out, and for that I will continue to use simpler systems unless there’s some compelling reason to use neural networks.

[Update <2020-05-21 Thu>: Allison Parish’s video^[21] is worth a watch! Allison also uses the CMU dictionary and has built a Python library^[22] that handles both pronunciation (letters→phonemes) and spelling (phonemes→letters) using sequence-to-sequence neural networks. This library is much easier to use than what I did, and it makes me want to try playing with this topic again. Skip to 23 minutes in if you want to see what the library can do.]

[Update: <2026-02-01 Sun>: I tried this in 2016. If I had to try it again in 2026, there are newer techniques, such as microgpt^[23] or picogpt^[24] or eemicrogpt^[25] ]

7 Appendix: Phoneme Representation#

The standard phoneme representation is IPA, the International Phonetic Alphabet^[26], but it’s hard to type. CMUDict uses ARPAbet^[27], which is ascii-only, focused on English. Apple uses its own syntax^[28], which also seems focused on English, and similar to ARPAbet for vowels and different for consonants. The Web Speech API supposedly supports SSML^[29], but I’m not sure if any browsers actually support this; Chrome on Mac supports Apple’s syntax instead (!?). The Festival^[30] speech library uses SABLE^[31].

I used ARPAbet because I was using CMUdict. However, I also wrote an ARPAbet to Apple converter arpabet-to-apple.py so that I could use Apple’s speech synthesis to hear some of the new words I was generating. I had hoped to use the Web Speech API but it seems pretty limited at this point. I later found this program^[32] to convert formats, but didn’t try it because I had already written my own.

It turns out the emphasis annotations I had removed to keep things simple actually make a big difference in pronunciation. When I wanted the system to say daneel, D AE N IY1 L is the closest match, but I’m working with phonemes D AE N IY L which sound a bit different. Pronunciation is hard! It’s a lot more than just the phonemes. Run these on Mac:

say -v Alex '[[inpt PHON]] dAEnIYl'
say -v Alex '[[inpt PHON]] dAEn1IYl'
say -v Alex 'daneel'

say -v Alex '[[inpt PHON]] mAYkUXl'
say -v Alex 'michael'

Note that the [[inpt PHON]] syntax doesn’t work with the voices added in newer version of Mac OS, according to this stackexchange question^[33].

8 Appendix: Beam Search#

One thing I had hoped to get was multiple potential spellings for a given pronunciation. There’s information I want to use that the neural network doesn’t have when it’s picking a single spelling. For example, carrie and kerry have the same pronunciation K EH R IY, so if I wanted to modify the pronunciation to K EH R IY AE N, I’d only get a single output, kariann. If I wanted separate carrie → carrianne and kerry → kerrianne spellings it’d be helpful if the neural network gave me a list of possible spellings, and then I could pick one closer to the original spelling.

To get multiple outputs from a sequence to sequence neural network I need to use beam search^[34], but the sample code in TensorFlow doesn’t include this feature^[35], and the CMUsphinx code doesn’t either^[36]. It’s been asked for on other projects too^[37]. Although there’s sample code out there, I don’t understand how any of this works well enough to integrate it, and it’s not a priority for me so I didn’t pursue it.

9 Appendix: Alignment#

If the goal is to modify michael to zichael then it’d be helpful to know which letters correspond to which sounds. This is called alignment:

m	i	ch	ae	l
M	AY	K	AH	L

There are a bunch of techniques for alignment. Take a look at this paper^[38].

With alignment, we can not only change the m to z without affecting the rest of the word, but we can also change how some sounds are spelled. Maybe some civilization in your game writes the AH sound as 'ō and the L sound as ll and we want Michael to be written Mich'ōll. The neural network may be good at capturing English spelling rules for existing words, but I think alignment would offer more possibilities for the designer or procedural generator to control how generated names work.

I wrote this page in 2016. In 2018 I played more with alignment in a project to procedurally modify spelling, and I made an interactive demo where you can put in your own words.

10 Appendix: Output#

I took 5000 names from the U.S. Baby Name list^[39], changed a few phonemes at a time, and asked for the spellings of the new names.

Feel free to use these in your projects!

Choose a name:
or search names:

Or download the full list of 800k+ unique names (this raw data includes duplicates).