CMU pronunciation rhyme dictionary

I am looking for a free or rhymed open source database.

I found CMU pronunciation "database" and its range of applications, but I can not understand them or find out where the data comes from.

A simple text file with a word and its phonemes is all I need.

Does anyone know where I will find, or where I will start listing such a list from CMU files?

+6
source share
3 answers

cmudict

cmudict is a text file and its format is very simple. Firstly, this word is indicated. Then there are two spaces. All that follows two spaces is pronunciation. If a word can have two different pronunciation methods, you will see two entries for a word like

word word(1) 

The beginning of the file lists the characters and punctuation marks. The character is followed by the English spelling of the names of the characters without spaces between them. Then it is followed by two space delimiters and arpabet code. Since you are only looking for rhymes, you do not need to do anything with the character section, since you will never look for rhymes ...ELLIPSIS

ARPAbet

For information on how ARPAbet codes are mapped to IPA, see wikipedia http://en.wikipedia.org/wiki/Arpabet , and each mapping shows sample words. It is very easy to understand how the two relate to each other, and this can help you understand how to read ARPAbet codes if you are familiar with IPA.

Summary

Basically, if you have already found cmudict, then you already have what you asked for: a database of words and their pronunciation. To find words that rhyme, you have to parse a flat file into a table and run a query to find words that end with the same ARPAbet code.

General theory of the behavior of things

Part: Material

  • create a new database
  • create a table in a database with three fields: index, word, arpabet
  • read the line of the cmudict file line by line
  • for each line break it into two parts, where two consecutive AND spaces are found
  • increase the number of indices, then insert the index number, word and arpabet code

Then Umm ...

Once you get the data into any database of your choice, you can use this database to find correlations between arpabet codes. You could find rhymes, consonance, consonance and other mnemonic devices. It will be like

Part: Thing

  • get the word you want to find the rhyme for
  • query database for word equivalent arpabet words
  • split the arpabet code into pieces, breaking it wherever there is a space
  • take the last piece of code and query the database for words whose arpet code codes correspond to the specified part.
  • Come up with rhyme poems

Labels and Spoilers

I got bored and wrote a Node.js module that covers the "Part: Material" listed above. If Node.js is installed on your computer, you can get the module by running npm install cmudict-to-sqlite See https://npmjs.org/package/cmudict-to-sqlite for README or just look into the document module.

+5
source

Rhyme Logic Using CMU Sentence Dictionary

OK Suppose you want to use the CMU Pronouncing Dictionary data (file example: cmudict-0.7b) to create a list of all words that rhyme with "LOVE".

Here you can do this:

First, you need to learn the pronunciation of LOVE. You will find this line in the dictionary, where "LOVE" and "L AH1 V" are separated by two spaces:

 LOVE L AH1 V 

This suggests that the word LOVE pronounced L AH1 V

Then find the vowel phoneme with primary stress. In other words, look for the number "1" in this pronunciation. The text immediately to the left of 1 is a vowel that has primary stress ( AH ). This text and everything to the right of it is your “rhyme phonemes” (due to the lack of a better term). So rhyme phonemes for LOVE AH1 V

We did half! Now we just need to find other words whose pronunciations end in AH1 V If you are playing Notepad ++, try to find everything in the current document for the AH1 V$ template using the Regular Expression search mode. This will match lines like:

 Line 392: ABOVE AH0 B AH1 V Line 10266: BELOVE B IH0 L AH1 V Line 30204: DENEUVE D IH0 N AH1 V Line 30205: DENEUVE(1) D IY0 N AH1 V Line 34064: DOVE D AH1 V Line 48177: GLOVE GL AH1 V Line 49053: GOV G AH1 V ... etc 

Rhyme woooooords!

There are many ways to implement this and many angular cases, but this is approximately the approach that seems to use many electronic rhyming dictionaries when searching for perfect rhymes.

A hypothetical SQL approach for storing rhyme data

Obviously, performance will be a problem if you just scan the dictionary every time someone wants to rhyme. If this is a concern, you can try to save or index the data in different ways.

Although not the most efficient on disk, I had good experience storing this stuff in an SQL table with indexed columns.

For a simple conceptual example, you can compute the "rhyme phonemes" of all words in the dictionary, and then insert them into the "Rhymes" table, whose columns are {WordText, RhymePhonemes}. For example, you can see entries such as:

 {"ABOVE", "AH1 V"} {"DOVE", "AH1 V"} {"OUTLIVE", "IH1 V"} {"GRADUATE", "AE1 JH AH0 W AH0 T"} {"GRADUATE", "AE1 JH AH0 W EY2 T"} 

... etc.

Then, to find the rhymes, you should issue a query, for example:

 SELECT OTHER.WordText FROM Rhymes INPUT INNER JOIN Rhymes OTHER ON OTHER.RhymePhonemes = INPUT.RhymePhonemes WHERE INPUT.WordText = 'love' AND OTHER.WordText <> INPUT.WordText ORDER BY OTHER.WordText 

This is also useful if you plan to print a dictionary where all similar words are grouped together.

There are, of course, many other ways of storing / retrieving data for various trade-offs, but hopefully this helps you get started.

I was also lucky to store the raw pronunciation in the database in various “full” formats (forward and reverse pronunciation strings, with voltage labels and without voltage signs, etc.), but not “sliced” into specific parts as phoneme rhyme columns.

Gotchas

Again, the original explanation with “love” will absolutely help you in rhyme. However, along the way, you are likely to encounter other problems that need to be considered. Here's the heads-up:

  • Some words have several pronunciations. In the CMU dictionary, alternative pronunciations are marked with text such as (1) , (2) , etc., Following the word, as in GRADUATE(2) . If someone wants to rhyme these words, you need to decide, between showing the rhymes of ALL agreed pronunciations or the user's choice, which pronunciation they really mean.
  • What do you do when a pronunciation has two or more "1s"? Choose the first one? Choose the last one? If you choose the latter, you will find more rhymes, but this may not be the most natural stress choice.
  • What do you do when the pronunciation does not have a "1"? This does not happen so much, but it happens, for example: ACCREDIT AH0 KR EH2 D AH0 T and AIKIN EY0 K IH0 N In this case, I would choose the next best stress (for example, select 2, 3, 4, etc., If 1 is missing). If they are all 0, I do not have good advice.
  • Some pronunciations are missing. This is a great start, but it doesn’t have all the words or spellings you might need. American spelling is preferable to English spelling.
  • Some pronunciation is not what you expect, and you can trim. For example, there is a pronunciation of “or,” which sounds like “er.”
  • You can compare "rhyme phonemes" with the removal of beats. This matters only for words whose main stress is not on the last vowel (therefore, you do not see the problem with the example of “love”).
+2
source

You can always use http://www.rhymezone.com/ and search for a word, and then put its rhyme matches in a text file if you use only a small demo subset. If you need a complete database of words. You can connect the dictionary to zombieJS UI automation, and then shield the words and put them in your own database. This will allow you to create your own rhyme database. Although, frankly, this is absolutely mandatory for your initial request

-1
source

All Articles