Python: how to add the string "ub" to each expressed vowel in a string?

Question

Python: how to add the string "ub" to each expressed vowel in a string?

Example : Speak → Spubeak, more details here

Don't give me a solution, but point me in the right direction or tell me which python library I could use? I am thinking about regex, since I need to find a vowel, but then what method could I use to insert 'ub' before the vowel?

+7

python string regex nlp

Sahat yalkabov Feb 29 '12 at 19:57

source share

3 answers

You can use regular expressions for wildcards. See re.sub .

Example:

 >>> import re >>> re.sub(r'(e)', r'ub\1', 'speak') 'spubeak'

You will need to read the documentation for regex groups, etc. You will also need to figure out how to combine different vowels, not just the ones shown in the example.

For some great ideas (and code) for using regular expressions in Python for the pronunciation dictionary, see this link, which is one of the design pages for Cainteoir : http://rhdunn.github.com/cainteoir/rules.html

The Cainteoir text-to-speech kernel framework (which is not yet fully implemented) uses regular expressions. See Also Pronunciation Dictionary and Regular Expressions , another article by Cainteoir.

+3

Steven T. Snyder Feb 29 '12 at 19:58

source share

Regular expressions are really the best route. If you don’t know how to proceed, check how capture groups work, and how you can include them in your lookups.

+1

mgibsonbr Feb 29 '12 at 20:00

source share

jfs · Accepted Answer · 2012-02-29T20:10:06+0000

This is more complicated than just a simple expression like

"Hi, how are you?" → "Hubi, hubow ubare yubou?"

A simple regular expression will not catch that e not pronounced in are .

You need a library that offers a pronunciation dictionary such as nltk.corpus.cmudict :

 from nltk.corpus import cmudict # $ pip install nltk # $ python -c "import nltk; nltk.download('cmudict')" def spubeak(word, pronunciations=cmudict.dict()): istitle = word.istitle() # remember, to preserve titlecase w = word.lower() #note: ignore Unicode case-folding for syllables in pronunciations.get(w, []): parts = [] for syl in syllables: if syl[:1] == syl[1:2]: syl = syl[1:] # remove duplicate isvowel = syl[-1].isdigit() # pronounce the word parts.append('ub'+syl[:-1] if isvowel else syl) result = ''.join(map(str.lower, parts)) return result.title() if istitle else result return word # word not found in the dictionary

Example:

 #!/usr/bin/env python # -*- coding: utf-8 -*- import re sent = "Hi, how are you?" subent = " ".join(["".join(map(spubeak, re.split("(\W+)", nonblank))) for nonblank in sent.split()]) print('"{}" → "{}"'.format(sent, subent))

Exit

  "Hi, how are you?"  → "Hubay, hubaw ubar yubuw?"

Note. This is different from the first example: each word is replaced by syllables.

Python: how to add the string "ub" to each expressed vowel in a string?

Exit

More articles: