This is more complicated than just a simple expression like
"Hi, how are you?" β "Hubi, hubow ubare yubou?"
A simple regular expression will not catch that e not pronounced in are .
You need a library that offers a pronunciation dictionary such as nltk.corpus.cmudict :
from nltk.corpus import cmudict # $ pip install nltk # $ python -c "import nltk; nltk.download('cmudict')" def spubeak(word, pronunciations=cmudict.dict()): istitle = word.istitle() # remember, to preserve titlecase w = word.lower() #note: ignore Unicode case-folding for syllables in pronunciations.get(w, []): parts = [] for syl in syllables: if syl[:1] == syl[1:2]: syl = syl[1:] # remove duplicate isvowel = syl[-1].isdigit() # pronounce the word parts.append('ub'+syl[:-1] if isvowel else syl) result = ''.join(map(str.lower, parts)) return result.title() if istitle else result return word # word not found in the dictionary
Example:
#!/usr/bin/env python # -*- coding: utf-8 -*- import re sent = "Hi, how are you?" subent = " ".join(["".join(map(spubeak, re.split("(\W+)", nonblank))) for nonblank in sent.split()]) print('"{}" β "{}"'.format(sent, subent))
Exit
"Hi, how are you?" β "Hubay, hubaw ubar yubuw?"
Note. This is different from the first example: each word is replaced by syllables.
jfs
source share