Regex to get a list of all words with special letters (unicode graphemes)

Question

Regex to get a list of all words with special letters (unicode graphemes)

I am writing a Python script for the FOSS language learning initiative. Let's say I have an XML file (or its simple, Python list) with a list of words in a specific language (in my case, the words are in Tamil, which uses a pointer based on the Brahmi script).

I need to highlight a subset of those words that can be written using only these letters.

Example in English:

words = ["cat", "dog", "tack", "coat"] get_words(['o', 'c', 'a', 't']) should return ["cat", "coat"] get_words(['k', 'c', 't', 'a']) should return ["cat", "tack"]

Tamil example:

 words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"] get_words([u'ம', u'ப', u'ட', u'ம்') should return [u"மடம்", u"படம்") get_words([u'ப', u'ம்', u'ட') should return [u"படம்"]

The order of returning words or the order of entering letters should not matter.

Although I understand the difference between unicode and graphemes code points, I'm not sure how they are processed in regular expressions.

In this case, I would like to combine only those words that consist of specific graphemes in the input list, and nothing more (that is, the marking following the letter should follow only this letter, but the graphemes themselves can occur in any order).

+6

python regex unicode tamil indic

Ashwin balamohan Jan 27 '13 at 3:17

source share

4 answers

EDIT: Well, don't use any answers from here. I wrote them all, thinking that Python regular expressions do not have a word boundary marker, and I tried to get around this flaw. Then @Mark Tolonen added a comment that Python has \b as a word boundary marker! So I posted another answer, short and simple, using \b . I will leave this here if someone is interested in finding solutions that work around the lack of \b , but I really don't expect anyone to be.

It is easy to make a regular expression that matches only the string of a specific character set. What you need to use is a “character class” with only those characters that you want to match.

I will do this example in English.

[ocat] This is a character class that will correspond to one character from the set [o, c, a, t] . The order of characters does not matter.

[ocat]+ Putting + at the end makes it match one or more characters from the set. But this alone is not enough; if you had the word “trainer,” that would match and return “coac.”

Unfortunately, there is no regular expression function for the dictionary boundary. [EDIT: This turns out to be wrong, as I said in the first paragraph.] We need to make one of ours. There are two possible beginnings of a word: the beginning of a line or a space separating our word from the previous word. Similarly, there are two possible ends of words: the end of a line or a space separating our word from the next word.

Since we will be matching some additional things that we don’t need, we can put parentheses around the part of the template that we want.

To match the two alternatives, we can create a group in parentheses and separate the alternatives with a vertical bar. Python regular expressions have a special designation for creating a group whose contents we do not want to save: (?:)

So, here is the pattern corresponding to the beginning of the word. Beginning of a line or space: (?:^|\s)

Here is an example of the end of a word. White space or end of line: `(?: \ S | $)

Putting it all together, here is our final template:

 (?:^|\s)([ocat]+)(?:\s|$)

You can build it dynamically. You do not need to hard code it all.

 import re s_pat_start = r'(?:^|\s)([' s_pat_end = r']+)(?:\s|$)' set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where() # set_of_chars is now set to the string: "ocat" s_pat = s_pat_start + set_of_chars + s_pat_end pat = re.compile(s_pat)

Now this in no way checks the correct words. If you have the following text:

 This is sensible. This not: occo cttc

The sample I showed you will match occo and cttc , and these are not words. These are strings made only from the letters [ocat] , though.

So just do the same with Unicode strings. (If you are using Python 3.x, then all the strings are Unicode strings, so you go.) Put the Tamil characters in the character class and you will go well.

This has a confusing problem: re.findall() does not return all possible matches.

EDIT: Okay, I realized that bothers me.

We want our template to work with re.findall() so that you can collect all the words. But re.findall() finds only non-overlapping patterns. In my example, re.findall() returned only ['occo'] , not ['occo', 'cttc'] , as I expected ... but this is because my template matched the space after occo . The group of matches did not collect a space, but was matched anyway, and since re.findall() did not want a match between matches, the space was "used" and did not work for cttc .

The solution is to use a Python regular expression function that I have never used before: a special syntax that says "should not precede" or "should not be followed." The \S sequence matches any space without spaces, so we could use this. But punctuation is not a space, and I think we want punctuation to limit the word. There is also a special syntax: "must precede" or "must follow." So, I think the best we can do:

Build a string that means "match when a character class string is at the beginning of a string and is followed by a space, or when a character class string is preceded by white space characters and followed by a space, or when a character string is preceded by spaces and followed by the end of the string, or when the character class string is preceded by the beginning of the string and the end of the string follows.

Here is this pattern using ocat :

 r'(?:^([ocat]+)(?=\s)|(?<=\s)([ocat]+)(?=\s)|(?<=\s)([ocat]+)$|^([ocat]+)$)'

I'm sorry, but I really think this is the best we can do and still work with re.findall() !

Actually this is less confusing in Python code:

 import re NMGROUP_BEGIN = r'(?:' # begin non-matching group NMGROUP_END = r')' # end non-matching group WS_BEFORE = r'(?<=\s)' # require white space before WS_AFTER = r'(?=\s)' # require white space after BOL = r'^' # beginning of line EOL = r'$' # end of line CCS_BEGIN = r'([' #begin a character class string CCS_END = r']+)' # end a character class string PAT_OR = r'|' set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where() # set_of_chars now set to "ocat" CCS = CCS_BEGIN + set_of_chars + CCS_END # build up character class string pattern s_pat = (NMGROUP_BEGIN + BOL + CCS + WS_AFTER + PAT_OR + WS_BEFORE + CCS + WS_AFTER + PAT_OR + WS_BEFORE + CCS + EOL + PAT_OR + BOL + CCS + EOL + NMGROUP_END) pat = re.compile(s_pat) text = "This is sensible. This not: occo cttc" pat.findall(text) # returns: [('', 'occo', '', ''), ('', '', 'cttc', '')]

So the crazy thing is that when we have alternative patterns that can match, re.findall() seems to return an empty string for alternatives that don't match. Therefore, we just need to filter the rows with a zero value according to our results:

 import itertools as it raw_results = pat.findall(text) results = [s for s in it.chain(*raw_results) if s] # results set to: ['occo', 'cttc']

I suppose it would be a little confusing to just build four different templates, run re.findall() for each and combine the results together.

EDIT: Okay, here is the code to create four templates and each of them. I think this is an improvement.

 import re WS_BEFORE = r'(?<=\s)' # require white space before WS_AFTER = r'(?=\s)' # require white space after BOL = r'^' # beginning of line EOL = r'$' # end of line CCS_BEGIN = r'([' #begin a character class string CCS_END = r']+)' # end a character class string set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where() # set_of_chars now set to "ocat" CCS = CCS_BEGIN + set_of_chars + CCS_END # build up character class string pattern lst_s_pat = [ BOL + CCS + WS_AFTER, WS_BEFORE + CCS + WS_AFTER, WS_BEFORE + CCS + EOL, BOL + CCS ] lst_pat = [re.compile(s) for s in lst_s_pat] text = "This is sensible. This not: occo cttc" result = [] for pat in lst_pat: result.extend(pat.findall(text)) # result set to: ['occo', 'cttc']

EDIT: Well, here's a completely different approach. I like it best.

First, we compare all the words in the text. A word is defined as one or more characters that are not punctuation and are not spaces.

Then we use a filter to remove words from the above; we only save words that are made only of the characters we want.

 import re import string # Create a pattern that matches all characters not part of a word. # # Note that '-' has a special meaning inside a character class, but it # is valid punctuation that we want to match, so put in a backslash in # front of it to disable the special meaning and just match it. # # Use '^' which negates all the chars following.  So, a word is a series # of characters that are all not whitespace and not punctuation. WORD_BOUNDARY = string.whitespace + string.punctuation.replace('-', r'\-') WORD = r'[^' + WORD_BOUNDARY + r']+' # Create a pattern that matches only the words we want. set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where() # set_of_chars now set to "ocat" # build up character class string pattern CCS = r'[' + set_of_chars + r']+' pat_word = re.compile(WORD) pat = re.compile(CCS) text = "This is sensible.  This not: occo cttc" # This makes it clear how we are doing this. all_words = pat_word.findall(text) result = [s for s in all_words if pat.match(s)] # "lazy" generator expression that yields up good results when iterated # May be better for very large texts. result_genexp = (s for s in (m.group(0) for m in pat_word.finditer(text)) if pat.match(s)) # force the expression to expand out to a list result = list(result_genexp) # result set to: ['occo', 'cttc']

EDIT: now I don't like any of the above solutions; see another answer that uses \b for a better solution in Python.

+3

steveha Jan 27 '13 at 5:45

source share

It is easy to make a regular expression that matches only the string of a specific character set. What you need to use is a “character class” with only those characters that you want to match.

I will do this example in English.

[ocat] This is a character class that will correspond to one character from the set [o, c, a, t] . The order of characters does not matter.

[ocat]+ Putting + at the end, it matches one or more characters from the set. But this alone is not enough; if you had the word "coach" , it would match and return "coac" .

\b[ocat]+\b' Now it only matches on word boundaries. (Thank you very much @Mark Tolonen for educating me about \b[ocat]+\b' Now it only matches on word boundaries. (Thank you very much @Mark Tolonen for educating me about \ b`.)

So, just create a template like the one above, only using the desired character set at runtime, and there you go. You can use this template with re.findall() or re.finditer() .

 import re words = ["cat", "dog", "tack", "coat"] def get_words(chars_seq, words_seq=words): s_chars = ''.join(chars_seq) s_pat = r'\b[' + s_chars + r']+\b' pat = re.compile(s_pat) return [word for word in words_seq if pat.match(word)] assert get_words(['o', 'c', 'a', 't']) == ["cat", "coat"] assert get_words(['k', 'c', 't', 'a']) == ["cat", "tack"]

+3

steveha Jan 27 '13 at 10:41

source share

I would not use regular expressions to solve this problem. I would rather use collections.Counter like this:

 >>> from collections import Counter >>> def get_words(word_list, letter_string): return [word for word in word_list if Counter(word) & Counter(letter_string) == Counter(word)] >>> words = ["cat", "dog", "tack", "coat"] >>> letters = 'ocat' >>> get_words(words, letters) ['cat', 'coat'] >>> letters = 'kcta' >>> get_words(words, letters) ['cat', 'tack']

This solution should work in other languages. Counter(word) & Counter(letter_string) finds the intersection between two counters or min (c [x], f [x]). If this intersection is equivalent to your word, then you want to return the word to match.

+2

πόδας ὠκύς Jan 27 '13 at 3:59

source share

jfs · Accepted Answer · 2013-01-28T06:23:20+0000

To support characters that can span multiple Unicode codes:

 # -*- coding: utf-8 -*- import re import unicodedata from functools import partial NFKD = partial(unicodedata.normalize, 'NFKD') def match(word, letters): word, letters = NFKD(word), map(NFKD, letters) # normalize return re.match(r"(?:%s)+$" % "|".join(map(re.escape, letters)), word) words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"] get_words = lambda letters: [w for w in words if match(w, letters)] print(" ".join(get_words([u'ம', u'ப', u'ட', u'ம்']))) # -> மடம் படம் print(" ".join(get_words([u'ப', u'ம்', u'ட']))) # -> படம்

It is assumed that the same character can be used in the words the word "zero" or "more."

If you only need words containing exactly the given characters:

 import regex # $ pip install regex chars = regex.compile(r"\X").findall # get all characters def match(word, letters): return sorted(chars(word)) == sorted(letters) words = ["cat", "dog", "tack", "coat"] print(" ".join(get_words(['o', 'c', 'a', 't']))) # -> coat print(" ".join(get_words(['k', 'c', 't', 'a']))) # -> tack

Note: in this case, there is no cat output, because cat does not use all the given characters.

What does normalize mean? And could you please explain re.match () regex syntax?

 >>> import re >>> re.escape('.') '\\.' >>> c = u'\u00c7' >>> cc = u'\u0043\u0327' >>> cc == c False >>> re.match(r'%s$' % (c,), cc) # do not match >>> import unicodedata >>> norm = lambda s: unicodedata.normalize('NFKD', s) >>> re.match(r'%s$' % (norm(c),), norm(cc)) # do match <_sre.SRE_Match object at 0x1364648> >>> print c, cc Ç Ç

Without normalization, c and cc do not match. Symbols from unicodedata.normalize() docs .

Regex to get a list of all words with special letters (unicode graphemes)

More articles: