EDIT: Well, don't use any answers from here. I wrote them all, thinking that Python regular expressions do not have a word boundary marker, and I tried to get around this flaw. Then @Mark Tolonen added a comment that Python has \b as a word boundary marker! So I posted another answer, short and simple, using \b . I will leave this here if someone is interested in finding solutions that work around the lack of \b , but I really don't expect anyone to be.
It is easy to make a regular expression that matches only the string of a specific character set. What you need to use is a “character class” with only those characters that you want to match.
I will do this example in English.
[ocat] This is a character class that will correspond to one character from the set [o, c, a, t] . The order of characters does not matter.
[ocat]+ Putting + at the end makes it match one or more characters from the set. But this alone is not enough; if you had the word “trainer,” that would match and return “coac.”
Unfortunately, there is no regular expression function for the dictionary boundary. [EDIT: This turns out to be wrong, as I said in the first paragraph.] We need to make one of ours. There are two possible beginnings of a word: the beginning of a line or a space separating our word from the previous word. Similarly, there are two possible ends of words: the end of a line or a space separating our word from the next word.
Since we will be matching some additional things that we don’t need, we can put parentheses around the part of the template that we want.
To match the two alternatives, we can create a group in parentheses and separate the alternatives with a vertical bar. Python regular expressions have a special designation for creating a group whose contents we do not want to save: (?:)
So, here is the pattern corresponding to the beginning of the word. Beginning of a line or space: (?:^|\s)
Here is an example of the end of a word. White space or end of line: `(?: \ S | $)
Putting it all together, here is our final template:
(?:^|\s)([ocat]+)(?:\s|$)
You can build it dynamically. You do not need to hard code it all.
import re s_pat_start = r'(?:^|\s)([' s_pat_end = r']+)(?:\s|$)' set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
Now this in no way checks the correct words. If you have the following text:
This is sensible. This not: occo cttc
The sample I showed you will match occo and cttc , and these are not words. These are strings made only from the letters [ocat] , though.
So just do the same with Unicode strings. (If you are using Python 3.x, then all the strings are Unicode strings, so you go.) Put the Tamil characters in the character class and you will go well.
This has a confusing problem: re.findall() does not return all possible matches.
EDIT: Okay, I realized that bothers me.
We want our template to work with re.findall() so that you can collect all the words. But re.findall() finds only non-overlapping patterns. In my example, re.findall() returned only ['occo'] , not ['occo', 'cttc'] , as I expected ... but this is because my template matched the space after occo . The group of matches did not collect a space, but was matched anyway, and since re.findall() did not want a match between matches, the space was "used" and did not work for cttc .
The solution is to use a Python regular expression function that I have never used before: a special syntax that says "should not precede" or "should not be followed." The \S sequence matches any space without spaces, so we could use this. But punctuation is not a space, and I think we want punctuation to limit the word. There is also a special syntax: "must precede" or "must follow." So, I think the best we can do:
Build a string that means "match when a character class string is at the beginning of a string and is followed by a space, or when a character class string is preceded by white space characters and followed by a space, or when a character string is preceded by spaces and followed by the end of the string, or when the character class string is preceded by the beginning of the string and the end of the string follows.
Here is this pattern using ocat :
r'(?:^([ocat]+)(?=\s)|(?<=\s)([ocat]+)(?=\s)|(?<=\s)([ocat]+)$|^([ocat]+)$)'
I'm sorry, but I really think this is the best we can do and still work with re.findall() !
Actually this is less confusing in Python code:
import re NMGROUP_BEGIN = r'(?:'
So the crazy thing is that when we have alternative patterns that can match, re.findall() seems to return an empty string for alternatives that don't match. Therefore, we just need to filter the rows with a zero value according to our results:
import itertools as it raw_results = pat.findall(text) results = [s for s in it.chain(*raw_results) if s]
I suppose it would be a little confusing to just build four different templates, run re.findall() for each and combine the results together.
EDIT: Okay, here is the code to create four templates and each of them. I think this is an improvement.
import re WS_BEFORE = r'(?<=\s)'
EDIT: Well, here's a completely different approach. I like it best.
First, we compare all the words in the text. A word is defined as one or more characters that are not punctuation and are not spaces.
Then we use a filter to remove words from the above; we only save words that are made only of the characters we want.
import re import string
EDIT: now I don't like any of the above solutions; see another answer that uses \b for a better solution in Python.