Extracting whole words

Question

Extracting whole words

I have a lot of typing in the real world that I need to get out of a word for input into spellchecker. I would like to extract as many meaningful words as possible without too much noise. I know that there are many ordinary ninjas here, so hopefully someone can help me.

I am currently retrieving all alphabetical sequences using '[az]+' . This is a good approximation, but it carries a lot of garbage.

Ideally, I would like some regular expression (should not be beautiful or effective) that extracts all alphabetical sequences separated by natural word delimiters (for example, [/-_,.: ] , etc.) and ignores any alphabetical sequences with illegal borders.

However, I would be happy to just get all the alphabetical sequences that DON'T CONSTANT next to the number. So, for example, 'pie21' will NOT retrieve 'pie' , but 'http://foo.com' will retrieve ['http', 'foo', 'com'] .

I tried the lookahead and lookbehind , but they were applied for each character (therefore, for example, re.findall('(?<!\d)[az]+(?!\d)', 'pie21') will return 'pi' when I want him to return nothing). I tried wrapping the alpha part as a term ( (?:[az]+) ), but that didn't help.

Details: Data is an email database, so basically it is plain English with normal numbers, but sometimes there are garbage strings like GIHQ4NWL0S5SCGBDD40ZXE5IDP13TYNEA and AC7A21C0 that I would like to completely ignore, I assume that any alphabetical sequence with a number in it trash.

+8

python regex word text-extraction alphabetical

orlade Apr 19 '11 at 14:22

source share

4 answers

Are you familiar with word boundaries? ( \b ). You can extract the word using \b around the sequence and matching the alphabet inside:

 \b([a-zA-Z]+)\b

For example, it will capture whole words, but dwell on tokens such as hyphens, periods, half-columns, etc.

You can \b and others in the python manual

EDIT Also, if you are looking at a number that follows or precedes a match, you can use a negative forward / reverse forecast:

 (?!\d) # negative look-ahead for numbers (?<!\d) # negative look-behind for numbers

+3

Brad christie Apr 19 '11 at 14:26

source share

What about:

 import re yourString="pie 42 http://foo.com GIHQ4NWL0S5SCGBDD40ZXE5IDP13TYNEA pie42" filter (lambda x:re.match("^[a-zA-Z]+$",x),[x for x in set(re.split("[\s:/,.:]",yourString))])

Note that:

split splits your string into potential candidates => returns a list of "potential words"
set does unicity filtering => converts the list into a set, thus deleting entries that appear more than once. This step is optional.
the filter reduces the number of candidates: takes a list, applies a test function to each element and returns a list of the element following the test. In our case, the test function is "anonymous"
lambda: anonymous function, taking an element and checking if it is a word (upper or lower case letters only)

EDIT: some clarification added

+2

Bruce Apr 19 '11 at 14:32

source share

Code example

 print re.search(ur'(?u)\b', ur'') print re.search(ur'(?u)\b\b', ur'')

or

 s = ur"abcd " import re rx1 = re.compile(ur"(?u)") rx2 = re.compile(ur"(?u)\b") rx3 = re.compile(ur"(?u)\b\b") print rx1.findall(s) print rx2.findall(s) print rx3.findall(s)

0

Alexander Lubyagin Dec 6 '17 at 10:44

source share

Tim pietzcker · Accepted Answer · 2011-04-19T14:25:35+0000

If you are restricting yourself to ASCII letters, use (with the re.I option)

 \b[az]+\b

\b is the anchor of the word boundary, corresponding only at the beginning and end of alphanumeric words. Therefore, \b[az]+\b matches pie , but not pie21 or 21pie .

To also allow non-ASCII letters, you can use something like this:

 \b[^\W\d_]+\b

which also allows you to use accented characters, etc. You may need to set the re.UNICODE parameter, especially when using Python 2, so that the \w abbreviation matches non-ASCII letters.

[^\W\d_] as a negative character class, allows any alphanumeric character, with the exception of numbers and underscores.

Extracting whole words

More articles: