Using grep to filter words from a stop word file

Question

Using grep to filter words from a stop word file

I want to use grep along with stop words to filter common English words from another file. The file "somefile" contains one word per line.

cat somefile | grep -v -f stopwords

The problem with this approach is this: it checks if the word occurs in seconds in some file, but I want the opposite, that is, check that the word in some file occurs in seconds.

How to do it?

Example

somefile contains the following:

 hello o orange

stop words contain the following:

I want to filter only the word "o" from some file, not a greeting and an orange.

+7

linux grep stop-words

Pimin konstantin kefaloukos Sep 7 '11 at 10:59

source share

2 answers

Assuming you have stop words file / tmp / words:

 in the

you can create sed from it:

 sed 's|^|s/\\<|; s|$|\\>/[CENSORED]/g;|' /tmp/words > /tmp/words.sed

this way you get /tmp/words.sed:

 s/\<in\>/[CENSORED]/g; s/\<the\>/[CENSORED]/g;

and then use it to censor any text file:

 sed -e -f /tmp/words.sed /input/file/to/filter.txt > /censored/output.txt

Sed requires -e to understand the extended regular expression needed for recognition. Of course, you can change [censored] to any other line or an empty line if you want.

This solution will process many words per line, as well as one word in line files.

+5

Michał Šrajer Sep 7 '11 at 11:23

source share

Pimin konstantin kefaloukos · Accepted Answer · 2011-09-07T11:16:05+0000

I thought about it also and found a solution ...

use the -w grep -w to match whole words:

 grep -v -w -f stopwords somefile

Using grep to filter words from a stop word file

More articles: