Using grep to filter words from a stop word file

I want to use grep along with stop words to filter common English words from another file. The file "somefile" contains one word per line.

cat somefile | grep -v -f stopwords 

The problem with this approach is this: it checks if the word occurs in seconds in some file, but I want the opposite, that is, check that the word in some file occurs in seconds.

How to do it?

Example

somefile contains the following:

 hello o orange 

stop words contain the following:

 o 

I want to filter only the word "o" from some file, not a greeting and an orange.

+7
source share
2 answers

I thought about it also and found a solution ...

use the -w grep -w to match whole words:

 grep -v -w -f stopwords somefile 
+14
source

Assuming you have stop words file / tmp / words:

 in the 

you can create sed from it:

 sed 's|^|s/\\<|; s|$|\\>/[CENSORED]/g;|' /tmp/words > /tmp/words.sed 

this way you get /tmp/words.sed:

 s/\<in\>/[CENSORED]/g; s/\<the\>/[CENSORED]/g; 

and then use it to censor any text file:

 sed -e -f /tmp/words.sed /input/file/to/filter.txt > /censored/output.txt 

Sed requires -e to understand the extended regular expression needed for recognition. Of course, you can change [censored] to any other line or an empty line if you want.

This solution will process many words per line, as well as one word in line files.

+5
source

All Articles