Remove lines containing non-English (Ascii) characters from file

I have a text file with characters from different languages, such as (Chinese, Latin, etc.)

I want to delete all lines containing these non-English characters. I want to include all English characters (ab), numbers (0-9) and all punctuation.

How to do this using unix tools like awk or sed.

+8
unix regex grep awk sed
source share
4 answers

Perl supports the character class [:ascii:] .

 perl -nle 'print if m{^[[:ascii:]]+$}' inputfile 
+15
source share

You can use egrep -v to return only lines that do not match the pattern, and use something like [^ a-zA-Z0-9.,;:-'"?!] as the pattern (use more punctuation if necessary )

Hm, thinking about this, double negation ( -v and the inverted character class) is probably not that good. Another way could be ^[ a-zA-Z0-9.,;:-'"?!]*$ .

You can also just filter for ASCII:

 egrep -v "[^ -~]" foo.txt 
+2
source share

You can use Awk if you force the C locale:

 LC_CTYPE=C awk '! /[^[:alnum:][:space:][:punct:]]/' my_file 

The environment variable LC_TYPE=C (or LC_ALL=C ) forces you to use the locale C to classify characters. It changes the value of character classes ( [:alnum:] , [:space:] , etc.) to match only ASCII characters.

Match strings /[^[:alnum:][:space:][:punct:]]/ regex with any character other than ASCII. ! before re-expression inverts the condition. Thus, only strings without non-ASCII characters will match. Then, since no action is specified, the default action is used to match strings ( print ).

EDIT: this can also be done with grep:

 LC_CTYPE=C grep -v '[^[:alnum:][:space:][:punct:]]' my_file 
+1
source share

With GNU grep, which supports perl compatible regular expressions, you can use:

 grep -P '^[[:ascii:]]+$' file 
+1
source share

All Articles