Remove lines containing non-English (Ascii) characters from file

Question

Remove lines containing non-English (Ascii) characters from file

I have a text file with characters from different languages, such as (Chinese, Latin, etc.)

I want to delete all lines containing these non-English characters. I want to include all English characters (ab), numbers (0-9) and all punctuation.

How to do this using unix tools like awk or sed.

+8

unix regex grep awk sed

Sudar Jul 20 '12 at 10:42

source share

4 answers

You can use egrep -v to return only lines that do not match the pattern, and use something like [^ a-zA-Z0-9.,;:-'"?!] as the pattern (use more punctuation if necessary )

Hm, thinking about this, double negation ( -v and the inverted character class) is probably not that good. Another way could be ^[ a-zA-Z0-9.,;:-'"?!]*$ .

You can also just filter for ASCII:

 egrep -v "[^ -~]" foo.txt

+2

Joey Jul 20 '12 at 10:44

source share

You can use Awk if you force the C locale:

 LC_CTYPE=C awk '! /[^[:alnum:][:space:][:punct:]]/' my_file

The environment variable LC_TYPE=C (or LC_ALL=C ) forces you to use the locale C to classify characters. It changes the value of character classes ( [:alnum:] , [:space:] , etc.) to match only ASCII characters.

Match strings /[^[:alnum:][:space:][:punct:]]/ regex with any character other than ASCII. ! before re-expression inverts the condition. Thus, only strings without non-ASCII characters will match. Then, since no action is specified, the default action is used to match strings ( print ).

EDIT: this can also be done with grep:

 LC_CTYPE=C grep -v '[^[:alnum:][:space:][:punct:]]' my_file

+1

Ael ombreglace Jul 20 '12 at 14:14

source share

With GNU grep, which supports perl compatible regular expressions, you can use:

 grep -P '^[[:ascii:]]+$' file

+1

hek2mgl Sep 08 '17 at 7:16

source share

Dennis williamson · Accepted Answer · 2012-07-20T11:10:31+0000

Perl supports the character class [:ascii:] .

 perl -nle 'print if m{^[[:ascii:]]+$}' inputfile

Remove lines containing non-English (Ascii) characters from file

More articles: