Perl - File Encoding and Word Comparison

Question

Perl - File Encoding and Word Comparison

I have a file with one phrase / terms each line that I read perl from STDIN. I have a list of stop words (for example, "á", "são", "é"), and I want to compare each of them with each term and delete if they are equal. The problem is that I'm not sure about the file encoding format.

I get this from the command file:

words.txt: Non-ISO extended-ASCII English text

My linux terminal is in UTF-8, and it shows the correct content for some words, but not for others. Here are some of them:

condi<E3>
conte<FA>dos
ajuda, mas não resolve
mo<E7>ambique
pedagógico são fenómenos

You can see that the 3rd and 5th lines correctly identify words with accents and special characters, while others do not. The correct output for the other lines should be: condiã, conteúdos and moçambique.

binmode(STDOUT, utf8), "" , - . , 3- :

ajuda, mas nÃ £ o resolve

?

+5

perl unicode character-encoding

Barata 05 '11 17:13

2

, UTF-8.

open(INPUT, "< badstuff.txt") || die "open failed: $!";

, , :

open(INPUT, "fixit < badstuff.txt |") || die "open failed: $!"

binmode(INPUT, ":encoding(UTF-8)") || die "binmode failed";

fixit :

use strict;
use warnings;
use Encode qw(decode FB_CROAK);

binmode(STDIN,  ":raw")  || die "can't binmode STDIN";
binmode(STDOUT, ":utf8") || die "can't binmode STDOUT";

while (my $line = <STDIN>) {
    $line = eval { decode("UTF-8", $line, FB_CROAK() };
    if ($@) { 
        $line = decode("CP1252", $line, FB_CROAK()); # no eval{}!
    }
    $line =~ s/\R\z/\n/;  # fix raw mode reads
    print STDOUT $line;    
}

close(STDIN)  || die "can't close STDIN: $!";
close(STDOUT) || die "can't close STDOUT: $!";
exit 0;

, ? , - . , @ARGV.

+4

tchrist 05 '11 19:23

Lumi · Accepted Answer · 2011-05-05T18:21:49+0000

:

C:\Dev\Perl :: chcp
Aktive Codepage: 1252.

C:\Dev\Perl :: type mixed-encoding.txt
eins zwei drei KÃ¤se vier fÃ¼nf Wurst
eins zwei drei Käse vier fünf Wurst

C:\Dev\Perl :: perl mixed-encoding.pl < mixed-encoding.txt
eins zwei drei vier fünf
eins zwei drei vier fünf

mixed-encoding.pl :

use strict;
use warnings;
use utf8; # source in UTF-8
use Encode 'decode_utf8';
use List::MoreUtils 'any';

my @stopwords = qw( Käse Wurst );

while ( <> ) { # read octets
    chomp;
    my @tokens;
    for ( split /\s+/ ) {
        # Try UTF-8 first. If that fails, assume legacy Latin-1.
        my $token = eval { decode_utf8 $_, Encode::FB_CROAK };
        $token = $_ if $@;
        push @tokens, $token unless any { $token eq $_ } @stopwords;
    }
    print "@tokens\n";
}

, script UTF-8. , script, , , use utf8, UTF-8, , .

, tchrist:

use strict;
use warnings;
# source in Latin1
use Encode 'decode';
use List::MoreUtils 'any';

my @stopwords = qw( Käse Wurst );

while ( <> ) { # read octets
        chomp;
        my @tokens;
        for ( split /\s+/ ) {
                # Try UTF-8 first. If that fails, assume 8-bit encoding.
                my $token = eval { decode utf8 => $_, Encode::FB_CROAK };
                $token    = decode Windows1252 => $_, Encode::FB_CROAK if $@;
                push @tokens, uc $token unless any { $token eq $_ } @stopwords;
        }
        print "@tokens\n";
}

Perl - File Encoding and Word Comparison

More articles: