I have a file with one phrase / terms each line that I read perl from STDIN. I have a list of stop words (for example, "á", "são", "é"), and I want to compare each of them with each term and delete if they are equal. The problem is that I'm not sure about the file encoding format.
I get this from the command file:
words.txt: Non-ISO extended-ASCII English text
My linux terminal is in UTF-8, and it shows the correct content for some words, but not for others. Here are some of them:
condi<E3>
conte<FA>dos
ajuda, mas não resolve
mo<E7>ambique
pedagógico são fenómenos
You can see that the 3rd and 5th lines correctly identify words with accents and special characters, while others do not. The correct output for the other lines should be: condiã, conteúdos and moçambique.
binmode(STDOUT, utf8), "" , - . , 3- :
ajuda, mas nà £ o resolve
?