Strange characters added to the name of the first column after reading the csv file exported by toads

Question

Strange characters added to the name of the first column after reading the csv file exported by toads

Whenever I read the csv file in R ( read.csv("file_name.csv") ) that was exported with toad, the first characters of the column are preceded by the following characters " ï .. ". In addition, opening a csv file in excel or notepad ++ is displayed correctly (without preceding characters). This is a hassle, as my workaround was to rename the column after each read. Thanks for any fix for this problem!

Edit:
An export was created in Toad by right-clicking on the query result set and selecting "Quick Export → File → CSV File"

Read more about comments:
head(readLines('test_file.csv'),n=3)
[1] "ï»¿ID,LOCATION" "12021,1204" "12281,1204"

+13

r toad csv

amunategui Apr 9 '14 at 21:58

source share

4 answers

I know this is a very old question, but the simplest solution I have found is to use NotePad ++. Open the CSV file in NotePad ++, click Encoding, select Encrypt to UTF-8, and save the file. This removes the specification, and the original code should work.

+1

ikilltheundead Oct 14 '18 at 5:55

source share

I recently ran into this with the clipboard and Microsoft Excel

With the ever-growing multilingual content used in data science, there is simply no safe way to use utf-8 anymore (in my case, excel assumed UTF-16 because most of my data included traditional Chinese (mandarin)?).

According to Microsoft Docs , the following specifications are used in Windows:

 |----------------------|-------------|-----------------------| | Encoding | Bom | Python encoding kwarg | |----------------------|-------------|-----------------------| | UTF-8 | EF BB BF | 'utf-8' | | UTF-16 big-endian | FE FF | 'utf-16-be' | | UTF-16 little-endian | FF FE | 'utf-16-le' | | UTF-32 big-endian | 00 00 FE FF | 'utf-32-be' | | UTF-32 little-endian | FF FE 00 00 | 'utf-32-le' | |----------------------|-------------|-----------------------|

I came up with the following approach, which seems to work well for detecting encoding using a byte order mark at the beginning of the file:

 def guess_encoding_from_bom(filename, default='utf-8'): msboms = dict((bom['sig'], bom) for bom in ( {'name': 'UTF-8', 'sig': b'\xEF\xBB\xBF', 'encoding': 'utf-8'}, {'name': 'UTF-16 big-endian', 'sig': b'\xFE\xFF', 'encoding': 'utf-16-be'}, {'name': 'UTF-16 little-endian', 'sig': b'\xFF\xFE', 'encoding': 'utf-16-le'}, {'name': 'UTF-32 big-endian', 'sig': b'\x00\x00\xFE\xFF', 'encoding': 'utf-32-be'}, {'name': 'UTF-32 little-endian', 'sig': b'\xFF\xFE\x00\x00', 'encoding': 'utf-32-le'})) with open(filename, 'rb') as f: sig = f.read(4) for sl in range(3, 0, -1): if sig[0:sl] in msboms: return msboms[sig[0:sl]]['encoding'] return default # Example using python csv module def excelcsvreader(path, delimiter=',', doublequote=False, quotechar='"', dialect='excel', escapechar='\\', fileEncoding='UTF-8'): filepath = os.path.expanduser(path) fileEncoding = guess_encoding_from_bom(filepath, default=fileEncoding) if os.path.exists(filepath): # ok let open it and parse the data with open(filepath, 'r', encoding=fileEncoding) as csvfile: csvreader = csv.DictReader(csvfile, delimiter=delimiter, doublequote=doublequote, quotechar=quotechar, dialect=dialect, escapechar='\\') for (rnum, row) in enumerate(csvreader): yield (rnum, row)

I understand that for this it is necessary to open the file for reading twice (once in binary format and once in the form of encoded text), but the API does not really make it easy to do otherwise in this particular case.

Anyway, I think this is a little more reliable than just assuming utf-8, and obviously automatic encoding detection doesn't work like that ...

+1

Skyleach Aug 22 '19 at 19:56

source share

After studying this question, this is due to the addition of BOM (Byte Order Mark) characters. Apparently, he cannot use Quick Export, but the data export wizard, since it allows you to set the file encoding. It worked for me, installing it in Western European (Windows) instead of unicode utf-8.

See How to remove ï "¿from the beginning of a file?

0

amunategui Apr 13 '14 at 1:33

source share

Victor castro · Accepted Answer · 2014-10-13T19:17:28+0000

Try the following:

 d <- read.csv("test_file.csv", fileEncoding="UTF-8-BOM")

This works in R 3.0.0+ and removes the specification if it is present in the file (common for files created from Microsoft applications: Excel, SQL server)

Strange characters added to the name of the first column after reading the csv file exported by toads

More articles: