Invalid input causes read.csv to disable data

I am trying to read a csv file in R, but it continues to break. I think this may be due to the encoding of the file, but I'm not sure.

Here is the code I ran:

read.csv('crunchbase_companies_2.csv', fileEncoding="UTF-8", quote="")

Then I get a warning message In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,: invalid input found on input connection.

R reads the data, but only until it hits the special character and then stops. So I just finished with partial data in R. I inserted the data I got here: http://pastebin.com/EQLnXz2W . Please note that he disconnects when he types things like "Ì". Therefore, these characters are not in the sample data.

I also checked the encoding in the terminal with file. He returns Non-ISO extended-ASCII English text, with CR line terminators.

What do I need to do to read the entire data set?

+4
source share
2 answers

So, although I do not quite understand why, as a result of which the work changes fileEncodingto latin1when the read.csv function is called.

This was stated in another answer here . For some reason I have not tried ...

+6
source

Today I ran into a similar problem and spent hours on it. I am trying to change the encoding of / fileEncoding, setlocal and a couple of other things found here. But none of them work for me.

In the end, I found the post not in English (these people probably have more experience with this), and this trick: change the open model from “r” to “rb”.

In my case, I use readLines, so it

fileIn=file("userinfo.csv",open="rb",encoding="UTF-8")
lines = readLines(fileIn, n = rowPerRead, warn = FALSE)

, , , , , Byte, .

+1

All Articles