Combine multiple CSV files and remove duplicates in R

I have almost 3,000 CSV files (containing tweets) in the same format, I want to merge these files into one new file and delete duplicate tweets. I came across various topics discussing similar issues, however the number of files usually ends up small. Hope you can help me write code inside R that does the job efficiently and effectively.

CSV files have the following format:

CSV image: Example CSV files

I changed (in columns 2 and 3) the usernames (on Twitter) to AE and the "actual names" to A1-E1.

Raw text file:

"tweet";"author";"local.time" "1";"2012-06-05 00:01:45 @A (A1): Cruijff z'n met-zwart-shirt-zijn-ze-onzichtbaar logica is even mooi ontkracht in #bureausport.";"A (A1)";"2012-06-05 00:01:45" "2";"2012-06-05 00:01:41 @B (B1): Welterusten #BureauSport";"B (B1)";"2012-06-05 00:01:41" "3";"2012-06-05 00:01:38 @C (C1): Echt ..... eindelijk een origineel sportprogramma #bureausport";"C (C1)";"2012-06-05 00:01:38" "4";"2012-06-05 00:01:38 @D (D1): LOL. \"Na onderzoek op de Fontys Hogeschool durven wij te stellen dat..\" Want Fontys staat zo hoog aangeschreven? #bureausport";"D (D1)";"2012-06-05 00:01:38" "5";"2012-06-05 00:00:27 @E (E1): Ik kijk Bureau sport op Nederland 3. #bureausport #kijkes";"E (E1)";"2012-06-05 00:00:27" 

Somehow my headers got messed up, they obviously should move one column to the right. Each CSV file contains up to 1,500 tweets. I would like to remove duplicates by checking the second column (containing the tweets) simply because they must be unique and the author's columns may be similar (for example, one author posts multiple tweets).

Is it possible to combine file merging and remove duplicates, or does this require problems, and if the processes are separated? As a starting point, I included two links from two blogs from Hayward Godwin that discussed three approaches to combining CSV files.

http://psychwire.wordpress.com/2011/06/03/merge-all-files-in-a-directory-using-r-into-a-single-dataframe/

http://psychwire.wordpress.com/2011/06/05/testing-different-methods-for-merging-a-set-of-files-into-a-dataframe/

Obviously, there are some topics related to my question on this site (for example, Combining several CSV files in R ), but I did not find anything that discusses both merging and deleting duplicates. I really hope you can help me and my limited knowledge of R do this!

Although I tried some codes found on the Internet, this did not actually produce the output file. Approximately 3,000 CSV files have the format described above. I tried the following code (for the merge part):

 filenames <- list.files(path = "~/") do.call("rbind", lapply(filenames, read.csv, header = TRUE)) 

This results in the following error:

 Error in file(file, "rt") : cannot open the connection In addition: Warning message: In file(file, "rt") : cannot open file '..': No such file or directory 

Update

I tried the following code:

  # grab our list of filenames filenames <- list.files(path = ".", pattern='^.*\\.csv$') # write a special little read.csv function to do exactly what we want my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';', col.names=c('ID','tweet','author','local.time'), colClasses=rep('character', 4)) } # read in all those files into one giant data.frame my.df <- do.call("rbind", lapply(filenames, my.read.csv)) # remove the duplicate tweets my.new.df <- my.df[!duplicated(my.df$tweet),] 

But I encountered the following errors:

After the third line, I get:

  Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names 

After the 4th line, I get:

  Error: object 'my.df' not found 

I suspect that these errors are caused by some failures made while writing csv files, as there are some cases where author / local.time is in the wrong column. Either to the left or to the right of the place where they were supposed to be, which leads to an additional column. I manually adapted 5 files and tested the code in these files, I did not get any errors. However, nothing seemed to happen. I did not get any result from R?

To solve the problem with the extra column, I slightly adjusted the code:

  #grab our list of filenames filenames <- list.files(path = ".", pattern='^.*\\.csv$') # write a special little read.csv function to do exactly what we want my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';', col.names=c('ID','tweet','author','local.time','extra'), colClasses=rep('character', 5)) } # read in all those files into one giant data.frame my.df <- do.call("rbind", lapply(filenames, my.read.csv)) # remove the duplicate tweets my.new.df <- my.df[!duplicated(my.df$tweet),] 

I tried this code in all the files, although R obviously started processing, in the end I got the following errors:

  Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names In addition: Warning messages: 1: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'Twitts - di mei 29 19_22_30 2012 .csv' 2: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'Twitts - di mei 29 19_24_31 2012 .csv' Error: object 'my.df' not found 

What have I done wrong?

+4
source share
1 answer

First, simplify the situation by being in the folder where the files are located, and try installing a template to read only files with a file ending with ".csv", so something like

 filenames <- list.files(path = ".", pattern='^.*\\.csv$') my.df <- do.call("rbind", lapply(filenames, read.csv, header = TRUE)) 

This should provide you with a data.frame with the contents of all tweets

A separate issue is the headers in the csv files. Fortunately, you know that all files are identical, so I would process them something like this:

 read.csv('fred.csv', header=FALSE, skip=1, sep=';', col.names=c('ID','tweet','author','local.time'), colClasses=rep('character', 4)) 

Nb. changed so that all columns are characters, and ';' divided

I would analyze this time later, if it were necessary ...

Another separate issue is the uniqueness of the tweets in the data.frame file - but I don’t understand if you want them to be unique to the user or globally unique. For tweets unique in the world, for example,

 my.new.df <- my.df[!duplicated(my.df$tweet),] 

For a unique author, I would add two fields - it's hard to understand what works without real data!

 my.new.df <- my.df[!duplicated(paste(my.df$tweet, my.df$author)),] 

So, all together and assume a few things along the way ...

 # grab our list of filenames filenames <- list.files(path = ".", pattern='^.*\\.csv$') # write a special little read.csv function to do exactly what we want my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';', col.names=c('ID','tweet','author','local.time'), colClasses=rep('character', 4)) } # read in all those files into one giant data.frame my.df <- do.call("rbind", lapply(filenames, my.read.csv)) # remove the duplicate tweets my.new.df <- my.df[!duplicated(my.df$tweet),] 

Based on the revised warnings after line 3, this is a problem with files with different number of columns. This is not easy to fix at all, except what you suggested having too many columns in the specification. If you delete the specification, you will encounter problems when trying to link () the data.frames files together ...

Here is some code using the for () loop and some cat () debugging commands to make it more explicit which files are broken so you can fix it:

 filenames <- list.files(path = ".", pattern='^.*\\.csv$') n.files.processed <- 0 # how many files did we process? for (fnam in filenames) { cat('about to read from file:', fnam, '\n') if (exists('tmp.df')) rm(tmp.df) tmp.df <- read.csv(fnam, header=FALSE, skip=1, sep=';', col.names=c('ID','tweet','author','local.time','extra'), colClasses=rep('character', 5)) if (exists('tmp.df') & (nrow(tmp.df) > 0)) { cat(' successfully read:', nrow(tmp.df), ' rows from ', fnam, '\n') # now lets append a column containing the originating file name # so that debugging the file contents is easier tmp.df$fnam <- fnam # now lets rbind everything together if (exists('my.df')) { my.df <- rbind(my.df, tmp.df) } else { my.df <- tmp.df } } else { cat(' read NO rows from ', fnam, '\n') } } cat('processed ', n.files.processed, ' files\n') my.new.df <- my.df[!duplicated(my.df$tweet),] 
+7
source

All Articles