I have almost 3,000 CSV files (containing tweets) in the same format, I want to merge these files into one new file and delete duplicate tweets. I came across various topics discussing similar issues, however the number of files usually ends up small. Hope you can help me write code inside R that does the job efficiently and effectively.
CSV files have the following format:
CSV image: 
I changed (in columns 2 and 3) the usernames (on Twitter) to AE and the "actual names" to A1-E1.
Raw text file:
"tweet";"author";"local.time" "1";"2012-06-05 00:01:45 @A (A1): Cruijff z'n met-zwart-shirt-zijn-ze-onzichtbaar logica is even mooi ontkracht in #bureausport.";"A (A1)";"2012-06-05 00:01:45" "2";"2012-06-05 00:01:41 @B (B1): Welterusten #BureauSport";"B (B1)";"2012-06-05 00:01:41" "3";"2012-06-05 00:01:38 @C (C1): Echt ..... eindelijk een origineel sportprogramma #bureausport";"C (C1)";"2012-06-05 00:01:38" "4";"2012-06-05 00:01:38 @D (D1): LOL. \"Na onderzoek op de Fontys Hogeschool durven wij te stellen dat..\" Want Fontys staat zo hoog aangeschreven? #bureausport";"D (D1)";"2012-06-05 00:01:38" "5";"2012-06-05 00:00:27 @E (E1): Ik kijk Bureau sport op Nederland 3. #bureausport #kijkes";"E (E1)";"2012-06-05 00:00:27"
Somehow my headers got messed up, they obviously should move one column to the right. Each CSV file contains up to 1,500 tweets. I would like to remove duplicates by checking the second column (containing the tweets) simply because they must be unique and the author's columns may be similar (for example, one author posts multiple tweets).
Is it possible to combine file merging and remove duplicates, or does this require problems, and if the processes are separated? As a starting point, I included two links from two blogs from Hayward Godwin that discussed three approaches to combining CSV files.
http://psychwire.wordpress.com/2011/06/03/merge-all-files-in-a-directory-using-r-into-a-single-dataframe/
http://psychwire.wordpress.com/2011/06/05/testing-different-methods-for-merging-a-set-of-files-into-a-dataframe/
Obviously, there are some topics related to my question on this site (for example, Combining several CSV files in R ), but I did not find anything that discusses both merging and deleting duplicates. I really hope you can help me and my limited knowledge of R do this!
Although I tried some codes found on the Internet, this did not actually produce the output file. Approximately 3,000 CSV files have the format described above. I tried the following code (for the merge part):
filenames <- list.files(path = "~/") do.call("rbind", lapply(filenames, read.csv, header = TRUE))
This results in the following error:
Error in file(file, "rt") : cannot open the connection In addition: Warning message: In file(file, "rt") : cannot open file '..': No such file or directory
Update
I tried the following code:
# grab our list of filenames filenames <- list.files(path = ".", pattern='^.*\\.csv$')
But I encountered the following errors:
After the third line, I get:
Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
After the 4th line, I get:
Error: object 'my.df' not found
I suspect that these errors are caused by some failures made while writing csv files, as there are some cases where author / local.time is in the wrong column. Either to the left or to the right of the place where they were supposed to be, which leads to an additional column. I manually adapted 5 files and tested the code in these files, I did not get any errors. However, nothing seemed to happen. I did not get any result from R?
To solve the problem with the extra column, I slightly adjusted the code:
#grab our list of filenames filenames <- list.files(path = ".", pattern='^.*\\.csv$')
I tried this code in all the files, although R obviously started processing, in the end I got the following errors:
Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names In addition: Warning messages: 1: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'Twitts - di mei 29 19_22_30 2012 .csv' 2: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'Twitts - di mei 29 19_24_31 2012 .csv' Error: object 'my.df' not found
What have I done wrong?