Read a list of file names from a network in R

I am trying to read a lot of csv files in R from a website. Threa are long-term daily (only working days) files. All files have the same data structure. I can successfully read one file using the following logic:

# enter user credentials user <- "JohnDoe" password <- "SecretPassword" credentials <- paste(user,":",password,"@",sep="") web.site <- "downloads.theice.com/Settlement_Reports_CSV/Power/" # construct path to data path <- paste("https://", credentials, web.site, sep="") # read data for 4/10/2013 file <- "icecleared_power_2013_04_10" fname <- paste(path,file,".dat",sep="") df <- read.csv(fname,header=TRUE, sep="|",as.is=TRUE) 

However, I am looking for tips on how to read all the files in a directory at once. I guess I could generate a sequence of dates to build the file name above in the loop and use rbind to add each file, but that seems cumbersome. In addition, when trying to read weekends and holidays where there are no files, there will be problems.

The following is a description of what appears in the file list files in a web browser:

file list in browser part 1

... ... ...

file list in browser part 2

Is there a way to scan the path (top) to get a list of all the file names in the directory that first match certin crieteia (that is, start with "icecleared_power_", since there are also some files in this location, enter a different starting name, which I don't want to read), then run read.csv through this list and use rbind to add?

Any recommendations would be greatly appreciated?

+4
source share
3 answers

At first, I would try to simply clear the links to the corresponding data files and use the information obtained to build a complete download path, including user logins, etc. As others suggested, lapply would be handy for batch downloading.

Here is an easy way to extract the urls. Obviously, modify the example to suit your actual scenario.

Here we are going to use the XML package to identify all the links available in the CRAN archives for the Amelia package ( http://cran.r-project.org/src/contrib/Archive/Amelia/ ).

 > library(XML) > url <- "http://cran.r-project.org/src/contrib/Archive/Amelia/" > doc <- htmlParse(url) > links <- xpathSApply(doc, "//a/@href") > free(doc) > links href href href "?C=N;O=D" "?C=M;O=A" "?C=S;O=A" href href href "?C=D;O=A" "/src/contrib/Archive/" "Amelia_1.1-23.tar.gz" href href href "Amelia_1.1-29.tar.gz" "Amelia_1.1-30.tar.gz" "Amelia_1.1-32.tar.gz" href href href "Amelia_1.1-33.tar.gz" "Amelia_1.2-0.tar.gz" "Amelia_1.2-1.tar.gz" href href href "Amelia_1.2-2.tar.gz" "Amelia_1.2-9.tar.gz" "Amelia_1.2-12.tar.gz" href href href "Amelia_1.2-13.tar.gz" "Amelia_1.2-14.tar.gz" "Amelia_1.2-15.tar.gz" href href href "Amelia_1.2-16.tar.gz" "Amelia_1.2-17.tar.gz" "Amelia_1.2-18.tar.gz" href href href "Amelia_1.5-4.tar.gz" "Amelia_1.5-5.tar.gz" "Amelia_1.6.1.tar.gz" href href href "Amelia_1.6.3.tar.gz" "Amelia_1.6.4.tar.gz" "Amelia_1.7.tar.gz" 

To demonstrate, imagine that in the end we only need links for version 1.2 of the package.

 > wanted <- links[grepl("Amelia_1\\.2.*", links)] > wanted href href href "Amelia_1.2-0.tar.gz" "Amelia_1.2-1.tar.gz" "Amelia_1.2-2.tar.gz" href href href "Amelia_1.2-9.tar.gz" "Amelia_1.2-12.tar.gz" "Amelia_1.2-13.tar.gz" href href href "Amelia_1.2-14.tar.gz" "Amelia_1.2-15.tar.gz" "Amelia_1.2-16.tar.gz" href href "Amelia_1.2-17.tar.gz" "Amelia_1.2-18.tar.gz" 

Now you can use this vector as follows:

 wanted <- links[grepl("Amelia_1\\.2.*", links)] GetMe <- paste(url, wanted, sep = "") lapply(seq_along(GetMe), function(x) download.file(GetMe[x], wanted[x], mode = "wb")) 

Update (to clarify your question in the comments)

The last step in the example above loads the specified files into your current working directory (use getwd() to check where it is). If instead you know for sure that read.csv works with data, you can also try changing the anonymous function to read files directly:

 lapply(seq_along(GetMe), function(x) read.csv(GetMe[x], header = TRUE, sep = "|", as.is = TRUE)) 

However, I think a safer approach might be to first load all the files into one directory and then use read.delim or read.csv or something that works for reading in data, similar to how it @Andreas has been suggested, I am speaking more securely because it gives you more flexibility in case the files are not fully downloaded and so on. In this case, instead of having to restart everything, you only need to download files that have not been fully downloaded.

+3
source

You can try using the "download.file" command.

 ### set up the path and destination path <- "url where file is located" dest <- "where on your hard disk you want the file saved" ### Ask R to try really hard to download your ".csv" try(download.file(path, dest)) 

The trick to this is to figure out how the "url" or "path" systematically changes between files. Often web pages are created so that "url's" are systematic. In this case, you could create a vector or info url frame to iterate inside the application function.

All this can be squeezed inside the "noodles". The data object is just what we are looking for. It can be a vector of URLs or a frame of observational data for a year and a month, which can then be used to create a URL in the lapply function.

 ### "dl" will apply a function to every element in our vector "data" # It will also help keep track of files which have no download data dl <- lapply(data, function(x) { path <- 'url' dest <- './data_intermediate/...' try(download.file(path, dest)) }) ### Assign element names to your list "dl" names(dl) <- unique(data$name) index <- sapply(dl, is.null) ### Figure out which downloads returned nothing no.download <- names(dl)[index] 

Then you can use "list.files ()" to merge all the data together, assuming they belong to the same data.frame

 ### Create a list of files you want to merge together files <- list.files() ### Create a list of data.frames by reading each file into memory data <- lapply(files, read.csv) ### Stack data together data <- do.call(rbind, data) 

Sometimes you will notice that the file was damaged after downloading. In this case, pay attention to the parameter contained in the download.file (), "mode" command. You can set mode = "w" or mode = "wb" if the file is stored in binary format.

+1
source

@MikeTP, if all reports start with "icecleared_power_" and the date is a business date, the "timeDate" package offers an easy way to create a vector of business dates, for example:

 require(timeDate) tSeq <- timeSequence("2012-01-01","2012-12-31") # vector of days tBiz <- tSeq[isBizday(tSeq)] # vector of business days 

and

 paste0("icecleared_power_",as.character.Date(tBiz)) 

gives the name of the concatenated file.

If the website follows a different logic regarding file names, we need more information, as Ananda Mahto observed.

Keep in mind that when creating a date vector with timeDate you can get much more complex than my simple example. You can take into account holiday schedules, dates of exchange rates, etc.

+1
source

All Articles