Group a list / entry to a data file

Question

Group a list / entry to a data file

Edit: This question has been deprecated. The jsonlite package is automatically smoothed.

I deal with online data streams that have write-based encoding, usually in JSON. The structure of the object (i.e., Names in JSON) is known from the API documentation, however, the values are mostly optional and are not present in every entry. Lists may contain new lists, and the structure is sometimes quite deep. Here is a pretty simple example of some GPS data: http://pastebin.com/raw.php?i=yz6z9t25 . Please note that the object "l" missing in the bottom lines due to the lack of a GPS signal.

I am looking for an elegant way to smooth these objects in a dataframe. I am currently using something like this:

 library(RJSONIO) library(plyr) obj <- fromJSON("http://pastebin.com/raw.php?i=yz6z9t25", simplifyWithNames=FALSE, simplify=FALSE) flatdata <- lapply(obj$data, as.data.frame); mydf <- rbind.fill(flatdata)

This is a task, but it is slow and error prone. The problem with this approach is that I do not use my knowledge of the structure (object names) in the data; instead, it is inferred from the data. This leads to problems when there is no specific property in each record. In this case, it will not be displayed in the data framework at all, and not in the column with NA values. This can lead to downstream problems. For example, I need to process a timestamp of a location:

 mydf$lt <- structure(mydf$lt/1000, class="POSIXct")

However, this will lead to an error in the case of a data set in which the object l$t does not exist. Also, as.data.frame and rbind.fill make things pretty slow. An exemplary data set is relatively small. Any suggestions for a better implementation? A reliable solution would always give a framework with the same columns in the same order, and where only the number of rows changes.

Edit: Below a dataset with lots of metadata. It is larger and nested deeper:

 obj <- fromJSON("http://www.stat.ucla.edu/~jeroen/files/output.json", simplifyWithNames=FALSE, simplify=FALSE)

+8

r

Jeroen Jun 25 '12 at 20:41

source share

3 answers

Here you will find a solution that allows you to take advantage of previous knowledge of data field names and classes. In addition, avoiding repeated calls to as.data.frame and a single call to plyr rbind.fill() (as a rbind.fill() time), it works about 60 times faster according to your example data.

 cols <- c("id", "ls", "ts", "l.lo","l.tz", "lt", "l.ac", "l.la", "l.pr", "m") numcols <- c("l.lo", "lt", "l.ac", "l.la") ## Flatten each top-level list element, converting it to a character vector. x <- lapply(obj$data, unlist) ## Extract fields that might be present in each record (returning NA if absent). y <- sapply(x, function(X) X[cols]) ## Convert to a data.frame with columns of desired classes. z <- as.data.frame(t(y), stringsAsFactors=FALSE) z[numcols] <- lapply(numcols, function(X) as.numeric(as.character(z[[X]])))

Edit: To confirm that my approach yields results identical to those in the original question, I performed the following test. (Note that in both cases I set stringsAsFactors=FALSE to avoid meaningless differences in factor level orders.)

 flatdata <- lapply(obj$data, as.data.frame, stringsAsFactors=FALSE) mydf <- rbind.fill(flatdata) identical(z, mydf) # [1] TRUE

Further editing:

For the record only, here is an alternative version above that is additionally automatically:

finds the names of all data fields
defines their class / type
forces the columns of the final data.frame file with the correct class

.

 dat <- obj$data ## Find the names and classes of all fields fields <- unlist(lapply(xx, function(X) rapply(X, class, how="unlist"))) fields <- fields[unique(names(fields))] cols <- names(fields) ## Flatten each top-level list element, converting it to a character vector. x <- lapply(dat, unlist) ## Extract fields that might be present in each record (returning NA if absent). y <- sapply(x, function(X) X[cols]) ## Convert to a data.frame with columns of desired classes. z <- as.data.frame(t(y), stringsAsFactors=FALSE) ## Coerce columns of z (all currently character) back to their original type z[] <- lapply(seq_along(fields), function(i) as(z[[cols[i]]], fields[i]))

+5

Josh o'brien Jun 25 '12 at 21:16

source share

An attempt is made here that makes no assumptions about data types. This is slightly slower than @JoshOBrien, but faster than the original OP solution.

 Joshua <- function(x) { un <- lapply(x, unlist, recursive=FALSE) ns <- unique(unlist(lapply(un, names))) un <- lapply(un, function(x) { y <- as.list(x)[ns] names(y) <- ns lapply(y, function(z) if(is.null(z)) NA else z)}) s <- lapply(ns, function(x) sapply(un, "[[", x)) names(s) <- ns data.frame(s, stringsAsFactors=FALSE) } Josh <- function(x) { cols <- c("id", "ls", "ts", "l.lo","l.tz", "lt", "l.ac", "l.la", "l.pr", "m") numcols <- c("l.lo", "lt", "l.ac", "l.la") ## Flatten each top-level list element, converting it to a character vector. x <- lapply(obj$data, unlist) ## Extract fields that might be present in each record (returning NA if absent). y <- sapply(x, function(X) X[cols]) ## Convert to a data.frame with columns of desired classes. z <- as.data.frame(t(y)) z[numcols] <- lapply(numcols, function(X) as.numeric(as.character(z[[X]]))) z } Jeroen <- function(x) { flatdata <- lapply(x, as.data.frame) rbind.fill(flatdata) } library(rbenchmark) benchmark(Josh=Josh(obj$data), Joshua=Joshua(obj$data), Jeroen=Jeroen(obj$data), replications=5, order="relative") # test replications elapsed relative user.self sys.self user.child sys.child # 1 Josh 5 0.24 1.000000 0.24 0 NA NA # 2 Joshua 5 0.31 1.291667 0.32 0 NA NA # 3 Jeroen 5 12.97 54.041667 12.87 0 NA NA

+2

Joshua ulrich Jun 25 '12 at 23:20

source share

Jeroen · Accepted Answer · 2012-06-26T22:28:05+0000

Just for the sake of clarity, I am adding a combination of Josh and Joshua's solutions, the best I've come up with so far.

 flatlist <- function(mylist){ lapply(rapply(mylist, enquote, how="unlist"), eval) } records2df <- function(recordlist, columns) { if(length(recordlist)==0 && !missing(columns)){ return(as.data.frame(matrix(ncol=length(columns), nrow=0, dimnames=list(NULL,columns)))) } un <- lapply(recordlist, flatlist) if(!missing(columns)){ ns <- columns; } else { ns <- unique(unlist(lapply(un, names))) } un <- lapply(un, function(x) { y <- as.list(x)[ns] names(y) <- ns lapply(y, function(z) if(is.null(z)) NA else z)}) s <- lapply(ns, function(x) sapply(un, "[[", x)) names(s) <- ns data.frame(s, stringsAsFactors=FALSE) }

The function is fast enough. I still think it should be able to speed it up, though:

 obj <- fromJSON("http://www.stat.ucla.edu/~jeroen/files/output.json", simplifyWithNames=FALSE, simplify=FALSE) flatdata <- records2df(obj$data)

It also allows you to force specific columns, although this does not result in too much speedup:

 flatdata <- records2df(obj$data, columns=c("m", "doesnotexist"))

Group a list / entry to a data file

More articles: