How to change data form using reoccurring columns?

I'm new to data analysis with R. Recently, I got a pre-formatted dataset of an environmental observing model, a subset of which is shown below:

date site obs mod site obs mod 2000-09-01 00:00:00 campus NA 61.63 city centre 66 56.69 2000-09-01 01:00:00 campus 52 62.55 city centre NA 54.75 2000-09-01 02:00:00 campus 52 63.52 city centre 56 54.65 

Basically, the data includes time series of hourly observed and simulated pollutant concentrations at different sites in “repeating columns”, i.e. obsmod site (in the example I showed only 2 of all 75 sites). I read this "wide" dataset as a data frame and wanted to change it in a "narrower" format as:

 date site obs mod 2000-09-01 00:00:00 campus NA 61.63 2000-09-01 01:00:00 campus 52 62.55 2000-09-01 02:00:00 campus 52 63.52 2000-09-01 00:00:00 city centre 66 56.69 2000-09-01 01:00:00 city centre NA 54.75 2000-09-01 02:00:00 city centre 56 54.65 

I figured I should use the "reshape2" package for this. First I tried to melt and then dcast the data set:

 test.melt <- melt(test.data, id.vars = "date", measure.vars = c("site", "obs", "mod")) 

However, he returned only half of the data, i.e. all entries of the site (s) ("city center") after the first ("campus") were cut off:

 date variable value 2001-01-01 00:00:00 site campus 2001-01-01 01:00:00 site campus 2001-01-01 02:00:00 site campus 2001-01-01 00:00:00 obs NA 2001-01-01 01:00:00 obs 52 2001-01-01 02:00:00 obs 52 2001-01-01 00:00:00 mod 61.63 2001-01-01 01:00:00 mod 62.55 2001-01-01 02:00:00 mod 63.52 

Then I tried to redo:

 test.recast <- recast(test.data, date ~ site + obs + mod) 

However, he returned with an error message:

 Error in eval(expr, envir, enclos) : object 'site' not found 

I tried to find the previous questions, but could not find similar scenarios (correct me if I am wrong). Can someone please help me with this?

Thank you very much in advance!

+6
source share
2 answers

You may be able to use the basic R-form after doing some cleanup of the variable name.

Here are your details.

 test <- read.table(header = TRUE, stringsAsFactors=FALSE, text = "date site obs mod site obs mod '2000-09-01 00:00:00' campus NA 61.63 'city centre' 66 56.69 '2000-09-01 01:00:00' campus 52 62.55 'city centre' NA 54.75 '2000-09-01 02:00:00' campus 52 63.52 'city centre' 56 54.65") test # date site obs mod site.1 obs.1 mod.1 # 1 2000-09-01 00:00:00 campus NA 61.63 city centre 66 56.69 # 2 2000-09-01 01:00:00 campus 52 62.55 city centre NA 54.75 # 3 2000-09-01 02:00:00 campus 52 63.52 city centre 56 54.65 

If you did it right, you should get names like I got: since @chase mentions in your answer , “duplicate column names are a little weird and this is not normal R behavior” - so we have to fix this.

Note. Both of these parameters generate a time variable, which you can continue and discard. You might want to save it just in case you want to reformat it back to a wide format.

  • Option 1: If you have names like I did (what you need), the solution is simple. For the first site, simply add "0" to the site name and use the R reshape base:

     names(test)[2:4] <- paste(names(test)[2:4], "0", sep=".") test <- reshape(test, direction = "long", idvar = "date", varying = 2:ncol(test)) rownames(test) <- NULL # reshape makes UGLY rownames test # date time site obs mod # 1 2000-09-01 00:00:00 0 campus NA 61.63 # 2 2000-09-01 01:00:00 0 campus 52 62.55 # 3 2000-09-01 02:00:00 0 campus 52 63.52 # 4 2000-09-01 00:00:00 1 city centre 66 56.69 # 5 2000-09-01 01:00:00 1 city centre NA 54.75 # 6 2000-09-01 02:00:00 1 city centre 56 54.65 
  • Option 2: If you really have duplicate column names , the fix is ​​still simple and follows the same logic. First, create more convenient column names (easy to use with rep() ), and then use reshape() as described above.

     names(test)[-1] <- paste(names(test)[-1], rep(1:((ncol(test)-1)/3), each = 3), sep = ".") test <- reshape(test, direction = "long", idvar = "date", varying = 2:ncol(test)) rownames(test) <- NULL ### Or, more convenient: # names(test) <- make.unique(names(test)) # names(test)[2:4] <- paste(names(test)[2:4], "0", sep=".") # test <- reshape(test, direction = "long", # idvar = "date", varying = 2:ncol(test)) # rownames(test) <- NULL 
  • An additional step: the data in this form is still not completely "long". If necessary, all that is required is another step:

     require(reshape2) melt(test, id.vars = c("date", "site", "time")) # date site time variable value # 1 2000-09-01 00:00:00 campus 0 obs NA # 2 2000-09-01 01:00:00 campus 0 obs 52.00 # 3 2000-09-01 02:00:00 campus 0 obs 52.00 # 4 2000-09-01 00:00:00 city centre 1 obs 66.00 # 5 2000-09-01 01:00:00 city centre 1 obs NA # 6 2000-09-01 02:00:00 city centre 1 obs 56.00 # 7 2000-09-01 00:00:00 campus 0 mod 61.63 # 8 2000-09-01 01:00:00 campus 0 mod 62.55 # 9 2000-09-01 02:00:00 campus 0 mod 63.52 # 10 2000-09-01 00:00:00 city centre 1 mod 56.69 # 11 2000-09-01 01:00:00 city centre 1 mod 54.75 # 12 2000-09-01 02:00:00 city centre 1 mod 54.65 

Refresh (to try to answer some questions from the comments)

  • The reshape() documentation is rather confusing. It is best to work through a few examples to understand how this works. In particular, “time” should not refer to time (“date” in your problem), but more, for example, for panel data, where records are collected at different times for the same identifier. In your case, the only "id" in the source data is the "date" column. Another potential id is the site, but not in the way the data is organized.

    Imagine for a moment if your data looked like this:

     test1 <- structure(list(date = structure(1:3, .Label = c("2000-09-01 00:00:00", "2000-09-01 01:00:00", "2000-09-01 02:00:00"), class = "factor"), obs.campus = c(NA, 52L, 52L), mod.campus = c(61.63, 62.55, 63.52), obs.cityCentre = c(66L, NA, 56L), mod.cityCentre = c(56.69, 54.75, 54.65)), .Names = c("date", "obs.campus", "mod.campus", "obs.cityCentre", "mod.cityCentre"), class = "data.frame", row.names = c(NA, -3L)) test1 # date obs.campus mod.campus obs.cityCentre mod.cityCentre # 1 2000-09-01 00:00:00 NA 61.63 66 56.69 # 2 2000-09-01 01:00:00 52 62.55 NA 54.75 # 3 2000-09-01 02:00:00 52 63.52 56 54.65 

    Now try reshape(test1, direction = "long", idvar = "date", varying = 2:ncol(test1)) . You will see that reshape() sees the site names as "time" (this can be redefined by adding " timevar = "site" " to your reshape ).

    When direction = "long" , you must indicate which columns change with time. In your case, these are all columns except the first, so my use of 2:ncol(test) for " varying ".

  • test2 ? Where is it?

  • Question in response to @Chase: I think you misunderstood how melt() should work. Basically, he tries to get you a “skin” form of your data. In this case, the “optional step” described above will be the most difficult form, since date + site will be the minimum necessary to include a unique identifier variable. (I would say that " time " can be safely removed.)

    Once your data is in the format described in the “optional step” (we assume that the output has been saved as “ test.melt ”, you can always expand the table in different ways easily. As a demonstration of what I mean, try the following and see what they do.

     dcast(test.melt, date + site ~ variable) dcast(test.melt, date ~ variable + site) dcast(test.melt, variable + site ~ date) dcast(test.melt, variable + date ~ site) 

    It’s not easy to have this flexibility if you stop at “Option 1” or “Option 2”.


Update (after a few years)

melt from "data.table" can now "melt" multiple columns in a similar way than reshape . It should work if column names are duplicated.

You can try the following:

 measure <- c("site", "obs", "mod") melt(as.data.table(test), measure.vars = patterns(measure), value.name = measure) # date variable site obs mod # 1: 2000-09-01 00:00:00 1 campus NA 61.63 # 2: 2000-09-01 01:00:00 1 campus 52 62.55 # 3: 2000-09-01 02:00:00 1 campus 52 63.52 # 4: 2000-09-01 00:00:00 2 city centre 66 56.69 # 5: 2000-09-01 01:00:00 2 city centre NA 54.75 # 6: 2000-09-01 02:00:00 2 city centre 56 54.65 
+10
source

The fact that you have duplicate column names is a bit strange and is not normal behavior of R. In most cases, R forces you to have valid names using the make.names() function. Despite everything, I can duplicate your problem. Note. I made my own example, since yours is not reproducible, but the logic is the same.

 #Do not force unique names s <- data.frame(id = 1:3, x = runif(3), x = runif(3), check.names = FALSE) #----- id xx 1 1 0.6845270 0.5218344 2 2 0.7662200 0.6179444 3 3 0.4110043 0.1104774 #Now try to melt, note that 1/2 of your x-values are missing! melt(s, id.vars = 1) #----- id variable value 1 1 x 0.6845270 2 2 x 0.7662200 3 3 x 0.4110043 

The solution is to make the column names unique. As I said, R does this by default in most cases. However, you can do this after the fact via make.unique()

 names(s) <- make.unique(names(s)) #----- [1] "id" "x" "x.1" 

Note that the second column x now has 1 padding to it. Now melt() works as you expected:

 melt(s, id.vars = 1) #----- id variable value 1 1 x 0.6845270 2 2 x 0.7662200 3 3 x 0.4110043 4 1 x.1 0.5218344 5 2 x.1 0.6179444 6 3 x.1 0.1104774 

At this point, if you want to treat x and x.1 as the same variable, I think a little gsub() or another regular expression function to get rid of offensive characters. This is a workflow that I use quite often.

+6
source

Source: https://habr.com/ru/post/926396/


All Articles