String divided by R with complex divisions

Question

String divided by R with complex divisions

I have df ( day.df ) with a vial column which I am trying to split into four new columns. The new columns will be the treatment gender line block . The day.df also contains the response and explanatory columns that will be saved.

So now day.df looks like this (top 4 of 31,000 lines):

  vial response explanatory Xm1.1 0 4 Xm2.1 0 4 Xm3.1 0 4 Xm4.1 0 4 . . . . . . . . .

The current contents of the vial column are as follows. Xm1.2 .

The first character (denoted as X) can be X or A - it will be a treament .
The second character (shown as m in the example) can be m or f - this is gender .
The third character (shown as 1 ) and ranges from 1 - 40 - this is line .
The fourth and last character is block and ranges from 1 - 4
"." need to drop

Thus, the new day.df will look something like this (I use four “random” lines to illustrate the variation in each new column):

  vial response explanatory treatment gender line block Xm1.1 0 4 X m 1 1 Am1.1 0 4 A m 1 1 Xf3.2 0 4 X f 3 2 Xm4.2 0 4 X m 4 2 . . . . . . . . .

I looked on the Internet how to do this, and this is the closest place to me; I tried breaking the vial column as follows:

  > a=strsplit(day.df$vial,"") > a[1] "Xm1.2"

but there were problems when the section "line" of the line went through> 9 because there were two characters, for example (for a line where vial is Af20.2 ).

  > a[300] [[1]] [1] "A" "f" "2" "0" "." "2"

Must read as:

  > a[300] [[1]] [1] "A" "f" "20" "." "2"

So the steps I need for help are as follows:

Overcome the problem with the line section of the line when it is more than 9.
Add split line list to day.df in four required columns

+7

string r

Ell Jul 05 '13 at 11:53

source share

3 answers

Read the data:

 Lines <- "vial response explanatory Xm1.1 0 4 Xm2.1 0 4 Xm3.1 0 4 Xm4.1 0 4 " day.df <- read.table(text = Lines, header = TRUE, as.is = TRUE)

1) , then process it with strapplyc . (we used as.is=TRUE so that day.df$vial a character, but if its a factor in your data frame, replace day.df$vial with as.character(day.df$vial) .) This approach performs parsing in only one short line of code:

 library(gsubfn) s <- strapplyc(day.df$vial, "(.)(.)(\\d+)[.](.)", simplify = rbind) # we can now cbind it to the original data frame colnames(s) <- c("treatment", "gender", "line", "block") cbind(day.df, s)

which gives:

  vial response explanatory treatment gender line block 1 Xm1.1 0 4 X m 1 1 2 Xm2.1 0 4 X m 2 1 3 Xm3.1 0 4 X m 3 1 4 Xm4.1 0 4 X m 4 1

2) Here is a different approach. It uses no packages and is relatively simple (no regular expressions at all) and includes only one R statement, including cbind'ing:

 transform(day.df, treatment = substring(vial, 1, 1), # 1st char gender = substring(vial, 2, 2), # 2nd char line = substring(vial, 3, nchar(vial)-2), # 3rd through 2 prior to last char block = substring(vial, nchar(vial))) # last char

The result is still.

UPDATE: the second approach is added.

UPDATE: some simplifications.

+4

G. grothendieck Jul 05 '13 at 12:14

source share

An alternative way that does not require the use of regular expressions is to use substr() in combination with the fact that the last part of your code is a numeric value.

Let's say your data is this:

 d1 <- read.table(header=TRUE,text=" vial response explanatory Xm1.1 0 4 Xm2.1 0 4 Xm3.2 0 4 Xm44.1 0 4")

Then the result can be achieved:

 d1$line <- as.integer(substr(x=d1$vial,3,6)) d1$block <- (as.numeric(substr(x=d1$vial,3,6)) %% 1)*10 d1$treatment <- substr(x=d1$vial,1,1) d1$gender <- substr(x=d1$vial,2,2)

The numerical part always starts after exactly two characters, regardless of the number of digits. We extract this part and write the numbers before the decimal point in the first line and the numbers after the decimal point in the second line. Treatment extraction and gender dimensions are simple.

+1

Maxim.K Jul 05 '13 at 12:22

source share

agstudy · Accepted Answer · 2013-07-05T12:06:07+0000

using gsub and strsplit as follows:

 v <- c('Xm1.1','Xf3.2') h <- gsub('(X|A)(m|f)([0-9]{1,2})[.]([1-4])','\\1|\\2|\\3|\\4',v) do.call(rbind,strsplit(h,'[|]')) [,1] [,2] [,3] [,4] [1,] "X" "m" "1" "1" [2,] "X" "f" "3" "2"

the result is data.frame, you can cbind it to the original data.frame.

EDIT @GriffinEvo Applicable and proven code:

  a = gsub('(X|A)(m|f)([0-9]{1,2})[.]([1-4])', '\\1|\\2|\\3|\\4',day.df$vial) do.call(rbind, strsplit(a,'[|]') ) day.df = cbind(day.df,do.call(rbind,strsplit(a,'[|]'))) colnames(day.df)[4:7] = c ("treatment" , "gender" , "line" , "block")

String divided by R with complex divisions

More articles: