Splitting strings from integers in R

Question

Splitting strings from integers in R

I recently encountered an interesting problem when trying to create a user database.

my lines are of the form:

183746IGH 105928759UBS

etc. (so basically a whole number combined with a string, both relatively random sizes.). What I'm trying to do is somehow separate the whole number in column 1 and everything else (letters) in column 2. How can this be done? I am trying to use strsplit, but it does not seem to offer such functionality.

Thanks for any help.

+5

r statistics

sdgaw erzswer May 01, '15 at 22:18

source share

5 answers

And another way with basic R and regular expressions:

 all <- c(' 183746IGH','105928759UBS') numeric <- sapply(a, function(x) sub('[[:alpha:]]+','', x)) alphabetic <- sapply(a, function(x) sub('[[:digit:]]+','', x)) > data.frame(all,alphabetic,numeric) all alphabetic numeric 183746IGH 183746IGH IGH 183746 105928759UBS 105928759UBS UBS 105928759

Or as per @rawr comment below:

 > read.table(text = gsub('(\\d)(\\D)', '\\1 \\2', all)) V1 V2 1 183746 IGH 2 105928759 UBS

Or the above version with function:

 get_alphanum <- function(x, type) { type <- switch(type, alpha = '[[:digit:]]+', digit = '[[:alpha:]]+') sub(type,'', x) } get_alphanum <- Vectorize(get_alphanum)

This gives a result applied directly to the vector!

 > get_alphanum(all, type='alpha') 183746IGH 105928759UBS " IGH" "UBS" > get_alphanum(all, type='digit') 183746IGH 105928759UBS " 183746" "105928759"

which can also be used to create data.frame:

 > data.frame(all, alpha=get_alphanum(all, type='alpha') , numeric=get_alphanum(all, type='digit')) all alpha numeric 183746IGH 183746IGH IGH 183746 105928759UBS 105928759UBS UBS 105928759

+5

LyzandeR May 01 '15 at 10:38 PM

source share

Other options include tstrsplit from the devel data.table version

 library(data.table)#v1.9.5+ setDT(df)[,tstrsplit(V1,'(?<=\\d)(?=\\D)', perl=TRUE, type.convert=TRUE)] # V1 V2 #1: 131341 adad #2: 45365 adadar #3: 425 cavsbsb #4: 46567567 daadvsv

If there are elements, the "non-numeric" part appears first, and the "numeric" part appears last, then we can use a more generalized option as a regular expression template,

  setDT(df)[,tstrsplit(V1, "(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)", perl = TRUE)]

Or using extract from tidyr

 library(tidyr) extract(df, V1, into=c('V1', 'V2'), '(\\d+)(\\D+)', convert=TRUE) # V1 V2 #1 131341 adad #2 45365 adadar #3 425 cavsbsb #4 46567567 daadvsv

If you need a source column too,

  extract(df, V1, into=c('V2', 'V3'), '(\\d+)(\\D+)', convert=TRUE, remove=FALSE) # V1 V2 V3 #1 131341adad 131341 adad #2 45365adadar 45365 adadar #3 425cavsbsb 425 cavsbsb #4 46567567daadvsv 46567567 daadvsv

For data.table we can use := to create new columns so that existing columns remain in the output, i.e.

 setDT(df)[,paste0('V',2:3):=tstrsplit(V1,'(?<=\\d)(?=\\D)', perl=TRUE, type.convert=TRUE)] # V1 V2 V3 #1: 131341adad 131341 adad #2: 45365adadar 45365 adadar #3: 425cavsbsb 425 cavsbsb #4: 46567567daadvsv 46567567 daadvsv

NOTE. Both solutions have the ability to convert a split column class ( type.convert/convert ).

data

 df <- data.frame(V1 = c("131341adad", "45365adadar", "425cavsbsb", "46567567daadvsv"))

+5

akrun May 02, '15 at 4:48

source share

read.pattern in the gsubfn package can do this. Each bracketed part of the regular expression given in the pattern argument will be read in a separate column:

 x <- c("183746IGH", "105928759UBS") library(gsubfn) read.pattern(text = x, pattern = "(\\d+)(\\D+)")

giving:

  V1 V2 1 183746 IGH 2 105928759 UBS

+3

G. grothendieck May 02, '15 at 2:15

source share

strsplit works if you provide the correct regular expression to separate.

In this case, you need something like:

 strsplit(String, split = "(?<=[a-zA-Z])(?=[0-9])", perl = TRUE)

Here it applies to @Steven example data:

 strsplit(as.character(df$V1), split = "(?<=[a-zA-Z])(?=[0-9])", perl = TRUE) # [[1]] # [1] "adad" "131341" # # [[2]] # [1] "adadar" "45365" # # [[3]] # [1] "cavsbsb" "425" # # [[4]] # [1] "daadvsv" "46567567"

Some time ago I wrote a function to do this, since my mind honestly doesn’t think very often in regular expressions. The function looks like this:

 SplitMe <- function(string, alphaFirst = TRUE, bind = FALSE) { if (!is.character(string)) string <- as.character(string) Pattern <- ifelse(isTRUE(alphaFirst), "(?<=[a-zA-Z])(?=[0-9])", "(?<=[0-9])(?=[a-zA-Z])") out <- strsplit(string, split = Pattern, perl = TRUE) if (isTRUE(bind)) { require(data.table) as.data.table(do.call(rbind, out)) } else { out } }

Intended use was something like:

 library(data.table) as.data.table(df)[, c("char", "num") := SplitMe(V1, bind = TRUE)][] # V1 char num # 1: adad131341 adad 131341 # 2: adadar45365 adadar 45365 # 3: cavsbsb425 cavsbsb 425 # 4: daadvsv46567567 daadvsv 46567567

Once you recognize this pattern, you can use it in other places that use strsplit , for example separate from "tidyr", which conveniently divides the values into columns:

 library(dplyr) library(tidyr) df %>% separate(V1, into = c("char", "num"), sep = "(?<=[a-zA-Z])(?=[0-9])", perl = TRUE) # char num # 1 adad 131341 # 2 adadar 45365 # 3 cavsbsb 425 # 4 daadvsv 46567567

+3

A5C1D2H2I1M1N2O1R2T1 May 02, '15 at 4:58

source share

Steven beaupré · Accepted Answer · 2015-05-01T22:31:06+0000

You can do:

 df <- data.frame(V1 = c("adad131341", "adadar45365", "cavsbsb425", "daadvsv46567567")) library(dplyr) library(stringr) df %>% mutate(V2 = str_extract(V1, "[0-9]+"), V3 = str_extract(V1, "[aA-zZ]+"))

What gives:

 # V1 V2 V3 #1 adad131341 131341 adad #2 adadar45365 45365 adadar #3 cavsbsb425 425 cavsbsb #4 daadvsv46567567 46567567 daadvsv

Splitting strings from integers in R

data

More articles: