How to split a line only on the first number

So, I have a dataset with street addresses, they are formatted very differently. For example:

d <- c("street1234", "Street 423", "Long Street 12-14", "Road 18A", "Road 12 - 15", "Road 1/2") 

From this I want to create two columns. 1. X: with street address and 2. Y: with number + everything that follows. Like this:

 XY Street 1234 Street 423 Long Street 12-14 Road 18A Road 12 - 15 Road 1/2 

So far I have tried strsplit and have performed some similar questions, for example: strsplit(d, split = "(?<=[a-zA-Z])(?=[0-9])", perl = T)) . I just can't find the correct regular expression.

Any help is greatly appreciated. Thank you in advance!

+7
regex r strsplit
source share
4 answers

There may be a space between the letter and the number, so add \s* (zero or more space characters) between the reverse windows:

 > strsplit(d, split = "(?<=[a-zA-Z])\\s*(?=[0-9])", perl = TRUE) [[1]] [1] "street" "1234" [[2]] [1] "Street" "423" [[3]] [1] "Long Street" "12-14" [[4]] [1] "Road" "18A" [[5]] [1] "Road" "12 - 15" [[6]] [1] "Road" "1/2" 

And if you want to create columns based on this, you can use separate from the tidyr package:

 > library(tidyr) > separate(data.frame(A = d), col = "A" , into = c("X", "Y"), sep = "(?<=[a-zA-Z])\\s*(?=[0-9])") XY 1 street 1234 2 Street 423 3 Long Street 12-14 4 Road 18A 5 Road 12 - 15 6 Road 1/2 
+7
source share

The non-modal approach using str_locate from stringr to find the first digit in the string and then split based on this location, i.e.

 library(stringr) ind <- str_locate(d, '[0-9]+')[,1] setNames(data.frame(do.call(rbind, Map(function(x, y) trimws(substring(x, seq(1, nchar(x), y-1), seq(y-1, nchar(x), nchar(x)-y+1))), d, ind)))[,1:2]), c('X', 'Y')) # XY #1 street 1234 #2 Street 423 #3 Long Street 12-14 #4 Road 18A #5 Road 12 - 15 #6 Road 1/2 

The NOTE that you receive a (harmless) warning that results from splitting in the line "Road 12 - 15" , which gives [1] "Road" "12 - 15" ""

+3
source share

This will also work:

 do.call(rbind,strsplit(sub('([[:alpha:]]+)\\s*([[:digit:]]+)', '\\1$\\2', d), split='\\$')) # [,1] [,2] #[1,] "street" "1234" #[2,] "Street" "423" #[3,] "Long Street" "12-14" #[4,] "Road" "18A" #[5,] "Road" "12 - 15" #[6,] "Road" "1/2" 
+3
source share

We can use read.csv with sub from base R

 read.csv(text=sub("^([A-Za-z ]+)\\s*([0-9]+.*)", "\\1,\\2", d), header=FALSE, col.names = c("X", "Y"), stringsAsFactors=FALSE) # XY #1 street 1234 #2 Street 423 #3 Long Street 12-14 #4 Road 18A #5 Road 12 - 15 #6 Road 1/2 
+2
source share

All Articles