Split a string into a comma following a specific word

I have a vector with names, for example:

names <- "Jansen, A., Karel, A., Jong, A. de, Pietersen, K." 

And I want to break it down into a name. In this case, I need to split the vector into ., And a comma after de (This name will be A. De Jong , which is typical in Dutch).

Now I am doing:

  strsplit(names,split="\\.\\,|\\<de\\>,") 

But it also removes de from the name:

 [[1]] [1] "Jansen, A" " Karel, A" " Jong, A. " " Pietersen, K." 

How can I get the result as a result?

 [[1]] [1] "Jansen, A" " Karel, A" " Jong, A. de" " Pietersen, K." 
+4
source share
3 answers

polishchuk regex needs two modifications to make it work in R.

First, the backslash requires an exit. Secondly, calling strsplit requires the argument perl = TRUE to enable lookbehind.

 strsplit(names, split = "\\.,|(?<=de)", perl = TRUE) 

gives the answer to which Sasha asked.

Note that this still includes a dot in the name de Jong, and it does not expand for alternatives such as van, der, etc. I suggest the following alternative.

 names <- "Jansen, A., Karel, A., Jong, A. de, Pietersen, K., Helsing, A. van" #split on every comma first_last <- strsplit(names, split = ",")[[1]] #rearrange into a matrix with the first column representing last names, #and the second column representing initials first_last <- matrix(first_last, byrow = TRUE, ncol = 2) #clean up: remove leading spaces and dots first_last <- gsub("^ ", "", first_last) first_last <- gsub("\\.", "", first_last) #combine columns again apply(first_last, 1, paste, collapse = ", ") 
+5
source

Try this regex: \.,|(?<=de), with look-behind.

It will match:

Jansen, A. ., Karel, A. ., Jong, A. de , Pietersen, K.

+3
source

I just figured out a very simple way to solve this problem, which I post here for reference. Just a gsub line first for something else that is easier to split:

 names <- "Jansen, A., Karel, A., Jong, A. de, Pietersen, K." names <- gsub("\\<de\\>,","de.,",names) strsplit(names,split="\\.\\,") [[1]] [1] "Jansen, A" " Karel, A" " Jong, A. de" " Pietersen, K." 

I think this requires a separate gsub() operator for each way this can happen (in Dutch you have van, der, de, te, ten and more), so it is not perfect, but it does its job.

+1
source

All Articles