How to replace values ​​in a data frame with a string in R?

short version: how to replace values ​​in a data frame with a string found in another data frame?

longer version: I am a biologist working with many types of bees. I have a dataset with many thousands of bees. Each line has a unique bee identification number along with all the necessary information about this sample (capture data, GPS location, etc.). Species information for each bee has not been entered because it takes a long time to identify them. When IDing, I get boxes of hundreds of bees, all of the same species. I enter them in a separate data frame. I am trying to write code that will update the original data file with information about species (family, gender, species, sex, etc.), as I ID bees. Currently, the type information in the source data file is empty and interpreted as NA inside R. I want R to find all the unique identifiers of the bees and fill in the species information, but it's hard for me to figure out how to replace the NA Values ​​with a string (for example, "Andrenidae ")

Here is a simple example of what I'm trying to do:

rawData<-data.frame(beeID=c(1:20),family=rep(NA,20)) speciesInfo<-data.frame(beeID=seq(1,20,3),family=rep("Andrenidae",7)) rawData[rawData$beeID == 4,"family"] <- speciesInfo[speciesInfo$beeID == 4,"family"] 

So, I replace things as I want, but with a number, not a surname (string). In the end, I would like to write a small loop to add all the data in the view, for example:

 for (i in speciesInfo$beeID){ rawData[rawData$beeID == i,"family"] <- speciesInfo[speciesInfo$beeID == i,"family"] } 

Thanks in advance for any advice!

Greetings

Zach

EDIT:

I just noticed that the first two methods below each time add a new column, which can cause problems if I need to add view information several times (which I usually do). For example:

 rawData<-data.frame(beeID=c(1:20),family=rep(NA,20)) Andrenidae<-data.frame(beeID=seq(1,20,3),family=rep("Andrenidae",7)) Halictidae<-data.frame(beeID=seq(1,20,3)+1,family=rep("Halictidae",7)) # using join library(plyr) rawData <- join(rawData, Andrenidae, by = "beeID", type = "left") rawData <- join(rawData, Halictidae, by = "beeID", type = "left") # using merge rawData <- merge(x=rawData,y=Andrenidae,by='beeID',all.x=T,all.y=F) rawData <- merge(x=rawData,y=Halictidae,by='beeID',all.x=T,all.y=F) 

Is there a way to collapse columns so that I have one single data frame? Or a way to update rawData rather than adding a new column every time? Thanks in advance!

+8
string replace r na
source share
4 answers

Here is a function, I think, will work for you. This uses match to find and index the values ​​in your annotation data frame, and then replace the values ​​in rawData.

 replaceID <- function(to,from,mergeBy,values){ x <- match(from[,mergeBy],to[,mergeBy]) to[,values][x] <- as.character(from[,values]) return(to) } > rawData <- replaceID(rawData,Halictidae,"beeID","family") > rawData beeID family 1 1 <NA> 2 2 Halictidae 3 3 <NA> 4 4 <NA> 5 5 Halictidae 6 6 <NA> 7 7 <NA> 8 8 Halictidae 9 9 <NA> 10 10 <NA> 11 11 Halictidae 12 12 <NA> 13 13 <NA> 14 14 Halictidae 15 15 <NA> 16 16 <NA> 17 17 Halictidae 18 18 <NA> 19 19 <NA> 20 20 Halictidae 
+4
source share

Another option is to use ?join in the plyr package

  library(plyr) #Adding family ahead of time was unnecessary so I'll remove it alongside the join. join(rawData, speciesInfo, by = "beeID", type = "left")[,-2] beeID family 1 1 Andrenidae 2 2 <NA> 3 3 <NA> 4 4 Andrenidae 5 5 <NA> 6 6 <NA> 7 7 Andrenidae 8 8 <NA> 9 9 <NA> 10 10 Andrenidae 11 11 <NA> 12 12 <NA> 13 13 Andrenidae 14 14 <NA> 15 15 <NA> 16 16 Andrenidae 17 17 <NA> 18 18 <NA> 19 19 Andrenidae 20 20 <NA> 

Update

 # If you anticipate adding new species over time, # simply rbind those into a single reference data.frame to merge with your rawData. # Like so: library(plyr) rawData <- join(rawData, rbind(Andrenidae, Halictidae), by = "beeID", type = "left") # To keep you code clean, you could do this step ahead of time species_list <- rbind(Andrenidae, Halictidae) rawData <- join(rawData, species_list, by = "beeID", type = "left") 
+3
source share

You can use the merge function, for example.

 rawData <- data.frame(beeID=c(1:20),family=rep(NA,20)) speciesInfo <- data.frame(beeID=seq(1,20,3), family=c(rep('Halictidae',4), rep("Andrenidae",3))) merged <- merge(x=rawData,y=speciesInfo,by='beeID',all.x=T,all.y=F) merged$family.x <- NULL # remove the family.x column names(merged) <- c('beeID','family') # rename the columns 

NB

There is no need to initialize rawData family column.
The merge function will add it automatically, for example.

 rawData <- data.frame(beeID=c(1:20)) speciesInfo <- data.frame(beeID=seq(1,20,3), family=c(rep('Halictidae',4), rep("Andrenidae",3))) merged <- merge(x=rawData,y=speciesInfo,by='beeID',all.x=T,all.y=F) > merged beeID family 1 1 Halictidae 2 2 <NA> 3 3 <NA> 4 4 Halictidae 5 5 <NA> 6 6 <NA> 7 7 Halictidae 8 8 <NA> 9 9 <NA> 10 10 Halictidae 11 11 <NA> 12 12 <NA> 13 13 Andrenidae 14 14 <NA> 15 15 <NA> 16 16 Andrenidae 17 17 <NA> 18 18 <NA> 19 19 Andrenidae 20 20 <NA> 
+2
source share

A data.table solution that will work efficiently with memory and time.

  • Note that you need to have stringsAsFactors = F for rbindlist (super-fast version do.call (rbind, list) / rbind)
  • I added another column as dummy data to the rawData object and the remote family.

Create data -

 rawData <- data.frame(beeID = c(1:20), other_stuff = sample(letters, 20), stringsAsFactors = F) Andrenidae <- data.frame(beeID = seq(1, 20, 3), family = rep("Andrenidae", 7), stringsAsFactors = F) Halictidae <- data.frame(beeID = seq(1, 20 , 3)+ 1, family = rep("Halictidae", 7), stringsAsFactors = F) library(data.table) # convert to data.table rawDT <- as.data.table(rawData) # combine the list of Species-specific data.frames into a large data.table speciesInfo <- rbindlist(list(Andrenidae, Halictidae)) # set the keys, to allow efficient use of data.table and its merging # abilities. The keys are the same for both setkeyv(rawDT, 'beeID') setkeyv(speciesInfo, 'beeID') # merge by key speciesInfo[rawDT, nomatch = NA] ## beeID family other_stuff ## 1: 1 Andrenidae s ## 2: 2 Halictidae x ## 3: 3 NA i ## 4: 4 Andrenidae e ## 5: 5 Halictidae v ## 6: 6 NA q ## 7: 7 Andrenidae w ## 8: 8 Halictidae c ## 9: 9 NA u ## 10: 10 Andrenidae z ## 11: 11 Halictidae y ## 12: 12 NA a ## 13: 13 Andrenidae l ## 14: 14 Halictidae r ## 15: 15 NA h ## 16: 16 Andrenidae o ## 17: 17 Halictidae n ## 18: 18 NA g ## 19: 19 Andrenidae p ## 20: 20 Halictidae m 

or

 rawDT[speciesInfo] ## beeID other_stuff family ## 1: 1 s Andrenidae ## 2: 2 x Halictidae ## 3: 4 e Andrenidae ## 4: 5 v Halictidae ## 5: 7 w Andrenidae ## 6: 8 c Halictidae ## 7: 10 z Andrenidae ## 8: 11 y Halictidae ## 9: 13 l Andrenidae ## 10: 14 r Halictidae ## 11: 16 o Andrenidae ## 12: 17 n Halictidae ## 13: 19 p Andrenidae ## 14: 20 m Halictidae 

What data are you interested in

+2
source share

All Articles