Subset data / data extraction based on the first 7 letters

I have a huge dataset with genotypic information from different populations. I would like to sort the data by population, but I do not know how to do it.

I want to sort by "pedigree_dhl". I used the following code, but I kept getting error messages.

newdata <- project[pedigree_dhl == CCB133$*1,  ]

My problem also is that the "dhl pedigree" contains all the names of the individual genotypes. Only the first 7 letters in the dhl pedigree column are the name of the population. In this example: CCB133. How can I tell R that I want to extract data for all columns containing CCB133?

  Allele1 Allele2      SNP_name gs_entry pedigree_dhl
1       T       T ZM011407_0151      656    CCB133$*1
2       T       T ZM009374_0354      656    CCB133$*1
3       C       C ZM003499_0591      656    CCB133$*1
4       A       A ZM003898_0594      656    CCB133$*1
5       C       C ZM004887_0313      656    CCB133$*1
6       G       G ZM000583_1096      656    CCB133$*1
+5
source share
1 answer

, grep regexp R. :

df <- read.table(text="  Allele1 Allele2      SNP_name gs_entry pedigree_dhl
1       T       T ZM011407_0151      656    CCB133$*1
2       T       T ZM009374_0354      656    CCB133$*1
3       C       C ZM003499_0591      656    CCB133$*1
4       A       A ZM003898_0594      656    CCB133$*1
5       C       C ZM004887_0313      656    CCB133$*1
6       G       G ZM000583_1096      656    CCB133$*1", header=T)

# put into df1 all rows where pedigree_dhl starts with CCB133$
p1 <- 'CCB133$'
df1 <- subset(df, grepl(p1, pedigree_dhl) )

, , . : , .

# If you want to create a new column based
# on the first seven letter of SNP_name (or any other variable)

df$SNP_7 <- substr(df$SNP_name, start=1, stop=7)

# If you want to order by pedigree_dhl
# then you don't need to select out the rows into a new dataframe

df <- df[ with(df, order(df$pedigree_dhl)), ]

- .

+5

All Articles