Combining vectors of unequal length and non-unique values

Question

Combining vectors of unequal length and non-unique values

I would like to do the following:

combined into a data frame, two vectors that

have different lengths
contain sequences also found in another vector
contain sequences not found in another vector
sequences that are not found in another vector are never longer than three elements
always has the same first element

Equal sequences should be displayed in the data frame in two vectors aligned with NA in the column, if the vector does not contain a sequence present in another vector.

For instance:

vector 1 vector 2 vector 1 vector 2 1 1 aa 2 2 gg 3 3 bb 4 1 or ha 1 2 ag 2 3 gb 5 4 ch 5 c

should be combined into a data frame

  1 1 aa 2 2 gg 3 3 bb 4 NA h NA 1 1 or aa 2 2 gg NA 3 NA b NA 4 NA h 5 5 cc

What I did was search for examples of merging, combining, cbind, plyr, but could not find a solution. I'm afraid I will need to start writing a function with nested loops to solve this problem.

+4

r sequence dataframe missing-data

Dmitrii I. Dec 15 '12 at 20:18

source share

2 answers

Note - this was suggested as an answer to the first version of OP. The question has been changed since then, but the problem is still undefined in my opinion.

Here is a solution that works with your integer example and will also work with numeric vectors. I also assume that:

both vectors contain the same number of sequences
a new sequence begins, where value[i+1] <= value[i]

If your vectors are not numeric or if one of my assumptions does not match your problem, you will have to clarify.

 v1 <- c(1,2,3,4,1,2,5) v2 <- c(1,2,3,1,2,3,4,5) v1.sequences <- split(v1, cumsum(c(TRUE, diff(v1) <= 0))) v2.sequences <- split(v2, cumsum(c(TRUE, diff(v2) <= 0))) align.fun <- function(s1, s2) { #aligns two sequences s12 <- sort(unique(c(s1, s2))) cbind(ifelse(s12 %in% s1, s12, NA), ifelse(s12 %in% s2, s12, NA)) } do.call(rbind, mapply(align.fun, v1.sequences, v2.sequences)) # [,1] [,2] # [1,] 1 1 # [2,] 2 2 # [3,] 3 3 # [4,] 4 NA # [5,] 1 1 # [6,] 2 2 # [7,] NA 3 # [8,] NA 4 # [9,] 5 5

+6

flodel Dec 15 '12 at 21:47

source share

flodel · Accepted Answer · 2012-12-16T04:15:21+0000

I affirm that your problem can be solved in terms of the shortest general supersymmetry . It is assumed that each of the two vectors represents one sequence. Please give the code below a try.

If it still does not solve your problem, you will need to explain exactly what you mean by “my vector contains not one, but many sequences”: determine what you mean by a sequence and tell us how sequences can be identified by scanning through two vectors.

Part I : Given two sequences, find the longest common subsequence

 LongestCommonSubsequence <- function(X, Y) { m <- length(X) n <- length(Y) C <- matrix(0, 1 + m, 1 + n) for (i in seq_len(m)) { for (j in seq_len(n)) { if (X[i] == Y[j]) { C[i + 1, j + 1] = C[i, j] + 1 } else { C[i + 1, j + 1] = max(C[i + 1, j], C[i, j + 1]) } } } backtrack <- function(C, X, Y, i, j) { if (i == 1 | j == 1) { return(data.frame(I = c(), J = c(), LCS = c())) } else if (X[i - 1] == Y[j - 1]) { return(rbind(backtrack(C, X, Y, i - 1, j - 1), data.frame(LCS = X[i - 1], I = i - 1, J = j - 1))) } else if (C[i, j - 1] > C[i - 1, j]) { return(backtrack(C, X, Y, i, j - 1)) } else { return(backtrack(C, X, Y, i - 1, j)) } } return(backtrack(C, X, Y, m + 1, n + 1)) }

Part II : Given two sequences, find the shortest common supersymmetry

 ShortestCommonSupersequence <- function(X, Y) { LCS <- LongestCommonSubsequence(X, Y)[c("I", "J")] X.df <- data.frame(X = X, I = seq_along(X), stringsAsFactors = FALSE) Y.df <- data.frame(Y = Y, J = seq_along(Y), stringsAsFactors = FALSE) ALL <- merge(LCS, X.df, by = "I", all = TRUE) ALL <- merge(ALL, Y.df, by = "J", all = TRUE) ALL <- ALL[order(pmax(ifelse(is.na(ALL$I), 0, ALL$I), ifelse(is.na(ALL$J), 0, ALL$J))), ] ALL$SCS <- ifelse(is.na(ALL$X), ALL$Y, ALL$X) ALL }

Your example :

 ShortestCommonSupersequence(X = c("a","g","b","h","a","g","c"), Y = c("a","g","b","a","g","b","h","c")) # JIXY SCS # 1 1 1 aaa # 2 2 2 ggg # 3 3 3 bbb # 9 NA 4 h <NA> h # 4 4 5 aaa # 5 5 6 ggg # 6 6 NA <NA> bb # 7 7 NA <NA> hh # 8 8 7 ccc

(where the two updated vectors are in columns X and Y )

Combining vectors of unequal length and non-unique values

More articles: