Strange number of subsequences?

I have a sequence object created as follows:

subsequences <- function(data){
  slmax <- max(data$time)
  sequences.seqe <- seqecreate(data)
  sequences.sts <- seqformat(data, from="SPELL", to="DSS", begin="time", end="end", id="id", status="event", limit=slmax)
  sequences.sts <- seqdef(sequences.sts, right = "DEL", left = "DEL", gaps = "DEL")
  (sequences.sts)
}

data <- subsequences(data)

head(data)

Which gives the result:

    Sequence                                                                     
[1] discussed-subscribed-*-discussed-*-discussed-*-discussed-*-discussed-*-closed
[2] *-opened-*-reviewed-*-discussed-*-discussed-*-discussed-*-merged             
[3] *-discussed-*-discussed-*-discussed-*-discussed                              
[4] *-opened-*-discussed-merged-discussed                                        
[5] *-discussed-*-referenced-discussed-closed-discussed-referenced-discussed     
[6] *-referenced-*-referenced-*-referenced-assigned-*-closed

But when I calculate the subsequences, I get seemingly ridiculous answers:

seqsubsn(head(data))
 [!] found missing state in the sequence(s), adding missing state to the alphabet
    Subseq.
[1]    1036
[2]    1248
[3]      88
[4]      49
[5]     294
[6]     240

How will the number of subsequences be much larger than the number of events in each sequence?

A 'dput ()' dataset can be found here . The problem is that the source data has timestamps in seconds. However, I used the function below to change the timestamps just to be consistent:

read_seqdata <- function(data, startdate, stopdate){
  data <- read.table(data, sep = ",", header = TRUE)
  data <- subset(data, select = c("pull_req_id", "action", "created_at"))
  colnames(data) <- c("id", "event", "time")
  data <- sqldf(paste0("SELECT * FROM data WHERE strftime('%Y-%m-%d', time,
    'unixepoch', 'localtime') >= '",startdate,"' AND strftime('%Y-%m-%d', time,
    'unixepoch', 'localtime') <= '",stopdate,"'"))
  data$end <- data$time
  data <- data[with(data, order(time)), ]
  data$time <- match( data$time , unique( data$time ) )
      data$end <- match( data$end , unique( data$end ) )
  slmax <- max(data$time)
  (data)
}

This allows you to create appropriate measures for entropy, sequence length, etc., but the number of subsequences is still problematic.

+3
1

. "", "".

$x = (x_1, x_2,..., x_3) $ $y $, $x_i $ $y $ , $y $, , A-B-A C-A-D-B-C-D-A-D.

, `mvad ' TraMineR.

library(TraMineR)
data(mvad)
mvad.scodes <- c("EM", "FE", "HE", "JL", "SC", "TR")
mvad.seq <- seqdef(mvad, 17:86, states = mvad.scodes)
print(mvad.seq[1:3,], format="SPS")

##    Sequence                      
##[1] (EM,4)-(TR,2)-(EM,64)         
##[2] (FE,36)-(HE,34)               
##[3] (TR,24)-(FE,34)-(EM,10)-(JL,2)

seqsubsn(mvad.seq)[1:3]

##[1]  7  4 16

seqsubsn (DSS). DSS , , EM-TR-EM. EM-TR-EM:

  • , : EM TR
  • : EM-TR, EM-EM, TR-EM
  • : EM-TR-EM

, , ( DSS)

*-opened-*-discussed-merged-discussed

49 , :

*-open, *-discussed, *-merged, opened-*, opened-discussed, opened-merged, discussed-merged, discussed-discussed, merged-discussed

,

+2

All Articles