I have a sequence object created as follows:
subsequences <- function(data){
slmax <- max(data$time)
sequences.seqe <- seqecreate(data)
sequences.sts <- seqformat(data, from="SPELL", to="DSS", begin="time", end="end", id="id", status="event", limit=slmax)
sequences.sts <- seqdef(sequences.sts, right = "DEL", left = "DEL", gaps = "DEL")
(sequences.sts)
}
data <- subsequences(data)
head(data)
Which gives the result:
Sequence
[1] discussed-subscribed-*-discussed-*-discussed-*-discussed-*-discussed-*-closed
[2] *-opened-*-reviewed-*-discussed-*-discussed-*-discussed-*-merged
[3] *-discussed-*-discussed-*-discussed-*-discussed
[4] *-opened-*-discussed-merged-discussed
[5] *-discussed-*-referenced-discussed-closed-discussed-referenced-discussed
[6] *-referenced-*-referenced-*-referenced-assigned-*-closed
But when I calculate the subsequences, I get seemingly ridiculous answers:
seqsubsn(head(data))
[!] found missing state in the sequence(s), adding missing state to the alphabet
Subseq.
[1] 1036
[2] 1248
[3] 88
[4] 49
[5] 294
[6] 240
How will the number of subsequences be much larger than the number of events in each sequence?
A 'dput ()' dataset can be found here . The problem is that the source data has timestamps in seconds. However, I used the function below to change the timestamps just to be consistent:
read_seqdata <- function(data, startdate, stopdate){
data <- read.table(data, sep = ",", header = TRUE)
data <- subset(data, select = c("pull_req_id", "action", "created_at"))
colnames(data) <- c("id", "event", "time")
data <- sqldf(paste0("SELECT * FROM data WHERE strftime('%Y-%m-%d', time,
'unixepoch', 'localtime') >= '",startdate,"' AND strftime('%Y-%m-%d', time,
'unixepoch', 'localtime') <= '",stopdate,"'"))
data$end <- data$time
data <- data[with(data, order(time)), ]
data$time <- match( data$time , unique( data$time ) )
data$end <- match( data$end , unique( data$end ) )
slmax <- max(data$time)
(data)
}
This allows you to create appropriate measures for entropy, sequence length, etc., but the number of subsequences is still problematic.