Get a unique string from a vector of similar strings

Question

Get a unique string from a vector of similar strings

I do not quite understand how to formulate this question. I just started working on a bunch of tweets, I did some basic cleaning, and now some of the tweets look like this:

x <- c("stackoverflow is a great site",
"stackoverflow is a great si",
"stackoverflow is a great",
"omg it is friday and so sunny",
"omg it is friday and so",
"arggh how annoying")

Basically I want to remove repetitions by checking if the first parts of the lines match and will return their longest. In this case, my result should be:

[1]"stackoverflow is a great site"
[2]"omg it is friday and so sunny"
[3]"arggh how annoying"

because all the others are truncated repetitions above. I tried to use unique(), but it does not return the results I want, as it tries to match the entire length of the strings. Any pointers please?

I am using R version 3.1.1 on Mac OSX 10.7 ...

Thank!

+4

string r unique

maryam Aug 22 '14 at 12:41

source share

3

:

library(stringr)
x[!sapply(seq_along(x), function(i) any(str_detect(x[-i], x[i])))]
[1] "stackoverflow is a great site" "omg it is friday and so sunny" "arggh how annoying"

, , . , , , , .

0

tonytonov 22 . '14 12:52

@Tonytonov's solution is good, but I recommend using stringipackage :)

stringi <- function(x){
  x[!sapply(seq_along(x), function(i) any(stri_detect_fixed(x[-i], x[i])))]
}

stringr <- function(x){
  x[!sapply(seq_along(x), function(i) any(str_detect(x[-i], x[i])))]
}

require(microbenchmark)
microbenchmark(stringi(x), stringr(x))
Unit: microseconds
       expr     min       lq   median       uq      max neval
 stringi(x)  52.482  58.1760  64.3275  71.9630  120.374   100
 stringr(x) 538.482 551.0485 564.3445 602.7095 1736.601   100

0

bartektartanus Aug 23 '14 at 22:20

source share

Matthew Plourde · Accepted Answer · 2014-08-22T12:56:47+0000

. .

x <- c("stackoverflow is a great site",
"stackoverflow is a great si",
"stackoverflow is a great",
"stackoverflow is an OK site",
"omg it is friday and so sunny",
"omg it is friday and so",
"arggh how annoying")

Filter(function(y) {
    x2 <- sapply(setdiff(x, y), substr, start=1, stop=nchar(y))
    ! duplicated(c(y, x2), fromLast=TRUE)[1]
}, x)


# [1] "stackoverflow is a great site" "stackoverflow is an OK site"   "omg it is friday and so sunny" [4] "arggh how annoying"

Get a unique string from a vector of similar strings

More articles: