Extract hashtags in multiple tweets using R

Question

Extract hashtags in multiple tweets using R

I desperately want to find a solution to extract hashtags from collective tweets in R. For example:

[[1]] [1] "RddzAlejandra: RT @NiallOfficial: What a day for @johnJoeNevin ! Sooo proud t have been there to see him at #London2012 and here in mgar #MullingarShuffle" [[2]] [1] "BPOInsight: RT @atos: Atos completes delivery of key IT systems for London 2012 Olympic Games http://t.co/Modkyo2R #london2012" [[3]] [1] "BloombergWest: The #Olympics sets a ratings record for #NBC, with 219M viewers tuning in. http://t.co/scGzIXBp #london2012 #tech"

How can I parse it to extract a list of hashtag words in all tweets. Previous solutions only display hashtags in the first tweet with these error messages in the code:

 > string <-"MonicaSarkar: RT @saultracey: Sun kissed #olmpicrings at #towerbridge #london2012 @ Tower Bridge http://t.co/wgIutHUl" > > [[2]] Error: unexpected '[[' in "[[" > [1] "ccrews467: RT @BBCNews: England manager Roy Hodgson calls #London2012 a \"wake-up call\": footballers and fans should emulate spirit of #Olympics http://t.co/wLD2VA1K" Error: unexpected '[' in "[" > hashtag.regex <- perl("(?<=^|\\s)#\\S+") > hashtags <- str_extract_all(string, hashtag.regex) > print(hashtags) [[1]] [1] "#olmpicrings" "#towerbridge" "#london2012"

+4

r

Adedoyin-olowe mariam Aug 14 '12 at 19:01

source share

2 answers

Sacha epskamp · Answer 1 · 2012-08-14T20:45:15+0000

Using regmatches and gregexpr , you will get a list with hashtags on a tweet, assuming hastag is in # format, followed by any number of letters or numbers (I'm not familiar with twitter):

 foo <- c("RddzAlejandra: RT @NiallOfficial: What a day for @johnJoeNevin ! Sooo proud t have been there to see him at #London2012 and here in mgar #MullingarShuffle","BPOInsight: RT @atos: Atos completes delivery of key IT systems for London 2012 Olympic Games http://t.co/Modkyo2R #london2012","BloombergWest: The #Olympics sets a ratings record for #NBC, with 219M viewers tuning in. http://t.co/scGzIXBp #london2012 #tech") regmatches(foo,gregexpr("#(\\d|\\w)+",foo))

Return:

 [[1]] [1] "#London2012" "#MullingarShuffle" [[2]] [1] "#london2012" [[3]] [1] "#Olympics" "#NBC" "#london2012" "#tech"

Justin · Answer 2 · 2012-08-14T19:11:40+0000

What about the strsplit and grep versions:

 > lapply(strsplit(x, ' '), function(w) grep('#', w, value=TRUE)) [[1]] [1] "#London2012" "#MullingarShuffle" [[2]] [1] "#london2012" [[3]] [1] "#Olympics" "#NBC," "#london2012" "#tech"

I could not figure out how to return multiple results from each row without first splitting, but I'm sure there is a way!

Extract hashtags in multiple tweets using R

More articles: