Tokenizing Twitter Posts in Lucene

My question in a nutshell: Does anyone know about TwitterAnalyzer or TwitterTokenizer for Lucene?

More detailed version:

I want to index several tweets in Lucene and save them as @user or #hashtag. StandardTokenizer does not work because it rejects punctuation (but it does other useful things, such as storing domain names, email addresses or recognizing abbreviations). How can I use an analyzer that does everything that StandardTokenizer does, but doesnโ€™t affect terms like @user and #hashtag?

My current solution is to pre-process the tweet text before loading it into the analyzer and replace the characters with other alphanumeric strings. For example,

String newText = newText.replaceAll("#", "hashtag"); newText = newText.replaceAll("@", "addresstag"); 

Unfortunately, this method violates legitimate email addresses, but I can live with it. Does this approach make sense?

Thanks in advance!

AMAC

+6
tokenize twitter lucene
source share
6 answers

StandardTokenizer and StandardAnalyzer basically pass your tokens through StandardFilter (which removes all kinds of characters from your standard tokens, for example, at the ends of words), and then a lowercase filter (for lowercase words) and finally StopFilter. This last one removes insignificant words like โ€œlike,โ€ โ€œin,โ€ โ€œfor,โ€ etc.

What you could easily do for starters is to implement your own analyzer, which does the same thing as StandardAnalyzer but uses the WhitespaceTokenizer as the first element that processes the input stream.

For more information about the internal work of the analyzers, you can see here

+5
source share
0
source share

A tutorial on the twitter special tokenizer, which is a modified version of the ark-tweet-nlp API, can be found at http://preciselyconcise.com/apis_and_installations/tweet_pos_tagger.php This API is able to identify emoticons, hashtags, interjections, etc. that are present in twitter.

0
source share

The Twitter API can be said to return all Tweets, Bios, etc. using "entities" (hashtags, userIds, urls, etc.), which are already versed in the contents of the collection.

https://dev.twitter.com/docs/entities

So, are you just looking for a way to redo what people have already done on Twitter?

0
source share

Open source Twitter there is a lib text process, implements a token handler for hashtag, etc.

e.g. HashtagExtractor https://github.com/twitter/commons/blob/master/src/java/com/twitter/common/text/extractor/HashtagExtractor.java

This is the base for lucene TokenStream.

0
source share

Itโ€™s cleaner to use your own tokenizer that processes Twitter usernames. I made one here: https://github.com/wetneb/lucene-twitter

This tokenizer recognizes Twitter usernames and hashtags, and a companion filter can be used for their lower case (given that they are not case sensitive):

 <fieldType name="text_twitter" class="solr.TextField" positionIncrementGap="100" multiValued="true"> <analyzer type="index"> <tokenizer class="org.opentapioca.analysis.twitter.TwitterTokenizerFactory" /> <filter class="org.opentapioca.analysis.twitter.TwitterLowercaseFilterFactory" /> </analyzer> <analyzer type="query"> <tokenizer class="org.opentapioca.analysis.twitter.TwitterTokenizerFactory" /> <filter class="org.opentapioca.analysis.twitter.TwitterLowercaseFilterFactory" /> </analyzer> </fieldType> 
0
source share

All Articles