My question in a nutshell: Does anyone know about TwitterAnalyzer or TwitterTokenizer for Lucene?
More detailed version:
I want to index several tweets in Lucene and save them as @user or #hashtag. StandardTokenizer does not work because it rejects punctuation (but it does other useful things, such as storing domain names, email addresses or recognizing abbreviations). How can I use an analyzer that does everything that StandardTokenizer does, but doesnโt affect terms like @user and #hashtag?
My current solution is to pre-process the tweet text before loading it into the analyzer and replace the characters with other alphanumeric strings. For example,
String newText = newText.replaceAll("#", "hashtag"); newText = newText.replaceAll("@", "addresstag");
Unfortunately, this method violates legitimate email addresses, but I can live with it. Does this approach make sense?
Thanks in advance!
AMAC
tokenize twitter lucene
Ruggiero spearman
source share