How to get token from Lucene TokenStream?

Question

How to get token from Lucene TokenStream?

I am trying to use Apache Lucene for tokenization, and I am confused about the process of getting tokens with TokenStream .

The worst part is that I look at comments in JavaDocs that affect my question.

http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/analysis/TokenStream.html#incrementToken%28%29

Somehow it is assumed that AttributeSource used, and not Token s. I am completely at a loss.

Can someone explain how to get token-like information from TokenStream?

+63

java tokenize attributes lucene token

Eric Wilson Apr 14 '10 at 14:30

source share

3 answers

Here's how it should be (a clean version of Adam answers):

 TokenStream stream = analyzer.tokenStream(null, new StringReader(text)); CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class); stream.reset(); while (stream.incrementToken()) { System.out.println(cattr.toString()); } stream.end(); stream.close();

+31

yegor256 Sep 23

source share

There are two options in the OP question:

What is the "process of obtaining tokens from TokenStream"?
"Can anyone explain how to get token information from TokenStream?"

Recent versions of the Lucene documentation for Token say (highlighted by me):

NOTE. Starting with version 2.9 ... you no longer need to use a token, and the new TokenStream API can be used as a convenience class that implements all attributes, which is especially useful for moving from the old to the new TokenStream API.

And TokenStream talks about its API:

... moved from Attribute-based Token-based ... the preferred way to store Token information is to use AttributeImpls.

Other answers to this question cover # 2 above: how to get token-like information from TokenStream in a “new” recommended method using attributes. After reading the documentation, Lucene developers believe that this change was made, in particular, to reduce the number of individual objects created at a time.

But, as some people pointed out in the comments on these answers, they do not directly answer # 1: how do you get Token if you really want / need this type?

With the same API change that TokenStream a AttributeSource makes, Token now implements Attribute and can be used with TokenStream.addAttribute , like the other answers for CharTermAttribute and OffsetAttribute . Therefore, they really answered this part of the original question, they simply did not show it.

It is important that, although this approach allows you to access Token during the loop, it is still only one object, no matter how many logical tokens are in the stream. Each incrementToken() call changes the Token state returned with addAttribute ; Therefore, if your goal is to create a collection of different Token objects that will be used outside the loop, you will need to do additional work to make the new Token object a (deep?) Copy.

+1

William Price Apr 18 '14 at 15:06

source share

Adam Paynter · Accepted Answer · 2010-04-14 14:37

Yes, this is a bit confusing (compared to the good way), but this should do it:

 TokenStream tokenStream = analyzer.tokenStream(fieldName, reader); OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class); TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class); while (tokenStream.incrementToken()) { int startOffset = offsetAttribute.startOffset(); int endOffset = offsetAttribute.endOffset(); String term = termAttribute.term(); }

Edit: New way

According to Donotello, TermAttribute deprecated in favor of CharTermAttribute . According to jpountz (and the Lucene documentation), addAttribute more desirable than getAttribute .

 TokenStream tokenStream = analyzer.tokenStream(fieldName, reader); OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class); CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { int startOffset = offsetAttribute.startOffset(); int endOffset = offsetAttribute.endOffset(); String term = charTermAttribute.toString(); }

How to get token from Lucene TokenStream?

Edit: New way

More articles: