There are two options in the OP question:
- What is the "process of obtaining tokens from TokenStream"?
- "Can anyone explain how to get token information from TokenStream?"
Recent versions of the Lucene documentation for Token say (highlighted by me):
NOTE. Starting with version 2.9 ... you no longer need to use a token, and the new TokenStream API can be used as a convenience class that implements all attributes, which is especially useful for moving from the old to the new TokenStream API.
And TokenStream talks about its API:
... moved from Attribute-based Token-based ... the preferred way to store Token information is to use AttributeImpls.
Other answers to this question cover # 2 above: how to get token-like information from TokenStream in a βnewβ recommended method using attributes. After reading the documentation, Lucene developers believe that this change was made, in particular, to reduce the number of individual objects created at a time.
But, as some people pointed out in the comments on these answers, they do not directly answer # 1: how do you get Token if you really want / need this type?
With the same API change that TokenStream a AttributeSource makes, Token now implements Attribute and can be used with TokenStream.addAttribute , like the other answers for CharTermAttribute and OffsetAttribute . Therefore, they really answered this part of the original question, they simply did not show it.
It is important that, although this approach allows you to access Token during the loop, it is still only one object, no matter how many logical tokens are in the stream. Each incrementToken() call changes the Token state returned with addAttribute ; Therefore, if your goal is to create a collection of different Token objects that will be used outside the loop, you will need to do additional work to make the new Token object a (deep?) Copy.
William Price Apr 18 '14 at 15:06 2014-04-18 15:06
source share