Is there a library for splitting sentences into a list of words in it?

I look at nltk for python, but it breaks (tokenize) won'tlike ['wo',"n't"]. Are there libraries that do this more reliably?

I know that I can create some kind of regular expression to solve this problem, but I am looking for a library / tool, because it will be a more directed approach. For example, after a basic regular expression with periods and commas, I understood words like 'Mr.' will break the system.

(@Artsiom)

If the sentence was “you won’t?”, Split () will give me [“you”, “will not?” ]. So, is there an extra "?" that I have to deal with. I am looking for a tried and tested method that eliminates kinks like the above, as well as the many exceptions that I am sure exist. Of course, I will resort to a split (regex) if I do not find it.

+5
source share
5 answers

The Natural Language Toolkit (NLTK) is probably what you need.

>>> from nltk.tokenize import word_tokenize
>>> word_tokenize("'Hello. This is a test.  It works!")
["'Hello", '.', 'This', 'is', 'a', 'test', '.', 'It', 'works', '!']
>>> word_tokenize("I won't fix your computer")
['I', 'wo', "n't", 'fix', 'your', 'computer']

nltk.tokenize.word_tokenizeby default, use the TreebankWordTokenizerword tokenizer, which tokenizes offers using Penn Treebank .

Please note that this tokenizer assumes the text is already segmented sentences.

, NLTK (.. WordPunctTokenizer, WhitespaceTokenizer...) .

+9

, , NLTK - , , . " " , ( calssifiers). . :

I am a happy teapot that won't do stuff?

NLTK.

TreebankWordTokenizer

I am a happy teapot that wo n't do stuff ?

WordPunctTokenizer

I am a happy teapot that won ' t do stuff ?

PunktWordTokenizer

I am a happy teapot that won 't do stuff ?

WhitespaceTokenizer

I am a happy teapot that won't do stuff?

. , PunktSentenceTokenizer , , , . , . WhitespaceTokenizer, , /, . stuff?, , , (, ), , won't, .

+5

@Karthick, , , :

  • .
  • "", . Else - .

alphabet = set('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ')
text = "I won't answer this question!"

word = ''
wordlist = []

for c in text:
    if c in alphabet:
        word += c
    else:
        if len(word) > 0:
            wordlist.append(word)
        word = ''

print wordlist
['I', "won't", 'answer', 'this', 'question']

, , :)

+3
source

NLTK comes with several different tokenizers, and you can see demos for each online application in a text demo for tokens . For your case, it seems that WhitespaceTokenizeris the best that essentially matches string.split().

+1
source

You can try the following:

op = []
string_big = "One of Python coolest features is the string format operator  This operator is unique to strings"
Flag = None
postion_start = 0
while postion_start < len(string_big):
    Flag = (' ' in string_big)
    if Flag == True:
        space_found = string_big.index(' ')
        print(string_big[postion_start:space_found])
        #print(space_found)
        op.append(string_big[postion_start:space_found])
        #postion_start = space_found
        string_big = string_big[space_found+1:len(string_big)]
        #print string_big
    else:
        op.append(string_big[postion_start:])
        break

print op
0
source

All Articles