Modify python nltk.word_tokenize to exclude "#" as delimiter

I use the Python NLTK library to tokenize my suggestions.

If my code

text = "C# billion dollars; we don't own an ounce C++"
print nltk.word_tokenize(text)

I get it as my conclusion

['C', '#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

Symbols ;, ., #are treated as separators. Is there a way to remove #from a set of delimiters, for example, how +is it not a delimiter and, therefore, is C++displayed as a single token?

I want my conclusion to be

['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

I want to C#be considered one token.

0
source share
2 answers

: , , , "#" .

txt = "C# billion dollars; we don't own an ounce C++"
tokens = word_tokenize(txt)

i_offset = 0
for i, t in enumerate(tokens):
    i -= i_offset
    if t == '#' and i > 0:
        left = tokens[:i-1]
        joined = [tokens[i - 1] + t]
        right = tokens[i + 1:]
        tokens = left + joined + right
        i_offset += 1

>>> tokens
['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']
+1

NLTK , regexp .

, (, , ecc) , :

>>> txt = "C# billion dollars; we don't own an ounce C++"
>>> regexp_tokenize(txt, pattern=r"\s|[\.,;']", gaps=True)
['C#', 'billion', 'dollars', 'we', 'don', 't', 'own', 'an', 'ounce', 'C++']
0

All Articles