I use the Python NLTK library to tokenize my suggestions.
If my code
text = "C# billion dollars; we don't own an ounce C++"
print nltk.word_tokenize(text)
I get it as my conclusion
['C', '#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']
Symbols ;, ., #are treated as separators. Is there a way to remove #from a set of delimiters, for example, how +is it not a delimiter and, therefore, is C++displayed as a single token?
I want my conclusion to be
['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']
I want to C#be considered one token.