Best way to parse a text document

I am trying to parse a simple text document in PHP, but have no idea how to do this correctly. I want to separate each word, assign them an identifier and save the result in JSON format.

Sample text:

"Hello, how are you (today)" 

This is what I am doing at the moment:

 $document_array = explode(' ', $document_text); json_encode($document_array); 

Received JSON

 [["Hello,"],["how"],["are"],["you"],["(today)"]] 

How to ensure that spaces remain in place and that characters are not included with words ...

 [["Hello"],[", "],["how"],[" "],["are"],[" "],["you"],[" ("],["today"],[")"]] 

I'm sure some kind of regular expression is required ... but I don’t know which template to use to solve all cases ... Any suggestions guys?

+7
source share
2 answers

In fact, this is indeed a complex problem, and one that obeys a fair number of academic replicas. It sounds so simple (just divide by space! Perhaps a few punctuation rules ...), but you quickly run into problems. Isn't there a word or two? What about word wrap words? Some may be one word, some may be two. What about multiple consecutive punctuation characters? Areas against quotes? etc. Even determining the end of a sentence is non-trivial. (Is it just a complete stop ??)

This problem is one of tokenization and a topic that search engines take very seriously. Honestly, you really should look at finding a tokenizer in your chosen language.

+4
source

Maybe this:?

 array_filter(preg_split('/\b/', $document_text)) 

"array_filter" removes empty values ​​in the first and / or last index of the resulting array, which will appear if your line starts or ends with the word boundary (\ b: http://php.net/manual/en/regexp.reference.escape.php )

+2
source

All Articles