I am currently developing a web application to extract the Twitter stream and attempt to independently create natural language processing.
Since my data is taken from Twitter (limited to 140 characters), many words have been reduced, or in this case, the missed space .
For instance:
"Hi, my name is Bob. I m 19yo and 170cm tall"
Must be designated as:
- hi
- my
- name
- bob
- i
- 19
- yo
- 170
- cm
- tall
Note that 19both yoin 19yohave a space between them. I use it mainly to extract numbers with their units.
Just what I need is a way to βexplodeβ every token that has a number in it with numbers or letters without .
'123abc' will be ['123', 'abc']
'abc123' will be ['abc', '123']
'abc123xyz' will be ['abc', '123', 'xyz']
etc.
PHP?
- , # spesifically /. #