>>> import re >>> myre = re.compile(r"\w{4,}") >>> myre.findall('Lorem, ipsum! dolor sit? amet...') ['Lorem', 'ipsum', 'dolor', 'amet']
Note that in Python 3, where all lines are Unicode, this will also find words that use letters other than ASCII:
>>> import re >>> myre = re.compile(r"\w{4,}") >>> myre.findall('Lorem, ipsum! dolรถr sit? amet...') ['Lorem', 'ipsum', 'dolรถr', 'amet']
In Python 2 you will need to use
>>> myre = re.compile(r"\w{4,}", re.UNICODE) >>> myre.findall(u'Lorem, ipsum! dolรถr sit? amet...') [u'Lorem', u'ipsum', u'dol\xf6r', u'amet']
Tim pietzcker
source share