Nltk error 'unknown url'

I am trying to run a python script that uses NLTK tokenization inside. Here is the piece of code from the script that initializes NLTK

class NLTKTagger: ''' class that supplies part of speech tags using NLTK note: avoids the NLTK downloader (see __init__ method) ''' def __init__(self): import nltk from nltk.tag import PerceptronTagger from nltk.tokenize import TreebankWordTokenizer tokenizer_fn = os.path.abspath(resource_filename('phrasemachine.data', 'punkt.english.pickle')) tagger_fn = os.path.abspath(resource_filename('phrasemachine.data', 'averaged_perceptron_tagger.pickle')) # Load the tagger self.tagger = PerceptronTagger(load=False) self.tagger.load(tagger_fn) # note: nltk.word_tokenize calls the TreebankWordTokenizer, but uses the downloader. # Calling the TreebankWordTokenizer like this allows skipping the downloader. # It seems the TreebankWordTokenizer uses PTB tokenization = regexes. ie no downloads # https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L25 self.tokenize = TreebankWordTokenizer().tokenize self.sent_detector = nltk.data.load(tokenizer_fn) 

I get the following error

  Traceback (most recent call last): File "C:\Users\Uzair\Desktop\phrasemachine_test.py", line 3, in <module> phrasemachine.get_phrases(text) File "C:\Program Files\Python36-32\lib\site-packages\phrasemachine\phrasemachine.py", line 260, in get_phrases tagger = TAGGER_NAMES[tagger]() File "C:\Program Files\Python36-32\lib\site-packages\phrasemachine\phrasemachine.py", line 173, in get_stdeng_nltk_tagger tagger = NLTKTagger() File "C:\Program Files\Python36-32\lib\site-packages\phrasemachine\phrasemachine.py", line 140, in __init__ self.tagger.load(tagger_fn) File "C:\Program Files\Python36-32\lib\site-packages\nltk\tag\perceptron.py", line 209, in load self.model.weights, self.tagdict, self.classes = load(loc) File "C:\Program Files\Python36-32\lib\site-packages\nltk\data.py", line 801, in load opened_resource = _open(resource_url) File "C:\Program Files\Python36-32\lib\site-packages\nltk\data.py", line 924, in _open return urlopen(resource_url) File "C:\Program Files\Python36-32\lib\urllib\request.py", line 223, in urlopen return opener.open(url, data, timeout) File "C:\Program Files\Python36-32\lib\urllib\request.py", line 526, in open response = self._open(req, data) File "C:\Program Files\Python36-32\lib\urllib\request.py", line 549, in _open 'unknown_open', req) File "C:\Program Files\Python36-32\lib\urllib\request.py", line 504, in _call_chain result = func(*args) File "C:\Program Files\Python36-32\lib\urllib\request.py", line 1388, in unknown_open raise URLError('unknown url type: %s' % type) urllib.error.URLError: <urlopen error unknown url type: c> 

I am using Python 3.6 for Windows 7 and NLTK 3.2.1. I tried the solutions mentioned in here and here. But no one worked. Any other solution?

0
python nltk
source share
1 answer

The data loader mistakenly accepts the C: prefix in your path for the protocol name, for example, http: . I thought this was already fixed ... To avoid the problem, add the file:" protocol at the beginning of your path. For example.

 self.tagger.load("file://"+tagger_fn) 

(There are more efficient ways to structure your code, but it is up to you.)

This is technically not an error, since nltk.data.load() expects a URL, not a path to the file system. But actually it should be fixed, it is not so difficult to process Windows paths ...

+2
source share

All Articles