How to use Python and lxml to parse a local html file?

I am working with a local html file in python and I am trying to use lxml to parse a file. For some reason, I can’t download the file correctly, and I'm not sure if this is because I do not have an http server installed on my local machine, using etree or something else.

My link for this code was this: http://docs.python-guide.org/en/latest/scenarios/scrape/

This could be a related problem: Requests: no connection adapters found, error in Python3

Here is my code:

from lxml import html import requests page = requests.get('C:\Users\...\sites\site_1.html') tree = html.fromstring(page.text) test = tree.xpath('//html/body/form/div[3]/div[3]/div[2]/div[2]/div/div[2]/div[2]/p[1]/strong/text()') print test 

The trace I am reading:

 C:\Python27\python.exe "C:/Users/.../extract_html/extract.py" Traceback (most recent call last): File "C:/Users/.../extract_html/extract.py", line 4, in <module> page = requests.get('C:\Users\...\sites\site_1.html') File "C:\Python27\lib\site-packages\requests\api.py", line 69, in get return request('get', url, params=params, **kwargs) File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request response = session.request(method=method, url=url, **kwargs) File "C:\Python27\lib\site-packages\requests\sessions.py", line 465, in request resp = self.send(prep, **send_kwargs) File "C:\Python27\lib\site-packages\requests\sessions.py", line 567, in send adapter = self.get_adapter(url=request.url) File "C:\Python27\lib\site-packages\requests\sessions.py", line 641, in get_adapter raise InvalidSchema("No connection adapters were found for '%s'" % url) requests.exceptions.InvalidSchema: No connection adapters were found for 'C:\Users\...\sites\site_1.html' Process finished with exit code 1 

You can see that this has something to do with the β€œconnection adapter,” but I'm not sure what that means.

+8
python
source share
2 answers

If the file is local, you should not use requests - just open the file and read it. requests expects to talk to the web server.

 with open(r'C:\Users\...site_1.html', "r") as f: page = f.read() tree = html.fromstring(page) 
+17
source share

There is a better way to do this: use the parse function instead of fromstring

 tree = html.parse("C:\Users\...site_1.html") print(html.tostring(tree)) 
0
source share

All Articles