Passing through the HTML DOM in Python

I want to write a Python script (using 3.4.3) that grabs an HTML page from a URL and can go through the DOM to try to find a specific element.

I currently have this:

#!/usr/bin/env python
import urllib.request

def getSite(url):
    return urllib.request.urlopen(url)

if __name__ == '__main__':
    content = getSite('http://www.google.com').read()
    print(content)

When I print the content, it prints the entire html page, which is something close to what I want ... although I would ideally want to be able to navigate the DOM, rather than treat it like a giant line.

I'm still pretty new to Python, but have experience working with several other languages ​​(mainly Java, C #, C ++, C, PHP, JS). I already did something similar with Java, but wanted to try it in Python.

Any help is appreciated. Hooray!

+4
2

. , lxml BeautifulSoup.

lxml:

import lxml.html

mysite = urllib.request.urlopen('http://www.google.com').read()
lxml_mysite = lxml.html.fromstring(mysite)

description = lxml_mysite.xpath("//meta[@name='description']")[0] # meta tag description
text = description.get('content') # content attribute of the tag

>>> print(text)
"Search the world information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

BeautifulSoup:

from bs4 import BeautifulSoup

mysite = urllib.request.urlopen('http://www.google.com').read()
soup_mysite = BeautifulSoup(mysite)

description = soup_mysite.find("meta", {"name": "description"}) # meta tag description
text = description['content'] # text of content attribute

>>> print(text)
u"Search the world information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

, BeautifulSoup , lxml - . / , .

+6

BeautifulSoup .

from bs4 import BeautifulSoup
import urllib                                       
soup = BeautifulSoup(urllib.urlopen("http://google.com").read())

for link in soup.find_all('a'):
    print(link.get('href'))
+1

All Articles