Beautifulsoup extracts text and link from an unordered list div <ul <li (Scandinavian characters)

Question

Beautifulsoup extracts text and link from an unordered list div <ul <li (Scandinavian characters)

I am trying to extract the city names on the left side of this web page ( http://www.silvan.dk/butikker ). The reason is because I need to extract the physical address of each city (which is on the page that the link links to, in the meantime I started to extract the city names). More precisely from this container. However, since I just started Python and Beautifulsoup, I was unable to extract the information I needed.

The result should give me: City, city-connection.

so far I:

import urllib2
import sys, locale, os, re
import lxml.etree
from bs4 import BeautifulSoup

def cp65001(name):
if name.lower() == 'cp65001':
    return codecs.lookup('utf-8')

html_page = urllib2.urlopen("http://www.silvan.dk/butikker",'w')
soup = BeautifulSoup(html_page)
li = soup.select("ul > li > a")
for link in li:
    print link.get('href')

Which gives me the following result:

#1
#2
#3
#4    
#5
#6
#7
#8
#9    
#10
#11
#12
#13
#14    
#15
#16
#17
#18

I would really appreciate it if someone could direct me to a solution. I'm tired of using

div = soup.find('div', id='leftContent')
lis = div.find_all('li')
num_lis = len(lis)

, ? .

+4

python html extract web-scraping beautifulsoup

Philip 09 . '13 8:49

1

Foo Bar User · Answer 1 · 2013-10-09T09:07:42+0000

:

li = soup.select("ul > li > a")

li = soup.select(".subMenu li a")

:

http://www.silvan.dk/butikker/ballerup
http://www.silvan.dk/butikker/birkeroed
http://www.silvan.dk/butikker/city2
http://www.silvan.dk/butikker/esbjerg
http://www.silvan.dk/butikker/fisketorvet
http://www.silvan.dk/butikker/fredericia
http://www.silvan.dk/butikker/frederikshavn
etc

Beautifulsoup extracts text and link from an unordered list div <ul <li (Scandinavian characters)

More articles: