Beautifulsoup extracts text and link from an unordered list div <ul <li (Scandinavian characters)
I am trying to extract the city names on the left side of this web page ( http://www.silvan.dk/butikker ). The reason is because I need to extract the physical address of each city (which is on the page that the link links to, in the meantime I started to extract the city names). More precisely from this container. However, since I just started Python and Beautifulsoup, I was unable to extract the information I needed.
The result should give me: City, city-connection.
so far I:
import urllib2
import sys, locale, os, re
import lxml.etree
from bs4 import BeautifulSoup
def cp65001(name):
if name.lower() == 'cp65001':
return codecs.lookup('utf-8')
html_page = urllib2.urlopen("http://www.silvan.dk/butikker",'w')
soup = BeautifulSoup(html_page)
li = soup.select("ul > li > a")
for link in li:
print link.get('href')
Which gives me the following result:
#1
#2
#3
#4
#5
#6
#7
#8
#9
#10
#11
#12
#13
#14
#15
#16
#17
#18
I would really appreciate it if someone could direct me to a solution. I'm tired of using
div = soup.find('div', id='leftContent')
lis = div.find_all('li')
num_lis = len(lis)
, ? .
+4
1
:
li = soup.select("ul > li > a")
li = soup.select(".subMenu li a")
:
http://www.silvan.dk/butikker/ballerup
http://www.silvan.dk/butikker/birkeroed
http://www.silvan.dk/butikker/city2
http://www.silvan.dk/butikker/esbjerg
http://www.silvan.dk/butikker/fisketorvet
http://www.silvan.dk/butikker/fredericia
http://www.silvan.dk/butikker/frederikshavn
etc
+3