Python BeautifulSoup is looking for a tag

Question

Python BeautifulSoup is looking for a tag

My first post is here, I am trying to find all the tags in this particular html, and I can’t take them out, this is the code:

from bs4 import BeautifulSoup
from urllib import urlopen

url = "http://www.jutarnji.hr"
html_doc = urlopen(url).read()
soup = BeautifulSoup(html_doc)
soup.prettify()
soup.find_all("a", {"class":"black"})

finding the function returns [], but I see that there are tags with the class: "black" in html, did I miss something?

Thank you Vedran

+5

python beautifulsoup

onoxo Mar 30 '12 at 17:47

source share

4 answers

Rik Poggi · Answer 1 · 2012-03-30T18:51:36+0000

It seems to work for me, so I would say that the problem is with your html document.

I tried to run the following:

from bs4 import BeautifulSoup

html_doc = """<html>
 <body>
  <a class="black">
   <b>
    text1
   </b>
   <c>
    text2
   </c>
  </a>
  <a class="micio">
  </a>
  <a class="black">
  </a>
 </body>
</html>"""
soup = BeautifulSoup(html_doc)
soup.prettify()
print(soup.find_all("a", {"class":"black"}))

And as a result, I got:

[<a class="black">
<b>
    text1
   </b>
<c>
    text2
   </c>
</a>, <a class="black">
</a>]

Edit: As @Puneet pointed out , the problem may be the lack of a space between the attributes in the html that you are "fetching".

I tried, for example, changing the above example to something like:

html_doc = """<html>
 <body>
  <aclass="black">

# etc.. as before

And I got an empty list: [].

Puneet · Answer 2 · 2012-03-30T19:24:16+0000

, - arent href . BeautifulSoup, , .

>>> BeautifulSoup.BeautifulSoup('<a href="http://www.jutarnji.hr/crkva-se-ogradila-od--cjenika--don-mikica--osim-krizme--sve-druge-financijske-obveze-su-neprihvatljive/1018314/" class="black">').prettify()
'<a href="http://www.jutarnji.hr/crkva-se-ogradila-od--cjenika--don-mikica--osim-krizme--sve-druge-financijske-obveze-su-neprihvatljive/1018314/" class="black">\n</a>'
>>> BeautifulSoup.BeautifulSoup('<a href="http://www.jutarnji.hr/crkva-se-ogradila-od--cjenika--don-mikica--osim-krizme--sve-druge-financijske-obveze-su-neprihvatljive/1018314/"class="black">').prettify()
''

Froyo · Answer 3 · 2012-03-31T16:42:15+0000

I also had the same problem.

Try

soup.findAll("a",{"class":"black"})

instead

soup.find_all("a",{"class":"black"})

soup.findAll () works well for me.

onoxo · Answer 4 · 2012-03-31T13:55:25+0000

this means using lxml solves the problem:

from bs4 import BeautifulSoup
import lxml
from urllib import urlopen

url = "http://www.jutarnji.hr"
html_doc = urlopen(url).read()
soup = BeautifulSoup(html_doc, "lxml")
soup.prettify()

soup.find_all("a", {"class":"black"})

Python BeautifulSoup is looking for a tag

More articles: