Beautiful soup and extracting div and its contents by ID

soup.find("tagName", { "id" : "articlebody" }) 

Why does this NOT return tags <div id="articlebody"> ... </div> and so on in between? It does not return anything. And I know that fact exists because I look directly at it from

 soup.prettify() 

soup.find("div", { "id" : "articlebody" }) also does not work.

Edit: There is no answer to this post - how to delete it? I found that BeautifulSoup is not parsing correctly, which probably actually means that the page I'm trying to parse is incorrectly formatted in SGML or something else.

+82
python beautifulsoup
Jan 25
source share
10 answers

You should post your sample document because the code works fine:

 >>> import BeautifulSoup >>> soup = BeautifulSoup.BeautifulSoup('<html><body><div id="articlebody"> ... </div></body></html') >>> soup.find("div", {"id": "articlebody"}) <div id="articlebody"> ... </div> 

Finding a <div> inside a <div> also works:

 >>> soup = BeautifulSoup.BeautifulSoup('<html><body><div><div id="articlebody"> ... </div></div></body></html') >>> soup.find("div", {"id": "articlebody"}) <div id="articlebody"> ... </div> 
+113
Jan 25 '10 at 22:55
source share

To find an element by id :

 div = soup.find(id="articlebody") 
+31
Mar 14 '14 at 16:17
source share

I think there is a problem when the "div" tags are too nested. I am trying to parse some contacts from the httml facebook file, and Beautifulsoup cannot find the "div" tags with the "fcontent" class.

This also happens with other classes. When I look for divs in general, it turns only those that are not so much nested.

The html source code can be any page from facebook from your friend’s friend list (and not from your friends). If someone can check this out and give some advice, I would really appreciate it.

This is my code where I am just trying to print the number of "div" tags with the "fcontent" class:

 from BeautifulSoup import BeautifulSoup f = open('/Users/myUserName/Desktop/contacts.html') soup = BeautifulSoup(f) list = soup.findAll('div', attrs={'class':'fcontent'}) print len(list) 
+9
Mar 04
source share

Most likely, because of the beautifulsoup parser, a problem arises by default. Change another parser, for example "lxml", and try again.

+7
Jan 29 '13 at 16:20
source share

In the beautifulsoup source, this line allows embedding divs in divs; therefore your concern in lukas comment will not be valid.

 NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del'] 

I think you need to specify the attrs you need, such as

 source.find('div', attrs={'id':'articlebody'}) 
+5
Jan 25 '10 at 23:05
source share

It also happened while trying to clear Google.
I ended up using pyquery.
Installation:

 pip install pyquery 

Using:

 from pyquery import PyQuery pq = PyQuery('<html><body><div id="articlebody"> ... </div></body></html') tag = pq('div#articlebody') 
+3
Apr 30 '15 at 5:34
source share

Beautiful Soup 4 supports most CSS selectors with .select() , so you can use the id selector , for example:

 soup.select('#articlebody') 

If you need to specify an element type, you can add a selector type before the id selector:

 soup.select('div#articlebody') 

The .select() method returns a collection of elements, which means that it will return the same results as the following .find_all() method example:

 soup.find_all('div', id="articlebody") # or soup.find_all(id="articlebody") 

If you want to select only one element, you can simply use the .find() method:

 soup.find('div', id="articlebody") # or soup.find(id="articlebody") 
+3
Feb 20 '17 at 5:42 on
source share

Have you tried soup.findAll("div", {"id": "articlebody"}) ?

it sounds crazy, but if you are cleaning material from the wild, you cannot rule out a few divs ...

+2
Jan 25 '10 at 23:00
source share

I used:

 soup.findAll('tag', attrs={'attrname':"attrvalue"}) 

Like my syntax for find / findall; what if between the list of tags and attributes there are no other optional parameters, it should not be different.

+2
Jan 25
source share

Here is a snippet of code

 soup = BeautifulSoup(:"index.html") titleList = soup.findAll('title') divList = soup.findAll('div', attrs={ "class" : "article story"}) 

As you can see, I find all the tags, and then I find all the tags with class = "article" inside

+1
Jan 25 '10 at 23:03
source share



All Articles