Beautiful soup and extracting div and its contents by ID

Question

Beautiful soup and extracting div and its contents by ID

soup.find("tagName", { "id" : "articlebody" })

Why does this NOT return tags <div id="articlebody"> ... </div> and so on in between? It does not return anything. And I know that fact exists because I look directly at it from

 soup.prettify()

soup.find("div", { "id" : "articlebody" }) also does not work.

Edit: There is no answer to this post - how to delete it? I found that BeautifulSoup is not parsing correctly, which probably actually means that the page I'm trying to parse is incorrectly formatted in SGML or something else.

+82

python beautifulsoup

sepiroth Jan 25

source share

10 answers

To find an element by id :

 div = soup.find(id="articlebody")

+31

jfs Mar 14 '14 at 16:17

source share

I think there is a problem when the "div" tags are too nested. I am trying to parse some contacts from the httml facebook file, and Beautifulsoup cannot find the "div" tags with the "fcontent" class.

This also happens with other classes. When I look for divs in general, it turns only those that are not so much nested.

The html source code can be any page from facebook from your friend’s friend list (and not from your friends). If someone can check this out and give some advice, I would really appreciate it.

This is my code where I am just trying to print the number of "div" tags with the "fcontent" class:

 from BeautifulSoup import BeautifulSoup f = open('/Users/myUserName/Desktop/contacts.html') soup = BeautifulSoup(f) list = soup.findAll('div', attrs={'class':'fcontent'}) print len(list)

+9

omar Mar 04

source share

Most likely, because of the beautifulsoup parser, a problem arises by default. Change another parser, for example "lxml", and try again.

+7

liang Jan 29 '13 at 16:20

source share

In the beautifulsoup source, this line allows embedding divs in divs; therefore your concern in lukas comment will not be valid.

 NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']

I think you need to specify the attrs you need, such as

 source.find('div', attrs={'id':'articlebody'})

+5

dagoof Jan 25 '10 at 23:05

source share

It also happened while trying to clear Google.
I ended up using pyquery.
Installation:

 pip install pyquery

Using:

 from pyquery import PyQuery pq = PyQuery('<html><body><div id="articlebody"> ... </div></body></html') tag = pq('div#articlebody')

+3

Shoham Apr 30 '15 at 5:34

source share

Beautiful Soup 4 supports most CSS selectors with .select() , so you can use the id selector , for example:

 soup.select('#articlebody')

If you need to specify an element type, you can add a selector type before the id selector:

 soup.select('div#articlebody')

The .select() method returns a collection of elements, which means that it will return the same results as the following .find_all() method example:

 soup.find_all('div', id="articlebody") # or soup.find_all(id="articlebody")

If you want to select only one element, you can simply use the .find() method:

 soup.find('div', id="articlebody") # or soup.find(id="articlebody")

+3

Josh Crozier Feb 20 '17 at 5:42 on

source share

Have you tried soup.findAll("div", {"id": "articlebody"}) ?

it sounds crazy, but if you are cleaning material from the wild, you cannot rule out a few divs ...

+2

user106514 Jan 25 '10 at 23:00

source share

I used:

 soup.findAll('tag', attrs={'attrname':"attrvalue"})

Like my syntax for find / findall; what if between the list of tags and attributes there are no other optional parameters, it should not be different.

+2

user257111 Jan 25

source share

Here is a snippet of code

 soup = BeautifulSoup(:"index.html") titleList = soup.findAll('title') divList = soup.findAll('div', attrs={ "class" : "article story"})

As you can see, I find all the tags, and then I find all the tags with class = "article" inside

+1

Recursion Jan 25 '10 at 23:03

source share

Lukáš Lalinský · Accepted Answer · 2010-01-25 22:55

You should post your sample document because the code works fine:

 >>> import BeautifulSoup >>> soup = BeautifulSoup.BeautifulSoup('<html><body><div id="articlebody"> ... </div></body></html') >>> soup.find("div", {"id": "articlebody"}) <div id="articlebody"> ... </div>

Finding a <div> inside a <div> also works:

 >>> soup = BeautifulSoup.BeautifulSoup('<html><body><div><div id="articlebody"> ... </div></div></body></html') >>> soup.find("div", {"id": "articlebody"}) <div id="articlebody"> ... </div>

Beautiful soup and extracting div and its contents by ID

More articles: