BeautifulSoup: how to enable output encoding?

I would like to include an encoding tag in an XML document using BeautifulSoup.BeautifulStoneSoup , but I'm not sure how to do this.

 <?xml version="1.0" encoding="UTF-8"?> <mytag>stuff</mytag> 

It outputs an encoding tag when I read a document that already has it, but I make a new soup.

Thanks!

Edit: I will give an example of what I am doing now.

 from BeautifulSoup import BeautifulStoneSoup, Tag soup = BeautifulStoneSoup() mytag = Tag(soup, 'mytag') soup.append(mytag) str(soup) # '<mytag></mytag>' soup.prettify() # No encoding given # '<mytag>\n</mytag>' soup.prettify(encoding='UTF-8') # '<mytag>\n</mytag>' # Where the encoding? 

Even if I create a soup like BeautifulStoneSoup(fromEncoding='UTF-8') , there is still no <?xml?> Tag.

Is there any other way to get this tag without creating and passing the tag as a string directly or is this the only way?

+4
source share
1 answer

Do you mean something like this?

 from BeautifulSoup import BeautifulStoneSoup soup = BeautifulStoneSoup('<?xml version="1.0" encoding="UTF-8"?>') # make some more soup 

Or,

 soup = BeautifulStoneSoup() # make some more soup soup.insert(0, '<?xml version="1.0" encoding="UTF-8"?>') 

From BeautifulSoup documentation :

Beautiful Soup tries to turn your document into Unicode in coding order:

  • The encoding you pass as the fromEncoding argument to the soup constructor.
  • The encoding found in the document itself: for example, in the XML declaration or (for HTML documents) the META-http-equiv tag. If Beautiful Soup finds this kind of encoding inside the document, it again analyzes the document from the very beginning and gives a new encoding. The only exception is if you explicitly specified an encoding, and this encoding really worked: then it will ignore any encoding found in the document.
  • The coding snorted, looking at the first few bytes of the file. If encoding is detected at this stage, it will be one of the encodings UTF- *, EBCDIC or ASCII.
  • The encoding sniffed by the font library, if installed.
  • Utf-8
  • Windows-1252

A beautiful soup will almost always guess correctly, if at all. But for documents without declarations and in strange encodings, he often will not be able to guess.

NB item # 2, which I read as: BeautifulSoup will automatically use the encoding in the xml declaration unless you explicitly specify one of the arguments fromEncoding. YMMV.

The previous referenced documentation has other, potentially useful, unicode examples.


Edit : @TorelTwiddler, if there is another way to add an xml declaration using BeautifulSoup without passing the tag as a string directly, I don't know about that.

However, consider the following:

 soup = BeautifulStoneSoup('<?xml version="1.0" encoding=""?>') # <- no encoding mytag = Tag(soup, 'mytag') soup.append(mytag) print str(soup) # "<?xml version='1.0' encoding='utf-8'?><mytag></mytag>" # Wha!? :) print soup.prettify(encoding='euc-jp') # <?xml version='1.0' encoding='euc-jp'?> # <mytag> # </mytag> 

Perhaps this will help you get where you want to go.

0
source

All Articles