Like an HTML prefix, so tag attributes stay on the same line?

I got this little code snippet:

text = """<html><head></head><body> <h1 style=" text-align: center; ">Main site</h1> <div> <p style=" color: blue; text-align: center; ">text1 </p> <p style=" color: blueviolet; text-align: center; ">text2 </p> </div> <div> <p style="text-align:center"> <img src="./foo/test.jpg" alt="Testing static images" style=" "> </p> </div> </body></html> """ import sys import re import bs4 def prettify(soup, indent_width=4): r = re.compile(r'^(\s*)', re.MULTILINE) return r.sub(r'\1' * indent_width, soup.prettify()) soup = bs4.BeautifulSoup(text, "html.parser") print(prettify(soup)) 

The output of the above snippet right now:

 <html> <head> </head> <body> <h1 style=" text-align: center; "> Main site </h1> <div> <p style=" color: blue; text-align: center; "> text1 </p> <p style=" color: blueviolet; text-align: center; "> text2 </p> </div> <div> <p style="text-align:center"> <img alt="Testing static images" src="./foo/test.jpg" style=" "/> </p> </div> </body> </html> 

I would like to figure out how to format the output so that it becomes the following:

 <html> <head> </head> <body> <h1 style="text-align: center;"> Main site </h1> <div> <p style="color: blue;text-align: center;"> text1 </p> <p style="color: blueviolet;text-align: center;"> text2 </p> </div> <div> <p style="text-align:center"> <img alt="Testing static images" src="./foo/test.jpg" style=""/> </p> </div> </body> </html> 

Otherwise, I would like to keep html expressions, such as <tag attrib1=value1 attrib2=value2 ... attribn=valuen> , in one line, if possible. When I say "if possible", I mean, not screwing up the value of the attributes themselves (value1, value2, ..., valuen).

Can this be done with beautifulsoup4? As far as I read in the docs, it seems like you can use a custom formatter , but I don't know how I could have a custom formatter so that it can fulfill the described requirements.

EDIT:

The @alecxe solution is quite simple, unfortunately, is not executed in more complex cases, for example, below:

 test1 = """ <div id="dialer-capmaign-console" class="fill-vertically" style="flex: 1 1 auto;"> <div id="sessionsGrid" data-columns="[ { field: 'dialerSession.startTime', format:'{0:G}', title:'Start time', width:122 }, { field: 'dialerSession.endTime', format:'{0:G}', title:'End time', width:122, attributes: {class:'tooltip-column'}}, { field: 'conversationStartTime', template: cty.ui.gct.duration_dialerSession_conversationStartTime_endTime, title:'Duration', width:80}, { field: 'dialerSession.caller.lastName',template: cty.ui.gct.person_dialerSession_caller_link, title:'Caller', width:160 }, { field: 'noteType',template:cty.ui.gct.nameDescription_noteType, title:'Note type', width:150, attributes: {class:'tooltip-column'}}, { field: 'note', title:'Note'} ]"> </div> </div> """ from bs4 import BeautifulSoup import re def prettify(soup, indent_width=4, single_lines=True): if single_lines: for tag in soup(): for attr in tag.attrs: print(tag.attrs[attr], tag.attrs[attr].__class__) tag.attrs[attr] = " ".join( tag.attrs[attr].replace("\n", " ").split()) r = re.compile(r'^(\s*)', re.MULTILINE) return r.sub(r'\1' * indent_width, soup.prettify()) def html_beautify(text): soup = BeautifulSoup(text, "html.parser") return prettify(soup) print(html_beautify(test1)) 

trace of TRACEBACK calls:

 dialer-capmaign-console <class 'str'> ['fill-vertically'] <class 'list'> Traceback (most recent call last): File "d:\mcve\x.py", line 35, in <module> print(html_beautify(test1)) File "d:\mcve\x.py", line 33, in html_beautify return prettify(soup) File "d:\mcve\x.py", line 25, in prettify tag.attrs[attr].replace("\n", " ").split()) AttributeError: 'list' object has no attribute 'replace' 
+8
code-formatting python html beautifulsoup
source share
2 answers

BeautifulSoup tried to save newlines and some spaces that you had in the attribute values ​​in the input HTML.

One way to solve this problem would be to sort through the attributes of an element and clear them to an exaggeration - deleting new lines and replacing several consecutive spaces with a single space:

 for tag in soup(): for attr in tag.attrs: tag.attrs[attr] = " ".join(tag.attrs[attr].replace("\n", " ").split()) print(soup.prettify()) 

Print

 <html> <head> </head> <body> <h1 style="text-align: center;"> Main site </h1> <div> <p style="color: blue; text-align: center;"> text1 </p> <p style="color: blueviolet; text-align: center;"> text2 </p> </div> <div> <p style="text-align:center"> <img alt="Testing static images" src="./foo/test.jpg" style=""/> </p> </div> </body> </html> 

Refresh (to define multi-valued attributes such as class ):

You just need to add a little modification, adding special processing for the case when the attribute is of type list :

 for tag in soup(): tag.attrs = { attr: [" ".join(attr_value.replace("\n", " ").split()) for attr_value in value] if isinstance(value, list) else " ".join(value.replace("\n", " ").split()) for attr, value in tag.attrs.items() } 
+7
source share

While BeautifulSoup is more commonly used, HTML Tidy may be a better choice if you work with quirks and have more specific requirements.

After installing the library for Python ( pip install pytidylib ) try the following code:

 from tidylib import Tidy tidy = Tidy() # assign string to text config = { "doctype": "omit", # "show-body-only": True } print tidy.tidy_document(text, options=config)[0] 

tidy.tidy_document returns a tuple with HTML and any errors that may have occurred. This code will output

 <html> <head> <title></title> </head> <body> <h1 style="text-align: center;"> Main site </h1> <div> <p style="color: blue; text-align: center;"> text1 </p> <p style="color: blueviolet; text-align: center;"> text2 </p> </div> <div> <p style="text-align:center"> <img src="./foo/test.jpg" alt="Testing static images" style=""> </p> </div> </body> </html> 

Uncommenting "show-body-only": True for the second sample.

 <div id="dialer-capmaign-console" class="fill-vertically" style="flex: 1 1 auto;"> <div id="sessionsGrid" data-columns="[ { field: 'dialerSession.startTime', format:'{0:G}', title:'Start time', width:122 }, { field: 'dialerSession.endTime', format:'{0:G}', title:'End time', width:122, attributes: {class:'tooltip-column'}}, { field: 'conversationStartTime', template: cty.ui.gct.duration_dialerSession_conversationStartTime_endTime, title:'Duration', width:80}, { field: 'dialerSession.caller.lastName',template: cty.ui.gct.person_dialerSession_caller_link, title:'Caller', width:160 }, { field: 'noteType',template:cty.ui.gct.nameDescription_noteType, title:'Note type', width:150, attributes: {class:'tooltip-column'}}, { field: 'note', title:'Note'} ]"></div> </div> 

For more information on setting up and configuring, see more settings . There are migration options specific to attributes that may help. As you can see, empty elements will occupy only one line, and html-tidy will automatically try to add things like DOCTYPE , head and title tags.

+4
source share

All Articles