Like an HTML prefix, so tag attributes stay on the same line?
I got this little code snippet:
text = """<html><head></head><body> <h1 style=" text-align: center; ">Main site</h1> <div> <p style=" color: blue; text-align: center; ">text1 </p> <p style=" color: blueviolet; text-align: center; ">text2 </p> </div> <div> <p style="text-align:center"> <img src="./foo/test.jpg" alt="Testing static images" style=" "> </p> </div> </body></html> """ import sys import re import bs4 def prettify(soup, indent_width=4): r = re.compile(r'^(\s*)', re.MULTILINE) return r.sub(r'\1' * indent_width, soup.prettify()) soup = bs4.BeautifulSoup(text, "html.parser") print(prettify(soup)) The output of the above snippet right now:
<html> <head> </head> <body> <h1 style=" text-align: center; "> Main site </h1> <div> <p style=" color: blue; text-align: center; "> text1 </p> <p style=" color: blueviolet; text-align: center; "> text2 </p> </div> <div> <p style="text-align:center"> <img alt="Testing static images" src="./foo/test.jpg" style=" "/> </p> </div> </body> </html> I would like to figure out how to format the output so that it becomes the following:
<html> <head> </head> <body> <h1 style="text-align: center;"> Main site </h1> <div> <p style="color: blue;text-align: center;"> text1 </p> <p style="color: blueviolet;text-align: center;"> text2 </p> </div> <div> <p style="text-align:center"> <img alt="Testing static images" src="./foo/test.jpg" style=""/> </p> </div> </body> </html> Otherwise, I would like to keep html expressions, such as <tag attrib1=value1 attrib2=value2 ... attribn=valuen> , in one line, if possible. When I say "if possible", I mean, not screwing up the value of the attributes themselves (value1, value2, ..., valuen).
Can this be done with beautifulsoup4? As far as I read in the docs, it seems like you can use a custom formatter , but I don't know how I could have a custom formatter so that it can fulfill the described requirements.
EDIT:
The @alecxe solution is quite simple, unfortunately, is not executed in more complex cases, for example, below:
test1 = """ <div id="dialer-capmaign-console" class="fill-vertically" style="flex: 1 1 auto;"> <div id="sessionsGrid" data-columns="[ { field: 'dialerSession.startTime', format:'{0:G}', title:'Start time', width:122 }, { field: 'dialerSession.endTime', format:'{0:G}', title:'End time', width:122, attributes: {class:'tooltip-column'}}, { field: 'conversationStartTime', template: cty.ui.gct.duration_dialerSession_conversationStartTime_endTime, title:'Duration', width:80}, { field: 'dialerSession.caller.lastName',template: cty.ui.gct.person_dialerSession_caller_link, title:'Caller', width:160 }, { field: 'noteType',template:cty.ui.gct.nameDescription_noteType, title:'Note type', width:150, attributes: {class:'tooltip-column'}}, { field: 'note', title:'Note'} ]"> </div> </div> """ from bs4 import BeautifulSoup import re def prettify(soup, indent_width=4, single_lines=True): if single_lines: for tag in soup(): for attr in tag.attrs: print(tag.attrs[attr], tag.attrs[attr].__class__) tag.attrs[attr] = " ".join( tag.attrs[attr].replace("\n", " ").split()) r = re.compile(r'^(\s*)', re.MULTILINE) return r.sub(r'\1' * indent_width, soup.prettify()) def html_beautify(text): soup = BeautifulSoup(text, "html.parser") return prettify(soup) print(html_beautify(test1)) trace of TRACEBACK calls:
dialer-capmaign-console <class 'str'> ['fill-vertically'] <class 'list'> Traceback (most recent call last): File "d:\mcve\x.py", line 35, in <module> print(html_beautify(test1)) File "d:\mcve\x.py", line 33, in html_beautify return prettify(soup) File "d:\mcve\x.py", line 25, in prettify tag.attrs[attr].replace("\n", " ").split()) AttributeError: 'list' object has no attribute 'replace' BeautifulSoup tried to save newlines and some spaces that you had in the attribute values ββin the input HTML.
One way to solve this problem would be to sort through the attributes of an element and clear them to an exaggeration - deleting new lines and replacing several consecutive spaces with a single space:
for tag in soup(): for attr in tag.attrs: tag.attrs[attr] = " ".join(tag.attrs[attr].replace("\n", " ").split()) print(soup.prettify()) <html> <head> </head> <body> <h1 style="text-align: center;"> Main site </h1> <div> <p style="color: blue; text-align: center;"> text1 </p> <p style="color: blueviolet; text-align: center;"> text2 </p> </div> <div> <p style="text-align:center"> <img alt="Testing static images" src="./foo/test.jpg" style=""/> </p> </div> </body> </html> Refresh (to define multi-valued attributes such as class ):
You just need to add a little modification, adding special processing for the case when the attribute is of type list :
for tag in soup(): tag.attrs = { attr: [" ".join(attr_value.replace("\n", " ").split()) for attr_value in value] if isinstance(value, list) else " ".join(value.replace("\n", " ").split()) for attr, value in tag.attrs.items() } While BeautifulSoup is more commonly used, HTML Tidy may be a better choice if you work with quirks and have more specific requirements.
After installing the library for Python ( pip install pytidylib ) try the following code:
from tidylib import Tidy tidy = Tidy() # assign string to text config = { "doctype": "omit", # "show-body-only": True } print tidy.tidy_document(text, options=config)[0] tidy.tidy_document returns a tuple with HTML and any errors that may have occurred. This code will output
<html> <head> <title></title> </head> <body> <h1 style="text-align: center;"> Main site </h1> <div> <p style="color: blue; text-align: center;"> text1 </p> <p style="color: blueviolet; text-align: center;"> text2 </p> </div> <div> <p style="text-align:center"> <img src="./foo/test.jpg" alt="Testing static images" style=""> </p> </div> </body> </html> Uncommenting "show-body-only": True for the second sample.
<div id="dialer-capmaign-console" class="fill-vertically" style="flex: 1 1 auto;"> <div id="sessionsGrid" data-columns="[ { field: 'dialerSession.startTime', format:'{0:G}', title:'Start time', width:122 }, { field: 'dialerSession.endTime', format:'{0:G}', title:'End time', width:122, attributes: {class:'tooltip-column'}}, { field: 'conversationStartTime', template: cty.ui.gct.duration_dialerSession_conversationStartTime_endTime, title:'Duration', width:80}, { field: 'dialerSession.caller.lastName',template: cty.ui.gct.person_dialerSession_caller_link, title:'Caller', width:160 }, { field: 'noteType',template:cty.ui.gct.nameDescription_noteType, title:'Note type', width:150, attributes: {class:'tooltip-column'}}, { field: 'note', title:'Note'} ]"></div> </div> For more information on setting up and configuring, see more settings . There are migration options specific to attributes that may help. As you can see, empty elements will occupy only one line, and html-tidy will automatically try to add things like DOCTYPE , head and title tags.