Convert html to text using Python
I am trying to convert an html block to text using Python.
Input:
<div class="body"><p><strong></strong></p> <p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p> <p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p> <p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p> <p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p> <p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div> Desired conclusion:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean como ligula eget dolor. Aenean mass
Aenean massa. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean como ligula eget dolor. Aenean massa
Lorim ipsum dolor sit amet, consectetuer adipiscing elit. Aenean como ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
I tried using the html2text module without much success (I'm pretty new to python :))
here is what i tried:
#!/usr/bin/env python import urllib2 import html2text from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(urllib2.urlopen('http://example.com/page.html').read()) txt = soup.find('div', {'class' : 'body'}) print html2text.html2text(txt) the "txt" object creates the html block above. I would like to convert it to text and print it on the screen.
Any help with a piece of code would be greatly appreciated.
What am I missing? soup.get_text() gives exactly the same result you wanted ...
from bs4 import BeautifulSoup soup = BeautifulSoup(html) print soup.get_text() Exit
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa PS! To be precise, you can replace the new line with a double - then it will be identical to your example :)
soup.get_text().replace('\n','\n\n') You can use regex ... but not recommended ...
The following code simply removes all the HTML tags in your data, giving you text.
import re data = """<div class="body"><p><strong></strong></p> <p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p> <p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p> <p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p> <p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p> <p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>""" data = re.sub(r'<.*?>', '', data) print data Exit
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa '\n' puts a new line between paragraphs.
from bs4 import Beautifulsoup soup = Beautifulsoup(text) print(soup.get_text('\n')) I needed a way to do this on the client system without loading additional libraries. I did not find a good solution, so I created my own. Feel free to use this if you want.
import urllib def html2text(strText): str1 = strText int2 = str1.lower().find("<body") if int2>0: str1 = str1[int2:] int2 = str1.lower().find("</body>") if int2>0: str1 = str1[:int2] list1 = ['<br>', '<tr', '<td', '</p>', 'span>', 'li>', '</h', 'div>' ] list2 = [chr(13), chr(13), chr(9), chr(13), chr(13), chr(13), chr(13), chr(13)] bolFlag1 = True bolFlag2 = True strReturn = "" for int1 in range(len(str1)): str2 = str1[int1] for int2 in range(len(list1)): if str1[int1:int1+len(list1[int2])].lower() == list1[int2]: strReturn = strReturn + list2[int2] if str1[int1:int1+7].lower() == '<script' or str1[int1:int1+9].lower() == '<noscript': bolFlag1 = False if str1[int1:int1+6].lower() == '<style': bolFlag1 = False if str1[int1:int1+7].lower() == '</style': bolFlag1 = True if str1[int1:int1+9].lower() == '</script>' or str1[int1:int1+11].lower() == '</noscript>': bolFlag1 = True if str2 == '<': bolFlag2 = False if bolFlag1 and bolFlag2 and (ord(str2) != 10) : strReturn = strReturn + str2 if str2 == '>': bolFlag2 = True if bolFlag1 and bolFlag2: strReturn = strReturn.replace(chr(32)+chr(13), chr(13)) strReturn = strReturn.replace(chr(9)+chr(13), chr(13)) strReturn = strReturn.replace(chr(13)+chr(32), chr(13)) strReturn = strReturn.replace(chr(13)+chr(9), chr(13)) strReturn = strReturn.replace(chr(13)+chr(13), chr(13)) strReturn = strReturn.replace(chr(13), '\n') return strReturn url = "http://www.theguardian.com/world/2014/sep/25/us-air-strikes-islamic-state-oil-isis" html = urllib.urlopen(url).read() print html2text(html) If you need without any libraries, today I encoded this:
https://github.com/iFA88/python-html-to-text
Do not use or modify.
Works with python 2.x, python 3 has not been tested. No libraries required.
Usage example:
def htmlToText(html): def _getElement(subhtml,name,end=None): ename = "<"+name+">" a = subhtml.lower().find(ename) if a == -1: ename = "<"+name+" " a = subhtml.lower().find(ename) if a == -1: return if end == None: end = "</"+name+">" b = subhtml.lower()[a+len(ename):].find(end)+a+len(end)+len(ename) if ba-len(end)-len(ename) == -1: b = subhtml[a+len(ename):].find('>')+a+len('>')+len(ename) return subhtml[a:b] def _getElementAttribute(element,name): a = element.lower().find(name+'="')+len(name+'="') if a == -1: return b = element[a:].find('"')+a return element[a:b] def _getElementContent(element): a = element.find(">")+len(">") if a == -1: return b = len(element)-element[::-1].find('<')-1 return element[a:b] ret = "" #if you wish get Title headElement = _getElement(html,'head') if headElement: titleElement = _getElement(headElement, 'title') if titleElement: titleContent = _getElementContent(titleElement) if titleContent: ret += titleContent+"\n\n" #get body content bodyElement = _getElement(html,'body') if bodyElement: bodyContent = _getElementContent(bodyElement) if bodyContent: ret += bodyContent #remove javascript while True: scriptElement = _getElement(ret, 'script') if not scriptElement: scriptElement = _getElement(ret, 'script', '</noscript>') if not scriptElement: break ret = ret.replace(scriptElement, '') #remove style while True: styleElement = _getElement(ret, 'style') if not styleElement: break ret = ret.replace(styleElement, '') #replace links while True: linkElement = _getElement(ret, 'a') if not linkElement: break linkElementContent = _getElementContent(linkElement) if linkElementContent: #this will replace: '<a href="some.site">text</a>' -> 'text' # ret = ret.replace(linkElement, linkElementContent) #this will replace: '<a href="some.site">link</a>' -> 'some.site' # linkElementHref = _getElementAttribute(linkElement, 'href') # if linkElementHref: # ret = ret.replace(linkElement, linkElementHref) #this will replace: '<a href="some.site">link</a>' -> 'text ( some.site )' linkElementHref = _getElementAttribute(linkElement, 'href') if linkElementHref: ret = ret.replace(linkElement, linkElementContent+' ( '+linkElementHref+' )') #replace paragraphs while True: paragraphElement = _getElement(ret, 'p') if not paragraphElement: break paragraphElementContent = _getElementContent(paragraphElement) if paragraphElementContent: ret = ret.replace(paragraphElement, '\n\n'+paragraphElementContent+'\n\n') else: ret = ret.replace(paragraphElement, '') #replace line breaks ret = ret.replace('<br>', '\n') ret = ret.replace('<br/>', '\n') #replace bolds while True: boldElement = _getElement(ret, 'b') if not boldElement: break boldElementContent = _getElementContent(boldElement) if boldElementContent: ret = ret.replace(boldElement, boldElementContent.upper()) else: ret = ret.replace(boldElement, '') #replace images while True: imgElement = _getElement(ret, 'img') if not imgElement: break imgElementSrc = _getElementAttribute(imgElement, 'src') if imgElementSrc: ret = ret.replace(imgElement, '[IMG] '+imgElementSrc+' [IMG]') else: ret = ret.replace(imgElement, '') #remove rest elements while True: a = ret.find("<") if a == -1: break b = ret[a:].find(">")+a if ba == -1: break b2 = ret[b:].find(">")+b if b2-b == -1: break element = _getElement(ret, ret[a+1:b2]) if element: elementContent = _getElementContent(element) if elementContent: ret = ret.replace(element, elementContent) else: ret = ret.replace(element, '') return ret html = """ <html> <head> <meta charset="UTF-8"> <title>I'm a nice website title</title> <script src='script.js'></script> <link rel="icon" type="image/x-icon" href="favicon.ico"> <style> body { display: inline-block; font-family: Verdana; margin: 0; overflow-x: hidden; padding-bottom: 20px; padding-top: 20px; text-align: center; width: 850px; } </style> </head> <body> <style> p { font-size: 1em; } </style> <script> document.write('Yes, i\'ma javascript!'); </script> <p> Im a text with <b>bold</b> and <i>italic</i> content.<br>If you <span style="font-size:2em">like</span> this visit my <a href="ethereumlottery.net" target="_blank">site</a>. </p> Here is a image: <img src="veryniceimage"/><br> Here is a image with other format: <img src="veryniceimage"><br> Here is a image with link: <a href="ethereumlottery.net"><img src="veryniceimage"/></a><br> </body> </html>""".replace('\n','').replace('\t','') print htmlToText(html) Result:
I'm a nice website title Im a text with BOLD and italic content. If you like this visit my site ( ethereumlottery.net ). Here is a image: [IMG] veryniceimage [IMG] Here is a image with other format: [IMG] veryniceimage [IMG] Here is a image with link: [IMG] veryniceimage [IMG] ( ethereumlottery.net ) You can use BeautifulSoup to remove unwanted scripts, etc., although you may need to experiment with several different sites to make sure that you cover the different types of things you want to exclude. Try the following:
from requests import get from bs4 import BeautifulSoup as BS response = get('http://news.bbc.co.uk/2/hi/health/2284783.stm') soup = BS(response.content, "html.parser") for child in soup.body.children: if child.name == 'script': child.decompose() print(soup.body.get_text())