Convert HTML to plain text and maintain structure / formatting, with ruby

I would like to convert html to plain text. I don’t want to just remove tags, but I would like to intelligently save as much formatting as possible. Insert line breaks for tags <br>, define paragraphs and format them as such, etc.

The input is pretty simple, usually well-formatted html (not whole documents, just a bunch of content, usually without anchors or images).

I could have put together a couple of regular expressions that get me 80%, but I thought that there might be some existing solutions with a lot of intelligence.

+5
source share
1

-, regex . , / , HTML .

, Nokogiri HTML- :

require 'nokogiri'

html = '
<html>
<body>
  <p>This is
  some text.</p>
  <p>This is some more text.</p>
  <pre>
  This is
  preformatted
  text.
  </pre>
</body>
</html>
'

doc = Nokogiri::HTML(html)
puts doc.text

>>  This is
>>  some text.
>>  This is some more text.
>>  
>>  This is
>>  preformatted
>>  text.

, Nokogiri , , , , . HTML tidy, .

, HTML- . , HTML , , HTML . .

HTML- , , "\n" "\r", <br> . SO , , - . , Nokogiri .

, , <li> <ul> <ol>, .

, lynx. , - , , . , , .

+7

All Articles