For a blog such as a project, I want to get the first few paragraphs, headings, lists, or something else within the range of characters from the markup of the generated html fragment for display as a summary.
So, if I have
<h1>hello world</h1> <p>Lets say these are 100 chars</p> <ul> <li>some bla bla, 40 chars</li> </ul> <p>some other text</p>
And suppose I want to summarize the text in the first 150 characters (itβs not necessary to be too precise, I could just get the first 150 characters, including tags, and continue with this, but will probably create some artifacts in the tail, which can be more complicated outstanding ...), he should give me h1, p and ul, but not the final p (which would be truncated). If the first element should have more than 150 characters, I would take the full first element.
How can i get this? Using XPath or regex? I am a little without ideas about this ...
Edit
First I want to give THANKS to everyone who answered!
While I had really excellent answers in this thread, it was much easier for me to connect before the markdown interpreter got in, take the first n text blocks separated by \ r \ n \ r \ n, and just pass it to the md generation .
class String def summarize_md length arr = self.split(/\r\n\r\n/) sum ="" arr.each do |ea| break if sum.length + ea.length > length sum = sum+"#{ea}\r\n\r\n" end sum end end
although itβs possible that this code can be reduced to one line, it is still much simpler and cpu more friendly than any of the proposed solutions. In any case, since my question can be interpreted, for example, if html was the starting point (and not the text md), I will just give an answer to the first guy ... I hope that is simple ...