HTML analysis "visually"

OK, I'm losing what to call this question. I have some HTML files, probably written by Lord Lucifer himself, which I need to parse. It consists of many segments like this, among other html tags

<p>HeadingNumber</p> <p style="text-indent:number;margin-top:neg_num ">Heading Text</p> <p>Body</p> 

Please note that the header number and text are in separate p-tags, horizontally aligned in css. css can be anything that the Lucifer wants, a mixture of indentation, padding, margins and positions.

However, this line is the only object in my business model and should be stored as such. So, how to determine if two p elements are visually on the same line and process them accordingly. I find the HTML files are well-formed if that helps.

+7
html c # parsing
source share
3 answers

You did not specify how you understood, but this is possible in jQuery, since you can determine the offset position of any element from the beginning of the window. See an example here.

The code:

 $(function() { function sameHorizon( obj1, obj2, tolerance ) { var tolerance = tolerance || 0; var obj1top = obj1.offset().top; var obj2top = obj2.offset().top; return (Math.abs(obj1top - obj2top) <= tolerance); } $('p').each(function(i,obj) { if ($(obj).css('margin-top').replace('px','') < 0) { var p1 = $(obj).prev('p'); var p2 = $(obj); var pTol = 4; // pixel tolerance within which elements considered aligned if (sameHorizon(p1, p2, pTol)) { // put what you want to do with these objects here // I just highlighted them for example p1.css('background','#cc0'); p2.css('background','#c0c'); // but you can manipulate their contents console.log(p1.html(), p2.html()); } } }); ​}); 

This code is based on the assumption that if a <p> has a negative margin-top , then it tries to be aligned with the previous <p> , but if you know jQuery, it should be obvious how to change it to meet different criteria.

If you cannot use jQuery for your problem, then I hope this is useful for someone else who is or that you can install something in jQuery to analyze this and output new markup.

+2
source share

You can run the irobotsoft web scraper and pass the test:

  • Open the page in a browser window.
  • Select and mark the line
  • Use the menu: Design → Practice HTQL and see if it can extract the string.
0
source share

I do not have much experience using this, but if the HTML is well-formed and depending on the format in which you need your parsed data, you can consider it as an XML document and use XQuery to parse from your data.

Also open the HTML in Firefox and see if you can determine which CSS styles are applied with Firebug. This may give you a clearer idea of ​​how HTML is lined up ... although it seems like it's done using "margin-top: negative_number" ... if in this case I think XQuery should be able to find elements with this particular style.

0
source share

All Articles