Html div nesting? using google fetchurl

Question

Html div nesting? using google fetchurl

I am trying to grab a table from the following web page

http://www.bloomberg.com/markets/companies/country/hong-kong/

I have an example code that was kindly provided by Phil Bozak here:

grab table from html using google script

which captures the table for this website:

http://www.airchina.com.cn/www/en/html/index/ir/traffic/

As you can see from Phil's code, there is a lot of "getElement ()" in the code. If I look at the html code for the Air China website. Looks like he's nested four times? why is the string .getElement?

Now I look at the source code of the Bloomberg page, and its loading is "div" ...

The question is, can someone show me how to grab a table from this Bloomberg page?

and just a brief explanation of the theory would also be helpful. Thanks a lot.

+1

dom html web-scraping google-apps-script

jason May 31 '13 at 13:42

source share

1 answer

Mogsdad · Accepted Answer · 2013-05-31T15:16:27+0000

Let me turn my question upside down and start with a theory. A methodology may be best for him.

You want to get something specific on a structured page. To do this, you either need a way to bind an element to an element (what can be done if it is marked in a unique way that we can access), or you need to move the structure more or less manually. You already know how to look at the source of the page, so you are familiar with this step. Here is a screenshot of the Firefox Inspector, highlighting the element of interest to us.

We can see the hierarchy of elements that lead to the table: html, body, div, div, div.ticker, table.ticker_data strong>. We also see the source:

<table class="ticker_data">

Well maintained! This is indicated! Unfortunately, this class information falls when processing HTML code in our script. Bummer. If it were id="ticker_data" instead, we could use the getElementByVal () utility from this answer to achieve this, and give ourselves some immunity from future page restructuring. Insert a pin into it - we will return to it.

This can help visualize this in the debugger. Here's the script utility for this - run it in debug mode, and you will find your HTML document to study:

 /** * Debug-run this in the editor to be able to explore the structure of web pages. * * Set target to the page you're interested in. */ function pageExplorer() { var target = "http://www.bloomberg.com/markets/companies/country/hong-kong/"; var pageTxt = UrlFetchApp.fetch(target).getContentText(); var pageDoc = Xml.parse(pageTxt,true); debugger; // Pause in debugger - explore pageDoc }

Here's what our page looks like in the debugger:

You might be wondering what numbered elements are, because you do not see them in the source. When an XML document contains multiple type elements at the same level, the parser presents them as an array numbered 0..n . Thus, when we see 0 under the div in the debugger, which tells us that there are several <div> tags in the HTML source at this level, and we can refer to them as an array, for example .div[0] .

Well, the theory behind us, let it go ahead and see how we can access the table with brute force.

Knowing the hierarchy, including the div arrays shown in the debugger, we could do this, ala Phil previous answer. I will make some weird indentation to illustrate the structure of the document:

 ... var target = "http://www.bloomberg.com/markets/companies/country/hong-kong/"; var pageTxt = UrlFetchApp.fetch(target).getContentText(); var pageDoc = Xml.parse(pageTxt,true); var table = pageDoc.getElement() .getElement("body") .getElements("div")[0] // 0-th div under body, shown in debugger .getElements("div")[5] // 5-th div under there .getElement("div") // another div .getElement("table"); // finally, our table

As a much more compact alternative to all of these .getElement() calls, we can navigate using dot notation.

 var table = pageDoc.getElement().body.div[0].div[5].div.table;

What is it.

Let's get back to this deep-rooted idea. In the debugger, we see that various attributes are attached to the elements. In particular, there is an "id" on this div [5], which contains a div that contains a table. Remember that in the source we saw class attributes, but note that they do not do this so far.

However, the fact that the good programmer puts this "id" in place means that we can do this, getDivById() from this earlier question:

 var contentDiv = getDivById( pageDoc.getElement().body, 'content' ); var table = contentDiv.div.table;

If they move things, we can still find this table without changing our code.

You already know what to do when you have a table element, so we are done here!

Html div nesting? using google fetchurl

More articles: