How to parse an HTML string in Google Apps Script without using XmlService?

I want to create a scraper using Google Spreadsheets using Google Apps Script. I know that this is possible, and I saw some tutorials and topics about it.

The basic idea is to use:

var html = UrlFetchApp.fetch('http://en.wikipedia.org/wiki/Document_Object_Model').getContentText(); var doc = XmlService.parse(html); 

And then get and work with the elements. However method

 XmlService.parse() 

Does not work on any page. For example, if I try:

 function test(){ var html = UrlFetchApp.fetch("https://www.nespresso.com/br/pt/product/maquina-de-cafe-espresso-pixie-clips-preto-lima-neon-c60-220v").getContentText(); var parse = XmlService.parse(html); } 

I get the following error:

 Error on line 225: The entity name must immediately follow the '&' in the entity reference. (line 3, file "") 

I tried using string.replace() to eliminate characters that seem to be causing an error, but that won't work. All other errors appear. The following code, for example:

 function test(){ var html = UrlFetchApp.fetch("https://www.nespresso.com/br/pt/product/maquina-de-cafe-espresso-pixie-clips-preto-lima-neon-c60-220v").getContentText(); var regExp = new RegExp("&", "gi"); html = html.replace(regExp,""); var parse = XmlService.parse(html); } 

Gives me the following error:

 Error on line 358: The content of elements must consist of well-formed character data or markup. (line 6, file "") 

I believe this is a problem with the XmlService.parse() method.

I read in these threads:

Google App Script parsing table from messed html and What is the best way to parse html in google apps script , you can use an outdated method called xml.parse() , which takes a second parameter that allows you to parse HTML. However, as I mentioned, it is out of date, and I can’t find the documentation anywhere. xml.parse() seems to parse the string, but I am having problems working with elements due to lack of documentation. And this is also not the safest long-term solution, because it may be deactivated in the near future.

So I want to know how to parse this HTML in Google Apps Script?

I also tried:

 function test(){ var html = UrlFetchApp.fetch("https://www.nespresso.com/br/pt/product/maquina-de-cafe-espresso-pixie-clips-preto-lima-neon-c60-220v").getContentText(); var htmlOutput = HtmlService.createHtmlOutput(html).getContent(); var parse = XmlService.parse(htmlOutput); } 

But this will not work, I get this error:

Not enough HTML content:

I was thinking about using an open source library to parse HTML, but I could not find.

My ultimate goal is to get some information from a set of pages such as "Price", "Link", "Product Name", etc. I managed to do this using the RegEx series:

 var ss = SpreadsheetApp.getActiveSpreadsheet(); var linksSheet = ss.getSheetByName("Links"); var resultadosSheet = ss.getSheetByName("Resultados"); function scrapyLoco(){ var links = linksSheet.getRange(1, 1, linksSheet.getLastRow(), 1).getValues(); var arrayGrandao = []; for (var row = 0, len = links.length; row < len; row++){ var link = links[row]; var arrayDeResultados = pegarAsCoisas(link[0]); Logger.log(arrayDeResultados); arrayGrandao.push(arrayDeResultados); } resultadosSheet.getRange(2, 1, arrayGrandao.length, arrayGrandao[0].length).setValues(arrayGrandao); } function pegarAsCoisas(linkDoProduto) { var resultadoArray = []; var html = UrlFetchApp.fetch(linkDoProduto).getContentText(); var regExp = new RegExp("<h1([^]*)h1>", "gi"); var h1Html = regExp.exec(html); var h1Parse = XmlService.parse(h1Html[0]); var h1Output = h1Parse.getRootElement().getText(); h1Output = h1Output.replace(/(\r\n|\n|\r|(^( )*))/gm,""); regExp = new RegExp("Ref.: ([^(])*", "gi"); var codeHtml = regExp.exec(html); var codeOutput = codeHtml[0].replace("Ref.: ","").replace(" ",""); regExp = new RegExp("margin-top: 5px; margin-bottom: 5px; padding: 5px; background-color: #699D15; color: #fff; text-align: center;([^]*)/div>", "gi"); var descriptionHtml = regExp.exec(html); var regExp = new RegExp("<p([^]*)p>", "gi"); var descriptionHtml = regExp.exec(descriptionHtml); var regExp = new RegExp("^[^.]*", "gi"); var descriptionHtml = regExp.exec(descriptionHtml); var descriptionOutput = descriptionHtml[0].replace("<p>",""); descriptionOutput = descriptionOutput+"."; regExp = new RegExp("ecom(.+?)Main.png", "gi"); var imageHtml = regExp.exec(html); var comecoDaURL = "https://www.nespresso.com/"; var imageOutput = comecoDaURL+imageHtml[0]; var regExp = new RegExp("nes_l-float nes_big-price nes_big-price-with-out([^]*)p>", "gi"); var precoHtml = regExp.exec(html); var regExp = new RegExp("[0-9]*,", "gi"); precoHtml = regExp.exec(precoHtml); var precoOutput = "BRL "+precoHtml[0].replace(",",""); resultadoArray = [codeOutput,h1Output,descriptionOutput,"Home & Garden > Kitchen & Dining > Kitchen Appliances > Coffee Makers & Espresso Machines", "MΓ‘quina",linkDoProduto,imageOutput,"new","in stock",precoOutput,"","","","Nespresso",codeOutput]; return resultadoArray; } 

But this is a lot of time for programming, it is very difficult to change it dynamically and not very reliably.

I need a way to parse this HTML code and easily access its elements. This is not really an addition. but a simple google script application ..

+11
javascript parsing html-parsing google-spreadsheet google-apps-script google-sheets
source share
7 answers

I did this in vanilla js. Not real html parsing. Just try to get some content from a string (url):

 function getLKKBTC() { var url = 'https://www.lykke.com/exchange'; var html = UrlFetchApp.fetch(url).getContentText(); var searchstring = '<td class="ask_BTCLKK">'; var index = html.search(searchstring); if (index >= 0) { var pos = index + searchstring.length var rate = html.substring(pos, pos + 6); rate = parseFloat(rate) rate = 1/rate return parseFloat(rate); } throw "Failed to fetch/parse data from " + url; } 
+6
source share

This has been discussed previously. See Here: What is the best way to parse html in google apps script

Unlike the XML service, the XMLService does not really forgive invalid HTML. The trick in Justin Bicknell's answer does the job. Although the XML service is outdated, it still continues to work.

+5
source share

I made a greeting for your problem. It works on GAS as a cheerio, which is similar to jQuery api. You can do it like this.

 const content = UrlFetchApp.fetch('https://example.co/').getContentText(); const $ = Cheerio.load(content); Logger.log($('p .blah').fist().text()); // blah blah blah ... 

See also https://github.com/asciian/cheeriogs

+3
source share

Could you use javascript for parsing html? If your Google Apps Script extracted the html as a string and then returned it to the javascript function, it looks like you could parse it just fine outside of the Google Apps script. Any tags that you want to clear can be sent to a special function of Google Apps, which will save the content.

Perhaps you could do this more easily with jQuery .

+1
source share

Keep in mind that some websites may not allow the automatic cleaning of their contents, so please read their terms or services before using Script applications to extract the content.

XmlService only works with valid XML documents, and most HTML (especially HTML5) is not valid for XML. The previous version of XmlService , simply called Xml , allowed for parsing "lenient", which would also allow parsing HTML. This service was sunset in 2013, but is still operational at the moment. Reference documents are no longer available, but this old tutorial shows its use.

Another alternative is to use the Kimono service, which processes fragments and parsing parts and provides a simple API that you can call through UrlFetchApp to get structured data.

+1
source share

I found a very neat alternative to scratches using the Google Script application. It is called PhantomJS Cloud . You can use urlFetchApp to access the API. This allows you to execute jQuery code on pages, which makes life a lot easier.

0
source share

It may not be the cleanest approach, but simple string handling also does the job without xmlservice:

 var url = 'https://somewebsite.com/?q=00:11:22:33:44:55'; var html = UrlFetchApp.fetch(url).getContentText(); // we want only the link text displayed from here: //<td><a href="/company/ubiquiti-networks-inc">Ubiquiti Networks Inc.</a></td> var string1 = html.split('<td><a href="/company/')[1]; // all after '<td><a href="/company/' var string2 = string1.split('</a></td>')[0]; // all before '</a></td>' var string3 = string2.split('>')[1]; // all after '>' Logger.log('link text: '+string3); // string3 => "Ubiquiti Networks Inc." 
0
source share

All Articles