Google Chrome Web Scraper (JavaScript + Chrome API)

What are the best options for doing open-tab web clips as part of the Google Chrome extension with JavaScript and any other technology. Other JavaScript libraries are also accepted.

It is important that the scraper looks like a regular web request . There are no references to AJAX or XMLHttpRequest, e.g. X-Requested-With: XMLHttpRequest or Origin .

Scraper content should be accessible from JavaScript for further processing and presentation inside the extension, most likely as a string.

Are there any hooks in any WebKit / Chrome: s API that can be used to create a regular web request and get results for manipulation?

 var pageContent = getPageContent(url); // TODO: Implement var items = $(pageContent).find('.item'); // Display items with further selections 

Bonus points to make this work from a local file on disk for initial debugging. But if this is the only point - to stop the decision, then ignore the bonus points.

+65
javascript google-chrome google-chrome-extension web-scraping
Jun 28 2018-11-11T00:
source share
7 answers

Try using XHR2 responseType = "document" and go back (new DOMParser).parseFromString(responseText, getResponseHeader("Content-Type")) with my text/html patch . See https://gist.github.com/1138724 for an example of how I detect responseType = "document support for responseType = "document (synchronously checking response === null for an object URL created from text/html blob) .

Use the Chrome WebRequest API to hide X-Requested-With headers, etc.

+12
Aug 25 2018-11-11T00:
source share

If you are well versed in something other than the Google Chrome plugin, check out phantomjs , which uses Qt-Webkit in the background and runs only as a browser, including the creation of ajax requests. You can call it a mute browser, as it does not display output on the screen and can work in the background when you do other things. If you want, you can export pdf images from the pages that it extracts. It provides a JS interface for loading pages, clicking buttons, etc., like in your browser. You can also add custom JS, such as jQuery, to any page that you want to clear, and use it to access dom and export the desired data. Since its use of Webkit , its rendering behavior is exactly the same as Google Chrome.

Another option is to use Aptana Jaxer , which is based on the Mozilla Engine and is a very good concept in itself. It can also be used as a simple cleaning tool.

+10
Aug 25 2018-11-18T00:
source share

Web page scraper is complicated in Chrome Extension. Some moments:

  • You run content scripts to access the DOM.
  • Background pages (one for each browser) can send and receive messages in the content script. That is, you can run the script content that sets the RPC endpoint and launches the specified callback in the context of the background page as an answer.
  • You can execute content scripts in all frames of a web page, and then stitch a document tree (consisting of 1..N frames containing the page).
  • As SK suggested, your background page can send XMLHttpRequest data to some kind of lightweight HTTP server that listens locally.
+6
Aug 30 '11 at 19:05
source share

Since this question arose, many tools have been released.

artoo.js is one of them. This is part of the JavaScript code designed to run in the browser console to provide you with some cleaning utilities. It can also be used as an extension of chrome.

+6
Nov 27 '14 at 12:14
source share

I am not sure that this is possible only with JavaScript, but if you can configure a dedicated PHP script for your extension that uses cURL to extract HTML for the page, the PHP script can clear the page for you and your extension can be read through an AJAX request.

The actual page being cleared will not know this AJAX request, however, since it is accessed via cURL.

+5
Jul 07 '11 at 13:21
source share

I think you can start with this.

So, you can try using the combination Extension + Plugin. The extension will have access to the DOM (including the plugin) and control the process. And the plugin will send the actual HTTP requests.

I can recommend using Firebreath as a cross-platform platform for Chrome / Firefox, in particular, take a look at this example: Firebreath - Create + HTTP + Requests + with + SimpleStreamsHelper

+4
Aug 30 '11 at 10:26
source share

Could you do some iframe tricks? if you load the url into the selected frame, you have dom in the document object and you can make jquery selection, no?

+2
Aug 12 '11 at 18:09
source share



All Articles