Retrieving JavaScript Variable Values ​​Using Web Clips

For the company’s project, I need to create an application for cleaning web pages with PHP and JavaScript (including jQuery), which will extract specific data from each page of our customers' websites. A scraping application should receive two types of data for each page: 1) determine whether there are certain HTML elements with specific identifiers, and 2) extract the value of a specific JavaScript variable. The JS variable name is the same on every page, but the value is usually different.

I believe that I know how I can get the first data requirement: using the PHP file_get_contents () function to get each HTML page, and then use JavaScript / jQuery to parse this HTML and find elements with specific identifiers. However, I'm not sure how to get the second piece of data - the values ​​of the JavaScript variable. The JavaScript variable is not found even in the HTML of each page; instead, it is in the external JavaScript file that is associated with the page. And even if JavaScript was embedded in HTML pages, I know that file_get_contents () will only retrieve JavaScript code (and other HTML), and not any variable values.

Can someone suggest a good approach to get this variable value for each page of this website?

EDIT: just to clarify, I need the values ​​of the JavaScript variables after running the JavaScript code. Is such a thing possible?

+4
source share
4 answers

Presumably this is not possible, because it seems so simple, but if it is your .js that you are trying to detect, why not just do this .js do something detected by a scraper on the page?

use js to populate the tag like this somewhere (via element.innerHTML, presumably):

<span><!--Important js thing has been activated!--></span>. 

edit: alternately, perhaps use document.write if the script needs to detect onload

+2
source

You say that you need the value of the variable after executing JS. I assume that this is always the same JS, with only the initial values ​​of the variables being what changes. It is best to port JS to PHP, which allows you to retrieve the initial values ​​of the JS variables and then pretend that you have executed JS.

Here is the function to extract variable values ​​from JavaScript:

 /** * extracts a variable value given its name and type. makes certain assumptions about the source, * ie can't handle strings with escaped quotes. * * @param string $jsText the JavaScript source * @param string $name the name of the variable * @param string $type the variable type, either 'string' (default), 'float' or 'int' * @return string|int|float the extracted variable value */ function extractVar($jsText, $name, $type = 'string') { if ($type == 'string') { $valueMatch = "(\"|')(.*?)(\"|')"; } else { $valueMatch = "([0-9.]+?)"; } preg_match("/$name\s*\=\s*$valueMatch/", $jsText, $matches); if ($type == 'string') { return $matches[2]; } else if ($type == 'float') { return (float)$matches[1]; } else if ($type == 'int') { return (int)$matches[1]; } else { return false; } } 
+4
source

Can't you use a js script to be sent to your clients and what script to send information to your server?

0
source

You might be able to use Zombie.js a Node (js): http://zombie.labnotes.org/

It can click links, walk the dom tree, and has to deal with JS because it is running JavaScript.

0
source

All Articles