Can Javascript read the source of any webpage?

I am working on cleaning the screen and want to get the source code on a specific page.

How can this be done using javascript? Please help me.

+61
javascript html
Mar 25 '09 at 7:32
source share
11 answers

Easy way to get started, try jQuery

$("#links").load("/Main_Page #jq-p-Getting-Started li"); 

More in jQuery Docs

Another way to make screenshots in a much more structured way is to use YQL or the Yahoo query language. It will return cleared data structured as JSON or xml.
eg
Let scrape stackoverflow.com

 select * from html where url="http://stackoverflow.com" 

will provide you with a JSON array (I selected this option), like this

  "results": { "body": { "noscript": [ { "div": { "id": "noscript-padding" } }, { "div": { "id": "noscript-warning", "p": "Qaru works best with JavaScript enabled" } } ], "div": [ { "id": "notify-container" }, { "div": [ { "id": "header", "div": [ { "id": "hlogo", "a": { "href": "/", "img": { "alt": "logo homepage", "height": "70", "src": "http://i.stackoverflow.com/Content/Img/stackoverflow-logo-250.png", "width": "250" } …….. 

The beauty is that you can make predictions and where , which ultimately gives you cleared data, structured data and only the data you need (much less bandwidth over the entire cable)
eg,

 select * from html where url="http://stackoverflow.com" and xpath='//div/h3/a' 

will get you

  "results": { "a": [ { "href": "/questions/414690/iphone-simulator-port-for-windows-closed", "title": "Duplicate: Is any Windows simulator available to test iPhone application? as a hobbyist who cannot afford a mac, i set up a toolchain kit locally on cygwin to compile objecti … ", "content": "iphone\n simulator port for windows [closed]" }, { "href": "/questions/680867/how-to-redirect-the-web-page-in-flex-application", "title": "I have a button control ....i need another web page to be redirected while clicking that button .... how to do that ? Thanks ", "content": "How\n to redirect the web page in flex application ?" }, ….. 

Now, to get only the questions that we do,

 select title from html where url="http://stackoverflow.com" and xpath='//div/h3/a' 

Pay attention to the title in the projections

  "results": { "a": [ { "title": "I don't want the function to be entered simultaneously by multiple threads, neither do I want it to be entered again when it has not returned yet. Is there any approach to achieve … " }, { "title": "I'm certain I'm doing something really obviously stupid, but I've been trying to figure it out for a few hours now and nothing is jumping out at me. I'm using a ModelForm so I can … " }, { "title": "when i am going through my project in IE only its showing errors A runtime error has occurred Do you wish to debug? Line 768 Error:Expected')' Is this is regarding any script er … " }, { "title": "I have a java batch file consisting of 4 execution steps written for analyzing any Java application. In one of the steps, I'm adding few libs in classpath that are needed for my co … " }, { …… 

As soon as you write your request, it generates a URL for you

http://query.yahooapis.com/v1/public/yql?q=select%20title%20from%20html%20where%20url%3D% 22http% 3A% 2F% 2Fstackoverflow.com% 22% 20and% 0A% 20% 20% 20% 20% 20% 20xpath% 3D '% 2F% 2Fdiv% 2Fh3% 2fa'% 0A% 20% 20% 20% 20 & format = & JSON amp; callback = cbfunc

in our case.

So, in the end, you end up doing something like this

 var titleList = $.getJSON(theAboveUrl); 

and play with him.

Pretty , right?

+104
Mar 25 '09 at 8:09
source share

Javascript can be used as long as you capture any page that you use through a proxy server in your domain:

 <html> <head> <script src="/js/jquery-1.3.2.js"></script> </head> <body> <script> $.get("www.mydomain.com/?url=www.google.com", function(response) { alert(response) }); </script> </body> 
+27
Mar 25 '09 at 8:06
source share

You can simply use XmlHttp (AJAX) to type the desired URL, and the HTML response from the URL will be available in the responseText property. If this is not the same domain, your users will receive a browser warning saying something like "This page is trying to access another domain. Do you want to allow this?"

+7
Mar 25 '09 at 7:40
source share

As a security measure, Javascript cannot read files from different domains. Although there might be some strange solution for this, I would consider a different language for this task.

+5
Mar 25 '09 at 7:37
source share

Using jquery

 <html> <head> <script src="http://jqueryjs.googlecode.com/files/jquery-1.3.2.js" ></script> </head> <body> <script> $.get("www.google.com", function(response) { alert(response) }); </script> </body> 
+3
Mar 25 '09 at 7:49
source share

I used ImportIO . They allow you to request HTML code from any website if you have created an account with them (which is free). They allow you to make up to 50 thousand requests per year. I did not find time to find an alternative, but I am sure that there are some.

In your Javascript, you just simply make a GET request like this:

 var request = new XMLHttpRequest(); request.onreadystatechange = function() { jsontext = request.responseText; alert(jsontext); } request.open("GET", "https://extraction.import.io/query/extractor/THE_PUBLIC_LINK_THEY_GIVE_YOU?_apikey=YOUR_KEY&url=YOUR_URL", true); request.send(); 

Sidenote: I found this question by studying what, in my opinion, was the same question, so others might find a useful solution.

UPDATE: I created a new one, which they only allowed me to use in less than 48 hours, before they said that I had to pay for the service. It looks like they closed your project pretty quickly if you don't pay. I made my own similar service with NodeJS and a library called NightmareJS. You can see their tutorial here and create your own web page cleaning tool. It is relatively easy. I did not try to configure it as an API that could make requests or something else.

+3
Aug 21 '16 at 0:52
source share

If you absolutely need to use javascript, you can load the page source using an ajax request.

Please note that with javascript you can only receive pages located in the same domain with the requesting page.

+2
Mar 25 '09 at 7:39
source share

You can create an XmlHttpRequest and request a page, and then use getResponseText () to get the content.

0
Jun 22 2018-12-12T00:
source share

You can use the FileReader API to get the file, and when selecting the file, put the URL of your web page in the selection box. Use this code:

 function readFile() { var f = document.getElementById("yourfileinput").files[0]; if (f) { var r = new FileReader(); r.onload = function(e) { alert(r.result); } r.readAsText(f); } else { alert("file could not be found") } } } 
0
Oct 26 '14 at 20:14
source share

You can get around a policy of the same origin by creating a browser extension or even saving the file as .hta on Windows (HTML application).

0
Oct 26 '14 at 8:58
source share

Despite many comments to the contrary, I believe that with the same JavaScript you can overcome the same requirement of origin.

I do not claim that the following is original, because I believe that some time ago I saw something similar elsewhere.

I tested this only with Safari on Mac.

The following demo displays the page in the base tag and moves its innerHTML to a new window. My script adds html tags, but with most modern browsers this can be avoided by using outerHTML.

 <html> <head> <base href='http://apod.nasa.gov/apod/'> <title>test</title> <style> body { margin: 0 } textarea { outline: none; padding: 2em; width: 100%; height: 100% } </style> </head> <body onload="w=window.open('#'); x=document.getElementById('t'); a='<html>\n'; b='\n</html>'; setTimeout('x.innerHTML=a+w.document.documentElement.innerHTML+b; w.close()',2000)"> <textarea id=t></textarea> </body> </html> 
-one
Mar 06 '15 at 13:29
source share



All Articles