How to get the source of a given url from a servlet?

Question

How to get the source of a given url from a servlet?

I want to read the source code (HTML tags) of a given URL from my servlet.

For example, the URL is http://www.google.com , and my servlet needs to read the HTML source code. Why do I need this, my web application is going to read other web pages and get useful content and do something with it.

Suppose my application shows a list of stores of one category in a city. As this list is created, my web application (servlet) goes through this web page, which displays various stores and reads content. With the source code, my servlet filters this source and gets useful information. Finally, a list is created (since my servlet does not have access to the web application database of the given URL).

Does anyone know a solution? (I especially need to do this in servlets). If you think there is another best way to get information from another site, please let me know.

thanks

+4

java html jsp servlets web-scraping

Decora Aug 21 '11 at 12:35

source share

6 answers

A servlet is not needed to read data from a remote server. You can simply use the java.net.URL or java.net.URLConnection class to read remote content from an HTTP server. For example,

 InputStream input = (InputStream) new URL("http://www.google.com").getContent();

+8

Andrey Adamovich Aug 21 '11 at 12:54

source share

Check out jsoup for HTML selection and analysis.

 Document doc = Jsoup.connect("http://en.wikipedia.org/").get(); Elements newsHeadlines = doc.select("#mp-itn ba");

+6

Jeremy Aug 21 '11 at 12:50

source share

As written above, for this purpose you do not need a servlet. The servlet API is used to respond to requests; the servlet container runs on the server side. If I understand you correctly, you do not need a server for this purpose. You need a simple http client emulator. Hope the following example helps you:

 import java.io.IOException; import java.io.InputStream; import java.io.UnsupportedEncodingException; import org.apache.http.HttpResponse; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.DefaultHttpClient; public class SimpleHttpClient { public String execute() { HttpClient httpClient = new DefaultHttpClient(); HttpGet httpGet = new HttpGet("google.com"); StringBuilder content = new StringBuilder(); try { HttpResponse response = httpClient.execute(httpGet); int bufferLength = 1024; byte[] buffer = new byte[bufferLength]; InputStream is = response.getEntity().getContent(); while (is.read(buffer) != -1) { content.append(new String(buffer, "UTF-8")); } } catch (UnsupportedEncodingException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } return content.toString(); } }

+1

hsestupin Aug 22 '11 at 21:59

source share

There are several solutions.

The easiest is using regular expressions. If you just want to extract links from tags like the <a href="THE URL"> custom regular expression, for example <a\s+href\s*=\s*["']?(.*?)["']\s*/> . Group (1) contains the URL. Now just create a Matcher and navigate through your document, and matcher.find() is true.

The following solution uses an XML parser to parse HTML. This will work well if your sites are written using well-formatted HTML (XHTML). Since this is not always true, this solution applies only to selected sites.

The following solution uses the built-in java HTML parser: http://java.sun.com/products/jfc/tsc/articles/bookmarks/

The next, most flexible is the use of a "real" html parser and even the best java-based HTML browser: Java HTML Parsing

Now it depends on the details of your task. If parsing static anchor tags is enough, custom regular expressions. If you do not choose one of the following methods.

0

Alexr Aug 21 '11 at 13:02

source share

As people say, you can use the main classes java.net.URL and java.net.URLConnection for extraction web pages. But more useful for this purpose is Apache HttpClient. Find documents and examples here: http://hc.apache.org/

0

umbr Aug 22 '11 at 11:52

source share

Srinivas · Accepted Answer · 2011-08-21T12:56:10+0000

What you are trying to do is called web scraping. Kayak and similar sites do this. Look for it on the Internet;) Well, in java you can do it.

URL url = new URL(<your URL>); BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream())); String inputLine; StringBuffer response = new StringBuffer(); while ((inputLine = in.readLine()) != null) { response.append(inputLine + "\n"); } in.close();

the answer will give you the full HTML content returned by this url.

How to get the source of a given url from a servlet?

More articles: