The easiest way to properly load html from a webpage to a string in Java

What the title says.

Help with thanks!

+26
java html parsing
04 Sep '09 at 21:26
source share
3 answers

An extremely common mistake is the inability to correctly convert the HTTP response from bytes to characters. To do this, you need to know the character encoding of the response. I hope this is indicated as a parameter in the "Content-Type" parameter. But put it in the body, since the "http-equiv" attribute in the meta tag is also an option.

So, it is surprisingly hard to load the page correctly in String , and even third-party libraries like HttpClient do not offer a general solution.

Here is a simple implementation that will handle the most common case:

 URL url = new URL("http://stackoverflow.com/questions/1381617"); URLConnection con = url.openConnection(); Pattern p = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*"); Matcher m = p.matcher(con.getContentType()); /* If Content-Type doesn't match this pre-conception, choose default and * hope for the best. */ String charset = m.matches() ? m.group(1) : "ISO-8859-1"; Reader r = new InputStreamReader(con.getInputStream(), charset); StringBuilder buf = new StringBuilder(); while (true) { int ch = r.read(); if (ch < 0) break; buf.append((char) ch); } String str = buf.toString(); 
+30
Sep 04 '09 at 22:21
source share

You can simplify it a bit with org.apache.commons.io.IOUtils :

 URL url = new URL("http://stackoverflow.com/questions/1381617"); URLConnection con = url.openConnection(); Pattern p = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*"); Matcher m = p.matcher(con.getContentType()); /* If Content-Type doesn't match this pre-conception, choose default and * hope for the best. */ String charset = m.matches() ? m.group(1) : "ISO-8859-1"; String str = IOUtils.toString(con.getInputStream(), charset); 
+4
Mar 19 '10 at 13:31
source share

I use this:

  BufferedReader bufferedReader = new BufferedReader( new InputStreamReader( new URL(urlToSeach) .openConnection() .getInputStream() )); StringBuilder sb = new StringBuilder(); String line = null; while( ( line = bufferedReader.readLine() ) != null ) { sb.append( line ) ; sb.append( "\n"); } .... in finally.... buffer.close(); 

He works most of the time.

+1
Sep 04 '09 at 21:34
source share



All Articles