An extremely common mistake is the inability to correctly convert the HTTP response from bytes to characters. To do this, you need to know the character encoding of the response. I hope this is indicated as a parameter in the "Content-Type" parameter. But put it in the body, since the "http-equiv" attribute in the meta tag is also an option.
So, it is surprisingly hard to load the page correctly in String , and even third-party libraries like HttpClient do not offer a general solution.
Here is a simple implementation that will handle the most common case:
URL url = new URL("http://stackoverflow.com/questions/1381617"); URLConnection con = url.openConnection(); Pattern p = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*"); Matcher m = p.matcher(con.getContentType()); String charset = m.matches() ? m.group(1) : "ISO-8859-1"; Reader r = new InputStreamReader(con.getInputStream(), charset); StringBuilder buf = new StringBuilder(); while (true) { int ch = r.read(); if (ch < 0) break; buf.append((char) ch); } String str = buf.toString();
erickson Sep 04 '09 at 22:21 2009-09-04 22:21
source share