I am trying to get the contents of a webpage as a string, and I found this question referring to how to write a basic web crawler that claims (and seems to deal with the encoding problem, however, the code provided there that works on US websites and England, does not allow the proper handling of other languages.
Here is the complete Java class that demonstrates what I mean:
import java.io.IOException; import java.io.InputStreamReader; import java.io.Reader; import java.io.UnsupportedEncodingException; import java.net.HttpURLConnection; import java.net.MalformedURLException; import java.net.URL; import java.util.regex.Matcher; import java.util.regex.Pattern; public class I18NScraper { static { System.setProperty("http.agent", ""); } public static final String IE8_USER_AGENT = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)"; ///questions/65453/simplest-way-to-correctly-load-html-from-web-page-into-a-string-in-java private static final Pattern CHARSET_PATTERN = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*"); public static String getPageContentsFromURL(String page) throws UnsupportedEncodingException, MalformedURLException, IOException { Reader r = null; try { URL url = new URL(page); HttpURLConnection con = (HttpURLConnection)url.openConnection(); con.setRequestProperty("User-Agent", IE8_USER_AGENT); Matcher m = CHARSET_PATTERN.matcher(con.getContentType()); /* If Content-Type doesn't match this pre-conception, choose default and * hope for the best. */ String charset = m.matches() ? m.group(1) : "ISO-8859-1"; r = new InputStreamReader(con.getInputStream(),charset); StringBuilder buf = new StringBuilder(); while (true) { int ch = r.read(); if (ch < 0) break; buf.append((char) ch); } return buf.toString(); } finally { if(r != null){ r.close(); } } } private static final Pattern TITLE_PATTERN = Pattern.compile("<title>([^<]*)</title>"); public static String getDesc(String page){ Matcher m = TITLE_PATTERN.matcher(page); if(m.find()) return m.group(1); return page.contains("<title>")+""; } public static void main(String[] args) throws UnsupportedEncodingException, MalformedURLException, IOException{ System.out.println(getDesc(getPageContentsFromURL("http://yandex.ru/yandsearch?text=%D0%A0%D0%B5%D0%B7%D1%83%D0%BB%D1%8C%D1%82%D0%B0%D1%82%D0%BE%D0%B2&lr=223"))); } }
What outputs:
??????????? — ??????: ??????? 360 ??? ???????
Although it should be:
— : 360
Can you help me understand what I'm doing wrong? Trying things like forcing UTF-8 does not help, even though it is the encoding specified in the source and the HTTP header.
java encoding internationalization utf-8 web-crawler
dimo414 Sep 30 '11 at 19:15 2011-09-30 19:15
source share