Encoding issues crawling non-English sites

I am trying to get the contents of a webpage as a string, and I found this question referring to how to write a basic web crawler that claims (and seems to deal with the encoding problem, however, the code provided there that works on US websites and England, does not allow the proper handling of other languages.

Here is the complete Java class that demonstrates what I mean:

import java.io.IOException; import java.io.InputStreamReader; import java.io.Reader; import java.io.UnsupportedEncodingException; import java.net.HttpURLConnection; import java.net.MalformedURLException; import java.net.URL; import java.util.regex.Matcher; import java.util.regex.Pattern; public class I18NScraper { static { System.setProperty("http.agent", ""); } public static final String IE8_USER_AGENT = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)"; ///questions/65453/simplest-way-to-correctly-load-html-from-web-page-into-a-string-in-java private static final Pattern CHARSET_PATTERN = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*"); public static String getPageContentsFromURL(String page) throws UnsupportedEncodingException, MalformedURLException, IOException { Reader r = null; try { URL url = new URL(page); HttpURLConnection con = (HttpURLConnection)url.openConnection(); con.setRequestProperty("User-Agent", IE8_USER_AGENT); Matcher m = CHARSET_PATTERN.matcher(con.getContentType()); /* If Content-Type doesn't match this pre-conception, choose default and * hope for the best. */ String charset = m.matches() ? m.group(1) : "ISO-8859-1"; r = new InputStreamReader(con.getInputStream(),charset); StringBuilder buf = new StringBuilder(); while (true) { int ch = r.read(); if (ch < 0) break; buf.append((char) ch); } return buf.toString(); } finally { if(r != null){ r.close(); } } } private static final Pattern TITLE_PATTERN = Pattern.compile("<title>([^<]*)</title>"); public static String getDesc(String page){ Matcher m = TITLE_PATTERN.matcher(page); if(m.find()) return m.group(1); return page.contains("<title>")+""; } public static void main(String[] args) throws UnsupportedEncodingException, MalformedURLException, IOException{ System.out.println(getDesc(getPageContentsFromURL("http://yandex.ru/yandsearch?text=%D0%A0%D0%B5%D0%B7%D1%83%D0%BB%D1%8C%D1%82%D0%B0%D1%82%D0%BE%D0%B2&lr=223"))); } } 

What outputs:

 ???????????&nbsp;&mdash; ??????: ??????? 360&nbsp;???&nbsp;??????? 

Although it should be:

 &nbsp;&mdash; :  360&nbsp;&nbsp; 

Can you help me understand what I'm doing wrong? Trying things like forcing UTF-8 does not help, even though it is the encoding specified in the source and the HTTP header.

+3
java encoding internationalization utf-8 web-crawler
Sep 30 '11 at 19:15
source share
3 answers

The problem you see is that the encoding on your Mac does not support the Cyrillic script. I'm not sure if this is true in the Oracle JVM, but when Apple released its own JVMs, the default character encoding for Java was MacRoman.

When starting your program, specify the file.encoding system property to set the character encoding to UTF-8 (which is the default setting for Mac OS X). Please note that you must install it at startup: java -Dfile.encoding=UTF-8 ... ; if you install it programmatically (with a call to System.setProperty() ), it is too late and the installation will be ignored.

Whenever Java needs to encode characters into bytes & mdash, for example, when it converts text to bytes for writing to standard output or error streams - mdash, it will use the default value unless you specify otherwise. If the default encoding cannot encode a specific character, the appropriate replacement character is replaced.

If the encoding can handle the Unicode replacement character, then U + FFFD, (& # xFFFD;) is used. Otherwise, a question mark (?) Is a commonly used replacement character.

+1
01 Oct 2018-11-11T00:
source share

Determining the correct encoding of the encoding can be difficult.

You need to use a combination

a) Content-Type HTML META tag:

 <META http-equiv="Content-Type" content="text/html; charset=EUC-JP"> 

b) HTTP response header:

 Content-Type: text/html; charset=utf-8 

c) Heuristics for determining encoding from bytes (see this question )

The reason for using all three is:

  • (a) and (b) may be absent.
  • The META content type may be incorrect (see this question )

What if (a) and (b) are both missing?

In this case, you need to use some heuristics to determine the correct encoding - see this question .

I believe that this sequence is the most reliable for reliable identification of the encoding of the HTML page encoding:

  • Use HTTP Content-Type response header (if exists)
  • Use coding detector in response bytes
  • use the META HTML content type

but you can swap 2 and 3.

+2
Sep 30 2018-11-21T00:
source share

Apache Tika contains an implementation of what you want here. Many use it for this. You can also look at Apache Nutch. On the other hand, you do not have to implement your own crawler at all.

0
Sep 30 2018-11-21T00:
source share



All Articles