Python site language definition

I am writing a bot that can simply check thousands of websites either in English or not.

I am using Scrapy (python 2.7 framework) to scan the first page of a website,

Can someone tell me that this is the best way to check the language of the site,

Any help would be appreciated.

+4
source share
8 answers

Check out the Natural Language Toolkit :

NLTK : http://nltk.org/

What do you want to learn using corpus to extract the default dictionary specified by NLTK :

nltk.corpus.words.words ()

Then compare the text with the above using difflib .

Link : http://docs.python.org/library/difflib.html

Using these tools, you can create a scale for measuring the difference needed between your text and the English words defined by NLTK.

+1
source

Since you are using Python, you can try NLTK. More precisely, you can check NLTK.detect

More information and the exact code snippet are here: NLTK and language detection

+4
source

You can use the response headers to find out:

Wikipedia

+2
source

If the sites are multilingual, you can send the heading "Accept-Language: en-US, en; q = 0.8" and expect an answer in English. If this is not the case, you can check the dictionary "response.headers" and see if you can find any information about the language.

If you're still unlucky, you can try to match the IP address of the country, and then in some language. As a last resort, try locating the language (I don't know how accurate this is).

+2
source

If you use Python, I highly recommend the standalone LangID written by Marco Louis and Tim Baldwin. The model is pre-prepared and character recognition is very accurate. It can also process an XML / HTML document.

+2
source

You can use the language detection API at http://detectlanguage.com. It accepts a text string via GET or POST and provides JSON results with scores. There are free and premium services.

+1
source

If the html website uses non-English characters, it is mentioned in the meta tag in the source code of the web page. this helps browsers know how to render the page.

here is an example from the arabic website http://www.tanmia.ae which contains both an english page and an arabic page

Meta tag

on the Arabic page: meta http-equiv = "X-UA-Compatible" content = "IE = edge

The same page, but in English meta http-equiv = "Content-Type" content = "text / html; charset = UTF-8" /

maybe the bot will look at the meta tag if it is English and then continue ignoring it?

0
source

If you do not want to trust what the web page tells you, but you want to check for yourself, you can use the statistical algorithm to determine the language. Trigram-based algorithms are robust and should work well with pages that are mostly in a different language but have a bit of English (enough to fool a heuristic like "check if the words" and "or" on the page "match) the language "ngram" google classification "and you will find many links to how this is done.

It's hard to compose your own trigram tables for English, but the Natural Language Toolkit comes with a set for several common languages. They are located at NLTK_DATA/corpora/langid . You can use these trigrams without the nltk library itself, but you can also look into the nltk.util.trigrams module.

0
source

All Articles