Function selection and uncontrolled learning for multilingual data + machine learning algorithm selection

Questions

I want to classify / categorize / cluster / group along with several thousand websites. There is data that we can train, so we can provide supervised training, but this is not the data that we have collected, and we are not inclined to use it, so we also consider uncontrolled training.

  • What features can I use in a machine learning algorithm to process multilingual data? Please note that some of these languages ​​may not have been processed in the Natural Language Processing field.

  • If I were to use an uncontrolled learning algorithm, should I just split the data by language and communicate differently with each language? Different languages ​​may have different corresponding categories (or not, depending on your psycholinguistic theoretical tendencies), which may influence the decision to divide.

  • I was thinking about using decision trees, or perhaps support for vector machines (SVMs) to take into account more features (from my understanding of them). This post offers random forests instead of SVM. Any thoughts?

Pragmatic approaches are welcome! (Theoretical too, but they can be saved for later fun.)

Some context

We are trying to classify the corpus of many thousands of sites in 3-5 languages ​​(perhaps up to 10, but we are not sure).

, . - , , , , , . / -.

. Brown Corpus Brill, - .

Orange.

+5
3

, , . , . , , , .

, .

, - , , Naive Bayes, SVM, .

- , .

? , .

+3

, , - , . , ( --), , . , , , . Yuval , Naive Bayes - , , - SVM .

, , , . , - ( ). , . , , . , , , .

, . - ? , Rosetta Stone, . , - , : , , , , , . , , - URL-, .

, , , (, , , ). - , , , . .

+3

: . , . , , , .

, . , , . , . , k-means , ( x-, , ). , .

. , "" . , , .

, , , - , LSI. / . , . . LSI , . LSA Random Indexing, , . Clojure , , . , , "" " " github ( , ).

" " . - . , Naive Bayes: , .

SVM - , ( ). , SVM - , , .

, . , , C4.5 , . , , , . , .

, .

+1

All Articles