Machine Learning Platform Selection

I have a set of user data and their loan repayment rates (how much time they took, how many payments, etc.). Now I want to analyze the history of past credit history and say: "If we borrow them X, they are likely to pay off more Y-payments in more than two days"

Here is my welcome

  • The algorithm is a clustering algorithm for grouping all users according to their redemption habits.
  • I want to use SOM or K-Means

So my question is which platforms are suitable for this? I still look at Mahut.

+6
machine-learning cluster-analysis mahout
source share
5 answers

A good look at Weka is a fairly mature open source toolkit with many machine learning algorithms, including clustering.

+2
source share

RapidMiner - free community edition - easy to use - nice visualization

http://rapid-i.com/content/view/181/190/

+2
source share

Another good library is scikits.learn , a computer-based learning library for Python programmers.

0
source share

There is an amazing book on this subject - Toby Segaran's Programming Collective Intelligence . It discusses various machine learning algorithms, clustering, etc. Also includes links to useful libraries and sample code.

0
source share

Why clustering? This is not like a clustering problem. You can do cluster analysis as a pre-processing phase to highlight several user groups (or you can omit this phase), but then you need to make some kind of numerical prediction : both - the number of payments and days - numbers, since you are going to receive these clustered numbers?

I suggest you use regression for this task. Linear regression should fit your needs. If the dependent variables (# packets and days) depend on other attributes non-linearly, you can try polynomial regression or even algorithms such as M5 ' , which first assemble a decision tree and then add a regression model to each sheet of this tree.

If you have non-numeric attributes, you can also try using classification - in this case you need to manually create possible classes (for example, the number of payments: from 3 to 5, from 6 to 10, etc.), and then use any algorithm classification (C4.5, SVM, Naive Bayes to mention a few).

Actually, I don’t think you have a lot of data. I believe that if the total is less than 50 MB, there is no need to use monsters such as Mahout, which are designed to handle really large amounts of data. You can use Weka or RapidMiner for this purpose. Even if they cannot process your data using the default configuration, simply increase the memory for the JVM, and in 99% of cases they will be fine.

0
source share

All Articles