How can we use unsupervised learning methods in a dataset and then label clusters?

Question

How can we use unsupervised learning methods in a dataset and then label clusters?

First of all, this is, of course, homework (so, please, not the full code). Nonetheless...

I need to test an uncontrolled algorithm next to a controlled algorithm using the Neural Network toolkit in Matlab. A dataset is a database of artificial UCI characters . The problem is that I had a good tutorial on controlled algorithms, and I remained immersed in uncontrolled.

So, I know how to create a self-organizing map using selforgmap , and then I train it with train(net, trainingSet) . I don’t understand what to do next. I know that he grouped the data that I gave them (hopefully) 10 clusters (one for each letter).

Two questions:

How can I label clusters (given that I have a comparison sample)?
- Am I trying to turn this into a controlled learning problem when I do this?
How to create a confusion matrix on a (other) test case for comparison with a controlled algorithm?

I think that I am missing something conceptual or based on jargon - all my searches were invented under the guidance of teaching methods. A point in the right direction would be greatly appreciated. My existing code is below:

 P = load('-ascii', 'pattern'); T = load('-ascii', 'target'); % data needs to be translated P = P'; T = T'; T = T(find(sum(T')), :); mynet = selforgmap([10 10]); mynet.trainparam.epochs = 5000; mynet = train(mynet, P); P = load('-ascii', 'testpattern'); T = load('-ascii', 'testtarget'); P = P'; T = T'; T = T(find(sum(T')), :); Y = sim(mynet,P); Z = compet(Y); % this gives me a confusion matrix for supervised techniques: C = T*Z'

+6

matlab unsupervised-learning machine-learning neural-network

Hotchips Oct 9 '12 at 3:46

source share

2 answers

Can this video provide any help? It does not answer your question, but it shows that human interaction may be required even to select the number of clusters. Automatic labeling of clusters is even more difficult.

If you think about it, there is no guarantee that clustering will be performed based on the number shown. The network can group numbers according to line width or font smoothing, etc.

+1

Ivan Koblik Oct 9 '12 at 11:18

source share

gevang · Accepted Answer · 2012-10-10T02:40:19+0000

Since you are not using any part of the marked data, you are using an uncontrolled method by definition.

"How can I then label clusters (given that I have a comparison sample)?

You can try various perturbations of the label set and save it to minimize the average error (or accuracy) in the comparison template. With clustering, you can mark your clusters the way you like. Think about it, for example, try different label assignments until you minimize the specified performance metric.

"Am I trying to turn this into a controlled learning problem when I do this?"

It depends. If you explicitly use (known) data points during the clustering process, then this is semi-controlled. If not, you simply use the labeling information to evaluate and “compare” with controlled approaches. This is a form of control, but not based on a set of training, but on the best expected performance (ie, "Agent" indicates the correct labels for the clusters).

"How to create a confusion matrix on a (other) test case for comparison with a controlled algorithm?"

You need a way to turn clusters into labeled classes. For a small number of clusters (e.g. C <= 5) you could create C! matrices C! and keep one that minimizes your average classification error. In your case, however, with C = 10, this is obviously impractical and hard overhead!

Alternatively, you can mark clusters (and thus get mixing matrices) using:

Semi-supported approaches where clusters can be labeled a priori or controlled by the process of sowing data belonging to a known cluster / class.
Ranking or searching for distances between estimated cluster centroids and truth marks. This will give each cluster the closest or most similar label.

How can we use unsupervised learning methods in a dataset and then label clusters?

More articles: