How to use the “Gaussian Dirichlet model of a Gaussian mixture” in Scikit-learn? (n_components?)

Question

How to use the “Gaussian Dirichlet model of a Gaussian mixture” in Scikit-learn? (n_components?)

My understanding of the “model of an infinite mixture with the Dirichlet process as a preliminary distribution by the number of clusters” is that the number of clusters is determined by the data as they merge with a certain number of clusters.

/ strong>

This R Implementation https://github.com/jacobian1980/ecostates determines the number of clusters in this way. Although the R Implementation uses the Gibbs sampler, I'm not sure if this affects it.

I am confused by the n_components parameter. n_components: int, default 1 : Number of mixture components. If the number of components is determined by the data and the Dirichlet process, then what is this parameter?

Ultimately, I try to get:

(1) cluster assignment for each sample;

(2) probability vectors for each cluster; and

(3) probability / logarithmic probability for each sample.

It appears that (1) is the predict method, and (3) is the score method. However, output (1) is completely dependent on the hyperparameter n_components .

My apologies if this is a naive question, I am very new to Bayesian programming and noticed that there was a Dirichlet Process in Scikit-learn that I wanted to try.

Here's the docs: http://scikit-learn.org/stable/modules/generated/sklearn.mixture.DPGMM.html#sklearn.mixture.DPGMM

Here is a usage example: http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html

Here is my naive use:

 from sklearn.mixture import DPGMM X = pd.read_table("Data/processed/data.tsv", sep="\t", index_col=0) Mod_dpgmm = DPGMM(n_components=3) Mod_dpgmm.fit(X)

+5

python scikit-learn statistics machine-learning bayesian

O.rka Aug 22 '16 at 22:25

source share

1 answer

rafaelvalle · Accepted Answer · 2016-09-21T17:42:52+0000

As @maxymoo pointed out in the comments, n_components is a truncation parameter.

In the context of the Chinese restaurant process, which is associated with the stick-break presentation in sklearn DP-GMM, a new data point connects the existing cluster k with probability |k| / n-1+alpha |k| / n-1+alpha and launches a new cluster with probability alpha / n-1 + alpha . This parameter can be interpreted as the concentration parameter of the Dirichlet process, and it will affect the final number of clusters.

Unlike the R implementation using the Gibbs sample, the sklearn DP-GMM implementation uses variational output. This may be due to differences in results.

The gentle Dirichlet Process tutorial can be found here .

How to use the “Gaussian Dirichlet model of a Gaussian mixture” in Scikit-learn? (n_components?)

More articles: