My understanding of the “model of an infinite mixture with the Dirichlet process as a preliminary distribution by the number of clusters” is that the number of clusters is determined by the data as they merge with a certain number of clusters.
/ strong>
This R Implementation https://github.com/jacobian1980/ecostates determines the number of clusters in this way. Although the R Implementation uses the Gibbs sampler, I'm not sure if this affects it.
I am confused by the n_components parameter. n_components: int, default 1 : Number of mixture components. If the number of components is determined by the data and the Dirichlet process, then what is this parameter?
Ultimately, I try to get:
(1) cluster assignment for each sample;
(2) probability vectors for each cluster; and
(3) probability / logarithmic probability for each sample.
It appears that (1) is the predict method, and (3) is the score method. However, output (1) is completely dependent on the hyperparameter n_components .
My apologies if this is a naive question, I am very new to Bayesian programming and noticed that there was a Dirichlet Process in Scikit-learn that I wanted to try.
Here's the docs: http://scikit-learn.org/stable/modules/generated/sklearn.mixture.DPGMM.html#sklearn.mixture.DPGMM
Here is a usage example: http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html
Here is my naive use:
from sklearn.mixture import DPGMM X = pd.read_table("Data/processed/data.tsv", sep="\t", index_col=0) Mod_dpgmm = DPGMM(n_components=3) Mod_dpgmm.fit(X)