Controlled hidden Dirichlet distribution for document classification?

Question

Controlled hidden Dirichlet distribution for document classification?

I have a group of documents already classified by a person in some groups.

Is there a modified version of lda that I can use to train the model and then later classify unknown documents with it?

+7

machine-learning nlp classification document-classification lda

snøreven Nov 25 '12 at 20:12

source share

3 answers

Yes, you can try Labeled LDA on the stanford stall at http://nlp.stanford.edu/software/tmt/tmt-0.4/

+3

Steve Nov 25 '12 at 21:59

source share

You can implement controlled LDA with PyMC, which uses the Metropolis probe to examine hidden variables in the following graphical model:

The academic building consists of 10 film reviews (5 positive and 5 negative), together with an associated star rating for each document. Star rating is known as the response variable of interest for each document. Documents and response variables are modeled together to find hidden topics that best predict response variables for future unlabeled documents. See the original paper for more information . Consider the following code:

import pymc as pm import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer train_corpus = ["exploitative and largely devoid of the depth or sophistication ", "simplistic silly and tedious", "it so laddish and juvenile only teenage boys could possibly find it funny", "it shows that some studios firmly believe that people have lost the ability to think", "our culture is headed down the toilet with the ferocity of a frozen burrito", "offers that rare combination of entertainment and education", "the film provides some great insight", "this is a film well worth seeing", "a masterpiece four years in the making", "offers a breath of the fresh air of true sophistication"] test_corpus = ["this is a really positive review, great film"] train_response = np.array([3, 1, 3, 2, 1, 5, 4, 4, 5, 5]) - 3 #LDA parameters num_features = 1000 #vocabulary size num_topics = 4 #fixed for LDA tfidf = TfidfVectorizer(max_features = num_features, max_df=0.95, min_df=0, stop_words = 'english') #generate tf-idf term-document matrix A_tfidf_sp = tfidf.fit_transform(train_corpus) #size D x V print "number of docs: %d" %A_tfidf_sp.shape[0] print "dictionary size: %d" %A_tfidf_sp.shape[1] #tf-idf dictionary tfidf_dict = tfidf.get_feature_names() K = num_topics # number of topics V = A_tfidf_sp.shape[1] # number of words D = A_tfidf_sp.shape[0] # number of documents data = A_tfidf_sp.toarray() #Supervised LDA Graphical Model Wd = [len(doc) for doc in data] alpha = np.ones(K) beta = np.ones(V) theta = pm.Container([pm.CompletedDirichlet("theta_%s" % i, pm.Dirichlet("ptheta_%s" % i, theta=alpha)) for i in range(D)]) phi = pm.Container([pm.CompletedDirichlet("phi_%s" % k, pm.Dirichlet("pphi_%s" % k, theta=beta)) for k in range(K)]) z = pm.Container([pm.Categorical('z_%s' % d, p = theta[d], size=Wd[d], value=np.random.randint(K, size=Wd[d])) for d in range(D)]) @pm.deterministic def zbar(z=z): zbar_list = [] for i in range(len(z)): hist, bin_edges = np.histogram(z[i], bins=K) zbar_list.append(hist / float(np.sum(hist))) return pm.Container(zbar_list) eta = pm.Container([pm.Normal("eta_%s" % k, mu=0, tau=1.0/10**2) for k in range(K)]) y_tau = pm.Gamma("tau", alpha=0.1, beta=0.1) @pm.deterministic def y_mu(eta=eta, zbar=zbar): y_mu_list = [] for i in range(len(zbar)): y_mu_list.append(np.dot(eta, zbar[i])) return pm.Container(y_mu_list) #response likelihood y = pm.Container([pm.Normal("y_%s" % d, mu=y_mu[d], tau=y_tau, value=train_response[d], observed=True) for d in range(D)]) # cannot use p=phi[z[d][i]] here since phi is an ordinary list while z[d][i] is stochastic w = pm.Container([pm.Categorical("w_%i_%i" % (d,i), p = pm.Lambda('phi_z_%i_%i' % (d,i), lambda z=z[d][i], phi=phi: phi[z]), value=data[d][i], observed=True) for d in range(D) for i in range(Wd[d])]) model = pm.Model([theta, phi, z, eta, y, w]) mcmc = pm.MCMC(model) mcmc.sample(iter=1000, burn=100, thin=2) #visualize topics phi0_samples = np.squeeze(mcmc.trace('phi_0')[:]) phi1_samples = np.squeeze(mcmc.trace('phi_1')[:]) phi2_samples = np.squeeze(mcmc.trace('phi_2')[:]) phi3_samples = np.squeeze(mcmc.trace('phi_3')[:]) ax = plt.subplot(221) plt.bar(np.arange(V), phi0_samples[-1,:]) ax = plt.subplot(222) plt.bar(np.arange(V), phi1_samples[-1,:]) ax = plt.subplot(223) plt.bar(np.arange(V), phi2_samples[-1,:]) ax = plt.subplot(224) plt.bar(np.arange(V), phi3_samples[-1,:]) plt.show()

Given the learning data (observable words and response variables), we can find out global topics (beta) and regression coefficients (eta) to predict the response variable (Y) in addition to the theme proportions for each document (theta). To make predictions of Y based on the studied beta and eta, we can define a new model where we do not observe Y and use the previously studied beta and eta to get the following result:

Here we predicted a positive review (approximately 2 given a range of ratings from -2 to 2) for a test case consisting of one sentence: “this is really a positive review, a great movie”, as shown by the back histogram mode on the right. For a full implementation, see ipython notebook .

+3

Vadim smolyakov Jul 25 '17 at 19:28

source share

Ben allison · Accepted Answer · 2012-11-26T10:26:48+0000

For what it is worth, the LDA as a classifier will be rather weak, because it is a generative model, and classification is a discriminatory problem. There is an LDA option called controlled LDA , which uses a more discriminatory criterion to form topics (you can get the source for this in different places), as well as a document with max margin , which I do not know about the state of the source code. I would avoid Labeled LDA's wording if you are not sure what you want, because it makes a strong assumption about the correspondence between topics and categories in the classification problem.

However, it is worth noting that none of these methods uses the topic model directly for classification. Instead, they take documents, and instead of using word-based functions, use the back of topics (the vector obtained from the output for the document) as a representation of the function before submitting it to the classifier, usually a linear SVM. This leads to a reduction in dimension based on the model model, followed by a strong discriminatory classifier, which is probably what you need. This pipeline is available in most languages using popular tool kits.

Controlled hidden Dirichlet distribution for document classification?

More articles: