Creating a term-document matrix in Python from an ElasticSearch index

ElasticSearch is new to here. I have a set of text documents that I indexed using ElasticSearch through the Python ElasticSearch client. Now I want to do some machine learning with documents using Python and scikit-learn. I need to do the following.

  • Use ElasticSearch parsers to process text (string, lowercase, etc.).
  • Extract processed documents (or analyze markers) from the index.
  • Convert processed documents to a Term-Document matrix for classification (possibly using the CountVectorizer in scikit-learn). Or, alternatively, maybe there is a way to get TDM directly from ElasticSearch.

I'm having trouble thinking about the right way to resolve this issue, and there seems to be no simple implementation of ElasticSearch.

For example, I can just extract non-parsed documents from ES and then process the documents in Python, but I want to use ES parsers. I can use ES parsers every time I request a set of documents from ES, but it seems like doing something twice since it should already be parsed and stored in the index. Alternatively, I think I can tell ES to get the term vectors for each document and manually extract tokens and counts from the results for each document, and then manually encode TDM based on tokens and counts. This seems to be the most direct way that I can think of so far.

TDM ES Python ?

+4
1

, Python.

:

- , Apache Spark. Sparse Matrix Spark MLlib RowMatrix RDD. Python, .

+1

All Articles