MongoDB memory along with MySQL XPath features

I have: a large set of complex documents with a tree-like data structure (each document may have its own data tree, which may differ from document to document). The backend is implemented using Django 1.3 and MySQL.

I need:

  • store these documents with scalable and fast storage
  • filter documents by some predefined requests
  • data search within a limited subset of document fields
  • additional function: search for documents using any possible query and extracting any possible information from the data tree. This additional feature is a customer requirement, and for non-technical users it should be convenient to use. At the moment, we have an agreement that XPath is enough.

Note. No need to change documents often. 90% of the time will be for read operations.

Note: I rarely need all possible fields from the data tree. Data is needed approximately 90% of the time, accounting for about 10% of the entire data tree. The only case where all the data is needed is the additional function described above. However, this is practically not a popular feature of the system.

Note. The data tree that comes with each document is a representation of some custom format that cannot be changed. I can select only the necessary pieces of data from the tree and convert them into a readable form (and also write them back in this custom format).

I am currently using:

  • MySQL to store the data tree for each XML document
  • some pre-selected data from XML as additional columns in one table to speed up the search
  • all other necessary fields are extracted from XML on the fly during each query using the mysql extractvalue () function (more than 10 extractvalue () calls in each SQL query)
  • all search and filtering operations are performed using XPath and extractvalue () queries regarding stored XML

The problem is that this workaround is very poor in terms of performance. In a 100k + records dataset, I have to wait a minute to execute one query with ~ 10 extractvalue () calls in one query.

The solutions that I see at the moment are:

1) Continue using the approach with pre-selected fields in additional columns. These fields are retrieved once when the document enters the system.

Pros:

  • Using proven technology (MySQL)
  • Most searches will run against preselected fields without using the very slow extractvalue () function
  • I tested XPath query requests in this approach on a 100k + recordset, and it seems that making 1 extractvalue () call is not too slow (<1s for a query) compared to 10+ simultaneous extractvalue () calls in the current temporary approach

Minuses:

  • Due to the fact that each document can have its own data tree, as well as many sets of pre-selected fields, I have to create a set of tables to store these data sets and combine these tables depending on the type of document
  • Pre-selected fields can be lists, not just single values, so each list should have a separate table
  • Synchronization required for preselected fields and XML

2) Use MySQL for custom XPath queries (i.e. an additional function ), use MongoDB for all other searches. MongoDB stores all the necessary pre-selected fields, MySQL stores only XML.

Note. . I don’t think I should keep all possible system data in MongoDB, such as user accounts, sessions, etc. MongoDB for documents only will be enough.

Pros:

  • 90% of the required requests should be fast
  • I could store any attached data for each document. Data trees may vary from document to document. No need to join many tables.
  • MongoDB has very convenient tools for using it with Python

Minuses:

  • Unverified technology (at least for me). I have no experience with MongoDB, however I have consulted with some programmers using it and it seems like it looks promising.
  • MongoDB doesn't have XPath-like functions (and it looks like they won't be in the near future ), so I should continue to use MySQL for XPath (as well as in solution 1) ). As a developer, I want to avoid inventing new custom query languages, so XPath looks like a good compromise.
  • Thus, synchronization between MySQL and MongoDB is required.

My questions:

  • Could there be any hidden or unobvious problems with solution 2) ?
  • I'm still worried about performance with extractvalue (). From my perspective on the client, you can replace the XPath approach with something similar, however, I have no idea about this, with the tools available for MySQL or MongoDB. Could there be similar solutions like XPath?

We will be grateful for any feedback, thoughts and comments on the above.

+4
source share
2 answers

So, if I understood the question correctly, you want

  • find the given node in the tree, given some way through the part of the tree to this node plus an additional expression query.
  • then return node and everything under it.

With materialized paths, you can do this. The main thing that you need to configure if the document has the path "a..b..c..d..e" and you want to find documents using the path "..b..c..d ..", as do it fast. If we start from the top, it's easy. However, we are not here. It might make sense to use a combined approach when the document has a materialized path for node plus an array of node ancestors, something like:

{ path : ",a,b,c,d,e,", ancestor : ['a','b','c','d','e'] } 

We could index the ancestors that would create the multicode index. Then we execute the following query: find the nodes on the path "... b, c, d ..." with some efficiency:

 find( { path : /,b,c,d,/, ancestor : 'd', <more_query_expressions_optionally> } ) 

In the above example, the index for the ancestor will be used, and only inspections with dd need to be checked. One could try the following: it could be even better, depending on how smart the query optimizer is:

 find( { path : /,b,c,d,/, ancestor : { $all : ['a','d'] }, ... } ) 
0
source

This is a very broad question, but some of the things I would consider are: XML data stores, such as MarkLogic and eXist, which are very well suited for optimizing queries against tree-structured data.

You can also consider copying your own base search index, such as MySQL or perhaps Lucene / Solr, if you need the best full-text search capabilities (phrases, synonyms next to queries, etc.).

I would do something like an index of the textual content of each named element and attribute (this is more or less the approach used by the XML repositories I mentioned above), use this to get a list of candidates in the list, and then evaluate the XPath expressions for candidates to weed out false positives. However, this is not a small project.

0
source

All Articles