Lucene Indexing: Generic or Isolated on an Account?

I am evaluating Lucene to implement a global search function in a SaaS application.

We do not want users to see the contents of other accounts, so the search will always be limited to the account.

Is it better to have one index with an account identifier field or one index per account? What are the advantages and disadvantages of each approach?

My concern is that the global index may affect performance due to frequent updates.

Thanks.

EDIT

  • Estimated number of common documents: 500,0000
  • Number of accounts: 4000
  • Indexed data is never shared between accounts
  • Account users can update their indexed data several times a day (no more than 100 in most cases).
  • The amount of indexed data tends to be stable after the initial setup process.
  • We need to save 10-20 fields per document
+4
source share
3 answers

here are some things that i would think of besides the usual problems (e.g. index updates, etc.):

  • How lucene returns ranked results depends on some “whole body” statistics, such as the total number of documents in which the term appears for this field. Thus, if index statistics for client a are not suitable for client b, this will have more damage for both clients, in addition to security risk ... if oscar is smart enough, it can really start to reverse bob documents due to the nature of the inverted index: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.159.9682 Perhaps you can get around this with such a ranking algorithm: https://issues.apache.org/jira/browse/LUCENE- 2864
  • Some other things in lucene relate to the “field as a whole” or “the index as a whole,” and you should be aware that they cannot really be changed based on each client if you combine the indexes together: omitTF (if you set it to one the document for the field, it is passed through the board for this field), similarity (in any released version of lucene you can only establish similarity in all directions, so that customers can’t set up a ranking model), spellcheck (you have to hack something where everyone customer has his own filtered "spell check index), ...
  • On the other hand, if you have a lot of terms, quite a lot of RAM is required, and providing each client with its own index, you will need more memory to store the index index in RAM for all indexes. However, you can reduce this a bit by adjusting things like termIndexInterval / Divisor.
+2
source

If it were me, if there is no normative reason why you cannot, I would drop them all by one index. It’s just my “don’t optimize what you don’t need,” talking about it.

The first concern is simply legal: you MUST JOIN and mix data together, even if they are separated by logical means. This applies to your attorneys, clients, and service agreements. This is not a problem.

Assuming that you can, then the next question is what impact other users will have on each other. If user A uses the system and user B is in the process of importing his 100K documents, will this affect user A? This affects user A because of how Lucene works, or simply because of the overall system load that occurs when importing and indexing documents.

Try and see.

The main thing is to make sure that your client systems do not directly access Lucene, but rather through some kind of facade. This facade is an ideal place to ensure customer segregation, as well as a good place to redirect traffic if at some point you decide that you need to outline your indexes.

You might need to snatch one heavy user out. Or are you selling a higher level of response time to someone who is guaranteed more resources in their SLA, etc.

But deciding what is the best way now? Eh, it seems early.

500K documents are not much data for Lucene. Just make sure you have the flexibility to implement it so you can add it later if you find that placing all of this in one instance is not viable. And "add ability" I mean exactly that, add it. Actually DO NOT IMPLEMENT, say, client-based shards. But rather, this is a good point at which it MAY be implemented without re-adding a bunch of plumbing later.

+1
source

I made a few "circumcised security" indexes here and there - definitely possible, if allowed. However, my general tendency towards SAAS materials with multiple clients would be to separate customers as much as possible for several reasons:

a) Ensures that coding errors do not lead to data leakage, angry customers, lawsuits and other haha.
b) Makes customization on the client much easier - your entire code base should not handle requests related to a specific client, c) It makes you horizontally scalable architecture from day one - scaling is easy if adding instances is easy, right?

Oh, and definitely take the advice of Will Hartung - search for the facade, this material really should not creep out of this layer.

+1
source

All Articles