How to search for a keyword in 100 billion posts?

This is a college project:

I have a database (mysql or postgresql doesn't matter) with 100 billion posts, and I need to find (as quickly as possible) a common keyword.

Each post contains 500-1000 keywords.

This is not only a database problem, but also software (for indexing or something else).

How can i do this?

I could use some advanced search engine technology, but I don’t know which one.

+6
database mysql postgresql
source share
8 answers

You might want to check out Sphinx . This is a full-text search engine that processes distributed indexes. You may have pieces of data distributed on many computers. And a request to one server can send a request to other servers and collect the results from each. It has pretty good speed, but you probably can't do 100 billion messages on one machine.

You probably won't be able to do something like this in MySQL or Postgresql. Although you can store all the data, MySQL and Postgres do not have the full speed of indexing and text search, which will bring you a real full text index.

MySQL has a way to compile support for the Sphinx storage engine, which, although the data will still be stored in Sphinx separately from MySQL, you can still query the Sphinx search engine using everything related to MySQL, and also execute joins other tables that are in your MySQL database. However, if you just want to do simple searches in documents and don't need to be attached to other data, you can simply use the PHP native interface.

+4
source share

Do you find using Apache Lucene ?

This does not directly work directly with your SQL database, you will need to write code to submit documents to it for creation and indexing, which you can then query.

I do not know how much extra space will be required and how much time will be required.

+13
source share

Sell ​​Google data for $ 100 billion. :)

They will index it for you for free , and you will make money.

+10
source share

There are about 6.8 billion people on the planet who can read about 1 message per minute (on average).

If everyone contributes, 100 billion divided by 6.8 billion is 14.7 minutes to read all the messages.

So:

1) Conquer the Earth.
2) Make everyone your slave.
3) Must read messages.
4)

5) Profit!

+5
source share
+4
source share

Have you tried the built-in full-text indexing functions of your database? You must try and prove that this will not work before you decide that it is not suitable and is looking for something.

+3
source share

Use Google Custom Search. In addition, you will earn a little and save a lot of hosting resources.

+3
source share

First of all, are we talking about keywords in separate fields or in messages?

If separate fields, it’s kind of OK. Just create a table with keyword-post relationships and do a simple search in SELECT post_id 7 ... WHERE keyword = 'X'.

If we are talking about full-text indexing, it’s best for you to use some kind of special indexing software, such as suggested in some other posts.

+1
source share

All Articles