What a good way to structure a 100M stop record for fast special requests?

The scenario is quite simple: the table has about 100 M records with 10 columns (analytics data view), and I need to be able to execute queries in any combination of these 10 columns. For example, something like this:

  • How many records with a = 3 && b > 100 exist in the last 3 months?

Basically, all queries will look like how many records with attributes of X are in the time interval Y , where X can be any combination of these 10 columns.

Data will continue to flow, it is not just a predefined set of records 100M, but over time it grows.

Since the choice of a column can be completely random, creating indexes for popular combinations is most likely impossible.

The question has two parts:

  • How should I structure this in an SQL database so that I can execute queries as quickly as possible, and what are the main steps I can take to improve performance?
  • Is there any NoSQL database optimized for such a search? I can only think of ElasticSearch, but I'm not working very well on this large dataset.
+7
source share
6 answers

Without indexes, your settings for setting up RDBMS to support this kind of processing are very limited. Basically you need massive parallelism and a super-fast kit. But it is clear that you do not store real data, so RDBMS is not suitable.

Performing a parallel route, the industry standard is Hadoop . You can still use SQL style queries through Hive .

Another noSQL option would be to consider a columnar database. This is an alternative way to organize data for analytics without using cubes. They load data fast. Vectorwise is the last player in the arena. I did not use it personally, but someone met with LondonData last night. Check it out .

Of course, moving from SQL databases — in whatever direction you go — will lead to a steep learning curve.

+1
source

you must build an SSAS cube and use MDX to query it.

There are “aggregations” in the cube, meaning results calculated ahead of schedule. Regardless of how you set up your cube (and your aggregations), you can have a SUM attribute (like A) in a measure group, and every time you ask a cube how many records A has, it just reads the aggregation and not reads the whole table and compute it.

0
source

As far as Oracle is concerned, this is likely to be structured as a table with separated intervals with local raster indexes in each column that you can query, and new data is added either through direct path insertion or into the exchange section.

Queries for popular column combinations can be optimized with a set of materialized views, possibly with collapse queries or a cube.

0
source

so that these queries are executed quickly using SQL solutions, use these rules of thumb. There are many caveats with this, though, and the actual SQL mechanism that you use will be very relevant to the solution.

I assume your data is integer, dated or short scaling. long lines etc. change the game. I also assume that you are using only fixed comparisons (=, <,>, <>, etc.).

a) If the time interval Y is present in each query, make sure that it is indexed if the predicate Y does not select a large percentage of rows. Make sure the lines are stored in the "Y" order so that they are packed onto the disk next to each other. This will happen, of course, in any case with new data. If the predicate Y is very dense (i.e. a few hundred lines), that may be all you need to do.

b) Do you execute "select" or "select count ()"? If you do not “select *,” then vertical partitioning may help depending on the presence of the engine and other indices.

c) Create indices of separate columns for each column where values ​​are widespread and do not have too many duplicates. The YEAR_OF_BIRTH index will usually be fine, but indexing FEMALE_OR_MALE is often not very good - although it is very database specific.

d) If you have columns, such as FEMALE_OR_MALE and predicates Y, wide, you have another problem - choose to count the number of women from most rows. You can try indexing, but it depends on the engine.

e) Try to make the columns "NOT NULL", if possible, as a rule, it saves 1 bit per row and can simplify the work of the internal optimizer.

f) Updates / Inserts. Creating indexes often degrades insert performance, but if your course is low enough, it may not matter. With only 100M lines, I assume your insertion speed is quite low.

g) Multi-segment keys will help, but you have already said that they do not go.

h) Get high-speed disks (RPMs) - the problem for these types of requests is usually IO (TPC-H test tests relate to IO, and you sounded like a “H” problem)

There are many options, but it depends on how much effort you want to spend to "make requests as quickly as possible." There are many No-SQL and other options to solve this problem, but I will leave this part of the question different.

0
source

In addition to the suggestions above, consider only requesting an updated materialized view. I think I'm just creating a select, count (*) group using the materialized cube () view in the table.

This will give you a complete cube for work. Play with this on a small test table to understand how the dice work. Look at Joe Selco's books for some examples, or just apply your specific RDBMS documentation for examples.

You get a little stuck if you always need to query the latest data in your table. But if you can relax, this is a requirement, you will find a materialized cube of a kind a pretty decent choice.

Are you absolutely sure that your users will hit all 10 columns equally? In the past, I came across premature optimization, but only to find that users actually used one or two columns for most of their reports, and that collapsing to these one or two colunmns was "good enough."

0
source

If you cannot create an OLAP cube from the data, can you create a pivot table based on unique combinations of X and Y. If the time period Y has a sufficiently high degree of detail, the pivot table can be reasonably small. Obviously, this is data dependent.

In addition, you must capture requests that users run. As a rule, when users say that they need every possible combination, when in practice this rarely happens, and most user requests can be satisfied from pre-calculated results. The pivot table will again be an option, you will get some delay with this parameter, but it may work.

Other options, if possible, would be to view the hardware. I have had good results in the past with solid state drives like Fusion-IO . This can significantly reduce the request time. This is not a substitute for good design, but with good design and the right equipment, it works well.

0
source

All Articles