Database Performance: Filtering by Column or Separate Table

I was wondering what the best approach would be for the following situation:

I have an Orders table in the database that obviously contains all the orders. But this is literally ALL orders, therefore including full / finished, which are simply marked as “full”. Of all the open orders, I want to calculate some things (for example, open quantity, open positions, etc.). What would be better: "

Keep 1 order table with ALL orders, including full / archived, and perform calculations by filtering the "full" flag?

Or I need to create another table, for example. "Orders_Archive", so that the Orders table will contain only open orders, which I use for calculations?

Is there a (clear) performance difference in these approaches?

(BTW I'm on PostgreSQL db.)

+7
performance database postgresql database-design
source share
4 answers

Or I need to create another table, for example. "Orders_Archive", so that the Orders table will contain only open orders, which I use for calculations?

Yes. They call this data warehouse. People do this because it speeds up the transaction system in order to eliminate a history that is hardly used. First, tables are physically smaller and faster. Secondly, a long-term history report does not interfere with transaction processing.

Is there a (clear) performance difference in these approaches?

Yes. Bonus You can restructure your story so that it is no longer in 3NF (for updating), but in Star Schema (for reporting). The benefits are huge.

Buy the Kimball Data Warehouse Toolkit to learn more about stellar design and move history from active tables to storage tables.

+5
source share

This is a common problem in database design: the question is whether to separate or “archive” records that are no longer “active”.

The most common approaches are:

  • All in one table, mark orders as “complete” if necessary. Pros: The simplest solution (both by code and structure), good flexibility (for example, it is easy to “resurrect” orders). Cons: Tables can be quite large, a problem for both queries and, for example, backups.
  • Archive of old material for a separate table. Solves problems with the first approach due to greater complexity.
  • Use a table with a split value. This means that logically (for the application) everything is in one table, but behind the scenes the DBMS places the material in separate areas depending on the values ​​(values) in some columns. You would probably use the “full” or “order completion date” column to split.

The latter type of approach combines the good parts of the first two, but needs support in the DBMS and is more difficult to configure.

Note:

Tables that store only “archived” data are usually called “archived tables”. Some DBMSs even provide special repositories for these tables (for example, MySQL), which are optimized for quick search and good storage efficiency due to slow changes / insertions.

+7
source share

Never separate or separate current / archived data. This is simply not true. It can be called a “data warehouse” or a bucket of fish, but it is wrong, not necessary and creates problems that were not otherwise present. Result:

  • everyone who requests data should now look for it in two places, and not in one
  • and, even worse, add aggregated values ​​manually (in Excel or something else)
  • you enter anomalies in the key, integrity is lost (which otherwise would be unique with a single db constraint)
  • when you need to change a completed order (or many), you must pull it out of the "warehouse" and return it to the "database"

If and only if the answer on the table is slow, then refer to this and increase the speed. Only. Nothing more. This (in every case I saw) is an indexing error (missing indexes or invalid columns or incorrect column sequence - all errors). Typically, all you need is an IsComplete column in the index, along with what your users use most often to search, to enable / exclude Open / Complete Orders.

Now, if your dbms platform cannot handle large tables or large result sets, this is another problem and you need to use any methods available in the tool. But as a problem with database design, this is simply wrong; There is no need to create duplicates, fill them in and maintain (with all the ensuing problems), unless you are limited by your platform.

As in the past year, and in this, as part of the usual performance assignment, I combined such partitioning tables with billions of rows (and had to solve all the problems with duplicate rows that supposedly did not exist, yes, right, 2 days just for of this). Consolidated tables with revised indexes were faster than partitioned tables; the excuse that “billions of rows slowed the table down” was completely false. Users love me because they no longer need to use two tools and query two “databases” to get what they need.

+3
source share

Since you are using postgresql, you can use a partial index. Suppose you often use orderdate for an incomplete order, you can specify an index as follows:

create index order_orderdate_unfinished_ix on orders ( orderdate ) where completed is null or completed = 'f'; 

When you put this condition, postgresql will not index completed orders, thereby preserving the hard disk space and increasing the index faster, since it contains only a small amount of data. In this way, you benefit without partitioning problems.

When you separate data in ORDERS and ORDERS_ARCHIVE, you will have to adjust existing reports. If you have a lot of messages, it can be painful.

See the full description of the partial index on this page: http://www.postgresql.org/docs/9.0/static/indexes-partial.html

EDIT: for archiving, I prefer to create another database with an identical schema, and then move the old data from transaction db to this db archive.

+1
source share

All Articles