If you are executing min / max / avg queries, would you prefer to use aggregation tables or just query the range of rows in the source table?
This is obviously a very open question, and there is not a single correct answer, so I'm just looking for general suggestions from people. Suppose a raw data table consists of a timestamp, a numeric foreign key (for example, a user ID), and a decimal value (for example, a purchase amount). Also, suppose the table has millions of rows.
I have done both. On the one hand, aggregation tables gave me significantly faster queries, but at the expense of increasing the number of additional tables. Displaying the current values for the aggregated range either requires a full return to the original data table, or a combination of smaller granular aggregates. I found that tracking in the application code, in which the aggregation table requests when there will be more work that you think about, and that circuit changes will be required, since the initial aggregation ranges will always not be enough ("But I wanted to see our sales for the last 3 payment period! ").
On the other hand, a query from raw data can be punitively slow, but it allows me to be very flexible regarding data ranges. When the range changes change, I just modify the query, and not rebuild the aggregation tables. Similarly, application code requires fewer updates. I suspect that if I were smarter with respect to my indexing (i.e., always having good coverage indexes), I could reduce the fine for choosing from raw data, but this is by no means a panacea.
Anyway, can I have the best of both worlds?
source
share