Improving performance in sql with multiple tables

Question

Improving performance in sql with multiple tables

I have two tables: Log (id, user, action, date) and ActionTypes (action, type). Given the action of A0 and type T0, I would like to calculate for each user how many times she used one action Ai immediately after A0, but skipping Log actions that are not of type T0. For example:

Journal:

id user action date ---------------------------------------- 1 mary start 2012-07-16 08:00:00 2 mary open 2012-07-16 09:00:00 3 john start 2012-07-16 09:00:00 4 mary play 2012-07-16 10:00:00 5 john open 2012-07-16 10:30:00 6 mary start 2012-07-16 11:00:00 7 mary jump 2012-07-16 12:00:00 8 mary close 2012-07-16 13:00:00 9 mary delete 2012-07-16 14:00:00 10 mary start 2012-07-16 15:00:00 11 mary open 2012-07-16 16:00:00

ActionTypes:

 action type -------------- start 0 open 1 play 1 jump 2 close 1 delete 1

So, given the action "start" and type "1", the answer will be as follows:

 user action ntimes ------------------------ mary open 2 mary close 1 john open 1

My attempt

 SELECT b.user,b.action, count(*) FROM log a, log b WHERE a.action='start' AND b.date>a.date AND a.user=b.user AND 1=(select type from ActionTypes where action=b.action) AND not exists (SELECT c.action FROM log c where c.user=a.user AND c.date>a.date and c.date<b.date and 1=(select type from ActionTypes where action=c.action)) GROUP BY b.user,b.action

There are about 1 million tuples in our Log table, and the query works, but it's too slow. We are using SQLServer. Any tips on how to make this faster? Thanks

+4

sql join sql-server nested sql-server-2008-r2

SAL PIMIENTA Jul 18 '12 at 7:07

source share

4 answers

After borrowing @ Nikola Markovinovich, I came up with the following solution:

 WITH ranked AS ( SELECT L1.[user], L2.action, rnk = ROW_NUMBER() OVER (PARTITION BY L1.id ORDER BY L2.date) FROM Log L1 INNER JOIN Log L2 ON L2.[user] = L1.[user] AND L2.date > L1.date INNER JOIN ActionType at ON L2.action = at.action WHERE L1.action = @Action AND at.type = @Type ) SELECT [user], action, ntimes = COUNT(*) FROM ranked WHERE rnk = 1 GROUP BY [user], action ;

Basically, this query selects from the Log table all user records that have the specified action, and then attaches this subset back to Log to retrieve all the actions of the specified type that follow those in the first subset, ranking them in ascending order by date along the path (using ROW_NUMBER() ), then the query retrieves only the rows with a ranking of 1 , groups them by user and action and counts the rows in groups.

You can see (and play with) a working example in SQL Fiddle .

+3

Andriy m Jul 18 '12 at 10:53

source share

Your action requests and all relationship fields are much faster than an integer, not a string.

The only way to fulfill your queries faster is to change the database structure. Relations must be indexed and must be integer, not string. For example, something like this:

 id user action date ---------------------------------------- 1 mary 1 2012-07-16 08:00:00 2 mary 2 2012-07-16 09:00:00 3 john 3 2012-07-16 09:00:00 4 mary 1 2012-07-16 10:00:00 5 john 3 2012-07-16 10:30:00 6 mary 4 2012-07-16 11:00:00 7 mary 5 2012-07-16 12:00:00 8 mary 6 2012-07-16 13:00:00 9 mary 1 2012-07-16 14:00:00 10 mary 3 2012-07-16 15:00:00 11 mary 1 2012-07-16 16:00:00

will solve your problem.

In addition, if you have from 1 to 9 types of actions, you can have an action with tinyint, and also if you add an identifier and tinyint with a primary key, then your queries will definitely be simpler (with simple joins), as well as your database will be more flexible for future changes. For example, you can:

 id action type -------------- 1 start 0 2 open 1 3 play 1 4 jump 2 5 close 1 6 delete 1

If id is the primary key, and the "action" in the "Log" table has a foreign key for this identifier.

I think the main problem is that you have no relationship of indexes and foreign keys.

+2

John skoumbourdis Jul 18 '12 at 7:12

source share

I do not agree with the statements:

... much faster than an integer, not a string
This is not entirely true, after indexing the action column, there are few differences between integers or strings.
... the only way to fulfill your queries faster is to change the database structure
In this case, the query can be optimized in several ways:
- Avoid filtering on the combined dataset (Log x ActionTypes) and try to filter earlier (in the example below, filtering occurs in the internal selection).
- Avoid repeated filtering conditions (where). Despite the fact that the sql server internally optimizes this duplication of queries, they usually sign that you do the calculation several times, and most of the time you can find a solution in which you could set the condition only once (in the example below you can set where clause to group by ).
- Your best friend is SQL Query Analyzer. Its a built-in tool in Sql Server Manager Studio, and it will show you the cost of executing sql queries based on the amount of data. This is a really good tool and helps find bottlenecks in queries.
Here is a simplified query that will give the result you need (it was written and tested on Oracle, as it has been a while since I worked with ms sql server):

 select "user", action, count(*) from action_log where action not in ( --exclusion criteria select action_type."action"from action_type where action_type."type" = 1 ) group by "user", action

0

Petro semeniuk Jul 18 '12 at 8:06

source share

Nikola Markovinović · Accepted Answer · 2012-07-18T09:38:27+0000

Could you ask for this request? It is used to check if a previous chronological record is requested. I believe it will be faster than self-connection. I put the @Sql Fiddle demo .

 select log.[user], log.action, count(*) ntimes from log inner join actiontype t on log.action = t.action where t.type = 1 and exists (select * from (select top 1 t1.type from log l1 inner join actiontype t1 on l1.action = t1.action where l1.[user] = log.[user] and l1.date < log.date and t1.type in (0, 1) order by l1.date desc ) prevEntry where prevEntry.type = 0 ) group by log.[user], log.action

I don’t understand why mary \ close is in the result list. The previous entry is a transition that is of type 2 and should not be skipped to begin.

Improving performance in sql with multiple tables

More articles: