How to avoid redundant data fields in a result set when using JOIN?

Question

How to avoid redundant data fields in a result set when using JOIN?

The following connection should receive user information along with its messages for users with a certain status:

SELECT * FROM user, message WHERE message.user_id=user.id AND user.status=1

The problem is that all rows of a specific user in the result set contain redundant columns that repeat the same data about this user (those fields that are extracted from the user table), only the fields from the message table contain information without excess. Something like that:

 user.id username email message.id subject 1 jane jane@gmail.com 120 Notification 1 jane jane@gmail.com 122 Re:Hello 1 jane jane@gmail.com 125 Quotation 2 john john@yahoo.com 127 Hi jane 2 john john@yahoo.com 128 Fix thiss 2 john john@yahoo.com 129 Ok 3 jim jim@msn.com 140 Re:Re:Quotation

As you can see, a lot of data is redundant, and we do not want to find users first, and then move on to their messages in structures like a loop or something like that. Loops that cause micro-queries should be avoided at all costs.

I'm not interested in the output of my program, which handles well in the user interface. I think that perhaps the network traffic obtained by returning the result of this query can be significantly reduced if somehow I can eliminate the repetition of user data in all lines related to this user.

+4

sql join redundancy

ashy_32bit Jul 05 '10 at 4:42

source share

4 answers

Borealid · Answer 1 · 2010-07-05T05:34:09+0000

There are a few things you should know.

First, the default SQL JOIN constructor is essentially a cross-product set limited to the WHERE clause. This means that it is multiplicative - you get duplicate results, which you then cut. You should also be careful when there are NULL fields.

Secondly, there is the keyword "DISTINCT". When you prefix a column in this selection, you will get no more than one instance of a specific value for that column in the results. Thus, according to your request, "SELECT DISTINCT user.id FROM" will eliminate server-side redundancies.

Third, the correct way to solve this problem is most likely not to use the '*' operator. I suggest:

SELECT user.id, username, email address, subject FROM message m, user WHERE m.user_id = user.id AND user.status = 1

This uses the simple, easy-to-understand implicit join syntax and must be valid SQL on any server. I can guarantee that it works with MySQL, at least. It also smoothes the message table "m" as abbreviated.

As you might guess, this will reduce the traffic from the SQL server to your database.

edit: if you want to exclude "redundant" email information, you cannot - you must make two different requests. SQL results are tables and must be rectangular, with all known values. No entry for 'ditto'.

edit 2: you only need to make two queries. For instance:

SELECT subject FROM message WHERE message.id IN (SELECT user.id FROM user WHERE status = 1)

This is one query containing a subquery, so it makes two calls to the database. But it does not have program cycles.

Rew · Answer 2 · 2010-07-05T04:48:57+0000

There is no sql in the direct request if you store them as a single request. If you print this out programmatically, then you order by user ID and only retype this information if the user ID changes.

Jonathan leffler · Answer 3 · 2010-07-05T04:58:41+0000

In the SQL standard, you must use NATURAL JOIN; this is combined into common column names and retains only one copy of these common names.

In practice, you will carefully list the columns you need, rather than resorting to the abbreviated '*' notation.

potatopeelings · Answer 4 · 2010-07-05T05:59:47+0000

Assuming you can use a stored procedure, you can write it to run the above query, and then use the cursor to store the values of zeros for "redundant information" to get something like

 user.id username email message.id subject 1 jane jane@gmail.com 120 Notification null null null 122 Re:Hello null null null 125 Quotation 2 john john@yahoo.com 127 Hi jane null null null 128 Fix thiss null null null 129 Ok 3 jim jim@msn.com 140 Re:Re:Quotation

and then return this result set to the temporary table. but while it can reduce network traffic, it will add overhead service data

Another way is to run 2 requests, one to get information about the user, and the other to get information about the message only with the identifier of the associated user, and then make a “connection” using the code on the application server side. sort of

 SELECT DISTINCT user.* FROM user, message WHERE message.user_id=user.id AND user.status=1

and

 SELECT user.id, message.* FROM user, message WHERE message.user_id=user.id AND user.status=1

leading to 2 trips to the database instead of 1, which may end up being slower even if network traffic is reduced.

And another way is to combine these 2 into one result set with something like

 SELECT user.* FROM user, message WHERE message.user_id=user.id AND user.status=1 UNION ALL SELECT user.id, message.* FROM user, message WHERE message.user_id=user.id AND user.status=1

to get something like

  user.id username/message.id email/subject 1 jane jane@gmail.com 2 john john@yahoo.com 3 jim jim@msn.com 1 120 Notification 1 122 Re:Hello 1 125 Quotation 2 127 Hi jane 2 128 Fix thiss 2 129 Ok 3 140 Re:Re:Quotation

and then use the application server logic to separate it. reduced network traffic, but more load on the application server or a small load on the database server.

But saved network traffic is rarely worth the extra complexity.

How to avoid redundant data fields in a result set when using JOIN?

More articles: