Question about joins and a table with millions of rows

I need to create 2 tables:

Magazine (10 million rows with these columns: identifier, title, genres, print, price)

Author (180 million rows with these columns: id, name, journal_id)

. Each author can write in ONLY ONE, and each journal has more authors.

So, if I want to know all the authors of Motors Magazine, I have to use this query:

SELECT * FROM Author, Magazine WHERE ( Author.magazine_id = Magazine.id ) AND ( genres = 'Motors' ) 

The same applies to the Print & Price column.

To avoid these joins with million row tables, I thought of using these tables:

Magazine (10 million rows with this column: id, title, genres, print, price)

Author (180 million rows with this column: id, name, magazine_id, genres, print, price)

. and this request:

 SELECT * FROM Author WHERE genres = 'Motors' 

Is this a good approach?

I want it to work faster

I can use Postgresql or Mysql.

+6
sql database mysql postgresql
source share
5 answers

No, I donโ€™t think that duplicating the information in your description is a good design for a relational database.

If you change the genre or price of this magazine, you will have to remember to change it in all lines of the author, where the information is duplicated. And if you sometimes forget, you will get anomalies in your data. How can you find out which one is correct?

This is one of the benefits of normalizing relational databases to present information with minimal redundancy, so you don't get anomalies.

For it to work faster, which I think you're trying to do, you should learn how to use indexes , especially indexes .

+6
source share

If you only need to get the Journal Author (and no information about the Journal), you can use EXISTS. Some say EXISTS is faster than JOIN because EXISTS stops the search after the first hit. Then you should use:

 SELECT * FROM Author WHERE EXISTS (SELECT 1 FROM Magazine WHERE genres = 'Motor' AND Author.id = Magazine.id) 

Also, as mentioned earlier, specifying columns will speed up the process.

+3
source share

Is this a good approach?

  • The pros of this approach outweigh the cons. The disadvantages of de-normalization (what you offer) include:
    • You need to maintain the correct genres, print and price data for each individual magazine in the authors table at any time when they change for magazine_id. It is expensive.
    • You obviously spend a lot of storage space, repeating the data of each log on average 18 times (is this the right choice?).
    • Any other selection / maintenance of the author table becomes slower / more expensive.
  • Your request seems to be broken. Instead, he should be
      SELECT * FROM Author, Magazine 
      WHERE Author.magazine_id = Magazine.id AND genres = 'Motors'
     
  • To solve your problem, make sure you have an index on the coffee table by genre and a pointer to magazine_id on the authors page
+2
source share

You must do this:

 SELECT * FROM Author JOIN Magazine ON Author.id = Magazine.id WHERE genres = 'Motors' 

It should be fast. If it is too slow, make sure you have all the relevant indexes, including the primary key indexes in the id fields for all tables and the genres index.

You should also specify the columns you want, and not return them all. Note that this query can potentially return millions of rows. Are you sure you want to take them all? I would consider the solution using paging and fetching only the first 50, until the user asks to see the next page.

+1
source share

You do not need to do a JOIN, and even then your main request is erroneous. You wanted to say:

 SELECT name FROM author WHERE magazine_id in (SELECT id FROM magazine WHERE genres = 'motors') 

There are many ways to manage huge data warehouses like this. If you give an example of what you want to extract from this data, people can offer effective ways to do this.

+1
source share

All Articles