Find different values of varchar column in super-large MYSQL table

Question

Find different values of varchar column in super-large MYSQL table

I want to find different values of varchar column in super large MYSQL table (1 billion rows).
I have the following solution:

  1. select distinct (col_name) from mytable; 
 2. export this column to a text file incrementally (select col_name from mytable where myid> x and myid <x + n), then use linux sort 
 sort myfile.txt |  uniq -u

The problem with the first method (even if the index is indexed) is that the request may crash for a long period of time, and then you will need to start all over again.
I am leaning towards the second way, is there any other faster way?

+4

sorting mysql unique distinct

user775187 Jun 10 '11 at 23:50

source share

2 answers

Benjamin · Answer 1 · 2011-06-11T00:29:56+0000

...
...
SELECT col_name FROM mytable GROUP BY col_name;

Even if they return the same result set, the two queries actually use different execution plans, and I noticed that GROUP BY in MySQL is slightly faster than DISTINCT .

I support the spinning_plate comment regarding the index. If you already have one, you will have much less pain to get the result. What is the power of your index?

dfb · Answer 2 · 2011-06-11T00:43:14+0000

Unfortunately, I had to resort to such nonsense before MySQL. If you can’t just pull the index and GROUP BY isn’t faster (I don’t know why this would come from the @Ben .. post), you could try to segment the problem so that its package.

I would still be working in MySQL, most likely it will be faster than everything that you write yourself or run on the UNIX command line. Treat it like you would a materialized table of representations or aggregations in a DW. One simple way would be to create a batch script package that would SELECT DISTINCTS over small ranges into a second table with separate values (via MERGE or some other mechanism). This is more downloadable, but you ran into the same performance problems as in different places. You will have to experiment with the parameters (batch size). If you use this in a production environment and people expect to get all the different values, as if they were querying directly in the database, it would be better to have 3 tables, source, temporary for the current batch, and the current table with the latest values and the date_modified column.

Find different values ​​of varchar column in super-large MYSQL table

More articles:

Find different values of varchar column in super-large MYSQL table