Performing Delta E Calculation and Sorting (CIE Lab) in SQL

I have a database table where each row is a color. My goal: to set the input color, calculate its distance to each color in the database table and sort the results by this distance. Or, as a user story: when I select a color, I want to see a list of colors that are most similar to the one I selected, with the closest matches at the top of the list.

I understand that various Delta E formulas (CIE Lab) are the best choice for this . I could not find any native SQL versions of the formulas, so I wrote my own SQL versions of the Delta E CIE 1976 and Delta E CIE 2000 . I checked the accuracy of my SQL versions of formulas against the results generated by python-colormath .

Formula 1976 is easy to write in SQL or any other language, because it is a simple calculation of Euclidean distance. It works well and quickly for me, on data sets of any size (I checked it on a color table with 100,000 rows, and the query takes less than 1 second).

Formula 2000, by contrast, is very long and complex. I managed to implement it in SQL, but its performance is small: about 5 seconds for querying 10,000 rows and about 1 minute for querying 100,000 rows.

I wrote an example application called colorsearchtest (in Python / Flask / Postgres) to play with my implementations (and I set up a demo on Heroku ). If you try this application, you can clearly see the difference in performance between Delta E. 1976 and 2000 queries.

This is a diagram for a color table (for each color, it stores the corresponding RGB and Lab representations as three numerical values):

CREATE TABLE color ( id integer NOT NULL, rgb_r integer, rgb_g integer, rgb_b integer, lab_l double precision, lab_a double precision, lab_b double precision ); 

This is some data in the table (all random colors created using a script in my application):

 INSERT INTO color (id, rgb_r, rgb_g, rgb_b, lab_l, lab_a, lab_b) VALUES (902, 164, 214, 189, 81.6521019943304793, -21.2561872439361323, 7.08354581694699004); INSERT INTO color (id, rgb_r, rgb_g, rgb_b, lab_l, lab_a, lab_b) VALUES (903, 113, 229, 64, 81.7930860963098212, -60.5865728472875205, 66.4022741184551819); INSERT INTO color (id, rgb_r, rgb_g, rgb_b, lab_l, lab_a, lab_b) VALUES (904, 65, 86, 78, 34.6593864327796624, -9.95482220634028003, 2.02661293272071719); ... 

And this is the Delta E CIE 2000 SQL function that I use:

 CREATE OR REPLACE FUNCTION DELTA_E_CIE2000(double precision, double precision, double precision, double precision, double precision, double precision, double precision, double precision, double precision) RETURNS double precision AS $$ WITH c AS (SELECT (CAST($1 AS VARCHAR) || ',' || CAST($2 AS VARCHAR) || ',' || CAST($3 AS VARCHAR) || ',' || CAST($4 AS VARCHAR) || ',' || CAST($5 AS VARCHAR) || ',' || CAST($6 AS VARCHAR)) AS lab_pair_str, (($1 + $4) / 2.0) AS avg_lp, SQRT( POW($2, 2.0) + POW($3, 2.0)) AS c1, SQRT( POW(($5), 2.0) + POW(($6), 2.0)) AS c2), gs AS (SELECT c.lab_pair_str, (0.5 * (1.0 - SQRT( POW(((c.c1 + c.c2) / 2.0), 7.0) / ( POW(((c.c1 + c.c2) / 2.0), 7.0) + POW(25.0, 7.0))))) AS g FROM c WHERE c.lab_pair_str = ( CAST($1 AS VARCHAR) || ',' || CAST($2 AS VARCHAR) || ',' || CAST($3 AS VARCHAR) || ',' || CAST($4 AS VARCHAR) || ',' || CAST($5 AS VARCHAR) || ',' || CAST($6 AS VARCHAR))), ap AS (SELECT gs.lab_pair_str, ((1.0 + gs.g) * $2) AS a1p, ((1.0 + gs.g) * $5) AS a2p FROM gs WHERE gs.lab_pair_str = ( CAST($1 AS VARCHAR) || ',' || CAST($2 AS VARCHAR) || ',' || CAST($3 AS VARCHAR) || ',' || CAST($4 AS VARCHAR) || ',' || CAST($5 AS VARCHAR) || ',' || CAST($6 AS VARCHAR))), cphp AS (SELECT ap.lab_pair_str, SQRT( POW(ap.a1p, 2.0) + POW($3, 2.0)) AS c1p, SQRT( POW(ap.a2p, 2.0) + POW($6, 2.0)) AS c2p, ( DEGREES(ATAN2($3, ap.a1p)) + ( CASE WHEN DEGREES(ATAN2($3, ap.a1p)) < 0.0 THEN 360.0 ELSE 0.0 END)) AS h1p, ( DEGREES(ATAN2($6, ap.a2p)) + ( CASE WHEN DEGREES(ATAN2($6, ap.a2p)) < 0.0 THEN 360.0 ELSE 0.0 END)) AS h2p FROM ap WHERE ap.lab_pair_str = ( CAST($1 AS VARCHAR) || ',' || CAST($2 AS VARCHAR) || ',' || CAST($3 AS VARCHAR) || ',' || CAST($4 AS VARCHAR) || ',' || CAST($5 AS VARCHAR) || ',' || CAST($6 AS VARCHAR))), av AS (SELECT cphp.lab_pair_str, ((cphp.c1p + cphp.c2p) / 2.0) AS avg_c1p_c2p, (((CASE WHEN (ABS(cphp.h1p - cphp.h2p) > 180.0) THEN 360.0 ELSE 0.0 END) + cphp.h1p + cphp.h2p) / 2.0) AS avg_hp FROM cphp WHERE cphp.lab_pair_str = ( CAST($1 AS VARCHAR) || ',' || CAST($2 AS VARCHAR) || ',' || CAST($3 AS VARCHAR) || ',' || CAST($4 AS VARCHAR) || ',' || CAST($5 AS VARCHAR) || ',' || CAST($6 AS VARCHAR))), ts AS (SELECT av.lab_pair_str, (1.0 - 0.17 * COS(RADIANS(av.avg_hp - 30.0)) + 0.24 * COS(RADIANS(2.0 * av.avg_hp)) + 0.32 * COS(RADIANS(3.0 * av.avg_hp + 6.0)) - 0.2 * COS(RADIANS(4.0 * av.avg_hp - 63.0))) AS t, (( (cphp.h2p - cphp.h1p) + (CASE WHEN (ABS(cphp.h2p - cphp.h1p) > 180.0) THEN 360.0 ELSE 0.0 END)) - (CASE WHEN (cphp.h2p > cphp.h1p) THEN 720.0 ELSE 0.0 END)) AS delta_hlp FROM av INNER JOIN cphp ON av.lab_pair_str = cphp.lab_pair_str WHERE av.lab_pair_str = ( CAST($1 AS VARCHAR) || ',' || CAST($2 AS VARCHAR) || ',' || CAST($3 AS VARCHAR) || ',' || CAST($4 AS VARCHAR) || ',' || CAST($5 AS VARCHAR) || ',' || CAST($6 AS VARCHAR))), d AS (SELECT ts.lab_pair_str, ($4 - $1) AS delta_lp, (cphp.c2p - cphp.c1p) AS delta_cp, (2.0 * ( SQRT(cphp.c2p * cphp.c1p) * SIN(RADIANS(ts.delta_hlp) / 2.0))) AS delta_hp, (1.0 + ( (0.015 * POW(c.avg_lp - 50.0, 2.0)) / SQRT(20.0 + POW(c.avg_lp - 50.0, 2.0)))) AS s_l, (1.0 + 0.045 * av.avg_c1p_c2p) AS s_c, (1.0 + 0.015 * av.avg_c1p_c2p * ts.t) AS s_h, (30.0 * EXP(-(POW(((av.avg_hp - 275.0) / 25.0), 2.0)))) AS delta_ro, SQRT( (POW(av.avg_c1p_c2p, 7.0)) / (POW(av.avg_c1p_c2p, 7.0) + POW(25.0, 7.0))) AS r_c FROM ts INNER JOIN cphp ON ts.lab_pair_str = cphp.lab_pair_str INNER JOIN c ON ts.lab_pair_str = c.lab_pair_str INNER JOIN av ON ts.lab_pair_str = av.lab_pair_str WHERE ts.lab_pair_str = ( CAST($1 AS VARCHAR) || ',' || CAST($2 AS VARCHAR) || ',' || CAST($3 AS VARCHAR) || ',' || CAST($4 AS VARCHAR) || ',' || CAST($5 AS VARCHAR) || ',' || CAST($6 AS VARCHAR))), r AS (SELECT d.lab_pair_str, (-2.0 * d.r_c * SIN(2.0 * RADIANS(d.delta_ro))) AS r_t FROM d WHERE d.lab_pair_str = ( CAST($1 AS VARCHAR) || ',' || CAST($2 AS VARCHAR) || ',' || CAST($3 AS VARCHAR) || ',' || CAST($4 AS VARCHAR) || ',' || CAST($5 AS VARCHAR) || ',' || CAST($6 AS VARCHAR))) SELECT SQRT( POW(d.delta_lp / (d.s_l * $7), 2.0) + POW(d.delta_cp / (d.s_c * $8), 2.0) + POW(d.delta_hp / (d.s_h * $9), 2.0) + r.r_t * (d.delta_cp / (d.s_c * $8)) * (d.delta_hp / (d.s_h * $9))) AS delta_e_cie2000 FROM r INNER JOIN d ON r.lab_pair_str = d.lab_pair_str WHERE r.lab_pair_str = ( CAST($1 AS VARCHAR) || ',' || CAST($2 AS VARCHAR) || ',' || CAST($3 AS VARCHAR) || ',' || CAST($4 AS VARCHAR) || ',' || CAST($5 AS VARCHAR) || ',' || CAST($6 AS VARCHAR)) $$ LANGUAGE SQL IMMUTABLE RETURNS NULL ON NULL INPUT; 

(I originally wrote this function using nested subqueries at about 10 levels, but then rewrote it instead of using WITH statements, that is, Postgres CTE. The new version is much readable, and the performance is similar to the old version. You can see both versions in the code .)

After defining the function, I use it in the following query:

 SELECT c.rgb_r, c.rgb_g, c.rgb_b, DELTA_E_CIE2000(73.9206633504, -50.2996953437, 23.8259166281, c.lab_l, c.lab_a, c.lab_b, 1.0, 1.0, 1.0) AS de2000 FROM color c ORDER BY de2000 LIMIT 100; 

So my question is: is there a way to improve the performance of the DELTA_E_CIE2000 function DELTA_E_CIE2000 that it can be used in real time for non-trivial data sets? Or, given the complexity of the formula, is that as fast as it is going to get?

From the testing that I did in my demo application, I would say that for the example of using a simple search for “similar colors” on a website, the difference in the accuracy of the results between the functions of 1976 and 2000 is. virtually negligible. That is, I am already sure that for my needs the 1976 formula is "good enough." However, the 2000 function returns slightly better results (depending on where the input color lies in the laboratory space), and indeed, I'm just wondering if it can accelerate its further progress.

+5
source share
1 answer

Two things: 1) you are not using the database fully and 2) your problem is a great example for the PostgreSQL user extension. That's why.

You use the database only as storage, keeping the colors as floating. In your current configuration, regardless of the type of query, the database should always check all values ​​(perform a sequential scan). This means a lot of input / output and a lot of calculations for several returned matches. You are trying to find the nearest colors N, so there are several ways to avoid performing calculations on all the data.

Simple improvement

The easiest way is to limit your calculations to a smaller subset of data. You can assume that the difference will be greater if the components are different from each other. If you can find a safe difference between components where results are always out of place, you can completely eliminate these colors by using ranged WHERE with btree indices. However, due to the nature of the L * a * b color space, this is likely to degrade your results.

First create the indexes:

 CREATE INDEX color_lab_l_btree ON color USING btree (lab_l); CREATE INDEX color_lab_a_btree ON color USING btree (lab_a); CREATE INDEX color_lab_b_btree ON color USING btree (lab_b); 

Then I adapted your query to include a WHERE clause to filter only colors where any of the components differs by no more than 20.

Update:. After another look, adding a limit of 20 is likely to worsen the results, since I found at least one point in space for which this is true:

 SELECT c.rgb_r, c.rgb_g, c.rgb_b, DELTA_E_CIE2000( 25.805780252087963, 53.33446637366859, -45.03961353720049, c.lab_l, c.lab_a, c.lab_b, 1.0, 1.0, 1.0) AS de2000 FROM color c WHERE c.lab_l BETWEEN 25.805780252087963 - 20 AND 25.805780252087963 + 20 AND c.lab_a BETWEEN 53.33446637366859 - 20 AND 53.33446637366859 + 20 AND c.lab_b BETWEEN -45.03961353720049 - 20 AND -45.03961353720049 + 20 ORDER BY de2000 ; 

I populated a table of 100,000 random colors with a script and tested:

Time without indexes: 44006.851 ms

Time with indexes and range request: 1293.092 ms

You can add this WHERE clause to delta_e_cie1976_query too, according to my random data, it drops the request time from ~ 110 ms to ~ 22 ms.

BTW: I got number 20 empirically: I tried with 10, but only got 380 entries that seem a bit low and might exclude some better options, since the limit is 100. With 20, the full set was 2900 lines and one can be pretty sure that coming matches will be there. I did not study the color space DELTA_E_CIE2000 or L * a * b *, so the threshold may need to be adjusted for various components so that it is true, but the principle of excluding uninteresting data is preserved.

Rewrite Delta E CIE 2000 to C

As you said, Delta E CIE 2000 is complex and quite unsuitable for implementation in SQL. It currently uses about 0.4 ms per call on my laptop. Implementing this in C should speed this up significantly. PostgreSQL assigns a default value for SQL functions as 100 and C as 1. I assume this is based on real-world experience.

Update:. Since this also scratches one of my clicks, I redefined the Delta E functions from the colormath module in C as the PostgreSQL extension available on PGXN . With this, I can see an acceleration of about 150x for CIE2000 when querying all records from a table with 100k records.

With this C function, I get a request time between 147 ms and 160 ms for 100 thousand colors. With an additional WHERE, the request time is about 20 ms, which seems quite acceptable to me.

The best but advanced solution

However, since your problem is finding the closest neighbor in 3-dimensional space, you can use the K-Nearest-Neighbor Indexing, which has been in PostgreSQL since version 9.1 .

For this to work, you put the L * a * b * components in the cube . This extension does not yet support the distance operator ( this is in the works ), but even if it were, it would not support the Delta E distances, and you would need to redefine it as the extension C.

This means that the GiST index statement operator (the btree_gist PostgreSQL extension in Contrib does this) to support indexing according to Delta E distances. The good part is that you could use different operators for different versions of Delta E, for example. <-> for Delta E CIE 2000 and <#> for Delta E CIE 1976 and queries will be really really fast for a little LIMIT even with Delta E CIE 2000.

In the end, it may depend on your requirements (requirements) and limitations.

+6
source

All Articles