How to create a custom window function for PostgreSQL? (Middle example)

I would like to better understand what is involved in creating UDF that works on windows in PostgreSQL. I searched several times how to create UDF as a whole, but did not find an example of how to do what works on a window.

To this end, I hope someone wants to share the code for writing UDF (maybe in C, pl / SQL or any of the procedural languages ​​supported by PostgreSQL), which calculates the average number of numbers per window. I understand that there are ways to do this by using the standard average aggregate function with window syntax (the lines between the syntax, which I suppose), I just ask this function because I think this is a good example. In addition, I think that if there was a window version of the middle function, then the database could support the current amount and number of observations and not summarize almost the same sets of rows at each iteration.

+7
source share
4 answers

You should look at postgresql source code postgresql / src / backend / utils / adt / windowfuncs.c and postgresql / src / backend / executor / nodeWindowAgg.c

There is no good documentation :( - a full-featured window function should be implemented only in C or PL / v8 - there is no API for other languages.

http://www.pgcon.org/2009/schedule/track/Version%208.4/128.en.html presentation from the author of the implementation in PostgreSQL.

I found only one non-interactive implementation - http://api.pgxn.org/src/kmeans/kmeans-1.1.0/

http://pgxn.org/dist/plv8/1.3.0/doc/plv8.html

+6
source

According to the documentation "Other window functions can be added by the user. In addition, any built-in or user-defined normal aggregate function can be used as a window function." (section 4.2.8). This worked for me to calculate stock layout adjustments:

CREATE OR REPLACE FUNCTION prod(float8, float8) RETURNS float8 AS 'SELECT $1 * $2;' LANGUAGE SQL IMMUTABLE STRICT; CREATE AGGREGATE prods ( float8 ) ( SFUNC = prod, STYPE = float8, INITCOND = 1.0 ); create or replace view demo.price_adjusted as select id, vd, prods(sdiv) OVER (PARTITION by id ORDER BY vd DESC ROWS UNBOUNDED PRECEDING) as adjf, rawprice * prods(sdiv) OVER (PARTITION by id ORDER BY vd DESC ROWS UNBOUNDED PRECEDING) as price from demo.prices_raw left outer join demo.adjustments using (id,vd); 

Here are the diagrams of the two tables:

 CREATE TABLE demo.prices_raw ( id VARCHAR(30), vd DATE, rawprice float8 ); CREATE TABLE demo.adjustments ( id VARCHAR(30), vd DATE, sdiv float); 
+2
source

PL / R provides this functionality. See here for some examples. However, I’m not sure that it (currently) meets your requirement to “keep the current amount and number of observations and [not] sum up almost identical sets of rows at each iteration” (see here ).

0
source

Starting from the table

  payments
 + ------------------------------ +
 |  customer_id |  amount |  item |
 |  5 |  10 |  book |
 |  5 |  71 |  mouse |
 |  7 |  13 |  cover |
 |  7 |  22 |  cable |
 |  7 |  19 |  book |
 + ------------------------------ +
 SELECT customer_id, AVG(amount) OVER (PARTITION BY customer_id) AS avg_amount, item, FROM payments' 

we get

  + ---------------------------------- +
 |  customer_id |  avg_amount |  item |
 |  5 |  40.5 |  book |
 |  5 |  40.5 |  mouse |
 |  7 |  18 |  cover |
 |  7 |  18 |  cable |
 |  7 |  18 |  book |
 + ---------------------------------- +

AVG is an aggregate function and can act as a window function. However, not all window functions are aggregate functions. Aggregate functions are simple window functions.

In the above request, let's not use the built-in AVG function and use our own implementation. Does the same thing just implemented by the user. The request above becomes:

 SELECT customer_id, my_avg(amount) OVER (PARTITION BY customer_id) AS avg_amount, item, FROM payments' 

The only difference from the previous request is that AVG been replaced with my_avg . Now we need to implement our custom function.

About how to calculate the average

Sum all the elements, then divide by the number of elements. For customer_id 7 it will be (13 + 22 + 19)/3 = 18 . We can divide this into:

  • Step-by-step accumulation - amount.
  • the last operation is separation.

How the aggregate function gets into the result

The average value is calculated in steps. Only the last value is needed. Start with an initial value of 0.

  1. Feed 13. Count the subtotal / accumulated amount, which is 13.
  2. Feed 22. Calculate the accumulated amount for which the previous amount is needed plus this element: 13 + 22 = 35
  3. Feed 19. Calculate the accumulated amount for which the previous amount plus this element is needed: 35 + 19 = 54 . This is the amount to be divided by the number of elements (3).
  4. The result of step 3. is transferred to another function that knows how to divide the accumulated amount by the number of elements.

Here it happened that the state started from the initial value 0 and changed with each step, and then moved on to the next step.

The state moves between steps as long as there is data. When all the data is consumed, the state goes into the final function (terminal operation). We want the state to contain all the information necessary for the battery, as well as for the operation of the terminal.

In the specific case of calculating the average value for the terminal, you need to know how many elements the battery worked, because it should be divided by this. For this reason, the state should include both the accumulated amount and the number of elements.

We need a tuple that will contain both. PostgreSQL predefined POINT type for rescue. DOT (5, 89) means the accumulated sum of 5 elements with a value of 89. The initial state is DOT (0,0).

The battery is implemented in what is called a state function. The terminal operation is implemented in the so-called final function.

When defining a custom aggregate function, we need to specify:

  • aggregate function name and return type
  • initial state
  • the type of state in which the infrastructure will go between steps and the final function
  • state function - knows how to complete the accumulation steps
  • final function - knows how to perform a terminal operation. Not always required (for example, in a custom SUM implementation, the final value of the accumulated amount is the result.)

Here is the definition for a custom aggregate function.

 CREATE AGGREGATE my_avg (NUMERIC) ( -- NUMERIC is what the function returns initcond = '(0,0)', -- this is the initial state of type POINT stype = POINT, -- this is the type of the state that will be passed between steps sfunc = my_acc, -- this is the function that knows how to compute a new average from existing average and new element. Takes in the state (type POINT) and an element for the step (type NUMERIC) finalfunc my_final_func -- returns the result for the aggregate function. Takes in the state of type POINT (like all other steps) and returns the result as what the aggregate function returns - NUMERIC ); 

my_acc only define two functions my_acc and my_final_func .

 CREATE FUNCTION my_acc (state POINT, elem_for_step NUMERIC) -- performs accumulated sum RETURNS POINT LANGUAGE SQL AS $$ -- state[0] is the number of elements, state[1] is the accumulated sum SELECT POINT(state[0]+1, state[1] + elem_for_step); $$; CREATE FUNCTION my_final_func (POINT) -- performs devision and returns final value RETURNS NUMERIC LANGUAGE SQL AS $$ -- $1[1] is the sum, $1[0] is the number of elements SELECT ($1[1]/$1[0])::NUMERIC; $$; 

Now that the functions are available, CREATE AGGREGATE defined above will be launched successfully. Now that we have defined the aggregate, we can execute a query based on my_avg instead of the built-in AVG :

 SELECT customer_id, my_avg(amount) OVER (PARTITION BY customer_id) AS avg_amount, item, FROM payments' 

The results are the same as what you get when using the built-in AVG .

The PostgreSQL documentation assumes that users are limited to implementing custom aggregate functions:

In addition to these functions of the [predefined window], any built-in or user-defined aggregate of general or statistical purpose (ie aggregates of an unordered set or hypothetical set) can be used as a window function;

What I suspect ordered-set or hypothetical-set aggregates mean:

  • the return value is identical to all other lines (for example, AVG and SUM . Unlike RANK returns different values ​​for all lines in the group, depending on more complex criteria)
  • ORDER BY does not make sense with PARTITIONing, because the values ​​are the same for all rows anyway. On the contrary, we want ORDER BY when using RANK()

Request:

 SELECT customer_id, item, rank() OVER (PARTITION BY customer_id ORDER BY amount desc) FROM payments; 

Geometric mean

The following is a custom aggregate function for which I did not find a built-in aggregate function, and it may be useful for some.

The state function calculates the average value of the natural logarithms of the terms.

The latter function raises the constant e to what the battery provides.

 CREATE OR REPLACE FUNCTION sum_of_log(state POINT, curr_val NUMERIC) RETURNS POINT LANGUAGE SQL AS $$ SELECT POINT(state[0] + 1, (state[1] * state[0]+ LN(curr_val))/(state[0] + 1)); $$; CREATE OR REPLACE FUNCTION e_to_avg_of_log(POINT) RETURNS NUMERIC LANGUAGE SQL AS $$ select exp($1[1])::NUMERIC; $$; CREATE AGGREGATE geo_mean (NUMBER) ( stype = NUMBER, initcond = '(0,0)', -- represent POINT value sfunc = sum_of_log, finalfunc = e_to_avg_of_log ); 
0
source

All Articles