Get row with maximum value in Hive / SQL?

I am new to Hive / SQL and I am stuck with a pretty simple problem. My data looks like this:

+------------+--------------------+-----------------------+ | carrier_iD | meandelay | meancanceled | +------------+--------------------+-----------------------+ | EV | 13.795802119653473 | 0.028584251044292006 | | VX | 0.450591016548463 | 2.364066193853424E-4 | | F9 | 10.898001378359766 | 0.00206753962784287 | | AS | 0.5071547420965062 | 0.0057404326123128135 | | HA | 1.2031093279839498 | 5.015045135406214E-4 | | 9E | 8.147899230704216 | 0.03876067292247866 | | B6 | 9.45383857757506 | 0.003162096314343487 | | UA | 8.101511665305816 | 0.005467725574605967 | | FL | 0.7265068895709532 | 0.0041141513746490044 | | WN | 7.156119279121648 | 0.0057419058192869415 | | DL | 4.206288692245839 | 0.005123990066804269 | | YV | 6.316802855264404 | 0.029304029304029346 | | US | 3.2221527095063736 | 0.007984031936127766 | | OO | 6.954715814690328 | 0.02596499362466706 | | MQ | 9.74568222216328 | 0.025628100708354324 | | AA | 8.720522654298968 | 0.019242775597574157 | +------------+--------------------+-----------------------+ 

I want Hive to return a row with the average value of the maximum value. I have:

 SELECT CAST(MAX(meandelay) as FLOAT) FROM flightinfo; 

which really returns max (I use cast because my values ​​are stored as STRING). So:

 SELECT * FROM flightinfo WHERE meandelay = (SELECT CAST(MAX(meandelay) AS FLOAT) FROM flightinfo); 

I get the following error:

 FAILED: ParseException line 1:44 cannot recognize input near 'select' 'cast' '(' in expression specification 
+7
sql hive
source share
4 answers

Use window features and analytics

 SELECT carrier_id, meandelay, meancanceled FROM (SELECT carrier_id, meandelay, meancanceled, rank() over (order by cast(meandelay as float) desc) as r FROM table) S WHERE Sr = 1; 

This will also solve the problem, if more than one row has the same maximum value, you will get all rows as a result. If you just need to change one row rank() to row_number() or add another term to order by .

+8
source share

use a connection instead.

 SELECT a.* FROM flightinfo a left semi join (SELECT CAST(MAX(meandelay) AS FLOAT) maxdelay FROM flightinfo)b on (a.meandelay=b.maxdelay) 
+2
source share

I don't think your subquery is allowed ...

Quick view here:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries

it says:

As in Hive 0.13, some types of subqueries are supported in the WHERE clause. These are queries in which the result of query processing can be processed as a constant for IN and NOT IN statements (called uncorrelated subqueries, since the subquery does not refer to columns from the parent query):

0
source share

You can use collect_max UDF from Brickhouse ( http://github.com/klout/brickhouse ) to solve this problem by passing a value of 1, which means you only need one maximum value.

 select array_index( map_keys( collect_max( carrier_id, meandelay, 1) ), 0 ) from flightinfo; 

Also, I read somewhere that Uive max UDF allows you to access other fields in a row, but I think it's easier to use collect_max .

0
source share

All Articles