Getting the most used words from a row column in SQL

So, we have this database filled with a bunch of strings, in this case post-captions.

What I want to do:

  • Divide the line into words
  • Indicate how many times words appear in lines
  • Give me 50 words
  • Does not have this timeout in the data.se request

I tried using the information from this SO question adapted to data.se as follows:

select word, count(*) from ( select (case when instr(substr(p.Title, nums.n+1), ' ') then substr(p.Title, nums.n+1) else substr(p.Title, nums.n+1, instr(substr(p.Title, nums.n+1), ' ') - 1) end) as word from (select ' '||Title as string from Posts p )Posts cross join (select 1 as n union all select 2 union all select 10 ) nums where substr(p.Title, nums.n, 1) = ' ' and substr(p.Title, nums.n, 1) <> ' ' ) w group by word order by count(*) desc 

Unfortunately, this gives me a lot of errors:

'substr' is not a recognized built-in function name. Incorrect syntax next to '|'. Invalid syntax next to 'nums'.

So, for a column of rows in SQL with a variable amount of text in each row, how can I get a list of the most frequently used words of X?

+8
source share
3 answers

As Blogbeard said, the query you specified does not work with SQL Server. Here is one way to count the most used word. This is based on the DelimitedSplitN4K feature written by Jeff Moden and enhanced by members of the SQL Server Central community.

ONLINE DEMO

 WITH E1(N) AS ( SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1) ) t(N) ), E2(N) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b), E4(N) AS (SELECT 1 FROM E2 a CROSS JOIN E2 b) SELECT TOP 50 x.Item, COUNT(*) FROM Posts p CROSS APPLY ( SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1), Item = LTRIM(RTRIM(SUBSTRING(p.Title, l.N1, l.L1))) FROM ( SELECT s.N1, L1 = ISNULL(NULLIF(CHARINDEX(' ',p.Title,s.N1),0)-s.N1,4000) FROM( SELECT 1 UNION ALL SELECT t.N+1 FROM( SELECT TOP (ISNULL(DATALENGTH(p.Title)/2,0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4 ) t(N) WHERE SUBSTRING(p.Title ,tN,1) = ' ' ) s(N1) ) l(N1, L1) ) x WHERE x.item <> '' GROUP BY x.Item ORDER BY COUNT(*) DESC 

Since function creation is prohibited, I wrote it like this. Here is the function definition if you are interested:

 CREATE FUNCTION [dbo].[DelimitedSplitN4K]( @pString NVARCHAR(4000), @pDelimiter NCHAR(1) ) RETURNS TABLE WITH SCHEMABINDING AS RETURN WITH E1(N) AS ( SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 ), E2(N) AS (SELECT 1 FROM E1 a, E1 b), E4(N) AS (SELECT 1 FROM E2 a, E2 b), cteTally(N) AS( SELECT TOP (ISNULL(DATALENGTH(@pString)/2,0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4 ), cteStart(N1) AS ( SELECT 1 UNION ALL SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,tN,1) = @pDelimiter ), cteLen(N1,L1) AS( SELECT s.N1, ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,4000) FROM cteStart s ) SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1), Item = SUBSTRING(@pString, l.N1, l.L1) FROM cteLen l ; 

And here is how you would use it:

 SELECT TOP 50 x.Item, COUNT(*) FROM Posts p CROSS APPLY dbo.DelimitedSplitN4K(p.Title, ' ') x WHERE LTRIM(RTRIM(x.Item)) <> '' GROUP BY x.Item ORDER BY COUNT(*) DESC 

Result:

 Item -------- ------- to 3812411 in 3331522 a 2543636 How 1770915 the 1534298 with 1341632 of 1297468 and 1166664 on 970554 from 964449 for 886007 not 835979 is 704724 using 703007 I 633838 - 632441 an 548450 when 449169 file 409717 how 358745 data 335271 do 323854 can 310298 get 305922 or 266317 error 263563 use 258408 value 254392 it 251254 my 238902 function 235832 by 231025 Android 228308 as 216654 array 209157 working 207445 does 207274 Is 205613 multiple 203336 that 197826 Why 196979 into 196591 after 192056 string 189053 PHP 187018 one 182360 class 179965 if 179590 text 174878 table 169393 
+10
source

Query solution (no split function required)

PostgreSQL

 select word, count(*) from ( -- get 1st words select split_part(title, ' ', 1) as word from posts union all -- get 2nd words select split_part(title, ' ', 2) as word from posts union all -- get 3rd words select split_part(title, ' ', 3) as word from posts -- can do this as many times as the number of words in longest title ) words where word is not null and word NOT IN ('', 'and', 'for', 'of', 'on') group by word order by count desc limit 50; 

for the short version: https://dba.stackexchange.com/a/82456/95929

+1
source

Thanks to the now available STRING_SPLIT function (starting with SQL Server 2016, compatibility level 130), this query becomes much simpler:

 SELECT TOP 50 value [word] , COUNT(*) [#times] FROM posts p CROSS APPLY STRING_SPLIT(p.title, ' ') GROUP BY value ORDER BY COUNT(*) DESC 

See it in action in the Stack Exchange Data Explorer, where it still works for less than 2 minutes for the current number of posts in the database. In Qaru em Português, he works without fear for a terrible timeout.

The results are similar to what you saw in Felix's answer :

result of words and their frequency

0
source

All Articles