SQL: number of problem words with len ()

I am trying to count the words of text that are written in a column of a table. Therefore, I use the following query.

SELECT LEN(ExtractedText) - LEN(REPLACE(ExtractedText, ' ', '')) + 1 from EDDSDBO.Document where ID='100'. 

I get the wrong result, which is very high. On the other hand, if I copy the text directly to the operator, then it works, i.e.

 SELECT LEN('blablabla text') - LEN(REPLACE('blablabla text', ' ', '')) + 1. 

Now the data type is nvarchar(max) , since the text is very long. I already tried to convert the column to text or ntext and apply datalength() instead of len() . However, I get the same result that it works as a string, but does not work from a table.

+4
source share
4 answers

You do not consider spaces as words. This usually gives a rough answer.

eg.

 ' this string will give an incorrect result ' 

Try this approach: http://www.sql-server-helper.com/functions/count-words.aspx

 CREATE FUNCTION [dbo].[WordCount] ( @InputString VARCHAR(4000) ) RETURNS INT AS BEGIN DECLARE @Index INT DECLARE @Char CHAR(1) DECLARE @PrevChar CHAR(1) DECLARE @WordCount INT SET @Index = 1 SET @WordCount = 0 WHILE @Index <= LEN(@InputString) BEGIN SET @Char = SUBSTRING(@InputString, @Index, 1) SET @PrevChar = CASE WHEN @Index = 1 THEN ' ' ELSE SUBSTRING(@InputString, @Index - 1, 1) END IF @PrevChar = ' ' AND @Char != ' ' SET @WordCount = @WordCount + 1 SET @Index = @Index + 1 END RETURN @WordCount END GO 

using

 DECLARE @String VARCHAR(4000) SET @String = 'Health Insurance is an insurance against expenses incurred through illness of the insured.' SELECT [dbo].[WordCount] ( @String ) 
+2
source

Leading spaces, trailing spaces, two or more spaces between adjacent words are likely causes of the wrong results you get.

The LTRIM() and RTRIM() functions can help you resolve the first two questions. As for the third, you can use REPLACE(ExtractedText, ' ', ' ') to replace double spaces with single spaces, but I'm not sure if you don't have triple spaces (in which case you will need to repeat the replacement).


UPDATE

Here's the UDF, which uses CTE and rating to eliminate extra spaces, and then counts the remaining ones to return the number as the number of words:

 CREATE FUNCTION fnCountWords (@Str varchar(max)) RETURNS int AS BEGIN DECLARE @xml xml, @res int; SET @Str = RTRIM(LTRIM(@Str)); WITH split AS ( SELECT idx = number, chr = SUBSTRING(@Str, number, 1) FROM master..spt_values WHERE type = 'P' AND number BETWEEN 1 AND LEN(@Str) ), ranked AS ( SELECT idx, chr, rnk = idx - ROW_NUMBER() OVER (PARTITION BY chr ORDER BY idx) FROM split ) SELECT @res = COUNT(DISTINCT rnk) + 1 FROM ranked WHERE chr = ' '; RETURN @res; END 

Using this function, your query will look like this:

 SELECT fnCountWords(ExtractedText) FROM EDDSDBO.Document WHERE ID='100' 

UPDATE 2

The function uses one of the system tables, master..spt_values , as the counting table. The particular subset used contains only values ​​from 0 to 2047. This means that the function will not work correctly for inputs longer than 2047 characters (after trimming both the top and the trailing spaces), as @ t-clausen.dk correctly noted in his comment. Therefore, a custom tally table should be used if longer input rows are possible.

+1
source

Replace spaces with something that is never found in your text, such as "$!". or select another value. then replace everything with "$! 'and' $! 'without anything this way you will never have more than 1 space after the word. Then use the current script. I defined the word as a space followed by a non-space.

This is an example

 DECLARE @T TABLE(COL1 NVARCHAR(2000), ID INT) INSERT @T VALUES('ABC D', 100) SELECT LEN(C) - LEN(REPLACE(C,' ', '')) COUNT FROM ( SELECT REPLACE(REPLACE(REPLACE(' ' + COL1, ' ', ' $!'), '$! ',''), '$!', '') C FROM @T ) A 

Here is a recursive solution

 DECLARE @T TABLE(COL1 NVARCHAR(2000), ID INT) INSERT @T VALUES('ABC D', 100) INSERT @T VALUES('have a nice day with 7 words', 100) ;WITH CTE AS ( SELECT 1 words, col1 c, col1 FROM @t WHERE id = 100 UNION ALL SELECT words +1, right(c, len(c) - patindex('% [^ ]%', c)), col1 FROM cte WHERE patindex('% [^ ]%', c) > 0 ) SELECT words, col1 FROM cte WHERE patindex('% [^ ]%', c) = 0 
+1
source

You must declare the column using the varchar data type, for example:

 create table emp(ename varchar(22)); insert into emp values('amit'); select ename,len(ename) from emp; 

output: 4

0
source

All Articles