SQL: number of problem words with len ()

Question

SQL: number of problem words with len ()

I am trying to count the words of text that are written in a column of a table. Therefore, I use the following query.

SELECT LEN(ExtractedText) - LEN(REPLACE(ExtractedText, ' ', '')) + 1 from EDDSDBO.Document where ID='100'.

I get the wrong result, which is very high. On the other hand, if I copy the text directly to the operator, then it works, i.e.

 SELECT LEN('blablabla text') - LEN(REPLACE('blablabla text', ' ', '')) + 1.

Now the data type is nvarchar(max) , since the text is very long. I already tried to convert the column to text or ntext and apply datalength() instead of len() . However, I get the same result that it works as a string, but does not work from a table.

+4

sql sql-server tsql

berl13 Aug 11 '11 at 7:40

source share

4 answers

Code magician · Answer 1 · 2011-08-11T07:52:58+0000

You do not consider spaces as words. This usually gives a rough answer.

eg.

 ' this string will give an incorrect result '

Try this approach: http://www.sql-server-helper.com/functions/count-words.aspx

 CREATE FUNCTION [dbo].[WordCount] ( @InputString VARCHAR(4000) ) RETURNS INT AS BEGIN DECLARE @Index INT DECLARE @Char CHAR(1) DECLARE @PrevChar CHAR(1) DECLARE @WordCount INT SET @Index = 1 SET @WordCount = 0 WHILE @Index <= LEN(@InputString) BEGIN SET @Char = SUBSTRING(@InputString, @Index, 1) SET @PrevChar = CASE WHEN @Index = 1 THEN ' ' ELSE SUBSTRING(@InputString, @Index - 1, 1) END IF @PrevChar = ' ' AND @Char != ' ' SET @WordCount = @WordCount + 1 SET @Index = @Index + 1 END RETURN @WordCount END GO

using

 DECLARE @String VARCHAR(4000) SET @String = 'Health Insurance is an insurance against expenses incurred through illness of the insured.' SELECT [dbo].[WordCount] ( @String )

Andriy m · Answer 2 · 2011-08-11T07:55:51+0000

Leading spaces, trailing spaces, two or more spaces between adjacent words are likely causes of the wrong results you get.

The LTRIM() and RTRIM() functions can help you resolve the first two questions. As for the third, you can use REPLACE(ExtractedText, ' ', ' ') to replace double spaces with single spaces, but I'm not sure if you don't have triple spaces (in which case you will need to repeat the replacement).

UPDATE

Here's the UDF, which uses CTE and rating to eliminate extra spaces, and then counts the remaining ones to return the number as the number of words:

 CREATE FUNCTION fnCountWords (@Str varchar(max)) RETURNS int AS BEGIN DECLARE @xml xml, @res int; SET @Str = RTRIM(LTRIM(@Str)); WITH split AS ( SELECT idx = number, chr = SUBSTRING(@Str, number, 1) FROM master..spt_values WHERE type = 'P' AND number BETWEEN 1 AND LEN(@Str) ), ranked AS ( SELECT idx, chr, rnk = idx - ROW_NUMBER() OVER (PARTITION BY chr ORDER BY idx) FROM split ) SELECT @res = COUNT(DISTINCT rnk) + 1 FROM ranked WHERE chr = ' '; RETURN @res; END

Using this function, your query will look like this:

 SELECT fnCountWords(ExtractedText) FROM EDDSDBO.Document WHERE ID='100'

UPDATE 2

The function uses one of the system tables, master..spt_values , as the counting table. The particular subset used contains only values from 0 to 2047. This means that the function will not work correctly for inputs longer than 2047 characters (after trimming both the top and the trailing spaces), as @ t-clausen.dk correctly noted in his comment. Therefore, a custom tally table should be used if longer input rows are possible.

t-clausen.dk · Answer 3 · 2011-08-11T08:42:13+0000

Replace spaces with something that is never found in your text, such as "$!". or select another value. then replace everything with "$! 'and' $! 'without anything this way you will never have more than 1 space after the word. Then use the current script. I defined the word as a space followed by a non-space.

This is an example

 DECLARE @T TABLE(COL1 NVARCHAR(2000), ID INT) INSERT @T VALUES('ABC D', 100) SELECT LEN(C) - LEN(REPLACE(C,' ', '')) COUNT FROM ( SELECT REPLACE(REPLACE(REPLACE(' ' + COL1, ' ', ' $!'), '$! ',''), '$!', '') C FROM @T ) A

Here is a recursive solution

 DECLARE @T TABLE(COL1 NVARCHAR(2000), ID INT) INSERT @T VALUES('ABC D', 100) INSERT @T VALUES('have a nice day with 7 words', 100) ;WITH CTE AS ( SELECT 1 words, col1 c, col1 FROM @t WHERE id = 100 UNION ALL SELECT words +1, right(c, len(c) - patindex('% [^ ]%', c)), col1 FROM cte WHERE patindex('% [^ ]%', c) > 0 ) SELECT words, col1 FROM cte WHERE patindex('% [^ ]%', c) = 0

jak · Answer 4 · 2011-08-11T08:49:10+0000

You must declare the column using the varchar data type, for example:

 create table emp(ename varchar(22)); insert into emp values('amit'); select ename,len(ename) from emp;

output: 4

SQL: number of problem words with len ()

More articles: