Leading spaces, trailing spaces, two or more spaces between adjacent words are likely causes of the wrong results you get.
The LTRIM() and RTRIM() functions can help you resolve the first two questions. As for the third, you can use REPLACE(ExtractedText, ' ', ' ') to replace double spaces with single spaces, but I'm not sure if you don't have triple spaces (in which case you will need to repeat the replacement).
UPDATE
Here's the UDF, which uses CTE and rating to eliminate extra spaces, and then counts the remaining ones to return the number as the number of words:
CREATE FUNCTION fnCountWords (@Str varchar(max)) RETURNS int AS BEGIN DECLARE @xml xml, @res int; SET @Str = RTRIM(LTRIM(@Str)); WITH split AS ( SELECT idx = number, chr = SUBSTRING(@Str, number, 1) FROM master..spt_values WHERE type = 'P' AND number BETWEEN 1 AND LEN(@Str) ), ranked AS ( SELECT idx, chr, rnk = idx - ROW_NUMBER() OVER (PARTITION BY chr ORDER BY idx) FROM split ) SELECT @res = COUNT(DISTINCT rnk) + 1 FROM ranked WHERE chr = ' '; RETURN @res; END
Using this function, your query will look like this:
SELECT fnCountWords(ExtractedText) FROM EDDSDBO.Document WHERE ID='100'
UPDATE 2
The function uses one of the system tables, master..spt_values , as the counting table. The particular subset used contains only values from 0 to 2047. This means that the function will not work correctly for inputs longer than 2047 characters (after trimming both the top and the trailing spaces), as @ t-clausen.dk correctly noted in his comment. Therefore, a custom tally table should be used if longer input rows are possible.
source share