Sorry for the question. I included a test script to demonstrate the situation and my best attempt at a solution.
There are two tables:
test_WORDS= Words extracted from several sources. A column OBJ_FKis a source identifier. WORD_IDis the identifier of the word itself, the only one in the source. Each line contains one word.test_PHRASE= list of phrases to search in test_WORDS. A column PHRASE_TEXTis a fragment separated by a space, for example, "foo bar" (see below), so each row contains several words.
Requirements:
Return the first word from test_WORDS, which is the start of matching phrases from test_PHRASE.
I would rather have something set to avoid the RBAR approach below. Also my decision is limited to 5 phrases. I need to support up to 20 phrases. Is it possible to combine words from a string in test_PHRASEwith continuous lines in test_WORDwithout cursors?
After breaking phrase words into a temporary table, the problem boils down to the coincidence of the parts of the two sets together in row order.
CREATE TABLE [dbo].[test_WORDS](
[OBJ_FK] [bigint] NOT NULL,
[WORD_ID] [int] NOT NULL,
[WORD_TEXT] [nvarchar](50) NOT NULL,
CONSTRAINT [PK_test_WORDS] PRIMARY KEY CLUSTERED
(
[OBJ_FK] ASC,
[WORD_ID] ASC
)
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[test_PHRASE](
[ID] [int],
[PHRASE_TEXT] [nvarchar](150) NOT NULL
CONSTRAINT [PK_test_PHRASE] PRIMARY KEY CLUSTERED
(
[ID] ASC
)
)
GO
INSERT INTO dbo.test_WORDS
SELECT 1,1,'aaa' UNION ALL
SELECT 1,2,'bbb' UNION ALL
SELECT 1,3,'ccc' UNION ALL
SELECT 1,4,'ddd' UNION ALL
SELECT 1,5,'eee' UNION ALL
SELECT 1,6,'fff' UNION ALL
SELECT 1,7,'ggg' UNION ALL
SELECT 1,8,'hhh' UNION ALL
SELECT 2,1,'zzz' UNION ALL
SELECT 2,2,'yyy' UNION ALL
SELECT 2,3,'xxx' UNION ALL
SELECT 2,4,'www'
INSERT INTO dbo.test_PHRASE
SELECT 1, 'bbb ccc ddd' UNION ALL
SELECT 2, 'ddd eee fff' UNION ALL
SELECT 3, 'xxx xxx xxx' UNION ALL
SELECT 4, 'zzz yyy xxx' UNION ALL
SELECT 5, 'xxx www ppp' UNION ALL
SELECT 6, 'zzz yyy xxx www'
DECLARE @maxRow AS INTEGER
DECLARE @currentRow AS INTEGER
DECLARE @phraseSubsetTable AS TABLE(
[ROW] int IDENTITY(1,1) NOT NULL,
[ID] int NOT NULL,
[PHRASE_TEXT] nvarchar(150) NOT NULL
)
DECLARE @WordList table
(
ID int,
WORD nvarchar(50)
)
DECLARE @returnTable AS TABLE(
OBJECT_FK INT NOT NULL,
WORD_ID INT NOT NULL,
PHRASE_ID INT NOT NULL
)
DECLARE @phrase AS NVARCHAR(150)
DECLARE @phraseID AS INTEGER
INSERT INTO @phraseSubsetTable
SELECT ID, PHRASE_TEXT
FROM dbo.test_PHRASE
WHERE ID IN (2,3,4)
SET @maxRow = @@ROWCOUNT
SET @currentRow = 1
WHILE @currentRow <= @maxRow
BEGIN
SELECT @phrase=PHRASE_TEXT, @phraseID=ID FROM @phraseSubsetTable WHERE row = @currentRow
DELETE FROM @WordList
;WITH Pieces(pn, start, stop) AS (
SELECT 1, 1, CHARINDEX(' ', @phrase)
UNION ALL
SELECT pn + 1, stop + 1, CHARINDEX(' ', @phrase, stop + 1)
FROM Pieces
WHERE stop > 0)
insert into @WordList
SELECT pn,
SUBSTRING(@phrase, start, CASE WHEN stop > 0 THEN stop-start ELSE 1056 END) AS WORD
FROM Pieces
DECLARE @wordCt as int
select @wordCt=count(ID) from @WordList;
;WITH WordOrder_CTE AS (
SELECT OBJ_FK, WORD_ID, WORD_TEXT,
ROW_NUMBER() OVER (Partition BY OBJ_FK ORDER BY WORD_ID) AS rownum
FROM test_WORDS)
INSERT INTO @returnTable
SELECT r1.OBJ_FK, r1.WORD_ID, @phraseID AS PHRASE_ID
FROM WordOrder_CTE r1
INNER JOIN @WordList w1 ON r1.WORD_TEXT = w1.WORD and w1.ID=1
LEFT JOIN WordOrder_CTE r2
ON r1.rownum = r2.rownum - 1 and r1.OBJ_FK = r2.OBJ_FK
LEFT JOIN @WordList w2 ON r2.WORD_TEXT = w2.WORD and w2.ID=2
LEFT JOIN WordOrder_CTE r3
ON r1.rownum = r3.rownum - 2 and r1.OBJ_FK = r3.OBJ_FK
LEFT JOIN @WordList w3 ON r3.WORD_TEXT = w3.WORD and w3.ID=3
LEFT JOIN WordOrder_CTE r4
ON r1.rownum = r4.rownum - 3 and r1.OBJ_FK = r4.OBJ_FK
LEFT JOIN @WordList w4 ON r4.WORD_TEXT = w4.WORD and w4.ID=4
LEFT JOIN WordOrder_CTE r5
ON r1.rownum = r5.rownum - 4 and r1.OBJ_FK = r5.OBJ_FK
LEFT JOIN @WordList w5 ON r5.WORD_TEXT = w5.WORD and w5.ID=5
WHERE (@wordCt < 2 OR w2.ID is not null) and
(@wordCt < 3 OR w3.ID is not null) and
(@wordCt < 4 OR w4.ID is not null) and
(@wordCt < 5 OR w5.ID is not null)
SET @currentRow = @currentRow+1
END
SELECT OBJECT_FK, WORD_ID, PHRASE_ID FROM @returnTable
GO
DROP TABLE [dbo].[test_WORDS]
DROP TABLE [dbo].[test_PHRASE]
Edited solution:
This is editing the correct solution below to account for non-contiguous word identifiers. Hope this helps someone as much as me.
;WITH
numberedwords AS (
SELECT
OBJ_FK,
WORD_ID,
WORD_TEXT,
rowcnt = ROW_NUMBER() OVER
(PARTITION BY OBJ_FK ORDER BY WORD_ID DESC),
totalInSrc = COUNT(WORD_ID) OVER (PARTITION BY OBJ_FK)
FROM dbo.test_WORDS
),
phrasedwords AS (
SELECT
nw1.OBJ_FK,
nw1.WORD_ID,
nw1.WORD_TEXT,
PHRASE_TEXT = RTRIM((
SELECT [text()] = nw2.WORD_TEXT + ' '
FROM numberedwords nw2
WHERE nw1.OBJ_FK = nw2.OBJ_FK
AND nw2.rowcnt BETWEEN nw1.rowcnt AND nw1.totalInSrc
ORDER BY nw2.OBJ_FK, nw2.WORD_ID
FOR XML PATH ('')
))
FROM numberedwords nw1
GROUP BY nw1.OBJ_FK, nw1.WORD_ID, nw1.WORD_TEXT, nw1.rowcnt, nw1.totalInSrc
)
SELECT *
FROM phrasedwords pw
INNER JOIN test_PHRASE tp
ON LEFT(pw.PHRASE_TEXT, LEN(tp.PHRASE_TEXT)) = tp.PHRASE_TEXT
ORDER BY pw.OBJ_FK, pw.WORD_ID
Note. The last query I used in production uses indexed temporary tables instead of CTEs. I also limited the length of the PHRASE_TEXT column to my needs. Thanks to these improvements, I was able to reduce the request time from 3 minutes to 3 seconds!