I am new to this forum, so I apologize if this is a very long question.
I am trying to create a general keyword parser that accepts a list of keywords and a list of text strings (which could be generated from a database or a text file in free format). Now I'm trying to extract entities from a list of text strings based on a list of keywords so that I can generate three key outputs;
- Keyword Mentioned
- The text string in which this keyword was mentioned, and
- the number of times this keyword was mentioned in a text string
Below is an example of the python code I wrote for this. As you can see, I am trying to accomplish this in three steps:
Stage 1 - accept the reject sequence so that I can remove all known unwanted lines from the list of text lines
Stage 2 (Level 1 parsing). Search by index type for keywords to narrow the list of rows. I need to complete a full search loop.
Stage 3 - Conduct a full search cycle.
Problem: The problem is that stage 3 (or passage 2 in the code) is extremely effective as an example for a list of keywords that contains 4500 elements, and for text lines with almost 2 million lines, the code runs for more than 24 hours.
Can anyone suggest a better way to make pass 2? or if there is a better way to write the whole function?
I'm a Python beginner, so if I missed something obvious, then apologize in advance.
def keywordParser(keywordIDsList, keywordDict, valueIDsList, valuesDict, rejectPattern):
parsedResultsDict = {}
parsedHashIDsList = []
successResultIDsList = []
rejectResultIDsList = []
processListPass1 = []
processListPass2 = []
idxkeyWordDict = {}
for keyID in keywordIDsList:
keywordID, keyWord = keywordDict[keyID]
idxkeyWordDict[keyWord] = (keywordID, keyWord)
percCount = 1
for valueID in processListPass1:
valKey, valText = valuesDict[valueID]
try:
keyWordVal, keywordID = idxkeyWordDict[valText]
except:
processListPass2.append(valueID)
percCount = 0
for valueID in processListPass2:
percCount += 1
valKey, valText = valuesDict[valueID]
valSuccess = 'N'
for keyID in keywordIDsList:
keyWordVal, keywordID = keywordDict[keyID]
keySearch = re.findall(keyWordVal, valText, re.DOTALL)
if keySearch:
parsedHashID = hash(str(valueID) + str(keyID))
parsedResultsDict[parsedHashID] = (valueID, keywordID, len(keySearch))
valSuccess = 'Y'
if valSuccess == 'Y':
successResultIDsList.append(valueID)
else:
rejectResultIDsList.append(valueID)
return (parsedResultsDict, parsedHashIDsList, successResultIDsList, rejectResultIDsList)