Finding list items in another longer list in python

I am new to this forum, so I apologize if this is a very long question.

I am trying to create a general keyword parser that accepts a list of keywords and a list of text strings (which could be generated from a database or a text file in free format). Now I'm trying to extract entities from a list of text strings based on a list of keywords so that I can generate three key outputs;

  • Keyword Mentioned
  • The text string in which this keyword was mentioned, and
  • the number of times this keyword was mentioned in a text string

Below is an example of the python code I wrote for this. As you can see, I am trying to accomplish this in three steps:

Stage 1 - accept the reject sequence so that I can remove all known unwanted lines from the list of text lines

Stage 2 (Level 1 parsing). Search by index type for keywords to narrow the list of rows. I need to complete a full search loop.

Stage 3 - Conduct a full search cycle.

Problem: The problem is that stage 3 (or passage 2 in the code) is extremely effective as an example for a list of keywords that contains 4500 elements, and for text lines with almost 2 million lines, the code runs for more than 24 hours.

Can anyone suggest a better way to make pass 2? or if there is a better way to write the whole function?

I'm a Python beginner, so if I missed something obvious, then apologize in advance.

##########################################################################################
# The keyWord parser conducts a 2 pass keyword lookup and parsing.
# Inputs:
#  keywordIDsList - Is a list of the IDs of the keyword (Standard declaration: keywordIDsList[]= Hash value of the keyWords)
#  KeywordDict - is the Dict of all the keywords and the associated ID.
#          (Standard declaration: keywordDict[keywordID]=(keywordID, keyWord) where keywordID is hash value in keywordIDsList)
#  valueIDsList - Is a list of the IDs of all the values that need to be parsed (Standard declaration: valueIDsList[]= Unique reference number of the values)
#  valuesDict - Is the Dict of all the value lines and the associated IDs.
#          (Standard declaration: valuesDict[uniqueValueKey]=(uniqueValueKey, valueText) where uniqueValueKey is the unique key in valueIDsList)
#  rejectPattern - A regular expression based pattern for rejecting columns with certain types of patterns. This is an optional field.
# Outputs:
#  parsedHashIDsList - Is the a hash value that is generated for every successful parse results
#  parsedResultsDict - Is actual parsed value as parsedResultsDict[parsedHashID]=(uniqueValueKey, keywordID, frequencyResult)
#  successResultIDsList - list of all unique value references that were parsed successfully
#  rejectResultIDsList - list of all unique value references that were rejected
##########################################################################################

def keywordParser(keywordIDsList, keywordDict, valueIDsList, valuesDict, rejectPattern):
    parsedResultsDict = {}
    parsedHashIDsList = []
    successResultIDsList = []
    rejectResultIDsList = []
    processListPass1 = []
    processListPass2 = []
    idxkeyWordDict = {}

    for keyID in keywordIDsList:
        keywordID, keyWord = keywordDict[keyID]
        idxkeyWordDict[keyWord] = (keywordID, keyWord)

    percCount = 1
    #    optional: if rejectPattern is provided then reject lines
    # ## Some python code for processing the reject patterns - this works fine

    #    Pass 1: Index based matching - partial code for index based search
    for valueID in processListPass1:
        valKey, valText = valuesDict[valueID]
        try:
            keyWordVal, keywordID = idxkeyWordDict[valText]
        except:
            processListPass2.append(valueID)

    percCount = 0

    #   Pass 2: Text based search and lookup - this part of the code is extremely inefficient

    for valueID in processListPass2:
        percCount += 1
        valKey, valText = valuesDict[valueID]
        valSuccess = 'N'
        for keyID in keywordIDsList:
            keyWordVal, keywordID = keywordDict[keyID]
            keySearch = re.findall(keyWordVal, valText, re.DOTALL)
            if keySearch:
                parsedHashID = hash(str(valueID) + str(keyID))
                parsedResultsDict[parsedHashID] = (valueID, keywordID, len(keySearch))
                valSuccess = 'Y'
        if valSuccess == 'Y':
            successResultIDsList.append(valueID)
        else:
            rejectResultIDsList.append(valueID)

    return (parsedResultsDict, parsedHashIDsList, successResultIDsList, rejectResultIDsList)
+4
1

Aho-Corasick. python .

+1

All Articles