I want to find a list of lines (from 2 thousand to 10 thousand lines in a list) in thousands of text files (there may be up to 100 thousand text files from 1 KB to 100 MB in size) saved in a folder and display a csv file for matching text file names.
I developed a code that does the required work, but it takes about 8-9 hours for 2,000 lines to search in about 2,000 text files from ~ 2.5 GB in size.
In addition, using this method, system memory is consumed, so sometimes it is necessary to split 2000 text files into smaller batches to run the code.
The code is as follows (Python 2.7).
import pandas as pd
import os
def match(searchterm):
global result
filenameText = ''
matchrateText = ''
for i, content in enumerate(TextContent):
matchrate = search(searchterm, content)
if matchrate:
filenameText += str(listoftxtfiles[i])+";"
matchrateText += str(matchrate) + ";"
result.append([searchterm, filenameText, matchrateText])
def search(searchterm, content):
if searchterm.lower() in content.lower():
return 100
else:
return 0
listoftxtfiles = os.listdir("Txt/")
TextContent = []
for txt in listoftxtfiles:
with open("Txt/"+txt, 'r') as txtfile:
TextContent.append(txtfile.read())
result = []
for i, searchterm in enumerate(searchlist):
print("Checking for " + str(i + 1) + " of " + str(len(searchlist)))
match(searchterm)
df=pd.DataFrame(result,columns=["String","Filename", "Hit%"])
Input example below.
List of Strings -
["Blue Chip", "JP Morgan Global Healthcare","Maximum Horizon","1838 Large Cornerstone"]
Text file -
A plain text file containing different lines separated by \ n
An example is below.
String,Filename,Hit%
JP Morgan Global Healthcare,000032.txt;000031.txt;000029.txt;000015.txt;,100;100;100;100;
Blue Chip,000116.txt;000126.txt;000114.txt;,100;100;100;
1838 Large Cornerstone,NA,NA
Maximum Horizon,000116.txt;000126.txt;000114.txt;,100;100;100;
, ββ 4 ( ;), ββ 3 , ββ .
?