How to quickly search for a string in text files

Question

How to quickly search for a string in text files

I want to find a list of lines (from 2 thousand to 10 thousand lines in a list) in thousands of text files (there may be up to 100 thousand text files from 1 KB to 100 MB in size) saved in a folder and display a csv file for matching text file names.

I developed a code that does the required work, but it takes about 8-9 hours for 2,000 lines to search in about 2,000 text files from ~ 2.5 GB in size.

In addition, using this method, system memory is consumed, so sometimes it is necessary to split 2000 text files into smaller batches to run the code.

The code is as follows (Python 2.7).

# -*- coding: utf-8 -*-
import pandas as pd
import os

def match(searchterm):
    global result
    filenameText = ''
    matchrateText = ''
    for i, content in enumerate(TextContent):
        matchrate = search(searchterm, content)
        if matchrate:
            filenameText += str(listoftxtfiles[i])+";"
            matchrateText += str(matchrate) + ";"
    result.append([searchterm, filenameText, matchrateText])


def search(searchterm, content):
    if searchterm.lower() in content.lower():
        return 100
    else:
        return 0


listoftxtfiles = os.listdir("Txt/")
TextContent = []
for txt in listoftxtfiles:
    with open("Txt/"+txt, 'r') as txtfile:
        TextContent.append(txtfile.read())

result = []
for i, searchterm in enumerate(searchlist):
    print("Checking for " + str(i + 1) + " of " + str(len(searchlist)))
    match(searchterm)

df=pd.DataFrame(result,columns=["String","Filename", "Hit%"])

Input example below.

List of Strings -

["Blue Chip", "JP Morgan Global Healthcare","Maximum Horizon","1838 Large Cornerstone"]

Text file -

A plain text file containing different lines separated by \ n

An example is below.

String,Filename,Hit%
JP Morgan Global Healthcare,000032.txt;000031.txt;000029.txt;000015.txt;,100;100;100;100;
Blue Chip,000116.txt;000126.txt;000114.txt;,100;100;100;
1838 Large Cornerstone,NA,NA
Maximum Horizon,000116.txt;000126.txt;000114.txt;,100;100;100;

, 4 ( ;), 3 , .

?

+6

python pandas

fdgodabhi 15 . '17 19:55

1

agtoever · Answer 1 · 2017-07-16T15:57:15+0000

, .

, . , ( ).

, , .

# -*- coding: utf-8 -*-
from os import listdir
from os.path import isfile

def search_strings_in_files(path_str, search_list):
    """ Returns a list of lists, where each inner list contans three fields:
    the filename (without path), a string in search_list and the
    frequency (number of occurences) of that string in that file"""

    filelist = listdir(path_str)

    return [[filename, s, open(path_str+filename, 'r').read().lower().count(s)]
        for filename in filelist
            if isfile(path_str+filename)
                for s in [sl.lower() for sl in search_list] ]

if __name__ == '__main__':
    print search_strings_in_files('/some/path/', ['some', 'strings', 'here'])

, :

_ .
( ).
.

: , :

search_list .
(for s in...)
, , (if isfile...)
(for filename...)
, :
- s,
- , , , , s.

, "" Python. , .

How to quickly search for a string in text files

More articles: