How can I effectively search for multiple lines in many files at once?

Question

How can I effectively search for multiple lines in many files at once?

Hi, I posted my piece of code, and then I will explain my purpose:

for eachcsv in matches:
    with open(eachcsv, 'r') as f:
        lines = f.readlines()
        for entry in rs:
            for line in lines:
                if entry in line:
                    print("found %s in %s" % (entry, eachcsv))

So, in "match" I got a list of csv files (the path to them). I open each csv file and load them into memory using readlines (). "rs" is a list of unique identifiers. For each item in the rs list, I need to search every line of the csv file and print each time I find the identifier in the file (I will check later if the line contains another fixed word).

The code above works for my purpose, but I don’t know why it takes more than 10 minutes to process a 400k line file, I need to complete this task for thousands of files so that I cannot finish the task. It seems to me that the slow part is the testing process, but I'm not sure.

Please note that I am using python because I am more confident about this, if there is another solution to my problem using other tools, I am fine with it.

EDIT: I will try to post some examples

"rs" list:
rs12334435
rs3244567
rs897686
....

files

# header data not needed
# data
# data
# data
# data
# data [...]
#COLUMN1    COLUMN2               COLUMN3   ...
data        rs7854.rs165463       dataSS=1(random_data)
data        rs465465data          datadata
data        rs798436              dataSS=1  
data        datars45648           dataSS=1

The ultimate goal is to calculate how many times each rs appears in each file, and if in column 3 there is SS = 1 to mark it on the output. Sort of

found rs12345 SS yes file 3 folder /root/foobar/file
found rs74565 SS no file 3 folder /root/foobar/file

+4

python shell awk csv

Bluestarry Apr 28 '16 at 18:54

source share

2 answers

, .

readline, readline. Readline .

for eachcsv in matches:
    with open(eachcsv, 'r') as f:
        for line in f:
            for entry in rs:
                if entry in line:
                    print("found %s in %s" % (entry, eachcsv))

https://docs.python.org/2/tutorial/inputoutput.html

, , , csv . https://docs.python.org/2/library/threading.html

import threading

def findRSinCSVFile(csvFile,rs)
    with open(csvFile, 'r') as f:
        for line in f:
            for entry in rs:
                if entry in line:
                    print("found %s in %s" % (entry, eachcsv))

for csvFile in csvFiles():
    threads.append(threading.Thread(target=findRSinCSVFile,args=(csvFile,rs)))

for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

csv.

(. - )

0

sverasch 28 . '16 19:35

James Youngman · Accepted Answer · 2016-05-02T07:52:33+0000

, . , :

. , - , . , , , Python.
. , start, . , - , . , .

, :

import re
import random
import sys
import time

# get_patterns makes up some test data.
def get_patterns():
    rng = random.Random(1)  # fixed seed, for reproducibility
    n = 300
    # Generate up to n unique integers between 60k and 80k.
    return list(set([str(rng.randint(60000, 80000)) for _ in xrange(n)]))

def original(rs, matches):
    for eachcsv in matches:
        with open(eachcsv, 'r') as f:
            lines = f.readlines()
            for entry in rs:
                for line in lines:
                    if entry in line:
                        print("found %s in %s" % (entry, eachcsv))

def mine(rs, matches):
    my_rx = re.compile(build_regex(rs))
    for eachcsv in matches:
        with open(eachcsv, 'r') as f:
            body = f.read()
            matches = my_rx.findall(body)
            for match in matches:
                print "found %s in %s" % (match, eachcsv)

def build_regex(literal_patterns):
    return "|".join([re.escape(pat) for pat in literal_patterns])

def print_elapsed_time(label, callable, args):
    t1 = time.time()
    callable(*args)
    t2 = time.time()
    elapsed_ms = (t2 - t1) * 1000
    print "%8s: %9.1f milliseconds" % (label, elapsed_ms)


def main(args):
    rs = get_patterns()
    filenames = args[1:]
    for function_name_and_function in (('original', original), ('mine', mine)):
        name, func = function_name_and_function
        print_elapsed_time(name, func, [rs, filenames])
    return 0

if __name__ == '__main__':
    sys.exit(main(sys.argv))

original, - mine. 300 400 . 30- . . , 3% ( , , ).

: . , - , :

~/source/stackoverflow/36923237$ python search.py example.csv
found green fox in example.csv
original:    9218.0 milliseconds
found green fox in example.csv
    mine:     600.4 milliseconds

: , .

, foobar umspquux. - foobar, umspquux. , .

. , . "f" "u", , , "o" "m". . , , .

- . . "foobar" "foobar" "foobar - ". . , -, , . , '|'. , foobar | umspquux "foobar" , "umspquux". '|' '|' '\'.

build_regex_literal_patterns. ['foobar', 'umspquux'] foobar | umspquux. - try-out, , .

, , , - - . , , , .

re.escape build_regex_literal_patterns , ( '|') ( '\ |'), .

findall . (.. ).

Python Python . , Google Develoeprs Python , Jeffrey Friedl - , , Python .

How can I effectively search for multiple lines in many files at once?

More articles: