Python - increasing search efficiency over a large file using readlines (size)

I am new to Python, and currently I'm using Python 2. I have source files, each of which consists of a huge amount of data (about 19 million lines). It looks like this:

apple \t N \t apple n&apos garden \t N \t garden b\ta\md great \t Adj \t great nice \t Adj \t (unknown) etc 

My task is to look for the third column of each file for some target words, and every time a search word is in the corpus, 10 words before and after this word should be added to the multidimensional dictionary.

EDIT: Lines containing '&', a '\' or the string '(unknown)' should be excluded.

I tried to solve this problem with readlines () and enumerate (), as you see in the code below. The code does what it should, but it is clearly not efficient enough for the amount of data provided in the source file.

I know that readlines () or read () should not be used for huge datasets since it loads the entire file into memory. However, while reading the file line by line, I was unable to use the enumerate method to get 10 words before and after the target word. I also can not use mmap, because I do not have permission to use it in this file.

So, I think that the readlines method with some size limit would be the most efficient solution. However, for this, could I make some mistakes, since every time you reach the size limit, 10 words after the target word is not captured, as the code just breaks?

 def get_target_to_dict(file): targets_dict = {} with open(file) as f: for line in f: targets_dict[line.strip()] = {} return targets_dict targets_dict = get_target_to_dict('targets_uniq.txt') # browse directory and process each file # find the target words to include the 10 words before and after to the dictionary # exclude lines starting with <,-,; to just have raw text def get_co_occurence(path_file_dir, targets, results): lines = [] for file in os.listdir(path_file_dir): if file.startswith('corpus'): path_file = os.path.join(path_file_dir, file) with gzip.open(path_file) as corpusfile: # PROBLEMATIC CODE HERE # lines = corpusfile.readlines() for line in corpusfile: if re.match('[AZ]|[az]', line): if '(unknown)' in line: continue elif '\\' in line: continue elif '&' in line: continue lines.append(line) for i, line in enumerate(lines): line = line.strip() if re.match('[AZ][az]', line): parts = line.split('\t') lemma = parts[2] if lemma in targets: pos = parts[1] if pos not in targets[lemma]: targets[lemma][pos] = {} counts = targets[lemma][pos] context = [] # look at 10 previous lines for j in range(max(0, i-10), i): context.append(lines[j]) # look at the next 10 lines for j in range(i+1, min(i+11, len(lines))): context.append(lines[j]) # END OF PROBLEMATIC CODE for context_line in context: context_line = context_line.strip() parts_context = context_line.split('\t') context_lemma = parts_context[2] if context_lemma not in counts: counts[context_lemma] = {} context_pos = parts_context[1] if context_pos not in counts[context_lemma]: counts[context_lemma][context_pos] = 0 counts[context_lemma][context_pos] += 1 csvwriter = csv.writer(results, delimiter='\t') for k,v in targets.iteritems(): for k2,v2 in v.iteritems(): for k3,v3 in v2.iteritems(): for k4,v4 in v3.iteritems(): csvwriter.writerow([str(k), str(k2), str(k3), str(k4), str(v4)]) #print(str(k) + "\t" + str(k2) + "\t" + str(k3) + "\t" + str(k4) + "\t" + str(v4)) results = open('results_corpus.csv', 'wb') word_occurrence = get_co_occurence(path_file_dir, targets_dict, results) 

I copied the entire part of the code for reasons of completeness, since all this is part of one function that creates a multidimensional dictionary from all the extracted information and writes it to the csv file.

I am very grateful for any hint or suggestion to make this code more efficient.

EDIT I fixed the code so that it takes into account the exact 10 words before and after the target word

+7
python dictionary multidimensional-array readlines enumerate
source share
2 answers

my idea was to create a buffer to hold up to 10 lines and another buffer to store after 10 lines, as the file to be read, it will be inserted into the buffer before the buffer, and the buffer will exit if the size exceeds 10

for the subsequent buffer, I clone another iterator from the 1st file iterator. Then run both iterators in parallel in a loop with the clone iterator, which starts 10 iterations ahead to get after 10 lines.

This will avoid the use of readlines () and load the entire file into memory. Hope it works for you in the real case

edited: only fill the buffer before after if column 3 does not contain any "&", "\", "(unknown)". Also change split ('\ t') to just split (), so it will take care of all spaces or tab

 import itertools def get_co_occurence(path_file_dir, targets, results): excluded_words = ['&', '\\', '(unknown)'] # modify excluded words here for file in os.listdir(path_file_dir): if file.startswith('testset'): path_file = os.path.join(path_file_dir, file) with open(path_file) as corpusfile: # CHANGED CODE HERE before_buf = [] # buffer to store before 10 lines after_buf = [] # buffer to store after 10 lines corpusfile, corpusfile_clone = itertools.tee(corpusfile) # clone file iterator to access next 10 lines for line in corpusfile: line = line.strip() if re.match('[AZ]|[az]', line): parts = line.split() lemma = parts[2] # before buffer handling, fill buffer excluded line contains any of excluded words if not any(w in line for w in excluded_words): before_buf.append(line) # append to before buffer if len(before_buf)>11: before_buf.pop(0) # keep the buffer at size 10 # next buffer handling while len(after_buf)<=10: try: after = next(corpusfile_clone) # advance 1 iterator after_lemma = '' after_tmp = after.split() if re.match('[AZ]|[az]', after) and len(after_tmp)>2: after_lemma = after_tmp[2] except StopIteration: break # copy iterator will exhaust 1st coz its 10 iteration ahead if after_lemma and not any(w in after for w in excluded_words): after_buf.append(after) # append to buffer # print 'after',z,after, ' - ',after_lemma if (after_buf and line in after_buf[0]): after_buf.pop(0) # pop off one ready for next if lemma in targets: pos = parts[1] if pos not in targets[lemma]: targets[lemma][pos] = {} counts = targets[lemma][pos] # context = [] # look at 10 previous lines context= before_buf[:-1] # minus out current line # look at the next 10 lines context.extend(after_buf) # END OF CHANGED CODE # CONTINUE YOUR STUFF HERE WITH CONTEXT 
+3
source share

A functional alternative written in Python 3.5. I simplified your example for only 5 words on both sides. There are other simplifications regarding filtering unwanted values, but this will require minor changes. I will use the fn package from PyPI to make this functional code more natural to read.

 from typing import List, Tuple from itertools import groupby, filterfalse from fn import F 

First we need to extract the column:

 def getcol3(line: str) -> str: return line.split("\t")[2] 

Then we need to break the lines into blocks separated by a predicate:

 TARGET_WORDS = {"target1", "target2"} # this is out predicate def istarget(word: str) -> bool: return word in TARGET_WORDS 

Allows you to filter garbage and write a function to get the last and first 5 words:

 def isjunk(word: str) -> bool: return word == "(unknown)" def first_and_last(words: List[str]) -> (List[str], List[str]): first = words[:5] last = words[-5:] return first, last 

Now let's get the groups:

 words = (F() >> (map, str.strip) >> (filter, bool) >> (map, getcol3) >> (filterfalse, isjunk))(lines) groups = groupby(words, istarget) 

Now process the groups

 def is_target_group(group: Tuple[str, List[str]]) -> bool: return istarget(group[0]) def unpack_word_group(group: Tuple[str, List[str]]) -> List[str]: return [*group[1]] def unpack_target_group(group: Tuple[str, List[str]]) -> List[str]: return [group[0]] def process_group(group: Tuple[str, List[str]]): return (unpack_target_group(group) if is_target_group(group) else first_and_last(unpack_word_group(group))) 

And the last steps:

 words = list(map(process_group, groups)) 

PS

This is my test case:

 from io import StringIO buffer = """ _\t_\tword _\t_\tword _\t_\tword _\t_\t(unknown) _\t_\tword _\t_\tword _\t_\ttarget1 _\t_\tword _\t_\t(unknown) _\t_\tword _\t_\tword _\t_\tword _\t_\ttarget2 _\t_\tword _\t_\t(unknown) _\t_\tword _\t_\tword _\t_\tword _\t_\t(unknown) _\t_\tword _\t_\tword _\t_\ttarget1 _\t_\tword _\t_\t(unknown) _\t_\tword _\t_\tword _\t_\tword """ # this simulates an opened file lines = StringIO(buffer) 

Given this file, you will get this result:

 [(['word', 'word', 'word', 'word', 'word'], ['word', 'word', 'word', 'word', 'word']), (['target1'], ['target1']), (['word', 'word', 'word', 'word'], ['word', 'word', 'word', 'word']), (['target2'], ['target2']), (['word', 'word', 'word', 'word', 'word'], ['word', 'word', 'word', 'word', 'word']), (['target1'], ['target1']), (['word', 'word', 'word', 'word'], ['word', 'word', 'word', 'word'])] 

Here you can drop the first 5 words and last 5 words.

+1
source share

All Articles