How to perform binary search in text file for keyword search in python?

The text file contains two columns - the index number (5 spaces) and characters (30 spaces). It is located in lexicographical order. I want to do a binary search to search for a keyword.

+6
source share
5 answers

Here's an interesting way to do this with the Python bisect built-in.

import bisect import os class Query(object): def __init__(self, query, index=5): self.query = query self.index = index def __lt__(self, comparable): return self.query < comparable[self.index:] class FileSearcher(object): def __init__(self, file_pointer, record_size=35): self.file_pointer = file_pointer self.file_pointer.seek(0, os.SEEK_END) self.record_size = record_size + len(os.linesep) self.num_bytes = self.file_pointer.tell() self.file_size = (self.num_bytes // self.record_size) def __len__(self): return self.file_size def __getitem__(self, item): self.file_pointer.seek(item * self.record_size) return self.file_pointer.read(self.record_size) if __name__ == '__main__': with open('data.dat') as file_to_search: query = raw_input('Query: ') wrapped_query = Query(query) searchable_file = FileSearcher(file_to_search) print "Located @ line: ", bisect.bisect(searchable_file, wrapped_query) 
+9
source

Do you need to do a binary search? If not, try converting your file to cdb (persistent database) . This will give you a very fast hash search to find the index for a given word:

 import cdb # convert the corpus file to a constant database one time db = cdb.cdbmake('corpus.db', 'corpus.db_temp') for line in open('largecorpus.txt', 'r'): index, word = line.split() db.add(word, index) db.finish() 

In a separate script, run queries against it:

 import cdb db = cdb.init('corpus.db') db.get('chaos') 12345 
+3
source

If you need to find one keyword in a file:

 line_with_keyword = next((line for line in open('file') if keyword in line),None) if line_with_keyword is not None: print line_with_keyword # found 

To find multiple keywords, you can use set() as @kriegar :

 def extract_keyword(line): return line[5:35] # assuming keyword starts on 6 position and has length 30 with open('file') as f: keywords = set(extract_keyword(line) for line in f) # O(n) creation if keyword in keywords: # O(1) search print(keyword) 

You can use dict() above instead of set() to save index information.

Here you can do a binary search in a text file:

 import bisect lines = open('file').readlines() # O(n) list creation keywords = map(extract_keyword, lines) i = bisect.bisect_left(keywords, keyword) # O(log(n)) search if keyword == keywords[i]: print(lines[i]) # found 

There is no advantage over the set() option.

Note: all options except the first load the entire file into memory. FileSearcher() , proposed by @Mahmoud Abdelkader , does not require loading the entire file into memory.

+1
source

Consider using a set instead of a binary search to search for a keyword in your file.

Set:

O (n) to create, O (1) to search, O (1) to insert / delete

If your input file is separated by a space, then:

 f = open('file') keywords = set( (line.strip().split(" ")[1] for line in f.readlines()) ) f.close() my_word in keywords <returns True or False> 

Vocabulary:

 f = open('file') keywords = dict( [ (pair[1],pair[0]) for pair in [line.strip().split(" ") for line in f.readlines()] ] ) f.close() keywords[my_word] <returns index of my_word> 

Binary search:

O (n log n) create, O (log n) lookup

edit: for your case of 5 characters and 30 characters you can just use line cutting

 f = open('file') keywords = set( (line[5:-1] for line in f.readlines()) ) f.close() myword_ in keywords or f = open('file') keywords = dict( [(line[5:-1],line[:5]) for line in f.readlines()] ) f.close() keywords[my_word] 
0
source

It is possible, with a slight loss of efficiency, to perform a binary search in a sorted text file with records of unknown length, repeatedly dividing in half by a range and reading ahead through the line terminator. Here is what I am doing to look up through the csv file with 2 header lines for the number in the first field. Give it an open file and the first search box. This is pretty easy to change for your problem. A match on the very first line with a zero offset will cause a failure, so this may be required in a special way. In my circumstances, the first two lines are headers and are skipped.

Please excuse my lack of polished python below. I use this function and a similar function to perform GeoCity Lite latitude and longitude calculations directly from the CSV files distributed by Maxmind.

Hope this helps

==========================================

 # See if the input loc is in file def look1(f,loc): # Compute filesize of open file sent to us hi = os.fstat(f.fileno()).st_size lo=0 lookfor=int(loc) # print "looking for: ",lookfor while hi-lo > 1: # Find midpoint and seek to it loc = int((hi+lo)/2) # print " hi = ",hi," lo = ",lo # print "seek to: ",loc f.seek(loc) # Skip to beginning of line while f.read(1) != '\n': pass # Now skip past lines that are headers while 1: # read line line = f.readline() # print "read_line: ", line # Crude csv parsing, remove quotes, and split on , row=line.replace('"',"") row=row.split(',') # Make sure 1st fields is numeric if row[0].isdigit(): break s=int(row[0]) if lookfor < s: # Split into lower half hi=loc continue if lookfor > s: # Split into higher half lo=loc continue return row # Found # If not found return False 
0
source

All Articles