How to perform binary search in text file for keyword search in python?

Question

How to perform binary search in text file for keyword search in python?

The text file contains two columns - the index number (5 spaces) and characters (30 spaces). It is located in lexicographical order. I want to do a binary search to search for a keyword.

+6

python search binary text-files

Apps Mar 07 '11 at 8:56

source share

5 answers

Mahmoud abdelkader · Answer 1 · 2011-03-07T11:43:47+0000

Here's an interesting way to do this with the Python bisect built-in.

import bisect import os class Query(object): def __init__(self, query, index=5): self.query = query self.index = index def __lt__(self, comparable): return self.query < comparable[self.index:] class FileSearcher(object): def __init__(self, file_pointer, record_size=35): self.file_pointer = file_pointer self.file_pointer.seek(0, os.SEEK_END) self.record_size = record_size + len(os.linesep) self.num_bytes = self.file_pointer.tell() self.file_size = (self.num_bytes // self.record_size) def __len__(self): return self.file_size def __getitem__(self, item): self.file_pointer.seek(item * self.record_size) return self.file_pointer.read(self.record_size) if __name__ == '__main__': with open('data.dat') as file_to_search: query = raw_input('Query: ') wrapped_query = Query(query) searchable_file = FileSearcher(file_to_search) print "Located @ line: ", bisect.bisect(searchable_file, wrapped_query)

samplebias · Answer 2 · 2011-03-07T11:39:28+0000

Do you need to do a binary search? If not, try converting your file to cdb (persistent database) . This will give you a very fast hash search to find the index for a given word:

 import cdb # convert the corpus file to a constant database one time db = cdb.cdbmake('corpus.db', 'corpus.db_temp') for line in open('largecorpus.txt', 'r'): index, word = line.split() db.add(word, index) db.finish()

In a separate script, run queries against it:

 import cdb db = cdb.init('corpus.db') db.get('chaos') 12345

jfs · Answer 3 · 2011-03-07T12:13:23+0000

If you need to find one keyword in a file:

 line_with_keyword = next((line for line in open('file') if keyword in line),None) if line_with_keyword is not None: print line_with_keyword # found

To find multiple keywords, you can use set() as @kriegar :

 def extract_keyword(line): return line[5:35] # assuming keyword starts on 6 position and has length 30 with open('file') as f: keywords = set(extract_keyword(line) for line in f) # O(n) creation if keyword in keywords: # O(1) search print(keyword)

You can use dict() above instead of set() to save index information.

Here you can do a binary search in a text file:

 import bisect lines = open('file').readlines() # O(n) list creation keywords = map(extract_keyword, lines) i = bisect.bisect_left(keywords, keyword) # O(log(n)) search if keyword == keywords[i]: print(lines[i]) # found

There is no advantage over the set() option.

Note: all options except the first load the entire file into memory. FileSearcher() , proposed by @Mahmoud Abdelkader , does not require loading the entire file into memory.

Ding · Answer 4 · 2011-03-07T09:53:16+0000

Consider using a set instead of a binary search to search for a keyword in your file.

Set:

O (n) to create, O (1) to search, O (1) to insert / delete

If your input file is separated by a space, then:

 f = open('file') keywords = set( (line.strip().split(" ")[1] for line in f.readlines()) ) f.close() my_word in keywords <returns True or False>

Vocabulary:

 f = open('file') keywords = dict( [ (pair[1],pair[0]) for pair in [line.strip().split(" ") for line in f.readlines()] ] ) f.close() keywords[my_word] <returns index of my_word>

Binary search:

O (n log n) create, O (log n) lookup

edit: for your case of 5 characters and 30 characters you can just use line cutting

 f = open('file') keywords = set( (line[5:-1] for line in f.readlines()) ) f.close() myword_ in keywords or f = open('file') keywords = dict( [(line[5:-1],line[:5]) for line in f.readlines()] ) f.close() keywords[my_word]

user3101161 · Answer 5 · 2013-12-20T21:52:05+0000

It is possible, with a slight loss of efficiency, to perform a binary search in a sorted text file with records of unknown length, repeatedly dividing in half by a range and reading ahead through the line terminator. Here is what I am doing to look up through the csv file with 2 header lines for the number in the first field. Give it an open file and the first search box. This is pretty easy to change for your problem. A match on the very first line with a zero offset will cause a failure, so this may be required in a special way. In my circumstances, the first two lines are headers and are skipped.

Please excuse my lack of polished python below. I use this function and a similar function to perform GeoCity Lite latitude and longitude calculations directly from the CSV files distributed by Maxmind.

Hope this helps

==========================================

 # See if the input loc is in file def look1(f,loc): # Compute filesize of open file sent to us hi = os.fstat(f.fileno()).st_size lo=0 lookfor=int(loc) # print "looking for: ",lookfor while hi-lo > 1: # Find midpoint and seek to it loc = int((hi+lo)/2) # print " hi = ",hi," lo = ",lo # print "seek to: ",loc f.seek(loc) # Skip to beginning of line while f.read(1) != '\n': pass # Now skip past lines that are headers while 1: # read line line = f.readline() # print "read_line: ", line # Crude csv parsing, remove quotes, and split on , row=line.replace('"',"") row=row.split(',') # Make sure 1st fields is numeric if row[0].isdigit(): break s=int(row[0]) if lookfor < s: # Split into lower half hi=loc continue if lookfor > s: # Split into higher half lo=loc continue return row # Found # If not found return False

How to perform binary search in text file for keyword search in python?

More articles: