Python: determining a prefix from a set of (similar) strings

Question

Python: determining a prefix from a set of (similar) strings

I have a rowset like

my_prefix_what_ever my_prefix_what_so_ever my_prefix_doesnt_matter

I just want to find the longest common part of these lines, here is the prefix. In the above result, the result should be

 my_prefix_

Lines

 my_prefix_what_ever my_prefix_what_so_ever my_doesnt_matter

must have a prefix

my_

Is there a relatively painless way in Python to define a prefix (without having to iterate over each character manually)?

PS: I am using Python 2.6.3.

+50

python string prefix

Kawu Jul 16 '11 at 15:03

source share

8 answers

Ned Batchelder is probably right. But for fun, here's a more efficient version of phimuemue responding with itertools .

 import itertools strings = ['my_prefix_what_ever', 'my_prefix_what_so_ever', 'my_prefix_doesnt_matter'] def all_same(x): return all(x[0] == y for y in x) char_tuples = itertools.izip(*strings) prefix_tuples = itertools.takewhile(all_same, char_tuples) ''.join(x[0] for x in prefix_tuples)

As an insult to readability, here is a single-line version :)

 >>> from itertools import takewhile, izip >>> ''.join(c[0] for c in takewhile(lambda x: all(x[0] == y for y in x), izip(*strings))) 'my_prefix_'

+11

senderle Jul 16 '11 at 18:12

source share

Here is my solution:

 a = ["my_prefix_what_ever", "my_prefix_what_so_ever", "my_prefix_doesnt_matter"] prefix_len = len(a[0]) for x in a[1 : ]: prefix_len = min(prefix_len, len(x)) while not x.startswith(a[0][ : prefix_len]): prefix_len -= 1 prefix = a[0][ : prefix_len]

+4

MRAB Jul 16 '11 at 15:35

source share

The following is a working, but probably rather inefficient solution.

 a = ["my_prefix_what_ever", "my_prefix_what_so_ever", "my_prefix_doesnt_matter"] b = zip(*a) c = [x[0] for x in b if x==(x[0],)*len(x)] result = "".join(c)

For small sets of strings this is not a problem. But for large sets, I personally would like to write another manual solution that checks each character one by one and stops when there are differences.

Algorithmically, this gives the same procedure, however, you could avoid creating a list c .

+2

phimuemue Jul 16 2018-11-17T00:

source share

Just out of curiosity, I figured out another way to do this:

 def common_prefix(strings): if len(strings) == 1:#rule out trivial case return strings[0] prefix = strings[0] for string in strings[1:]: while string[:len(prefix)] != prefix and prefix: prefix = prefix[:len(prefix)-1] if not prefix: break return prefix strings = ["my_prefix_what_ever","my_prefix_what_so_ever","my_prefix_doesnt_matter"] print common_prefix(strings) #Prints "my_prefix_"

As Ned said, it might be better to use os.path.commonprefix , which is a pretty elegant feature.

+1

ThePhysicist Oct 30 '13 at 15:57

source share

The second line uses a reduction function for each character in the input lines. It returns a list of elements N + 1, where N is the length of the shortest input line.

Each item in the lot is either (a) an input symbol if all input lines match at that position, or (b) None. lot.index (No) is the position of the first No in the lot: the length of the common prefix. out is a common prefix.

 val = ["axc", "abc", "abc"] lot = [reduce(lambda a, b: a if a == b else None, x) for x in zip(*val)] + [None] out = val[0][:lot.index(None)]

+1

Mano Bastardo Nov 02 '15 at 15:15

source share

Here is another way to do this using OrderedDict with minimal code.

 import collections import itertools def commonprefix(instrings): """ Common prefix of a list of input strings using OrderedDict """ d = collections.OrderedDict() for instring in instrings: for idx,char in enumerate(instring): # Make sure index is added into key d[(char, idx)] = d.get((char,idx), 0) + 1 # Return prefix of keys while value == length(instrings) return ''.join([k[0] for k in itertools.takewhile(lambda x: d[x] == len(instrings), d)])

0

user2758579 Jan 14 '15 at 7:10

source share

Here is a simple clean solution. The idea is to use the zip () function to align all characters, putting them in the list of the 1st character, the list of the 2nd character, ... the list of the nth character. Then iterate over each list to see if it contains only 1 value.

 a = ["my_prefix_what_ever", "my_prefix_what_so_ever", "my_prefix_doesnt_matter"] list = [all(x[i] == x[i+1] for i in range(len(x)-1)) for x in zip(*a)] print a[0][:list.index(0) if list.count(0) > 0 else len(list)]

output: my_prefix _

0

Patmanizer Nov 24 '16 at 15:51

source share

Ned Batchelder · Accepted Answer · 2011-07-16 15:45

Never rewrite what is provided to you: os.path.commonprefix does just that:

Returns the longest path prefix (accepted by character), which is the prefix of all paths in the list. If the list is empty, return an empty string ( '' ). Note that this may return the wrong paths because it works with the symbol at a time.

For comparison with other answers, here is the code:

 # Return the longest prefix of all list elements. def commonprefix(m): "Given a list of pathnames, returns the longest common leading component" if not m: return '' s1 = min(m) s2 = max(m) for i, c in enumerate(s1): if c != s2[i]: return s1[:i] return s1

Python: determining a prefix from a set of (similar) strings

More articles: