Mark dynamic substrings in a string list

Question

Mark dynamic substrings in a string list

Suppose these two sets of rows:

file=sheet-2016-12-08.xlsx
file=sheet-2016-11-21.xlsx
file=sheet-2016-11-12.xlsx
file=sheet-2016-11-08.xlsx
file=sheet-2016-10-22.xlsx
file=sheet-2016-09-29.xlsx
file=sheet-2016-09-05.xlsx
file=sheet-2016-09-04.xlsx

size=1024KB
size=22KB
size=980KB
size=15019KB
size=202KB

I need to run the function on both of these sets separately and get the following output, respectively:

file=sheet-2016-*.xlsx

size=*KB

A data set can be any set of rows. It must not match the format. Here is another example:

id.4030.paid
id.1280.paid
id.88.paid

For what the expected output will be:

id.*.paid

Basically, I need a function to analyze a set of strings and replace an unusual substring with an asterisk (*)

+6

python algorithm

Hydera Aug 25 '17 at 22:26

source share

2 answers

from os.path import commonprefix

def commonsuffix(m):
    return commonprefix([s[::-1] for s in m])[::-1]

def inverse_glob(strs):
    start = commonprefix(strs)
    n = len(start)
    ends = [s[n:] for s in strs]
    end = commonsuffix(ends)
    if start and not any(ends):
        return start
    else:
        return start + '*' + end

, .

, , . ['spamAndEggs', 'spamAndHamAndEggs'] spam*AndEggs spamAnd*Eggs. ['aXXXXz', 'aXXXz'] . , .

JFF, os.path.commonprefix,

- .

+1

wim 25 . '17 22:41

Jean-François Fabre · Accepted Answer · 2017-08-25T22:35:44+0000

you can use os.path.commonprefixto calculate the common prefix. It is used to calculate shared directories in a list of file paths, but it can be used in a general context.

, , ( https://gist.github.com/willwest/ca5d050fdf15232a9e67)

dataset = """id.4030.paid
id.1280.paid
id.88.paid""".splitlines()

import os


# Return the longest common suffix in a list of strings
def longest_common_suffix(list_of_strings):
    reversed_strings = [s[::-1] for s in list_of_strings]
    return os.path.commonprefix(reversed_strings)[::-1]

common_prefix = os.path.commonprefix(dataset)
common_suffix = longest_common_suffix(dataset)

print("{}*{}".format(common_prefix,common_suffix))

:

id.*.paid

EDIT: wim:

, , prefix*suffix: ,
/ , :

, , , ( / )

def compute_generic_string(dataset):
    # edge case where all strings are the same
    if len(set(dataset))==1:
        return dataset[0]

    commonprefix = os.path.commonprefix(dataset)

    return "{}*{}".format(commonprefix,os.path.commonprefix([s[len(commonprefix):][::-1] for s in dataset])[::-1])

:

for dataset in [['id.4030.paid','id.1280.paid','id.88.paid'],['aBBc', 'aBc'],[]]:
    print(compute_generic_string(dataset))

:

id.*.paid
aB*c
*

( , *, , )

Mark dynamic substrings in a string list

More articles: