Dynamic Regular Expression Generation for Predictable Repeating String Patterns in a Data Feed

I'm currently trying to process several data feeds that I have no control over, where I use regular expressions in C # to extract information.

The source of the data feed is to extract the underlying row data from their database (for example, product name, price, etc.), and then format that data into English text strings. For each line, part of the text is repeated with static text, and some with dynamically generated text from the database.

eg

Panasonic TV with a FREE Blu-ray player

Sony TV with free DVD player + Box Office DVD

Kenwood Hi-Fi Unit with $ 20 Amazon MP3 Voucher

Thus, the format in this case is: PRODUCT with FREEGIFT.

PRODUCT and FREEGIFT are dynamic parts of each line, and the text “c” is static. Each channel has about 2,000 lines.

Creating a regular expression to extract dynamic parts is trivial.

The problem is that the marketing data feed controls continue to change the structure of the static text, usually once every two weeks, so this week I could:

New Panasonic TV and FREE Blu-ray Player if you order today

New Sony TV and free DVD player + Box Office DVD if you order today

New Kenwood Hi-Fi unit and $ 20 Amazon MP3 Voucher if you order today

And next week it will probably be something else, so I have to keep changing my regexes ...

How would you handle this?

? , ?

.

+5
2

, , , , , , . , .

, , , , , . , , , , , .

(20000 !) ( ..)

: , , , , - FREEGIFT?

  • , " ",
  • , , x ( )
  • ( 2), ,
  • , .

private static IEnumerable<string> FindCommonContent(string[] strings, int minimumMatchLength)
{
    string sharedContent = "";

    while (strings.All(x => x.Length > 0))
    {
        var item1FirstCharacter = strings[0][0];

        if (strings.All(x => x[0] == item1FirstCharacter))
        {
            sharedContent += item1FirstCharacter;

            for (int index = 0; index < strings.Length; index++)
                strings[index] = strings[index].Substring(1);

            continue;
        }

        if (sharedContent.Length >= minimumMatchLength)
            yield return sharedContent;

        sharedContent = "";

        // If the first minMatch characters of a string aren't in all the other strings, consume the first character of that string
        for (int index = 0; index < strings.Length; index++)
        {
            string testBlock = strings[index].Substring(0, Math.Min(minimumMatchLength, strings[index].Length));

            if (!strings.All(x => x.Contains(testBlock)))
                strings[index] = strings[index].Substring(1);
        }
    }

    if (sharedContent.Length >= minimumMatchLength)
        yield return sharedContent;
}

1 ( ):

FindCommonContent(strings, 4);
=> "with "

2 ( ):

FindCommonContent(strings, 4);
=> "Brand new ", "and a ", "if you order today"

:

 "{.*}" + string.Join("{.*}", FindCommonContent(strings, 4)) + "{.*}";
=> "^{.*}Brand new {.*}and a {.*}if you order today{.*}$"

, , ( ), , , .

+3

, , , , , , .

. Regex String -, , .

.

0

All Articles