Very complex regex

Question

Very complex regex

I'm stuck trying to write this regular expression that I need. Basically, I have a long string consisting of two different data types:

[a-f0-9] {32}
[A-Za-Z0-9 =] {x}

The fact is that x is only a constant in a specific instance: if in one case it is 12, there will be 12 for this particular data set, but the next time I run the regular expression, it may take 15, or 45, for example. I have an unpredictable number of types (1) between each part of type (2). My goal is to "collect" all data of type (2).

For example, I could have a line like this:

[a-f0-9]{192} [a-zA-Z0-9=]{11} [a-f0-9]{96} [a-zA-Z0-9=]{11} [af-0-9]{160} [a-zA-Z0-9=]{11}

(All together without separators) . I need to return a string consisting of 33 characters of the character set [a-zA-Z0-9 =]. The fact that the number of characters in each of the substrings is a constant in the instance (in the case of 11 above, but it could just be 13) is vital, because since it contains a smaller set of characters, otherwise it is impossible to know where one line begins and the other ends.

I have been trying to get this to work for almost a month, and I'm close to tearing my hair out. I am not very good at regular expressions ...

Sample data:

 3c21e03a10b9415fb3e1067ea75f8205 c8dc9900a5089d31e01241c7a947ed7e d5f8cd6bb86ebef6d7d104c84ae6e8a7 e23c99af9c9d6d0294d8b51094c39021 4bb4af7e61760735ba17c29e8f542a66 875da91e90863f1ddb7e149297fc59af cf5de951fb65d06d2927aab7b9b54830 e2d935616a54c381c2f38db3731d5a37 SGVsbG8gbXk 6dd11d15c419ac219901f14bdd999f38 0ad94e978ad624d15189f5230e5435a9 2dc19fe95e583e7d593dd52ae7e68a6e 465ffa6074a371a8958dad3ad271181a 23310939b981b4e56f2ecee26f82ec60 fe04bef49be47603d1278cc80673b226 gbmFtZSBpcy 3c21e03a10b9415fb3e1067ea75f8205 c8dc9900a5089d31e01241c7a947ed7e d5f8cd6bb86ebef6d7d104c84ae6e8a7 e23c99af9c9d6d0294d8b51094c39021 BvbGl2ZXIga 4bb4af7e61760735ba17c29e8f542a66 875da91e90863f1ddb7e149297fc59af cf5de951fb65d06d2927aab7b9b54830 e2d935616a54c381c2f38db3731d5a37 G9vcmF5IQ==

I would like to extract "SGVsbG8gbXkgbmFtZSBpcyBvbGl2ZXIgaG9vcmF5IQ ==".

+6

regex encryption

Mala Dec 28 '09 at 13:10

source share

14 answers

I do not believe that the right expressions are the right tool for this problem.

My concern is that the range [a-f0-9] is included in the range [a-zA-Z0-9 =], and since there are no separators, and the length of the entries is variable, the border between the two entries seems pretty fuzzy.

You can have a heuristic that works to determine where records start and end by searching for a pattern in the data, and you can then apply regular expressions using this pattern, but it is unlikely that regular expressions will help you expand this pattern in the first place.

+5

Eric Bréchemier Dec 28 '09 at 13:23

source share

I don’t think that your “data types” are clearly defined enough to solve this problem for all cases, regardless of whether you use regular expressions at all.

Since, judging by your example, type 1 may occur several times in a row, and type 2 may look like type 1, since the character sets overlap, I don’t see how you can tell them separately for all cases, even if you know X ( which, judging by this question, I’m not sure that you are doing this).

As a primitive example, given a string of 2,000 repetitions of the letter “a,” how could you talk about all types 1 and 2?

If there is any possibility at all of having something that gives you data placed in explicit delimiters, do it. Otherwise, you will have to use heuristics to disambiguate, and I don't think regex is the right tool for this.

+3

Michael borgwardt Dec 28 '09 at 13:31

source share

It seems that the data you are processing between the hexadecimal strings is Base64 . The actual problem that you are describing seems insoluble with the constraints you have indicated (cannot accept any lengths, etc.).

But the big thing you should be aware of is that the base64 character set also contains the characters "+" and "/". The '=' characters are filled in because the length of the entire (in your case, concatenated) base64 bit with base64 encoding always has an even number of 4 characters.

+2

Nakedible Dec 28 '09 at 13:41

source share

As some other answers have already said, I think that regular expressions are wrong here, or at least not initially. You need to start with an algorithmic approach. Here's why: you cannot know the exact value of x. The best you can do is evaluate the data x for each fragment of type 2. Then you need a mechanism to guess the most likely value of x based on all the ratings (possibly using something like climbing a hill). After that, you can apply the regular expression or just take out pieces of the appropriate length.

+2

Ben Dec 28 '09 at 13:41

source share

If you know the size of each field, I would just use substr.

 $a = substr($line,192,11); $b = substr($line,299,11); $c = substr($line,380,11);

or use str_split and convert the string to an array and create substrings from arrays.

+1

tvanfosson Dec 28 '09 at 13:29

source share

You made a mistake on the path of IMO. The template is hexadecimal encoded data in which base64 encoded parts are inserted. This hexadecimal data should mean something that can be used to determine when the “necessary” data will begin. In addition, if the original data that you deactivate is split into rows of the same length, it should also mean something. You should "understand" the data, and not use the brainless RegExp template to match it, which is not possible here.

+1

BYK Dec 28 '09 at 13:36

source share

How do you define this magic x?

If you know x in advance for each dataset, just use your regular expression and replace x with the actual value before each call (in most languages you can compose an arbitrary string of characters and use this as a regular expression).
If you do not know x, then I do not see how there is any answer, since it cannot be determined only from the input data (as you indicate).

Edit:

From your comment, 2) seems to apply: x is not known in advance.

As indicated, then in the general case there will be several solutions for a given piece of input data.

You can write a program that will extract all the substrings that match your criteria. If there is only one solution for this input, you're in luck; otherwise you will have to decide what you like best.

To extract substrings, one idea (perhaps not optimal) would be to simply iterate over all the reasonable values for x and try your regular expression for each x. If this matches, you have found one solution. If more than one x occurs, more than one solution exists.

There is probably a more efficient way to do this, but if you have a low enough upper bound for x, this should be doable. (Obviously, a data size of 32 is always the upper bound for x, so this will always work in principle).

0

sleske Dec 28 '09 at 13:21

source share

How about something line by line:

 ([a-f0-9]*([a-zA-Z0-9=]*))*

And then just connect the matches ([a-zA-Z0-9=]*) .

Can you expect the [a-zA-Z0-9=]* be the same length every time? Or do you need to check it out? If you have to check the length every time, then this problem is not solvable with a regular expression (i.e., this is not an ordinary language, but rather a non-contextually free language).

0

Yuval Adam Dec 28 '09 at 13:24

source share

Is it possible that the last line you want to combine ends with the character '=='?

If not, you can combine a line ending first with "==", calculate its size, and then use it as your x to capture other lines that you want to capture.

0

ℝaphink Dec 28 '09 at 13:25

source share

I really think that you cannot collect all your parts of type (2) if you do not know how many pieces of type (1) you will have and their length.

The best solution is to parse line by line and apply a regular expression for each line. If it matches type (2), then combine it into a result line.

If your line is not line-separated, before you parse it, do preg_replace.

0

dasilvj Dec 28 '09 at 13:27

source share

Or you can simply check for valid characters with a regular expression, and then check the length of the string through a property / function. You seem to be making things harder than you should be.

0

marko Dec 28 '09 at 13:42

source share

Why not just do it:

 ^[a-zA-Z0-9]+==$

or

 ^[a-zA-Z0-9]+[=]+$

-one

Bobby Dec 28 '09 at 13:22

source share

You don't seem to care about the contents of the string, so this should be done. Of course, you must know the number to use. Also, I assume that the data is all on one line (I suppose you just specified a new line)

^. {192} (. {11}). {96} (. {11}). {160} (. {11}). * $

Then you just need to combine the last 3 elements from the matches.

== Added

Good, because uppercase seems to be an indicator of where you need to extract.

What you need to do is first get all the features of the UpperCase char, get a multiple of 32 less than each position, and then use a substring to extract the content you want. How do you get 11 again?

-one

jfno Dec 28 '09 at 13:33

source share

Mark byers · Accepted Answer · 2009-12-28T13:57:18+0000

This is your lucky day! The problem is not solvable at all, but I believe that the following almost always gives the correct answer for typical real-life data:

 <?php $s = ' 3c21e03a10b9415fb3e1067ea75f8205 c8dc9900a5089d31e01241c7a947ed7e d5f8cd6bb86ebef6d7d104c84ae6e8a7 e23c99af9c9d6d0294d8b51094c39021 4bb4af7e61760735ba17c29e8f542a66 875da91e90863f1ddb7e149297fc59af cf5de951fb65d06d2927aab7b9b54830 e2d935616a54c381c2f38db3731d5a37 SGVsbG8gbXk 6dd11d15c419ac219901f14bdd999f38 0ad94e978ad624d15189f5230e5435a9 2dc19fe95e583e7d593dd52ae7e68a6e 465ffa6074a371a8958dad3ad271181a 23310939b981b4e56f2ecee26f82ec60 fe04bef49be47603d1278cc80673b226 gbmFtZSBpcy 3c21e03a10b9415fb3e1067ea75f8205 c8dc9900a5089d31e01241c7a947ed7e d5f8cd6bb86ebef6d7d104c84ae6e8a7 e23c99af9c9d6d0294d8b51094c39021 BvbGl2ZXIga 4bb4af7e61760735ba17c29e8f542a66 875da91e90863f1ddb7e149297fc59af cf5de951fb65d06d2927aab7b9b54830 e2d935616a54c381c2f38db3731d5a37 G9vcmF5IQ== '; $s = preg_replace('/\r?\n/', '', $s); for ($i = 1; $i < 20; ++$i) { $pattern = "/^(([a-f0-9]{32})+([a-zA-Z0-9=]{{$i}})?)+$/"; if (preg_match($pattern, $s)) { $pattern = "/(?:[a-f0-9]{32})+([a-zA-Z0-9=]{{$i}})/"; $matches = array(); preg_match_all($pattern, $s, $matches); print_r(join('', $matches[1])); break; } }

The conclusion in this case:

 SGVsbG8gbXkgbmFtZSBpcyBvbGl2ZXIgaG9vcmF5IQ==

I believe the code can be improved, but I'm sure you are just happy to get something that works. I think this is similar to the bazooka method described above, but I honestly don't think there is a better way. Also note that it is important to start with a little hunch first to minimize the chance of false matches. The order of terms in a regular expression is also important to increase the likelihood of making the right choice when more than one choice is possible (try to combine the first, greedy, and then the simplest match only if it fails).

Very complex regex

More articles: