Regular expression to clear a numbered list

Question

Regular expression to clear a numbered list

I just started playing with Regex and it seems to be a bit stuck! I wrote a massive find and replaced using multi-line text in TextSoap. This is for cleaning recipes that I have OCR'd, and because there are Ingredients and Directions . I cannot change "1" to become "1.", as this can rewrite "1 Tbsp" as "1 Tbsp".

So I checked to see if the following two lines (possibly with extra lines) were the next consecutive numbers, using this code as find:

^(1) (.*)\n?((\n))(^2 (.*)\n?(\n)^3 (.*)\n?(\n)) ^(2) (.*)\n?((\n))(^3 (.*)\n?(\n)^4 (.*)\n?(\n)) ^(3) (.*)\n?((\n))(^4 (.*)\n?(\n)^5 (.*)\n?(\n)) ^(4) (.*)\n?((\n))(^5 (.*)\n?(\n)^6 (.*)\n?(\n)) ^(5) (.*)\n?((\n))(^6 (.*)\n?(\n)^7 (.*)\n?(\n))

and as a replacement for each of the above:

 $1. $2 $3 $4$5

My problem is that although it works the way I wanted it, it will never complete the task for the last three numbers ...

Example text I want to clear:

 1 This is the first step in the list 2 Second lot if instructions to run through 3 Doing more of the recipe instruction 4 Half way through cooking up a storm 5 almost finished the recipe 6 Serve and eat

And I want it to look like this:

 1. This is the first step in the list 2. Second lot if instructions to run through 3. Doing more of the recipe instruction 4. Half way through cooking up a storm 5. almost finished the recipe 6. Serve and eat

Is there a way to check the previous line or two above to run this backwards? I looked at the look and looked at me, and I'm a little confused about this. Does anyone have a way to clear the list of numbered pages or help me with the regex that I want, please?

+6

regex

Palendrone Jan 16 '13 at 13:17

source share

2 answers

How about this?

 1 Tbsp salt 2 Tsp sugar 3 Eggs

You are faced with the main limitation of regular expressions: they do not work well when your data cannot be strictly defined. You can intuitively know what the ingredients are and what are the steps, but it’s not easy to move from this to a robust set of rules for the algorithm.

I suggest you think of an approach based on a position within a file. This cookbook usually formats all recipes in the same way: for example, the ingredients appear first, and then the list of steps. This would probably be an easier way to talk about the differences.

+1

user1919238 Jan 16 '13 at 13:28

source share

alan · Accepted Answer · 2013-01-16T16:09:40+0000

dan1111 is right. You may encounter problems with similar data. But, given the sample you provided, this should work:

 ^(\d+)\s+([^\r\n]+)(?:[\r\n]*) // search $1. $2\r\n\r\n // replace

If you are not using Windows, remove \r from the replace line.

Explanation:

 ^ // beginning of the line (\d+) // capture group 1. one or more digits \s+ // any spaces after the digit. don't capture ([^\r\n]+) // capture group 2. all characters up to any EOL (?:[\r\n]*) // consume additional EOL, but do not capture

Replace:

 $1. // group 1 (the digit), then period and a space $2 // group 2 \r\n\r\n // two EOLs, to create a blank line // (remove both \r for Linux)

Regular expression to clear a numbered list

More articles: