Regular expression to clear a numbered list

I just started playing with Regex and it seems to be a bit stuck! I wrote a massive find and replaced using multi-line text in TextSoap. This is for cleaning recipes that I have OCR'd, and because there are Ingredients and Directions . I cannot change "1" to become "1.", as this can rewrite "1 Tbsp" as "1 Tbsp".

So I checked to see if the following two lines (possibly with extra lines) were the next consecutive numbers, using this code as find:

^(1) (.*)\n?((\n))(^2 (.*)\n?(\n)^3 (.*)\n?(\n)) ^(2) (.*)\n?((\n))(^3 (.*)\n?(\n)^4 (.*)\n?(\n)) ^(3) (.*)\n?((\n))(^4 (.*)\n?(\n)^5 (.*)\n?(\n)) ^(4) (.*)\n?((\n))(^5 (.*)\n?(\n)^6 (.*)\n?(\n)) ^(5) (.*)\n?((\n))(^6 (.*)\n?(\n)^7 (.*)\n?(\n)) 

and as a replacement for each of the above:

 $1. $2 $3 $4$5 

My problem is that although it works the way I wanted it, it will never complete the task for the last three numbers ...

Example text I want to clear:

 1 This is the first step in the list 2 Second lot if instructions to run through 3 Doing more of the recipe instruction 4 Half way through cooking up a storm 5 almost finished the recipe 6 Serve and eat 

And I want it to look like this:

 1. This is the first step in the list 2. Second lot if instructions to run through 3. Doing more of the recipe instruction 4. Half way through cooking up a storm 5. almost finished the recipe 6. Serve and eat 

Is there a way to check the previous line or two above to run this backwards? I looked at the look and looked at me, and I'm a little confused about this. Does anyone have a way to clear the list of numbered pages or help me with the regex that I want, please?

+6
source share
2 answers

dan1111 is right. You may encounter problems with similar data. But, given the sample you provided, this should work:

 ^(\d+)\s+([^\r\n]+)(?:[\r\n]*) // search $1. $2\r\n\r\n // replace 

If you are not using Windows, remove \r from the replace line.

Explanation:

 ^ // beginning of the line (\d+) // capture group 1. one or more digits \s+ // any spaces after the digit. don't capture ([^\r\n]+) // capture group 2. all characters up to any EOL (?:[\r\n]*) // consume additional EOL, but do not capture 

Replace:

 $1. // group 1 (the digit), then period and a space $2 // group 2 \r\n\r\n // two EOLs, to create a blank line // (remove both \r for Linux) 
+2
source

How about this?

 1 Tbsp salt 2 Tsp sugar 3 Eggs 

You are faced with the main limitation of regular expressions: they do not work well when your data cannot be strictly defined. You can intuitively know what the ingredients are and what are the steps, but itโ€™s not easy to move from this to a robust set of rules for the algorithm.

I suggest you think of an approach based on a position within a file. This cookbook usually formats all recipes in the same way: for example, the ingredients appear first, and then the list of steps. This would probably be an easier way to talk about the differences.

+1
source

All Articles