Regex to remove xml declaration from string

First of all, I know that this is a bad decision, and I should not do this.

Background: feel free to skip


However, I need a quick fix for a working system. We currently have a data structure that is serialized into a string by creating "xml" fragments using a number of string constructors. Whether this is valid XML, I rather doubt it. After creating this xml and before sending it in turn, some cleaning code looks for xml declarations in the entry line and deletes them.

The way this is done (iterating over each character that executes indexOf for <?xml Xml) is so slow that it causes thread timeouts and kills our systems. Ultimately, I will try to fix it correctly (create xml using xml documents or something similar), but for now I need a quick fix to replace what is there.

Please keep in mind, I know this is far from an ideal solution, but I need a quick solution to get us back to work.


Question

I thought of using regular expressions to find ads. I planned: <\?xml.*?> , And then used Regex.Replace(input, string.empty) to delete.

Could you tell me if there are any obvious problems with this regular expression, or just write it in code using string.IndexOf("<?xml") string.IndexOf("?>") string.IndexOf("<?xml") and string.IndexOf("?>") In (much more reasonable) loop is better.

EDIT I need to take care of new lines.

Will <\?xml[^>]*?> Achieve the goal?

EDIT2

Thanks for the help. Regex wise <\?xml.*?\?> Worked fine. In the end, I wrote some temporary code and tested both ar egex and IndexOf() . I found that for our simplest use case, just deleting the ad took:

  • Almost a second how it was
  • 0.01 seconds with regex
  • impossible to use loop and IndexOf()

So I went for IndexOf() as it is a very simple loop.

+4
source share
2 answers

You probably want either this: <\?xml.*\?> , Or this: <\?xml.*?\?> , Because now you have this, the regular expression is not looking for '? > ', but only for'> '. I don't think you need the first option because it is greedy and it will delete everything between the first appearance. '' The second option will work until you have nested XML tags. If you do this, it will delete everything between the first "". If you have another tag.

Also, I don't know how regular expressions are implemented in .NET, but I seriously doubt that they are faster than using indexOf.

+6
source
 strXML = strXML.Remove(0, sXMLContent.IndexOf(@"?>", 0) + 2); 
-1
source

All Articles