Regular expressions - how to replace a character in quotation marks

Hi Regular Expression Experts,

There has never been a problem with string manipulations that I could not resolve with regular expressions so far, at least elegantly, using only one step. Here is an example of the data I'm working with:

0, "section1", "(7) The supply of a" certificate "outside the State is prohibited. Since both sections 339 of the 1940 statute, 68 and section 341 of this law, in their statement that the certificate should be provided to a citizen only if such person time in the United States, itโ€™s clear that the document could not and cannot be delivered outside the United States. ", http://www.google.com/

1, "section2", http://www.google.com/

2, "section3", ",", http://www.google.com/

This is a section of a much larger CSV file. With one elegant regular expression, I would like to replace only all commas that occur in double quotes with an underscore (_). It is important that the regular expression does NOT replace any commas outside the quotes, because this would ruin the CSV data structure.

Thanks Tom

-

UPDATE:

Sorry guys, I posted the question without fully explaining my situation, so let me summarize below:

  • Suppose that the quotation marks in quotation marks are already escaped (the quotation marks in quotation marks in the CSV file saved in Excel are represented by "" or """ etc., so they are easily replaced in advance).
  • I work in JavaScript.

Using the example text above, this is what it should look like after starting a regular expression replacement (there should be only 5 replacements):

0, "section1", "(7) The delivery of a" certificate "outside the State is prohibited. Since both sections 339 of the 1940 statute 68 and section 341 of this law in their statement that the certificate should be provided to the citizen_ only if such person is in time in the United States, it is clear that the document could not and cannot be delivered outside the United States. ", Http://www.google.com/

1, "section2", http://www.google.com/

2, "section3", "__", http://www.google.com/

+7
source share
3 answers

I will help you, but you must promise to stop using the word elegant. He has worked too much lately and deserves a break .: P

 (?m),(?=[^"]*"(?:[^"\r\n]*"[^"]*")*[^"\r\n]*$) 

This corresponds to a comma if there is an odd number of quotation marks between the comma and the end of the record. I am assuming a standard CSV format in which recording ends on the next line separator, which is not enclosed in quotation marks. Line separators are legal inside quotation marks, as well as quotation marks if they are escaped with a different quote.

Depending on the flavor of the regular expression you use, you may need \r?$ Instead of $ . For example, in .NET, only a line separator ( \n ) is considered a line separator. But in Java, $ matches before \r in \r\n , but not between \r and \n (unless you set UNIX_LINES mode).

+12
source

Regular expressions are not particularly good for matching balanced text (i.e., start and end quotes).

A naive approach would be to repeatedly apply something like this (until it no longer matches):

 s/(^[^"]*(?:"[^"]*"[^"]*)*?)"([^",]*),([^"]*)"/$1"$2_$3"/ 

But this will not work with escaped quotes. The best (i.e., the simplest, most readable and most appropriate) solution is to use the CSV file analyzer , iterate over all the field values โ€‹โ€‹one by one (replacing the commas with underscores when you go), and then write it back to the file.

+3
source

Sorry if you are not using Python, which contains the following code. I have not seen any indication of which language you are using. Anyway, I think the code is understandable.

 import re ch = '''0,"section1","(7) Delivery of 'certificate' outside the United States prohibited. Since both section 339 of the 1940 statute, 68/ and section 341 of the present law are explicit in their statement that the certificate shall be furnished the citizen, only if such individual is at the time within the United States, it is clear that the document could not and cannot be delivered outside the United States.",http://www.google.com/ 1,"section2",,http://www.google.com/ 2,"section3",",,",http://www.google.com/ ''' poto = re.compile('("[^"]+")') def comma_replacement(match): return match.group().replace(',','_') print poto.sub(comma_replacement , ch) 

This method stores two adjacent commas in a string.

1, "section2", http://www.google.com/

without changes. That's what you need?

0
source

All Articles