Regular expression to search for non-exclusive double quotes in a CSV file

What will be the regular expression for finding sets of two unscreened double quotes that are contained in columns specified by double quotes in the CSV file?

Does not match:

"asdf","asdf" "", "asdf" "asdf", "" "adsf", "", "asdf" 

Match:

 "asdf""asdf", "asdf" "asdf", """asdf""" "asdf", """" 
+3
source share
5 answers

Try the following:

 (?m)""(?![ \t]*(,|$)) 

Explanation:

 (?m) // enable multi-line matching (^ will act as the start of the line and $ will act as the end of the line (i)) "" // match two successive double quotes (?! // start negative look ahead [ \t]* // zero or more spaces or tabs ( // open group 1 , // match a comma | // OR $ // the end of the line or string ) // close group 1 ) // stop negative look ahead 

So, in plain English: "match two consecutive double quotes only if they SHOULD NOT have a comma or end of line with optional spaces and tabs between" >.

(i) except that they are ordinary metacharacters for the beginning and end of a line.

+3
source

Due to the complexity of your problem, the solution depends on the engine you are using. This is because to solve it you must use the look and look forward, and each engine is not the same.

My answer uses the Ruby mechanism. Validation is only one RegEx, but I have all the code here to better explain.

Please note that due to the Ruby RegEx mechanism (or my knowledge), optional forward / reverse lookups are not possible. So I need a little problem with spaces before and after the comma.

Here is my code:

 orgTexts = [ '"asdf","asdf"', '"", "asdf"', '"asdf", ""', '"adsf", "", "asdf"', '"asdf""asdf", "asdf"', '"asdf", """asdf"""', '"asdf", """"' ] orgTexts.each{|orgText| # Preprocessing - Eliminate spaces before and after comma # Here is needed if you may have spaces before and after a valid comma orgText = orgText.gsub(Regexp.new('\" *, *\"'), '","') # Detect valid character (non-quote and valid quote) resText = orgText.gsub(Regexp.new('([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")'), '-') # resText = orgText.gsub(Regexp.new('([^\"]|(^|(?<=,)|(?<=\\\\))\"|\"($|(?=,)))'), '-') # [^\"] ===> A non qoute # | ===> or # ^\" ===> beginning quot # | ===> or # \"$ ===> endding quot # | ===> or # (?<=,)\" ===> quot just after comma # \"(?=,) ===> quot just before comma # (?<=\\\\)\" ===> escaped quot # This part is to show the invalid non-escaped quots print orgText print resText.gsub(Regexp.new('"'), '^') # This part is to determine if there is non-escaped quotes # Here is the actual matching, use this one if you don't want to know which quote is un-escaped isMatch = ((orgText =~ /^([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")*$/) != 0).to_s # Basicall, it match it from start to end (^...$) there is only a valid character print orgText + ": " + isMatch print print "" print "" } 

When executing the code, it prints:

 "asdf","asdf" ------------- "asdf","asdf": false "","asdf" --------- "","asdf": false "asdf","" --------- "asdf","": false "adsf","","asdf" ---------------- "adsf","","asdf": false "asdf""asdf","asdf" -----^^------------ "asdf""asdf","asdf": true "asdf","""asdf""" --------^^----^^- "asdf","""asdf""": true "asdf","""" --------^^- "asdf","""": true 

I hope I give you some ideas that you can use with another engine and language.

+2
source
 ".*"(\n|(".*",)*) 

should work, I think ...

0
source

For single line matches:

 ^("[^"]*"\s*,\s*)*"[^"]*""[^"]*" 

or for multiple lines:

 (^|\r\n)("[^\r\n"]*"\s*,\s*)*"[^\r\n"]*""[^\r\n"]*" 

Change / Note. Depending on the regular expression engine you are using, you can use lookbehinds and other things to make the regular expression more compact. But this should work on most regex engines just fine.

0
source

Try this regex:

 "(?:[^",\\]*|\\.)*(?:""(?:[^",\\]*|\\.)*)+" 

This will match any quoted string with at least one pair of unshielded double quotes.

0
source

All Articles