I'm having difficulty organizing a function that will process strings the way I want. I considered several previous questions 1 , 2 , 3 , among other things, I figured out. Here is the setup, I have well-structured, but variable data that needs to be split from a line read from a file to an array of strings. The following examples demonstrate some examples of data that I do.
('Vdfbr76','gsdf','gsfd','',NULL),
('Vkdfb23l','gsfd','gsfg','ggg@df.gf',NULL),
('4asg0124e','Lead Actor/SFX MUA/Prop designer','John Smith','jsmith@email.com',NULL),
('asdguIux','Director, Camera Operator, Editor, VFX','John Smith','',NULL),
...
(492,'E1asegaZ1ox','Nysdag_5YmD','145872325372620',1,'long, string, with, commas'),
I want to break these lines based on commas, however, each line has commas that cause problems. In addition to this, creating the exact one re.split(regex, line)becomes difficult; the number of elements in each row changes during reading.
Some solutions I've tried so far.
def splitLine(text, fields, delimiter):
return_line = []
regex_string = "(.*?),"
for i in range(0,len(fields)-1):
regex_string+=("(.*)")
if i < len(fields)-2:
regex_string+=delimiter
return_line = re.split(regex_string, text)
return return_line
This will give a result when we have the following output
regex_string
return_line
However, the main problem is that it sometimes squeezes two fields together. In the case of the 3rd value in the array.
(.*?),(.*),(.*),(.*),(.*),(.*)
['', '\t(222', "'Vy1asdfnuJkA','Ndfbyz3_YMD'", "'14541242640005471'", '2', "'Hello World!')", '', '\n']
Where the perfect result will look:
['', '\t(222', "'Vy1asdfnuJkA'", "'Ndfbyz3_YMD'", "'14541242640005471'", '2', "'Hello World!')", '', '\n']
This is a small change, but it has a huge impact on the result. I tried to manipulate the regex string to better match what I was trying to do, but in each case that I decided, the other broke it, unfortunately.
Another case I played with came from Aaron Cronin user in this post 4 which looks below
def split_at(text, delimiter, opens='<([', closes='>)]', quotes='"\''):
result = []
buff = ""
level = 0
is_quoted = False
for char in text:
if char in delimiter and level == 0 and not is_quoted:
result.append(buff)
buff = ""
else:
buff += char
if char in opens:
level += 1
if char in closes:
level -= 1
if char in quotes:
is_quoted = not is_quoted
if not buff == "":
result.append(buff)
return result
The results of this look like this:
["\t('Vk3NIasef366l','gsdasdf','gsfasfd','',NULL),\n"]
The main problem is that it is output as the same line. This puts me in a feedback loop.
An ideal result would look like this:
[\t('Vk3NIasef366l','gsdasdf','gsfasfd','',NULL),\n]
, , . . .