Process text file using various delimiters

Question

Process text file using various delimiters

My text file (unfortunately) looks like this ...

<amar>[amar-1000#Fem$$$_Y](1){india|1000#Fem$$$,mumbai|1000#Mas$$$} <akbar>[akbar-1000#Fem$$$_Y](1){} <john>[-0000#$$$_N](0){USA|0100#$avi$$,NJ|0100#$avi$$}

It contains the name of the customer, followed by some information. Sequence...

text string followed by a list, set, and then a dictionary

<> [] () {}

It is not a python compatible file, so data is not as expected. I want to process the file and extract some information.

 amar 1000 | 1000 | 1000 akbar 1000 john 0000 | 0100 | 0100

1) name between <>

2) The number between - and # in the list

3 and 4) divide the dictionary by a comma and the numbers between | and # (there may be more than two entries)

I am open to using any tool that is most suitable for this task.

+4

python grep awk sed

shantanuo Aug 14 '15 at 8:03

source share

5 answers

Since grammars are quite complex, you may find the right parser to be the best solution.

 #!/usr/bin/env python import fileinput from pyparsing import Word, Regex, Optional, Suppress, ZeroOrMore, alphas, nums name = Suppress('<') + Word(alphas) + Suppress('>') reclist = Suppress('[' + Optional(Word(alphas)) + '-') + Word(nums) + Suppress(Regex("[^]]+]")) digit = Suppress('(' + Word(nums) + ')') dictStart = Suppress('{') dictVals = Suppress(Word(alphas) + '|') + Word(nums) + Suppress('#' + Regex('[^,}]+') + Optional(',')) dictEnd = Suppress('}') parser = name + reclist + digit + dictStart + ZeroOrMore(dictVals) + dictEnd for line in fileinput.input(): print ' | '.join(parser.parseString(line))

This solution uses the pyparsing library, and runs:

 $ python parse.py file amar | 1000 | 1000 | 1000 akbar | 1000 john | 0000 | 0100 | 0100

+3

Chris seymour Aug 14 '15 at 9:42

source share

You can add all delimiters to the FS variable in awk and count fields, for example:

 awk -F'[<>#|-]' '{ print $2, $4, $6, $8 }' infile

If you have more than two entries between curly braces, you can use a loop to move all fields to the last, for example:

 awk -F'[<>#|-]' '{ printf "%s %s ", $2, $4 for (i = 6; i <= NF; i += 2) { printf "%s ", $i } printf "\n" }' infile

Both teams give the same results:

 amar 1000 1000 1000 akbar 1000 john 0000 0100 0100

+2

Birei Aug 14 '15 at 8:51

source share

You can use regex to search for arguments

Example:

 a="<john>[-0000#$$$_N](0){USA|0100#$avi$$,NJ|0100#$avi$$}" name=" ".join(re.findall("<(\w+)>[\s\S]+?-(\d+)#",a)[0]) others=re.findall("\|(\d+)#",a) print name+" | "+" | ".join(others) if others else " "

exit:

 'john 0000 | 0100 | 0100'

Full code:

 with open("input.txt","r") as inp: for line in inp: name=re.findall("<(\w+)>[\s\S]+?-(\d+)#",line)[0] others=re.findall("\|(\d+)#",line) print name+" | "+" | ".join(others) if others else " "

+2

The6thSense Aug 14 '15 at 9:03

source share

For one line of your file:

 test='<amar>[amar-1000#Fem$$$_Y](1){india|1000#Fem$$$,mumbai|1000#Mas$$$}'

replace <with a blank character and delete everything after> to get the name

 echo $test | sed -e 's/<//g' | sed -e 's/>.*//g'

get all four-digit characters:

 echo $test | grep -o '[0-9]\{4\}'

replace the place with your favorite separator

 sed -e 's/ /|/g'

This will do:

 echo $(echo $test | sed -e 's/<//g' | sed -e 's/>.*//g') $(echo $test | grep -o '[0-9]\{4\}') | sed -e 's/ /|/g'

This will output:

amar | 1000 | 1000 | 1000

with a quick script you got it: your_script.sh input_file output_file

 #!/bin/bash IFS=$'\n' #line delimiter #empty your output file cp /dev/null "$2" for i in $(cat "$1"); do newline=`echo $(echo $i | sed -e 's/<//g' | sed -e 's/>.*//g') $(echo $i | grep -o '[0-9]\{4\}') | sed -e 's/ /|/g'` echo $newline >> "$2" done cat "$2"

+2

Bertrand martel Aug 14 '15 at 9:17

source share

Martin evans · Accepted Answer · 2015-08-14T09:11:55+0000

The following Python script will read your text file and give the desired results:

 import re, itertools with open("input.txt", "r") as f_input: for line in f_input: reLine = re.match(r"<(\w+)>\[(.*?)\].*?{(.*?)\}", line) lNumbers = [re.findall(".*?(\d+).*?", entry) for entry in reLine.groups()[1:]] lNumbers = list(itertools.chain.from_iterable(lNumbers)) print reLine.group(1), " | ".join(lNumbers)

Prints the following output:

 amar 1000 | 1000 | 1000 akbar 1000 john 0000 | 0100 | 0100

Process text file using various delimiters

More articles: