How to clear data from a text table using Python?

Question

How to clear data from a text table using Python?

I have the following text and I would like to clear the data items and save them in excel. Is there a way to do this in Python?

text = """ ANNUAL COMPENSATION LONG-TERM COMPENSATION --------------------------------------- ------------------------------------- AWARDS PAYOUTS -------------------------- ---------- SECURITIES OTHER RESTRICTED UNDERLYING ALL OTHER NAME AND PRINCIPAL ANNUAL STOCK OPTIONS/ LTIP COPMPENSA- POSITION YEAR SALARY ($) BONUS ($) COMPENSATION ($) AWARD(S) ($)(1) SAR (#) PAYOUTS($) TION($)(3) ------------------ ---- ---------- --------- ---------------- --------------- ---------- ---------- ---------- JOHN W. WOODS 1993 $595,000 $327,250 There is no $203,190.63 18,000 $ 29,295 Chairman, President, & 1992 $545,000 $245,250 compensation 166,287.50 18,825 (2) Not $ 29,123 Chief Executive Officer 1991 $515,000 $283,251 required to be 45,000 Applicable of AmSouth & AmSouth disclosed in Bank NA this column. C. STANLEY BAILEY 1993 $266,667(4) $133,333 117,012.50 4,500 $ 11,648 Vice Chairman, AmSouth 1992 $210,000 $ 84,000 42,400.00 4,800 $ 12,400 & AmSouth Bank NA 1991 $186,750 $ 82,170 161,280.00 9,750 C. DOWD RITTER 1993 $266,667(4) $133,333 117,012.50 4,500 $ 13,566 Vice Chairman, AmSouth 1992 $210,000 $ 84,000 42,400.00 4,800 $ 12,920 & AmSouth Bank NA 1991 $188,625 $ 82,995 161,280.00 9,750 WILLIAM A. POWELL, JR. 1993 $211,335 $ 95,101 11,000 $124,548 President, AmSouth 1992 $330,000 $132,000 98,050.00 11,100 $ 22,225 and Vice Chairman, 1991 $308,000 $169,401 24,000 AmSouth Bank NA Retired in 1993 A. FOX DEFUNIAK, III 1993 $217,000 $ 75,950 52,971.88 4,500 $ 11,122 Senior Executive Vice 1992 $200,000 $ 62,000 42,400.00 4,800 $ 11,240 President, Birmingham 1991 $177,500 $ 78,100 161,280.00 9,750 Banking Group, AmSouth Bank NAEW STEPHENSON, JR. 1993 $177,833 $ 71,133 52,971.88 3,400 $ 9,256 Senior Executive Vice 1992 $150,000 $ 45,000 27,825.00 3,150 $ 8,560 President, AmSouth 1991 $140,000 $ 52,488 107,520.00 6,750 and Chairman & Chief Executive Officer, AmSouth Bank of Florida """

Right now, I'm just trying to get it in csv style format with '|' character to separate data items, and then manually extract data to succeed:

 tmp = open('tmp.txt','w') tmp.write(text) tmp.close() data1 = [] for line in open('tmp.txt'): line = line.lower() if 'SALARY' in line: line = line.replace(' ','|') line = line.replace('--', '') line = line.replace('- -', '') line = line.replace('- -', '') line = line.replace('(1)', '') line = line.replace('(2)', '') line = line.replace('(3)', '') line = line.replace('(4)', '') line = line.replace('(5)', '') line = line.replace('(6)', '') line = line.replace('(7)', '') line = line.replace('(8)', '') line = line.replace('(9)', '') line = line.replace('(10)', '') line = line.replace('(11)', '') line = line.replace('(S)', '') line = line.replace('($)', '') line = line.replace('(#)', '') line = line.replace('$', '') line = line.replace('-0-', '0') line = line.replace(')', '|') line = line.replace('(', '|-') line = re.sub(r'\s(\d)', '|\\1', line) line = line.replace(' ', '') line = line.replace('||', '|') data1.append(line) data = ''.join(data1)

The problem is that I have to do this thousands of times, and I’ll always have to go through each table and save the elements I need. Is there a way to create a dictionary that will track things like year, salary, bonus, other annual compensation, etc. For each person listed in the leftmost column?

+4

python scrape

user728166 May 03 '11 at 18:27

source share

3 answers

You need to write a series of generators for sequential data flow to reduce noise and complexity.

This is not an easy problem to solve in any programming language.

 def strip_top( source_text ): src= iter( source_text ) for line in src: if line.rstrip().startswith("AWARDS"): next( src ) break for line in src: yield line def columnize( source_text ): """Assumes strip_top or similar to remove confusing extra headers""" for line in src: yield line[0:24], line[25:30], ... for each coumn def collapse_headers( source_text ): """Assumes columnize( strip_top()).""" src= iter( source_text ) headings= [ [] for i in range(9) ] for line in src: if line[0] == "------------------": break for col in range(9): headings[col].append(line[col].strip()) yield [ " ".join(h) for h in headings ] for line in src: yield line etc.

Then your "main" program collects these transformations into a pipeline.

 with open("some file","r") as text: for line in collapse_headers( columnize( strip_top( text ) ) ): # further cleanup? # actual processing

This allows you to "tune" each part of the conversion sequence separately.

+2

S. Lott May 03 '11 at 19:10

source share

Well, splitting columns is easy, they are fixed widths, so you can do:

 cells = [rowtext[0:24], rowtext[25:29], ...]

Splitting strings is a bit more complicated. It looks like you can handle the headers separately and then check

 cells[0] == cells[0].upper()

to see whether to start a new block of rows (i.e. when the first cell in the row is in the block header). Of course, I assume that your thousands of files have almost the same format.

Once you get the data in an easy-to-use format, matching them in Python is an easy bit. You can paste all this into the dictionary, or if it is too large, write it to disk as a large CSV file or sqlite database.

+1

Thomas K May 03 '11 at 18:48

source share

Gustav larsson · Accepted Answer · 2011-05-03T19:11:39+0000

Here is the code to get you started:

 text = """JOHN ...""" # text without the header # These can be inferred if necessary cols = [0, 24, 29, 39, 43, 52, 71, 84, 95, 109, 117] db = [] row = [] for line in text.strip().split("\n"): data = [line[cols[i]:cols[i+1]] for i in xrange((len(cols)-1))] if data[0][0] != " ": if row: db.append(row) row = map(lambda x: [x], data) else: for i, c in enumerate(data): row[i].append(c) print db

This will create an array with an element per person. Each element will be an array of all columns and will contain an array of all rows. This way, you can easily access different years or do something like clutching a person’s name:

 for person in db: print "Name:", person[0][0] print " ".join(s.strip() for s in person[0][1:]) print

Will yield:

 Name: JOHN W. WOODS Chairman, President, & Chief Executive Officer of AmSouth & AmSouth Bank NA Name: C. STANLEY ...

How to clear data from a text table using Python?

More articles: