How to parse a file, create records and manipulate records, including the frequency of timing and distance calculations

Question

How to parse a file, create records and manipulate records, including the frequency of timing and distance calculations

I participate in the Perl introclass, looking for suggestions and feedback on my approach to writing a small (but complex) program that analyzes atom data. My professor encourages forums. I am not promoting myself with subsets or Perl modules (including Bioperl), so please limit your answers to the corresponding “beginner level” so that I can understand and learn from your sentences and / or code (also indicate “Magic”).

The program requirements are as follows:

Read the file (containing Atoms data) from the command line and create an array of atom records (one record / atom per new line). For each entry, the program will have to store:
• Atom serial number (cols 7-11)
• The three-letter name of the amino acid to which it refers (cols 18-20)
• Atom three coordinates (x, y, z) (cols 31 - 54)
• Atomic one- or two-letter element name (for example, C, O, N, Na) (cols 77-78)
Request one of three commands: frequency, length, density d (d is a certain number):
• freq - how many of each type of atom are in the file (example Nitrogen, sodium, etc. will be displayed as follows: N: 918 S: 23
• length - distance between coordinates
• density d (where d is a number) - the program will prompt you to specify a file name to save the calculations and will contain the distance between this atom and each other atom. If this distance is less than or equal to the number d, it increases the counter of the number of atoms located at this distance, unless this count is zero in the file. The result will look something like this: 1: 5
2: 3
3: 6
... (a very large file) and will close when it ends.

I am looking for feedback on what I wrote (and should write) in the code below. I especially appreciate any feedback on how to write my subscribers. I have included sample input data below.

The structure and description of the structure of the program, as I see it:

$^W = 1; # turn on warnings use strict; # behave! my @fields; my @recs; while ( <DATA> ) { chomp; @fields = split(/\s+/); push @recs, makeRecord(@fields); } for (my $i = 0; $i < @recs; $i++) { printRec( $recs[$i] ); } my %command_table = ( freq => \&freq, length => \&length, density => \&density, help => \&help, quit => \&quit ); print "Enter a command: "; while ( <STDIN> ) { chomp; my @line = split( /\s+/); my $command = shift @line; if ($command !~ /^freq$|^density$|length|^help$|^quit$/ ) { print "Command must be: freq, length, density or quit\n"; } else { $command_table{$command}->(); } print "Enter a command: "; } sub makeRecord # Read the entire line and make records from the lines that contain the # word ATOM or HETATM in the first column. Not sure how to do this: { my %record = ( serialnumber => shift, aminoacid => shift, coordinates => shift, element => [ @_ ] ); return\%record; } sub freq # take an array of atom records, return a hash whose keys are # distinct atom names and whose values are the frequences of # these atoms in the array. sub length # take an array of atom records and return the max distance # between all pairs of atoms in that array. My instructor # advised this would be constructed as a for loop inside a for loop. sub density # take an array of atom records and a number d and will return a # hash whose keys are atom serial numbers and whose values are # the number of atoms within that distance from the atom with that # serial number. sub help { print "To use this program, type either\n", "freq\n", "length\n", "density followed by a number, d,\n", "help\n", "quit\n"; } sub quit { exit 0; } # truncating for testing purposes. Actual data is aprox. 100 columns # and starts with ATOM or HETATM. __DATA__ ATOM 4743 CG GLN A 704 19.896 32.017 54.717 1.00 66.44 C ATOM 4744 CD GLN A 704 19.589 30.757 55.525 1.00 73.28 C ATOM 4745 OE1 GLN A 704 18.801 29.892 55.098 1.00 75.91 O

+6

perl hash bioinformatics

Koala Dec 04 '10 at 0:12

source share

2 answers

The FM answer is pretty good. I just mentioned a couple of additional things:

You already have a hash with valid commands (which is a good idea). There is no need to duplicate this list in regular expression. I would do something like this:

 if (my $routine = $command_table{$command}) { $routine->(@line); } else { print "Command must be: freq, length, density or quit\n"; }

Notice that I also pass @line to the subroutine because you will need this for the density command. Routines that take no arguments can simply ignore them.

You can also generate a list of valid commands for reporting an error using keys %command_table , but I will leave this as an exercise for you.

Another thing is that in the description of the input file column numbers are mentioned, which indicates that it has a fixed-width format. This is better analyzed using substr or unpack . If the field is always empty or contains a space, then your split will not parse it correctly. (If you are using substr , keep in mind that the number of columns starts at 0, when people often label the first column 1.)

+4

cjm Dec 04 '10 at 1:27

source share

Fmc · Accepted Answer · 2010-12-04T01:00:16+0000

It looks like your Perl skills are doing well - using links and complex data structures. Here are some tips and tricks.

Enable warnings with use warnings , not $^W = 1 . The first is self-documenting and has the advantage that it is local to the closing block, and not a global setting.
Use well-known variables to help document program behavior, rather than relying on Perl special $_ . For example:
```
 while (my $input_record = <DATA>){ } 
```
In user input scenarios, an infinite loop provides the ability to avoid repeated instructions such as "Enter a command." See below.
Your regex can be simplified to avoid having to re-bind. See below.
Affirmative tests are generally easier to understand than negative tests. See the modified if-else structure below.
Include each part of the program in your own routine. This is a good general practice for a number of reasons, so I would just start this habit.
A related good practice is to minimize the use of global variables. As an exercise, you can try to write a program so that it does not use global variables at all. Instead, any necessary information will be passed between subprograms. With small programs, you don’t have to be tough on avoiding globals, but it’s not a bad idea to keep the ideal in mind.
Give your routine length different name. This name is already used by the built-in function length .
Regarding your question about makeRecord , one of them is to ignore the filtering problem inside makeRecord . Instead, makeRecord may contain an additional hash field, and the filtering logic will be somewhere else. For example:
```
 my $record = makeRecord(@fields); push @recs, $record if $record->{type} =~ /^(ATOM|HETATM)$/; 
```

An illustration of some of the points above:

 use strict; use warnings; run(); sub run { my $atom_data = load_atom_data(); print_records($atom_data); interact_with_user($atom_data); } ... sub interact_with_user { my $atom_data = shift; my %command_table = (...); while (1){ print "Enter a command: "; chomp(my $reply = <STDIN>); my ($command, @line) = split /\s+/, $reply; if ( $command =~ /^(freq|density|length|help|quit)$/ ) { # Run the command. } else { # Print usage message for user. } } } ...

How to parse a file, create records and manipulate records, including the frequency of timing and distance calculations

More articles: