How can I generate a range of ranges from the first letters of a word list in Perl?

Question

How can I generate a range of ranges from the first letters of a word list in Perl?

I'm not sure how to explain this, so I'll just start with an example.

Given the following data:

Apple Apricot Blackberry Blueberry Cherry Crabapple Cranberry Elderberry Grapefruit Grapes Kiwi Mulberry Nectarine Pawpaw Peach Pear Plum Raspberry Rhubarb Strawberry

I want to create an index based on the first letter of my data, but I want the letters to be grouped together.

Here is the frequency of the first letters in the dataset above:

  2 A 2 B 3 C 1 E 2 G 1 K 1 M 1 N 4 P 2 R 1 S

Since my sample data set is small, let's just say that the maximum number to combine letters is 3. Using the data above, this will be my index:

 ABC DG HO P QZ

By clicking on the link "DG", you will see:

 Elderberry Grapefruit Grapes

In my listing above, I cover the complete alphabet - I think this is not entirely necessary - I will be fine with this output too:

 ABC EG KN P RS

Obviously, my dataset is not a fetus, I will have more data (about 1000-2000 elements), and my "maximum per range" will be more than 3.

I'm also not too worried about one-way data, so if 40% of my data starts with "S", then S will only have its own link - I do not need to break it into the second letter in the data.

Since my dataset will not change too often, I will be fine with a static "maximum per range", but it would be nice to have it calculated dynamically. In addition, the data set will not start with numbers - it is guaranteed to start with the letter AZ.

I started building an algorithm for this, but it keeps getting so dirty that I start over. I don’t know how to search Google for this - I’m not sure if this method is being called.

Here is what I started with:

 #!/usr/bin/perl use strict; use warnings; my $index_frequency = { map { ( $_, 0 ) } ( 'A' .. 'Z' ) }; my $ranges = {}; open( $DATASET, '<', 'mydata' ) || die "Cannot open data file: $!\n"; while ( my $item = <$DATASET> ) { chomp($item); my $first_letter = uc( substr( $item, 0, 1 ) ); $index_frequency->{$first_letter}++; } foreach my $letter ( sort keys %{$index_frequency} ) { if ( $index_frequency->{$letter} ) { # build $ranges here } }

My problem is that I continue to use a bunch of global variables to track counts and previous letters - my code is very messy.

Can someone give me a step in the right direction? I guess this is more of an algorithm issue, so if you don’t have the ability to do this in Perl, the pseudocode will work as well, I think I can convert it to Perl.

Thanks in advance!

+4

algorithm perl range

Brianh Sep 2 '09 at 15:13

source share

5 answers

Here is my suggestion:

 # get the number of instances of each letter my %count = (); while (<FILE>) { $count{ uc( substr( $_, 0, 1 ) ) }++; } # transform the list of counts into a map of count => letters my %freq = (); while (my ($letter, $count) = each %count) { push @{ $freq{ $count } }, $letter; } # now print out the list of letters for each count (or do other appropriate # output) foreach (sort keys %freq) { my @sorted_letters = sort @{ $freq{$_} }; print "$_: @sorted_letters\n"; }

Update . I think I misunderstood your requirements. The following block of code does something more like what you want.

 my %count = (); while (<FILE>) { $count{ uc( substr( $_, 0, 1 ) ) }++; } # get the maximum frequency my $max_freq = (sort values %count)[-1]; my $curr_set_count = 0; my @curr_set = (); foreach ('A' .. 'Z') { push @curr_set, $_; $curr_set_count += $count{$_}; if ($curr_set_count >= $max_freq) { # print out the range of the current set, then clear the set if (@curr_set > 1) print "$curr_set[0] - $curr_set[-1]\n"; else print "$_\n"; @curr_set = (); $curr_set_count = 0; } } # print any trailing letters from the end of the alphabet if (@curr_set > 1) print "$curr_set[0] - $curr_set[-1]\n"; else print "$_\n";

+2

JSB ձոգչ Sep 2 '09 at 15:53

source share

Try something similar, where frequency is the frequency array that you calculated in the previous step, and threshold_low is the minimum number of records in the range, and threshold_high is the max. number. This should give harmonious results.

 count=0 threshold_low=3 threshold_high=6 inrange=false frequency['Z'+1]=threshold_high+1 for letter in range('A' to 'Z'): count += frequency[letter]; if (count>=threshold_low or count+frequency[letter+1]>threshold_high): if (inrange): print rangeStart+'-' print letter+' ' inrange=false count=0 else: if (not inrange) rangeStart=letter inrange=true

+1

redtuna Sep 2 '09 at 15:33

source share

 use strict; use warnings; use List::Util qw(sum); my @letters = ('A' .. 'Z'); my @raw_data = qw( Apple Apricot Blackberry Blueberry Cherry Crabapple Cranberry Elderberry Grapefruit Grapes Kiwi Mulberry Nectarine Pawpaw Peach Pear Plum Raspberry Rhubarb Strawberry ); # Store the data by starting letter. my %data; push @{$data{ substr $_, 0, 1 }}, $_ for @raw_data; # Set max page size dynamically, based on the average # letter-group size (in this case, a multiple of it). my $MAX_SIZE = sum(map { scalar @$_ } values %data) / keys %data; $MAX_SIZE = int(1.5 * $MAX_SIZE + .5); # Organize the data into pages. Each page is an array reference, # with the first element being the letter range. my @pages = (['']); for my $letter (@letters){ my @d = exists $data{$letter} ? @{$data{$letter}} : (); if (@{$pages[-1]} - 1 < $MAX_SIZE or @d == 0){ push @{$pages[-1]}, @d; $pages[-1][0] .= $letter; } else { push @pages, [ $letter, @d ]; } } $_->[0] =~ s/^(.).*(.)$/$1-$2/ for @pages; # Convert letters to range.

+1

Fmc Sep 2 '09 at 20:02

source share

This is an example of how I will write this program.

 #! /opt/perl/bin/perl use strict; use warnings; my %frequency; { use autodie; open my $data_file, '<', 'datafile'; while( my $line = <$data_file> ){ my $first_letter = uc( substr( $line, 0, 1 ) ); $frequency{$first_letter} ++ } # $data_file is automatically closed here } #use Util::Any qw'sum'; use List::Util qw'sum'; # This is just an example of how to calculate a threshold my $mean = sum( values %frequency ) / scalar values %frequency; my $threshold = $mean * 2; my @index; my @group; for my $letter ( sort keys %frequency ){ my $frequency = $frequency{$letter}; if( $frequency >= $threshold ){ if( @group ){ if( @group == 1 ){ push @index, @group; }else{ # push @index, [@group]; # copy @group push @index, "$group[0]-$group[-1]"; } @group = (); } push @index, $letter; }elsif( sum( @frequency{@group,$letter} ) >= $threshold ){ if( @group == 1 ){ push @index, @group; }else{ #push @index, [@group]; push @index, "$group[0]-$group[-1]" } @group = ($letter); }else{ push @group, $letter; } } #push @index, [@group] if @group; push @index, "$group[0]-$group[-1]" if @group; print join( ', ', @index ), "\n";

0

Brad gilbert Sep 2 '09 at 16:48

source share

user80168 · Accepted Answer · 2009-09-02T16:00:01+0000

The main approach:

 #!/usr/bin/perl -w use strict; use autodie; my $PAGE_SIZE = 3; my %frequencies; open my $fh, '<', 'data'; while ( my $l = <$fh> ) { next unless $l =~ m{\A([az])}i; $frequencies{ uc $1 }++; } close $fh; my $current_sum = 0; my @letters = (); my @pages = (); for my $letter ( "A" .. "Z" ) { my $letter_weigth = ( $frequencies{ $letter } || 0 ); if ( $letter_weigth + $current_sum > $PAGE_SIZE ) { if ( $current_sum ) { my $title = $letters[ 0 ]; $title .= '-' . $letters[ -1 ] if 1 < scalar @letters; push @pages, $title; } $current_sum = $letter_weigth; @letters = ( $letter ); next; } push @letters, $letter; $current_sum += $letter_weigth; } if ( $current_sum ) { my $title = $letters[ 0 ]; $title .= '-' . $letters[ -1 ] if 1 < scalar @letters; push @pages, $title; } print "Pages : " . join( " , ", @pages ) . "\n";

The problem is what it infers (from your data):

 Pages : A , B , CD , EJ , KO , P , QZ

But I would say that this is actually a good approach :) And you can always change the for loop to:

 for my $letter ( sort keys %frequencies ) {

if you need.

How can I generate a range of ranges from the first letters of a word list in Perl?

More articles: