I'm not sure how to explain this, so I'll just start with an example.
Given the following data:
Apple Apricot Blackberry Blueberry Cherry Crabapple Cranberry Elderberry Grapefruit Grapes Kiwi Mulberry Nectarine Pawpaw Peach Pear Plum Raspberry Rhubarb Strawberry
I want to create an index based on the first letter of my data, but I want the letters to be grouped together.
Here is the frequency of the first letters in the dataset above:
2 A 2 B 3 C 1 E 2 G 1 K 1 M 1 N 4 P 2 R 1 S
Since my sample data set is small, let's just say that the maximum number to combine letters is 3. Using the data above, this will be my index:
ABC DG HO P QZ
By clicking on the link "DG", you will see:
Elderberry Grapefruit Grapes
In my listing above, I cover the complete alphabet - I think this is not entirely necessary - I will be fine with this output too:
ABC EG KN P RS
Obviously, my dataset is not a fetus, I will have more data (about 1000-2000 elements), and my "maximum per range" will be more than 3.
I'm also not too worried about one-way data, so if 40% of my data starts with "S", then S will only have its own link - I do not need to break it into the second letter in the data.
Since my dataset will not change too often, I will be fine with a static "maximum per range", but it would be nice to have it calculated dynamically. In addition, the data set will not start with numbers - it is guaranteed to start with the letter AZ.
I started building an algorithm for this, but it keeps getting so dirty that I start over. I donโt know how to search Google for this - Iโm not sure if this method is being called.
Here is what I started with:
#!/usr/bin/perl use strict; use warnings; my $index_frequency = { map { ( $_, 0 ) } ( 'A' .. 'Z' ) }; my $ranges = {}; open( $DATASET, '<', 'mydata' ) || die "Cannot open data file: $!\n"; while ( my $item = <$DATASET> ) { chomp($item); my $first_letter = uc( substr( $item, 0, 1 ) ); $index_frequency->{$first_letter}++; } foreach my $letter ( sort keys %{$index_frequency} ) { if ( $index_frequency->{$letter} ) {
My problem is that I continue to use a bunch of global variables to track counts and previous letters - my code is very messy.
Can someone give me a step in the right direction? I guess this is more of an algorithm issue, so if you donโt have the ability to do this in Perl, the pseudocode will work as well, I think I can convert it to Perl.
Thanks in advance!