May use some help with this soundex coding

Question

May use some help with this soundex coding

The US Census Bureau uses a special encoding called "soundex" to search for information about a person. Soundex is an encoding of surnames (surnames) based on how the surname sounds, and not how it is written. Surnames that sound the same but differ in different ways, for example SMITH and SMYTH, have the same code and are filed together. The soundex coding system was designed so that you can find the last name, even if it may have been written under various spellings.

In this lab, you will develop, code, and document a program that creates a soundex code when entering with a last name. The user will be asked to indicate the last name, and the program should display the appropriate code.

Basic Soundex Encoding Rules

Each sound coding of a surname consists of a letter and three numbers. The letter used is always the first letter of the last name. Numbers are assigned to the remaining letters of the surname in accordance with the soundex manual below. Zero values are added at the end, if necessary, to always create a four-character code. Additional letters are not counted.

Soundex Coding Guide

Soundex assigns a number for different consonants. Consonants that sound the same are assigned the same number:

Number of consonants

1 B, F, P, V 2 C, G, J, K, Q, S, X, Z 3 D, T 4 L 5 M, N 6 R

Soundex ignores the letters A, E, I, O, U, H, W, and Y.

There are 3 additional rules for encoding sounds. A good program design will implement them as one or more separate functions.

Rule 1. Names with double letters

If the surname has double letters, they should be considered as one letter. For example:

Gutierrez is encoded by G362 (G, 3 for T, 6 for the first R, second R is ignored, 2 for Z).

Rule 2. Names with letters side by side that have the same Soundex code

If the last name has different letters side by side, which have the same number in the audio coding guide, they should be treated as one letter. Examples:

Pfister is encoded as P236 (P, F is ignored because it is considered the same as P, 2 for S, 3 for T, 6 for R).
Jackson is encoded as J250 (J, 2 for C, K is ignored just like C, S is ignored just like C, 5 for adding N, 0).

Rule 3. Consonant Separators

3.a. If the vowel (A, E, I, O, U) separates two consonants that have the same soundex code, the consonant is encoded to the right of the vowel. Example:

Tymczak is encoded as T-522 (T, 5 for M, 2 for C, Z is ignored (see the Side by Side rule above), 2 for K). Since the vowel "A" separates Z and K, K. is encoded.

3.b. If "H" or "W" is shared by two consonants that have the same soundex code, the consonant on the right is not encoded. Example:

* Ashcraft is encoded by A261 (A, 2 for S, C is ignored just like S with H in between, 6 for R, 1 for F). It is not encoded A226.

So far this is my code:

surname = raw_input("Please enter surname:") outstring = "" outstring = outstring + surname[0] for i in range (1, len(surname)): nextletter = surname[i] if nextletter in ['B','F','P','V']: outstring = outstring + '1' elif nextletter in ['C','G','J','K','Q','S','X','Z']: outstring = outstring + '2' elif nextletter in ['D','T']: outstring = outstring + '3' elif nextletter in ['L']: outstring = outstring + '4' elif nextletter in ['M','N']: outstring = outstring + '5' elif nextletter in ['R']: outstring = outstring + '6' print outstring

The code does enough of what they are asking for, I'm just not sure how to encode the three rules. This is where I need help. So any help is appreciated.

+2

python soundex

user189375 Oct 13 '09 at 19:31

source share

2 answers

mjv · Answer 1 · 2009-10-13T19:41:01+0000

A few tips:

Using an array in which each Soundex code is stored and indexed by the ASCII value (or value in a shorter numeric range derived from this) of the corresponding letter, you will both make the code more efficient and more readable. This is a very common method : understand, use and reuse; -)
When you parse an input string, you need to track (or compare) with a previously processed letter to ignore duplicate letters and handle other rules. (implement each of them in a separate function, as outlined in the entry). The idea could be to introduce a function responsible for - adding the soundex code for the current letter of the input being processed. This function, in turn, will call each of the "rules" functions, possibly discarding earlier, based on the return values of some rules. In other words, replace the systematic ...

  outstring = outstring + c # btw could be + =
 ... with
     outstring + = AppendCodeIfNeeded (c)

Beware that this multifunctional structure is redundant for such trivial logic, but it is not a bad idea to do this for practice.

steveha · Answer 2 · 2009-10-13T20:01:38+0000

Here are some quick tips on general Python stuff.

0) You can use a for loop to loop through any sequence, and a line is considered a sequence. Therefore, you can write:

 for nextletter in surname[1:]: # do stuff

This is easier to write and easier to understand than calculating an index and indexing a surname.

1) You can use the += operator to add lines. Instead

 x = x + 'a'

You can write

 x += 'a'

As for help on your specific problem, you will need to track the previous letter. If your job had a rule stating that the two "z" characters in a string should be encoded as 99 ", you can add this code:

 def rule_two_z(prevletter, curletter): if prevletter.lower() == 'z' and curletter.lower() == 'z': return 99 else: return -1 prevletter = surname[0] for curletter in surname[1:]: code = rule_two_z(prevletter, curletter) if code < 0: # do something else here outstring += str(code) prevletter = curletter

Hmmm, you wrote your code to return integer strings like '3' , whereas I wrote your code to return the actual integer, and then call str() on it before adding it to the string. Anyway, maybe good.

Good luck

May use some help with this soundex coding

More articles: