Normalizing a table with a low degree of integrity

I was given a table of about 18,000 rows. Each entry describes the location of one client. The problem is that when a person created the table, they did not add a field for "Company Name", but only "Place Name", and one company can have many locations.

For example, here are a few entries describing the same client:

Location table

 ID  Location_Name     
 1   TownShop#1        
 2   Town Shop - Loc 2 
 3   The Town Shop     
 4   TTS - Someplace   
 5   Town Shop,the 3   
 6   Toen Shop4        

My goal is to do this:

Location table

 ID  Company_ID   Location_Name     
 1   1            Town Shop#1       
 2   1            Town Shop - Loc 2 
 3   1            The Town Shop     
 4   1            TTS - Someplace   
 5   1            Town Shop,the 3   
 6   1            Toen Shop4        

Company table

 Company_ID  Company_Name  
 1           The Town Shop 

There is no "Company" table, I will need to generate a list of company names from the most descriptive or best location name that represents several locations.

Currently, I think that I need to create a list of location names that are similar, and then manually view this list.

, .

@Neall, , , , , , . , "repcount" 1 .

@yukondude, 4 - .

+5
5

, , , ? , , , Levenshtein algo, CompanyNames LocationNames.


, , .

... :

  • CompanyNames, , . . .
  • () , .
  • CompanyName LocationName ( Levenshtein - ). .
  • , MatchScore < CompanyName.
  • LocationNames by CompanyName | | MatchScore, , . MatchScore .

. , , , 18K .

0

. - . . " ". :

SELECT count(*) AS repcount, "Location Name" FROM mytable
 WHERE "Company Name" IS NULL
 GROUP BY "Location Name"
 ORDER BY repcount DESC
 LIMIT 5;

, , UPDATE... WHERE "Location Name" = "Location".

P.S. - .

: - - ? ?

+1

- , , ( ..), .

Amazon Mechanical Turk .

0

, , Company, company_id "Location", Company, , id. ( 18 000 , varchar).

- Location. , - :

  • Company , ( ).
  • .
  • , company_id, , NULL ( ), Company.id.
  • , row_id . , . , , , , .
  • , Location company_id, ALTER Company, NOT NULL company_id ( , , ).

, SQL company_id. , script .

0

, 4 - doozy.

, , , , . , , company_id:

UPDATE  Location
SET     Company_ID = 1
WHERE   (LOWER(Location_Name) LIKE '%to_n shop%'
OR      LOWER(Location_Name) LIKE '%tts%')
AND     Company_ID IS NULL;

I believe that this would be consistent with your examples (I added a part IS NULLso as not to overwrite the previously set Company_ID values), but, of course, in 18,000 lines, you should be inventive enough to handle various combinations.

Something else that could help would be to use company names to generate queries like the ones above. You can do something like the following (in MySQL):

SELECT  CONCAT('UPDATE Location SET Company_ID = ',
        Company_ID, ' WHERE LOWER(Location_Name) LIKE ',
        LOWER(REPLACE(Company_Name), ' ', '%'), ' AND Company_ID IS NULL;')
FROM    Company;

Then just run the statements it creates. It can do a lot of great work for you.

0
source

All Articles