Database Design - Search for Radical Change

I have a relational database that stores information about the plant. There is plant_id, a category of plant that refers to a table of categories, etc. The design is not bad, but I came across a task that is quite interesting. Data should be automatically populated by users.

An ideal scenario is a user who logs onto a website and uploads their data set, usually in text format (for example, CSV or another Windows document), and this data is automatically inserted into the database of my factories.

So, I basically issue the specification; the user must set their .csv columns for a specific name so that the script can communicate with the fields of my database. There are many such fields that relate to the general attributes of the plant ... for example, the height of the plant or the color of the plant. This information may be height using a metric system or using another system. The color may be something like brownish, which is actually the same as saying brown. There are hundreds of examples.

I am trying to automatically correct these inconsistencies so that the brown attribute is snapped to the actual brown color. This link is important because people using the site will want to find plants of a certain color, and now I need to go through and manually link all these different attributes.

How do you fix this and avoid manual intervention?

+6
mysql database-design
source share
4 answers

ABOUT! such an interesting bit of work.

You need to write a grammar and a parser that can update the data according to the grammar. What you can do is do an initial study of the final data set and create some grammar and try loading the data. If something fails, write it down with details about the grammar, expected value and actual value, for example, say something like this "plant color mismatch, expected" brown (or BROWN, brownish) and found "borwn", please , update the input or add β€œborwn” to the grammar. ”This will tell you what to do in the future. In addition, you can plan to automatically add such things to the grammar by looking at some templates. But initially it may be manual. Therefore, you will have time from the time Eni and logs to update the grammar.

Later, if you feel comfortable, you can open the interface for updating the grammar.

I speak very positively, but I know that there will be many problems. Please return your problems.

+3
source share

If you want to accept all these values ​​for the plant height - 3 mm, 3 cm, 3 inches, 3 ", 3 ', 3 - then what you want to do is impossible. There is no automated way to determine if 3' means" 3 cm "or 3 inches

There are several different ways to solve this problem. But the approach you take depends in part on who is interested in the data.

If the data that your users upload is mostly used by the users themselves, they will endure more work for you. If the data is mainly used for others, it will not.

You may be mean or liberal in what you accept. The more mean you are, the more automated your processes are.

Perhaps you can be really stingy by providing your users an Excel spreadsheet with validation .

If you need to be very liberal in what you accept, you probably need several tables to map one standard value to many non-standard values. A table for displaying colors may look like this:

Standard User-supplied -- brown brown brown brownish brown brn brown puce 

First you need to fill in this table manually. But the more data you process, the more likely it is that a custom color will already be present in the table.

This table should not exist in the database. It can be external, and the data cleaning process can be external. The β€œT” part for ETL for data warehouses is sometimes done this way. (β€œETL” means β€œExtract, Convert, and Download.”)

+2
source share

Even I was thinking about something like this ..... maybe this link can help http://magia3e.wordpress.com/2007/05/29/semantic-analysis-making-sense-of-the-chaos -of-free-text /

0
source share

I agree with Catcall's answer that the typed color map matches real colors. Perhaps for your first iteration, you can do some research to come up with some of these meanings, starting with words that contain a common color name (for example, your brown BROWNish example). Also, for other columns, I'm sure there is a finite set that you can find by analyzing the data. For example, heights cm, inches, feet, etc. It's not that much.

In the end, for each column, you will need a finite set of descriptors. Perhaps they can be nested (red can be the parent of red, burgundy, lilac, burgundy). What you could embed in the application upon import would be a function that parses the imported columns and offers suggestions to match the final sets. If this is truly a community-based site, you can allow users to suggest new options for this column. Very similar to the stackoverflow tagging function.

The guy xkcd conducted a fun experiment with colors, not sure if this data is available or useful - http://blog.xkcd.com/2010/05/03/color-survey-results/ .

Good luck

0
source share

All Articles