Consider an e-commerce application with several stores. Each store owner can edit the catalog of goods of his store.
My current database schema is as follows:
item_names: id | name | description | picture | common(BOOL) items: id | item_name_id | picture | price | description | picture item_synonyms: id | item_name_id | name | error(BOOL)
Notes: error indicates misspelling (for example, "Erickson"). description and picture the item_names table are "global", which can be overridden by the "local" description and picture fields of the items table (if the store owner wants to provide a different image for the item). common helps to separate unique item names ("Jimmy Joe Cheese Pizza" from "Cheese Pizza")
I think the bright side of this scheme is:
Optimized synonym search and processing: I can query the item_names and item_synonyms with the name LIKE %QUERY% and get a list of item_name_id to be combined with the items table. (Examples of synonyms: "Sony Ericsson", "Sony Ericson", "X10", "X 10")
Autocomplete: Again a simple query on the item_names table. I can avoid using DISTINCT and minimize the number of options ("Sony Ericsson Xperia β’ X10", "Sony Ericsson - Xperia X10", "Xperia X10, Sony Ericsson")
Down side:
Overhead: When I insert an item, I ask for item_names to find out if this name exists. If not, I create a new entry. When deleting an item, I count the number of records with the same name. If this is the only element with this name, I delete the entry from the item_names table (only so that everything is clean, the accounts of possible erroneous representations). And updating is a combination of both.
Unknown element names: Store owners sometimes use sentences like "Harry Potter 1, 2 Books + CD + Magic Hat". There is something wrong that you have so much overhead to accommodate such cases. Perhaps this was the main reason why I am inclined to this scheme:
items: id | name | picture | price | description | picture
(... with item_names and item_synonyms as service tables that I could query)
- Is there a better scheme you would suggest?
- Do element names need to be normalized for autocomplete? Perhaps this is what Facebook does for the posts "School", "City"?
- Is the first scheme or the second one better / optimal for the search?
Thanks in advance!
References: (1) Does the name of the person normalize too far? , (2) Avoid DISTINCT
EDIT: If you enter two items with similar names, the administrator who sees this simply clicks "Make a synonym", which converts one of the names to the synonym of the other. I do not need a way to automatically detect if the name entered is synonymous with another. I hope autocomplete takes care of 95% of such cases. As the table grows in size, the need for βMake a synonymβ will decrease. Hope clears up the confusion.
UPDATE: For those who would like to know what I went with ... I went with the second scheme, but deleted the item_names and item_synonyms in the hope that Solr will give me the opportunity to complete all the other tasks that I need:
items: id | name | picture | price | description | picture
Thank you all for your help!