Best way to store the names of user-submitted items (and their synonyms)

Consider an e-commerce application with several stores. Each store owner can edit the catalog of goods of his store.

My current database schema is as follows:

item_names: id | name | description | picture | common(BOOL) items: id | item_name_id | picture | price | description | picture item_synonyms: id | item_name_id | name | error(BOOL) 

Notes: error indicates misspelling (for example, "Erickson"). description and picture the item_names table are "global", which can be overridden by the "local" description and picture fields of the items table (if the store owner wants to provide a different image for the item). common helps to separate unique item names ("Jimmy Joe Cheese Pizza" from "Cheese Pizza")

I think the bright side of this scheme is:

Optimized synonym search and processing: I can query the item_names and item_synonyms with the name LIKE %QUERY% and get a list of item_name_id to be combined with the items table. (Examples of synonyms: "Sony Ericsson", "Sony Ericson", "X10", "X 10")

Autocomplete: Again a simple query on the item_names table. I can avoid using DISTINCT and minimize the number of options ("Sony Ericsson Xperia β„’ X10", "Sony Ericsson - Xperia X10", "Xperia X10, Sony Ericsson")

Down side:

Overhead: When I insert an item, I ask for item_names to find out if this name exists. If not, I create a new entry. When deleting an item, I count the number of records with the same name. If this is the only element with this name, I delete the entry from the item_names table (only so that everything is clean, the accounts of possible erroneous representations). And updating is a combination of both.

Unknown element names: Store owners sometimes use sentences like "Harry Potter 1, 2 Books + CD + Magic Hat". There is something wrong that you have so much overhead to accommodate such cases. Perhaps this was the main reason why I am inclined to this scheme:

 items: id | name | picture | price | description | picture 

(... with item_names and item_synonyms as service tables that I could query)

  • Is there a better scheme you would suggest?
  • Do element names need to be normalized for autocomplete? Perhaps this is what Facebook does for the posts "School", "City"?
  • Is the first scheme or the second one better / optimal for the search?

Thanks in advance!

References: (1) Does the name of the person normalize too far? , (2) Avoid DISTINCT


EDIT: If you enter two items with similar names, the administrator who sees this simply clicks "Make a synonym", which converts one of the names to the synonym of the other. I do not need a way to automatically detect if the name entered is synonymous with another. I hope autocomplete takes care of 95% of such cases. As the table grows in size, the need for β€œMake a synonym” will decrease. Hope clears up the confusion.


UPDATE: For those who would like to know what I went with ... I went with the second scheme, but deleted the item_names and item_synonyms in the hope that Solr will give me the opportunity to complete all the other tasks that I need:

 items: id | name | picture | price | description | picture 

Thank you all for your help!

+6
database database-design normalization denormalization
source share
3 answers

The requirements that you specify in your comment (Optimized Search, Synonym Processing, and AutoComplete) are not things that are typically associated with an RDBMS. It seems that you are trying to solve the search problem, and not the problem of data storage and normalization. You might want to take a look at some search architectures such as Solr

Excerpt from the list of solr functions:

Boundary search based on unique field values, explicit queries, or date ranges

Spelling suggestions for custom queries

Read more Similar offers for this document

Auto confirmation function

Performance optimization

+2
source share

If there were more attributes to match, I would suggest using a quick search system. There is no need to set aliases as entries are added, attributes are simply indexed and each search returned returns a match for the relevance score. Take the top X% as valid matches and show them.

Creating and storing pseudonyms seems like a rude, time-consuming approach that probably won't be able to adapt to the needs of your users.

+1
source share

Just an idea.

One thing that comes to my mind is to sort the characters in the name and synonym, discarding all empty space. It is like a decision to find all anagrams for a word. The end result is the ability to quickly search for related posts. As you indicated, all synonyms must converge in the same term or name. The search is performed by synonyms using the again sorted input string.

0
source share

All Articles