Your "BIG flaw" is not unusual. As you say, the Amazon will be “connected” with everything. This is a fairly common problem with any recommendation system that tries to use such a relationship. I did not do this with store categories, but the problem is very similar to the video selection / ranking system that I built.
The usual way to help prevent the dominance of popular material is, instead of using the number of suitable categories, it gives weights for each store. The total weights are 1/category_count or 1/sqrt(category_count) .
Introduce three stores:
Jim Books - 2 categories: ["Books", "Music"] Amazon - 10 categories: ["Books", "Music", "Movies", "Housewares", etc.] Ralph Remainders - 3 categories: ["Books", "Music", "Movies"]
Now, if you are looking for stores similar to Jim Books, you match the categories. Obviously, both Amazon and Ralph include the Books and Music categories, and if you use only the number of matching categories, both will have the same rating.
But if you use a weight coefficient, then their estimates are very different. With a weight factor of 1/category_count :
Amazon - 10 categories, weighting factor = 1/10. Ralph - 3 categories, weighting factor = 1/3.
So, Amazon will get a similarity rating of 0.20, and Ralph will get a similarity rating of 0.66.
If the weight coefficient is 1/sqrt(category_count) , then:
Amazon - weighting factor = 1/sqrt(10) = 0.316 Ralph - weighting factor = 1/sqrt(3) = 0.562
In this case, the Amazon score is around 0.632 and the Ralph score is 1.244.
I found that 1/sqrt(category_count) is generally better because it weakens the overwhelming influence of high-level stores (i.e. those that have many categories), but not so much that these stores do not fall into the results. 1/category_count gives too much emphasis on stores that have only one or two categories.