I have a set of alphanumeric product codes for various products. Similar products do not have their own similarities in their codes, that is, the product code “A123” can mean “Harry Potter Volume 1 DVD” and “B123” can mean “Kellogs Corn Flakes”. I also do not have a description or product identification. All I have is the "owner" of this code. So my data looks (abnormally) like this:
Owner 1: ProductCodes A123, B124, W555, M221, M556,127,102
Owner2: Product Code D103, Z552, K112, L3254,223,112
Owner3: ProductCode G123
....
I have huge (i.e. terabytes) sets of this data.
I assume that the owner will - for most - have an indefinite number of groups of similar products - that is, the owner can have only 2 groups - all DVDs and Harry Potter books, but also the Iron Maiden collection "cds. I would like to analyze these data and determine the function of the distance between product codes so that I can begin to make assumptions about how “close” product codes are to each other, as well as product code codes (so I can also determine how many groups belong to the owner). some research on a text clustering algorithms, but there are many options to choose from, and I'm not sure which one is best for this scenario.
Can someone point me to the most suitable python-based cluster functions / libraries for use, please ?!
Richard Green
source share