Best Python Cluster Library for Product Data Analysis

I have a set of alphanumeric product codes for various products. Similar products do not have their own similarities in their codes, that is, the product code “A123” can mean “Harry Potter Volume 1 DVD” and “B123” can mean “Kellogs Corn Flakes”. I also do not have a description or product identification. All I have is the "owner" of this code. So my data looks (abnormally) like this:

Owner 1: ProductCodes A123, B124, W555, M221, M556,127,102

Owner2: Product Code D103, Z552, K112, L3254,223,112

Owner3: ProductCode G123

....

I have huge (i.e. terabytes) sets of this data.

I assume that the owner will - for most - have an indefinite number of groups of similar products - that is, the owner can have only 2 groups - all DVDs and Harry Potter books, but also the Iron Maiden collection "cds. I would like to analyze these data and determine the function of the distance between product codes so that I can begin to make assumptions about how “close” product codes are to each other, as well as product code codes (so I can also determine how many groups belong to the owner). some research on a text clustering algorithms, but there are many options to choose from, and I'm not sure which one is best for this scenario.

Can someone point me to the most suitable python-based cluster functions / libraries for use, please ?!

+7
source share
6 answers

You have a bipartite graph. As an initial hit, it looks like you will consider the lists of neighbors as zero-level vectors between which you define some kind of similarity / correlation. For example, it could be the normalized Hamming distance. Depending on how you do this, you will receive a chart on one domain - product code or owners. It will soon become clear why I threw everything in the language of graphs, carrying with me. Now why are you pushing for a Python implementation? Clustering large-scale data is time and memory. To get the cat out of the bag, I wrote and still support the graph clustering algorithm, which is widely used in bioinformatics. Is is threaded, accepts weighted graphs and is used for graphs with millions of nodes and to a billion edges. See http://micans.org/mcl/ for more information. Of course, if you spend stackoverflow and stackexchange, there are quite a few threads that might interest you. I would recommend the Louvain method, except that I'm not sure if it accepts the weighted networks you are likely to produce.

+8
source

The R language contains many packages for finding groups in the data , as well as python bindings to R called RPy . R provides several algorithms already mentioned here, and also known for good performance on large data sets.

+1
source
+1
source

I know little about your area of ​​concern. But PyCluster is a pretty decent clustering package that works well on large datasets: http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm

Hope this helps.

0
source

I do not know about the finished lib, sorry. There are large libraries for full-text search and similarities, but for bit sets you will have to roll on your own (as far as I know). In any case, a couple of suggestions:

  • bitrate approach: first ask 10k owners x 100k products, or 100k x 10k, in memory to play. You can use bitarray to create a large array of 10k x 100k bits. But then what do you want to do with it?
    To find similar pairs among N objects (owners or products) you should look at all N * (N-1) / 2 pairs, which is a lot;
    or, there must be some structure in the data that allows early cropping / hierarchical similarity.
    or google "greedy clustering" Python - don't see the finished lib.

  • How do you define the "likeness" of owners / products? There are many possibilities - total, total ratio, tf-idf ...

(Added): Have you looked at the Mahout API recommendations, is this what you are looking for?
This question says there is no Python equivalent that leaves two options:
a) ask if anyone used Mahout from Jython, or b) if you cannot lick them, join them.

0
source

You can try clustering using the k-mean clustering algorithm and its scipy implementation, available in scikits.learn.cluster.KMeans .

0
source

All Articles