Since someone has already posted a defaultdict solution, I'm going to give pandas one, just for a change. pandas is a very convenient data processing library. Among other nice features, he can handle this task of counting on a single line, depending on what type of output is required. Really:
df = pd.read_csv("cluster.csv") counted = df.groupby(["Cluster_id", "User", "Quality"]).size() df.to_csv("counted.csv")
-
To provide a trailer for which pandas simplified, we can upload a file - the main data storage object in pandas is called "DataFrame":
>>> import pandas as pd >>> df = pd.read_csv("cluster.csv") >>> df <class 'pandas.core.frame.DataFrame'> Int64Index: 500000 entries, 0 to 499999 Data columns: Tag 500000 non-null values User 500000 non-null values Quality 500000 non-null values Cluster_id 500000 non-null values dtypes: int64(1), object(3)
We can verify that the first few lines look fine:
>>> df[:5] Tag User Quality Cluster_id 0 bbb u001 bad 39 1 bbb u002 bad 36 2 bag u003 good 11 3 bag u004 good 9 4 bag u005 bad 26
and then we can group Cluster_id and User and work in each group:
>>> for name, group in df.groupby(["Cluster_id", "User"]): ... print 'group name:', name ... print 'group rows:' ... print group ... print 'counts of Quality values:' ... print group["Quality"].value_counts() ... raw_input() ... group name: (1, 'u003') group rows: Tag User Quality Cluster_id 372002 xxx u003 bad 1 counts of Quality values: bad 1 group name: (1, 'u004') group rows: Tag User Quality Cluster_id 126003 ground u004 bad 1 348003 ground u004 good 1 counts of Quality values: good 1 bad 1 group name: (1, 'u005') group rows: Tag User Quality Cluster_id 42004 ground u005 bad 1 258004 ground u005 bad 1 390004 ground u005 bad 1 counts of Quality values: bad 3 [etc.]
If you are going to process csv files a lot, it is definitely worth a look.
source share