Python case counting in csv file

I have a csv file with 4 columns {Tag, User, Quality, Cluster_id}. Using python, I would like to do the following: for each cluster_id (from 1 to 500) I want to see for each user the number of good and bad tags (obtained from the quality column). There are over 6,000 users. I can only read line by line in the csv file. Therefore, I am not sure how this can be done.

For instance:

Columns of csv = [Tag User Quality Cluster] Row1= [bag u1 good 1] Row2 = [ground u2 bad 2] Row3 = [xxx u1 bad 1] Row4 = [bbb u2 good 3] 

I just managed to get every line of the csv file.

I can only access each row at a time, and not for two loops. The psedudocode of the algorithm I want to implement:

 for cluster in clusters: for user in users: if eval == good: good_num = good_num +1 else: bad_num = bad_num + 1 
+4
source share
2 answers

Since someone has already posted a defaultdict solution, I'm going to give pandas one, just for a change. pandas is a very convenient data processing library. Among other nice features, he can handle this task of counting on a single line, depending on what type of output is required. Really:

 df = pd.read_csv("cluster.csv") counted = df.groupby(["Cluster_id", "User", "Quality"]).size() df.to_csv("counted.csv") 

-

To provide a trailer for which pandas simplified, we can upload a file - the main data storage object in pandas is called "DataFrame":

 >>> import pandas as pd >>> df = pd.read_csv("cluster.csv") >>> df <class 'pandas.core.frame.DataFrame'> Int64Index: 500000 entries, 0 to 499999 Data columns: Tag 500000 non-null values User 500000 non-null values Quality 500000 non-null values Cluster_id 500000 non-null values dtypes: int64(1), object(3) 

We can verify that the first few lines look fine:

 >>> df[:5] Tag User Quality Cluster_id 0 bbb u001 bad 39 1 bbb u002 bad 36 2 bag u003 good 11 3 bag u004 good 9 4 bag u005 bad 26 

and then we can group Cluster_id and User and work in each group:

 >>> for name, group in df.groupby(["Cluster_id", "User"]): ... print 'group name:', name ... print 'group rows:' ... print group ... print 'counts of Quality values:' ... print group["Quality"].value_counts() ... raw_input() ... group name: (1, 'u003') group rows: Tag User Quality Cluster_id 372002 xxx u003 bad 1 counts of Quality values: bad 1 group name: (1, 'u004') group rows: Tag User Quality Cluster_id 126003 ground u004 bad 1 348003 ground u004 good 1 counts of Quality values: good 1 bad 1 group name: (1, 'u005') group rows: Tag User Quality Cluster_id 42004 ground u005 bad 1 258004 ground u005 bad 1 390004 ground u005 bad 1 counts of Quality values: bad 3 [etc.] 

If you are going to process csv files a lot, it is definitely worth a look.

+2
source

collections.defaultdict should be very useful here:

 # WARNING: Untested from collections import defaultdict auto_vivificator = lambda: defaultdict(auto_vivificator) data = auto_vivificator() # open your csv file for tag, user, quality, cluster in csv_file: user = data[cluster].setdefault(user, defaultdict(int)) if is_good(quality): user["good"] += 1 else: user["bad"] += 1 for cluster, users in enumerate(data): print "Cluster:", cluster for user, quality_metrics in enumerate(users): print "User:", user print quality_metrics print # A blank line 
+3
source

All Articles