Comprehensive DataFrame Filtering

Question

Comprehensive DataFrame Filtering

I just started working with Pandas, and I'm trying to figure out if this is the right tool for my problem.

I have a dataset:

date, sourceid, destid, h1..h12

I am mainly interested in the sum of each column H1..H12, but I need to exclude several ranges from the data set.

Examples:

exclude data H4, H5, H6, where source = 4944 and exclude H8, H9-H12 where destination = 481981 and ...

... this can go on for many many filters as we are constantly deleting data to get closer to our final model.

I think I saw in the solution that I could create a list of filters that I would like, and then create a function to check, but I did not find a good example to work with.

My initial thought was to create a copy of df and just delete the data that we don’t need, and if we need it, we could just copy it back from the df source, but that seems like the wrong way.

+4

python pandas

Glenn Feb 14 '13 at 6:30

source share

2 answers

Def_os · Answer 1 · 2013-02-14T08:19:54+0000

With the help of masks, you do not need to delete data from the data frame. For instance:.

 mask1 = df.sourceid == 4944 var1 = df[mask1]['H4','H5','H6'].sum()

Or directly:

 var1 = df[df.sourceid == 4944]['H4','H5','H6'].sum()

In the case of multiple filters, you can combine logical masks with Boolean operators:

 totmask = mask1 & mask2

Hyry · Answer 2 · 2013-02-15T03:53:16+0000

you can use DataFrame.ix[] to set the data to zeros.

First create a dummy DataFrame:

 N = 10000 df = pd.DataFrame(np.random.rand(N, 12), columns=["h%d" % i for i in range(1, 13)], index=["row%d" % i for i in range(1, N+1)]) df["sourceid"] = np.random.randint(0, 50, N) df["destid"] = np.random.randint(0, 50, N)

Then for each of your filters you can call:

 df.ix[df.sourceid == 10, "h4":"h6"] = 0

since you have 600k lines, create an array of masks at df.sourceid == 10 , possibly slow. You can create Series objects that map the value to the DataFrame index:

 sourceid = pd.Series(df.index.values, index=df["sourceid"].values).sort_index() destid = pd.Series(df.index.values, index=df["destid"].values).sort_index()

and then exclude h4, h5, h6, where sourceid == 10:

 df.ix[sourceid[10], "h4":"h6"] = 0

to find line identifiers where sourceid == 10 and destid == 20:

 np.intersect1d(sourceid[10].values, destid[20].values, assume_unique=True)

to find line identifiers, where 10 <= source <= 12 and 3 <= destid <= 5:

 np.intersect1d(sourceid.ix[10:12].values, destid.ix[3:5].values, assume_unique=True)

sourceid and destid are series with duplicate index values, when index values are ok, Pandas use a search sorted to find the index. this is O (log N), it is faster to create arrays of masks that are O (N).

Comprehensive DataFrame Filtering

More articles: