you can use DataFrame.ix[] to set the data to zeros.
First create a dummy DataFrame:
N = 10000 df = pd.DataFrame(np.random.rand(N, 12), columns=["h%d" % i for i in range(1, 13)], index=["row%d" % i for i in range(1, N+1)]) df["sourceid"] = np.random.randint(0, 50, N) df["destid"] = np.random.randint(0, 50, N)
Then for each of your filters you can call:
df.ix[df.sourceid == 10, "h4":"h6"] = 0
since you have 600k lines, create an array of masks at df.sourceid == 10 , possibly slow. You can create Series objects that map the value to the DataFrame index:
sourceid = pd.Series(df.index.values, index=df["sourceid"].values).sort_index() destid = pd.Series(df.index.values, index=df["destid"].values).sort_index()
and then exclude h4, h5, h6, where sourceid == 10:
df.ix[sourceid[10], "h4":"h6"] = 0
to find line identifiers where sourceid == 10 and destid == 20:
np.intersect1d(sourceid[10].values, destid[20].values, assume_unique=True)
to find line identifiers, where 10 <= source <= 12 and 3 <= destid <= 5:
np.intersect1d(sourceid.ix[10:12].values, destid.ix[3:5].values, assume_unique=True)
sourceid and destid are series with duplicate index values, when index values ββare ok, Pandas use a search sorted to find the index. this is O (log N), it is faster to create arrays of masks that are O (N).