Duplicate row based on value in different columns

I have a transaction data block. Each row represents a two-element transaction (count it as a transaction of two event tickets or something else). I want to duplicate each row based on the quantity sold.

Here is a sample code:

# dictionary of transactions d = {'1': ['20', 'NYC', '2'], '2': ['30', 'NYC', '2'], '3': ['5', 'NYC', '2'], \ '4': ['300', 'LA', '2'], '5': ['30', 'LA', '2'], '6': ['100', 'LA', '2']} columns=['Price', 'City', 'Quantity'] # create dataframe and rename columns df = pd.DataFrame.from_dict(data=d, orient='index') df.columns = columns 

As a result, a data frame is created that looks like this:

 Price City Quantity 20 NYC 2 30 NYC 2 5 NYC 2 300 LA 2 30 LA 2 100 LA 2 

So, in the above case, each line is converted to two repeating lines. If the column "quantity" is 3, then this row is converted to three repeating rows.

+8
python pandas
source share
3 answers

First, I recreated your data using integers instead of text. I also changed the amount to make it easier to understand the problem.

 d = {1: [20, 'NYC', 1], 2: [30, 'NYC', 2], 3: [5, 'SF', 3], 4: [300, 'LA', 1], 5: [30, 'LA', 2], 6: [100, 'SF', 3]} columns=['Price', 'City', 'Quantity'] # create dataframe and rename columns df = pd.DataFrame.from_dict(data=d, orient='index').sort_index() df.columns = columns >>> df Price City Quantity 1 20 NYC 1 2 30 NYC 2 3 5 SF 3 4 300 LA 1 5 30 LA 2 6 100 SF 3 

I created a new DataFrame using a nested list structure.

 df_new = pd.DataFrame([df.ix[idx] for idx in df.index for _ in range(df.ix[idx]['Quantity'])]).reset_index(drop=True) >>> df_new Price City Quantity 0 20 NYC 1 1 30 NYC 2 2 30 NYC 2 3 5 SF 3 4 5 SF 3 5 5 SF 3 6 300 LA 1 7 30 LA 2 8 30 LA 2 9 100 SF 3 10 100 SF 3 11 100 SF 3 
+8
source share

How about this approach. I changed your details a bit to cause the sale of 4 tickets.

We use the np.ones () helper array, the appropriate size, and then the key line of code: a[np.arange(a.shape[1])[:] > a[:,0,np.newaxis]] = 0

I was shown this technique here: numpy - update values ​​using slicing to reflect the value of the array

Then its just calling .stack() and some basic filtering to complete.

 d = {'1': ['20', 'NYC', '2'], '2': ['30', 'NYC', '2'], '3': ['5', 'NYC', '2'], \ '4': ['300', 'LA', '2'], '5': ['30', 'LA', '4'], '6': ['100', 'LA', '2']} columns=['Price', 'City', 'Quantity'] df = pd.DataFrame.from_dict(data=d, orient='index') df.columns = columns df['Quantity'] = df['Quantity'].astype(int) # make a ones array my_ones = np.ones(shape=(len(df),df['Quantity'].max())) # turn my_ones into a dataframe same index as df so we can join it to the right hand side. Plenty of other ways to achieve the same outcome. df_my_ones = pd.DataFrame(data =my_ones,index = df.index) df = df.join(df_my_ones) 

which is as follows:

  Price City Quantity 0 1 2 3 1 20 NYC 2 1 1 1 1 3 5 NYC 2 1 1 1 1 2 30 NYC 2 1 1 1 1 5 30 LA 4 1 1 1 1 4 300 LA 2 1 1 1 1 

now get the column Number and units in a numpy array

 a = df.iloc[:,2:].values 

it's a smart bit

 a[np.arange(a.shape[1])[:] > a[:,0,np.newaxis]] = 0 

and assign df again.

 df.iloc[:,2:] = a 

and now df looks like this: notice how we set a zero number in quantity:

  Price City Quantity 0 1 2 3 1 20 NYC 2 1 1 0 0 3 5 NYC 2 1 1 0 0 2 30 NYC 2 1 1 0 0 5 30 LA 4 1 1 1 1 4 300 LA 2 1 1 0 0 df.set_index(['Price','City','Quantity'],inplace=True) df = df.stack().to_frame() df.columns = ['sale_flag'] df.reset_index(inplace=True) print df[['Price','City', 'Quantity']][df['sale_flag'] !=0] print df 

which produces:

 Price City Quantity 0 20 NYC 2 1 20 NYC 2 4 5 NYC 2 5 5 NYC 2 8 30 NYC 2 9 30 NYC 2 12 30 LA 4 13 30 LA 4 14 30 LA 4 15 30 LA 4 16 300 LA 2 17 300 LA 2 
+3
source share

Answer with repeat

 df.loc[df.index.repeat(df.Quantity)] Out[448]: Price City Quantity 1 20 NYC 2 1 20 NYC 2 2 30 NYC 2 2 30 NYC 2 3 5 NYC 2 3 5 NYC 2 4 300 LA 2 4 300 LA 2 5 30 LA 2 5 30 LA 2 6 100 LA 2 6 100 LA 2 
0
source share

All Articles