Pandas: Normalizing DataFrame

Question

Pandas: Normalizing DataFrame

I have input in a flattened file. I want to normalize this data by dividing it into tables. Can I do this neatly with pandas , that is, by reading flattened data into a DataFrame instance, and then applying some functions to get the resulting DataFrame instances?

Example:

The data is transferred to my disk as a CSV file as follows:

 ItemId ClientId PriceQuoted ItemDescription 1 1 10 scroll of Sneak 1 2 12 scroll of Sneak 1 3 13 scroll of Sneak 2 2 2500 scroll of Invisible 2 4 2200 scroll of Invisible

I want to create two DataFrames:

 ItemId ItemDescription 1 scroll of Sneak 2 scroll of Invisibile

and

 ItemId ClientId PriceQuoted 1 1 10 1 2 12 1 3 13 2 2 2500 2 4 2200

If pandas has only a good solution for the simplest case (normalization leads to two tables with many relationships - as in the previous example), this may be enough for my current needs. However, I may need a more general solution in the future.

+7

python pandas database-normalization

max Aug 27 '12 at 10:40

source share

1 answer

Wouter overmeire · Accepted Answer · 2012-08-27T11:07:30+0000

 In [30]: df = pandas.read_csv('foo1.csv', sep='[\s]{2,}') In [30]: df Out[30]: ItemId ClientId PriceQuoted ItemDescription 0 1 1 10 scroll of Sneak 1 1 2 12 scroll of Sneak 2 1 3 13 scroll of Sneak 3 2 2 2500 scroll of Invisible 4 2 4 2200 scroll of Invisible In [31]: df1 = df[['ItemId', 'ItemDescription']].drop_duplicates().set_index('ItemId') In [32]: df1 Out[32]: ItemDescription ItemId 1 scroll of Sneak 2 scroll of Invisible In [33]: df2 = df[['ItemId', 'ClientId', 'PriceQuoted']] In [34]: df2 Out[34]: ItemId ClientId PriceQuoted 0 1 1 10 1 1 2 12 2 1 3 13 3 2 2 2500 4 2 4 2200

Pandas: Normalizing DataFrame

More articles: