Pandas change cell values ​​based on another cell

I am currently formatting data from two different data sets. One of the data sets reflects the number of observations of people in the room by the hour, the second - the number of people based on wifi logs generated in 5 minutes.

After combining these two data frames into one, I ran into a problem when every hour (like "10:00:00") has data from the original set, but other data (every 5 minutes, like "10:47: 14") is not includes this data.

Here's what the merge file looks like:

room time con auth capacity % Count module size 0 B002 Mon Nov 02 10:32:06 23 23 90 NaN NaN NaN NaN` 1 B002 Mon Nov 02 10:37:10 25 25 90 NaN NaN NaN NaN` 12527 B002 Mon Nov 02 10:00:00 NaN NaN 90 50% 45.0 COMP30520 60` 12528 B002 Mon Nov 02 11:00:00 NaN NaN 90 0% 0.0 COMP30520 60` 

Do I have a way to go through the framework and find all the information about "employment", "OccupancyCount", "module" and "size" from 11:00:00 and write it to all the cells that are on the same day and where is the hour between 10:00:00 and 10:59:59?

This will allow me to have all the information on each line, and then allow me to collect min() , max() and median() based on the "day" and "hour".

To answer the comment for the source data, there is:
first data frame:

  time room module size 0 Mon Nov 02 09:00:00 B002 COMP30190 29 1 Mon Nov 02 10:00:00 B002 COMP40660 53 

second frame:

  room time con auth capacity % Count 0 B002 Mon Nov 02 20:32:06 0 0 NaN NaN NaN 1 B002 Mon Nov 02 20:37:10 0 0 NaN NaN NaN 2 B002 Mon Nov 02 20:42:12 0 0 NaN NaN NaN 12797 B008 Wed Nov 11 13:00:00 NaN NaN 40 25 10.0 12798 B008 Wed Nov 11 14:00:00 NaN NaN 40 50 20.0 12799 B008 Wed Nov 11 15:00:00 NaN NaN 40 25 10.0 

thus, these two data blocks were combined together:

 DFinal = pd.merge(DF, d3, left_on=["room", "time"], right_on=["room", "time"], how="outer", left_index=False, right_index=False) 

Any help with this would be greatly appreciated.

Thank you very much,

-Romain

+5
source share
3 answers

In fact, I was able to fix this:

First: using the section in the time function to create two additional columns: one for the day , shown at time, and one for the hour in the time column. I used lambda functions to get these columns:

 df['date'] = df['date'].map(lambda x: x[10:-6]) df['time'] = df['time'].map(lambda x: x[8:-8]) 

Based on these two new columns, I changed the way data is combined.

here is the code i used to fix it:

 dataframeFinal = pd.merge(dataframe1, dataframe2, left_on=["room", "date", "hour"], right_on=["room", "date", "hour"], how="outer", left_index=False, right_index=False, copy=False) 

After this merge, I had duplicate time columns ("time_y" and "time_x").
Therefore, I replaced the NaN values ​​as follows:

 dataframeFinal.time_y.fillna(dataframeFinal.time_x, inplace=True) 

Now the column "time_y" contains all the time values, not more than NaN. I don't need the "time_x" table, so I delete it from the data frame

 dataframeFinal = dataframeFinal.drop('time_x', axis=1) 
0
source

Somewhere to start:

 b = df[(df['time'] > X) & (df['time'] < Y)] 

selects all items over time X and Y

And then

 df.loc[df['column_name'].isin(b)] 

Gives you the lines you need (i.e. between X and Y), and you can simply assign as you wish. I think you want to assign the values ​​of the selected lines to the lines of the number X?

Hope this helps.

Note that these functions cut and paste jobs from
[1] Filter rows of data data if the value in the column is in the list of values
[2] Select rows from a DataFrame based on values ​​in a column in pandas

+2
source

If I understood this correctly, you want to fill in all the missing values ​​in the combined data frame with the corresponding nearest data point available at the specified hour. I did something similar essentially in the past, using the pandas.cut variation for deadlines, but I can't find it, it wasn’t really nice anyway.

Although I'm not quite sure, the fillna pandas fillna method may be what you want ( docs here ).

Let your two data frames be called df_hour and df_cinq , you have combined them as follows:

 df = pd.merge(df_hour, df_cinq, left_on=["room", "time"], right_on=["room", "time"], how="outer", left_index=False, right_index=False) 

Then you change your index for a while and sort it:

 df.set_index('time',inplace=True) df.sort_index(inplace=True) 

The fillna method has a “method” option, which can have these values ​​( 2 ):

  Method Action pad / ffill Fill values forward bfill / backfill Fill values backward nearest Fill from the nearest index value 

Using this for forwarding (i.e. missing values ​​are filled with the previous value in the frame):

 df.fillna(method='ffill', inplace=True) 

The problem with this in your data is that all the missing data during off hours belonging to the 5-minute observations will be filled with outdated data points. You can use the limit parameter to limit the number of consecutive data points that need to be populated, but I don't know if it is useful to you.

Here's the full script I wrote as a toy example:

 import pandas as pd import random hourly_count = 8 #workhours cinq_count = 24 * 12 # 1day hour_rng = pd.date_range('1/1/2016-09:00:00', periods = hourly_count, freq='H') cinq_rng = pd.date_range('1/1/2016-00:02:53', periods = cinq_count, freq='5min') roomz = 'room0 room1 secretroom'.split() hourlydata = {'col1': [], 'col2': [], 'room': []} for i in range(hourly_count): hourlydata['room'].append(random.choice(roomz)) hourlydata['col1'].append(random.random()) hourlydata['col2'].append(random.randint(0,100)) cinqdata = {'col3': [], 'col4': [], 'room': []} frts = 'apples oranges peaches grapefruits whatmore'.split() vgtbls = 'onion1 onion2 onion3 onion4 onion5 onion0'.split() for i in range(cinq_count): cinqdata['room'].append(random.choice(roomz)) cinqdata['col3'].append(random.choice(frts)) cinqdata['col4'].append(random.choice(vgtbls)) hourlydf = pd.DataFrame(hourlydata) hourlydf['time'] = hour_rng cinqdf = pd.DataFrame(cinqdata) cinqdf['time'] = cinq_rng df = pd.merge(hourlydf, cinqdf, left_on=['room','time'], right_on=['room', 'time'], how='outer', left_index=False, right_index=False) df.set_index('time',inplace=True) df.sort_index(inplace=True) df.fillna(method='ffill', inplace=True) print(df['2016-1-1 09:00:00':'2016-1-1 17:00:00']) 
+1
source

All Articles