How to hide pandas framework with some columns like json?

I have a df dataframe that loads data from a database. Most columns are json strings, and some are a jsons list. For example:

 id name columnA columnB 1 John {"dist": "600", "time": "0:12.10"} [{"pos": "1st", "value": "500"},{"pos": "2nd", "value": "300"},{"pos": "3rd", "value": "200"}, {"pos": "total", "value": "1000"}] 2 Mike {"dist": "600"} [{"pos": "1st", "value": "500"},{"pos": "2nd", "value": "300"},{"pos": "total", "value": "800"}] ... 

As you can see, not all rows have the same number of elements in json rows for a column.

What I need to do is keep the regular columns, such as id and name as they are, and smooth the json columns as follows:

 id name columnA.dist columnA.time columnB.pos.1st columnB.pos.2nd columnB.pos.3rd columnB.pos.total 1 John 600 0:12.10 500 300 200 1000 2 Mark 600 NaN 500 300 Nan 800 

I tried to use json_normalize like this:

 from pandas.io.json import json_normalize json_normalize(df) 

But there seem to be some problems with keyerror . How to do it right?

+21
json python pandas flatten dataframe
source share
4 answers

Here's a solution using json_normalize() again using a user-defined function to get the data in the correct format, json_normalize function.

 import ast from pandas.io.json import json_normalize def only_dict(d): ''' Convert json string representation of dictionary to a python dict ''' return ast.literal_eval(d) def list_of_dicts(ld): ''' Create a mapping of the tuples formed after converting json strings of list to a python list ''' return dict([(list(d.values())[1], list(d.values())[0]) for d in ast.literal_eval(ld)]) A = json_normalize(df['columnA'].apply(only_dict).tolist()).add_prefix('columnA.') B = json_normalize(df['columnB'].apply(list_of_dicts).tolist()).add_prefix('columnB.pos.') 

Finally, join the DFs at a common index to get:

 df[['id', 'name']].join([A, B]) 

Image


EDIT: - According to a comment by @MartijnPieters, the recommended way to decode json strings is to use json.loads() which is much faster compared to using ast.literal_eval() if you know that the data source is JSON,

+21
source share

create a custom function to align columnB then use pd.concat

 def flatten(js): return pd.DataFrame(js).set_index('pos').squeeze() pd.concat([df.drop(['columnA', 'columnB'], axis=1), df.columnA.apply(pd.Series), df.columnB.apply(flatten)], axis=1) 

enter image description here

+8
source share

The fastest seems to be:

 json_struct = json.loads(df.to_json(orient="records")) df_flat = pd.io.json.json_normalize(json_struct) #use pd.io.json 
0
source share
 data = { "data": [ { "date": "2018-08-20T00:00:00", "values": [ { "account_id": "account_1", "device_id": "device_1", "deviceModel": "testdev", "csp_id": "csp_device_1", "Events": [ { "EventCategory": "Security Scan", "EventCategoryData": [ { "name": "security_scan_malware_detected", "info": [ { "threat": "Pup", "count": 8.0 } ] }, { "name": "security_scan_malware_removed", "info": [ { "threat": "adware", "count": 1.0 } ] } ], "scancount": 2.0 }, { "EventCategory": "Web Security", "EventCategoryData": [ { "name": "web_security_number_of_unverified_sites", "info": [ { "threat": "Unverified Web Sites", "count": 2.0 } ] }, { "name": "web_security_number_of_suspicious_sites", "info": [ { "threat": "Suspicious Web Sites", "count": 0.0 } ] }, { "name": "web_security_number_of_risky_sites", "info": [ { "threat": "Risky Web Sites", "count": 2.0 } ] } ] }, { "EventCategory": "Network Security", "EventCategoryData": [ { "name": "network_security_threat_detected", "info": [ { "threat": "Wap-wifi", "count": 2.0 } ] } ], "scancount": 4.0 }, { "EventCategory": "Others", "EventCategoryData": [ { "name": "security_scan_dat_update_complete", "info": [ { "previousversion": "default", "updatedversion": "default" } ] } ] } ] } ] }, { "date": "2018-08-22T00:00:00", "values": [ { "account_id": "account_1", "device_id": "device_1", "deviceModel": "testdev", "csp_id": "csp_device_1", "Events": [ { "EventCategory": "Security Scan", "EventCategoryData": [ { "name": "security_scan_malware_detected", "info": [ { "threat": "Pup", "count": 2 } ] }, { "name": "security_scan_malware_removed", "info": [ { "threat": "Malware", "count": 1 }, { "threat": "Pup", "count": 1 } ] } ], "scancount": 1 }, { "EventCategory": "Web Security", "EventCategoryData": [ { "name": "web_security_number_of_unverified_sites", "info": [ { "threat": "Unverified Web Sites", "count": 1 } ] }, { "name": "web_security_number_of_suspicious_sites", "info": [ { "threat": "Suspicious Web Sites", "count": 1 } ] }, { "name": "web_security_number_of_risky_sites", "info": [ { "threat": "Risky Web Sites", "count": 1 } ] } ] }, { "EventCategory": "Network Security", "EventCategoryData": [ { "name": "network_security_threat_detected", "info": [ { "threat": "OpenWifi", "count": 1 }, { "threat": "Wap-wifi", "count": 1 } ] } ], "scancount": 1 }, { "EventCategory": "Others", "EventCategoryData": [ { "name": "security_scan_dat_update_complete", "info": [ { "previousversion": "default", "updatedversion": "default" } ] } ] } ] } ] } ], "status": "success", "identifier": "device_1", "identifier_type": "csp", "query_type": "rt_aggregate", "info_type": "" } 

I have the same requirement, but I cannot use Nickil Maveli's solution

0
source share

All Articles