Speed ​​up JSON in dataframe with lots of data manipulation

I have a massive JSON data frame formatted as follows:

[ [{ "created_at": "2017-04-28T16:52:36Z", "as_of": "2017-04-28T17:00:05Z", "trends": [{ "url": "http://twitter.com/search?q=%23ChavezSigueCandanga", "query": "%23ChavezSigueCandanga", "tweet_volume": 44587, "name": "#ChavezSigueCandanga", "promoted_content": null }, { "url": "http://twitter.com/search?q=%2327Abr", "query": "%2327Abr", "tweet_volume": 79781, "name": "#27Abr", "promoted_content": null }], "locations": [{ "woeid": 395277, "name": "Turmero" }] }], [{ "created_at": "2017-04-28T16:57:35Z", "as_of": "2017-04-28T17:00:03Z", "trends": [{ "url": "http://twitter.com/search?q=%23fyrefestival", "query": "%23fyrefestival", "tweet_volume": 141385, "name": "#fyrefestival", "promoted_content": null }, { "url": "http://twitter.com/search?q=%23HotDocs17", "query": "%23HotDocs17", "tweet_volume": null, "name": "#HotDocs17", "promoted_content": null }], "locations": [{ "woeid": 9807, "name": "Vancouver" }] }] ]... 

I wrote a function that formats it in the pandas framework, which takes this form:

 +----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+ | | name | promoted_content | query | tweet_volume | url | as_of | created_at | location_name | location_woeid | +----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+ | 47 | #BatesMotel | | %23BatesMotel | 59748 | http://twitter.com/search?q=%23BatesMotel | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg | 2972 | | 48 | #AdviceForPeopleJoiningTwitter | | %23AdviceForPeopleJoiningTwitter | 51222 | http://twitter.com/search?q=%23AdviceForPeopleJoiningTwitter | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg | 2972 | | 49 | #CADTHSymp | | %23CADTHSymp | | http://twitter.com/search?q=%23CADTHSymp | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg | 2972 | | 0 | #WorldPenguinDay | | %23WorldPenguinDay | 79006 | http://twitter.com/search?q=%23WorldPenguinDay | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto | 4118 | | 1 | #TravelTuesday | | %23TravelTuesday | | http://twitter.com/search?q=%23TravelTuesday | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto | 4118 | | 2 | #DigitalLeap | | %23DigitalLeap | | http://twitter.com/search?q=%23DigitalLeap | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto | 4118 | | … | … | … | … | … | … | … | … | … | … | | 0 | #nusnc17 | | %23nusnc17 | | http://twitter.com/search?q=%23nusnc17 | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham | 12723 | | 1 | #WorldPenguinDay | | %23WorldPenguinDay | 79006 | http://twitter.com/search?q=%23WorldPenguinDay | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham | 12723 | | 2 | #littleboyblue | | %23littleboyblue | 20772 | http://twitter.com/search?q=%23littleboyblue | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham | 12723 | +----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+ 

This is a function that writes JSON to a DataFrame:

 def trends_to_dataframe(data): df = pd.DataFrame() for location in data: temp_df = pd.DataFrame() for trend in location[0]['trends']: temp_df = temp_df.append(pd.Series(trend), ignore_index=True) temp_df['as_of'] = location[0]['as_of'] temp_df['created_at'] = location[0]['created_at'] temp_df['location_name'] = location[0]['locations'][0]['name'] temp_df['location_woeid'] = location[0]['locations'][0]['woeid'] df = df.append(temp_df) return df 

Unfortunately, with the amount of data that I have (and some simple timers that I tested), it will take about 4 hours. Any thoughts on how to speed this up?

+5
source share
1 answer

You can speed up the process by using asynchronous data alignment with concurrent.futures , and then load it into a DataFrame using from_records .

 from concurrent.futures import ThreadPoolExecutor def get_trends(location): trends = [] for trend in location[0]['trends']: trend['as_of'] = location[0]['as_of'] trend['created_at'] = location[0]['created_at'] trend['location_name'] = location[0]['locations'][0]['name'] trend['location_woeid'] = location[0]['locations'][0]['woeid'] trends.append(trend) return trends flat_data = [] with ThreadPoolExecutor() as executor: for location in data: flat_data += get_trends(location) df = pd.DataFrame.from_records(flat_data) 
+3
source

All Articles