I have a massive JSON data frame formatted as follows:
[ [{ "created_at": "2017-04-28T16:52:36Z", "as_of": "2017-04-28T17:00:05Z", "trends": [{ "url": "http://twitter.com/search?q=%23ChavezSigueCandanga", "query": "%23ChavezSigueCandanga", "tweet_volume": 44587, "name": "#ChavezSigueCandanga", "promoted_content": null }, { "url": "http://twitter.com/search?q=%2327Abr", "query": "%2327Abr", "tweet_volume": 79781, "name": "#27Abr", "promoted_content": null }], "locations": [{ "woeid": 395277, "name": "Turmero" }] }], [{ "created_at": "2017-04-28T16:57:35Z", "as_of": "2017-04-28T17:00:03Z", "trends": [{ "url": "http://twitter.com/search?q=%23fyrefestival", "query": "%23fyrefestival", "tweet_volume": 141385, "name": "#fyrefestival", "promoted_content": null }, { "url": "http://twitter.com/search?q=%23HotDocs17", "query": "%23HotDocs17", "tweet_volume": null, "name": "#HotDocs17", "promoted_content": null }], "locations": [{ "woeid": 9807, "name": "Vancouver" }] }] ]...
I wrote a function that formats it in the pandas framework, which takes this form:
+----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+ | | name | promoted_content | query | tweet_volume | url | as_of | created_at | location_name | location_woeid | +----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+ | 47 | #BatesMotel | | %23BatesMotel | 59748 | http://twitter.com/search?q=%23BatesMotel | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg | 2972 | | 48 |
This is a function that writes JSON to a DataFrame:
def trends_to_dataframe(data): df = pd.DataFrame() for location in data: temp_df = pd.DataFrame() for trend in location[0]['trends']: temp_df = temp_df.append(pd.Series(trend), ignore_index=True) temp_df['as_of'] = location[0]['as_of'] temp_df['created_at'] = location[0]['created_at'] temp_df['location_name'] = location[0]['locations'][0]['name'] temp_df['location_woeid'] = location[0]['locations'][0]['woeid'] df = df.append(temp_df) return df
Unfortunately, with the amount of data that I have (and some simple timers that I tested), it will take about 4 hours. Any thoughts on how to speed this up?