Extract data from a URL result using custom formatting

I have a url:
http://somewhere.com/relatedqueries?limit=2&query=seedterm

where changing inputs, restrictions and queries will generate the required data. A limit is the maximum number of possible conditions, and a query is a seed term.

The URL provides a text result formatted as follows:
oo.visualization.Query.setResponse ({version: '0.5', reqId: '0', status: 'CI', whitefish: '1303596067112929220', table {column_shift: [{ID: 'rating', label: ' Score ", type: 'number', pattern: '#, ## 0 ###'}}, {ID: 'query', label: 'query', type: 'string', pattern: ''}], strings : [{with: [{v: 0.9894380670262618, F: '0.99'}, {v: 'newterm1'}]}, {with: [{v: 0.9894380670262618, F: '0.99'}, {v : 'newterm2'}]}], p: {'totalResultsCount': '7727'}}});

I would like to write a python script that takes two arguments (limit number and query seed), iterates over the data online, parses the result and returns a list with new terms ['newterm1', 'newterm2'] in this case.

I would really like the help, especially using the url as I have never done this before.

+4
source share
2 answers

It looks like you can break this problem down into several subtasks.

subtasks

There are several problems that need to be resolved before compiling a complete script:

  • Generating a Request URL: Creating a Custom Request URL from a Template
  • Data Acquisition: Query Execution
  • Unwrapping JSONP : The returned data looks like JSON wrapped in a JavaScript function call
  • Passing the graph of the object: Moving the result to find the desired bits of information

Generating a request URL

This is just string formatting.

url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}' url = url_template.format(limit=2, seedterm='seedterm') 

Python 2 Note

Here you will need to use the string format operator ( % ).

 url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s' url = url_template % dict(limit=2, seedterm='seedterm') 

Data retrieval

For this you can use the urllib.request built-in module.

 import urllib.request data = urllib.request.urlopen(url) # url from previous section 

This returns a file-like object called data . You can also use the instruction here:

 with urllib.request.urlopen(url) as data: # do processing here 

Python 2 Note

Import urllib2 instead of urllib.request .

JSONP Deployment

The result you pasted looks like JSONP. Given that the wrapper function called ( oo.visualization.Query.setResponse ) does not change, we can simply disable this method.

 result = data.read() prefix = 'oo.visualization.Query.setResponse(' suffix = ');' if result.startswith(prefix) and result.endswith(suffix): result = result[len(prefix):-len(suffix)] 

JSON parsing

As a result, the result string is only JSON data. Disassemble it with the json built-in module.

 import json result_object = json.loads(result) 

Moving an object graph

You now have a result_object that represents the JSON response. The object itself will be dict with keys such as version , reqId , etc. Based on your question, here is what you need to do to create a list.

 # Get the rows in the table, then get the second column value for # each row terms = [row['c'][2]['v'] for row in result_object['table']['rows']] 

Putting it all together

 #!/usr/bin/env python3 """A script for retrieving and parsing results from requests to somewhere.com. This script works as either a standalone script or as a library. To use it as a standalone script, run it as `python3 scriptname.py`. To use it as a library, use the `retrieve_terms` function.""" import urllib.request import json import sys E_OPERATION_ERROR = 1 E_INVALID_PARAMS = 2 def parse_result(result): """Parse a JSONP result string and return a list of terms""" prefix = 'oo.visualization.Query.setResponse(' suffix = ');' # Strip JSONP function wrapper if result.startswith(prefix) and result.endswith(suffix): result = result[len(prefix):-len(suffix)] # Deserialize JSON to Python objects result_object = json.loads(result) # Get the rows in the table, then get the second column value # for each row return [row['c'][2]['v'] for row in result_object['table']['rows']] def retrieve_terms(limit, seedterm): """Retrieves and parses data and returns a list of terms""" url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}' url = url_template.format(limit=limit, seedterm=seedterm) try: with urllib.request.urlopen(url) as data: data = perform_request(limit, seedterm) result = data.read() except: print('Could not request data from server', file=sys.stderr) exit(E_OPERATION_ERROR) terms = parse_result(result) print(terms) def main(limit, seedterm): """Retrieves and parses data and prints each term to standard output""" terms = retrieve_terms(limit, seedterm) for term in terms: print(term) if __name__ == '__main__' try: limit = int(sys.argv[1]) seedterm = sys.argv[2] except: error_message = '''{} limit seedterm limit must be an integer'''.format(sys.argv[0]) print(error_message, file=sys.stderr) exit(2) exit(main(limit, seedterm)) 

Python Version 2.7

 #!/usr/bin/env python2.7 """A script for retrieving and parsing results from requests to somewhere.com. This script works as either a standalone script or as a library. To use it as a standalone script, run it as `python2.7 scriptname.py`. To use it as a library, use the `retrieve_terms` function.""" import urllib2 import json import sys E_OPERATION_ERROR = 1 E_INVALID_PARAMS = 2 def parse_result(result): """Parse a JSONP result string and return a list of terms""" prefix = 'oo.visualization.Query.setResponse(' suffix = ');' # Strip JSONP function wrapper if result.startswith(prefix) and result.endswith(suffix): result = result[len(prefix):-len(suffix)] # Deserialize JSON to Python objects result_object = json.loads(result) # Get the rows in the table, then get the second column value # for each row return [row['c'][2]['v'] for row in result_object['table']['rows']] def retrieve_terms(limit, seedterm): """Retrieves and parses data and returns a list of terms""" url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s' url = url_template % dict(limit=2, seedterm='seedterm') try: with urllib2.urlopen(url) as data: data = perform_request(limit, seedterm) result = data.read() except: sys.stderr.write('%s\n' % 'Could not request data from server') exit(E_OPERATION_ERROR) terms = parse_result(result) print terms def main(limit, seedterm): """Retrieves and parses data and prints each term to standard output""" terms = retrieve_terms(limit, seedterm) for term in terms: print term if __name__ == '__main__' try: limit = int(sys.argv[1]) seedterm = sys.argv[2] except: error_message = '''{} limit seedterm limit must be an integer'''.format(sys.argv[0]) sys.stderr.write('%s\n' % error_message) exit(2) exit(main(limit, seedterm)) 
+12
source

I didn’t understand your problem well, because from your code it seems to me that you are using the visualization API (this is the first time I heard about this, by the way).

But if you are just looking for a way to retrieve data from a web page, you can use urllib2 , it is just for retrieving data, and if you want to analyze the retrieved data, you will need to use a more suitable library like BeautifulSoop

if you are dealing with a different web service (RSS, Atom, RPC) and not web pages, you can find a bunch of python libraries that you can use and that deal with each service.

 import urllib2 from BeautifulSoup import BeautifulSoup result = urllib2.urlopen('http://somewhere.com/relatedqueries?limit=%s&query=%s' % (2, 'seedterm')) htmletxt = resul.read() result.close() soup = BeautifulSoup(htmltext, convertEntities="html" ) # you can parse your data now check BeautifulSoup API. 
+1
source

All Articles