How can I lazily read multiple JSON values from a file / stream in Python?

Question

How can I lazily read multiple JSON values from a file / stream in Python?

I would like to read several JSON objects from a file / stream in Python, one at a time. Unfortunately json.load() only .read() to the end of the file; there seems to be no way to use it to read a single object or to lazily loop through objects.

Is there any way to do this? Using a standard library would be ideal, but if there was a third-party library, I would use it.

At the moment, I put each object on a separate line and using json.loads(f.readline()) , but I would prefer not to.

Usage example

example.py

 import my_json as json import sys for o in json.iterload(sys.stdin): print("Working on a", type(o))

in.txt

 {"foo": ["bar", "baz"]} 1 2 [] 4 5 6

session example

 $ python3.2 example.py < in.txt Working on a dict Working on a int Working on a int Working on a list Working on a int Working on a int Working on a int

+91

json python serialization

Jeremy Banks Jul 30 '11 at 10:12

source share

11 answers

JSON is generally not very good for this kind of incremental use; There is no standard way to serialize multiple objects so that they can be easily downloaded one at a time without analyzing the entire batch.

The solution you use for each row is visible elsewhere. Scrapy calls this the "JSON lines":

You can do this a little more Pythonically:

 for jsonline in f: yield json.loads(jsonline) # or do the processing in this loop

I think this is the best way - it does not depend on third-party libraries, and it is easy to understand what is happening. I also used this in my own code.

+38

Thomas K Jul 30 2018-11-21T00:

source share

A little late, maybe, but I had this exact problem (well, more or less). My standard solution to these problems is usually to simply decompose regular expressions on some known root object, but in my case this was not possible. The only possible way to do this in the general case is to implement the correct tokenizer.

After you did not find a sufficiently general and sufficiently effective solution, I ended this myself by writing the splitstream module. This is a preliminary tokenizer that understands JSON and XML and splits the continuous stream into several fragments for parsing (it leaves the actual parsing to you, though). To get some kind of performance, it is written as module C.

Example:

 from splitstream import splitfile for jsonstr in splitfile(sys.stdin, format="json")): yield json.loads(jsonstr)

+29

Krumelur Jun 25 '15 at 21:14

source share

Of course you can do it. You just need to accept raw_decode directly. This implementation loads the entire file into memory and works with this line (as json.load does); if you have large files, you can change it to read only from a file if necessary without much difficulty.

 import json from json.decoder import WHITESPACE def iterload(string_or_fp, cls=json.JSONDecoder, **kwargs): if isinstance(string_or_fp, file): string = string_or_fp.read() else: string = str(string_or_fp) decoder = cls(**kwargs) idx = WHITESPACE.match(string, 0).end() while idx < len(string): obj, end = decoder.raw_decode(string, idx) yield obj idx = WHITESPACE.match(string, end).end()

Usage: just as you requested, this is a generator.

+26

Jeremy Roman Jul 31 '11 at 0:18

source share

This is a pretty nasty problem actually, because you need to translate the strings, but matching patterns between multiple strings in curly braces, and also matching the json pattern. This is a kind of json-prepse followed by json parse. Json, compared to other formats, is easy to parse, so it’s not always necessary to look for a parse library, however, how do we solve these conflicting problems?

Generators to the rescue!

The beauty of generators for such a problem lies in the fact that you can stack them on top of each other, gradually abstracting the complexity of the problem, while maintaining laziness. I also looked at using a mechanism to pass values to a generator (send ()), but, fortunately, I didn't need to use it.

To solve the first of the problems, you will need some kind of streaming fader, like the streaming version of re.finditer. My attempt at this below pulls the lines as needed (uncomment the debug expression to see) while still returning matches. I actually then changed it a bit to get the wrong lines, as well as matches (marked as 0 or 1 in the first part of the resulting tuple).

 import re def streamingfinditer(pat,stream): for s in stream: # print "Read next line: " + s while 1: m = re.search(pat,s) if not m: yield (0,s) break yield (1,m.group()) s = re.split(pat,s,1)[1]

Thus, then it is possible to match until there are brackets, each time take into account whether curly braces are balanced, and then both simple and complex objects are returned as necessary.

 braces='{}[]' whitespaceesc=' \t' bracesesc='\\'+'\\'.join(braces) balancemap=dict(zip(braces,[1,-1,1,-1])) bracespat='['+bracesesc+']' nobracespat='[^'+bracesesc+']*' untilbracespat=nobracespat+bracespat def simpleorcompoundobjects(stream): obj = "" unbalanced = 0 for (c,m) in streamingfinditer(re.compile(untilbracespat),stream): if (c == 0): # remainder of line returned, nothing interesting if (unbalanced == 0): yield (0,m) else: obj += m if (c == 1): # match returned if (unbalanced == 0): yield (0,m[:-1]) obj += m[-1] else: obj += m unbalanced += balancemap[m[-1]] if (unbalanced == 0): yield (1,obj) obj=""

This returns the tuples as follows:

 (0,"String of simple non-braced objects easy to parse") (1,"{ 'Compound' : 'objects' }")

Basically, that unpleasant part is done. Now we just need to do the final level of parsing as we see fit. For example, we can use the Jeremy Roman iterload function (Thanks!) To parse one line:

 def streamingiterload(stream): for c,o in simpleorcompoundobjects(stream): for x in iterload(o): yield x

Check this:

 of = open("test.json","w") of.write("""[ "hello" ] { "goodbye" : 1 } 1 2 { } 2 9 78 4 5 { "animals" : [ "dog" , "lots of mice" , "cat" ] } """) of.close() // open & stream the json f = open("test.json","r") for o in streamingiterload(f.readlines()): print o f.close()

I get these results (and if you enable this debug line, you will see that it pulls the lines as needed):

 [u'hello'] {u'goodbye': 1} 1 2 {} 2 9 78 4 5 {u'animals': [u'dog', u'lots of mice', u'cat']}

This will not work for all situations. Due to the implementation of the json library, it is impossible to work completely correctly without overriding the parser yourself.

+26

Benedict Oct 17 '11 at 2:11 a.m.

source share

I believe that the best way to do this is to use a state machine. Below is an example of the code that I developed by converting NodeJS code from the link below in Python ~~3 (the non-local keyword used, available only in Python 3, the code will not work in Python 2)~~

Edit-1: Updated and made code compatible with Python 2

Edit-2: updated and added only version of Python3

https://gist.github.com/creationix/5992451

Python 3 only

 # A streaming byte oriented JSON parser. Feed it a single byte at a time and # it will emit complete objects as it comes across them. Whitespace within and # between objects is ignored. This means it can parse newline delimited JSON. import math def json_machine(emit, next_func=None): def _value(byte_data): if not byte_data: return if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _value # Ignore whitespace if byte_data == 0x22: # " return string_machine(on_value) if byte_data == 0x2d or (0x30 <= byte_data < 0x40): # - or 0-9 return number_machine(byte_data, on_number) if byte_data == 0x7b: #: return object_machine(on_value) if byte_data == 0x5b: # [ return array_machine(on_value) if byte_data == 0x74: # t return constant_machine(TRUE, True, on_value) if byte_data == 0x66: # f return constant_machine(FALSE, False, on_value) if byte_data == 0x6e: # n return constant_machine(NULL, None, on_value) if next_func == _value: raise Exception("Unexpected 0x" + str(byte_data)) return next_func(byte_data) def on_value(value): emit(value) return next_func def on_number(number, byte): emit(number) return _value(byte) next_func = next_func or _value return _value TRUE = [0x72, 0x75, 0x65] FALSE = [0x61, 0x6c, 0x73, 0x65] NULL = [0x75, 0x6c, 0x6c] def constant_machine(bytes_data, value, emit): i = 0 length = len(bytes_data) def _constant(byte_data): nonlocal i if byte_data != bytes_data[i]: i += 1 raise Exception("Unexpected 0x" + str(byte_data)) i += 1 if i < length: return _constant return emit(value) return _constant def string_machine(emit): string = "" def _string(byte_data): nonlocal string if byte_data == 0x22: # " return emit(string) if byte_data == 0x5c: # \ return _escaped_string if byte_data & 0x80: # UTF-8 handling return utf8_machine(byte_data, on_char_code) if byte_data < 0x20: # ASCII control character raise Exception("Unexpected control character: 0x" + str(byte_data)) string += chr(byte_data) return _string def _escaped_string(byte_data): nonlocal string if byte_data == 0x22 or byte_data == 0x5c or byte_data == 0x2f: # " \ / string += chr(byte_data) return _string if byte_data == 0x62: # b string += "\b" return _string if byte_data == 0x66: # f string += "\f" return _string if byte_data == 0x6e: # n string += "\n" return _string if byte_data == 0x72: # r string += "\r" return _string if byte_data == 0x74: # t string += "\t" return _string if byte_data == 0x75: # u return hex_machine(on_char_code) def on_char_code(char_code): nonlocal string string += chr(char_code) return _string return _string # Nestable state machine for UTF-8 Decoding. def utf8_machine(byte_data, emit): left = 0 num = 0 def _utf8(byte_data): nonlocal num, left if (byte_data & 0xc0) != 0x80: raise Exception("Invalid byte in UTF-8 character: 0x" + byte_data.toString(16)) left = left - 1 num |= (byte_data & 0x3f) << (left * 6) if left: return _utf8 return emit(num) if 0xc0 <= byte_data < 0xe0: # 2-byte UTF-8 Character left = 1 num = (byte_data & 0x1f) << 6 return _utf8 if 0xe0 <= byte_data < 0xf0: # 3-byte UTF-8 Character left = 2 num = (byte_data & 0xf) << 12 return _utf8 if 0xf0 <= byte_data < 0xf8: # 4-byte UTF-8 Character left = 3 num = (byte_data & 0x07) << 18 return _utf8 raise Exception("Invalid byte in UTF-8 string: 0x" + str(byte_data)) # Nestable state machine for hex escaped characters def hex_machine(emit): left = 4 num = 0 def _hex(byte_data): nonlocal num, left if 0x30 <= byte_data < 0x40: i = byte_data - 0x30 elif 0x61 <= byte_data <= 0x66: i = byte_data - 0x57 elif 0x41 <= byte_data <= 0x46: i = byte_data - 0x37 else: raise Exception("Expected hex char in string hex escape") left -= 1 num |= i << (left * 4) if left: return _hex return emit(num) return _hex def number_machine(byte_data, emit): sign = 1 number = 0 decimal = 0 esign = 1 exponent = 0 def _mid(byte_data): if byte_data == 0x2e: # . return _decimal return _later(byte_data) def _number(byte_data): nonlocal number if 0x30 <= byte_data < 0x40: number = number * 10 + (byte_data - 0x30) return _number return _mid(byte_data) def _start(byte_data): if byte_data == 0x30: return _mid if 0x30 < byte_data < 0x40: return _number(byte_data) raise Exception("Invalid number: 0x" + str(byte_data)) if byte_data == 0x2d: # - sign = -1 return _start def _decimal(byte_data): nonlocal decimal if 0x30 <= byte_data < 0x40: decimal = (decimal + byte_data - 0x30) / 10 return _decimal return _later(byte_data) def _later(byte_data): if byte_data == 0x45 or byte_data == 0x65: # E e return _esign return _done(byte_data) def _esign(byte_data): nonlocal esign if byte_data == 0x2b: # + return _exponent if byte_data == 0x2d: # - esign = -1 return _exponent return _exponent(byte_data) def _exponent(byte_data): nonlocal exponent if 0x30 <= byte_data < 0x40: exponent = exponent * 10 + (byte_data - 0x30) return _exponent return _done(byte_data) def _done(byte_data): value = sign * (number + decimal) if exponent: value *= math.pow(10, esign * exponent) return emit(value, byte_data) return _start(byte_data) def array_machine(emit): array_data = [] def _array(byte_data): if byte_data == 0x5d: # ] return emit(array_data) return json_machine(on_value, _comma)(byte_data) def on_value(value): array_data.append(value) def _comma(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _comma # Ignore whitespace if byte_data == 0x2c: # , return json_machine(on_value, _comma) if byte_data == 0x5d: # ] return emit(array_data) raise Exception("Unexpected byte: 0x" + str(byte_data) + " in array body") return _array def object_machine(emit): object_data = {} key = None def _object(byte_data): if byte_data == 0x7d: # return emit(object_data) return _key(byte_data) def _key(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _object # Ignore whitespace if byte_data == 0x22: return string_machine(on_key) raise Exception("Unexpected byte: 0x" + str(byte_data)) def on_key(result): nonlocal key key = result return _colon def _colon(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _colon # Ignore whitespace if byte_data == 0x3a: # : return json_machine(on_value, _comma) raise Exception("Unexpected byte: 0x" + str(byte_data)) def on_value(value): object_data[key] = value def _comma(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _comma # Ignore whitespace if byte_data == 0x2c: # , return _key if byte_data == 0x7d: # return emit(object_data) raise Exception("Unexpected byte: 0x" + str(byte_data)) return _object

Python 2 compatible

 # A streaming byte oriented JSON parser. Feed it a single byte at a time and # it will emit complete objects as it comes across them. Whitespace within and # between objects is ignored. This means it can parse newline delimited JSON. import math def json_machine(emit, next_func=None): def _value(byte_data): if not byte_data: return if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _value # Ignore whitespace if byte_data == 0x22: # " return string_machine(on_value) if byte_data == 0x2d or (0x30 <= byte_data < 0x40): # - or 0-9 return number_machine(byte_data, on_number) if byte_data == 0x7b: #: return object_machine(on_value) if byte_data == 0x5b: # [ return array_machine(on_value) if byte_data == 0x74: # t return constant_machine(TRUE, True, on_value) if byte_data == 0x66: # f return constant_machine(FALSE, False, on_value) if byte_data == 0x6e: # n return constant_machine(NULL, None, on_value) if next_func == _value: raise Exception("Unexpected 0x" + str(byte_data)) return next_func(byte_data) def on_value(value): emit(value) return next_func def on_number(number, byte): emit(number) return _value(byte) next_func = next_func or _value return _value TRUE = [0x72, 0x75, 0x65] FALSE = [0x61, 0x6c, 0x73, 0x65] NULL = [0x75, 0x6c, 0x6c] def constant_machine(bytes_data, value, emit): local_data = {"i": 0, "length": len(bytes_data)} def _constant(byte_data): # nonlocal i, length if byte_data != bytes_data[local_data["i"]]: local_data["i"] += 1 raise Exception("Unexpected 0x" + byte_data.toString(16)) local_data["i"] += 1 if local_data["i"] < local_data["length"]: return _constant return emit(value) return _constant def string_machine(emit): local_data = {"string": ""} def _string(byte_data): # nonlocal string if byte_data == 0x22: # " return emit(local_data["string"]) if byte_data == 0x5c: # \ return _escaped_string if byte_data & 0x80: # UTF-8 handling return utf8_machine(byte_data, on_char_code) if byte_data < 0x20: # ASCII control character raise Exception("Unexpected control character: 0x" + byte_data.toString(16)) local_data["string"] += chr(byte_data) return _string def _escaped_string(byte_data): # nonlocal string if byte_data == 0x22 or byte_data == 0x5c or byte_data == 0x2f: # " \ / local_data["string"] += chr(byte_data) return _string if byte_data == 0x62: # b local_data["string"] += "\b" return _string if byte_data == 0x66: # f local_data["string"] += "\f" return _string if byte_data == 0x6e: # n local_data["string"] += "\n" return _string if byte_data == 0x72: # r local_data["string"] += "\r" return _string if byte_data == 0x74: # t local_data["string"] += "\t" return _string if byte_data == 0x75: # u return hex_machine(on_char_code) def on_char_code(char_code): # nonlocal string local_data["string"] += chr(char_code) return _string return _string # Nestable state machine for UTF-8 Decoding. def utf8_machine(byte_data, emit): local_data = {"left": 0, "num": 0} def _utf8(byte_data): # nonlocal num, left if (byte_data & 0xc0) != 0x80: raise Exception("Invalid byte in UTF-8 character: 0x" + byte_data.toString(16)) local_data["left"] -= 1 local_data["num"] |= (byte_data & 0x3f) << (local_data["left"] * 6) if local_data["left"]: return _utf8 return emit(local_data["num"]) if 0xc0 <= byte_data < 0xe0: # 2-byte UTF-8 Character local_data["left"] = 1 local_data["num"] = (byte_data & 0x1f) << 6 return _utf8 if 0xe0 <= byte_data < 0xf0: # 3-byte UTF-8 Character local_data["left"] = 2 local_data["num"] = (byte_data & 0xf) << 12 return _utf8 if 0xf0 <= byte_data < 0xf8: # 4-byte UTF-8 Character local_data["left"] = 3 local_data["num"] = (byte_data & 0x07) << 18 return _utf8 raise Exception("Invalid byte in UTF-8 string: 0x" + str(byte_data)) # Nestable state machine for hex escaped characters def hex_machine(emit): local_data = {"left": 4, "num": 0} def _hex(byte_data): # nonlocal num, left i = 0 # Parse the hex byte if 0x30 <= byte_data < 0x40: i = byte_data - 0x30 elif 0x61 <= byte_data <= 0x66: i = byte_data - 0x57 elif 0x41 <= byte_data <= 0x46: i = byte_data - 0x37 else: raise Exception("Expected hex char in string hex escape") local_data["left"] -= 1 local_data["num"] |= i << (local_data["left"] * 4) if local_data["left"]: return _hex return emit(local_data["num"]) return _hex def number_machine(byte_data, emit): local_data = {"sign": 1, "number": 0, "decimal": 0, "esign": 1, "exponent": 0} def _mid(byte_data): if byte_data == 0x2e: # . return _decimal return _later(byte_data) def _number(byte_data): # nonlocal number if 0x30 <= byte_data < 0x40: local_data["number"] = local_data["number"] * 10 + (byte_data - 0x30) return _number return _mid(byte_data) def _start(byte_data): if byte_data == 0x30: return _mid if 0x30 < byte_data < 0x40: return _number(byte_data) raise Exception("Invalid number: 0x" + byte_data.toString(16)) if byte_data == 0x2d: # - local_data["sign"] = -1 return _start def _decimal(byte_data): # nonlocal decimal if 0x30 <= byte_data < 0x40: local_data["decimal"] = (local_data["decimal"] + byte_data - 0x30) / 10 return _decimal return _later(byte_data) def _later(byte_data): if byte_data == 0x45 or byte_data == 0x65: # E e return _esign return _done(byte_data) def _esign(byte_data): # nonlocal esign if byte_data == 0x2b: # + return _exponent if byte_data == 0x2d: # - local_data["esign"] = -1 return _exponent return _exponent(byte_data) def _exponent(byte_data): # nonlocal exponent if 0x30 <= byte_data < 0x40: local_data["exponent"] = local_data["exponent"] * 10 + (byte_data - 0x30) return _exponent return _done(byte_data) def _done(byte_data): value = local_data["sign"] * (local_data["number"] + local_data["decimal"]) if local_data["exponent"]: value *= math.pow(10, local_data["esign"] * local_data["exponent"]) return emit(value, byte_data) return _start(byte_data) def array_machine(emit): local_data = {"array_data": []} def _array(byte_data): if byte_data == 0x5d: # ] return emit(local_data["array_data"]) return json_machine(on_value, _comma)(byte_data) def on_value(value): # nonlocal array_data local_data["array_data"].append(value) def _comma(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _comma # Ignore whitespace if byte_data == 0x2c: # , return json_machine(on_value, _comma) if byte_data == 0x5d: # ] return emit(local_data["array_data"]) raise Exception("Unexpected byte: 0x" + str(byte_data) + " in array body") return _array def object_machine(emit): local_data = {"object_data": {}, "key": ""} def _object(byte_data): # nonlocal object_data, key if byte_data == 0x7d: # return emit(local_data["object_data"]) return _key(byte_data) def _key(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _object # Ignore whitespace if byte_data == 0x22: return string_machine(on_key) raise Exception("Unexpected byte: 0x" + byte_data.toString(16)) def on_key(result): # nonlocal object_data, key local_data["key"] = result return _colon def _colon(byte_data): # nonlocal object_data, key if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _colon # Ignore whitespace if byte_data == 0x3a: # : return json_machine(on_value, _comma) raise Exception("Unexpected byte: 0x" + str(byte_data)) def on_value(value): # nonlocal object_data, key local_data["object_data"][local_data["key"]] = value def _comma(byte_data): # nonlocal object_data if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _comma # Ignore whitespace if byte_data == 0x2c: # , return _key if byte_data == 0x7d: # return emit(local_data["object_data"]) raise Exception("Unexpected byte: 0x" + str(byte_data)) return _object

Testing

 if __name__ == "__main__": test_json = """[1,2,"3"] {"name": "tarun"} 1 2 3 [{"name":"a", "data": [1, null,2]}] """ def found_json(data): print(data) state = json_machine(found_json) for char in test_json: state = state(ord(char))

Conclusion of the same

 [1, 2, '3'] {'name': 'tarun'} 1 2 3 [{'name': 'a', 'data': [1, None, 2]}]

+7

Tarun Lalwani Dec 04 '17 at 17:26

source share

I would like to provide a solution. The main idea is to "try" to decode: if it fails, give it more feed, otherwise use the offset information to prepare the next decoding.

However, the current json module cannot wrap SPACE in the header of the string to be decrypted, so I have to disable them.

 import sys import json def iterload(file): buffer = "" dec = json.JSONDecoder() for line in file: buffer = buffer.strip(" \n\r\t") + line.strip(" \n\r\t") while(True): try: r = dec.raw_decode(buffer) except: break yield r[0] buffer = buffer[r[1]:].strip(" \n\r\t") for o in iterload(sys.stdin): print("Working on a", type(o), o)

========================== I have tested several txt files and it works great. (In1.txt)

 {"foo": ["bar", "baz"] } 1 2 [ ] 4 {"foo1": ["bar1", {"foo2":{"A":1, "B":3}, "DDD":4}] } 5 6

(in2.txt)

 {"foo" : ["bar", "baz"] } 1 2 [ ] 4 5 6

(in.txt, your initial)

 {"foo": ["bar", "baz"]} 1 2 [] 4 5 6

(output for the Benedict test)

 python test.py < in.txt ('Working on a', <type 'list'>, [u'hello']) ('Working on a', <type 'dict'>, {u'goodbye': 1}) ('Working on a', <type 'int'>, 1) ('Working on a', <type 'int'>, 2) ('Working on a', <type 'dict'>, {}) ('Working on a', <type 'int'>, 2) ('Working on a', <type 'int'>, 9) ('Working on a', <type 'int'>, 78) ('Working on a', <type 'int'>, 4) ('Working on a', <type 'int'>, 5) ('Working on a', <type 'dict'>, {u'animals': [u'dog', u'lots of mice', u'cat']})

+5

wuliang Apr 17 '12 at 16:39

source share

I used the elegant @wuilang solution. A simple approach - reading bytes, trying to decode, reading bytes, trying to decode, ... - worked, but, unfortunately, it was very slow.

In my case, I was trying to read "pretty-printed" JSON objects of the same type of object from a file. This allowed me to optimize the approach; I could read the file line by line, only decrypt when I found a line containing exactly "}":

 def iterload(stream): buf = "" dec = json.JSONDecoder() for line in stream: line = line.rstrip() buf = buf + line if line == "}": yield dec.raw_decode(buf) buf = ""

If you are working with single-line compact JSON that escapes strings in string literals, you can further simplify this approach:

 def iterload(stream): dec = json.JSONDecoder() for line in stream: yield dec.raw_decode(line)

Obviously, these simple approaches only work for very specific JSON types. However, if these assumptions are fulfilled, these decisions work correctly and quickly.

+4

sigpwned Jan 09 '16 at 19:44

source share

Here's mine:

 import simplejson as json from simplejson import JSONDecodeError class StreamJsonListLoader(): """ When you have a big JSON file containint a list, such as [{ ... }, { ... }, { ... }, ... ] And it too big to be practically loaded into memory and parsed by json.load, This class comes to the rescue. It lets you lazy-load the large json list. """ def __init__(self, filename_or_stream): if type(filename_or_stream) == str: self.stream = open(filename_or_stream) else: self.stream = filename_or_stream if not self.stream.read(1) == '[': raise NotImplementedError('Only JSON-streams of lists (that start with a [) are supported.') def __iter__(self): return self def next(self): read_buffer = self.stream.read(1) while True: try: json_obj = json.loads(read_buffer) if not self.stream.read(1) in [',',']']: raise Exception('JSON seems to be malformed: object is not followed by comma (,) or end of list (]).') return json_obj except JSONDecodeError: next_char = self.stream.read(1) read_buffer += next_char while next_char != '}': next_char = self.stream.read(1) if next_char == '': raise StopIteration read_buffer += next_char

+3

user3542882 16 . '14 20:29

source share

json.JSONDecoder, raw_decode . python JSON , . ( ) JSON. , while JSON , -.

 import json def yield_multiple_value(f): ''' parses multiple JSON values from a file. ''' vals_str = f.read() decoder = json.JSONDecoder() try: nread = 0 while nread < len(vals_str): val, n = decoder.raw_decode(vals_str[nread:]) nread += n # Skip over whitespace because of bug, below. while nread < len(vals_str) and vals_str[nread].isspace(): nread += 1 yield val except json.JSONDecodeError as e: pass return

, . , - json.JSONDecoder.raw_decode() , , , whileloop ...

 def yield_multiple_value(f): ''' parses multiple JSON values from a file. ''' vals_str = f.read() decoder = json.JSONDecoder() while vals_str: val, n = decoder.raw_decode(vals_str) #remove the read characters from the start. vals_str = vals_str[n:] # remove leading white space because a second call to decoder.raw_decode() # fails when the string starts with whitespace, and # I don't understand why... vals_str = vals_str.lstrip() yield val return

json.JSONDecoder raw_decode https://docs.python.org/3/library/json.html#encoders-and-decoders :

JSON , .

JSON. , .

input.txt, , , .

+2

hetepeperfan 07 . '17 23:01

source share

( !!!) ujson :

 import ujson as json

!

-one

A STEFANI 10 . '17 18:02

source share

Nic Watson · Accepted Answer · 2017-05-25 21:33

Here is a much, much simpler solution. The secret lies in the attempt, failure and proper use of information in the exception. The only limitation is that the file must be searchable.

 def stream_read_json(fn): import json start_pos = 0 with open(fn, 'r') as f: while True: try: obj = json.load(f) yield obj return except json.JSONDecodeError as e: f.seek(start_pos) json_str = f.read(e.pos) obj = json.loads(json_str) start_pos += e.pos yield obj

Edit: just noticed that this will only work for Python> = 3.5. For earlier failures, returns a ValueError, and you need to parse the position from the string, for example.

 def stream_read_json(fn): import json import re start_pos = 0 with open(fn, 'r') as f: while True: try: obj = json.load(f) yield obj return except ValueError as e: f.seek(start_pos) end_pos = int(re.match('Extra data: line \d+ column \d+ .*\(char (\d+).*\)', e.args[0]).groups()[0]) json_str = f.read(end_pos) obj = json.loads(json_str) start_pos += end_pos yield obj

How can I lazily read multiple JSON values ​​from a file / stream in Python?

Usage example

example.py

in.txt

session example

Python 3 only

Python 2 compatible

Testing

More articles:

How can I lazily read multiple JSON values from a file / stream in Python?