I have a large XML data file (> 160M) for processing, and it seems that parsing SAX / expat / pulldom is the way to go. I would like to have a thread that seeps through the nodes and pushes the nodes to be processed into the queue, and then other worker threads pull the next available node from the queue and process it.
I have the following (it should have locks, I know it will be, later)
import sys, time import xml.parsers.expat import threading q = [] def start_handler(name, attrs): q.append(name) def do_expat(): p = xml.parsers.expat.ParserCreate() p.StartElementHandler = start_handler p.buffer_text = True print("opening {0}".format(sys.argv[1])) with open(sys.argv[1]) as f: print("file is open") p.ParseFile(f) print("parsing complete") t = threading.Thread(group=None, target=do_expat) t.start() while True: print(q) time.sleep(1)
The problem is that the body of the while block is called only once, and then I can’t even ctrl-C interrupt it. In smaller files, the result looks as expected, but this seems to indicate that the handler receives the call only when the document is completely parsed, which seems to have exceeded the goal of the SAX parser.
I am sure this is my own ignorance, but I do not see where I am making a mistake.
PS: I also tried changing start_handler like this:
def start_handler(name, attrs): def app(): q.append(name) u = threading.Thread(group=None, target=app) u.start()
No love, however.
python multithreading xml sax
decitrig
source share