Reading a Freebase data dump in python, reading multiple lines?

I am trying to use a freebase data dump, but it pops up that I have some problems reading files with python. It seems that my program could not read all the lines.

def test2(): count=0 for line in open(FREEBASE_TOPIC): count+=1 return count def test3(): count=0 for line in open(FREEBASE_QUAD): count+=1 return count if __name__ == "__main__": print "FREEBASE TOPIC - NR LINES:",test2() print "FREEBASE QUAD - NR LINES:",test3() 

Results in this:

 FREEBASE TOPIC - ITR TIME: 1.21000003815 FREEBASE TOPIC - NR LINES: 1643010 FREEBASE QUAD - ITER TIME: 0.797000169754 FREEBASE QUAD - NR LINES: 3155131 

It could be all. It seems that several lines contain the whole free base. And I don’t see how you can iterate over one 33 GB file and another 5 GB file in 2 seconds.

What's wrong? I download files again if something went wrong during the download process, but I need decades with my connections, so I ask if there is any time. The file size is correct and I printed some lines and they look right.

+4
source share
3 answers

there is a problem that occurred to me:

 open('file', 'rb') 

should solve this problem.

 chr(26) 

sometimes leads to the fact that the file ends for the text mode 'r', which is the default.

+2
source

It looks like you are decompressing files before using them. It is almost certainly better for you to keep the file compressed and unzip it when you access it.

 from bz2 import BZ2File for line in BZ2File('freebase-datadump-quadruples-<date>.tsv.bz2','rU'): <process a line> 
+2
source

Your script works fine and produces the correct number of lines for me on Ubuntu. Could this be a limitation of your OS?

Parsing a large (20 GB) python text file - reading in 2 lines as 1

0
source

Source: https://habr.com/ru/post/1416031/


All Articles