I have this very large 2.8GB xml file. This is a dump of articles on Polish Wikipedia. The size of this file is very problematic for me. The task is to find this file for some large amount of data. All I have is article titles. I thought I could sort these headers and use one linear loop through the file. The idea is not so bad, but the articles are not sorted alphabetically. They are sorted by identifier, which I do not know a priori.
So, my second thought was to make the index of this file. To store strings in other files (or the database) in the following format: title;id;index (possibly without an ID). Once again I asked a question about help. The hypothesis was that if I had the index of the desired tag, I could use the simple Seek method to move the cursor to the file without reading the entire contents, etc. For small files, I think this might work fine. But on my computer (laptop, C2D proc, Win7, VS2008) I get an error message that the application does not respond.
In my program, I read every line from the file and checks if it contains the tag that I need. I also count all the bytes I read and save the lines in the format above. Thus, while the indexing program freezes. But until now, the index file of the result is 36.2 MB, and the last index is 2,872,765,202 (B), while the whole XML file is 3,085,439,630 B.
My third thought was to split the file into smaller parts. More precisely, 26 pieces (there are 26 letters in Latin), each of which contains only entries starting for the same letter, for example. in a.xml, all entries whose names begin with the letter "A". Final files will look like tens of MB, max. About 200 MB, I think. But there is the same problem with reading the whole file.
To read the file, I used perhaps the fastest way: using StreamReader . I read somewhere that the StreamReader and XmlReader class from System.Xml are the fastest methods. StreamReader even faster than XmlReader . Obviously, I cannot load this entire file into memory. I installed only 3 GB of RAM, and Win7 takes approximately 800 MB-1 GB at full load.
So, I ask for help. What is better to do. The fact is that finding this XML file should be quick. It should be faster than loading individual Wikipedia pages in HTML format. I'm not even sure that this is possible.
Perhaps upload all the necessary content to the database? Maybe it will be faster? But still, I will need to read the entire file at least once.
I'm not sure if there are any restrictions on one question length, but I will also put a sample source code for indexing.
while (reading) { if (!reader.EndOfStream) { line = reader.ReadLine(); fileIndex += enc.GetByteCount(line) + 2; //+2 - to cover characters \r\n not included into line position = 0; } else { reading = false; continue; } if (currentArea == Area.nothing) //nothing interesting at the moment { //search for position of <title> tag position = MoveAfter("<title>", line, position); //searches until it finds <title> tag if (position >= 0) currentArea = Area.title; else continue; } (...) if (currentArea == Area.text) { position = MoveAfter("<text", line, position); if (position >= 0) { long index = fileIndex; index -= line.Length; WriteIndex(currentTitle, currentId, index); currentArea = Area.nothing; } else continue; } } reader.Close(); reader.Dispose(); writer.Close(); } private void WriteIndex(string title, string id, long index) { writer.WriteLine(title + ";" + id + ";" + index.ToString()); }
Best wishes and thanks,
Ventus
Edit: Link to this dump Wiki http://download.wikimedia.org/plwiki/20100629/