Working with a very large XML file in C #

I have this very large 2.8GB xml file. This is a dump of articles on Polish Wikipedia. The size of this file is very problematic for me. The task is to find this file for some large amount of data. All I have is article titles. I thought I could sort these headers and use one linear loop through the file. The idea is not so bad, but the articles are not sorted alphabetically. They are sorted by identifier, which I do not know a priori.

So, my second thought was to make the index of this file. To store strings in other files (or the database) in the following format: title;id;index (possibly without an ID). Once again I asked a question about help. The hypothesis was that if I had the index of the desired tag, I could use the simple Seek method to move the cursor to the file without reading the entire contents, etc. For small files, I think this might work fine. But on my computer (laptop, C2D proc, Win7, VS2008) I get an error message that the application does not respond.

In my program, I read every line from the file and checks if it contains the tag that I need. I also count all the bytes I read and save the lines in the format above. Thus, while the indexing program freezes. But until now, the index file of the result is 36.2 MB, and the last index is 2,872,765,202 (B), while the whole XML file is 3,085,439,630 B.

My third thought was to split the file into smaller parts. More precisely, 26 pieces (there are 26 letters in Latin), each of which contains only entries starting for the same letter, for example. in a.xml, all entries whose names begin with the letter "A". Final files will look like tens of MB, max. About 200 MB, I think. But there is the same problem with reading the whole file.

To read the file, I used perhaps the fastest way: using StreamReader . I read somewhere that the StreamReader and XmlReader class from System.Xml are the fastest methods. StreamReader even faster than XmlReader . Obviously, I cannot load this entire file into memory. I installed only 3 GB of RAM, and Win7 takes approximately 800 MB-1 GB at full load.

So, I ask for help. What is better to do. The fact is that finding this XML file should be quick. It should be faster than loading individual Wikipedia pages in HTML format. I'm not even sure that this is possible.

Perhaps upload all the necessary content to the database? Maybe it will be faster? But still, I will need to read the entire file at least once.

I'm not sure if there are any restrictions on one question length, but I will also put a sample source code for indexing.

 while (reading) { if (!reader.EndOfStream) { line = reader.ReadLine(); fileIndex += enc.GetByteCount(line) + 2; //+2 - to cover characters \r\n not included into line position = 0; } else { reading = false; continue; } if (currentArea == Area.nothing) //nothing interesting at the moment { //search for position of <title> tag position = MoveAfter("&lt;title>", line, position); //searches until it finds &lt;title> tag if (position >= 0) currentArea = Area.title; else continue; } (...) if (currentArea == Area.text) { position = MoveAfter("&lt;text", line, position); if (position >= 0) { long index = fileIndex; index -= line.Length; WriteIndex(currentTitle, currentId, index); currentArea = Area.nothing; } else continue; } } reader.Close(); reader.Dispose(); writer.Close(); } private void WriteIndex(string title, string id, long index) { writer.WriteLine(title + ";" + id + ";" + index.ToString()); } 

Best wishes and thanks,

Ventus

Edit: Link to this dump Wiki http://download.wikimedia.org/plwiki/20100629/

+7
c # xml
source share
10 answers

Well, if it meets your requirements, I would first import this XML into RDMS, such as SQL Server, and then query this SQL Server.

With the right indexes (full-text indexes, if you want to find a lot of text), this should be pretty fast ...

This will reduce the amount of overhead that arises when parsing an XML file by libraries ...

+6
source share

Well ... If you are going to look for it, I highly recommend that you find a better way than dealing with the file itself. I suggest, as you mentioned, to put it in a well-normalized and indexed database and do a search there. Everything you do will effectively duplicate what the database does.

However, it will take some time. XmlTextReader is probably your best bet, it works with one node at a time. LINQ to XML should also be quite efficient processing, but I have not tried it with a large file and therefore cannot comment.

May I ask: where did this huge XML file come from? Perhaps there is a way to handle the situation at the source, and not before processing a 3 GB file.

+7
source share

I like the idea of ​​creating an index - you get your code to be very simple and you don’t need any terrible dependencies like databases :)

So - create an index in which you save the following

[content to search]:[byte offset to the start of the xml node that contains the content]

To capture the byte offset, you need to create your own stream on top of the input file and create a reader for it. you request a position from each reader. Read (..). Example index entry:

"Now is the winter of our discontent":554353

This means that the entry in the xml file containing "Now is the winter of our discontent" is on node at byte position 554.353. Note. I will be tempted to encode part of the index search so that you do not encounter the delimiters used.

To read the index, you look at the index on the disk (i.e. do not load it into memory), looking for the corresponding record. After that, you will get the byte offset. Now create a new stream through the XML file and set its position to the byte offset - create a new reader and read the document from this point.

+2
source share

you can save the file in couchDB. I wrote a python script to do this:

 import couchdb import datetime import time from lxml import etree couch = couchdb.Server() db = couch["wiki"] infile = '/Users/johndotnet/Downloads/plwiki-20100629-pages-articles.xml' context = etree.iterparse(source=infile, events=("end", ), tag='{http://www.mediawiki.org/xml/export-0.4/}page') for event, elem in context: #dump(elem) couchEle = {} for ele in elem.getchildren(): if ele.tag == "{http://www.mediawiki.org/xml/export-0.4/}id": couchEle['id'] = ele.text if ele.tag == "{http://www.mediawiki.org/xml/export-0.4/}title": couchEle['title'] = ele.text if ele.tag == "{http://www.mediawiki.org/xml/export-0.4/}revision": for subEle in ele.getchildren(): if subEle.tag == "{http://www.mediawiki.org/xml/export-0.4/}text": couchEle['text'] = subEle.text db[couchEle['title']] = couchEle 

This should import the entire article with id, title and text into couchDB.

Now you should make this request:

 code = ''' function(doc) { if(doc.title.indexOf("Brzeg") > -1) { emit(doc._id, doc); } } ''' results = db.query(code) 

Hope this helps!

+1
source share

XmlReader will be fast, but you need to check if it is enough in your script. Suppose we are looking for a value located in a node called Item :

 using (var reader = XmlReader.Create("data.xml")) { while (reader.Read()) { if (reader.NodeType == XmlNodeType.Element && reader.Name == "Item") { string value = reader.ReadElementContentAsString(); if (value == "ValueToFind") { // value found break; } } } } 
0
source share

I would do this:

1) Break XML into smaller files. For example, if the XML looks like this, I would create one file in the node article with a name that matches the title attribute. If the title is not unique, then I will just indicate the files.

Since these are many files, I would split them into subdirectories, each of which had no more than 1000 files.

 <root> <article title="aaa"> ... </article> <article title="bbb"> ... </article> <article title="ccc"> ... </article> </root> 

2) Create an index table with the file names and columns that you want to find.

3) As an option, you can store XML fragments in the database, and not on the hard drive. The SQL Server varChar (MAX) type is suitable for this.

0
source share

Dump the Solr index and use it to search. You can run Solr as a standalone search engine and a simple bit of scripting to cycle through the file and dump each article into the index. Then Solr gives you a full text search of the fields that you decide to index ...

0
source share

The only way you can quickly find this is to store it in a database, like the others. If the database is not an option, then it will take a long time, no doubt. I would make a multithreaded application. Create workflows to be read in the data, and possibly insert them into the line queue. Have 5 threads that execute this segment throughout the file (so one thread will start, the second thread will start 1/5 of the path to the file, the third thread will start 2/5 of the path, etc. Etc.). Meanwhile, there is another thread that reads the line queue and looks for what it is looking for. Then clean the stream after it is complete. It will take some time, but it should not crash or eat tons of memory.

If you find that he eats a lot, set a limit on the number of elements that the queue can hold, and pause threads until the queue size falls below this threshold.

0
source share

You can use XML DataType in SQL Server, which supports up to 2 GB of xml data. And you can directly request xml using this.

Refer to this. http://technet.microsoft.com/en-us/library/ms189887(v=sql.105).aspx

Hope this helps!

0
source share

I know this question // the answer is old. But I recently solved this problem myself, and found Personally, I would use JSON.Net (newtonking). It is as simple as de-serializing the results of an XML document for C # objects.

Now my documents (results) are only a few megabytes (an average of 5 MB), but I can see that this is growing using the Solr index. Be that as it may, I am getting quick results.

CodePlex discussion regarding performance.

0
source share

All Articles