What is the fastest way to parse text with custom separators and some very very large field values ​​in C #?

I am trying to deal with some delimited text files that have non-standard delimiters (not separated by comma / quote or tab). Delimiters are random ASCII characters that often do not appear between delimiters. After searching, I seem to have only found that no .NET solutions would fit my needs, and the user libraries that people wrote for this seem to have some drawbacks when it comes to gigantic input (4 GB file with some field values, very easily several million characters).

While this seems a bit extreme, it's actually the industry standard for electronic document management (EDD) for some review software to have field values ​​that contain the full contents of the document. For reference, I previously did this in python using the csv module without any problems.

Here is an example input:

Field delimiter = 
quote character = þ

þFieldName1þþFieldName2þþFieldName3þþFieldName4þ
þValue1þþValue2þþValue3þþSomeVery,Very,Very,Large value(5MB or so)þ
...etc...

Edit: So I went ahead and created a delimited file parser from scratch. I am a little tired using this solution, as it may be error prone. It also does not feel "elegant" or right when it is necessary to write its own parser for such a task. I also have the feeling that I probably shouldn't have written a parser at all from scratch.

+3
source share
6

API . .NET . IL- ​​ .

; 4 .

- , string.split:

public IEnumerable<string[]> CreateEnumerable(StreamReader input)
{
    string line;
    while ((line = input.ReadLine()) != null)
    {
        yield return line.Split('þ');
    }
}

, , Linq ;) , , IEnumerable , StreamReader , ( , ToList/ToArray, , , , , !).

:

using (StreamReader sr = new StreamReader("c:\\test.file"))
{
    var qry = from l in CreateEnumerable(sr).Skip(1)
              where l[3].Contains("something")
              select new { Field1 = l[0], Field2 = l[1] };
    foreach (var item in qry)
    {
        Console.WriteLine(item.Field1 + " , " + item.Field2);
    }
}
Console.ReadLine();

, , 4- "-". , .

+5

Windows -, IO Completion. , .

, #/. NET, Joe Duffy

18) Windows Asynchronous Procedure Calls (APC) .

;), APC, IOCP . -, .

, Eric White .

+1

(msdn .NET- ) , IEnumerable / ( - )

0

, , , . 8K ( ), , .

, ? - ? ?

0

. , BCL, .

"" - , , . , API API- BCL "reader", "".

, , , . FileStream. .

, :

using(var reader = new EddReader(new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8192)) {
    // Read a small field
    string smallField = reader.ReadFieldAsText();
    // Read a large field
    Stream largeField = reader.ReadFieldAsStream();
}
0

Although this does not help solve the problem with large inputs, a possible solution to the parsing problem may include a custom parser that uses a strategy template to provide a separator.

-2
source

All Articles