C # StreamReader.ReadLine () - you need to choose line terminators

I wrote a C # program to read an Excel.xls / .xlsx file and output text to CSV and Unicode. I wrote a separate program to delete empty entries. This is achieved by reading each line using StreamReader.ReadLine (), and then passing the character through the character and not writing the line for output if it contains all commas (for CSV) or all tabs (for Unicode text).

The problem occurs when the Excel file contains inline newlines (\ x0A) inside the cells. I changed my XLS to a CSV converter to find these new lines (as it goes along the cell) and write them as \ x0A, and regular lines use StreamWriter.WriteLine ().

The problem occurs in a separate program to delete empty entries. When I read in StreamReader.ReadLine (), by definition it returns a line with only a line, not a terminator. Since inline newlines are displayed as two separate lines, I cannot determine which one is a complete record, and which is an embedded newline when I write them to the final file.

I'm not even sure I can read in \ x0A, because everything in the input is registered as "\ n". I could go by character, but it destroys my logic to remove blank lines.

We will be very grateful for any ideas.

+7
c # newline readline streamreader
source share
5 answers

I would recommend you change your architecture to work more like a parser in a compiler.

You want to create a lexer that returns a sequence of tokens, and then a parser that reads a sequence of tokens and does something with them.

In your case, the markers will be:

  • Column data
  • Comma
  • End of line

You will consider "\ n" ("\ x0a") by itself as an embedded new row, and therefore include it as part of the column data token. A '\ r \ n' will be the end of line token.

This has the following advantages:

  • Performing only 1 data transfer
  • Saving just 1 row of data
  • Reuse as much memory as possible (for line and list builder)
  • Easy to change if your requirements change.

Here is an example of what Lexer will look like:

Disclaimer: I didn’t even compile, let alone test, this code, so you need to clear it and make sure it works.

enum TokenType { ColumnData, Comma, LineTerminator } class Token { public TokenType Type { get; private set;} public string Data { get; private set;} public Token(TokenType type) { Type = type; } public Token(TokenType type, string data) { Type = type; Data = data; } } private IEnumerable<Token> GetTokens(TextReader s) { var builder = new StringBuilder(); while (s.Peek() >= 0) { var c = (char)s.Read(); switch (c) { case ',': { if (builder.Length > 0) { yield return new Token(TokenType.ColumnData, ExtractText(builder)); } yield return new Token(TokenType.Comma); break; } case '\r': { var next = s.Peek(); if (next == '\n') { s.Read(); } if (builder.Length > 0) { yield return new Token(TokenType.ColumnData, ExtractText(builder)); } yield return new Token(TokenType.LineTerminator); break; } default: builder.Append(c); break; } } s.Read(); if (builder.Length > 0) { yield return new Token(TokenType.ColumnData, ExtractText(builder)); } } private string ExtractText(StringBuilder b) { var ret = b.ToString(); b.Remove(0, b.Length); return ret; } 

Your parser code will look like this:

 public void ConvertXLS(TextReader s) { var columnData = new List<string>(); bool lastWasColumnData = false; bool seenAnyData = false; foreach (var token in GetTokens(s)) { switch (token.Type) { case TokenType.ColumnData: { seenAnyData = true; if (lastWasColumnData) { //TODO: do some error reporting } else { lastWasColumnData = true; columnData.Add(token.Data); } break; } case TokenType.Comma: { if (!lastWasColumnData) { columnData.Add(null); } lastWasColumnData = false; break; } case TokenType.LineTerminator: { if (seenAnyData) { OutputLine(lastWasColumnData); } seenAnyData = false; lastWasColumnData = false; columnData.Clear(); } } } if (seenAnyData) { OutputLine(columnData); } } 
+13
source share

You cannot change the StreamReader to return line terminators, and you cannot change what it uses to end the line.

I do not quite understand the problem in terms of what you are avoiding, especially in terms of "and write them as \ x0A". An example file will probably help.

It looks like you may need to work with a symbol or perhaps first download the entire file and perform a global replacement, for example.

 x.Replace("\r\n", "\u0000") // Or some other unused character .Replace("\n", "\\x0A") // Or whatever escaping you need .Replace("\u0000", "\r\n") // Replace the real line breaks 

I'm sure you can do it with a regex, and it will probably be more efficient, but I find that the long way is easier to understand :) This is a bit of a hack to make a global replacement, although I hope with more we’ll come up with a better solution.

+4
source share

Essentially, a hard return to Excel (shift + enter or alt + enter, I don’t remember) puts a new line equivalent to \ x0A in the default encoding, which I use to write my CSV. When I write to CSV, I use StreamWriter.WriteLine (), which prints a line plus a new line (I believe this is \ r \ n).

The CSV is fine and it comes out exactly how Excel saves it, the problem is when I read it in an empty device to delete records, I use ReadLine (), which will process the record with an inline new line as CRLF.

Here is an example file after converting to CSV ...

 Reference,Name of Individual or Entity,Type,Name Type,Date of Birth,Place of Birth,Citizenship,Address,Additional Information,Listing Information,Control Date,Committees 1050,"Aziz Salih al-Numan ",Individual,Primary Name,1941 or 1945,An Nasiriyah,Iraqi,,Ba'th Party Regional Command Chairman; Former Governor of Karbala and An Najaf Former Minister of Agriculture and Agrarian Reform (1986-1987),Resolution 1483 (2003),6/27/2003,1518 (Iraq) 1050a,???? ???? ???????,Individual,Original script,1941 or 1945,An Nasiriyah,Iraqi,,Ba'th Party Regional Command Chairman; Former Governor of Karbala and An Najaf Former Minister of Agriculture and Agrarian Reform (1986-1987),Resolution 1483 (2003),6/27/2003,1518 (Iraq) 

As you can see, the first entry has an inline new line after al-Numan. When I use ReadLine (), I get "1050", "Aziz Salih al-Numan", and when I write this, WriteLine () ends this CRLF line. I am losing the original string terminator. When I use ReadLine () again, I get a line starting with '1050a'.

I could read the whole file and replace it, but then I would have to replace them. Basically, what I want to do is get the line terminator to determine if it is \ x0a or CRLF, and then if its \ x0A, I will use Write () and insert this terminator.

+1
source share

I know that I was a little late to the game here, but I had the same problem and my solution was much simpler than most of them.

If you can determine the number of columns, which should be easy to do, since the first row is usually the column headings, you can check the number of columns for the expected number of columns. If the number of columns is not equal to the expected number of columns, you simply merge the current row with the previous unsurpassed rows. For example:

 string sep = "\",\""; int columnCount = 0; while ((currentLine = sr.ReadLine()) != null) { if (lineCount == 0) { lineData = inLine.Split(new string[] { sep }, StringSplitOptions.None); columnCount = lineData.length; ++lineCount; continue; } string thisLine = lastLine + currentLine; lineData = thisLine.Split(new string[] { sep }, StringSplitOptions.None); if (lineData.Length < columnCount) { lastLine += currentLine; continue; } else { lastLine = null; } ...... 
0
source share

Thank you so much for your code and some others, I came up with the following solution! I added the link below to some code that I wrote that used part of the logic from this page. I decided that I would honor the honor! Thanks!

The following is an explanation of what I need: Try this, I wrote this because I have a very large "|" which have \ r \ n inside some columns, and I needed to use \ r \ n as the end of the line separator. I tried to import some files using SSIS packages, but due to some corrupted data in the files that I failed. The file was more than 5 GB, so it was too large to open and manually fix. I found the answer by looking at a lot of forums to understand how streams work, and eventually came up with a solution that reads every character in the file and splashes the line based on the definitions I added to it. this is for use in a command line application, bundled with :). Hope this helps some other people, I have not found a solution like him anywhere else, although the ideas were inspired by this forum and others.

stack overflow

0
source share

All Articles