Error with find.exe?

In C ++, we have a method for finding text in a file. It works by reading the variable file and using strstr. But we had problems when the file turned out to be very large.

I thought I could solve this by calling find.exe using _popen. It works unless these conditions are true:

  • The file is of unicode type (BOM = FFFE)
  • EXACTLY 4096 byte file
  • The text you are looking for is the last text in the file

To recreate, you can do this:

  • Open notebook
  • Insert 2046 X then A at the end
  • Save as test.txt, encoding = "unicode"
  • Make sure the file is 4096 bytes.
  • Open a command prompt and type: find "A" / c test2.txt -> No impressions

I also tried this:

  • Add or remove X and you will get a hit (the file is no longer 4096 bytes)
  • Save as UTF-8 (and add enough X so that the file is again 4096 bytes) and you get a hit
  • Find something in the middle of the file (the file is still unicode and 4096 bytes) and you will get hit.

Is this a mistake or is something missing?

+7
source share
1 answer

Very interesting mistake.

This question forced me to do some experiments on XP and Win 7 - the behavior is different.

XP

ANSI - FIND cannot read 1023 characters (1023 bytes) on a single line. FIND can match a string that exceeds 1023 characters as long as the search string matches up to the 1024th. The corresponding printout of the string is truncated after 1023 characters.

Unicode - FIND cannot read more than 1024 characters (2048 bytes) on a single line. FIND can match a string that exceeds 1024 characters as long as the search string matches up to the 1025th. The corresponding printout of the string is truncated after 1024 characters.

It is very strange to me that the string restrictions for Unicode and ANSI on XP are not the same number of bytes, and it is just a multiple. The Unicode limit, expressed in bytes, is 2 times the limit for ANSI plus 1.

Note. Truncating matching long lines also truncates the newline character, so the next matching line will be added to the previous line. You can say this is a new line if you use the / N option.

Window 7

ANSI - I did not find the limit of the maximum line length that can be found (although I did not try very hard). Any matching line that exceeds 4095 characters (4095 bytes) is truncated after 4095 characters. FIND can successfully search for more than 4095 characters per line, it just cannot display all of them.

Unicode I did not find a limit on the maximum length of a string that can be found (although I did not try very hard). Any matching string that exceeds 2047 characters (4094 bytes) is truncated after 2047 characters. FIND can successfully search for the last 2047 characters in a string, it just cannot display all of them.

Since Unicode byte lengths are always a multiple of 2, and the maximum ANSI displayed length is an odd number, it makes sense that the maximum displayed string length in bytes is less for Unicode than for ANSI.

But then there is a strange Unicode error. If the Unicode file length is an exact multiple of 4096 bytes, then the last character cannot be found or printed. It does not matter if the file contains one line or several lines. It depends only on the total file length.

I am wondering that a multiple of 4096 errors is within one of the maximum allowed string length (in bytes). But I do not know if there is a connection between these behaviors or if it is just a coincidence.

Note. Truncating matching long lines also truncates any newline character, so the next matching line will be added to the previous line. You can say this is a new line if you use the / N option.

+4
source

All Articles