Why does std :: regex_iterator cause a stack overflow with this data?

I used std::regex_iterator to parse the log files. My program worked pretty well for several weeks and analyzed millions of log lines, until today, when today I ran it against a log file and got a stack overflow. It turned out that the reason is only one line of the log in the log file. Does anyone know why my regex invokes such massive recursion? Here is a small standalone program that shows the problem (my compiler is VC2012):

 #include <string> #include <regex> #include <iostream> using namespace std; std::wstring test = L"L3 T15356 79726859 [CreateRegistryAction] Creating REGISTRY Action:\n" L" Identity: 272A4FE2-A7EE-49B7-ABAF-7C57BEA0E081\n" L" Description: Set Registry Value: \"SortOrder\" in Key HKEY_CURRENT_USER\\Software\\Hummingbird\\PowerDOCS\\Core\\Plugins\\Fusion\\Settings\\DetailColumns\\LONEDOCS1\\Search Unsaved\\$AUTHOR.FULL_NAME;DOCSADM.PEOPLE.SYSTEM_ID\n" L" Operation: 3\n" L" Hive: HKEY_CURRENT_USER\n" L" Key: Software\\Hummingbird\\PowerDOCS\\Core\\Plugins\\Fusion\\Settings\\DetailColumns\\LONEDOCS1\\Search Unsaved\\$AUTHOR.FULL_NAME;DOCSADM.PEOPLE.SYSTEM_ID\n" L" ValueName: SortOrder\n" L" ValueType: REG_DWORD\n" L" ValueData: 0\n" L"L4 T15356 79726859 [CEMRegistryValueAction::ClearRevertData] [ENTER]\n"; int wmain(int argc, wchar_t* argv[]) { static wregex rgx_log_lines( L"^L(\\d+)\\s+" // Level L"T(\\d+)\\s+" // TID L"(\\d+)\\s+" // Timestamp L"\\[((?:\\w|\\:)+)\\]" // Function name L"((?:" // Complex pattern L"(?!" // Stop matching when... L"^L\\d" // New log statement at the beginning of a line L")" L"[^]" // Matching all until then L")*)" // ); try { for (std::wsregex_iterator it(test.begin(), test.end(), rgx_log_lines), end; it != end; ++it) { wcout << (*it)[1] << endl; wcout << (*it)[2] << endl; wcout << (*it)[3] << endl; wcout << (*it)[4] << endl; wcout << (*it)[5] << endl; } } catch (std::exception& e) { cout << e.what() << endl; } return 0; } 
+7
c ++ regex c ++ 11
source share
2 answers

Negative look patterns that are tested on each character seem like a bad idea to me, and what you're trying to do is not difficult. You want to match (1) the rest of the line, and then (2) any number of the following (3) lines starting with something other than L \ d (small error, see below): (another edit: these are regular expressions ; if you want to write them as string literals, you need to change \ to \\ .)

  .*\n(?:(?:[^L]|L\D).*\n)* | | | +-1 | +---------------3 +---------------------2 

In extreme text mode . should not match \ n, but you can always replace two . in this expression with [^\n]

Edited to add: I understand that this may not work if there is an empty line before the end of the log entry, but this should cover this case; I have changed . on [^\n] for extra precision:

  [^\n]*\n(?:(?:(?:[^L\n]|L\D)[^\n]*)?\n)* 
+4
source share

The regular expression looks OK; at least there is nothing in it that could lead to a catastrophic return.

I see a little opportunity to optimize the regex by reducing stack usage:

 static wregex rgx_log_lines( L"^L(\\d+)\\s+" // Level L"T(\\d+)\\s+" // TID L"(\\d+)\\s+" // Timestamp L"\\[([\\w:]+)\\]" // Function name L"((?:" // Complex pattern L"(?!" // Stop matching when... L"^L\\d" // New log statement at the beginning of a line L")" L"[^]" // Matching all until then L")*)" // ); 

Have you set the ECMAScript option ? Otherwise, I suspect that the regular expression library uses POSIX regular expressions by default and they do not support lookahead statements.

+1
source share

All Articles