Optimization of several million char * for string conversion

I have an application that should take several million char * as an input parameter (usually strings less than 512 characters long (in Unicode)), and also convert and store them as .net strings.

This turned out to be a real bottleneck in my application. I am wondering if there is any design template or ideas to make it more efficient.

There is a key part that makes me feel that it can be improved: there are many duplicates. Let's say 1 million objects arrive, maybe there will be only 50 unique char * patterns.

For the record, here is the algorithm that I use to convert char * to string (this algorithm is in C ++, but the rest of the project is in C #)

String ^StringTools::MbCharToStr ( const char *Source ) { String ^str; if( (Source == NULL) || (Source[0] == '\0') ) { str = gcnew String(""); } else { // Find the number of UTF-16 characters needed to hold the // converted UTF-8 string, and allocate a buffer for them. const size_t max_strsize = 2048; int wstr_size = MultiByteToWideChar (CP_UTF8, 0L, Source, -1, NULL, 0); if (wstr_size < max_strsize) { // Save the malloc/free overhead if it a reasonable size. // Plus, KJN was having fits with exceptions within exception logging due // to a corrupted heap. wchar_t wstr[max_strsize]; (void) MultiByteToWideChar (CP_UTF8, 0L, Source, -1, wstr, (int) wstr_size); str = gcnew String (wstr); } else { wchar_t *wstr = (wchar_t *)calloc (wstr_size, sizeof(wchar_t)); if (wstr == NULL) throw gcnew PCSException (__FILE__, __LINE__, PCS_INSUF_MEMORY, MSG_SEVERE); // Convert the UTF-8 string into the UTF-16 buffer, construct the // result String from the UTF-16 buffer, and then free the buffer. (void) MultiByteToWideChar (CP_UTF8, 0L, Source, -1, wstr, (int) wstr_size); str = gcnew String ( wstr ); free (wstr); } } return str; } 
+7
source share
4 answers

You can use each character from the input string to create a trie structure. Leaves have one .NET string object. Then, when the char* that you saw earlier appears, you can quickly find the existing version of .NET without allocating any memory.

Pseudo Code:

  • start with an empty trie,
  • process a char * by searching trie until you can continue
  • add nodes until all your char * is encoded as nodes
  • on the sheet, attach the actual .NET string

The answer to this other SO question should help you get started: How to create a trie in C #

+5
source

There is a key part that makes me feel that it can be improved: there are many duplicates. Let's say 1 million objects arrive, there can only be 50 unique char * patterns.

If so, you might want to save the โ€œfoundโ€ patterns in the map (for example, using std::map<const char*, gcroot<String^>> [although you will need a std::map<const char*, gcroot<String^>> for const char* )), and use this to return previously converted value.

There are overheads for card storage, comparison, etc. However, this can be reduced by drastically reducing memory usage (you can reuse managed string instances), as well as by saving memory allocations (calloc / free). In addition, using malloc instead of calloc is likely to be a (very small) improvement, since you do not need to zero out memory before calling MultiByteToWideChar .

+3
source

I think the first optimization you could do here would be to make your first attempt to call MultiByteToWideChar to start with a buffer instead of a null pointer. Since you specified CP_UTF8 , MultiByteToWideChar must go all the way to determine the expected length. If there is a length that is longer than the vast majority of your lines, you can optimize the distribution of the buffer of that size on the stack; and if that fails, then move on to dynamic allocation. That is, move the first branch if your if/else block is outside of if/else .

You can also save some time by calculating the length of the original string once and passing it explicitly - this way MultiByteToWideChar should not do strlen every time you call it.

However, it seems that if the rest of your project is C #, you should use the .NET BCL class libraries designed for this, instead of having side-by-side assembly in C ++ / CLI for the sole purpose of converting strings, This is for System.Text.Encoding .

I doubt that any caching data structure that you can use here will be significant.

Oh, and do not ignore the result of MultiByteToWideChar - you should not lead to anything void , you have undefined behavior in the MultiByteToWideChar event.

+2
source

I would probably use a cache based on a triple tree structure or the like, and look at the input line to see if it has already been converted before even converting a single character to a .NET view.

+1
source

All Articles