Hash table performance, why is C ++ the slowest?

Question

Hash table performance, why is C ++ the slowest?

A simple hash performance test, it seems the C ++ version is slower than the perl version and the golang version.

The perl version took about 200 milliseconds,
The C ++ version took 280 milliseconds.
The golang version took 56 milliseconds.

On my PC with a Core (TM) processor i7-2670QM @ 2.20GHz, Ubuntu 14.04.3LTS,

Any ideas?

perl version

use Time::HiRes qw( usleep ualarm gettimeofday tv_interval nanosleep clock_gettime clock_getres clock_nanosleep clock stat ); sub getTS { my ($seconds, $microseconds) = gettimeofday; return $seconds + (0.0+ $microseconds)/1000000.0; } my %mymap; $mymap{"US"} = "Washington"; $mymap{"UK"} = "London"; $mymap{"France"} = "Paris"; $mymap{"Russia"} = "Moscow"; $mymap{"China"} = "Beijing"; $mymap{"Germany"} = "Berlin"; $mymap{"Japan"} = "Tokyo"; $mymap{"China"} = "Beijing"; $mymap{"Italy"} = "Rome"; $mymap{"Spain"} = "Madrad"; $x = ""; $start = getTS(); for ($i=0; $i<1000000; $i++) { $x = $mymap{"China"}; } printf "took %f sec\n", getTS() - $start;

C ++ Version

 #include <iostream> #include <string> #include <unordered_map> #include <sys/time.h> double getTS() { struct timeval tv; gettimeofday(&tv, NULL); return tv.tv_sec + tv.tv_usec/1000000.0; } using namespace std; int main () { std::unordered_map<std::string,std::string> mymap; // populating container: mymap["US"] = "Washington"; mymap["UK"] = "London"; mymap["France"] = "Paris"; mymap["Russia"] = "Moscow"; mymap["China"] = "Beijing"; mymap["Germany"] = "Berlin"; mymap["Japan"] = "Tokyo"; mymap["China"] = "Beijing"; mymap["Italy"] = "Rome"; mymap["Spain"] = "Madrad"; double start = getTS(); string x; for (int i=0; i<1000000; i++) { mymap["China"]; } printf("took %f sec\n", getTS() - start); return 0; }

Golang version

 package main import "fmt" import "time" func main() { var x string mymap := make(map[string]string) mymap["US"] = "Washington"; mymap["UK"] = "London"; mymap["France"] = "Paris"; mymap["Russia"] = "Moscow"; mymap["China"] = "Beijing"; mymap["Germany"] = "Berlin"; mymap["Japan"] = "Tokyo"; mymap["China"] = "Beijing"; mymap["Italy"] = "Rome"; mymap["Spain"] = "Madrad"; t0 := time.Now() sum := 1 for sum < 1000000 { x = mymap["China"] sum += 1 } t1 := time.Now() fmt.Printf("The call took %v to run.\n", t1.Sub(t0)) fmt.Println(x) }

Update 1

To improve the C ++ version, change x = mymap["China"]; on mymap["China"]; but very little in performance.

Update 2

I got the original result when compiling without any optimization: g++ -std=c++11 unorderedMap.cc . When optimizing "-O2", it costs only about half the time (150 ms)

Update 3

To remove a possible call to the char* constructor in string , I created a string constant. The time is reduced to approximately 220 ms (there is no optimization during compilation). Thanks to the @ neil-kirk suggestion, with optimization (-O2 flag), the time is about 80 ms.

  double start = getTS(); string x = "China"; for (int i=0; i<1000000; i++) { mymap[x]; }

Update 4

Thanks to @ steffen-ullrich, who pointed out a syntax error for the perl version. I changed that. The performance indicator is about 150 ms.

Update 5

It seems that the number of instructions executed matters. Using the command valgrind --tool=cachegrind <cmd>

For the Go version

 $ valgrind --tool=cachegrind ./te1 ==2103== Cachegrind, a cache and branch-prediction profiler ==2103== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al. ==2103== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info ==2103== Command: ./te1 ==2103== --2103-- warning: L3 cache found, using its data for the LL simulation. The call took 1.647099s to run. Beijing ==2103== ==2103== I refs: 255,763,381 ==2103== I1 misses: 3,709 ==2103== LLi misses: 2,743 ==2103== I1 miss rate: 0.00% ==2103== LLi miss rate: 0.00% ==2103== ==2103== D refs: 109,437,132 (77,838,331 rd + 31,598,801 wr) ==2103== D1 misses: 352,474 ( 254,714 rd + 97,760 wr) ==2103== LLd misses: 149,260 ( 96,250 rd + 53,010 wr) ==2103== D1 miss rate: 0.3% ( 0.3% + 0.3% ) ==2103== LLd miss rate: 0.1% ( 0.1% + 0.1% ) ==2103== ==2103== LL refs: 356,183 ( 258,423 rd + 97,760 wr) ==2103== LL misses: 152,003 ( 98,993 rd + 53,010 wr) ==2103== LL miss rate: 0.0% ( 0.0% + 0.1% )

For an optimized version of C ++ (without an optimization flag)

 $ valgrind --tool=cachegrind ./a.out ==2180== Cachegrind, a cache and branch-prediction profiler ==2180== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al. ==2180== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info ==2180== Command: ./a.out ==2180== --2180-- warning: L3 cache found, using its data for the LL simulation. took 64.657681 sec ==2180== ==2180== I refs: 5,281,474,482 ==2180== I1 misses: 1,710 ==2180== LLi misses: 1,651 ==2180== I1 miss rate: 0.00% ==2180== LLi miss rate: 0.00% ==2180== ==2180== D refs: 3,170,495,683 (1,840,363,429 rd + 1,330,132,254 wr) ==2180== D1 misses: 12,055 ( 10,374 rd + 1,681 wr) ==2180== LLd misses: 7,383 ( 6,132 rd + 1,251 wr) ==2180== D1 miss rate: 0.0% ( 0.0% + 0.0% ) ==2180== LLd miss rate: 0.0% ( 0.0% + 0.0% ) ==2180== ==2180== LL refs: 13,765 ( 12,084 rd + 1,681 wr) ==2180== LL misses: 9,034 ( 7,783 rd + 1,251 wr) ==2180== LL miss rate: 0.0% ( 0.0% + 0.0% )

For an optimized version of C ++

 $ valgrind --tool=cachegrind ./a.out ==2157== Cachegrind, a cache and branch-prediction profiler ==2157== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al. ==2157== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info ==2157== Command: ./a.out ==2157== --2157-- warning: L3 cache found, using its data for the LL simulation. took 9.419447 sec ==2157== ==2157== I refs: 1,451,459,660 ==2157== I1 misses: 1,599 ==2157== LLi misses: 1,549 ==2157== I1 miss rate: 0.00% ==2157== LLi miss rate: 0.00% ==2157== ==2157== D refs: 430,486,197 (340,358,108 rd + 90,128,089 wr) ==2157== D1 misses: 12,008 ( 10,337 rd + 1,671 wr) ==2157== LLd misses: 7,372 ( 6,120 rd + 1,252 wr) ==2157== D1 miss rate: 0.0% ( 0.0% + 0.0% ) ==2157== LLd miss rate: 0.0% ( 0.0% + 0.0% ) ==2157== ==2157== LL refs: 13,607 ( 11,936 rd + 1,671 wr) ==2157== LL misses: 8,921 ( 7,669 rd + 1,252 wr) ==2157== LL miss rate: 0.0% ( 0.0% + 0.0% )

+7

c ++ hashtable perl go

packetie Nov 27 '15 at 5:03

source share

4 answers

Steffen ullrich · Answer 1 · 2015-11-27T06:21:45+0000

From your Perl code (before your attempts to fix it):

 @mymap = (); $mymap["US"] = "Washington"; $mymap["UK"] = "London";

This is not a map, but an array. The syntax for hash cards is:

  %mymap; $mymap{"US"} = ....

So you are actually doing this creating an array, not a hash map and accessing element 0 all the time. Please use use strict; and use warnings; all the time with Perl, and even a simple syntax check with warnings would show you a problem:

 perl -cw x.pl Argument "US" isn't numeric in array element at x.pl line 9. Argument "UK" isn't numeric in array element at x.pl line 10.

In addition, the main part of the standard does nothing useful (assigns a variable and never uses it), and some compilers can detect it and simply optimize it.

If you check the code generated by your Perl program, you will see:

 $ perl -MO=Deparse x.pl @mymap = (); $mymap[0] = 'Washington'; $mymap[0] = 'London'; ... for ($i = 0; $i < 1000000; ++$i) { $x = $mymap[0]; }

That is, it detects a problem at compile time and replaces it with access to array index 0.

Thus, whenever you need benchmarks:

Make sure your program really does what it should.
Make sure that the compiler does not optimize the situation and does not perform work at compile time, when other languages will do this at run time. Any statements without any or predictable results are prone to such optimization.
Make sure that you are truly measuring what you intend to, and not something else. Even small changes in the program can affect the execution time, because memory allocation is required, it was not before, etc., And these changes may not be related to what you intend to measure. In your test, you measure access to the same hash element over and over again without any access to other elements between them. This is an activity that can play very well with processor caches, but probably does not reflect real world use.

And, using a simple timer is not a realistic benchmark. There are other processes in the system, there is a scheduler, there is caching ... and with today's processor it strongly depends on the load on the system, because, perhaps, the CPU will work with one standard in a mode with lower power consumption than in other tests, t .e. with different CPU frequencies. For example, several runs of the same “test” vary in measured time between 100 ms and 150 ms in my system.

Benchmarks are lies and micro-tests like yours doubly.

Jens · Answer 2 · 2015-11-27T10:44:24+0000

I modified your example a bit to get some information about the structure of the hash table:

 #include <iostream> #include <string> #include <unordered_map> #include <sys/time.h> #include <chrono> using namespace std; int main() { std::unordered_map<std::string, std::string> mymap; // populating container: mymap["US"] = "Washington"; mymap["UK"] = "London"; mymap["France"] = "Paris"; mymap["Russia"] = "Moscow"; mymap["China"] = "Beijing"; mymap["Germany"] = "Berlin"; mymap["Japan"] = "Tokyo"; mymap["China"] = "Beijing"; mymap["Italy"] = "Rome"; mymap["Spain"] = "Madrad"; std::hash<std::string> h; for ( auto const& i : mymap ) { printf( "hash(%s) = %ud\n", i.first.c_str(), h( i.first ) ); } for ( int i = 0; i != mymap.bucket_count(); ++i ) { auto const bucketsize = mymap.bucket_size( i ); if ( 0 != bucketsize ) { printf( "size of bucket %d = %d\n", i, bucketsize ); } } auto const start = std::chrono::steady_clock::now(); string const x = "China"; std::string res; for ( int i = 0; i < 1000000; i++ ) { mymap.find( x ); } auto const elapsed = std::chrono::steady_clock::now() - start; printf( "%s\n", res ); printf( "took %d ms\n", std::chrono::duration_cast<std::chrono::milliseconds>( elapsed ).count() ); return 0; }

By running this on my system, I get a ~ 68 ms runtime with the following output:

 hash(Japan) = 3611029618d hash(Spain) = 749986602d hash(China) = 3154384700d hash(US) = 2546447179d hash(Italy) = 2246786301d hash(Germany) = 2319993784d hash(UK) = 2699630607d hash(France) = 3266727934d hash(Russia) = 3992029278d size of bucket 0 = 0 size of bucket 1 = 0 size of bucket 2 = 1 size of bucket 3 = 1 size of bucket 4 = 1 size of bucket 5 = 0 size of bucket 6 = 1 size of bucket 7 = 0 size of bucket 8 = 0 size of bucket 9 = 2 size of bucket 10 = 3

It turns out that the hash table is not very optimized and contains some collisions. Further printing of the elements in the bucket shows that Spain and China are in bucket 9. The bucket is probably a linked list with nodes distributed in memory, explaining the performance drop.

If you choose a different hash table size so that there are no collisions, you can get acceleration. I tested it with the addition of mymap.rehash(1001) and got an acceleration of 20-30% to something between 44-52ms.

Now one more thing will be the calculation of the hash values for China. The function is executed at each iteration. We can do this when we switch to constant C simple strings:

 #include <iostream> #include <string> #include <unordered_map> #include <sys/time.h> #include <chrono> static auto constexpr us = "US"; static auto constexpr uk = "UK"; static auto constexpr fr = "France"; static auto constexpr ru = "Russia"; static auto constexpr cn = "China"; static auto constexpr ge = "Germany"; static auto constexpr jp = "Japan"; static auto constexpr it = "Italy"; static auto constexpr sp = "Spain"; using namespace std; int main() { std::unordered_map<const char*, std::string> mymap; // populating container: mymap[us] = "Washington"; mymap[uk] = "London"; mymap[fr] = "Paris"; mymap[ru] = "Moscow"; mymap[cn] = "Beijing"; mymap[ge] = "Berlin"; mymap[jp] = "Tokyo"; mymap[it] = "Rome"; mymap[sp] = "Madrad"; string const x = "China"; char const* res = nullptr; auto const start = std::chrono::steady_clock::now(); for ( int i = 0; i < 1000000; i++ ) { res = mymap[cn].c_str(); } auto const elapsed = std::chrono::steady_clock::now() - start; printf( "%s\n", res ); printf( "took %d ms\n", std::chrono::duration_cast<std::chrono::milliseconds>( elapsed ).count() ); return 0; }

On my machine, this reduces the execution time by 50% to ~ 20 ms. The difference is that instead of calculating the hash value from the contents of the string, now it just converts the address into a hash value, which is much faster because it just returns the address value as size_t. We also do not need to rephrase, because there are no collisions for the bucket with cn .

kostya · Answer 3 · 2015-11-27T07:14:05+0000

It just shows that for this particular implementation, the implementation of the Go hash map is optimized very well.

mymap["China"] calls mapaccess1_faststr , which is specially optimized for string keys. In particular, for small cards with one bucket, the hash code is not even calculated for short (less than 32 bytes) lines.

James picone · Answer 4 · 2015-11-27T05:22:45+0000

This is an assumption:

unordered_map :: operator [] requires a string argument. You provide char *. Without optimizations, the C ++ version is likely to call the constructor std :: string (char *) a million times to turn China into std :: string. The Go language specification most likely makes "string literals" the same type as a string, so no constructors should be called.

With optimization, the row constructor will exit the loop and you will not see the same problem. Or, quite possibly, the code will not be created, except for two system calls to get the time and a system call to print the difference, because in the end it all has no effect.

To confirm this, you will need to see which assembly is being created. It will be a compiler. See this question for flags needed for GCC.

Hash table performance, why is C ++ the slowest?

More articles: