Setup

  • All measurements were done on Intel 5-5300U (Broadwell).
    Caches: L1 - 32Kb, L2 - 256Kb, L3 - 3Mb.
    Linux64, kernel 4.15.0-74. Transparent huge pages turned on.

HashMap implementations

4 different Hashmap implementations were analysed: 'chaining' (reference and inline) and 'open addressing' (reference and inline).

RCH

Reference CHaining hashtable a.k.a. java.util.HashMap as baseline

ICH

Inline types CHaining, corelibs.mapprotos.XHashMap class

IOA

Inline types Open Addressing implementation, corelibs.mapprotos.HashMap class

ROA

Reference Open Addressing implementation, obtained from corelibs.mapprotos.HashMap class

Several other inline types chaining implementation were checked. Results didn’t show a significant difference from corelibs.mapprotos.XHashMap.

Benchmarks

Benchmarks cover 3 areas of basic map operation: get value from map, put and iteration over map.

Iteration seems quite simple: any inline based implementation is faster than reference based implementations (speedup range from 1.5x times to 3x times). It’s decided to omit iteration from further reading.

getHit

get operation from map. All requested keys exist in the map.

getMix

get operation from map. Half of requested keys exist in the map.

put

set of put operations. Map is growing from empty map (created by default constructor) to requested size.

putSized

set of put operations. Map created with capacity enough to hold all keys, thus resize operation is excluded.

Data size (number of <key,value> pairs in the maps).

It was found that any HashMap behavior highly depends on load factor. Typical scenario - actual load factors are in range 0.375 .. 0.75.
0.75 is default max load factor, 0.375 is half of max load factor (actual load factor when map capacity was doubled after reaching max load factor).

All measurements were done for pairs <dense, sparse> where dense was chosen to have 0.75 actial load factor and sparse to have 0.375. (any sparse size is dense size +2).

  • <767,769> - all internal maps structures fit into L1 cache

  • <1535,1537> - fit into L2

  • <6143,6145> - fit into L2

  • <12287,12289> - fit into L3

  • <49151,49153> - fit into L3

  • <98303,98305> - mostly fit into L3

  • <1572863,1572865> - out of any caches (from 32 to 70 Mb of memory)

Raw results

Results better than java.util.HashMap are marked green.

Table 1. getHit (average time per get method; ns)
size RCH dense RCH sparse ROA dense ROA sparse IOA dense IOA sparse ICH dense ICH sparse

767(+2)

10.6

10.7

14.5

13.1

14.6

12.6

13.1

12.6

1535(+2)

11.7

11.5

24.1

14.8

22.9

14.2

18.5

14.4

6143(+2)

20.6

17.2

43.9

24.7

34.4

21.3

21.9

19.5

12287(+2)

22.9

19.6

48.7

27.9

40.6

22.7

25.8

20.9

49151(+2)

29.5

23.7

68.9

40.7

58.7

33.1

28.6

29.7

98303(+2)

51.8

53.6

126.9

68.3

80.5

51.2

43.4

45.9

1572863(+2)

77.9

69.4

270.4

113.6

181.4

78.8

75.5

67.3

Table 2. getHit relative performance to j.u.HashMap (<1 – faster, >1 – slower)
size RCH dense RCH sparse ROA dense ROA sparse IOA dense IOA sparse ICH dense ICH sparse

767(+2)

1.36

1.23

1.38

1.18

1.24

1.18

1535(+2)

2.06

1.28

1.96

1.23

1.58

1.25

6143(+2)

2.13

1.43

1.67

1.24

1.06

1.13

12287(+2)

2.13

1.42

1.77

1.16

1.13

1.06

49151(+2)

2.34

1.72

1.99

1.40

0.97

1.25

98303(+2)

2.45

1.27

1.55

0.95

0.84

0.86

1572863(+2)

3.47

1.64

2.33

1.14

0.97

0.97

Table 3. getMix (average time per get method; ns)
size RCH dense RCH sparse ROA dense ROA sparse IOA dense IOA sparse ICH dense ICH sparse

767(+2)

10.2

9.5

23.3

11.1

22.9

12.3

10.9

10.6

1535(+2)

11.3

10.5

28.3

12.1

31.3

13.8

16.0

12.6

6143(+2)

32.1

26.4

54.3

32.4

47.5

28.5

39.7

27.5

12287(+2)

38.8

31.4

60.5

42.0

54.8

38.9

42.1

34.7

49151(+2)

47.3

40.7

91.5

52.9

76.3

51.4

42.9

39.9

98303(+2)

61.1

57.1

140.6

72.6

113.8

71.6

51.9

53.4

1572863(+2)

109.0

93.9

267.8

123.6

258.4

107.9

90.4

77.3

Table 4. getMix relative performance to j.u.HashMap (<1 – faster, >1 – slower)
size RCH dense RCH sparse ROA dense ROA sparse IOA dense IOA sparse ICH dense ICH sparse

767(+2)

2.29

1.17

2.26

1.29

1.07

1.12

1535(+2)

2.50

1.15

2.76

1.31

1.41

1.20

6143(+2)

1.69

1.23

1.48

1.08

1.24

1.04

12287(+2)

1.56

1.34

1.41

1.24

1.09

1.10

49151(+2)

1.93

1.30

1.61

1.26

0.91

0.98

98303(+2)

2.30

1.27

1.86

1.25

0.85

0.94

1572863(+2)

2.46

1.32

2.37

1.15

0.83

0.82

Table 5. put (average time per put method; ns)
size RCH dense RCH sparse ROA dense ROA sparse IOA dense IOA sparse ICH dense ICH sparse

767(+2)

29.4

49.3

48.0

70.7

70.5

98.1

41.9

69.6

1535(+2)

38.4

60.8

71.6

96.8

86.4

122.4

49.6

89.5

6143(+2)

49.8

76.0

93.4

112.7

111.7

137.3

66.1

104.8

12287(+2)

55.0

85.7

99.0

124.2

115.3

244.4

69.2

216.6

49151(+2)

72.4

210.6

141.9

292.5

245.2

294.7

199.0

303.9

98303(+2)

184.5

244.5

282.7

419.8

254.1

329.5

252.2

363.0

1572863(+2)

376.3

474.5

512.5

689.3

1951.2

2111.8

823.6

1046.6

Table 6. put relative performance to j.u.HashMap (<1 – faster, >1 – slower)
size RCH dense RCH sparse ROA dense ROA sparse IOA dense IOA sparse ICH dense ICH sparse

767(+2)

1.63

1.44

2.39

1.99

1.42

1.41

1535(+2)

1.87

1.59

2.25

2.01

1.29

1.47

6143(+2)

1.87

1.48

2.24

1.81

1.33

1.38

12287(+2)

1.80

1.45

2.10

2.85

1.26

2.53

49151(+2)

1.96

1.39

3.39

1.40

2.75

1.44

98303(+2)

1.53

1.72

1.38

1.35

1.37

1.48

1572863(+2)

1.36

1.45

5.19

4.45

2.19

2.21

Table 7. putSized (average time per put method; ns)
size RCH dense RCH sparse ROA dense ROA sparse IOA dense IOA sparse ICH dense ICH sparse

767(+2)

20.8

22.5

34.2

18.2

40.9

30.0

23.3

24.5

1535(+2)

23.8

23.3

38.8

20.2

46.1

31.2

24.9

25.5

6143(+2)

23.0

23.1

52.7

25.3

59.5

34.5

28.6

28.1

12287(+2)

27.8

25.5

56.3

30.2

70.4

94.3

31.8

151.4

49151(+2)

41.9

130.0

83.8

139.6

135.2

176.4

145.2

193.5

98303(+2)

141.5

157.8

208.0

157.5

242.0

291.7

192.7

249.4

1572863(+2)

295.7

338.3

404.4

401.2

1197.5

880.1

747.6

693.6

Table 8. putSized relative performance to j.u.HashMap (<1 – faster, >1 – slower)
size RCH dense RCH sparse ROA dense ROA sparse IOA dense IOA sparse ICH dense ICH sparse

767(+2)

1.64

0.81

1.97

1.34

1.12

1.09

1535(+2)

1.63

0.87

1.93

1.34

1.05

1.10

6143(+2)

2.29

1.10

2.59

1.49

1.24

1.21

12287(+2)

2.03

1.18

2.53

3.69

1.14

5.94

49151(+2)

2.00

1.07

3.23

1.36

3.47

1.49

98303(+2)

1.47

1.00

1.71

1.85

1.36

1.58

1572863(+2)

1.37

1.19

4.05

2.60

2.53

2.05

Some unstructured observation

Open addressing

Open addressing causes significant performance degradation in comparison of chaining implementations due to nature of linear probing. Lookup in dense open addressing map is almost twice slower than lookup in sparse open addressing map. That behavior correlates with the chart https://en.wikipedia.org/wiki/File:Hash_table_average_insertion_time.png (taken from https://en.wikipedia.org/wiki/Hash_table)

Moving from reference chaining to reference open addressing causes ~2x times performance degradation in dense case (~1.5x times in sparse case). Subsequent move to inline types open addressing gives +15 .. +25% of speedup, but can’t mitigate the degradation from the first step.

Benchmarking methodology flaw in Valhalla world

The goal of inline types is to provide performance benefits due to lower cost of memory access. One of the key factors influencing memory access cost is CPU cache misses. At the same time our benchmarking tools (and benchmarking methodology in general) not always give enough information. It’s a differentiation between hot and cold access to data. Hot means it’s frequently used in given moment of time (and as result data reside in caches) and cold is either rare access or access distributed through application (and can’t populate caches enough). Due to repetitive nature of benchmarks we may measure only hot scenarios. Than may causes the fact that even reference types implementation populates into caches and it may not show performance benefits of inline types. That fact should be kept in mind analyzing Valhalla benchmarks. If we observe the same performance on our benchmarks between reference and inline than means we still may get performance difference in cold case.

e.g.
  • Maps of 767(+2) size fit into L1 cache for all implementations. For such benchmarks we may observe performance benefits only if we do less memory accesses. If number of memory accesses is the same - we can’t say anything about cache misses here.

  • Maps of 1535(+2) (or 6143(+2)) size fit into L2 cache. The only valuable metric here is L1 misses, because both reference and inline implementations are resides in L2 cache.

Some answers

Why put operations are always slower for inline maps.

In general execution time is factorized into CPI * instructions. Where CPI (cycles per instruction) is a component showing all CPU stalls including cache misses.

For all analysed maps implementation CPI is approximately the same (jitter withing 10%-20%), but all inline based maps have 2.5x - 3x times more instructions. Situation with CPI will be considered below. As for amount of instructions.

Inline based chaining map
  • put benchmark - key contributor is resize methods. We just have to move 4x time more data (16 bytes Entry vs 4 bytes reference). The idea of moving inline types with SSE instructions won’t help here, because some restructuring is required.

  • putSized benchmark - the reason is quite complex isEmpty method:

   static inline class XNode<K,V> implements Map.Entry<K,V> {
        boolean isEmpty() {
            return hash == 0 && key == null && value == null;
        }

As example, violation of HashMap contract and prohibition of null keys allows to simplify isEmpty(just key==null) and speedup put method by 25%, but doesn’t allow to reach java.util.HashMap performance.

Why get operations are mostly slower.

For open addressing the amount of instructions is 1.5x-2x times larger than java.util.HashMap due to deficiency of linear probing (with comparable CPI).

Inline based chaining implementation has the similar amount of instructions as java.util.HashMap.

CPI detalization is more complex. First of all - there is no significant changes in cache misses. Performance benefits of inline based chaining implementation on large data sizes are caused by reducing amount of TLB misses. That is why didn’t observed performance benefits for small maps (TLB doesn’t matter in that case). Also it explains old experiments when turning off large pages have shown performance benefits even for mid-sized maps (note: all numbers above are for turned on large pages).

Why cache misses were not changed.

Here we have to look into amount of memory consumed by maps. Statistical data have shown that that number of bytes per <key, value> doesn’t depend on map size, but depends on load factor.

Table 9. bytes per 1 <key,value> pair (only Map’s structures)
dense sparse

RCH

37

43

ROA

37

43

IOA

21

43

ICH

31

48

  • with small load factor (sparse case) inline based chaining map needs more memory than reference based implementation.

  • natural property of any hashtable is gaps (up to 25% in dense case). In reference based map any gap is only reference with null. Inline based maps waste the whole entry in the table. The table is 4 times larger. We got a shift, reference based map has more cache misses when doing dereference to Entry, inline based map has more cache misses when walking to the table itself. Overall amount of cache misses is unchanged.

  • inline based table is larger, but more structured (just sequential range of addresses) - that is why inline based maps are touching less amount of memory pages that random reference based map and get performance benefits when TLB cache size is not enough.

Memory consumption above was shown for default execution with compressed references. If we look into uncompressed references the picture became worse for inline based maps.

Table 10. bytes per 1 <key,value> pair (-XX:-UseCompressedOops)
dense sparse

RCH

59

69

ROA

51

61

IOA

43

85

ICH

57

93