2019-08-23

Benchmarks

It’s always tricky to write microbenchmarks on Java for sorting algorithms (or any data structure modifications), because of we have to create new data structure (or copy golden copy) before each operation. Even if we successfully excluded preparation time from our measurement time (that is not always possible), it quite difficult to exclude preparation time from profile information. Besides, number of golden copies can’t really be large or randomized and may loose interesting scenarios. Other approach is used in this study. 'Salted' value is mixed with comparison operation and is changed after each sort operation (immediately and 'magically' provides unsorted array). The disadvantage of this approach is that sometimes it may produce data with quite different unsortness degree and thus large difference in time of sorting. That required a large amount repetitions to achieve desired convergence of average result.

The key goal of the study was comparing behavior of identical sorting algorithms on similar reference and inline classes. This implies that amount and sequence of compare operations is the same for both kinds of classes, the only difference is data moving. Inline classes requires more copying data (the whole value should be moved) vs references (only reference is moved). That is why benchmarks use as much as possible light-weight compare operation to reduce cost of compare in our study. Only data moving is targeted. In order to limit number of benchmarks to reasonable value our implemented compare operation follows the worst case memory access scenario it reads all fields of comparing class. Partial scenarios (only some fields are read) may be more commonly used, but they works faster (or with the same speed) as worst case scenario.

Two stable sorting algorithms were analyzed. Classical MergeSort and TimSort (default sort algorithm from Java classlib). Manual specialization of each algorithm was implemented for inline classes. Besides straightforward inline classes sorting, both algorithms were implemented in 'indexed' version: sort array of 'int' indices and copy values to result after. Two algorithms were checked to convert unsorted array by sorted indices. The first uses auxiliary array and copy values from one to another. The second performs permutation in place and don’t need additional allocation. The second worked significantly slower than the first and was excluded.

It was analyzed classes of 8, 16, 24, 32, and 64 bytes of size (2,4,6,8, and 16 int fields).

Benchmarked arrays have sizes: 400, 4000, 40000, 400000, and 4000000. That numbers were chosen to approximately fit working set into L1 cache (400), L2 (4000), L3 (40000), slightly large than L3 (400000), and significantly larger than L3 (4000000). That is true for classes 8..32 bytes size. 64 bytes class slightly out of desired cache sizes.

The goal was to look into current general picture. None of algorithmic tuning was performed. Also, it is knows, it may be produced some inefficiency in generated code for inline types, that case also was not analyzed.

Results by size of class

Sort of 8 bytes classes

400 4000 40000 400000 4000000

ref MergeSort

25.8

314

4031

143900

2102239

val MergeSort

18.6

232

2851

33866

391601

val MergeSort indexed

21.3

275

3256

40744

455590

ref TimSort

39.3

416

4049

69785

951722

val TimSort

23.0

244

2349

26131

282501

val TimSort indexed

26.3

278

2697

30450

310405

average time, microseconds

Results from the table above are converted into relative chart where reference TimSort results (as defauld JDK algorithm) is considered as 100%:

size02

Sort of 16 bytes classes

400 4000 40000 400000 4000000

ref MergeSort

27.9

361

4650

155665

2236805

val MergeSort

27.6

364

4906

53787

646101

val MergeSort indexed

27.9

351

5148

55690

618230

ref TimSort

40.1

431

4162

66821

898132

val TimSort

26.3

282

2877

35440

386164

val TimSort indexed

30.1

318

3742

37109

373690

average time, microseconds

Relative performance:

size04

Sort of 24 bytes classes

400 4000 40000 400000 4000000

ref MergeSort

31.4

420

5416

173281

2346903

val MergeSort

40.0

534

7715

85742

1040775

val MergeSort indexed

33.7

433

6280

75284

838683

ref TimSort

41.9

442

4304

65311

1047121

val TimSort

30.0

335

4370

48482

563881

val TimSort indexed

33.1

368

4499

48008

454127

average time, microseconds

Relative performance:

size06

Sort of 32 bytes classes

400 4000 40000 400000 4000000

ref MergeSort

34.1

464

6133

180560

2393601

val MergeSort

51.2

681

9358

106365

1259343

val MergeSort indexed

38.9

518

7294

87751

994094

ref TimSort

43.1

459

4492

80248

962111

val TimSort

32.9

353

4614

52174

619560

val TimSort indexed

35.9

387

4603

50910

492273

average time, microseconds

Relative performance:

size08

Sort of 64 bytes classes

400 4000 40000 400000 4000000

ref MergeSort

46.9

655

9292

215155

4380417

val MergeSort

97.9

1336

18915

216270

2593009

val MergeSort indexed

66.1

878

13376

146132

1764097

ref TimSort

50.8

532

5373

78048

1139305

val TimSort

47.2

546

7664

95353

1141067

val TimSort indexed

49.4

527

7006

76962

706724

average time, microseconds

Relative performance:

size16

Results by length of array

len1
len2
len3
len4
len4

Memory consumption

Inline types sorting has signicant larger auxiliary memory allocation. Just as example for array size = 400.

Table 1. Extra allocated bytes (columns per class size)
8 16 24 32 64

ref MergeSort

1616

1616

1616

1616

1616

val MergeSort

3216

6416

12816

12816

25616

val MergeSort indexed

6448

9648

16048

16048

28848

ref TimSort

976

976

976

976

976

val TimSort

1776

3376

6576

6576

12976

val TimSort indexed

5832

9032

15432

15432

28232

extra allocation per sort, bytes

Table 2. Relative, now many times more extra allocated memory than reference TimSort
8 16 24 32 64

ref MergeSort

1.7

1.7

1.7

1.7

1.7

val MergeSort

3.3

6.6

13.1

13.1

26.2

val MergeSort indexed

6.6

9.9

16.4

16.4

29.6

val TimSort

1.8

3.5

6.7

6.7

13.3

val TimSort indexed

6.0

9.3

15.8

15.8

28.9

extra allocation per sort, relative times to ref TimSort

In general, extra allocations for reference sorting are within 3% - 20% of original data set size. Inline types extra allocations are within 100%-200%. (which is not surprising).

Conclusion

  • for small class sizes or small array sizes value type sort is faster than reference and even faster than indexed value type sort

  • for larger class sizes and larger array sizes indexed sort is more preferable

  • reference sort wins only in small amout of cases

So even for some data modification algorithms inline types win over references due to data locality.

The case where reference types are fastest here is the case of 400000 elements array. "Fit to L3 case". The reason of that is large extra memory allocations for inline types, our data set became slightly larger than L3 cache.

Note
It’s crucial to get specialized codegeneration for each inline class used for sorting. Without specialization generic objectified version will kill all benefits obtained by data locality. Right now C2 is able to provide such specialization only with methods inline. TimSort code is large enough not be inlined. Taking this into consideration indexed sort looks more prefferable, because of inline types are localized to relatively small top method.