Mergesort Investigation 9 - memory access patterns

As a way to see if merely accessing the linked list node’s memory causes the mergesort performance drops, I wrote two non-sorting functions and benchmarked them to try to distinguish

  • accessing the nodes’ data
  • modifying .Next pointers

I started with the code from my transliteration of the Wikipedia bottom up algorithm. Benchmarking shows it has the same abrupt performance drops as my original July 2021 algorithm, but the “merge” part is broken out in a separate function making it easier to perform different actions on the sublists to-be-merged. It’s also a little faster, so benchmarking will go quicker.

I wrote two variant programs. Both have the same framework as Wikipedia’s bottom up mergesort with lists. Both should access linked list nodes in roughly the same order. I also added a function to my benchmarking program that creates linked lists with .Data fields with increasing numerical values - the linked list is already sorted.

  1. Walking sublists in the same length and order as the Wikipedia bottom up algorithm, but incrementing .Data fields of each node rather than switching .Next pointers around.
  2. Weaving a single list from two sublists by alternating nodes from each of the sublists. This switches around .Next pointers some, but mostly not as much as actually sorting.
  3. Sorting an already sorted list. No .Next pointers get changed.
  4. While I’m at it, sorting a reverse-sorted list. Almost every .Next pointer change during merging is very local.

Walk sublists, increment .Data values

node incrementing benchmarking

Above, walking sublists in the same length and order as the Wikipedia bottom up algorithm, but incrementing .Data fields of each node rather than switching .Next pointers around. I made a mistake setting the incrementing. The data incrementing benchmarks used a 40,000 node increment on list lengths, while my earlier benchmarking used a 400,000 node increment. I used lines to visualize the performance instead of points. I see no apparent performance drops. The Dell R530 produces benchmarks with a lot more variation, as in most of the other benchmarking.

Incrementing .Data values isn’t nearly as fast as walking linked lists, but it’s much faster than sorting.

Walk sublists, weave .Next pointers

sublist weaving benchmarking

Above, weaving sublists in the same length and order as the Wikipedia bottom up algorithm. The purple plus signs in the graph above are the same data as the filled-in golden squares on this graph.

The weaving code looks like this:


    for *x != nil && *y != nil {
        t.Next = *y
        t = t.Next
        *y = (*y).Next

        t.Next = *x
        t = t.Next
        *x = (*x).Next
    }

In the above, x and y are of type **Node, assigned to left and right sublists at random.

I made a mistake setting the incrementing. The sublist weaving benchmarks used a 40,000 node increment on list lengths, while my earlier benchmarking used a 400,000 node increment.

I used lines instead of points to create the graph above. At 10 sublist weaving points for every single iterative algorithm benchmark point, gnuplot created a fuzzy-looking graph for the sublist weaving benchmarks. The “points” symbols that gnuplot uses are too large to create good visualizations for the smaller increment.

Surprisingly, it looks like other abrupt performance drops show up, at roughly 221 list length increments.

Anomaly detection

I’d like some objective measure of whether or not the every-2MiB-steps in the “weaving sublists” benchmarks actually exist. I tried some simple anomaly detection.

I chose to use the Qotom fanless server’s sublist weaving benchmarks, represented by the green line labeled “Qotom” in the graph above. The apparent steps or drops in performance are larger than for the other machines’ benchmark data.

I calculated the time increment from one benchmark datum to another, small “bumps” in mean elapsed time for 10 sorts. I calculated each bump’s z-value, and plotted the z-values.

node incrementing benchmarking

The above graph’s dots are the z-values of bumps in mean sublist weaving benchmark time from one list length to the next of the Qotom fanless server, which is the purple points in the first graph on this page, above. I chose the Qotom server’s benchmark data because there’s less visible elapsed time variation from one list length to the next. I believed that any anomalies would be easier to detect.

The red dots are z-values greater than 1.5 or less than -1.5. The choice of 1.5 and -1.5 is admittedly convenient. I wish that the negative bumps had been smaller. There’s some relationship to list lengths that are powers of 2.

List length z-value log2(list length) Δ list length
2121000 1.540751 21.02
4201000 3.362053 22.00 2080000
6321000 1.750901 22.59 2120000
8401000 7.106960 23.00 2080000
10521000 1.897553 23.33 2120000
12601000 4.135628 23.59 2080000
14681000 1.826495 23.81 2080000
14841000 -1.607975 23.82 (ignored)
16801000 16.262354 24.00 2120000
17761000 -2.158296 24.08 (ignored)

I believe this helps show that anomalies, the abrupt performance drops, appear at list lengths about 221 or 2097152 apart.

Recursive mergesort, small increment

Do these extra performance drops show up if I run my recursive mergesort algorithm at a smaller list length increment between timings? I benchmarked recursive mergesort with small increments in linked list length to see if it shows performance drops.

recursive merge sort, small increment benchmark graph

The turquoise colored points, labeled “Dell R530, recursive” above, are the same as the “Dell R530 #1” gold-colored squares on this graph For the smaller increment benchmarking, there’s 10 list lengths benchmarked between turquoise points, so I had gnuplot draw a purple line instead of drawing individual points which looks fuzzy and uninformative.

The difference in performance arises from me making a code change that improves clarity (I think) at the expense of a little performance.

Old:

for left != nil && right != nil {
    var n *Node
    if left.Data < right.Data {
        n = left
        left = left.Next
    } else {
        n = right
        right = right.Next
    }
    t.Next = n
    t = t.Next
}

New:

for left != nil && right != nil {
    n := &right
    if left.Data < right.Data {
        n = &left
    }
    t.Next = *n
    *n = (*n).Next
    t = t.Next
}

For the “new” code, I made the merge happen through chasing pointers-to-pointers, an extra de-reference on each node advance. Fewer lines of code mean fewer bugs, generally, and decreases human readers’ cognitive burden to understand the code.

In any case, the recursive linked list mergesort does not show the abrupt performance drops even when I graph smaller linked list length increments.

Iterative mergesorts, small increment

Do these extra performance drops show up if I run my iterative mergesort algorithm (July 2021 algorithm) and Wikipedia bottom-up with lists algorithm at a smaller list length increment between timings?

iterative mergesort benchmarking graph

Above, running 10 sorts of randomly-valued linked lists at 40,000 node increments, list lengths from 1,000 nodes, to 17,961,000 nodes. If you squint, you can see a few steps at multiples of 221 on bottom-up algorithm lines in the above graph, about 6x105 and 1.3x107, but really at no other places.

Once again, the Dell R530 produces benchmarks with more variation (purple line), at least with my July 2021 algorithm.

Sorting an already sorted list

Sorting an already sorted list should demonstrate that changing .Next pointers is most of what causes the abrupt performance drops. All the node value comparisons take place, all the work of merging sublists occurs, but no (or at least very few) .Next pointers change.

pre-sorted lists benchmark graph

That’s benchmarking for mergesorting linked lists whose data values are either already sorted (low to high), or sorted high-to-low, the reverse of what the algorithms do. The purple line in this graph is the same data as the green line in the data touching benchmark above. I used 400,000 node increments in list length for this graph, so gnuplot points work for visualization.

In the above benchmarking, the Qotom fanless server has the most variation in benchmarks. I don’t see anything approaching the abrupt performance drops you can see when sorting randomly-valued lists on any of the machines.

Conclusions

At the very least, this experiment shows that merely accessing the nodes’ data isn’t what causes the performance drops, it’s changing the .Next pointers and following those changed pointers. I still don’t understand why changing .Next pointers around, and then following them in the next merge of two sublists causes the performance drops at specific linked list lengths.

The sorting pre-sorted lists and reverse-sorted lists benchmarking demonstrates that changing .Next pointers is involved in causing the performance drops. But changing just any old .Next pointer isn’t enough, sorting reverse-sorted linked lists doesn’t show any abrupt performance drop, yet a sort involves changing every single .Next pointer in the list.

There may be some very small performance drops at linked list lengths that are multiples of 221, but nothing like the bigger steps that caused me to investigate in the first place.