Mergesort Investigation 4 - reuse linked lists

Bruce Ediger

2024-07-17 (Last Modified: 2024-09-04)

After I wrote a recursive merge sort, I reviewed it to ensure it wasn’t doing extra work. From that perspective, I noticed that for each of the 10 list sorts my benchmark program made for a given length linked list, it created a new linked list.

I thought that I could make the overall benchmark run faster by not creating a new, very large, linked list for every run, but rather re-using the existing list. I hoped to have the benchmark program use less time between the ten measured sort algorithm executions, not make the sorting faster. I surprised myself by getting the opposite effect.

I added code to my benchmarking that used the same linked list nodes for each of the 10 time algorithm executions. The results were interesting. Every length of linked list I benchmarked sorted slower when the program re-used a previously allocated linked list.

benchmarks of iterative algorithm re-using linked lists

Above, benchmark runs of the iterative algorithm with and without re-using linked lists. It appears that the Macbook M2 ARM laptop is the only machine that shows the abrupt performance drops at 8 million and 16 million node length lists, while re-using those lists.

benchmarks of recursive algorithm re-using linked lists

Recursive algorithm with and without re-using linked lists. The recursive algorithm is also slower with a re-used linked list.

Re-using a linked list over and over causes major performance hits.

To an older programmer like me, this smells like the benchmarks used enough memory to cause paging. Modern computers have large enough amounts of RAM that paging probably isn’t occurring.

I wrote a short program to see what addresses and Node struct sizes my code uses. My mergesort benchmark program is written in Go and I haven’t really looked to see what the Go runtime heap allocator does.

To recap, a linked list in these posts is a series of Go structs, each struct having a .Next field containing the address of another struct Node, one that’s logically next in the list. “Logically” is doing a lot of work in that sentence.

type Node struct {
    Data uint
    Next *Node
}

My short program says that each Node is 16 bytes. That makes sense on an x86_64 machine. The “natural” size of an integer will be 64 bits, or 8 bytes. Even though address sizes (according to lscpu) for my machines is 48 bits virtual, the physical address size is between 39 and 46 bits. That’s too big for a 32-bit register, these benchmarks use 8 byte addresses. 8 bytes for unsigned integer .Data field, 8 more bytes for the .Next pointer field, 16 bytes for the struct. Apparently there’s no padding for alignment.

It turns out that the Go runtime allocates 504 nodes contiguously in a single, 8192-byte piece of memory. 504*8192 is 8064 bytes. Apparently the Go heap allocator uses about 128 bytes for its own purposes in each 8192-byte piece of memory.

The code for creating an unsorted list of length n looks like this:

    for i := 0; i < n; i++ {
        head = &Node{
            Data: rand.Int(),
            Next: head,
        }
    }

The above procedure makes a list where most nodes’ address and the value of their .Next fields are close numerically, usually differing only by the size of a struct Node (16 bytes). Further, the .Next pointer usually has a lower address in it. The 504-node sub-lists in a 8192-byte block go backwards in memory if you follow .Next pointers. The address jumps between blocks of 504 nodes mostly go from higher to lower addresses, but occasionally, there’s a numerically much larger change in address from the end of one 8192-byte block to the next that can be either higher or lower in address.

Every data point on a graph in these mergesort investigation posts comprises the mean execution time of 10 sorts of a list with randomly-chosen integer data values.

At the start of a benchmark run, a 16-node linked list with randomly-chosen data values between 0 and 16 looks like this, lower memory addresses on the left, higher address on the right:

example unsorted 16-node linked list

The Node at the head of the list has a doubled boundary. .Next pointers are represented by the bold lines joining Node ovals.

Remember that sorting a linked list doesn’t change the addresses of the Node structs, but merely changes the values of .Next fields to contain the address of the next higher-valued Node. After getting sorted, the nodes’ addresses and .Next fields are typically not close numerically.

Here’s a representation of the same list, after getting sorted.

example sorted 16-node linked list

A different node with data value of 2 is the head node of the sorted list, it’s double circled. Changing head nodes means my iterative mergesort is not stable. If you follow the bold arrows, you get numerically sorted values.

2 → 2 → 2 → 4 → 5 → 5 → 5 → 6 → 8 → 8 → 9 → 9 → 10 → 14 → 15 → 15

The re-use code walks a sorted list by .Next pointers, and gives each node new, different, randomly-chosen integer values, converting it back to an unsorted list.

In the linked-list-reusing benchmarks, only the first of 10 executions of a sorting algorithm start with a linked list where the nodes are “near” each other when referenced following .Next pointers. The remaining 9 executions of a sorting algorithm start with a linked list whose .Next pointers contain addresses that aren’t local to the node’s address.

It turns out that both algorithms I wrote, iterative and recursive, are startlingly sensitive to this lack of locality of reference. I don’t think that poor locality of reference is the cause of the abrupt slowdowns because the abrupt performance drops show up in the Macbook M2 ARM benchmarking.