Dmytro Taranovsky
September 9, 2018; modified September 12, 2018

Sorting with an average of $\mathrm{lg}(n!)+o(n)$ comparisons

Abstract: We prove that comparison-based sorting is possible with an average of $\mathrm{lg}(n!)+o(n)$ comparisons. Under a conjectured randomization assumption, $o(n)$ can be improved to $O(n^{1-ε})$. We also attempt a construction for $n^{0.5+o(1)}$ and $O(n^{0.5-ε})$.

1   Introduction

In a number of ways, sorting is the classic problem in computer science, and yet it continues to surprise us with new results. A simply entropy argument shows that comparison-based sorting requires an average of $≥\mathrm{lg}(n!)$ comparisons, but how closely can we approach this bound?

A symbolic significance of $\mathrm{lg}(n!)+o(n)$ is that it is $o(n)$ comparisons from optimal, wasting an average of only $o(1)$ comparisons per element. If achievable, the significance of $O(n^{0.5-ε})$ is that it breaks the square root barrier commonly seen in randomized constructions. While the immediate benefit of the results is mostly theoretical, the constructions here enhance our understanding of probability and computation and may be useful in other contexts.

Prior work: Merge sort uses $\mathrm{lg}(n!) + Θ(n)$ comparisons even in the worst case, and is efficiently implementable in practice. Merge-insertion sort (also known as Ford–Johnson sort) also uses $\mathrm{lg}(n!)+ Θ(n)$ comparisons but with a much smaller constant in $Θ(n)$. [1] gives a variation and a modification that is efficient in terms of compute. Slightly improved average bounds are obtained in [2], and their (1,2)Insertion algorithm resembles a portion of our $\mathrm{lg}(n!)+o(n)$ algorithm. The results in this paper were originally presented in [3], with this paper being the expanded version.

Open problems:
• Proving the $O(n^{1-ε})$ bound.
• How much can the bound be improved? Can the constructions for $n^{0.5+o(1)}$ and $O(n^{0.5-ε})$ work?
• Is sorting with $\mathrm{lg}(n!)+o(n)$ comparisons possible in the worst case?

2   Basic Approach

We can assume that all elements are distinct, by annotating them if necessary. The average case uses distinct elements in random order. By randomly shuffling the items, we can turn the average performance into the expected performance for every input, and the probability of 'bad' behavior can be typically made exponentially small. As an aside, we note that in practice, one may first want to check if the list is already largely sorted, and also to take advantage of equal items and other information.

We can compute the average number of comparisons by adding the entropy loss for each comparison relative to using a fair coin. At $\frac{1}{2}+ε$ yes probability, the loss is only $Θ(ε^2)$, which with proper constructs will allow us great freedom to use slightly imbalanced comparisons to reduce the need for bad strongly imbalanced comparisons in the end. While not used in this paper, we will note that merging two random sorted lists (by going sequentially from small to large) of the same size $n$ wastes only $Θ(\log n)$ entropy because for an element in position $i$ the comparison probability is typically $\frac{1}{2}±Θ(1/\sqrt{\min(1+i,n-i)})$; the exact waste is $2n-\mathrm{lg} {2n \choose n}+O(1) = \frac{1}{2} \mathrm{lg}(n) + O(1)$.

Our starting point is insertion sort with a binary search to decide where to insert the next element into the sorted subset $S$. When $(1-ε)2^m ≤ |S| ≤ 2^m-1$, an insertion uses at most $m$ comparisons, which (in terms of entropy) is optimal up to an $O(ε)$ additive factor; and for average-case complexity, $2^m ≤ |S| ≤ (1+ε) 2^m$ also works. Now, when $|S|$ is not close to a power of 2, insertion of an element $A$ is suboptimal, even in the average case and regardless of how we balance each query: Any balancing that does not 'cross' a power of 2 leads to the same average number of comparisons (as can be checking using the comparison tree).

However, if by using other elements, and wasting $o(1)$ comparisons, we could steer $A$ to an approximately uniform distribution over an interval of $S$ of length close to a power of 2, we get the desired $\mathrm{lg}(n!)+o(n)$.

We achieve this by adding elements in batches, and sometimes efficiently comparing elements of the batch with each other, such that the interval of $S$ corresponding to an element $A$ decreases in a quasi-random manner (and with the probability distribution of $A$ inside the interval nearly uniform), and when the interval length is close enough to a power of 2, doing the binary search to insert $A$.

Common constructs

We will keep a subset $S$ of sorted elements, and for each unsorted element $A$, we will keep track of the minimal interval $I_A$ of $S$ where $A$ is known to be located. $|I_A|$ is the length of $I_A$; $I_A=I_B$ is by the identity of the intervals.

Let $\mathrm{Compare}(A,B)$ be: Compare $A$ with $B$, and then (in random order) compare $A$ and $B$ against corresponding elements of $S$ until their intervals are disjoint (or both have length 1). The element of $S$ is chosen (in a consistent manner) to make the probabilities for the comparison as close to 1/2 as possible, assuming that when $\mathrm{Compare}$ is called, $(A,B)$ is uniformly distributed on $I_A⨯I_B$.

Because of the disjointness in the end, $\mathrm{Compare}$ preserves the uniformity assumption. Also, on average, $\mathrm{Compare}$ does $O(1)$ comparisons, and aside from the initial comparison, wastes only $O(l^{-2})$ entropy where $l$ is the shortest interval length.

The following three sections can be mostly read independently of each other. Using different algorithms, we
- prove the $\mathrm{lg}(n!)+o(n)$ bound,
- get $\mathrm{lg}(n!)+O(n^{1-ε})$ under a very likely assumption, and
- attempt to get $\mathrm{lg}(n!)+n^{0.5+o(1)}$ and $\mathrm{lg}(n!)+O(n^{0.5-ε})$.

3   A $\mathrm{lg}(n!)+o(n)$ algorithm

Our first algorithm is the following.

Given: A sorted list $S$, and a batch of $m$ unsorted elements; $m∈ω(1)∩o(|S|)$; the unsorted elements are random relative to $S$.

Repeat (1)-(3) while possible:
1. Pick two elements $A$ and $B$ from the batch with $I_A=I_B$ (any choice will work).
2. Run $\mathrm{Compare}(A,B)$.
3. If $|I_A|$ is close enough to a power of 2,(note 1) remove $A$ from the batch (without forgetting $I_A$); and do similarly with $B$.
Finally: Insert all elements into $S$ and complete the sort.

Note 1: For "close enough", any $o(1)$ relative error (as a function of $m$) works as long as $m-o(m)$ elements will be removed in step (4) (possible by note 2). Under a conjectured randomization assumption ($\frac{|I_A|}{2^{\lfloor \mathrm{lg} |I_A| \rfloor}}$ behaves randomly enough), using $c \log \log m / \log m$ relative error captures $m(1-\log^{-Θ(c)}m)$ elements, allowing a $\mathrm{lg}(n!)+O(n \log \log n / \log n)$ average comparison sorting algorithm.

Note 2: Because the same sequence of comparisons leads to the same bounding interval, almost all elements will go through step (1) $Ω(\log m)$ times (unless removed in step 4). In the beginning, if $A < B$ and we pick $A$, we compare $A$ against element $S[≈(1-1/\sqrt{2})|S|]$, and each application of step (3) to $A$ has $Θ(1)$ probability of reducing $|I_A|$ in $≈1/(1-1/\sqrt{2})$ times. Now for every ratio $a>1$ that is not a rational power of 2, we have $∀ε>0 \, ∀d>0 \, ∃m_1,m_2∈\mathbb{N} \,\, 1-ε < \frac{a^{m_1}}{d2^{m_2}} < 1+ε$ (and by compactness, an upper bound $M$ on $m_1$ can be chosen independent of $d$ if $1 ≤ d < 2$). Thus, for every $ε$, there is $M$ such that a chain of $≤M$ $\mathrm{Compare}$ has $2^{-O(M)}$ probability of getting close enough to a power of 2, and so we get the $o(n)$ bound.

4   A likely $\mathrm{lg}(n!)+O(n^{1-ε})$ algorithm

Modulo a randomization assumption, we can achieve $\mathrm{lg}(n!)+O(n^{1-ε})$ average comparisons as follows. The below was chosen for simplicity rather than optimality; also, the batch size can be customized.

If our randomization assumption works out (i.e. the distribution of interval lengths and positions is random enough), then throughout much of the process, a typical $A$ can be efficiently compared with a choice of $n^{Θ(1)}$ elements (with $n^{Θ(1)}$ different interval lengths). Thus, we can typically choose a comparison for (1) above, and if we are unlucky with the comparison result, we still get $Θ(\log n)$ chances, thus achieving (if $ε$ is small enough, say 0.01) a $\mathrm{lg}(n!)+O(n^{1-ε})$-comparison algorithm. With some changes and approximations, the total compute can be made quasilinear: Given an element $A$, compute promising interval lengths, and then look up $B$s with the right approximate center and interval lengths.

Initially, $\mathrm{Compare}$ will be called primarily on elements with identical intervals. Because the effect of $\mathrm{Compare}$ on length is multiplicative (and multiplication is commutative), the length randomization will initially be $2^{o(\mathrm{depth})}$, but should we get enough randomization of positions, and after we start comparing elements with different interval lengths, lengths should be quickly randomized as well.

Optimizations

There are number of ways to optimize the comparisons, but the obstacle is that each comparison may end up being unlucky and we have a limited number of comparisons, and about $n^{1-ε}$ elements will be unlucky in every comparison. To improve $ε$, we can
- accelerate the initial randomization (included in $\mathrm{lg}(n!)+n^{0.5+o(1)}$ below; we expect $o(\log n)$ randomization depth)
- be more tolerant of deviations as the interval length shrinks (since we get fewer choices for comparison), this should stretch out choices (for unlucky elements) until roughly length $n^ε$.
- optimize comparisons (both in choosing $A$ and $B$ and in $\mathrm{Compare}(A,B)$).

In $\mathrm{Compare}(A,B)$, by symmetry (i.e. $P(A < B)≈0.5$) the lengths will not significantly depend on whether $A < B$. However, both $A$ and $B$ must be compared with $S$ to get length reduction that is not a power of 2. This suggest the following comparison protocol.

Comparison: If trying to optimize $I_A$, choose $B$ such that after $A$-$B$ and then $A$-$S$ comparison if the comparison results are in the same direction, then $|I_A|$ is close to a power of 2. The approximate required $|I_B|$ for this exists and is unique. (Also, $|I_A|/|I_B| < 2$ as otherwise after $A < B$ and then $A < S[i]$, $|I_A|$ will be reduced in 4 times.) If lucky (after doing the $A$-$B$ and $A$-$S$ comparisons), compare $B$ with $S$ until $|I_A|$ and $|I_B|$ are disjoint, or it is sufficiently likely that $B$ is in $I_A$. Make $I_A$ and $I_B$ disjoint in a reasonable manner.
Success Probability: At either small enough or large enough $|I_B|/|I_A|$, the probability is ≈1/2 (with an average of 4 comparisons conditioned on success), but the probability is a bit lower at other ratios. Also, by choosing whether to optimize $A$ or use $A$ for optimizing $B$ (and also how to make $I_A$ and $I_B$ disjoint), we should be able to avoid low success probabilities and random use of $\mathrm{Compare}$.

If we assume (probably slightly pessimistically) that the modified $\mathrm{Compare}$ uses (with survivor correction) around 5 comparisons and has 1/3 probability of success, we get $≈(1-ε)/5⋅\mathrm{lg}(n)$ chances, and $ε ≈ (1-ε)/5/\log_{3/2} 2 ≈ 0.10$.

Unfortunately, even at a million elements, this should only reduce excess comparisons in several times. A perhaps much better approach is to wait until an interval is close to a power of 2, controlling not the individual interval lengths but the distributions of lengths.

5   A possible $\mathrm{lg}(n!)+n^{0.5+o(1)}$ algorithm

The sorting algorithm will be: Randomly shuffle the list and sort the first half $S$. To insert the second half, make the distribution of $I_A$ right, and do the (*) below. To make the distribution of $I_A$ right, we will randomize the intervals at $o(\log n)$ depth, and then shape the distribution by withholding some elements and randomizing the rest. We do not know how well the distribution shaping works.

The initial randomization should only need $n^{o(1)}$ entropy waste. If the distribution shaping works well enough, the entropy waste stems primarily from random interval length mismatches during (*) and is thus $n^{0.5+o(1)}$.

Sorting

Suppose that $|S|=n$ and we are given an unsorted batch of $n$ elements with the intervals $I_A$ also given such that
- $|I_A|$ is typically $n^{1-o(1)}$, or rather $P(|I_A| < n^{1-d}) ∈ n^{o(1)-d}$.
- $\frac{|I_A|}{2^{\lfloor \mathrm{lg} |I_A| \rfloor}}$ is distributed uniformly, or rather $∀i ∀d \, |\{A < S[i]: \frac{|I_A|}{2^{\lfloor \mathrm{lg} |I_A| \rfloor}} < d\}| = d \, i ± n^{0.5+o(1)}$.
- the total entropy (given the intervals) is $\sum{\mathrm{lg}|I_A|}-n^{0.5+o(1)}$ bits.

Then, we can sort the items wasting an average of $n^{0.5+o(1)}$ comparisons as follows:
(*) Insert all elements in the order of their initial $\frac{|I_A|}{2^{\lfloor \mathrm{lg} |I_A| \rfloor}}$. This way all elements are inserted when their interval length is close to a power of 2.

Notes:
• Using the updated $I_A$ should be marginally better, but the interaction between different intervals would complicate the analysis.
• The $\frac{|I_A|}{2^{\lfloor \mathrm{lg} |I_A| \rfloor}}$ uniformity assumption can be weakened provided that the average magnitude of the errors (relative interval length error at actual insertion time) in (*) is $n^{-0.5+o(1)}$.
• $P(|I_A| < n^{1-d}) ∈ n^{o(1)-d}$ is consistent with $|I_A|/\text{(distance to edge)} = o(1)$ (unless $|I_A|=1$), which can be useful for handling elements close to the edges. Also, $P(|I_A| < n^{1-d}) ∈ n^{o(1)-d/2}$ suffices if the uniformity assumption is strengthened appropriately.

Randomization

To make a 'random' distribution, we can randomly use $\mathrm{Compare}(A,B)$ with $P(A < B)≈0.5$, and with $n^{Θ(1)}$ choices, we should get randomization after $Θ(n \log n / \log \log n)$ comparisons, except that initially all $I_A$ are identical. Thus, with just $\mathrm{Compare}$, we should not expect randomization at a sublogarithmic depth (i.e. with $I_A$ long enough). However, we conjecture that using a generalization of $\mathrm{Compare}$ to $k=ω(1)$ elements will work: With $k$ elements entangled, we should have about $k$ noncommuting choices for each comparison with $S$. This should allow $O(\log_k n + \log k)$ randomization depth, as desired (assuming that $k$ is not too large as we need depth $Θ(\log k)$ to disentangle the elements). We expect that the compute can be made quasilinear with small enough $k$, but the dependence on $k$ is unclear.

To be more precise, define the graph of batch elements by connecting $A$ and $B$ with an edge if $I_A$ and $I_B$ overlap and we know whether $A < B$. One randomization construction (there are many) is the following:
Repeat until we have enough randomization, at which point disconnect the elements:
If possible, randomly pick $A$ and $B$ with $P(A < B)≈0.5$ that are not connected to each other and with both connected components having size $<k$. Otherwise, randomly pick $A$ from a component with maximum size, and compare it against the corresponding element of $S$.
Note: Initially, the sizes of the connected components will be powers of 2. Also, if $k$ is $Ω(\log n)$, we may wish to use specific construction designs so as to build bigger components. Any $2^i$ unsorted elements can be connected with $2^i-1$ even comparisons.

Distribution Shaping

To make the global distribution of $|I_A|/2^{\lfloor \mathrm{lg} |I_A| \rfloor}$ right for (*), we can make a 'random' distribution, and then withhold the right fraction of the elements for each $|I_A|/2^{\lfloor \mathrm{lg} |I_A| \rfloor}$ while randomizing the rest. This works (with $O(1)$ repetitions) provided that the 'random' distribution is effectively correct up to a constant factor. To the extent that withholding alters randomization of other elements, we can use gradient descent to find the right corrections.

However, the problem is that the distribution of lengths may depend on location. If we could use exact locations, this likely would be addressable: The insertion of an element $A$ will delay insertion of other elements, and if this causes a systematic inefficiency (with the above corrections), we can 'reroll' $A$ to make its error random. Also, after initial randomization, rerolling an element can presumably be done using $\mathrm{Compare}$ with $O(1)$ expected comparisons.

However, we only have the intervals $I_A$, and while we have enough degrees of freedom (and many different randomization ways), the withholding rates have to be between 0 and 1; and we do not know what is achievable.

6   A possible $\mathrm{lg}(n!)+O(n^{0.5-ε})$ combination

Suppose that the errors in the distribution shaping (above) are sufficiently random (we do not know whether the square root barrier will prevent that): $∀i_1,i_2,d_1,d_2 \, (0 ≤ i_1 < i_2 < n ∧ 1 ≤ d_1 < d_2 ≤ 2)$ $|\{A: S[i_1] < A < S[i_2] ∧ d_1 ≤ \frac{|I_A|}{2^{\lfloor \mathrm{lg} |I_A| \rfloor}} < d_2\}| - (d_2-d_1)(i_2-i_1) =$ $O(((d_2-d_1)(i_2-i_1))^{0.5+o(1)} + n^{0.5-2ε+o(1)})$. Then, if we make the batch size equal $|S|+n^{0.5+ε}$ and selectively reject $≈n^{0.5+ε}$ elements in (*) (above), we can insert all but these $≈n^{0.5+ε}$ elements with entropy waste $n^{0.5-ε/2+o(1)}$ as follows.

Split $S$ into $n^ε$ nearly equal intervals, and when during insertion, $I_A$ settles on an interval, reject (i.e. cancel the insertion) if the interval is too long, thus reducing the variation in the lengths of these intervals $n^{ε/2}-o(1)$ times, which in turn reduces the length variations of random length $n^{1-o(1)}$ intervals in $n^{ε/2-o(1)}$ times, as needed. (Intuitively, increasing precision in $x$ times requires increasing both spatial and temporal resolutions for corrections in $x^2$ times, thus increasing the corrective term in $x^2$ times.) Now, we can use the above $\mathrm{lg}(n!)+O(n^{1-ε})$ algorithm to insert the remaining elements with $O(n^{0.5-ε'})$ waste if $ε$ is small enough.

The $ε$ we get in the $O(n^{0.5-ε})$ is quite small. Using the above $\mathrm{lg}(n!)+O(n^{1-ε})$ analysis and the 3-fold reduction here, $3ε≈(0.5-4ε)/5/\log_{3/2} 2$ (here $0.5-4ε$ represents the range of the available depths), so $ε ≈ 0.016 > 0.01$, but perhaps there are more efficient algorithms. Also, at a typical point we have around $n^{0.5-ε}$ choices on which element to insert, and in addition, using slightly imbalanced queries gives us many choices for fine-tuning the distributions of lengths, which may give a bigger $ε$.

Worst-case complexity of sorting: While our constructions do not appear to work for the worst-case, and we do not know if closeness to a power of 2 can be circumvented in the worst-case, most likely there is a sorting algorithm with $\mathrm{lg}(n!)+o(n)$ worst-case comparisons. For finding median, there is a linear gap between the average case ($1.5n+o(n)$ comparisons) and the worst case (at least $(2+ε)n-O(1)$ comparisons).[4] However, for sorting, there is plenty of freedom for arranging comparisons and for finding new sorting algorithms.

References

[1] Stefan Edelkamp and Armin Weiß. QuickXsort: Efficient Sorting with n log n - 1.399n + o(n) Comparisons on Average. CSR 2014: pp. 139–152.
[2] Kazuo Iwama and Junichi Teruyama. Improved Average Complexity for Comparison-Based Sorting. WADS 2017: Algorithms and Data Structures pp. 485-496.
[3] Dmytro Taranovsky. Sorting with an average of $\mathrm{lg}(n!)+o(n)$ comparisons. https://cstheory.stackexchange.com/questions/41493/sorting-with-an-average-of-mathrmlgnon-comparisons
[4] Dorit Dor and Uri Zwick. Median Selection Requires $(2+\epsilon)n$ Comparisons. SIAM Journal on Discrete Mathematics, 2001, Vol. 14, No. 3 : pp. 312-325.