$\DeclareMathOperator{\lg}{lg}$

Dmytro Taranovsky
September 9, 2018; modified June 14, 2020

# Sorting with $\lg(n!)+o(n)$ comparisons

Abstract: We prove that comparison-based sorting is possible with $\lg(n!)+o(n)$ worst-case comparisons. Also, under a conjectured randomization assumption, we get sorting with an average of $\lg(n!)+O(n^{1-ε})$ comparisons (and give a possible adaptation to the worst-case), and we also give a possible construction for sorting with an average of $\lg(n!)+O(\log n)$ comparisons.

## 1   Introduction

In a number of ways, sorting is the classic problem in computer science, and yet it continues to surprise us with new results. A simply entropy argument shows that comparison-based sorting requires an average of $≥\lg(n!)$ comparisons ($\lg$ is $\log_2$), but how closely can we approach this bound?

A symbolic significance of $\lg(n!)+o(n)$ is that it is $o(n)$ comparisons from optimal, wasting an average of only $o(1)$ comparisons per element. If achievable, $O(n^{0.5-ε})$ would break the square root barrier commonly seen in randomized constructions, and $O(\log n)$ is exceptionally close to having zero waste. While the immediate benefit of the results is mostly theoretical, the constructions here enhance our understanding of probability and computation and may be useful in other contexts, especially in optimization.

Prior work: Merge sort uses $\lg(n!) + Θ(n)$ comparisons even in the worst case, and is efficiently implementable in practice. Merge-insertion sort (also known as Ford–Johnson sort) also uses $\lg(n!)+ Θ(n)$ comparisons but with a much smaller constant in $Θ(n)$. [1] gives a variation and a modification that is efficient in terms of compute. Slightly improved average bounds are obtained in [2], and their (1,2)Insertion algorithm resembles a portion of our average-case $\lg(n!)+o(n)$ algorithm. Many of the results in this paper were originally presented in [3] (2018), with this paper being the expanded version. In May 2020, I obtained the results for the worst-case, as well as a possible construction for an average of $\lg(n!)+O(\log n)$ comparisons.

Open problems:
• Is sorting possible with an average of $\lg(n!)+O(\log n)$ comparisons? Is this optimal?
• How many comparisons are needed in the worst case?
Related to the above problems:
• Proving the $O(n^{1-ε})$ bound.
• Can the constructions in the paper for $n^{0.5+o(1)}$ and $O(n^{0.5-ε})$ and $O(\log n)$ work?
• Proving the $O(n^{1-ε})$ construction for the worst-case.

Outline:
1   Introduction
2   Basic Approach - basic approach for the average case, including a discussion of entropy, and a key function that we will call Compare.
3   A $\lg(n!)+o(n)$ algorithm - proves that sorting is possible with an average of $\lg(n!)+o(n)$ comparisons. The algorithms in sections 3,4,5 are all different.
4   A likely $\lg(n!)+O(n^{1-ε})$ algorithm - under a likely randomization assumption, gives an algorithm using an average $\lg(n!)+O(n^{1-ε})$ comparisons and discusses various refinements.
5   A possible $\lg(n!)+n^{0.5+o(1)}$ algorithm
6   A possible $\lg(n!)+O(n^{0.5-ε})$ combination - combines the algorithms in the previous two sections (less important now because of the possible $\lg(n!)+O(\log n)$ algorithm)
7   A possible $\lg(n!)+O(\log n)$ algorithm - refines the $\lg(n!)+n^{0.5+o(1)}$ algorithm, plausibly achieving an average of $\lg(n!)+O(\log n)$ comparisons
8   Worst-case complexity of sorting - proves that sorting is possible with $\lg(n!)+o(n)$ worst-case comparisons, and discusses how to adapt the $\lg(n!)+O(n^{1-ε})$ algorithm to the worst case.

6   A possible $\lg(n!)+n^{0.5+o(1)}$ algorithm In section 2, we present the basic approach for the average-case. In sections 3-7, we give different algorithms and likely constructions for the average case. In section 8, we adapt some of the algorithms for the worst-case.

## 2   Basic Approach

We can assume that all elements are distinct, by annotating them if necessary. The average case uses distinct elements in random order. By randomly shuffling the items, we can turn the average performance into the expected performance for every input, and the probability of 'bad' behavior can be typically made exponentially small. As an aside, we note that in practice, one may first want to check if the list is already largely sorted, and also take advantage of equal items and other information.

We can compute the average number of comparisons by adding the entropy loss for each comparison relative to using a fair coin. At $\frac{1}{2}+ε$ yes probability, the loss, which will also call entropy waste is $Θ(ε^2)$. Thus, assuming an average of $O(n \log n)$ comparisons (and without bounding $ε$), the precise average number of comparisons is $\lg(n!)(1+Θ(E(ε^2)))$. Proper constructs will allow us great freedom to use slightly imbalanced comparisons to reduce the need for bad strongly imbalanced comparisons in the end and minimize the expected $ε^2$.

For one of the algorithms, we will use that merging two random sorted lists (by going sequentially from small to large) of the same size $n$ wastes only $Θ(\log n)$ entropy because for an element in position $i$ the comparison probability is typically $\frac{1}{2}±Θ(1/\sqrt{\min(1+i,n-i)})$; the exact waste is $2n-\lg {2n \choose n}+O(1) = \frac{1}{2} \lg n + O(1)$ (merging uses $2n+O(1)$ comparisons while reducing bit entropy by $\lg {2n \choose n}$). However, such a low entropy waste is likely the exception rather than the rule.

Our starting point is insertion sort with a binary search to decide where to insert the next element into the sorted subset $S$. When $(1-ε)2^m ≤ |S| ≤ 2^m-1$, an insertion uses at most $m$ comparisons, which (in terms of entropy) is optimal up to an $O(ε)$ additive factor; and for average-case complexity, $2^m ≤ |S| ≤ (1+ε) 2^m$ also works. Now, when $|S|$ is not close to a power of 2, insertion of an element $A$ is suboptimal, even in the average case and regardless of how we balance each query: Any balancing that does not 'cross' a power of 2 leads to the same average number of comparisons (as can be checked using the comparison tree).

However, if by using other elements, and wasting $o(1)$ comparisons, we could steer an element $A$ to an approximately uniform distribution over an interval of $S$ of length close to a power of 2, we get the desired $\lg(n!)+o(n)$.

We achieve this by adding elements in batches, and sometimes efficiently comparing elements of the batch with each other, such that the interval of $S$ corresponding to an element $A$ decreases in a quasi-random manner (and with the probability distribution of $A$ inside the interval nearly uniform), and when the interval length is close enough to a power of 2, doing the binary search to insert $A$.

Common constructs

We will keep a subset $S$ of sorted elements, and for each unsorted element $A$, we will keep track of the minimal interval $I_A$ of $S$ where $A$ is known to be located. $|I_A|$ is the length of $I_A$; $I_A=I_B$ is by the identity of the intervals.

Let $\mathrm{Compare}(A,B)$ be: Compare $A$ with $B$, and then (in random order) compare $A$ and $B$ against corresponding elements of $S$ until their intervals are disjoint (or both have length 1). The element of $S$ is chosen (in a consistent manner) to make the probabilities for the comparison as close to 1/2 as possible, assuming that when $\mathrm{Compare}$ is called, $(A,B)$ is uniformly distributed on $I_A⨯I_B$. Some algorithms will use Compare with nonrandom selection of $A$ vs $B$ in comparisons with $S$.

Because of the disjointness in the end, $\mathrm{Compare}$ preserves the uniformity assumption. Also, on average, $\mathrm{Compare}$ does $O(1)$ comparisons, and aside from the initial comparison, wastes only $O(l^{-2})$ entropy where $l$ is the shortest interval length.

We say that uninserted elements $A$ and $B$ are directly entangled iff $I_A$ and $I_B$ overlap and we know whether $A<B$; and 'entangled' is the transitive closure of directly entangled. For unentangled $A$ and $B$, the probability distribution on $I_A×I_B$ will be approximately uniform in our algorithms, and similarly for bigger groups of elements.

In the next three sections, using different algorithms, we
- prove the $\lg(n!)+o(n)$ bound,
- get $\lg(n!)+O(n^{1-ε})$ under a very likely assumption, and
- attempt to get $\lg(n!)+n^{0.5+o(1)}$, and in the following sections, $\lg(n!)+O(n^{0.5-ε})$ and $\lg(n!)+O(\log n)$ refinements.

## 3   A $\lg(n!)+o(n)$ algorithm

Our first algorithm is the following (which can be applied repeatedly to sort a list).

Given: A sorted list $S$, and a batch of $m$ unsorted elements; $m∈ω(1)∩o(|S|)$; the unsorted elements are random relative to $S$.

Repeat (1)-(3) while possible:
1. Pick two elements $A$ and $B$ from the batch with $I_A=I_B$ (any choice will work).
2. Run $\mathrm{Compare}(A,B)$.
3. If $|I_A|$ is close enough to a power of 2,(note 1) remove $A$ from the batch (without forgetting $I_A$); and do similarly with $B$ (using $|I_B|$).
Finally: Insert all elements into $S$ and complete the sort.

Note 1: For "close enough", any $o(1)$ relative error (as a function of $m$) works as long as $m-o(m)$ elements will be removed in step (3) (possible by note 2). Under a conjectured randomization assumption ($\frac{|I_A|}{2^{\lfloor \lg |I_A| \rfloor}}$ behaves randomly enough), using $c \log \log m / \log m$ relative error captures $m(1-\log^{-Θ(c)}m)$ elements, allowing (using a large enough $c$) a $\lg(n!)+O(n \log \log n / \log n)$ average comparison sorting algorithm.

Note 2 (proof of the $o(m)$ bound on wasted entropy): Because the same sequence of comparisons leads to the same bounding interval, almost all elements will go through step (1) $Ω(\log m)$ times (unless removed in step 3). In the beginning, if $A < B$ and we pick $A$, we compare $A$ against element $S[≈(1-1/\sqrt{2})|S|]$, and each application of step (3) to $A$ has $Θ(1)$ probability of reducing $|I_A|$ in $≈1/(1-1/\sqrt{2})$ times. Now for every ratio $a>1$ that is not a rational power of 2, we have $∀ε>0 \, ∀d>0 \, ∃m_1,m_2∈\mathbb{N} \,\, 1-ε < \frac{a^{m_1}}{d2^{m_2}} < 1+ε$ (and by compactness, an upper bound $M$ on $m_1$ can be chosen independent of $d$ if $1 ≤ d < 2$). Thus, for every $ε$, there is $M$ such that a chain of $≤M$ $\mathrm{Compare}$ has $2^{-O(M)}$ probability of getting close enough to a power of 2, and so we get the $o(m)$ bound, and hence sorting with an average of $\lg(n!)+o(n)$ comparisons.

## 4   A likely $\lg(n!)+O(n^{1-ε})$ algorithm

Modulo a randomization assumption, we can achieve $\lg(n!)+O(n^{1-ε})$ average comparisons as follows. The below was chosen for simplicity rather than optimality; also, the batch size can be customized.

• Randomly shuffle the items, and sort the first half into a list $S$, while keeping the second half as an unsorted batch.
• Repeat until the batch is empty:
Randomly choose $A∈\text{batch}$. Let $G = \{ B∈\text{batch}: |P(A < B) - 0.5| < n^{-0.51ε} \}$. If $G$ is empty, remove $A$ from the batch and insert into $S$. Otherwise:
1. If there is $B∈G$ such that with probability $Θ(1)$ (say ≥0.05), $\mathrm{Compare}(A,B)$ makes $|I_A|$ within $n^{-ε}$ relative error of a power of 2, run $\mathrm{Compare}(A,B)$ and if successful (i.e. $|I_A|$ is within $n^{-ε}$ relative error of a power of 2), remove $A$ from the batch and insert into $S$.
2. If there is no such $B∈G$, run $\mathrm{Compare}(A,B)$ for a random $B∈G$.

If our randomization assumption works out (i.e. the distribution of interval lengths and positions is random enough), then throughout much of the process, a typical $A$ can be efficiently compared with a choice of $n^{Θ(1)}$ elements (with $n^{Θ(1)}$ different interval lengths). Thus, we can typically choose a comparison for (1) above, and if we are unlucky with the comparison result, we still get $Θ(\log n)$ chances, thus achieving (if $ε$ is small enough, say 0.01) a $\lg(n!)+O(n^{1-ε})$-comparison algorithm. With some changes and approximations, the total compute can be made quasilinear: Given an element $A$, compute promising interval lengths, and then look up $B$s with the right approximate center and interval lengths.

Initially, $\mathrm{Compare}$ will be called primarily on elements with identical intervals. Because the effect of $\mathrm{Compare}$ on length is multiplicative (and multiplication is commutative), the length randomization will initially be $2^{o(\mathrm{depth})}$, but should we get enough randomization of positions, and after we start comparing elements with different interval lengths, lengths should be quickly randomized as well.

We likely get $\lg(n!)+O(n^{1-ε})$ comparisons with probability $1-\frac{1}{(n!)^{Ω(1)}}$.

Optimizations

There are number of ways to optimize the comparisons, but the obstacle is that each comparison may end up being unlucky and we have a limited number of comparisons, and about $n^{1-ε}$ elements will be unlucky in every comparison. To improve $ε$, we can
- accelerate the initial randomization (included in the possible $\lg(n!)+n^{0.5+o(1)}$ algorithm below; we expect $o(\log n)$ randomization depth)
- be more tolerant of deviations as the interval length shrinks (since we get fewer choices for comparison), this should stretch out choices (for unlucky elements) until roughly length $n^ε$.
- optimize comparisons (both in choosing $A$ and $B$ and in $\mathrm{Compare}(A,B)$).

In $\mathrm{Compare}(A,B)$, by symmetry (i.e. $P(A < B)≈0.5$) the lengths will not significantly depend on whether $A < B$. However, both $A$ and $B$ must be compared with $S$ to get length reduction that is not a power of 2. This suggest the following comparison protocol.

Comparison: If trying to optimize $I_A$, choose $B$ such that after $A$-$B$ and then $A$-$S$ comparison if the comparison results are in the same direction, then $|I_A|$ is close to a power of 2. The approximate required $|I_B|$ for this exists and is unique. (Also, $|I_A|/|I_B| < 2$ as otherwise after $A < B$ and then $A < S[i]$, $|I_A|$ will be reduced in 4 times.) If lucky (after doing the $A$-$B$ and $A$-$S$ comparisons), compare $B$ with $S$ until $|I_A|$ and $|I_B|$ are disjoint, or it is sufficiently likely that $B$ is in $I_A$. Make $I_A$ and $I_B$ disjoint in a reasonable manner.
Success Probability: At either small enough or large enough $|I_B|/|I_A|$, the probability is ≈1/2 (with an average of 4 comparisons conditioned on success), but the probability is a bit lower at other ratios. Also, by choosing whether to optimize $A$ or use $A$ for optimizing $B$ (and also how to make $I_A$ and $I_B$ disjoint), we should be able to avoid low success probabilities and random use of $\mathrm{Compare}$.

If we assume (probably slightly pessimistically) that the modified $\mathrm{Compare}$ uses (with survivor correction) around 5 comparisons and has 1/3 probability of success, we get $≈(1-ε)/5⋅\lg(n)$ chances, and $ε ≈ (1-ε)/5/\log_{3/2} 2 ≈ 0.10$.

Unfortunately, even at a million elements, this should only reduce excess comparisons in several times. A likely much better approach is to wait until an interval is close to a power of 2, controlling not the individual interval lengths but the distributions of lengths.

## 5   A possible $\lg(n!)+n^{0.5+o(1)}$ algorithm

We present a possible $\lg(n!)+n^{0.5+o(1)}$ average comparison sorting algorithm; then in the next section give a possible $\lg(n!)+O(n^{0.5-ε})$ combination with the algorithm in the previous section; and in the section after that, give a possible $\lg(n!)+O(\log n)$ refinement of the $\lg(n!)+n^{0.5+o(1)}$ algorithm (which makes our possible $\lg(n!)+O(n^{0.5-ε})$ combination less important). A caveat is that we use the term "algorithm" loosely, leaving some implementation choices (especially in the distribution shaping below) to the programmer, with the assumption that a straightforward implementation would work, both in theory, and after experimenting, with reasonable constants in practice.

The sorting algorithm will be: Randomly shuffle the list and sort the first half $S$. To insert the second half, make the distribution of $I_A$ right, and do the optimized insertion (*) below. To make the distribution of $I_A$ right, we will randomize the intervals at $o(\log n)$ depth (below), and then shape the distribution by withholding some elements and randomizing the rest. We do not know how well the distribution shaping works.

The initial randomization should only need $n^{o(1)}$ entropy waste. If the distribution shaping works well enough, the entropy waste stems primarily from random interval length mismatches during (*) and is thus $n^{0.5+o(1)}$.

Sorting

Suppose that $|S|=n$ and we are given an unsorted batch of $n$ elements with the intervals $I_A$ also given such that
- $|I_A|$ is typically $n^{1-o(1)}$, or rather $P(|I_A| < n^{1-d}) ∈ n^{o(1)-d}$.
- $\frac{|I_A|}{2^{\lfloor \lg |I_A| \rfloor}}$ is distributed uniformly, or rather $∀i ∀d \, |\{A < S[i]: \frac{|I_A|}{2^{\lfloor \lg |I_A| \rfloor}} < d\}| = d \, i ± n^{0.5+o(1)}$.
- the total entropy is within $n^{0.5+o(1)}$ bits of having an effectively random distribution given the intervals.

Then, we should be able to sort the items wasting an average of $n^{0.5+o(1)}$ comparisons as follows:
(*) Insert all elements in the order of their initial $\frac{|I_A|}{2^{\lfloor \lg |I_A| \rfloor}}$. This way all elements are inserted when their interval length is close to a power of 2.

Notes:
• Using the updated $I_A$ should be marginally better, but the interaction between different intervals would complicate the analysis.
• The $\frac{|I_A|}{2^{\lfloor \lg |I_A| \rfloor}}$ uniformity assumption can be weakened provided that the average magnitude of the errors (relative interval length error at actual insertion time) in (*) is $n^{-0.5+o(1)}$.
• $P(|I_A| < n^{1-d}) ∈ n^{o(1)-d}$ is consistent with $|I_A|/\text{(distance to edge)} = o(1)$ (unless $|I_A|=1$), which can be useful for handling elements close to the edges. Also, $P(|I_A| < n^{1-d}) ∈ n^{o(1)-d/2}$ suffices if the uniformity assumption is strengthened appropriately.

Randomization

To make a 'random' distribution, we can randomly use $\mathrm{Compare}(A,B)$ with $P(A < B)≈0.5$, and with $n^{Θ(1)}$ choices (typically $n^{0.5±o(1)}$ choices if we aim for $O(\log n)$ (or $n^{o(1)}$) total waste), we should get randomization after $Θ(n)$ comparisons, except that initially all $I_A$ are identical. Thus, with just $\mathrm{Compare}$, we should not expect randomization at a sublogarithmic depth (i.e. with $I_A$ long enough). However, we conjecture that using a generalization of $\mathrm{Compare}$ to $k=ω(1)$ elements will work: With $k$ elements entangled, we should have about $k$ noncommuting choices for each comparison with $S$. This should allow $O(\log_k n + \log k)=o(\log n)$ randomization depth, as desired (perhaps assuming that $k$ is not too large as we need enough depth to disentangle the elements).

To be more precise, define the entanglement graph of batch elements by connecting $A$ and $B$ with an edge if $I_A$ and $I_B$ overlap and we know whether $A < B$. One randomization construction (there are many) is the following:
Repeat until we have enough randomization, at which point disconnect the elements:
If possible, randomly pick and compare $A$ and $B$ with $P(A < B)≈0.5$ that are not connected to each other and with both connected components having size $<k$. Otherwise, randomly pick $A$ from a component with maximum size, and compare it against the corresponding element of $S$.

For the possible $\lg(n!)+O(\log n)$ algorithm, we need to be more precise. We first note that after the initial randomization, $Θ(n)$ applications of Compare (with reasonable choices, including in disentangling $A$ and $B$) can be done at $O(1)$ expected comparisons per element and $O(d^{-1})$ expected fraction of intervals reduced in $>d$ times. We want the initial randomization to have this property as well. With $2^i-1$ exactly even comparisons (and thus zero entropy waste), $m=2^i$ unsorted elements can be entangled into a tree. Furthermore, there are $m^{Ω(\log m)}$ (presumably $m^{Θ(\log m)}$ inequivalent trees here. For a typical tree, we can typically make $Θ(m)$ comparisons with $S$ before it splits into pieces, and the ordering of the comparisons and their results should give $m^{Θ(\log m)}$ bits of entropy, which is large enough for a large enough $m=Ω(\log n / \log \log m)$. Furthermore, for every tree, disentanglement uses an average of $O(1)$ comparisons per element, and for a typical tree with $m=n^{O(1)}$, we appear to get the right behavior at the tails. There are atypical trees, such as min-heap, where disentanglement gives a $Θ(\log m /d)$ elements with $O(d^{-1})$ interval lengths, but we can choose not to construct too many of these trees.

With $m$ entangled elements, the probability distribution for an element is piecewise polynomial (in the limit $|S|→∞$) of degree $m-1$, with all piece boundaries occurring on interval boundaries. And assuming tree structure for the entanglement, we can integrate out the $m$ elements one-by-one, and for polylogarithmic $m$, identify the right element of $S$ for comparison in polylogarithmic time.

Distribution Shaping

To make the global distribution of $|I_A|/2^{\lfloor \lg |I_A| \rfloor}$ right for (*), we can make a 'random' distribution, and then withhold the right fraction of the elements for each $|I_A|/2^{\lfloor \lg |I_A| \rfloor}$ while randomizing the rest. This works (with $O(1)$ repetitions) provided that the 'random' distribution is effectively correct up to a constant factor. To the extent that withholding alters randomization of other elements, we can use gradient descent to find the right corrections.

However, the problem is that the distribution of lengths may depend on location. At least if we could use exact locations, this is likely addressable: The insertion of an element $A$ will delay insertion of other elements, and if this causes a systematic inefficiency (with the above corrections), we can 'reroll' $A$ to make its error random. Also, after initial randomization, rerolling an element can presumably be done using $\mathrm{Compare}$ with $O(1)$ expected comparisons.

However, we only have the intervals $I_A$, and while we have enough degrees of freedom (and many different randomization ways), the withholding rates have to be between 0 and 1; and we do not know what is achievable. Complicating the analysis is that the losses might be concentrated on the outliers, and that the behavior may be different near the top and the bottom of the sorted list. We will discuss implementation of distribution shaping during discussion of the possible $\lg(n!)+O(\log n)$ algorithm.

## 6   A possible $\lg(n!)+O(n^{0.5-ε})$ combination

Suppose that the errors in the distribution shaping (above) are sufficiently random (we do not know whether the square root barrier will prevent that; the $\lg(n!)+O(\log n)}$ section does not use this specific assumption):
$∀i_1,i_2,d_1,d_2 \, (0 ≤ i_1 < i_2 < n ∧ 1 ≤ d_1 < d_2 ≤ 2)$ $|\{A: S[i_1] < A < S[i_2] ∧ d_1 ≤ \frac{|I_A|}{2^{\lfloor \lg |I_A| \rfloor}} < d_2\}| - (d_2-d_1)(i_2-i_1) =$ $O(((d_2-d_1)(i_2-i_1))^{0.5+o(1)} + n^{0.5-2ε+o(1)})$. Then, if we make the batch size equal $|S|+n^{0.5+ε}$ and selectively reject $≈n^{0.5+ε}$ elements in (*) (above), we can insert all but these $≈n^{0.5+ε}$ elements with entropy waste $n^{0.5-ε/2+o(1)}$ as follows.

Split $S$ into $n^ε$ nearly equal intervals, and when during insertion, $I_A$ settles on one of the intervals, reject (i.e. cancel the insertion) if the interval is too long, thus reducing the variation in the lengths of these intervals $n^{ε/2-o(1)}$ times, which in turn reduces the length variations of random length $n^{1-o(1)}$ intervals in $n^{ε/2-o(1)}$ times, as needed. (Intuitively, increasing precision in $x$ times requires increasing both spatial and temporal resolutions for corrections in $x^2$ times, thus increasing the corrective term in $x^2$ times.) Now, we can use the above $\lg(n!)+O(n^{1-ε})$ algorithm to insert the remaining elements with $O(n^{0.5-ε'})$ waste if $ε$ is small enough.

The $ε$ we get in the $O(n^{0.5-ε})$ is quite small. Using the above $\lg(n!)+O(n^{1-ε})$ analysis and the 3-fold reduction here, $3ε≈(0.5-4ε)/5/\log_{3/2} 2$ (here $0.5-4ε$ represents the range of the available depths), so $ε ≈ 0.016 > 0.01$, but perhaps there are more efficient algorithms. Also, at a typical point we have around $n^{0.5-ε}$ choices on which element to insert, and in addition, using slightly imbalanced queries gives us many choices for fine-tuning the distributions of lengths, which may give a bigger $ε$.

## 7   A possible $\lg(n!)+O(\log n)$ algorithm

Surprisingly, it may be possible to sort with an average of $\lg(n!)+O(\log n)$ comparisons.

Starting with a single element, at each stage we will approximately double the length of the sorted list $S$ until all elements are sorted. Like in the $\lg(n!)+n^{0.5+o(1)}$ algorithm, each stage will consist of randomization and distribution shaping, followed by optimized insertion. As noted above, we do not know how well distribution shaping works, and we get the $O(\log n)$ bound only if it works essentially as best as general considerations suggest it would.

At all but the last stage, $Θ(|S|)$ elements will be forward to the next stage. We can even amalgamate all but the last stage, doing randomization and optimized insertion in parallel. Except for the last stage, $E(|S|/|I_A|)=O(1)$ (averaged over elements $A$ used in the stage); for the last stage, after the randomization, $E(|S|/|I_A|)=O(\log n)$.

Errors and tolerances before insertion

A random element $A$ adds a total $Θ(|I_A|^{-1})$ inefficiency for insertions of other elements that ignore $A$, with $Θ(|I_A|^{-2})$ inefficiency per insertion for $Θ(|I_A|)$ insertions (assuming uniform distribution on $I_A$ and that $|S|/(|S|-|I_A|)=Ω(1)$). This corresponds to $O(1)$ entropy waste per stage, except for the $O(\log n)$ waste in the last stage.

For nonfinal stages, $E(|S|/|I_A|)=O(1)$ is likely achievable since we have many randomization and distribution shaping possibilities, and since the desired distribution of intervals is sufficiently smooth. We mostly (or fully) refuse to disentangle elements once get to a certain depth, leaving them to the last stage. Furthermore, we can cancel insertions of some elements (leaving them to the final stage) to fine-tune the distribution.

Our tolerance for systematic errors is $O(\sqrt{|S|})$, except for $\sqrt{n \log n}$ in the last stage. An element may cause a systematic error at multiple length scales, so the worst case is effective $k \sqrt{\log |S|}$ systematic error from $k$ elements. This does not prevent large corrections as the distribution of interval lengths and entanglement patterns does not count as systematic errors in the absence of interference with the algorithm.

Insertion

After randomization and distribution shaping, we have a supply of elements waiting to be inserted into the sorted list $S$, with element $A$ having an interval $I_A$ and with essentially nothing known about $A$ beyond that. When $|I_A|$ approaches a power of 2, we will do a slightly imbalanced $A$-$S$ comparison with one of the intervals being a power of 2. If we get a power of 2, insert $A$, and otherwise, wait until the next opportunity; this is the key point for getting much better than $n^{0.5+o(1)}$ inefficiency. The greedy approach (choose $A$ with the closest ratio to a power of 2) might work. At a nonfinal stage, we have enough elements waiting for insertion (and with the ability to cancel) that without systematic errors, the typical mismatch interval lengths can likely be $O(|S|^{-5-ε}|$, so the mismatch with the power of 2 contributes only $O(1)$ (and if desired, after $ω(1)$ stages, a total of $o(1)$) entropy waste).

Insertion during the final stage: For the final stage, have a typical deficiency of a large enough $Θ(\sqrt{n \log n})$ of the number of suitable intervals for insertion. This deficiency will cause a typical $Θ(\log n / n)$ per element waste for $Θ(n)$ elements, plus $Θ(\sqrt{n \log n})$ per element waste for the $Θ(\sqrt{n \log n})$ elements at the end, though the latter can be essentially avoided by tapering down the deficiency near the end.

Underflows and overflows: Note that, in the final stage, we can handle underflows (i.e. not enough elements with the right intervals) by proceeding with insertion of larger intervals. We can likely also deal with local overflows (interval length grows too large) by anticipating them and using comparisons such that regions with overflow risk get larger intervals. A global overflow, however, apparently has $n^{Ω(1)}$ cost, so to make their risk small enough, we apparently need a large enough $n \sqrt{\log n}$ expected underflow.

Distribution of the number of comparisons: The standard deviation for the number of comparisons can likely be made $Θ(\sqrt{\log n})$ (or by wasting comparisons, even $O(n^{-c})$ for a given constant $c$ (number of comparisons depends on $c$)). The imbalanced comparisons in the main insertion part of the last stage give $Θ(\sqrt{\log n})$ stdev, and nonfinal stages appear to have $Θ(1)$ per stage variance. Optimizing the $Θ(\log n)$ term, however, can lead to a polynomially large stdev through a polynomially small risk of an overflow.

Optimality: The $O(\log n)$ inefficiency budget tightly constrains the sorting algorithm possibilities. Furthermore, in the above, we appear to $\log$ in three places: $O(1)$ per stage from non-uniformities from rejected elements, tail of shorter depths in the final stage, and buffer against overflows in the final stage. We can try to address the first one by choosing insertions to get a better than random non-uniformity in $S$, but we do not see how to address the other two. Most likely, an average of $\lg(n!)+Θ(\log n)$ comparisons is really optimal.

Alternative construction for $\mathrm{lg(n!)}+\mathrm{polylog}(n)$ comparisons: An alternative method of handling 'extraneous' elements in the final stage is to randomly split the $n$ elements into batches of approximate size $n/2,n/4,n/8,...$. Leftover elements from sorting the $n/2$ batch will go into the $n/4$ batch and so on. At the end, merge the batches using sequential merging of similar size lists. However, the merging adds up to inefficiency of $\frac{1}{4}(\lg^2 n) + O(\lg n)$ comparisons, which is in addition to mutual information between the lists being largely wasted. Given a single element $A$ moved from an otherwise sorted batch $i$ (with $I_A$ being the interval for $A$ and with batch size $x$) to batch $i+1$ and with no other connections, the expected mutual information (and hence waste) between sorted batch $i$ and sorted batch $i+1$ is $Θ(\frac{\sqrt x}{|I_A|})$ (assuming this is $O(1)$), which limits our ability to move elements between batches.

Variations on the average comparisons measure: There other measures and setups that respect the $\lg(n!)$ lower bound on the number of comparisons/bits. An example is to permit, in $ε$ fraction of the problems, to bow out and complete the remainder of the sorting using biased comparisons: $C(p,A,B)$ ($0<p<1$) returns whether $A<B$ and takes $-\lg p$ bits if $A<B$ and other $-\lg (1-p)$ bits otherwise. By managing rare bad events, we possibly get a better $Θ(\log n)$ term in the worst case for a large enough $ε=o(1)$ than the ordinary average case. Another variation is to allow bowing out in every case, but only after $1-ε$ of the list has been sorted; an interesting question is whether it is possible to get $O(1)$ waste for some $ε=n^{-Ω(1)}$.

Implementation suggestions and distribution shaping

The $\lg(n!)+O(\log n)$ bound (if achievable) requires high precision, but even simpler (and more relaxed) versions in this style should perform very well (compared to previous algorithms) in terms of number of comparisons on moderate size lists.

For distribution shaping, we recommend an iterative process to select among different algorithms and refine their parameters. For now, we can give a description that plausibly works assuming large enough constants (which will be reasonable).

For nonfinal stages, we might forgo complicated distribution shaping (perhaps at the cost of increasing the constant in $Θ(\log n)$) by simply scheduling enough elements for insertion and withholding some (unentangled) elements to prevent non-uniformity in $S$. Initial randomization is as described previously, except that we disentangle elements only up to a large enough fixed depth (measured using entropy), leaving the rest for the final stage. For each element, we have its probability distribution using its interval, or using its piecewise polynomial for entangled elements. By adding up their contributions (ignoring interaction), we get a sufficiently accurate entropy loss from doing insertions through numerically almost even comparisons with $S$ that is due to the nonuniformity of $S$. We then prioritize insertions that avoid and/or reduce the entropy loss for future insertions (under a reasonable interval distribution for upcoming insertions, or even just considering all intervals to be equally likely). The insertion decision can be cancelled after any comparison, and it can be based on current inefficiency (in terms of not getting a power of 2 and non-uniformity), contribution to future inefficiencies, and with an adjustment for possible cancellation.

For the final stage, after disentangling the elements, we can do distribution shaping through Compare (and a small number of variations of Compare based on the selection of $A-S$ vs $B-S$ comparisons). For each appropriate combination ($A$, $B$, Compare variation), we can compute its contribution to distribution shaping, and then do gradient descent to compensate for interactions between the contributions, getting a probability for each combination such that the total is good enough. We then apply all the Compare, and (if needed) repeat $O(1)$ times (or if somehow needed, more times around the edges of $S$). Ideally, the initial randomization (using groups of more than 2 elements) would do most distribution shaping to avoid or reduce the need for Compare afterwards.

The quality of the distribution can be estimated by assuming that elements are inserted in the order approximately based on the interval length $|I_A|$, and estimating the entropy loss from non-uniformity in $S$, the interval lengths not being a power of 2, and some measure of the risk of an overflow. A simple measure of each of these that is off by a constant factor may work (for overflow, the constant factor is about $O(\sqrt{n \log n}$ and must not be too small). If $|S|$ more than doubles, then for elements with appropriate lengths, we can randomly (or intelligently) choose during which doubling they will be inserted. During insertion, we can tweak the selection of an element to insert (and also whether the left or right subinterval is a power of 2) using its impact on the distribution shape.

All runtimes are polynomial, and we suspect that with further research, an optimized implementation can achieve a quasilinear run time. There are $Θ(n^2)$ different intervals on $S$, but with our error tolerances, it suffices to keep track of expectations for $\tilde{Θ}(n)$ intervals. For each approximate point $s$ on $S$, and each appropriate ratio $r$, we compute the expected number of non-inserted elements $A$ with $A<s$ and $\frac{|I_A|}{2^{\lfloor \lg |I_A| \rfloor}}<r$. We then apply the usual squared error for global deviations in $\frac{|I_A|}{2^{\lfloor \lg |I_A| \rfloor}}<r$, and for each $r$ and each length scale (with effectively $Θ(\log n)$ length scales) compute the appropriate squared error for nonuniformity of the distribution of $\{A:\frac{|I_A|}{2^{\lfloor \lg |I_A| \rfloor}}\}<r$ relative to $S$. Especially if the number of Compare per element is a soft $O(1)$ limit, we also include the $E(|S|/|I_A|)$ penalty term to avoid too many random errors from short intervals. The number of element pairs suitable for Compare is $\tilde{Θ}(n^{1.5})$, and they are generally non-equivalent, but if $\tilde{Θ}(n^{1.5})$ runtime is unacceptable, we we might be able to work backward

## 8   Worst-case complexity of sorting

### 8.1   General notes

There is often a gap between average-case and worst-case complexity of algorithms. For example, for finding median, there is a linear gap between the average case ($1.5n+o(n)$ comparisons) and the worst case (at least $(2+ε)n-O(1)$ comparisons). We suspect that sorting can be done in worst-case $\lg(n!)+O(n^{1-ε})$ but not $\lg(n!)+O(n^ε)$ comparisons.

Our constructions above, being heavily reliant on probability, do not immediately work for the worst case. However, some of the algorithms are sufficiently robust that we can bias the comparisons such that either we get enough randomization or we consistently gain information compared to 1 bit / comparison, which will compensate the nonrandomness. We prove that $\lg(n!)+o(n)$ is possible, and also give a likely construction for $\lg(n!)+O(n^{1-ε})$ comparisons.

As an aside, before we had such an algorithm, some results on the number of comparisons are as follows. (With the algorithm $f(x)=0$, but we might get some of the results by dividing the difference (in the definition of $f$) by an appropriate function.) Let $g(n)$ be the minimum number of comparisons to sort $n$ elements in the worst case, and $f(x) = \lim\limits_{i→∞} \frac{g(n)-\lg(n!)}{n}$ ($n=\mathrm{round}(2^i x)$, $i∈ℕ$, $x>0$). Note that $λx f(\lg x)$ has period 1. Using efficiency of equal size two-way merge, $f$ converges. Using efficiency of inserting elements into sorted list of size $(1-o(1))2^i$, $f'_{-}(1)=-f(1)$ (i.e. is as small as possible). On the interval $[1,2]$, $f'$ and $f''$ are bounded from above. The latter is because merging two sorted lists, with one $1+ε$ times bigger than the other, incurs $O(ε^2)+o(1)$ worst-case entropy loss per element. Thus, $f'_{-}(x)≤f'_{+}(x)$ and both of these exist. An 'efficient' $k$-way merge (of equal size sorted lists) with $k$ not being a power of 2 would give $f(x)=0$ since we could compose a sorting algorithm from $k$-way merges and insertions at size $(1-o(1))2^i$.

### 8.2   A $\lg(n!)+o(n)$ algorithm

We prove that sorting is possible with worst-case $\lg(n!)+O(n/\log \log \log n)$ comparisons; the running time can be made $O(n \log n)$. We do this by showing that insertion of $m$ elements into a sorted list of length $n$ can be done with $\lg((n+m)!/n!)+O(m/\log \log \log m)$ comparisons. By splitting the insertions into batches, we can assume that $m$ is $O(n^{1-ε})$. The algorithm and the proof are not especially complex, but the tradeoff is that we only get $O(m/\log \log \log m)$, which is only 'slightly' $o(m)$.

Specifically, we adapt the average-case $\lg(n!)+o(n)$ algorithm (in section 3) to the worst-case by modifying Compare and by changing 'close to a power of 2'.

Recall at the start of Compare, $I_A=I_B$, and Compare has $Θ(1)$ chance of reducing $|I_A|$ in $≈r=1/(1-1/\sqrt{2})$ times. Using the irrationality measure of $r^m$ (and how far away $r^m$ ($m≤n$) is away from rational numbers with small denominators), $\lg r$ is at least $2^{-O(n)}$ away from every rational number with denominator $≤n$ (this is presumably far from optimal). From there, $∀x ∃N∈2^{O(n)} ∃m \, (1-1/n) < r^N/2^m < 1$. Let $f_m(x)$ be the least natural number $N$ such that $\frac{x}{r^N}$ is close enough to a power of 2 without risking overflow — for example, $1-\frac{N+2}{N+1} \frac{100}{\log \log \log m} < \frac{x}{2^k r^N} < 1-\frac{N+1}{N+2} \frac{1}{\log \log m}$ works. (The $\frac{N+2}{N+1}$ factor relaxes the bounds as $N$ decreases, which will permit our use of slightly biased comparisons below.) Every unentangled element $A$ with $f_m(|I_A|)=0$ will be removed from the batch and marked for insertion.

We will bias Compare as follows:
* To avoid randomness in disentanglement, let Compare alternatively compare $A$ and $B$ with the sorted list $S$, starting with $A$.
* After getting $A<B$ or $B<A$, there is a path involving 2 comparisons with $|I_A|$ decreasing in $≈r$ times. Increase the probability for the comparisons in this path by $εj/5^{f_m(|I_A|)}$ (assuming the uniform distribution on $I_A×I_B$ before the $A-B$ comparison) where $j=1$ for the first comparison and $j=2$ for the second comparison, and $ε$ can be for example $1/\sqrt{\log m}$.
* For the remaining paths, after three comparisons (including $A$ vs $B$), reduce the probabilities leading to failure to separate $A$ and $B$ in $1+ε_2 1.1^n$ times where $n$ is the number of comparisons so far in the current Compare, and $ε_2$ can be for example $1/\log^2 m$. Thus, at a low enough cost per Compare, Compare is effectively limited to $O(\log ε_2^{-1})$ comparisons as otherwise we gain $ω(1)$ extra information.

We get the bound on the total number of comparisons by summing up inefficiencies along every path:
* For every element $A$, the inefficiency is $O(1)$ comparisons (using the choices above, $O(1/\sqrt{\log m})$ (amortized) plus $\text{depth}/\log^2 m$ plus $O(1)$ for the insertion).
* If $|I_A|$ gets close enough (from below) towards a power of 2, we lose only $O(1/\log \log \log m)$ comparisons on $A$ as the inefficiencies involving Compare are small enough.
* At most $m^{o(1)}$ $A$ can run out of Compare at comparison depth $o(\log m)$ without settling on an interval close enough to a power of 2.
* If $|I_A|$ is used by Compare up to a sufficient depth, we gain $ω(1)$ information (use the effective upper bound on Compare length, and the gain of roughly $εj/5^{f_m(x)}$ information for the worst $x$).

To be written.

To be written.

## References

[1] Stefan Edelkamp and Armin Weiß. QuickXsort: Efficient Sorting with n log n - 1.399n + o(n) Comparisons on Average. CSR 2014: pp. 139–152.
[2] Kazuo Iwama and Junichi Teruyama. Improved Average Complexity for Comparison-Based Sorting. WADS 2017: Algorithms and Data Structures pp. 485-496.
[3] Dmytro Taranovsky. Sorting with an average of $\lg(n!)+o(n)$ comparisons. https://cstheory.stackexchange.com/questions/41493/sorting-with-an-average-of-mathrmlgnon-comparisons .
[4] Dorit Dor and Uri Zwick. Median Selection Requires $(2+\epsilon)n$ Comparisons. SIAM Journal on Discrete Mathematics, 2001, Vol. 14, No. 3 : pp. 312-325.