DynamicLogLog: Faster, Smaller, and More Accurate Cardinality Estimation
Cardinality estimation - calculating the number of distinct elements in a stream - is a longstanding problem with applications from networking to bioinformatics. HyperLogLog (HLL), the prevailing standard, has a well-known 34% error spike in its tran…
Authors: Brian Bushnell
DynamicLogLog: F aster, Smaller, and More A ccurate Cardinalit y Estimation Brian Bushnell 1 * 2026 1 DOE Join t Genome Institute, Lawrence Berk eley National Laboratory , Berk eley , CA, USA *Corresp onding author: bbushnell@lbl.go v OR CiD: Brian Bushnell: https://orcid.org/0000-0002-8140-0131 Abstract Cardinalit y estimation — calculating the n umber of distinct elements in a stream — is a longstanding problem with applications across n umerous elds, from netw orking to bioinformatics to animal p opulation studies. Widely used approac hes include Linear Counting [1], accurate at low cardinalities; LogLog [2], accurate at high cardinalities; and Hyp erLogLog (HLL) [3], a fusion of the tw o, whic h is accurate at b oth low and high cardinalities. How ev er, Hyp erLogLog has a well-kno wn error spike in the mid-region corresp onding to the transition from Linear Count- ing to LogLog [4], though this has b een eliminated in the more recent UltraLogLog (ULL) [5]. F urthermore, the accuracy of LogLog improv es with the num b er of buck ets ( B ) used (standard error ), and the maximum cardinalit y it can represent is limited by the num b er of bits p er buck et. Th us, to increase the maxim um repre- sen table cardinalit y while main taining accuracy requires expanding each buc ket. Each squaring of the cardinality (doubling the exp onen t) requires an additional B bits — the size of the data structure is B log(log(cardinality)). Here w e present DynamicLogLog (DLL), whic h uses a shared exp onent to allow early exits o ver 99.9% of the time at high cardinality , increasing sp eed while reducing the size complexity to essentially constant with resp ect to cardinality , for a c hosen precision: 4 B +log(log(cardinality)). Th us, squaring the maxim um represen table car- dinalit y (doubling its exp onent) requires only a single additional bit of global state, regardless of the n um b er of buc kets. As such, while traditional LogLog v ariants using 6 bits p er buck et can count the stars in the Milky W ay , DynamicLogLog can coun t the particles in the univ erse at similar accuracy while using 33% less space — and with a at error curve due to a new blending function. DynamicLogLog also uses a nov el cardinality technique, Dynamic Linear Counting (DLC), whic h allo ws accurate cardinality estimation at an y cardinality without needing a correction factor, as well as a new Logarithmic Hybrid Blend to eliminate HLL’s error h ump. DLL’s 4-bit buc kets additionally allo w more ecient packing in p ow er-of-2 computer words. A ccuracy w as quan tied as the cardinalit y-w eigh ted mean absolute error, using 2,048 buc kets, av eraged ov er 512,000 sim ulations out to a true cardinalit y of 8,388,608, sampled at exp onentially-spaced chec kpoints. Dy- namicLogLog’s h ybrid estimate demonstrated 1.830% mean and 1.834% p eak absolute error using 1,024 bytes, compared to 1.84% mean and 34.1% p eak for Hyp erLogLog using 1,536 bytes (6 bits 2,048 buc kets), while UltraLogLog (ULL) using 1,024 b ytes (8 bits 1,024 buck ets) demonstrated 1.95% mean and 1.96% p eak. DLC, whic h is used to calculate blend p oints in DLL’s hybrid function, ac hieved 1.90% mean and 1.93% p eak without correction. F urthermore, DynamicUltraLogLog (UDLL6), a fusion of DLL and ULL, achiev es ULL-level accuracy at 75% of the memory (1.5 KB vs 2 KB for 2,048 registers). 1. In tro duction Coun ting the num b er of distinct elements in a data stream — the c ar dinality estimation problem — arises wherever large datasets m ust b e summarized in b ounded space. Netw ork monitoring systems estimate the num ber of distinct 1 IP o ws; bioinformatics pip elines coun t unique k-mers in sequencing reads to estimate genome size; ecologists can apply feature analysis to estimate animal p opulations. In each case, exact counting requires space prop ortional to b oth the cardinality and p er-elemen t information conten t, which is infeasible for streams of billions of complex elemen ts. Probabilistic cardinalit y estimators trade exactness for dramatically reduced space. The foundational insigh t, due to Fla jolet and Martin [6], is that the statistical prop erties of hash v alues — sp ecically , the lengths of runs of leading zeros — enco de information ab out the num b er of distinct elements seen. This observ ation led to a family of increasingly rened algorithms: Probabilistic Coun ting [6], Linear Coun ting [1], LogLog [2], Sup erLogLog [2], and Hyp erLogLog [3]. Hyp erLogLog, the prev ailing standard, uses B buck ets (where B is the num b er of buc k ets) of 6 bits each to achiev e a standard error of approximately . At 2,048 buc kets (1,536 bytes pack ed at 6 bits), this yields roughly 2.3% standard error (distinct from mean absolute error) — sucient for most applications. How ever, Hyp erLogLog has three w ell-known limitations: 1. The error bulge. HLL uses Linear Counting (LC) for cardinalities b elow ~2.5 B and the harmonic mean estimator ab ov e ~5 B . In the transition region (roughly 2.5 B to 5 B ), neither estimator is accurate, pro ducing a characteristic error spike that exceeds 34% absolute error. The Hyp erLogLog++ v ariant [4] mitigates this with an empirical bias correction table, but do es not eliminate it. 2. Memory scaling. Each buck et m ust store the maximum observed leading-zero count (NLZ). With 6-bit buc kets, the maxim um representable NLZ is 63, limiting the countable cardinalit y to appro ximately 2^63 B . T o square the maxim um representable cardinality (doubling its exp onent), every buck et needs an additional bit — the total data structure size scales as B log(log( C )) where C is the maximum cardinality . 3. Sp eed. Ev ery input elemen t m ust b e hashed, its buc k et iden tied, and the stored v alue compared and p oten tially up dated. There is no mechanism to skip elements that cannot p ossibly aect the result. In this pap er w e presen t DynamicLogLog (DLL), a cardinalit y estimator that addresses all three limitations sim ultaneously . DLL uses a shar e d exp onent (called minZeros ) across all buc kets, storing only the r elative leading- zero coun t p er buck et. This design yields three key b enets: • Smaller. Only 4 bits p er buck et suce (vs. 6 for HLL), a 33% memory reduction. At the same memory budget, DLL can use 50% more buc kets, reducing v ariance by a factor of 1.22. • F aster. The shared exp onen t enables an e arly exit mask ( eeMask ): a single unsigned comparison against the hash v alue rejects elemen ts whose NLZ is b elo w the current o or. At high cardinality , ov er 99.9% of elemen ts are rejected b efore any buck et access o ccurs. • Flatter. DLL introduces Dynamic Linear Counting (DLC), a tier-a ware extension of Linear Coun ting that pro vides accurate estimates across the full cardinality range. Com bined with a Logarithmic Hybrid Blend using logarithmically scaled mixing w eights, DLL eliminates the LC-to-LogLog transition bulge entirely . DLL’s size complexity is 4 B + log(log( C )) — the shared exp onent costs a single global integer, and squaring the maximum represen table cardinality (doubling its exp onent) requires only one additional bit regardless of the n umber of buck ets. In contrast, HLL’s size complexit y is 6 B (or more generally , b B where b is the bits p er buck et, and b m ust grow with log(log( C ))). DLL is implemented in Jav a as part of the BBT o ols bioinformatics suite [7] and is usable for k -mer cardinality coun ting of sequence les via the loglog.sh shell script with the ag loglogtype=dll4 . The source co de is a v ailable in BBT o ols/curren t/cardinality/DynamicLogLog4.ja v a. UDLL6 is in the same directory as UltraDynam- icLogLog6i.ja v a. The remainder of this pap er is organized as follows: Section 2 reviews the mathematical background common to all LogLog-family estimators. Section 3 describ es the DLL architecture, including tier promotion and the early exit mask. Section 4 in tro duces the DLC family of estimators and the Logarithmic Hybrid Blend. Section 5 cov ers correction factors, including self-similar correction factor (CF) lo okup and the DLL3 ov ero w correction. Section 6 describ es MicroIndex, a 64-bit cardinality Blo om lter allowing lazy buck et allo cation for small sets. Section 2 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 1 0 7 T rue Car dinality 0.001 0.01 0.1 1 Mean Absolute Er r or 34.1% peak error HLL Er r or Bulge vs DLL Flat P r ofile (2048 buck ets, 512k simulations) LL6 / HLL 1.5 KB DLL4 / Hybrid 1 KB DLL4 / DLC 1 KB D L L 4 / H y b r i d 1 . 5 K B ( 3 , 0 7 2 B ) Figure 1. Mean absolute error vs true cardinalit y on a log-log scale. HLL (LL6) p eaks at 34.1% error near 1.5–2.5 B during the LC-to-harmonic-mean transition. DLL Hybrid and DLC remain b elo w 2% throughout with no transition artifact. (2048 buc kets, 512k simulations.) 7 describ es the sim ulation and b enc hmarking metho ds. Section 8 presen ts exp erimental results comparing DLL to HLL on b oth high-complexity and low-complexit y datasets, ev aluates the complemen tary nature of DLL and UltraLogLog, introduces history-corrected hybrid estimation (Hybrid+ n ), and presents Lay ered Dynamic Linear Coun ting (LDLC). Section 9 discusses the results and future directions. Section 10 concludes. 2. Bac kground 2.1 The LogLog F ramework All LogLog-family estimators share a common structure. Given a hash function h mapping elements to uniformly distributed in tegers in [0, 2^L), the hash v alue is split into tw o parts: • The buc k et selector : t ypically the lo w est k bits determine which of B = 2^ k buck ets this element maps to (alternativ ely , a mo dulo op eration allows arbitrary buck et coun ts). • The rank : the Number of Leading Zeros (NLZ) in the remaining L k bits. (Flajolet’s original denition adds one to a void zero, but DLL uses raw NLZ throughout; the +1 oset is handled by the stored = relNlz + 1 enco ding describ ed in Section 3.1.) Eac h buck et stores the maximum NLZ observ ed among all elemen ts mapping to that buck et. The maximum NLZ enco des information ab out the n umber of distinct elements: if n distinct elements are distributed across B buck ets, the exp ected maxim um NLZ in each buck et is approximately log ( n / B ). The k ey estimators derived from this structure are: Linear Coun ting (LC). Coun ts the fraction of empty buck ets. If V buc kets out of B are empty: LC ln 3 This is accurate when many buck ets are empty (cardinality 2.5 B ) but b ecomes meaningless when all buck ets are lled ( V = 0). LogLog / Hyp erLogLog. Uses the stored maxim um NLZ v alues to estimate cardinality . The harmonic mean v ariant (Hyp erLogLog) computes: HLL NLZ where NLZ_j is the maximum NLZ in buck et j (denoted M_j in Fla jolet’s original notation) and _m is a bias- correction constan t ( 0.7213/(1 + 1.079/ B ) for large B ). This is accurate at high cardinality but unreliable b elow ~2.5 B . HLL’s transition problem. HLL uses LC for cardinalities b elow a threshold (typically 2.5 B ) and the harmonic mean ab ov e it. The transition b etw een these tw o estimators pro duces the c haracteristic error bulge: neither is accurate in the crosso ver region, and switching b etw een them introduces a discontin uit y . 2.2 Notation Throughout this pap er w e use: Sym b ol Meaning B Num b er of buck ets (p ow er of 2 for bitmask selection; arbitrary with mo dulo) k Buc ket selector bits: B = 2^ k NLZ Num b er of leading zeros in the non-buck et p ortion of the hash absNlz Absolute NLZ v alue (the raw count from the hash) minZeros Shared exp onen t: the current minimum NLZ o or across all buc kets relNlz Relativ e NLZ: absNlz minZeros stored Enco ded buck et v alue: relNlz + 1 (0 reserved for empty) V Num b er of empty buck ets V_t Num b er of buck ets with absNlz < t (empty at tier t ) C T rue cardinality CF Correction factor: m ultiplier applied to a raw estimate Error metrics: Metric Denition Signed error (estimate true) / true. P ositive = ov ercount, negative = undercoun t Absolute error |estimate true| / true. Alw ays non-negative Mean absolute error A verage of absolute error across all instances at a given cardinalit y Standard deviation Std dev of signed error across instances at a giv en cardinalit y; measures precision P eak error Maxim um mean absolute error across all cardinality c heckpoints Log-w eighted avg Un weigh ted av erage of error at exp onen tially-spaced c heckpoints; each cardinalit y decade contributes equally . Reects a log-uniform cardinalit y workload 4 Metric Denition Card-w eighted avg Eac h chec kp oint weigh ted b y its cardinality; emphasizes high-cardinalit y b ehavior. Reects a linearly-distributed w orkload Estimation metho ds: Name Denition LC Linear Coun ting: B ln( B / V ) LCmin Tier-comp ensated LC: 2^minZeros B ln( B / V ) Mean Occupancy-corrected harmonic mean (Section 5.3) HMean Fla jolet’s harmonic mean with static _m (Section 5.3) GMean Geometric mean of dierence v alues (Section 5.3) HLL Standard Hyp erLogLog estimator: _m B ( 2 ( NLZ_j)) ( ) F GRA Fisherian Generalized Remaining Area estimator (Ertl 2024; Section 8.3) DLC( t ) Dynamic Linear Coun ting at tier t : 2^ t B ln( B / V_t ) DLC Exp onen tial log-space blend across all tiers (Section 4.5) DLCBest Best single-tier DLC estimate, selected b y optimal o ccupancy (Section 4.4) Hybrid Logarithmic Hybrid Blend: LCmin + Mean with CF correction (Section 4.6) Hybrid+ n Hybrid with n -bit p er-state history correction (e.g., Hybrid+2 uses 2-bit history) Mean+ n Mean with n -bit p er-state history correction LDLC La yered Dynamic Linear Counting: p er-tier history-corrected LC blend (Section 8.5) Estimator t yp es: Name Bits/buc ket Early exit History bits Description LL6 6 No 0 T raditional Hyp erLogLog with unpack ed byte array DLL4 4 Y es 0 DynamicLogLog, 4-bit DLL3 3 Y es 0 DynamicLogLog, 3-bit with o vero w correction ULL 8 No 2 UltraLogLog (Ertl 2024) UDLL5 5 Y es 1 DynamicUltraLogLog, 1-bit history UDLL6 6 Y es 2 DynamicUltraLogLog, 2-bit history (= DLL + ULL fusion) UDLL7 7 Y es 3 DynamicUltraLogLog, 3-bit history W e dierentiate b etw een Estimator T yp es — the actual data structure format and metho d of up dating it when a new element arriv es — and Estimation Metho ds , the algorithm used to derive a cardinalit y estimate from the buc ket state. These are semi-indep endent: many dierent data structures can yield m ultiple dierent estimates. F or example, DynamicLogLog4 (the 4-bit-buck et version) can pro duce Linear Coun ting, Dynamic Linear Coun ting, Hyp erLogLog, Mean, and Hybrid estimates from the same buck et state; similarly , most estimation metho ds can b e applied to dieren t data structures, sometimes needing minor mo dications (e.g., DLC requires tier-aw are buck et coun ts, which DLL tracks natively but LL6 can reconstruct from stored NLZ v alues). 5 3. DynamicLogLog Architecture 3.1 Shared-Exp onen t Represen tation Lik e all LogLog-family estimators, DLL requires a hash function that pro duces uniformly distributed, unbiased output. Any hash satisfying this requirement is suitable; our implementation uses Staord’s Mix13 64-bit nalizer [8]. The central observ ation b ehind DLL is that after enough distinct elements hav e b een observed, the NLZ v alues across all B buck ets even tually exceed some minimum v alue. After n distinct elemen ts hav e been added, the exp ected maxim um NLZ p er buck et is approximately log ( n / B ), causing b oth the a verage and o or to rise with increasing cardinalit y . T raditional LogLog stores the absolute NLZ p er buck et, requiring enough bits to represent the full range (0 to 63 for a 64-bit hash). DLL instead factors the NLZ in to a shared comp onent and a p er-buck et residual: absNlz minZeros relNlz The minZeros v alue is a single in teger shared across all buck ets, representing the curren t “o or” — the minimum NLZ that any non-empty buck et can hav e. Each buck et stores only the relative NLZ ( relNlz ), enco ded as stored = relNlz + 1 with 0 reserv ed for empty buck ets. With 4 bits p er buc ket, stored ranges from 0 to 15, representing relative NLZ v alues 0 through 14. This 15-tier range is sucient: the probability of a v alid up date causing a buc ket to ov ero w (relNlz > 14) is approximately 1/32,768 per update. This is negligible compared to the intrinsic exp ected error of the estimator for t ypical buc ket counts (~2,048), though in practice o v erow could b e practically eliminated by using 5 bits p er buc ket. Thematically , “lossy algorithms can gain from lossiness everywhere” . The memory lay out is simple and cache-friendly: 8 buc kets pack into a single 32-bit in teger (4 bits 8 = 32), with zero w asted bits. Compare to 6-bit HLL, where 6 do es not divide evenly into 32 or 64, requiring either wasteful padding or complex cross-w ord packing. 3.2 Tier Promotion As cardinality increases, the NLZ v alues across all buc kets drift upw ard. Even tually , every buc ket has stored >= 1 , meaning the shared o or can b e raised. This tier pr omotion is analogous to a oating-p oin t exp onen t increment: 1. Incremen t minZeros by 1. 2. Subtract 1 from ev ery non-empty buck et’s stored v alue. 3. Buc kets that drop to stored = 0 b ecome “empty” in the new frame. 4. Coun t the new empty (or tier-0) buck ets to determine minZeroCount . 5. If minZeroCount is still 0, rep eat (m ultiple promotions can chain). The complete implemen tation is shown in Listings 1–2 (after Section 3.3). Early promotion. By default, DLL promotes when all buck ets hav e stored >= 1 (EARL Y_PROMOTE mo de), meaning every buck et has b een written at least once since the last promotion. This is safe b ecause DLC (Section 4) correctly handles p ost-promotion empty buck ets. Early promotion reduces the time each buc ket sp ends at the maxim um stored v alue, reducing ov ero w p ollution in the DLL3 (3-bit) v ariant. Merge considerations. When merging t wo DLL instances (e.g., from parallel threads), the merged result adopts the higher minZeros and takes the p er-buck et max after re-framing. Ho w ever, b ecause eac h instance promoted indep enden tly based on its subset of the data, the merged tier distribution has more ov ero w than a single instance w ould — see Section 8 (Limitations) for quantication. Self-similarit y . Each interv al b et w een successive tier promotions — an er a — is statistically self-similar. The cardinalit y at which the t -th promotion o ccurs is appro ximately B 2^ t ln( B ), and the buc ket distribution 6 at each promotion b oundary is identically shap ed (scaled by a factor of 2). This self-similarity is exploited for correction factor lo okup at arbitrary cardinalit y (Section 5.2). DLL4 is the primary 4-bit v arian t of DLL, storing 8 buck ets p er 32-bit integer. DLL3 is a 3-bit v arian t oering greater memory sa vings at slightly reduced accuracy for low-complexit y data (Section 5.4). LL6 is traditional Hyp erLogLog with 6 bits p er buck et stored in an unpack ed byte array , used throughout this pap er as a baseline for comparison. DDL is a man tissa v ariant describ ed in Section 9. 3.3 Early Exit Mask After t tier promotions, all buck ets represent NLZ t and thus any new elemen t whose hash has few er than t leading zeros cannot p ossibly up date a buck et. Rather than computing the NLZ and chec king, DLL uses a precomputed e arly exit mask ( eeMask ) to reject such elements with a single comparison: eeMask = 0xFFFFFFFFFFFFFFFFL >>> minZeros // top minZeros bits cleared if (key > eeMask) return; // unsigned comparison The fraction of elemen ts that pass the mask is appro ximately 2^( minZeros). Since minZeros log ( C / B ) where C is the current cardinalit y , the early exit rate approaches 100% exp onentially — the higher the cardinalit y , the faster DLL b ecomes p er elemen t. This is the origin of the “Dynamic” in DynamicLogLog: the algorithm dynamically adapts its workload to the data. At low cardinalit y (minZeros = 0), no elements are rejected and DLL pro cesses every hash like traditional HLL. As cardinality grows and tiers promote, an exp onentially increasing fraction of elemen ts are rejected b efore any buck et array memory access. Benc hmark. W e measured wall-clock time pro cessing 20 million 150bp reads (2.4 billion 31-mer adds) at a true cardinalit y of 1.72 billion (a medium-high complexity stream with 72% distinct k -mers), using 2048 buck ets on a single thread: Conguration Time Notes LL6 (HLL baseline) 23.4 s Every hash accesses a buck et DLL4 without eeMask 27.6 s Tier promotion o verhead, no early exit DLL4 with eeMask 19.6 s 99.99% of hashes rejected b efore buck et access Without eeMask, DLL4 is 18% slower than LL6 due to the ov erhead of tier promotion (the countAndDecrement scan at each promotion). The early exit mask more than comp ensates: DLL4 with eeMask is 16% faster than LL6 ov erall, despite the tier promotion cost. The true sp eedup of eeMask alone is 29% (27.6s 19.6s), in a non-memory-constrained, single-estimator scenario. A t this cardinality (minZeros 20), only 1 in ~10 hashes passes the mask (0.0001%). Of those that reac h buc ket-access logic, only 11.9% actually up date a buck et v alue. The v ast ma jority of the 19.6s is sp ent on k -mer generation and hashing, whic h o ccur regardless of the early exit. The actual buck et-access time is reduced to near zero. Crucially , the early exit has zer o ac cur acy imp act : every rejected element would hav e pro duced newStored <= oldStored and b een a no-op an ywa y . The mask simply a voids the work of conrming this. The eeMask tech- nique is applicable to an y LogLog v arian t that uses tier promotion, and could b e retroactiv ely added to existing implemen tations. Amortized cost. Tier promotions o ccur only log ( C / B ) times o v er the lifetime of the estimator (each time minZeroCount reaches 0). Eac h promotion runs countAndDecrement once, lo oping ov er all B buck ets. The total promotion cost is therefore O( B log( C / B )), whic h is negligible relative to the O( C ) cost of pro cessing all elemen ts — particularly since eeMask reduces the p er-elemen t cost to near zero at high cardinalit y . (Note: the stor age cost of minZeros is only log(log( C )) bits, since that man y bits suce to represent the promotion count; the distinction b et ween the num b er of promotions and the bits needed to count them is the dierence b etw een time and space complexit y .) 7 Listing 1. The complete hashAndStore metho d for DLL4. The early exit mask (line 4) rejects the v ast ma jority of elements at high cardinalit y b efore an y buc ket access. MicroIndex (lines 7–8) provides a Blo om-lter o or for lo w cardinality . The tier promotion lo op (lines 16–21) adv ances minZeros and tigh tens the early exit mask when all buc kets hav e b een lled. 8 Listing 2. The countAndDecrement metho d, called during tier promotion. Decrements every non-empty buck et’s stored v alue by 1 and counts how many reac h the minimum tier, determining whether another promotion is needed. 9 3.4 Memory Comparison V ariant Bits/buc ket 2048 buc kets 4096 buc kets Relativ e to HLL-2048 HLL (6-bit) 6 1,536 B 3,072 B 1.00 DLL4 (4-bit) 4 1,024 B 2,048 B 0.67 DLL3 (3-bit) 3 768 B 1,536 B 0.50 A t equal memory , DLL4 uses 50% more buck ets than HLL, reducing the standard error by a factor of 1.22. DLL3 uses double the buck ets, reducing error by 1.41 — though with some limitations for lo w-complexity data (Section 5.3). DLL4’s 4-bit pac king is additionally hardw are-friendly: 8 buck ets p er 32-bit word with zero wasted bits. HLL’s 6-bit pac king wastes 2 bits p er 32-bit word (only 5 buck ets p er word) or requires cross-word spanning. Bey ond the buc ket arra y , DLL requires a small constant o verhead: one byte for the curren t tier o or ( minZeros ), whic h is the only architecturally required auxiliary state to allow cardinality o ver 2^255. Our implementation additionally stores a 64-bit MicroIndex for improv ed low-cardinalit y estimation and cached early-exit state (eeMask + coun ter) for sp eed, totaling approximately 28 b ytes of o verhead indep enden t of buc ket count — negligible relative to the buc ket array at any practical size. 4. Dynamic Linear Coun ting 4.1 The LC Ceiling Problem Linear Coun ting estimates cardinality from the fraction of empty buck ets: LC ln This is remarkably accurate at low cardinality — when V is large relativ e to B . But LC has a hard ce iling: once all buc kets are lled ( V = 0), the estimate b ecomes innite. In practice, LC degrades rapidly as V shrinks; ab ov e ~5 B , its v ariance is to o large for reliable use as a blending comp onent. Hyp erLogLog addresses this by switc hing to the harmonic mean estimator ab o ve a threshold. But the transition is the source of HLL’s characteristic error bulge: in the crossov er region, neither estimator is well-suited, and the discon tinuit y b etw een them in tro duces additional error. 4.2 DLC: Tier-A ware Linear Coun ting DLL’s tier structure oers a natural solution. A t any tier t , dene the tier-t empty c ount : where V is the num b er of truly empty buc kets and n_i is the num b er of buck ets with absolute NLZ equal to i . Conceptually , V_t treats all buc kets with NLZ b elow t as “empty for the purp oses of tier t . ” The Dynamic Linear Coun ting estimate at tier t is then: DLC ln 10 A t tier 0, this reduces to classic LC. At tier 1, buc kets with NLZ = 0 are added to the “empt y” p o ol, doubling the eectiv e range. Each successiv e tier extends the range b y another factor of 2. Where LC is accurate up to ~2.5 B , DLC( t ) is accurate up to approximately 2^ t 2.5 B . The key insight is that DLC provides a useful estimate at every cardinality , not just at low cardinality . At an y given cardinalit y , there exists some tier t where V_t is near the optimal range (roughly B /6 to B /3, centered on the optimal V B /4 where LC’s estimation error is minimized). DLC at that tier gives a low-error estimate (1.620% mean absolute error without CF; see Section 8.3, T able 3 and Figure 6) without needing the harmonic mean, correction factors, or a transition function. DLL’s Hybrid blend (Section 4.6) still uses empirically chosen b oundaries (0.2 B and 5 B ), similar to HLL’s 2.5 B threshold — but the critical dierence is that b oth estimators in DLL’s blend (LCmin and Mean) are accurate throughout the crossov er zone, so the exact cuto v alues are not load-b earing. HLL’s threshold is fragile b ecause it switc hes b et ween a failing estimator (LC) and a not-yet- accurate one (harmonic mean); DLL’s b oundaries merely dene a window o v er which tw o go o d estimates are smo othly com bined. 4.3 LCmin: Tier-Comp ensated Linear Coun ting The simplest DLC v ariant uses the tier o or directly: LCmin minZeros ln This equals DLC at the low est activ e tier (the tier corresp onding to minZeros ). After eac h tier promotion, the empt y-buck et count resets and LC b ecomes accurate again — scaled by the accumulated promotion factor. LCmin serv es as the low-cardinalit y anchor in DLL’s hybrid blend. 4.4 DLCb est: Best Single-Tier Estimate F or each tier t , compute V_t and select the tier whose V_t is closest to a target o ccupancy (empirically , 25% free — i.e., V_t B /4): arg min DLCb est returns DLC(t*). When t w o adjacent tiers are equidistant from the target, their estimates are av eraged. This pro duces the most accurate single-p oint estimate at an y cardinality , though it can b e slightly noisier than the blended DLC due to tier switc hing. 4.5 DLC: Exp onen tial Log-Space Blend The pro duction DLC estimator w eights all tiers by their proximit y to the target o ccupancy: exp target DLC exp ln DLC The log-space av eraging preven ts outlier tiers from dominating. The decay constant 9.0 and target o ccupancy V_tar get 0.25 B (i.e., 75% full) w ere calibrated empirically across multiple buck et sizes. Because in ter-tier o ccupancy b oundaries v ary dramatically across tiers — at tier 0, V_0 starts at B (all empt y) and LC error starts at zero, while at high tiers V_t ma y start near 0 (all “lled” at that tier) and the single-tier error starts at innity — a smo oth transition to LCmin at low o ccupancy ( V > 0.3 B ) preven ts artifacts when few tiers are informative. P arameter robustness. The DLC blend has three hand-pick ed parameters: the target o ccupancy (0.25 B ), the 11 deca y constant ( = 9.0), and the LCmin transition zone (0.3 B –0.5 B ). T o quantify sensitivity , w e v aried eac h parameter by and measured the uncorrected DLC mean absolute error ov er 4,000 DLL4 estimator instances at 2,048 buck ets, av eraged across 670 exp onentially-spaced c heckpoints from cardinality 1 to , with each c heckpoint receiving equal weigh t (equiv alent to uniform weigh ting o ver log-cardinality). The worst-case c hange w as +0.005 p ercen tage p oin ts of absolute error (from 1.369% to 1.374%, a 0.4% relative increase), conrming that none of the three parameters are load-b earing. The DLC blend’s accuracy comes from the tiling of tier estimates across the cardinalit y range, not from ne-tuning of the blend parameters. DLC achiev es accuracy comparable to CF-corrected HLL across the full cardinality range — without requiring a correction factor table. This is unique among the estimators ev aluated: HLL, Mean, and Hybrid all require CF tables for optimal accuracy , and even FGRA (Section 8.3) embeds correction con stan ts ( / series) deriv ed from theoretical analysis. DLC is the only estimator that cov ers the full cardinality range with no pre-computed corrections of an y kind. When correction factors ar e applied, DLC’s accuracy improv es further, as the CF addresses the small systematic biases inheren t in the LC formula. 4.6 Logarithmic Hybrid Blend: Eliminating the Bulge DLL’s hybrid estimator — the L o garithmic Hybrid Blend — smo othly transitions b et ween LCmin and the CF- corrected Mean estimator using a linearly in terp olated blend with a logarithmically scaled mixing weigh t: Hybrid LCmin if LCmin LCmin Mean CF if LCmin Mean CF if LCmin where t = ln(LCmin / 0.2 B ) / ln(5 B / 0.2 B ). The mixing weigh t t is logarithmic in LCmin, providing smo oth transitions at b oth endp oin ts. The critical dierences from HLL’s transition are that b oth endp oin ts are accurate in the crossov er region, and the en tire transition region is more accurate than either input curve, whereas HLL is less accurate than either input, and usually b oth. LCmin, b eing tier-comp ensated, remains accurate muc h further into the transition zone than raw LC. The Mean estimator with correction factor is accurate from mo derate cardinality upw ard. The blend smo othly com bines tw o go o d estimates, rather than switching b etw een a failing estimate (LC) and a not-y et-accurate one (harmonic mean), as in Hyp erLogLog. The result is a hybrid estimator with no error bulge: the crossov er region is as at as the regions on either side of it. The blend region exceeds the accuracy of either input b ecause the t wo estimates often exhibit error in opp osite directions, resulting in partial cancellation. 5. Correction F actors 5.1 CF T able Structure Lik e Hyp erLogLog, DLL b enets from correction factors (CF s) that comp ensate for systematic biases in the raw estimators. DLL uses a cardinalit y-indexed CF table: the correction factor for each estimator t yp e is stored as a function of true cardinalit y . The CF lo okup uses iterativ e renemen t: starting from the ra w DLC estimate as a seed, lo ok up the CF, compute the corrected estimate, use the corrected estimate as a new lo okup k ey , and rep eat until conv ergence. F or DLL4, whose CF curv es are nearly at, a single iteration suces. F or DLL3, whose ov erow correction introduces slight nonlinearit y , 2–3 iterations are needed. CF tables are generated b y sim ulation: many indep endent estimator instances with dierent hash functions are run across a range of known cardinalities, and the av erage bias at each cardinality p oint yields the multiplicativ e correction factor. 12 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 1 0 7 T rue Car dinality 1 0 3 1 0 2 1 0 1 1 0 0 Mean Absolute Er r or Hybrid blend starts Hybrid blend ends MicroIndex Accuracy Boost 34.1% peak error LC Mean T ransition: Why DLL Hybrid Stays Flat (DLL4, 2048 buck ets, 512k simulations) LC Mean (CF) HLL (LL6) Hybrid (DLL4) Figure 2. Absolute error of LC, Mean (CF), HLL, and DLL Hybrid as cardinality increases, on a log Y-axis. LC and HLL div erge in the transition region while the Hybrid remains at throughout. The reduced error of Hybrid compared to HLL in the 400–10,500 cardinalit y region is a com bination of the improv ed blend b et ween LC and Mean, and the inheren t low er error of Mean compared to HMean (see Figures 4–5). Below cardinality ~20, the MicroIndex pro vides additional accuracy . (DLL4, 2048 buck ets, 512k sim ulations.) 13 0.4 0.6 0.8 1.0 Cor r ection F actor (a) Cor r ection F actor vs Car dinality Mean HMean GMean Hybrid DLC HLL 0.0 0.2 0.4 0.6 0.8 1.0 Signed Er r or (no CF) (b) Signed er r or without cor r ection factors systematic bias varies with car dinality Mean HMean GMean Hybrid DLC HLL 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 1 0 7 T rue Car dinality 0.04 0.02 0.00 0.02 0.04 Signed Er r or (with CF) (c) Signed er r or with cor r ection factors bias eliminated acr oss full range Mean HMean GMean Hybrid DLC HLL Why Cor r ection F actors Matter (DLL4, 2048 buck ets, 512k simulations) Figure 3. Three-panel demonstration of correction factors. (a) Implied CF v alues by estimator type. (b) Signed error without CF — large systematic biases are visible. (c) Signed error with CF — bias is eliminated across the full cardinalit y range. (DLL4, 2048 buck ets, 512k simulations.) 14 5.2 Self-Similar CF Lo okup In man y cases, such as DLL4 Hybrid, the CF app ears to asymptotically approac h a constant, but tier promotion eects are more pronounced for DLL3 due to a higher ov ero w rate, or DLC using any estimator type. DLL’s tier structure is self-similar: the buck et distribution after each tier promotion is statistically iden tical to the previous era, scaled b y a factor of 2. This means the correction factor pattern rep eats every cardinality doubling: CF CF The CF table is generated at a xed range — typically 2,048 buck ets 4,096 maximum multiplier 8,388,608 maxim um cardinality , with 1% logarithmic k ey spacing yielding appro ximately 1,600 entries p er estimator type. T o lo ok up a correction factor at any cardinality b eyond the table’s range, the estimate is simply righ t-shifted un til it falls within b ounds: while (estimate > tableMax) estimate >>>= 1; CF = table.lookup(estimate); Because the CF pattern rep eats every cardinalit y doubling, this pro duces the correct correction factor regardless of the actual cardinality . The table is therefore a small, xed-size constant — appro ximately 1,600 entries p er estimator type — that serves the full cardinality range from 1 to 2^63 without extension or tiling. Only one CF table is needed regardless of the n umber of estimator instances. 5.3 Buc k et A veraging Metho ds All LogLog-family estimators m ust combine the maxim um NLZ v alues stored across B buck ets into a single cardinalit y estimate. Several av eraging metho ds are p ossible; we ev aluate three. Let NLZ_j denote the absolute NLZ stored in buc k et j , and let the sum run o ver lled buc k ets only ( c ount = n umber of lled buck ets). Dene the dier enc e value d_j = 2^(63 NLZ_j ), which is prop ortional to the probabilit y of observing NLZ NLZ_j in a uniform 64-bit hash. Occupancy-corrected Mean. Computes the arithmetic mean of the dierence v alues and inv erts: Mean count count Substituting and simplifying rev eals: Mean coun t count Hyp erLogLog’s harmonic mean (HMean). The class ic HLL form ula applied o ver lled buck ets only: HMean coun t where . Both formulas are fundamen tally harmonic means of the underlying buck et probabilities , sharing the same core: coun t . The only dierence is the leading co ecient. HMean uses Fla jolet’s static constan t , whic h is deriv ed under the assumption that all buck ets are occupied. Mean uses the dynamic, o ccupancy-a ware co ecient count , which tracks the actual num ber of lled buck ets. This distinction explains three empirically observ ed regimes: 15 1. High cardinalit y ( ): All buc k ets are lled (coun t ), so Mean’s coecient reduces to — a xed constan t, comparable to . Because b oth form ulas are now static scalar multiples of coun t , the CF table maps them to iden tical v alues with identical v ariance. 2. Lo w cardinality ( ): The v ast ma jorit y of buc k ets are empt y , and non-empty buc k ets almost all hav e . Both formulas eectiv ely degenerate to tracking the o ccupied coun t, rendering them equiv alent. 3. T ransition zone ( to ): The num b er of lled buck ets (coun t) uctuates due to binomial v ariation across random streams at the same true cardinalit y . Mean’s dynamic co ecient coun t absorbs this uctuation, damp ening the v ariance of the raw estimate b efore CF correction is applied. HMean’s static pro vides no such damp ening. In other words, Mean is not a fundamentally dierent estimator from HMean — it is a gener alization that replaces Fla jolet’s constant-occupancy approximation with an exact o ccupancy correction. HMean is the sp ecial case of Mean where the co ecient is frozen. Mean is nev er w orse than HMean at any cardinalit y tested, and is mo destly but consisten tly b etter in the transition zone (e.g., 1.613% vs 1.645% mean absolute error at ). Geometric Mean (GMean). Computes the geometric mean of the dierence v alues and in verts: GMean exp count ln coun t By the AM–GM inequality , the geometric mean of is alwa ys the arithmetic mean, so GMean pro duces a larger ra w estimate than Mean. This translates to a large p ositiv e bias b efore correction. After CF correction, GMean ac hieves roughly t wice the absolute error of Mean and HMean throughout the cardinalit y range. CF correction remo ves bias but cannot comp ensate for GMean’s higher intrinsic v ariance. W e use Mean in DLL’s h ybrid estimator b ecause it is prov ably at least as accurate as HMean and requires no sp ecial constan t. 5.4 DLL3: 3-Bit V ariant and Ov ero w Correction DLL3 uses 3 bits per buc k et (stored v alues 0–7), encoding stored = relNlz + 1 with 0 reserved for empt y , yielding relative NLZ v alues 0 through 6. This reduces memory by 25% compared to DLL4 — or equiv alently , allo ws 33% more buck ets at the same memory . The tradeo is that buck ets whose relative NLZ exceeds 6 are clamp ed to stored = 7 (the maximum), creating a systematic underestimate at high cardinality . No stored v alue is reserved for ov ero w; stored = 7 represents b oth legitimate relNlz = 6 and clamp ed ov erow. Eac h aected tier accumulates approximately B ln(2 B )/256 “ghost” o vero w v alues p er era. Recording ov erow. At eac h tier promotion, DLL3 counts the num ber of buck ets at stored = 7 (the maximum) and records half this coun t as the estimated ov erow for the next tier: storedOv erow topCoun t The factor of 1/2 arises from the geometric distribution: half the buc kets at stored = 7 hav e true NLZ exactly equal to 6 + minZeros, while the other half ha ve NLZ strictly greater (ov erow victims). Cum ulativ e-space correction. The correction op erates in reverse-cum ulative space rather than on individual tier coun ts: cumRa w 16 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 T rue Car dinality 0.010 0.015 0.020 0.025 0.030 0.035 0.040 Mean Absolute Er r or 0.71x buck ets 7.02x buck ets Dynamic vs Static Occupancy Cor r ection: Mean Er r or (LL6, 2048 buck ets, 512k simulations, CF cor r ected) Mean HMean GMean DLC Figure 4. Mean absolute error after CF correction for Mean, HMean, GMean, and DLC. Mean and HMean are both harmonic means of the same buck et probabilities; the only dierence is Mean’s dynamic o ccupancy co ecien t (count+B)/(2B) versus HMean’s static _m. Belo w 0.71 B and ab ov e 7.02 B (arro ws), o ccupancy is eectiv ely constant and the tw o con verge. In the transition region, Mean’s dynamic co ecient absorbs o ccupancy uctuations, yielding mo destly lo wer error. GMean is consistently worse. (LL6, 2048 buck ets, 512k simulations.) 17 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 T rue Car dinality 0.010 0.015 0.020 0.025 0.030 0.035 0.040 Std Dev of Er r or Dynamic vs Static Occupancy Cor r ection: Standar d Deviation (LL6, 2048 buck ets, 512k simulations, CF cor r ected) Mean HMean GMean DLC Figure 5. Standard deviation of error after CF correction for the same estimators. Mean and HMean show iden tical standard deviation outside the 0.71 –7.02 buck et range, conrming that when o ccupancy is constant, the dynamic and static co ecients pro duce iden tical v ariance. The dierence is conned to the partial-o ccupancy region where collisions are common. (LL6, 2048 buc kets, 512k simulations.) 18 F or each aected tier: corrCum cumRa w cumRaw where X_t is the stored o vero w estimate for tier t. W orking in cumulativ e space is essen tial: it mak es each tier’s correction indep endent, a voiding the cascading errors that arise when correcting individual tier coun ts. The corrected single-tier coun ts are recov ered as dierences of adjacent cumulativ e v alues. With this correction, DLL3 matches DLL4’s accuracy within 0.02% for high-complexit y (all-unique) data. How ever, for low-complexit y data with many duplicate v alues, tier promotions caused b y duplicates in teract with the ov ero w correction to pro duce a systematic p ositive bias abov e ~100 B cardinality . DLL4, with its 15-tier range, is unaected (see Figure 11). Recommendation: DLL4 (4-bit) is the recommended default. DLL3 (3-bit) outp erforms DLL4 in the same memory space only when the stream is kno wn a priori to hav e near-maximal complexity . 6. MicroIndex F or v ery small cardinalities (below ~120 distinct elemen ts), a 64-bit Blo om lter [9] is sucient for accurate cardinalit y estimates, without allo cating a buc ket array . DLL addresses this with a Micr oIndex : a single 64-bit in teger that can b e used to either lazy-allo cate the buck et arra y once cardinality exceeds a threshold (resulting in larger cardinalities b eing underestimated b y up to that threshold), or to increase the accuracy of LC in the lo w-cardinality region by adding hash-collision resilience. Six bits from the hash v alue (ab ov e the buck et selector bits) index in to the 64-bit word, setting the corresp onding bit. The resulting p opulation coun t gives a miniature Linear Counting estimate ov er 64 virtual “buck ets”: MicroEst ln max p op count microIndex This is used as the b est estimate prior to lazy array allo cation. When the buck et arra y is allo cated, MicroIndex impro ves the accuracy of Linear Coun ting directly: at low cardinality , m ultiple elements can hash to the same main buck et but dierent MicroIndex p ositions, so the MicroIndex p opulation count may exceed the num ber of lled buc kets. LC is calculated using: max lledBuc kets p op count microIndex This reduces V (increasing the LC estimate) when the MicroIndex has detected collisions that the main buc ket arra y missed, impro ving accuracy for cardinalities b elow ~64 where hash collisions within the 64-bit space are still informativ e. A secondary o or, max(estimator, MicroEst), provides a safety net. The MicroIndex also enables lazy arra y allo cation: the main buck et array need not b e created until the MicroIndex saturates, saving memory for applications that create man y estimators (e.g., p er- k -mer tracking in bioinformatics). The MicroIndex requires zero additional memory b ey ond a single long eld. A dditionally , in practice, cardinality estimates are capped at the n um b er of elements added (which eliminates o verestimation in maximally-complex streams), though this clamping was disabled for all gures in this pap er. 7. Metho ds 7.1 Sim ulation F ramework All accuracy ev aluations used purp ose-built calibration drivers included in BBT o ols. T w o simulators were dev el- op ed to test estimator accuracy under dieren t data distributions: 19 High-complexit y simulator (DDLCalibrationDriv er2). This driv er tests accuracy on streams of all-unique elemen ts. Eac h thread creates one estimator instance at a time, feeds it uniformly random 64-bit v alues from a seeded PRNG (Xoshiro256++ [10]), hashed through a 64-bit nalizer based on Staord’s Mix13 [8], records estimator output at logarithmically-spaced cardinality thresholds (1% incremen ts), then discards the instance and creates the next. This one-at-a-time design k eeps the working set in L1/L2 cac he for the entire run, eliminating cac he eects from the accuracy measurement. The driver is fully deterministic: the same master seed pro duces iden tical results regardless of thread count. Results across all instances are merged after all threads complete. F or eac h threshold, the driver records the signed relativ e error (estimate true)/true, the absolute relativ e error, and the squared error for standard deviation computation. Correction factor tables are generated from the same output b y computing CF = true/estimate av eraged across all instances at each cardinality p oint. Lo w-complexit y sim ulator (Lo wComplexit yCalibrationDriver). This driv er tests accuracy on streams with bounded cardinality and man y duplicate elemen ts, simulating real-w orld data where the same elemen ts app ear rep eatedly . A xed array of cardinality unique random 64-bit v alues is generated from a master seed. Eac h estimator instance dra ws from this arra y with replacemen t, biased to w ard lo w er indices via min(rand(), rand()) to pro duce a sk ewed frequency distribution resembling natural data. Elemen ts are added via hashAndStore() on every draw (including duplicates). T rue cardinalit y is track ed via a BitSet recording which unique v alues hav e b een seen. Estimates are recorded on every add while the true cardinalit y is at a rep orting threshold, capturing how the estimator’s state ev olv es as duplicate v alues trigger tier promotions. Bet ween thresholds, only hashAndStore() runs — no rawEstimates() ov erhead. The iterations parameter controls ho w long each estimator runs b ey ond saturation: iterations=N means each estimator pro cesses cardinalit y N total adds. This allows testing b ehavior at extreme duplication lev els where the ratio of total adds to unique elemen ts is very high. Rep orting. Both simulators rep ort results at exp onen tially-spaced cardinality thresholds: each threshold is at least 1% larger than the previous, yielding approximately 230 rep orting p oin ts p er decade of cardinalit y . Error metrics are the mean across all estimator instances at eac h threshold: signed error (bias), absolute error (accuracy), and standard deviation (precision). 7.2 Baseline Comparator F or fair comparison, w e implemented LL6 (LogLog6), a 6-bit-p er-buck et estimator that serv es as a structural equiv alent of Hyp erLogLog. LL6 stores one b yte p er buc ket (6 bits used, 2 wasted), has no tier promotion, no shared exp onent, and no eeMask early exit. Ev ery hash accesses a buck et. LL6 uses the same estimator pip eline as DLL4 — the same CardinalityStats.fromNlzCounts() path, the same DLC computation, the same hybrid blend — isolating the eect of DLL’s architectural innov ations (shared exp onent, tier promotion, eeMask) from estimator algorithm dierences. LL6’s correction factor table was generated indep enden tly using the high-complexit y simulator with 512,000 in- stances at 2,048 buc kets. Note that LL6 is a structural baseline, not a reimplementation of Hyp erLogLog++ [4]. HLL++ adds empirical bias correction tables (k-nearest-neighbor interpolation at 200 calibration p oints) and a sparse representation at higher virtual precision for small cardinalities, which partially reduce the transition bulge but do not eliminate it. Our LL6 baseline uses DLL’s estimator pip eline (including DLC and Hybrid), isolating the comparison to arc hitectural dierences rather than estimation algorithm dierences. 7.3 Sp eed Benc hmarks Sp eed b enchmarks w ere conducted on a 64-core compute no de (AMD EPYC 7502P , 2.5 GHz; 32 KB L1d, 512 KB L2 p er core; 128 MB shared L3; 512 GB RAM) using t wo approac hes: (1) the BBDuk [7] bioinformatics to ol pro cessing 325 million 150bp Illumina reads (48.8 Gbp) with 16 threads, measuring end-to-end throughput in Mbp/s; and (2) the DDLCalibrationDriver in single-threaded b enchmark mo de ( benchmark=t ), whic h runs 20 pure hashAndStore() lo ops with no estimate computation, measuring cache eects across v arying num b ers of sim ultaneous estimator instances. 8. Exp erimen tal Results 8.1 High-Complexit y Ev aluation W e simulated 512,000 independent estimator instances with 2,048 buc k ets each, out to a true cardinality of 8,388,608 (4,096 the buc ket count), sampled at 1,297 exp onentially-spaced c heckpoints. The HLL bulge (Figure 1). LL6 (HLL) exhibits a pronounced error spike in the transition region b et ween Linear Counting and the harmonic mean estimator, p eaking at 34.1% mean absolute error near cardinalit y 3,000– 5,000 (1.5–2.5 B ). In con trast, DLL4’s Hybrid estimator and DLC b oth remain b elow 2% error throughout the en tire range, with no visible transition artifact. Stable estimation metho ds. At a readable scale (0–4% error), HLL’s spike go es o-chart while Hybrid, DLC, and FGRA (ULL/UDLL6) maintain relativ ely at proles throughout the cardinality range. Hybrid (with CF) ac hieves 1.574% log-w eighted a verage error; DLC (without CF) achiev es 1.620%. Both outp erform HLL’s 2.684% log-w eighted av erage, and the p eak error dierence is dramatic: Hybrid p eaks at 1.835% vs HLL’s 34.136%. FGRA also eliminates the bulge en tirely via its single closed-form estimator (see Section 8.3). 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 1 0 7 T rue Car dinality 0.01 0.1 1 Mean Absolute Er r or How DLC W orks: Individual T ier Estimates and Blended R esult (2048 buck ets) DLC0 DLC1 DLC2 DLC3 DLC4 DLC5 DLC6 DLC7 DLC8 DLC9 DLC10 DLC11 DLC12 DLC13 DLC14 DLC15 DLC16 DLC17 DLC18 DLC19 DLC (blend) DLCBest Figure 6. Individual DLC tier estimates (DLC0–DLC19) each ha ve a U-shap ed accuracy curve accurate near their optimal cardinalit y range. The DLC blend (red) threads through the v alleys of all tier curves. (DLL4, 2048 buc kets, 512k simulations.) DLC tier structure (Figure 6). The individual DLC tier estimates (DLC0 through DLC19) each hav e a U- shap ed accuracy curv e: accurate near their optimal cardinalit y range and diverging ab ov e and b elow. The tiers tile the cardinalit y range like o verlapping scales, eac h cov ering approximately one o ctav e. The DLC blend (red) threads through the v alleys of all tier curves, selecting the optimal tier at each cardinality . Sp eed (Figure 7). In a pro duction BBDuk run pro cessing 325 million 150bp reads with 16 threads, DLL4 ac hieved 1,415 Mbp/s compared to 1,182 Mbp/s for LL6 (20% faster). Disabling eeMask reduced DLL4 to 987 21 LL6 (HLL) DLL4 no eeMask DLL4 with eeMask DDL no eeMask DDL with eeMask ULL UDLL6 0 250 500 750 1000 1250 1500 1750 2000 Thr oughput (Mbp/s) 1182 987 1415 1015 1729 901 1388 +43% +70% +54% BBDuk P r oduction Benchmark: 325M × 150bp r eads, 16 thr eads, 2048 buck ets Each k-mer added twice (input + output car dinality) Figure 7. BBDuk throughput (Mbp/s) pro cessing 325M reads with 16 threads, 2,048 buck ets. As a QC to ol, BBDuk measures k-mer cardinality for b oth input and output reads (b efore and after ltering/trimming), so each k-mer is added twice, using tw o estimators p er thread. DLL4 with eeMask achiev es 1,415 Mbp/s; without eeMask it drops to 987 Mbp/s. DDL achiev es 1,729 Mbp/s. ULL achiev es 901 Mbp/s; UDLL6 achiev es 1,388 Mbp/s (54% faster than ULL). 22 Mbp/s (17% slow er than LL6), demonstrating that eeMask is essential to ov ercome tier promotion ov erhead. DDL, whic h uses the same eeMask with 8-bit buck ets and simpler b yte-arra y pac king, achiev ed 1,729 Mbp/s (70% faster than without eeMask). ULL ac hieved 901 Mbp/s and UDLL6 achiev ed 1,388 Mbp/s (54% faster than ULL), nearly matc hing DLL4. ULL lac ks early exit en tirely; UDLL6’s int-pac k ed registers and conserv ative eeMask reco ver most of the sp eed (see Section 8.3 for details). 2 1 2 3 2 5 2 7 2 9 2 1 1 2 1 3 Simultaneous Estimators 0 50 100 150 200 250 300 Million A dds/sec 12.9× faster 3.1× Cache Thrashing: DLL4 vs LL6 Thr oughput with Simultaneous Estimators DLL4 × 8192 buck ets (4 KB each) LL6 × 8192 buck ets (8 KB each) DLL4 × 2048 buck ets (1 KB each) LL6 × 2048 buck ets (2 KB each) Figure 8. Throughput (million adds/sec, single-threaded) vs num b er of sim ultaneous estimator instances, for DLL4 and LL6 at 2048 and 8192 buck ets. LL6’s larger p er-instance fo otprint causes severe cache thrashing at high counts; DLL4’s eeMask eliminates most memory accesses regardless of cache pressure. At 4096 simultaneous 8192-buc ket estimators, DLL4 is 12.9 faster; at 8192 estimators, where neither data structure ts in cache, DLL4 is still 3.1 faster. Cac he thrashing (Figure 8). When m ultiple estimator instances are used simultaneously (as in per- k -mer cardinalit y trac king), LL6 suers progressive cache thrashing as the combined working set exceeds the CPU cache hierarc hy . This single-threaded b enc hmark measures pure hashAndStore() throughput with v arying num b ers of sim ultaneous estimators on the AMD EPYC 7502P (32 KB L1d, 512 KB L2 p er core, 128 MB shared L3). With 512 simultaneous 8,192-buc ket estimators (total working set: 2 MB for DLL4, 4 MB for LL6 — exceeding L2 but tting in L3), DLL4 ac hieves 2.3 the throughput of LL6, b ecause eeMask eliminates >99.99% of memory accesses regardless of cache pressure. A t the extreme (8,192 simultaneous 8,192-buck et estimators, total working set: 32–64 MB — exceeding ev en L3 in eective terms given the dual-so ck et CPU setup), DLL4 remains 3.1 faster than LL6. 8.2 Lo w-Complexit y Ev aluation T o test robustness under realistic conditions where many input elements are duplicates, we used the low-complexit y sim ulator with 256 buck ets, 16,384 estimator instances, 8 iterations, and cardinalities up to 100 million (390,000 the buc ket count). DLL4 stability (Figure 11). DLL4 sho ws no error degradation at an y cardinalit y tested, including well past tier 15 (the p oint at which all buck ets ma y ha ve, and most ha ve, exp erienced o vero w) and into tier 18. The error prole remains at from low to extreme cardinality , conrming that the 15-tier range of 4-bit storage is sucient 23 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 1 0 7 T rue Car dinality 0.0050 0.0075 0.0100 0.0125 0.0150 0.0175 0.0200 0.0225 0.0250 Mean Absolute Er r or DLL3 vs DLL4 High-Comple xity (2048 buck ets, 512k simulations, with CF) DLL3 Hybrid (CF) DLL4 Hybrid (CF) DLL3 DLC (CF) DLL4 DLC (CF) Figure 9. DLL3 vs DLL4 absolute error under high-complexity data with CF correction. All four estimators ac hieve nearly iden tical accuracy , demonstrating that DLL3’s 3-bit o v erow correction is eective for all-unique streams. (2048 buc kets, 512k simulations.) 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 1 0 7 T rue Car dinality 0.00100 0.00075 0.00050 0.00025 0.00000 0.00025 0.00050 0.00075 0.00100 Signed Er r or DLL3 vs DLL4 High-Comple xity Signed Er r or (2048 buck ets, 512k simulations) DLL3 Hybrid DLL4 Hybrid DLL3 DLC DLL4 DLC Figure 10. Signed error for DLL3 and DLL4 under high-complexity conditions. Both are calibrated to within across the full range, conrming CF correction is essen tially p erfect for all-unique data. 24 1 0 1 1 0 3 1 0 5 1 0 7 T rue Car dinality 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Mean Absolute Er r or T ier 15 T ier 18 DLL4 L ow-Comple xity 256 buck ets, 16k DDLs, car d to 94M (366194×B) Hybrid DLC DLCBest Figure 11. DLL4 absolute error under low-complexit y conditions (256 buck ets, 16k instances, 8 iterations, up to 100M cardinalit y). Error remains at through tier 15 and b eyond — the 4-bit range is sucien t for practical use. for practical use. Buck ets for these high-cardinality tests w ere restricted to 256 to allo w reac hing high tiers with high precision, as the amount of simulation time is O(estimators buck ets 2^maxTier). DLL4’s Hybrid achiev es 4.83% a verage and 5.30% p eak absolute error under low-complexit y conditions. DLL3 under low-complexit y (Figures 12–13). DLL3’s b ehavior under lo w-complexity data is nuanced and dep ends on b oth the cardinality-to-buc k et ratio and the duplication rate. A t mo derate cardinalities, DLL3 sho ws comparable log-weigh ted av erage error to DLL4 — Hybrid achiev es 4.89% (vs DLL4’s 4.83%) and DLC achiev es 5.05% (vs 5.03%). Ho wev er, DLL3’s p eak error is mo derately higher: Hybrid p eaks at 7.39% (vs DLL4’s 5.30%). The divergence onset shifts dep ending on simulation parameters. In our 100-million-cardinality test with 8 iterations (reaching ~93 million true cardinalit y out of 100 million p ossible), the error was still rising at the end of the simulation. With more iterations (higher duplication rate) or closer approach to maximum cardinality , the div ergence w ould likely increase further. W e rep ort the maximum observ ed divergence under our test conditions but cannot b ound the worst case, as the error dep ends on the sp ecic duplication pattern and how close the true cardinalit y approaches the array size. Without ov ero w correction (Section 5.3), DLL3 would strictly undercount in b oth high- and low-complexit y cases, b ecause clamping alwa ys loses upw ard information. The ov ero w correction successfully comp ensates for this in high-complexit y data, where DLL3 matches DLL4 within 0.02% (Figure 9). How ever, under low-complexit y data with sustained duplication, the correction ov ercorrects: o v erow ed duplicates that were already accounted for by the correction can trigger additional tier promotions, causing the same lost information to b e comp ensated twice — a systematic p ositiv e bias. This ov ercorrection is resp onsible for DLL3’s higher p eak error (7.39% vs DLL4’s 5.30%) — the div ergence is caused by the correction, not despite it. T able 4. Mean absolute error under low-complexit y conditions (256 buck ets, 16k instances, 8 iterations, up to 100M cardinalit y). DLL3 matches DLL4 at mo derate cardinality but diverges at extreme duplication. 25 1 0 1 1 0 3 1 0 5 1 0 7 T rue Car dinality 0.00 0.02 0.04 0.06 0.08 0.10 Mean Absolute Er r or DLL3 tier 7 DLL4 tier 15 Hybrid L ow-Comple xity DLL3 vs DLL4 (256 buck ets, 16k DDLs, 100M car d) DLL3 Hybrid DLL4 Hybrid Figure 12. DLL3 vs DLL4 Hybrid absolute error under lo w-complexity data. DLL3 tracks DLL4 closely at mo derate cardinalit y but diverges at extreme duplication. (256 buck ets, 16k instances, 8 iterations.) 1 0 1 1 0 3 1 0 5 1 0 7 T rue Car dinality 0.04 0.02 0.00 0.02 0.04 Signed Er r or DLL3 tier 7 DLL4 tier 15 DLC L ow-Comple xity DLL3 vs DLL4 (256 buck ets, 16k DDLs, 100M car d) DLL3 DLC DLL4 DLC Figure 13. DLL3 vs DLL4 DLC signed error under low-complexit y data. The creeping p ositiv e bias ab o ve tier 7 (mark ed by the trend onset) is clearly visible for DLL3; DLL4 remains at throughout. 26 Estimator Metho d A vg Abs Error P eak Abs Error DLL4 Hybrid 0.04833 0.05304 DLL4 DLC 0.05028 0.05691 DLL3 Hybrid 0.04892 0.07388 DLL3 DLC 0.05051 0.06864 HLL (LL6) achiev es 5.68% av erage error under the same conditions — comparable to DLL4’s DLC (5.03%) — but its p eak error of 29.8% is 5.6 worse than DLL4’s worst case (5.30%), as the LC-to-HLL transition bulge is amplied under duplication. 8.3 Complemen tary Nature of DynamicLogLog and UltraLogLog UltraLogLog (ULL) [5] takes an orthogonal approach to improving HLL’s space eciency: rather than sharing state to shrink registers while main taining information con ten t, it stores additional sub-NLZ history bits p er register (2 extra bits recording whether up dates with NLZ and NLZ o ccurred), increasing b oth the size and information conten t of eac h register. This enables more ecient estimation via Ertl’s FGRA (Fisherian Generalized Remaining Area) estimator, yielding a 28% low er memory-v ariance pro duct while preserving HLL’s indep endent- register arc hitecture and full merge semantics. Thus, b oth ULL and DLL increase the information densit y of registers using opp osite but compatible metho ds, as each metho d mo dies a dierent part of the register — DLL increases the density of the NLZ section, while ULL adds a new higher-densit y section. FGRA also eliminates the LC-to-HLL transition bulge: b ecause it is a single closed-form estimator that uses all register information sim ultaneously , it has no discontin uous hando b et ween low-cardinalit y and high-cardinalit y regimes (Figure 14). ULL do es not, ho wev er, provide early-exit acceleration — every hash must b e fully pro cessed. The tw o approaches are complemen tary . DLL’s shared exp onen t requires only 4 bits for relative NLZ, p oten tially freeing bits for sub-NLZ history . W e tested this directly by implementing DynamicUltraLogLog (UDLL6): a 6-bit-p er-register v arian t that combines DLL4’s relative enco ding and tier promotion with ULL’s 2-bit sub-NLZ history and FGRA estimator. Eac h 6-bit register stores 4 bits of relative NLZ plus 2 bits of history , in the same enco ding as ULL but relative to the shared exp onent. This yields 2,048 registers in 1.5 KB — or 1,640 bytes as implemen ted, due to int-pac king 5 registers p er 32-bit word with 2 bits unused p er word (versus ULL’s 2 KB for 2,048 8-bit registers, or DLL4’s 1 KB for 2,048 4-bit registers). Figure 14 and T able 3 compare accuracy across memory budgets. T wo av eraging metho ds are shown: lo g-weighte d (un weigh ted a v erage of error at exp onen tially-spaced chec kp oints, corresp onding to what is visible on a log-scale plot and reecting an exp onen tially-distributed cardinality w orkload) and c ar dinality-weighte d (eac h chec kp oin t w eighted by its cardinality , emphasizing steady-state b eha vior at high cardinality and corresp onding to a linear- scale view). F or equal-memory comparisons at non-p ow er-of-2 buc ket counts, w e use a v ariant of DLL4 that selects buck ets via unsigned mo dulo rather than bitmask, allo wing arbitrary buck et counts (e.g., 3,072 buck ets for exactly 1.5 KB at 4 bits p er buc ket). This uses the same estimator pip eline and CF tables as the bitmask version. T able 3. Mean absolute relativ e error at equal memory , av eraged to 8,388,608 cardinality . L o g-wt = unw eighted a verage across exp onen tially-spaced c heckpoints; Car d-wt = cardinalit y-weigh ted av erage emphasizing steady state. HLL = uncorrected LL6 (6-bit Hyp erLogLog) as baseline. DLC rows use no correction factor; all other metho ds use CF where applicable. Estimator Metho d Memory Buck ets Log-wt vs HLL Card-wt vs HLL P eak LL6 HLL 1.5 KB 2,048 0.02684 — 0.01839 — 0.34136 DLL4 Hybrid 1 KB 2,048 0.01572 0.01830 0.01834 DLL4 DLC 1 KB 2,048 0.01617 0.01898 +3% 0.01928 DLL4 Hybrid 1.5 KB 3,072 0.01266 0.01494 0.01501 DLL4 Hybrid 2 KB 4,096 0.01085 0.01294 0.01298 27 Estimator Metho d Memory Buck ets Log-wt vs HLL Card-wt vs HLL P eak DLL4 DLC 2 KB 4,096 0.01114 0.01340 0.01373 ULL F GRA 1 KB 1,024 0.01741 0.01949 +6% 0.01955 ULL F GRA 2 KB 2,048 0.01210 0.01377 0.01381 UDLL6 F GRA 1.5 KB 2,048 0.01210 0.01376 0.01380 The log-weigh ted and cardinalit y-w eighted av erages tell complemen tary stories. At equal register coun t (2,048 buc kets), DLL4 at 1 KB and HLL at 1.5 KB ha ve nearly identical steady-state error (Card-wt: ), but DLL4’s elimination of the LC-to-HLL transition bulge giv es it a 41% adv an tage in log-weigh ted error — the bulge con tributes disproportionately b ecause it spans a wide range on the log scale. When DLL4 is giv en the same memory as HLL (1.5 KB 3,072 buc kets via mo dulo addressing), its adv antage b ecomes substantial under b oth w eightings: log-w eighted, cardinalit y-weigh ted. UDLL6 at 1.5 KB and ULL at 2 KB o v erlap almost p erfectly (Figure 14), demonstrating that the fusion of DLL’s shared exp onen t with ULL’s sub-NLZ history achiev es ULL-level accuracy at 75% of the memory . At equal memory (1.5 KB), UDLL6 outp erforms DLL4 by 4–8%, suggesting that FGRA’s use of sub-NLZ history provides a small but consisten t adv antage ov er the DLC/Hybrid pip eline at the same information budget. ULL at 1 KB (1,024 registers) exhibits 6% higher mean error than HLL at 1.5 KB in steady state despite its substan tially more ecien t estimator — FGRA extracts enough information from 1,024 8-bit registers to nearly matc h 2,048 4-bit registers, but not quite. ULL’s improv emen t ov er HLL at equal register count is dramatic ( cardinalit y-weigh ted at 2,048 buck ets), conrming that FGRA is a ma jor adv ance. Y et DLL4 at equal memory still outp erforms ULL: the additional buck ets enabled b y 4-bit storage pro vide a larger accuracy gain than ULL’s ric her p er-register information. A dapting ULL’s register enco ding to DLL’s relative framew ork required rearranging the bit lay out so that the NLZ p ortion o ccupies the high bits of eac h register, enabling the same unsigned-comparison early exit used b y DLL4. Ho wev er, b ecause UDLL6 must accept hashes whose NLZ falls within the 2-bit history margin b elow the curren t o or (to up date sub-NLZ history for registers near the o or), its eeMask threshold is necessarily 2 bits more conserv ativ e than DLL4’s. This reduces the early exit rate, whic h may partially explain UDLL6’s lo wer throughput. In pro duction, UDLL6 is 54% faster than ULL (1,388 vs 901 Mbp/s in BBDuk; Figure 7) and nearly matc hes DLL4 with eeMask (1,415 Mbp/s). The sp eed gain ov er ULL comes from int-pac k ed 6-bit registers (5 p er 32-bit w ord, 1,640 b ytes for 2,048 registers) combined with the conserv ative eeMask that rejects most hashes b efore any register access. 8.4 History-Corrected Hybrid Estimation UltraLogLog’s p er-register history bits (Section 8.3) record whether up dates with NLZ and NLZ o ccurred, pro viding sub-NLZ information that F GRA exploits via Fisherian analysis. An alternativ e to FGRA is direct p er- state correction: each of the 2^ h history states (where h is the num ber of history bits) exhibits a characteristic bias relativ e to the tier av erage, and this bias can b e measured by sim ulation and stored as a small additiv e correction table. W e call the resulting estimators Hybrid+ n (or Mean+ n ), where n is the num b er of history bits used. F or 2-bit history , each register falls into one of 4 states based on whether NLZ and NLZ up dates hav e b een observ ed. Sim ulation ov er 6.5 million trials p er state yields a 4-entry additive correction in NLZ-space (e.g., state 0 contributes to the eective NLZ, while state 3 contributes +0.21). These p er-state corrections are applied b efore the harmonic mean computation, making eac h register a less biased estimator of its lo cal cardinality . The approac h generalizes naturally to 1-bit and 3-bit history (2 and 8 correction entries, resp ectively). W e denote the resulting estimator types by their total bits p er register: UDLL5 (4 NLZ + 1 history), UDLL6 (4 NLZ + 2 history), and UDLL7 (4 NLZ + 3 history). Note that UDLL6 is architecturally identical to the UDLL6 describ ed in Section 8.3 — the same data structure supp orts b oth F GRA and Hybrid+2 as estimation metho ds. 28 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 1 0 7 T rue Car dinality 1 0 2 Mean Absolute R elative Er r or DLL4 vs ULL vs UDLL6 A ccuracy at Equal Memory 512k estimators, cor r ected (HLL uncor r ected) HLL 1.5 KB (2048 buck ets) DLL4 1 KB (2048 buck ets) DLL4 1.5 KB (3072 buck ets) DLL4 2 KB (4096 buck ets) ULL 1 KB (1024 buck ets) ULL 2 KB (2048 buck ets) UDLL6 1.5 KB (2048 buck ets) Figure 14. Mean absolute relative error for DLL4, ULL, UDLL6, and HLL at equal memory budgets (512k estimators, up to 8.4M cardinalit y). UDLL6 at 1.5 KB and ULL at 2 KB ov erlap almost p erfectly , demonstrating that UDLL6 achiev es ULL-lev el accuracy at 75% of the memory . HLL (uncorrected LL6) shown for reference; all DLL/ULL v ariants use correction. 29 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 1 0 7 T rue Car dinality 0.010 0.015 0.020 0.025 Hybrid Absolute Er r or Hybrid Absolute Er r or : DLL4 vs History Bit Configurations (Linear Y) (2048 buck ets, 512k simulations, CF cor r ected) DLL4 (no history) UDLL5 Hybrid+1 UDLL6 Hybrid+2 UDLL7 Hybrid+3 Figure 15. Hybrid accuracy b y history bits: DLL4 Hybrid (0-bit), UDLL5 Hybrid+1 (1-bit), UDLL6 Hybrid+2 (2-bit), and UDLL7 Hybrid+3 (3-bit), all at 2,048 buc kets with CF correction. Eac h additional history bit pro vides diminishing improv emen t: the rst bit reduces error by 10.4%, the second by an additional 8.1%, the third by only 3.4%. (512k sim ulations.) 30 T able 5. Mean absolute error by estimation metho d (2,048 buck ets, CF-corrected, 512k simulations). L o g-wt = un weigh ted av erage across exp onen tially-spaced chec kp oints; Car d-wt = cardinality-w eigh ted av erage. Estimator Metho d Bits/reg Log-wt Card-wt DLL4 Hybrid 4 0.01573 0.01830 UDLL5 Hybrid+1 5 0.01410 0.01551 UDLL6 Hybrid+2 6 0.01287 0.01383 UDLL7 Hybrid+3 7 0.01231 0.01289 UDLL6 F GRA 6 0.01210 0.01376 The rst history bit provides the largest improv ement ( log-w eighted, $- - -$10.8% for the second, $- -$6.8% for the third). A t 2 history bits, UDLL6 Hybrid+2 nearly matches UDLL6 F GRA (log-weigh ted 0.01287 vs 0.01210; cardinalit y-weigh ted 0.01383 vs 0.01376) using only a 10-entry correction table, compared to F GRA’s full Fisherian analysis with / series co ecients. A t equal memory , history bits outp erform extra buck ets: T able 6. Memory-fair comparison: UDLL Hybrid+ n vs DLL4 Hybrid at equal memory (512k simulations). Memory (b ytes) UDLL Hybrid+ n Log- wt Card- wt DLL4 Hybrid Log- wt Card- wt Log-wt winner Card-wt winner 1,368 UDLL5 (2,048 B ) 0.01410 0.01551 DLL4 (2,560 B ) 0.01394 0.01637 DLL4 b y 1.1% UDLL5 b y 5.2% 1,640 UDLL6 (2,048 B ) 0.01287 0.01383 DLL4 (3,072 B ) 0.01266 0.01494 DLL4 b y 1.6% UDLL6 b y 7.4% 1,824 UDLL7 (2,048 B ) 0.01231 0.01289 DLL4 (3,584 B ) 0.01163 0.01382 DLL4 b y 5.5% UDLL7 b y 6.8% The tw o weigh ting schemes tell dierent stories: DLL4 wins on log-weigh ted error (b y 1–6%), where its extra buc kets reduce low-cardinalit y v ariance, while UDLL Hybrid+ n wins on cardinality-w eighted error (by 5–7%), where history bits impro v e steady-state accuracy . UDLL6 Hybrid+2 oers the b est practical tradeo: 93.7% pac king eciency in 32-bit w ords, 7.4% cardinality-w eighted adv antage o ver DLL4, and near-F GRA accuracy with minimal mac hinery . 8.5 La y ered Dynamic Linear Counting (LDLC) The p er-state history corrections can also impro ve DLC. A t eac h tier, the cum ulativ e DLC empt y coun t V_t can b e rened by incorp orating History-only Linear Coun ting (HC): an LC estimate computed using only the history bits of registers at the current tier b oundary . DLC and HC exhibit systematic biases that are approximately an ti- phase — where DLC ov erestimates at a given cardinalit y , HC tends to underestimate, and vice v ersa — pro ducing c haracteristic sine-wa ve patterns in signed error that are visible across tier b oundaries (Figure 16, upp er panel). LDLC blends these t wo estimates as a w eighted sum: 60% DLC + 40% HC. The anti-phase cancellation reduces the p erio dic bias comp onent, yielding an 8.1% improv emen t in log-weigh ted error ov er DLC alone (0.01467 vs 0.01597 at 2,048 buc kets). Like DLC, LDLC requires no correction factor table — the accuracy gain comes en tirely from combining tw o complemen tary estimators that are individually un biased on av erage but biased in opp osite directions at an y given cardinality . 9. Discussion DynamicLogLog demonstrates that the LogLog framew ork’s memory and sp eed can b e substantially improv ed by sharing state across buc k ets. The shared-exp onen t design is conceptually simple — analogous to oating-p oint 31 1 0 4 1 0 5 1 0 6 0.0100 0.0075 0.0050 0.0025 0.0000 0.0025 0.0050 0.0075 0.0100 Mean Signed Er r or LDLC: Anti-Phase Sine W ave Cancellation (UDLL6, 2048 buck ets = 1.5 KB, 512k simulations) DLC HC (history - only) LDLC (60% DLC + 40% HC) 1 0 4 1 0 5 1 0 6 T rue Car dinality 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 Mean Absolute Er r or DLC HC (history - only) LDLC Figure 16. An ti-phase sine wa v e cancellation in LDLC. Upp er panel: signed error of DLC (blue) and HC (gold) sho ws opp osite-phase p erio dic structure; the LDLC blend (red) threads b etw een them. Low er panel: absolute error sho wing LDLC’s 8.1% improv ement ov er DLC. (UDLL6, 2,048 buc k ets, 1.5 KB, 512k simulations.) 32 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 1 0 7 T rue Car dinality 0.003 0.01 0.03 Mean Absolute Er r or Memory -F air Comparison: All Estimators at 1.5 KB (512k simulations per estimator) 4-bit Hybrid (DLL4, 3072 buck ets) 4-bit DLC (DLL4, 3072 buck ets) 6-bit Hybrid+History (UDLL6, 2048 buck ets) 6-bit LDLC (UDLL6, 2048 buck ets) 6-bit FGR A (UDLL6, 2048 buck ets) Figure 17. Memory-fair comparison at 1.5 KB: DLL4 Hybrid and DLC (3,072 4-bit), UDLL6 Hybrid+2, LDLC, and FGRA (2,048 6-bit). All estimators eliminate the LC-to-HLL transition artifact; history-corrected metho ds ac hieve the low est error. (512k simulations.) represen tation with a global absolute exp onent and p er-buck et relative exp onent osets — but its consequences are far-reaching: 33% memory savings, 16–29% higher throughput (dep ending on workload), and the elimination of the LC-to-HLL transition artifact. The DLL family of estimators is p erhaps the most practically signican t contribution. Where HLL requires a carefully tuned transition b etw een t wo fundamentally dieren t estimators (LC and harmonic mean), DLC provides a single, unied framework that is accurate across the full cardinalit y range. The tier-a w are extension of LC is natural giv en DLL’s architecture, but it could in principle b e applied to an y LogLog v arian t that trac ks p er-buck et NLZ v alues. The p er-state history correction mechanism (Hybrid+ n ) demonstrates that a remarkably small correction table — just 4 en tries for 2-bit history — captures most of the information that FGRA extracts through full Fisherian analysis. LDLC further shows that an ti-phase error cancellation b etw een DLC and history-only LC can improv e accuracy without an y correction tables at all. A man tissa v ariant, DynamicDemiLog (DDL), trades buck et coun t for p er-buck et collision resistance by storing fractional NLZ bits. This enables set comparison op erations analogous to MinHash sketc hing, as w ell as element frequency histograms. These capabilities will b e describ ed in a forthcoming pap er. Limitations. DLL’s shared-exp onent design in tro duces a tradeo for parallel merge op erations. When m ultiple DLL instances pro cess disjoin t subsets of a stream and are merged (via per-buck et max), each instance promotes tiers less aggressiv ely than a single instance seeing all data. This increases the eective ov erow at merge b oundaries, pro ducing a systematic underestimate. With DLL4’s 15-tier range, the eect is minor in practice — appro ximately 0.02% undercount with 8 merged instances, 0.15% with 16, and 1.4% with 64, on a 1.72-billion-element dataset with 2,048 buck ets. F or parallel cardinalit y estimation of a single stream, we recommend using a single sync hronized DLL instance rather than p er-thread copies. T raditional HLL, which has no tier promotion, do es not suer from 33 1 2 4 8 16 32 64 Number of Mer ged Instances 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 Car dinality Change vs Single Instance (%) -0.02% -0.02% -0.15% -0.37% -1.43% DLL4 Mer ge A ccuracy : P er - Thr ead Instances Mer ged (2.4B adds, 1.72B distinct, 2048 buck ets) Figure 18. Merge accuracy: mean absolute error vs num b er of merged DLL4 instances, at true cardinalit y 1.72 billion with 2048 buc kets. Error is negligible at 8 instances (0.02%) and grows to 1.4% at 64 instances. this issue. F uture work. The eeMask early exit technique is applicable to any LogLog-family estimator that adopts tier pro- motion, and could b e retroactively added to existing implementations. Additionally , the self-similar CF structure (Section 5.2) generalizes to any estimator with p erio dic error structure. The merge undercoun t could p otentially b e addressed b y a merge comp ensation factor derived from the num ber of merged instances and the tier gap b etw een them, though this remains to b e in vestigated. 10. Conclusion DynamicLogLog achiev es higher accuracy and 16–29% higher throughput p er element than Hyp erLogLog at 33% less memory . The key innov ations — shared-exp onen t storage, early exit masking, and Dynamic Linear Counting — are individually simple but collectively transformative. DLL provides a drop-in replacemen t for Hyp erLogLog in an y application where cardinality estimation is needed, with signican t adv antages in accuracy , sp eed, and memory , and only a minor disadv an tage in merge accuracy for highly parallel workloads (Section 9). F urthermore, DLL’s framew ork is complemen tary with UltraLogLog’s p er-register history approach: their fusion (UDLL6) ac hieves ULL-lev el accuracy at 75% of the memory , and p er-state history corrections (Hybrid+2) nearly match FGRA using only a 10-entry correction table. La yered Dynamic Linear Coun ting (LDLC) extends this further, ac hieving 8% b etter accuracy than DLC through an ti-phase error cancellation — with no correction tables at all. These results demonstrate that architectural compaction and information-theoretic eciency are orthogonal improv emen ts that comp ose naturally . A c knowledgemen ts Sim ulations and b enchmarks w ere conducted on the Dori computing cluster at the Joint Genome Institute, La wrence Berkeley National Lab oratory . The work conducted b y the U.S. Department of Energy Join t Genome 34 Institute (h ttps://ror.org/04xm1d337), a DOE Oce of Science User F acility , is supp orted b y the Oce of Science of the U.S. Departmen t of Energy op erated under Contract No. DE-A C02-05CH11231. Synthetic b enchmark reads w ere generated using randomgenome.sh and randomreadsmg.sh from the BBT o ols suite and are fully repro ducible b y any user with those to ols. The author thanks Alex Cop eland for reviewing and commen ting on the man uscript, and Chlo e for substan tial con tributions to algorithm implemen tation, calibration infrastructure, and manuscript preparation throughout this pro ject. Disclosure of AI assistance. This work made extensiv e use of large language mo del (LLM) AI to ols, including An thropic Claude and Google Gemini. The calibration drivers, b enc hmarking infrastructure, and supp orting analysis scripts describ ed in the Metho ds w ere dev elop ed with AI assistance. The initial draft of this man uscript, including all sections, w as prepared with AI assistance and subsequen tly review ed, revised, and v alidated b y the author. All algorithmic design decisions, exp erimental metho dology , interpretation of results, and scien tic conclusions are solely those of the author, who tak es full resp onsibility for the accuracy of all conten t. AI to ols w ere used as an instrument of pro ductivit y , not as a source of scien tic judgmen t. This disclosure is pro vided in accordance with emerging journal p olicies on AI-assisted authorship and in the in terest of full transparency . References [1] K.-Y. Whang, B. T. V ander-Zanden, and H. M. T aylor, “A linear-time probabilistic counting algorithm for database applications,” A CM T r ansactions on Datab ase Systems , v ol. 15, no. 2, pp. 208–229, 1990, doi: 10.1145/78922.78925. [2] M. Durand and P . Fla jolet, “Loglog counting of large cardinalities,” in A lgorithms – esa 2003 , in Lecture notes in computer science, v ol. 2832. Springer, 2003, pp. 605–617. doi: 10.1007/978-3-540-39658-1_55. [3] P . Fla jolet, É. F usy , O. Gandouet, and F. Meunier, “Hyp erLogLog: The analysis of a near-optimal cardinality estimation algorithm,” in Pr o c e e dings of the 2007 international c onfer enc e on analysis of algorithms (aofa) , in Discrete mathematics and theoretical computer science. 2007, pp. 137–156. [4] S. Heule, M. Nunkesser, and A. Hall, “Hyp erLogLog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm,” Pr o c e e dings of the EDBT 2013 Confer enc e , pp. 683–692, 2013, doi: 10.1145/2452376.2452456. [5] O. Ertl, “UltraLogLog: A practical and more space-ecien t alternativ e to Hyp erLogLog for appro xi- mate distinct counting,” Pr o c e e dings of the VLDB Endowment , vol. 17, no. 7, pp. 1655–1668, 2024, doi: 10.14778/3654621.3654632. [6] P . Fla jolet and G. N. Martin, “Probabilistic counting algorithms for data base applications,” Journal of Computer and System Scienc es , vol. 31, no. 2, pp. 182–209, 1985, doi: 10.1016/0022-0000(85)90041-8. [7] B. Bushnell, “BBT o ols: Bioinformatics to ols for sequence analysis. ” 2014. A v ailable: https://bbmap.org [8] T. Staord, “Better bit mixing — improving on MurmurHash3’s 64-bit nalizer. ” 2011. A v ailable: h t t p s : //zimbry.blogspot.com/2011/09/better- bit- mixing- improving- on.html [9] B. H. Blo om, “Space/time trade-os in hash co ding with allow able errors,” Communic ations of the A CM , vol. 13, no. 7, pp. 422–426, 1970, doi: 10.1145/362686.362692. [10] D. Blackman and S. Vigna, “Scram bled linear pseudorandom num b er generators,” A CM T r ansactions on Mathematic al Softwar e , v ol. 47, no. 4, pp. 1–32, 2021, doi: 10.1145/3460772. 35
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment