Sketched k-mer representation
We represent our sequences and genomes in terms of k-mers. We do not use all available k-mers in our sequences. Instead, we use a sketch15 of all k-mers using the FracMinHash method21; given a hash function h that maps k-mers to [0, M], we retain a k-mer denoted as x only if h(x) < M/c. The expected fraction of subsampled k-mers is exactly \(\frac{1}{c}\) under appropriate uniformity and stochasticity assumptions on the hash function; hence, c can be thought of as the rate of subsampling. We use minimap2’s66 k-mer hash function implementation in practice.
Our sketched representation has several benefits. Firstly, we take c = 200 by default, effectively speeding up computation and lowering memory costs by a factor of 200. Secondly, two subsampled k-mers are probably not near each other in a read or a genome, making them more independent of one another in a statistical sense, as discussed later on.
Independent substitution model with read sampling
Sylph assumes a generative probabilistic model, where calculating ANI becomes a question of parameter estimation under our model. The model consists of two parts: sequence evolution and read sampling. We consider the independent substitution model without spurious k-mers67 and uniform random read sampling with error.
Assume we are given a genome S. The first part of our random model is an independent substitution model on S parameterized by θ ∈ [0, 1]; assume every letter in S mutates to another letter with probability θ independently, resulting in a mutated version denoted S′.
Definition 1
Given random substitution parameter θ, let τ = 1 − θ. In our model, τ is defined as the (fractional) ANI between S and S'.
Note that we define ANI to be fractional for mathematical convenience.
The second modeling part is a read-sampling step, independent of the first part, where reads are sampled from the mutated genome S′ uniformly at random. Each read has some error rate ϵ, where errors are also modeled as independent substitutions. Importantly, under the uniformity assumption and that the reads are small enough, we can approximate the depth of coverage as Poisson distributed; if the average number of times a base is covered (its true coverage) is δ, the distribution of the number of times a base is covered follows a Poisson distribution with mean δ.
We extend this Poisson assumption to k-mers. That is, the k-mer multiplicity, the number of times a specific k-mer is seen, is also Poisson distributed. However, because of the read errors, where every read has its own error rate ϵ subject to some distribution Dϵ, the actual coverage of k-mers is modified to \(\delta \cdot {{\mathbb{E}}}_{\epsilon \sim {D}_{\epsilon }}[{(1-\epsilon )}^{k}]\); a k-mer may not be ‘covered’ because of it having an error within the read. Note that random errors are distinct from consistent mutations between genomes. In addition, for a read of length L, there are only L − k + 1 k-mers; thus, the coverage decreases by a factor of (L − k + 1)/L. Assuming all reads have length L, we provide a second definition.
Definition 2
Assuming read length L, k-mer length k, coverage δ and read error rate ϵ~Dϵ under a distribution Dϵ, we define the effective coverage λ as follows:
$$\lambda =\delta \cdot {{\mathbb{E}}}_{\epsilon \sim {D}_{\epsilon }}[{(1-\epsilon )}^{k}]\cdot (L-k+1)/L.$$
In practice, we observe k-mer multiplicities and not the true coverage; hence, the distribution is Poisson with parameter λ. That is, sylph only sees the effective coverage under our model and not the true coverage δ. We discuss true coverage computation in ‘Computation of effective and true coverage’ (Methods).
Shotgun metagenome k-mer model
Under the above model, we can now precisely state the input and output for sylph. Let R be a set of reads from our sample. We model R as the output from uniform read sampling for a true underlying set of genomes G1, G2, …, Gi at effective coverages λ1, λ2, …, λi, with all final reads mixed together. We model an input genome \({G}_{i}^{{\prime} }\) as mutating from Gi because of random substitutions at ANI τi. Our goal is to infer and output the true λi and τi parameters from our k-mer data under this model.
Definition 3
Given an input reference genome \({G}_{i}^{{\prime} }\) and a fixed hash function resulting in N FracMinHash k-mers for \({G}_{i}^{{\prime} }\), let X1, …, XN be the multiplicity of \({G}_{i}^{{\prime} }\)’s k-mers in the reads R.
We make the following assumptions for modeling purposes: (1) the k-mers for each genome within the metagenome are unique; (2) if a k-mer is mutated in Gi', it does not appear in the metagenome; and (3) if a k-mer in a read is erroneous, it does not show up in any genome. In practice, assumption 1 is frequently violated (for example, because of mobile elements or repetitive elements). We take these issues into account later in the inference step. Assumptions 2 and 3 are assumed to be reasonable.
Under these assumptions, we derive our model of Xi’s distribution. If the k-mer for Xi is in R, then Xi is Poisson with parameter λi. However, if the k-mer corresponding to Xi is mutated and, thus, differs between Gi and \({G}_{i}^{{\prime} }\), we assume that it will not exist in R according to assumption 2; therefore, Xi is 0. The probability of a k-mer being not mutated is \({\tau }_{i}^{k}\); hence, it follows that \({X}_{i} \sim {\rm{Pois}}(\lambda ){\rm{Bern}}({\tau }^{k})\), where the Poisson and Bernoulli random variables are independent by assumption. In other words, \({X}_{j} \sim {\rm{ZIP}}({\tau }_{i}^{k},{\lambda }_{i})\), where ZIP is a zero-inflated Poisson distribution. If the sampling locations of the FracMinHash k-mers for Xi, Xi+1, …, Xi+N are close together on the genome, these random variables become strongly correlated; nearby k-mers will be covered by the same reads frequently. However, we subsample k-mers by FracMinHash; therefore, on average, our k-mers are c bases apart. Hence, we assume that k-mers X1, …, XN are sufficiently spread out so that the statistical independence of Xi is not a bad assumption in practice.
Inference for λ
We can now rephrase our problem precisely as inferring the parameters τk and λ from N independent and identically distributed (i.i.d.) samples X1, …, XN ~ ZIP(τk, λ) for each genome in the database. There exists literature for inference concerning i.i.d. ZIP samples but existing maximum-likelihood estimators and method-of-moments estimators depend on means and variances of Xi (ref. 68), which are not robust to long-tailed outliers that are commonly present in metagenomic data (for example, mobile elements and low-complexity k-mers). As such, we use a ratio-of-multiplicity estimator for λ as used by Skmer23, which we define below.
We first define Na to be the number of Xi with multiplicity a. It follows that, for all a ≥ 1, \({\mathbb{E}}[{N}_{a}]=N\cdot \Pr ({X}_{i}=a)=N\cdot {\tau }^{k}\cdot ({e}^{-\lambda }{\lambda }^{a})/a!\) Notice that \({\mathbb{E}}[{N}_{a+1}]/{\mathbb{E}}[{N}_{a}]=\lambda /(a+1)\), which does not depend on τ. We have samples of Na available from data; thus, we can use this ratio as our estimator.
Definition 4
The ratio-of-multiplicity λ estimator was inspired by ξ from Sarmashghi et al.23. Our estimator for λ is as follows:
$$\hat{\lambda }=\frac{{N}_{a+1}}{{N}_{a}}\cdot (a+1).$$
(1)
(The estimator used in Skmer23 differs from sylph in that their Na variables are absolute k-mer frequencies over all k-mers in the sample, whereas our Na variables are only k-mers contained in a reference genome.
In practice, the estimator has reasonable bias except for very small λ (Supplementary Fig. 3). We discuss the zero-denominator case in ‘Practical thresholds’ (Methods).
λ-adjusted containment index and ANI
We described how sylph obtains λ in the previous section but not the ANI τ. The central idea for calculating ANI is simple: check how many k-mers from the reference genome are contained in our reads. Let A be the set of sketched k-mers in our genome and B be the set of sketched k-mers in our reads. We call \(\frac{| A\cap B| }{| A| }\) the containment index20. Suppose A and B are k-mers from two complete genomes, without read sampling. In this case, the containment index can be transformed into an estimator for τ = ANI by a simple formula, which we call the naive ANI:
$$\frac{| A\cap B| }{| A| }\approx {\tau }^{k}={\rm{AN{I}}}^{k},$$
where the approximate equality can be made precise18 under an appropriate random model.
The formula does not take into account coverage and read sampling. It also does not take into account large structural differences between the two genomes, which we discuss further in Supplementary Note 1. To adjust for read sampling, we derive a different ANI formula by the following argument: we can write \(| A\cap B| =\mathop{\sum }\nolimits_{i = 1}^{N}I({X}_{i}\ge 1)\), where I is the indicator 0–1 random variable. Thus, the expected value is as follows:
$${\mathbb{E}}[| A\cap B| ]=N\cdot \Pr ({\rm{ZIP}}({\tau }^{k},\lambda )\ge 1)=N\cdot [{\tau }^{k}(1-{e}^{-\lambda })].$$
Remembering that we assume a fixed, nonrandom hash function for FracMinHash, whereby ∣A∣ = N, we get the following:
$${\mathbb{E}}\left[\frac{| A\cap B| }{| A| }\right]={\tau }^{k}(1-{e}^{-\lambda }).$$
(2)
This immediately gives rise to our final ANI estimator, the λ-adjusted containment ANI estimator.
Definition 5
The coverage-adjusted ANI estimate is as follows:
$$\hat{\tau }={\left(\frac{| A\cap B| }{| A| }\cdot \frac{1}{(1-{e}^{-\lambda })}\right)}^{1/k}.$$
(3)
We can use this estimator once we know λ. In practice, we use the \(\hat{\lambda }\) estimate in equation (1) to first estimate λ, because this estimate does not depend on τ, and then we plug in the estimate \(\hat{\lambda }\) into equation (3) to get our final ANI estimate \(\hat{\tau }\). Note that the λ-adjusted ANI in equation (3) could be >1, in which case we threshold τ = 1.
Practical thresholds
In practice, we apply three filters: (1) we only apply the λ adjustment for ANI if the median multiplicity for k-mers X1, …, XN is ≤3, as the term (1 − e−λ) in equation (3) is small if λ ≥ 3; (2) we only proceed with λ adjustment if Na ≥ 3 and Na+1 ≥ 3, where a is the k-mer multiplicity that is ≥1 with the most number of k-mers (this is usually a = 1); these filters are to make sure we only proceed with λ adjustment when it is necessary and if enough information is given; and (3) we only output ANIs if the number of FracMinHash k-mers for a genome is >50 by default, although this can be changed for smaller genomes and contigs.
Masking multicopy and dependent k-mers
Note that we can remove or mask k-mers in A, the k-mers in our reference genomes, without affecting the inference of our parameters, even if that k-mer is in B. This masking procedure simply reduces our sample size during inference by discarding some of the sample Xi. We can, thus, remove FracMinHash k-mers that are not ‘satisfactory’. First, we only take unique FracMinHash k-mers in the reference genome, as our k-mer Poisson model is predicated on the assumption of unique k-mers. Second, we remove FracMinHash k-mers in the reference genome if they are less than 30 bp apart by default, as k-mers that are too close have strongly correlated coverage, which is not suitable under our assumption of independence. We found that the latter step gives better inferences for small c.
Handling paired-end fragments and duplicated reads
For real short and paired-end reads, we found two major failures of our uniformly distributed read assumption, which in turn impacted our Poisson coverage model: (1) for paired-end reads, if two reads for a pair overlap too much because of a small fragment length69, k-mers can be double-counted and (2) polymerase chain reaction (PCR) duplicates70, such as from Illumina sequencing, can double-count reads and, thus, k-mers.
To deal with failure 1, we use a simple heuristic: if a FracMinHash k-mer appears twice across a pair of reads, we consider this k-mer as counted only once. To deal with failure 2, we use a simple locality-sensitive hashing technique for deduplicating reads that we describe below. In Supplementary Figs. 22–24, we show that deduplication greatly improves detection for low-abundance genomes.
We first explain the deduplication algorithm for paired-end reads. We start by scanning through each read pair and, for each FracMinHash k-mer x detected, we take the first 32-mer on both the forward read and the reverse read and we mask both 32-mers in a 010101…01 pattern, ending up with two 16-mers that we denote y1 and z1. We get another pair of two 16-mers by masking with a 1010…10 pattern instead, which we call y2 and z2. We then insert the tuples (x, y1, z1) and (x, y2, z2) into a set membership data structure, as discussed below. If one or both of these two tuples are already present, we ignore the FracMinHash k-mer and assume that it is because of a PCR duplicate.
The above algorithm is locality sensitive in that, if two PCR duplicates differ by at most one substitution, it is guaranteed to not double-count the k-mers coming from a PCR duplicate, as one of the two masking patterns has unchanged bases. In practice, it can often tolerate more. For our set membership data structure, we use a scalable cuckoo filter71,72 implementation with false positive rate P = 0.0001, which we found to decrease memory consumption while not affecting inference.
For single-end reads, we make a few changes. First, we ignore reads with length > 400 bp, as these are likely to be long reads. Secondly, instead of letting z1 and z2 be from the second read in the pair, we take z1 and z2 to be the k-mers starting from the middle of the read. Finally, we only deduplicate for the FracMinHash k-mers with multiplicity < 4, as PCR duplicates are indistinguishable from true reads for high-coverage, single-end read sets.
Bootstrapping for uncertainty quantification
We found the main source of uncertainty to be in the estimation of λ because of read sampling when λ is very small, which propagates to uncertainty in τ. Other sources of uncertainty, such as because of FracMinHash sampling, have been shown to be small when the ANI is greater than 90% for bacterial genomes18.
To quantify the uncertainty in ANI because of λ uncertainty, we perform a bootstrap on the resulting distribution of Xi (the k-mer multiplicities). Sylph resamples (with replacement) the k-mer multiplicities over 100 iterations and performs coverage-adjusted ANI calculations for each iteration. Sylph then takes the 5th and 95th percentile ANI estimates as the 90% confidence interval. Notably, it is possible for a resampling that Na or Na+1 is 0, in which case we throw out the observation. We only proceed if >50 of the resamples give Na and Na+1 that are nonzero. Note that our bootstrapping procedure is fundamentally different from the issue encountered in a related study73, where bootstrapping causes issues when resampling already-sampled reads; we resample k-mers from the reference genome, not k-mers from reads.
We show the confidence interval coverage probabilities (‘coverage’ here referring to confidence interval coverage, not read-sampling coverage) in Supplementary Table 3 for a synthetic test. In general, sylph’s confidence intervals are slightly tight but the coverage probabilities are not less than 70% under simulations. Under real samples with more sources of uncertainty, we expect our intervals to be tight but still provide a rough quantification of uncertainty.
Computation of effective and true coverage
Let m be the median k-mer multiplicity for a given genome. Effective coverage for a genome is output as either the λ estimated from equation (1) if m ≤ 3, a robust mean k-mer coverage when 4 ≤ m ≤ 15 or just m otherwise. The robust mean k-mer coverage is calculated by taking the mean k-mer multiplicity for k-mers with multiplicity less than α, with α defined as follows: α is the smallest number such that for Pois(m), a Poisson random variable parameterized by m, Pr(Pois(m) > α) < 10−10. We found that this robust mean is more accurate than the median for small (effective) coverages because the median is necessarily an integer. However, for large coverages, we found that the robust mean is unnecessary; thus, we use the median instead.
The effective coverage, λ, can be much smaller than the true coverage, δ, because of read errors and small read lengths (definition 2). Users can toggle sylph to estimate the true coverage instead by using the following method. To estimate δ, we need to know L and \({{\mathbb{E}}}_{\epsilon \sim {D}_{\epsilon }}[{(1-\epsilon )}^{k}]\), which we denote E. This gives the following formula:
$$\delta =\lambda \times \frac{L}{L-k+1}/E.$$
Sylph allows the user to input a fixed read length L, otherwise, the mean single-ended read length is used. While read lengths for long reads may be heterogeneous, the effect of read length is negligible for long reads on the effective coverage; hence, this is not an issue in practice. For estimating E, if the user inputs a read error rate ϵ, sylph simply estimates E = (1 − ϵ)k. That is, Dϵ, the distribution of error rates, is a point distribution at ϵ. If the user does not input a read error rate, sylph estimates E as follows. Defining na as the number of k-mers in the read sketch with multiplicity a, sylph automatically estimates E as
$$E=1-\frac{{n}_{1}}{{\sum }_{a\ > \ 1}a\cdot {n}_{a}}.$$
This formula comes from two assumptions: (1) most k-mers with multiplicity 1 are erroneous k-mers from highly covered genomes and not from lowly covered genomes and (2) erroneous k-mers are unique (the same error does not occur more than once). Thus, ∑a>1ana ⋅ (1 − E) ≈ n1, giving the above formula. We used sylph’s automatic E estimation for all samples unless explicitly stated or if it was from soil or ocean, where ϵ = 0.005 (99.5% identity) was assumed. In soil and ocean samples, we found that assumption 1 does not hold and, thus, encourage users to provide an approximate ϵ of 0.001–0.010 for short reads instead.
Profiling by k-mer reassignment
After ANI calculation, sylph can output results with the ‘query’ command. However, such results are not a metagenomic profile with accurate relative abundances. Shared k-mers within similar genomes in the database may cause multiple related species to have high ANI when only one of the species is actually present.
To reassign shared k-mers, we use a winner-take-all heuristic that is also an option in Mash Screen and sourmash. The ‘profile’ command finds putative ANIs in the same manner as ‘query’ but then assigns k-mers uniquely to the genome with the highest ANI. For example, if a k-mer with a multiplicity of 10 in the sample is present in three database genomes, all ten ‘copies’ of the k-mer would be assigned to the single highest containment ANI genome and assumed to not be present in the other two. The ANI computation is then performed again but only considering these reassigned k-mers, giving our final ANIs. Sylph then outputs genomes for which ANI is >95% by default.
Outputting abundances, percentage of reads detected and taxonomic profiles
Let GL1, …, GLq be the genome lengths of the q genomes passing the threshold. Let λi and δi denote the effective and true coverage of the ith detected genome. Sylph outputs the taxonomic abundance for each genome as \({\lambda }_{i}/\mathop{\sum }\nolimits_{i = 1}^{q}{\lambda }_{i}\). The sequence abundance output is \({\lambda }_{i}\cdot G{L}_{i}/\mathop{\sum }\nolimits_{i = 1}^{q}{\lambda }_{i}\cdot G{L}_{i}\). We estimate the percentage of reads detected at the species level as \(\mathop{\sum }\nolimits_{i = 1}^{q}{\delta }_{i}\cdot G{L}_{i}\) divided by the total number of bases in the reads times 100, outputting 100% if this value exceeds 100. Importantly, the percentage of reads detected depends on the true coverage.
We provide a script for turning a metagenomic profile without taxonomy information into a taxonomic profile for multiple prebuilt databases. This script can be easily customized by adding taxonomic accessions to arbitrary genomes. This is achieved by translating each genome in the output to a taxonomic accession string and aggregating sylph’s output abundances at each taxonomic rank and is independent of the core algorithm.
Synthetic metagenome construction and benchmarking
To construct the undercharacterized synthetic metagenome (Fig. 2a), we first computed nearest-neighbor ANIs from the newer GTDB-R214 database to GTDB-R89. All genome-to-genome ANI calculations were performed with skani74. We then arbitrarily selected 50 genomes with 95–97.5% ANI and 150 genomes with 85–90% ANI (Fig. 2a) as nearest-neighbor ANI. Notably, ANI and the aligned fraction between two genomes are positively correlated75 (Supplementary Fig. 25), implying that the low-ANI genomes in the synthetic community will likely have large-scale variation not found in our references. We then sampled 3 Gbp of 2× 150-bp reads for this dataset using wgsim76 with abundance following a log-normal distribution with mean and s.d. of the underlying normal distribution equal to 2 and 1, respectively, as applied in other benchmarks38. Ten samples were created for this dataset. Exact software commands for profiling are shown in Supplementary Note 4. For the five synthetic metagenomes binned by 1% ANIs (Fig. 2b), for each 1% ANI bin, we generated a metagenome with 50 genomes each and 750 Mbp of reads using the same distributions as for the undercharacterized metagenome case.
MetaPhlAn4 could not be fairly profiled against the undercharacterized case because it contains species-level genomes in the genus-level holdout set. We attempted to profile MetaPhlAn4 on the 95–100% binned ANI dataset; its database should contain all GTDB-R89 species as its taxonomy can be mapped to GTDB-R207, a newer database11. Therefore, we first mapped taxa from MetaPhlAn4’s default NCBI taxonomy to the GTDB-R207 database as provided by a script included in MetaPhlAn4. Then, we projected the representative genome for the GTDB-R207 taxa to the nearest-neighbor ANI genome’s species name in the GTDB-R89. An issue we found was that MetaPhlAn4 SGBs could collapse species together in the GTDB-R89 taxonomy, thus picking the wrong representative because of the inherent incompatibility with the SGB definition and GTDB. This lowered the precision and sensitivity (Supplementary Figs. 8 and 15).
CAMI2 benchmarking details
The Strain Madness and Marine challenge datasets from the CAMI2 challenge were used37. The Marine dataset consisted of 777 genomes, as well as 200 circular elements. The Strain Madness dataset consisted of 408 genomes with high strain diversity, including over 180 Streptococcus pneumoniae strains in a single sample.
To obtain CAMI2 results, we took the profiles for MetaPhlAn (version 2.9), mOTUs (version 2.5) and KMCP as compiled in the KMCP study32 and the official CAMI2 repository. We reran all profiles using the latest version of OPAL34 (Supplementary Table 4) because we found slightly different values depending on the version of OPAL. For sylph, we took all genomes in the provided RefSeq CAMI2 snapshot (January 8, 2019) with a taxid present and used these genomes as sylph’s database (with default settings). TaxonKit77 was used for converting sylph’s output into a CAMI-compatible profile.
We also ran MetaPhlAn4 and mOTUs3 ourselves because no official profiles are available. However, these profilers use more comprehensive databases than the official submissions; thus, direct comparison is not completely fair. We discuss CAMI2 taxonomy issues in Supplementary Note 2.
In the Marine dataset, we found that, for the detected species, many had closely related genomes in the provided database; sylph’s median ANI was 99.8% over all samples for detected genomes in the Marine dataset. There were also species in the true profiles that were not present in the given reference genomes, slightly obfuscating the profiling metrics. For example, Psychrobacter sp. JCM 18900 is present in the taxonomy and in the gold-standard profile for multiple samples in the Marine dataset but it is not present in the RefSeq databases and is not found by any method. This causes other Psychrobacter species to be profiled and classified as incorrect, lowering the sensitivity and the precision.
Real metagenome benchmarking with mOTUs3, MetaPhlAn4 and sylph
To fairly compare mOTUs3, sylph and MetaPhlAn4, we used the GTDB-R207 database, which all three methods were compatible with. We directly profiled sylph with the GTDB-R207 database. We used default databases for the other methods (versions in Supplementary Table 4) and mapped their profiles to GTDB-R207 as follows. MetaPhlAn4’s taxonomic profile was mapped to the GTDB-R207 database using the default provided scripts as before. mOTUs3’s output was mapped to GTDB-R207 using the mappings provided at https://zenodo.org/records/10275750. mOTUs3’s corresponding GTDB taxa may have a blank species name; we removed these taxa from species-level comparisons in this case. Effective coverage (Fig. 4d) was estimated by sylph on the full (nondownsampled) metagenomes and divided by 10.
MWAS procedure details
We ran logistic regression using statsmodels78 for each of the 289,232 genomes with sylph’s ANI as a covariate and the disease–control status as the response. In addition, we used all of the same metadata covariates as specified by Table 1 in Wallen et al.42 to control for potential confounding (for example, sex and age). Note that we only used one genome per species representative in the Q–Q plot (Fig. 5d) because genomes in the same species give highly dependent P values but we ran the MWAS on all 289,232 genomes.
We set a presence–absence threshold of 98% ANI, meaning that, if the ANI for a sample was below 98%, we set it to 98% for the regression. Otherwise, if the ANI was >98%, we kept it as is. This gave a continuous presence–absence covariate so that only high ANIs gave information for the regression, which is desirable when trying to find strain-level associations. We then used the Benjamini–Hochberg procedure44 to control for FDR at a significance level of α = 0.05.
We created a linear ordering on genomes within a species group, where similar genomes should be placed next to each other along the Manhattan plot, by running an ANI-based average linkage clustering on the genomes. We used skani74 for all-to-all ANI calculation and scipy79 for clustering, using the resulting dendrogram ordering as the within-species ordering.
Synthetic viral metagenome profiling
To generate synthetic viral metagenomes for profiling, we downloaded all viral genomes from RefSeq and the MGV database53 and computed ANI from all RefSeq viruses to MGV using skani and the ‘-slow’ setting. RefSeq viral genomes with >95% ANI and >80% aligned fraction for both the query and the genome were considered, with the nearest-neighbor MGV genome being designed as the true genome. We randomly selected 50 RefSeq genomes for each sample over ten samples and simulated reads using the same methodology as with the other synthetic metagenomes, except we only sampled 3 Mbp of reads for each sample.
Benchmarking hardware and software
Benchmarking was performed on an Intel(R) Xeon(R) CPU at 3.10 GHz with 64 cores and 240 GB of RAM as a Google Cloud instance with a solid-state drive. Software versions are shown in Supplementary Table 4.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.