On 2022-10-16 16:15:13, user Alex Crits-Christoph wrote:
In this preprint, Washburrne and colleagues put forth some reasoning and basic analysis that they believe suggests the viral genomic data from the early SARS-CoV-2 pandemic is consistent with a single spillover event. This is in contrast to the work of Pekar et al. 2022 Science, which concluded that the genomic data from the early pandemic is best explained by multiple independent spillover events from an animal population. However, this preprint misrepresents the findings of Pekar et al. 2022, and makes several conceptual errors that fundamentally undermine their conclusions.
There are 4 basic features of the early SARS-CoV-2 phylogeny that are each largely inconsistent with a single spillover event:
A Lineage A ancestral haplotype is inconsistent with the molecular clock: Lineage B exhibits more divergence from the root of the tree than would be expected if lineage A were the ancestral virus in humans (Pekar Fig S20, S19).
Two basal polytomies of lineages A and B were formed at the start of the SARS-CoV-2 epidemic, whereas most single introductions within a city, location, or event are characterized by a single polytomy.
There are no plausible candidates for intermediate genomes observed for lineages A and B.
Both Lineage A and Lineage B are connected to and were present during the outbreak at the Huanan Seafood Market, and there was sustained case transmission within the market for up to a month.
The authors have *attempted* (unsuccessfully) to address points 2 and 3, but they have entirely ignored points 1 and 4, which are still highly pertinent. All four of these observations need to be explained by any hypothesis of SARS-CoV-2 origins.
Now, on to specific scientific errors in this work:
- In the first section, the authors describe how superspreading events can create polytomies, as do introduction events. This is an intuitive observation, as both superspreading events and successful introductions can result from rapid transmission from a singular infection source. What they fail to note, however, is that superspreading events and introduction events are characterized by a single polytomy, not by two. Here is a simple list of introduction/superspreading events characterized by a single polytomy:
New Zealand https://www.nature.com/arti...<br />
Lombardy https://www.nature.com/arti...<br />
Louisiana (Mardi Gras superspreading event) https://www.sciencedirect.c...<br />
Xinfadi market in Beijing https://academic.oup.com/ns...
In none of the above cases of introduction/superspreader events do we observe two basal polytomies separated by two mutations with no intermediates as we do for early SARS-CoV-2 in Wuhan.
Ironically, the authors cite Popa et al. 2020 Nature Communications on the spread of SARS-CoV-2 in Austria as an example of how polytomies can be linked to superspreader events. However, this work elegantly describes how each polytomy results from a separate introduction event into Austria:
Vienna-1 clade/polytomy: connected to an index patient from Italy.<br />
Tyrrol-1 clade/polytomy: phylogenetically linked to North America.<br />
Vienna-3 clade/polytomy: connected to Cluster OG, an independent travel-associated cluster.<br />
Tyrrol-3 clade/polytomy: connected to Cluster D, an independent travel-associated cluster.
So indeed, the cited work is actually more strong evidence that introduction events — including those of a ‘superspreader’ nature — are characterized by a single polytomy. We see no instances of a single superspreader event creating two concurrent polytomies, separated by two or more mutations, as we observe with the rise of lineages A and B in Wuhan. It is not merely the existence of polytomies in a phylogeny that is relevant, but the observed ratio of polytomy frequency and size, which Pekar et al. simulations predict would arise very infrequently with a single introduction.
Further, the authors are incorrect in their characterization of the FAVITES models used by Pekar et al. FAVITES has been modified to accurately recapitulate SARS-CoV-2 superspreading nature; see Worobey et al. 2020 Science, Figure S2. Washburne et al. say:
“and the transmission model of FAVITES will extend superspreading events over timescales that within-host evolution can occur”. However, the simulations in Pekar et al., 2022, and in FAVITES more broadly, account for within-host evolution: the coalescent process and subsequent mutational evolution are agnostic to subsampling and within-host evolution.
- In the second section, the authors describe how ascertainment biases and biased contact tracing could affect the recovered phylogeny. The core conceptual errors here are namely:
The lineage A/B split and the basal polytomies of SARS-CoV-2 are still obvious in any phylogeny of early SARS-CoV-2 even when excluding genomes from the city of Wuhan: this phylogenetic structure is factually not an artifact of sampling, and anyone is welcome to build a tree of sequences before April 2020 excluding those from Wuhan and demonstrate this.
Likewise, lineage A is still incompatible with the molecular clock when genomes linked to the Huanan Market are excluded. Even in sequences from February 2020 can you see a ‘lag’ in the evolution of lineage A from its root compared to lineage B (Pekar Fig S20).
The authors propose no explanation of how contact tracing of patients connected to one market could produce a phylogenetic artifact of two large, basal polytomies: indeed, their simple analysis in Fig 2 shows that contact tracing will preferentially sample just one lineage, not two. Small polytomies are common throughout the SARS-CoV-2 phylogeny.
A contact tracing bias cannot explain a lack of intermediate genomes between lineages A and B into itself. Firstly, if the evolution between the lineages occurred in humans, the patients with intermediate genomes should be contact traceable from normal lineage B patients. Second, even if they were missed in Wuhan, we would see the phylogenetic descendents of the intermediate genotype spread to other countries, unless this lineage just happened to be wiped out very quickly.
As discussed by the Worobey et al. 2021 Science perspective, several of the earliest known SARS-CoV-2 patients were emphatically not contact traced from others — they were independently noticed in different hospitals throughout the city. This includes the earliest known case of lineage A, who was not contact traced, and had no noted connection to the Huanan Seafood Market, but after the fact was realized to live just a few blocks away (and shopped at a nearby market).
Several other data points that together point towards the known early case data in Wuhan not being strongly characterized by ascertainment bias are discussed in the supplementary text of Worobey et al. 2022 Science section on this topic.
- In the third section, the authors put forth the possibility that several sampled genomes were intermediate sequences of lineage A and lineage B. Again here, they both misunderstand the data that they are reporting on, and misconstrue the methods and findings of Pekar et al.
They propose that a set of genomes obtained from Sichuan may constitute C/C intermediate haplotypes between lineages A and B. However, the data does not support this, as elegantly explained by Zach Hensel on Twitter: <br />
https://twitter.com/alchemy...<br />
https://twitter.com/alchemy...
Washburne writes: "It is difficult to see how sequencing errors, which are random, could occur at exactly the same position in these 12 early outbreak genomes."
However, what they do not understand is that several of these genomes were plagued by systematic bioinformatics errors, not random sequencing errors. This was likely due to a known issue with a pipeline that imputed the reference genotype in loci with no read support, instead of replacing those positions with N characters. As demonstrated by Hensel above, for this particular dataset with poor coverage, that included the vast majority of samples which had no coverage at the relevant sites.
Further, the authors misunderstand why certain genomes have been excluded from Pekar et al. The deciding observation is not the quality of the underlying sequencing data — although that is certainly likely the hidden cause — but the observation that some genomes share multiple polymorphisms with derived lineages in A and B, strongly indicating that they are phylogenetically aberrant. In all scenarios in which underlying data are available, it has been confirmed that these phylogenetic outliers are plagued by poor data quality issues, with missing data that has often been incorrectly imputed. In cases without the underlying data, the only alternative explanation would have to be a highly unusual degree of recurrent mutations. As this is fully explained in Pekar et al. 2022, I highly suggest the authors attempt a re-read to understand the reasoning of how we can identify these incorrect genomes.
There are two more “minor” (in the grand scheme of things) errors in this section:
“Lineage A and Lineage B, are separated by only two defining single nucleotide changes (SNCs), at positions 8782 and 21844”
This is incorrect - the second position should be 28144, not 21844. This is wrong throughout the manuscript.
"Intermediate sequences suggest there may not be two basal polytomies"
Polytomies can be separated by a single mutation and still be polytomies: there is a basal polytomy in lineage A, and a separate basal polytomy in lineage B. The existence of intermediate genomes would not preclude the presence of these two polytomies.
In sum, neither of the three points raised by Washburne and colleagues are in fact relevant to the hypothesis of multiple spillovers of SARS-CoV-2. Finally, it is also important to briefly discuss a broader conceptual error made by the authors. As they write:
"Far from being able to conclude two spillover events, both hypotheses - natural origin and lab origin - are still on the table."
This quote (along with knowledge of their past works) makes evident the aim of the authors: to reject the possibility of multiple SARS-CoV-2 spillovers because it is a finding largely inconsistent with their preferred laboratory origin hypothesis. They are correct in thinking that multiple spillovers of SARS-CoV-2 cannot easily be explained by a hypothesis of laboratory emergence. They are, however, incorrect in their statement that a lack of evidence for multiple spillovers would “put the lab origin hypothesis on the table”. There is an astounding degree of evidence against the possibility of laboratory emergence, primarily:
(1) the complete lack of epidemiological contacts traced to the WIV, and the March 2020 seronegativity of Shi Zhengli’s group, <br />
(2) the geographic epicenter of the pandemic was in Hankou, Wuhan, not Wuchang, where the WIV resides, <br />
(3) the detailed insight we have into the research ongoing at the WIV in 2018-2019, including CoV sequences submitted to GenBank in 2018 (Yu Ping et al.) and Latinne et al. 2020 (submitted Oct 6 2019), multiple publicly available theses and papers, interviews, collaborator emails, US intelligence investigations, and unfunded grant proposals: all of which has so far indicated a lack of a SARS-CoV-2 progenitor at WIV, <br />
(4) the preponderance of evidence from the known early cases within the city of Wuhan, which were either linked to or centered around the Huanan Seafood Market, including the very first cases first identified in hospitals as reported by independent journalists as described in Worobey 2021 Science perspective,<br />
(5) the positive viral samples from an animal cage, a freezer, a defeathering machine, and the drains and ground of wildlife selling stalls within the western half of the Huanan Seafood Market, the half to which most human cases were also linked, and <br />
(6) direct and geographic links of patients and environmental sampling firmly establishing that both early SARS-CoV-2 lineages A and B were first identified in connection to the Huanan Seafood Market.
Put otherwise, it is clear that the authors misrepresent and misunderstand the reasons why multiple spillovers have been proposed. Contrary to their beliefs, it is not to undermine or reject the laboratory hypothesis. The clear evidence against that hypothesis is well described in Holmes et al. 2021 Cell, The WHO Mission Report, and Worobey et al. 2022 Science— it is entirely incidental that the likelihood of multiple spillovers also happens to be inconsistent with their hypothesis.
Why then has the possibility of multiple spillovers been proposed? Because the genomic data from the early SARS-CoV-2 pandemic is *peculiar*, and these peculiarities have so far only been adequately explained by models incorporating multiple spillovers. It is as simple as that.