On 2020-05-06 15:28:18, user Charles Warden wrote:
I agree that using imputed values from a SNP chip can be a problem, and I would say medical decisions should never be made using an imputed value (whether that is from a SNP chip or lcWGS data).
I have some general notes (and opinions related to some comments), but I thought I should move those to a blog post, in order to keep the commentary more focused.
In terms of this specific paper:
1) The choice of array can affect the results. For example, while the conclusions are similar to concordance sections of this paper, the GLIPSE paper shows better performance with the Infinium Omni 2.5 compared to other SNP chips (in these interactive plots for EUR and ASW individuals, as well as the Rubinacci et al. 2020 pre-print). This should also be possible to compare with the 1000 Genomes samples, and this may be different than the GSA results?
For example, I checked the manifest files, and it looks like Infinium Global Screening array would be a “medium” density array versus a “high” density array (with the categories defined from that other study).
2) In the context of this paper, the comparison that I am interested in is not imputed SNP chip genotypes to imputed lcWGS genotypes, since I would agree that it is likely to see some (although possibly subtle) improvement for lcWGS imputations versus SNP chip imputations.
Instead, what I would like to see is directly assayed SNP chip genotypes versus imputed lcWGS genotypes. The ability to provide results without any imputations is the main reason I prefer SNP chips over lcWGS (if those were my only 2 options), so I would want to compare the SNP chip genotypes where all variants were directly measured by the SNP chip versus the lcWGS imputations.
This would mean you could only compare PRS values among probes present on the SNP chip, for example. However, it looks like you selected a CAD PRS with 1,745,179 variants (225,667 were directly typed on the GSA) and a BC PRS with 313 variants (75 were directly typed on the GSA). The BC PRS is closer to the number of variants in PRS that I have tested on myself. So, if you can find a PRS with between dozens and 1000s of variants, where 100% were directly typed on either the GSA or Omni-2.5 SNP chip (or both), then maybe that can help with providing the comparison that I would like to see?
As I mention below, maybe using public SNP chip + WGS data can help you identify a custom array where the probes were designed to cover everything for a PRS? I would guess/hope that this would be a requirement if getting FDA approval for a clinical test (and this is why I don’t consider a PRS with imputed SNP chip values to be equivalent), but maybe you can also find this for some research-level PRS (like the 23andMe diabetes example that I described in my blog post above, which I think uses 1,244 loci, even though other risk factors were more likely to predict whether you got diabetes)?
Or, removing the PRS results would be another option that would reduce concerns from myself. For example, it looks like the error rate in this paper was noticeably higher when the BC PRS used 100s of SNPs (instead of a PRS with >1 million variants). There are also factors that could cause me to prefer removing the PRS results (which I moved to the blog post).
3) I agree that batch effects (like “index hopping”) could cause down-sampling to underestimate the error rate for lcWGS (which I would guess is more of an issue for smaller libraries on higher throughput machines). In the “Experimental Overview”, it sounds like you used cell lines for 1000 Genomes individuals for the new sequencing experiments? While it is hard for me to say exactly what could cause a problem, am I correct that previous Gencove developments considered 1000 Genomes data? Is it not possible to have more independent test datasets for your estimates (for a set of ~120 individuals)? I myself am one individual from the Personal Genome Project that has public high-coverage and low-coverage WGS from different companies (along with SNP chip genotypes).
To be fair, this may be less important than some of the other points, especially if sections / content is removed. For example, if you reduced the focus to variability in technical replicates derived from 1000 Genomes subjects and different technologies (and remove the PRS application), then I don’t think this extra analysis needs to be added.
4) If the goal is to show a general principle, then maybe you could show if open-source programs like STITCH, GLIMPSE, etc. can achieve similar performance with lcWGS data? I think this would make disclosing the conflicting interests less important, even though I think that still needs to be done. This would be good to show for readers that might usually prefer to use open-source options, unless Gencove is changed to become open-source (and I think showing performance of alternative programs is common, even if that was true).
5) I think the most important issue has been fixed in the link for revision 2 (I previously had issues accessing the content in s3://gencove-sbir/, but I can see the data in https://gencove-sbir.s3.ama... "https://gencove-sbir.s3.amazonaws.com/index.html)"). Nevertheless, in order to match the current 1000 Genomes data deposits, is it possible to deposit the data (derived from 1000 Genome subjects) into public genomics databases like the SRA? Or, if you have already done so, can you please provide accessions that don’t require an extra step (or steps) to access the data?
Summary:
I think this is an interesting topic of research, and I think lcWGS imputations can be useful for certain applications (such as relatedness and broad ancestry). However, I have concerns about the clinical utility of Polygenic Risk Scores from imputed genotypes in lcWGS data. That said, I think these results could be presented with less controversy if the PRS section and Figure 4 was removed, and that is a possible solution to some concerns that doesn’t require extra work (taking out results, rather than adding in new results).
I think testing 1000 Genomes Omni 2.5 SNP chip concordance (and/or only comparing “directly assayed” SNP chip genotypes) and potentially removing the PRS results are what I am most concerned about.
Thank you very much for putting together this pre-print. I believe that it is important to see independent presentations of results from different groups. I can also tell that a lot of work was put into this paper (with a several pages of supplemental information), so I appreciate this.