On 2017-06-24 07:54:37, user sandeep chakraborty wrote:
I thank the authors (https://www.ncbi.nlm.nih.go... "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4644263/)") for engaging in a public debate on the TEP-paper.
I understand their critical view on my pre-prints - "and once more highlights the critical need for peer-reviewing" and " we wish to point at the importance of peer review".
However, I would like to point simultaneously that biorxiv provides a platform to ask `impertinent' questions in case the human (and fallible) peer-review process might have been missed out on asking those.<br />
"That is the essence of science: ask an impertinent question, and you are on the way to a pertinent answer." - Jacob Bronowski.<br />
So, I hope readers judge the data I am providing - and not my credentials.
I will try to prove the statement "the author clearly has no understanding of the general principle of surrogate signatures" wrong below<br />
in details.
My problems are with:<br />
1) low counts indicating degraded RNA (or low sample amount) - and lack of supporting information about real values.<br />
2) application of "deep learning" techniques in lieu of the low counts.
The problems with measuring low counts have been critiqued by Sinha, et al, 2017.<br />
http://biorxiv.org/content/...
And noted in a news item:<br />
https://www.wired.com/2017/...<br />
"Sinha and other academic researchers aren’t the only ones who need that kind of needle-in-a-haystack sensitivity.<br />
Precision medicine—like spotting a piece of tumor DNA in a drop of blood or finding a rare variant among the 3 billion base pairs in the human genome1—also requires high-resolution sequencing. Clinical researchers and biotech start-ups that need that kind of resolving power are increasingly using Illumina’s ExAmp chemistry and the machines that employ it, including its newest line, the NovaSeq."
Since the TEP-study is using the same "needle-in-a-haystack sensitivity", it needs to provide the data that shows over-expression of MET.<br />
"Overexpression of MET protein in tumor tissue relative to adjacent normal tissues occurs in 25-75% of NSCLC" - https://www.mycancergenome..... Is it 25%, or is it 75%, in the current 60 sample size - a wide range. So, it is very very important to see that the classification has been done properly, before providing statistics based on surrogate biomarkers.
RNA-seq values certainly show no over-expression (on the contrary - but leaving that aside, since surrogacy does not require them to be there)<br />
Healthy MET = [287 197 61 142 178 127 7 176 2 133 188 156 23 2 185 170 104 23 28 71 108]<br />
NSCLC MET = [0 14 24 5 11 9 45 1 5 11 9 12 5 2 34 3 7 42 6 16 12 2 5 2]
Instead, the paper summarily states:<br />
"Assessment of MET overexpression in non-small cell lung cancer FFPE slides was performed by immunohistochemistry (anti-Total cMET SP44 Rabit mono-clonal antibody (mAb), Ventana, or the A2H2-3 anti-human MET mAb (Gruver et al., 2014)).
The challenges in FFPE include:<br />
1) Degraded/fragmented DNA or RNA:<br />
2) Insufficient amount of sample<br />
https://www.promega.com/-/m...
And similarly, the EGFR mutations were determined using FFPE (what EGFR mutations? there are so many, see below).
Is this not the same problem being stated by Sinha, et. al, 2017, albeit for a different sequencer?
In the refutation to my analysis, Dr Best states:<br />
"Surrogate signatures are composed of indirect biomarkers, and as stated in the publication the direct markers were not detected using thromboSeq (shallow sequencing), in contrast to amplicon sequencing (deep sequencing), which did allow for the detection of such direct markers in tumor-educated platelets."<br />
I guess it refers to the line in the manuscript:<br />
"We subsequently compared the diagnostic accuracy of the TEP mRNA classification method with a targeted KRAS (exon 12 and 13) and EGFR (exon 20 and 21) amplicon deep sequencing strategy ($5,0003 coverage) on the Illumina Miseq platform using prospectively collected blood samples of patients with localized or metastasized cancer.".
Two questions:<br />
(A) Where is the supporting data? With the zero counts of EGFR in the RNA-seq sample, is it not important to see that?<br />
EGFR in NSCLC = [1 4 1 0 14 9 1 0 0 0 2 0 0 0 19 0 0 21 0 0 6 0 0 0]
(B) EGFR (exon 20 and 21) gives a subset of possible EGFR mutations. Does its absence suffice to infer wildtype?<br />
Here is a comprehensive list from https://www.mycancergenome....<br />
Kinase Domain Duplication<br />
c.2156G>C (G719A)<br />
c.2155G>T (G719C)<br />
c.2155G>A (G719S)<br />
Exon 19 Deletion<br />
Exon 19 Insertion<br />
Exon 20 Insertion<br />
c.2290_2291ins (A763_Y764insFQEA)<br />
c.2303G>T (S768I)<br />
c.2369C>T (T790M)<br />
c.2573T>G (L858R)<br />
c.2582T>A (L861Q)<br />
Just to specify, Exon 20 and 21 spans from "2541 to 2881". And includes only a couple of possible mutations.
How does all these different mutations modify the surrogate signatures in the same manner?
About the Kappa statistics - Table S7 talks about "EGFR mut" in "4/39 (10%)".<br />
Since only 36 samples are annotated (Table S1), how does one know about the remaining 3?
"the author clearly has no understanding of the general principle of surrogate signatures" - here is what I understand. Please correct me if I am wrong.
Platelets circulating through the body when in proximity to tumor cells get their mRNA profile changed. "A total of 1,453 out of 5,003 mRNAs were increased and 793 out of 5,003 mRNAs were decreased in TEPs as compared to platelet samples of healthy<br />
donors". (See below for questions on the validity of the statistics).<br />
Now, these are passed onto the SVM algorithm.<br />
I have used the perl version http://search.cpan.org/~kwi..., so I know a bit.<br />
Essentially, this is some sort of machine learning (there are features with values, binary or otherwise, and training sets) and then validation (10 fold, for example : (a) Partition dataset into 10 sets of size n/10. (b) Train on 9 datasets and test on 1. (c) Repeat 10 times and take a mean<br />
accuracy). As mentioned in the TEP-paper, "The algorithms we developed use a limited number of different spliced RNAs for sample classification". And finally, comes out a set of genes that classifies between different diseases or between disease and healthy (and this is final problem, see my closing statements). I understand it does not even have to be the genes implicated in the cancer (for example, MET and EGFR in NSCLC).<br />
Thats the "surrogate signature" theory.
Moving on, take one gene (TRAT1) from the set of 1072 genes which is supposed to discriminate between healthy and pan-cancer (Fig 1F and G column in TableS4).
Healthy = [158 75 39 88 98 96 0 242 0 92 359 53 7 3 53 51 103 11 1801 48 67]<br />
NSCLC = [0 1 8 0 40 1 855 0 8 10 2 25 1 288 19 0 0 3 1 3 2 0 13 0]
An empirical glance shows some difference - most NSCLC have<br />
low counts, and most healthy samples have higher counts (though few go the other way, still I will say it shows difference).<br />
But, the raw reads are too low for discrimination, and as in the case of MET if small reads can finally translate into much larger numbers by using FFPE sequencing, then this can go either way.
Because for MET these are the values as shown above are :<br />
Healthy MET = [287 197 61 142 178 127 7 176 2 133 188 156 23 2 185 170 104 23 28 71 108]<br />
NSCLC MET = [0 14 24 5 11 9 45 1 5 11 9 12 5 2 34 3 7 42 6 16 12 2 5 2]
If FFPE analysis finds over-expression in NSCLC from such RNA-seq data, what is to say of the real expression levels of other genes which have expression in the same levels (like TRAT1) - they could be anything !!
2) The P-values of the over-expression values seem doubtful. Take the ribosomal gene RPSA (Table S2) - another gene used for discrimination.
Healthy (21 samples) = [490 323 160 783 306 342 531 593 241 283 846 364 44 572 253 372 701 54 1416 214 546]
Pan-cancer<br />
(98 samples) = [118 342 381 328 122 991 151 191 242 347 265 580 195 88 279 354 213 366 92 118 317 292 74 206 157 171 343 117 194 136 209 189 170 160 15 155 300 734 46 173 288 140 119 762 1156 341 223 363 508 416 709 73 40 227 204 174 156 148 147 255 324 298 121 194 295 103 58 1078 206 190 50 140 156 457 577 152 260 175 147 105 179 178 201 275 134 277 19 315 215 71 120 189 102 175 102 444 139 133]
Visually, I dont see any difference to justify a P-value of 1E-38 (this is ranked the second), and will find it hard to believe any statistic that comes up with that number.<br />
Again, these counts are too low - and as mentioned above with MET, can go either way - when amplified.
Note, the lowest RPKM of RPSA in set of tissues is 83 (+-9) in liver - https://www.ncbi.nlm.nih.go....<br />
The values above from the TEP-study are raw reads - dividing them number of million reads per kilobase of transcript,<br />
would give very low values.<br />
"Platelet RNA sequencing yielded a mean read count of $22 million reads per sample". And the gene is 1186 long. So divide by about 22 to get the RPKM above.
As conclusion, I can describe the basic problem of this flow in this way - assume<br />
1) One randomly assign values to each gene count,<br />
2) Remove genes where the random generator has assigned equal values or too random values<br />
3) Give it to a classifier to do its MAGIC, which will finally give a set of genes which will separate out the classes.<br />
Thats its job, and we have made it easier for it by step 2.<br />
Each time one does these steps, she/he will get a different answer - with no biological significance.
Here, RNA-seq is not a random number generator (platelet markers have huge counts) - but for low values genes, its counts cannot be trusted, and becomes almost random.
best regards,<br />
Sandeep