On 2020-05-05 20:42:00, user Taekjip Ha wrote:
Thank you very much for sharing your interesting manuscript!<br />
We used your preprint as one of the journal club papers in the Single<br />
Molecule & Single Cell Biophysics course for graduate students of Johns<br />
Hopkins University during the Covid-19 lockdown. Students also practiced peer<br />
reviews as the final assignment. I am submitting their formal reviews here <br />
and hope you find them useful.
Taekjip Ha
Reviewer 1.
The authors develop an ?-hemolysin nanopore-based sequencing by synthesis assay<br />
which can be used to interrogate the kinetic properties of single DNA<br />
polymerases. Their method is novel and addresses the problem of increasing the<br />
throughput of polymerase screening methods. Previous techniques only allowed<br />
kinetics of polymerases to be screened one at a time. This new method is a<br />
clever integration of existing nanopore sequencing technologies that addresses a<br />
longstanding problem in development of specialized polymerases in biotechnology.<br />
The paper is interesting to read and not especially difficult for someone<br />
outside of the field to understand.
Each polymerase-pore complex could be uniquely tagged with a circular barcode<br />
template, allowing the assay to be multiplexed and scaled up to accommodate 96<br />
complexes at once. Convincing proof of concept data is shown highlighting the<br />
ability of the method to distinguish between barcodes, as well as the stability<br />
of the circular template. The title and abstract are appropriate, concise, and<br />
clearly lay out the aims of the paper. Introductory figures showing assay design<br />
and low throughput tests are very well presented and easy for the reader to<br />
follow. Low throughput tests show clear clustering, in both two-dimensional<br />
plots and PCA, of data obtained from each tested polymerase which could be used<br />
to distinguish and characterize them. Later in the paper, however, there are<br />
confusing inconsistencies between what is stated and what is shown in the data.
Figure 3a shows how each kinetic parameter is defined by the voltage trace. Only<br />
four of the five kinetic parameters are shown: dwell time, tag release rate, tag<br />
capture rate, and full catalytic rate. Tag capture dwell time (TCD) is not<br />
shown, yet it is featured in the principle components analysis and is shown to<br />
have a relatively high coefficient for some polymerases. How this parameter is<br />
defined by the trace and how it differs from dwell time is not clearly addressed<br />
in the main text of the paper. This figure (3a) and the subsequent analysis<br />
could be improved by explaining how each parameter is calculated and how they<br />
differ to clear up any ambiguity. Explanations of how each parameter correlates<br />
to polymerase fidelity, processivity, speed, etc. may also help convince the<br />
reader of the utility of their method. This is done well for some but not all of<br />
the described parameters.
Figure 5 shows the distribution of counts associated with each of 96 unique<br />
circular barcodes over three polymerases. RPol1 is associated with relatively<br />
few read counts which are not much higher from background off-target signal from<br />
RPol33. The uneven distribution of barcode counts is attributed to the low<br />
processivity of polymerase 1. Later (figure 6), in the 96-plex screen of<br />
polymerase mutants, less than twenty mutants in the screen have detectable<br />
barcode counts and those that do have few counts. This observation is again<br />
thought to be due to poor processivity of the polymerases. Polymerase fidelity<br />
very likely also plays a role in the ability of the assay to identify<br />
polymerases. Since barcode assignment is alignment based, and nanopore<br />
sequencing platforms are known to have a relatively high error rate as well, one<br />
can imagine that a more error-prone polymerase will also escape detection. There<br />
is no benchmarking data to define a polymerase detection threshold. It is clear<br />
that the efficacy of the method decreases for polymerases with lower fidelity<br />
and processivity, but what might be designated as ‘low’ is never defined. What<br />
subset of polymerases make it through this new screening process and what are<br />
their defining kinetic characteristics? How widely applicable would this method<br />
be for identifying desired features in polymerase variants? What kinds of<br />
polymerases would be expected to be missed by the screen?
There are some minor inconsistencies in the data that should be addressed.<br />
Supplemental table 5 shows the calculation of the proportion of mapped reads in<br />
the low throughput 3-plex experiments. The number of total raw reads used to<br />
calculate the 67% CBT mapping as described by the main text is 418, the value<br />
for RPol1 alone rather than a sum of the total read values for all three<br />
columns. Similarly, the text states that 20 polymerase variants were identified<br />
in the screen while figure 6a shows only 17 polymerases were associated with<br />
barcode counts.
The method described in the paper is conceptually strong and should be very<br />
helpful in identifying polymerases with desirable kinetic properties when<br />
coupled to mutagenesis screens. It has the potential to be improved upon as<br />
nanopore sequencing technology is further developed and the error rate that is<br />
currently innate to the platform is decreased. It is likely that general<br />
improvements to nanopore sequencing itself would greatly decrease false positive<br />
rates in the described method. This technique could also be more applicable if<br />
its points of failure were addressed and proper thresholds defined. The higher<br />
false positive rate observed in RPol2 (supplemental figure 11a) is more likely<br />
to be a fault of the polymerase fidelity rather than a characteristic of the<br />
barcode set. What kind of polymerase misincorporation rate is permissible to<br />
still allow confident barcode assignment? At what point does polymerase<br />
processivity become an issue and cause ambiguity in barcode identification?<br />
There appears to be a set of kinetic parameters that must be met in order for<br />
differences in polymerases to be resolved by this assay. Clearly defining what<br />
it is good at and what it is going to miss is essential before it can be used<br />
reliably for screening.
Reviewer 2.
Summary<br />
In this article, the authors expand upon their previously published system of singlemolecule<br />
nanopore sequencing-by-synthesis and investigate whether it can be scaled-up to be<br />
used as a screening method downstream of polymerase directed evolution experiments. The<br />
major advancement in this paper is that as a screening tool for polymerases, it also has the<br />
capability to provide detailed kinetic information on each of the polymerases, something that<br />
prior methods struggled to do. As a proof-of-principle, the authors simultaneously screen 96<br />
polymerases with 96 barcodes and extract kinetic data from their single-molecule profiling.<br />
This work has multiple merits. Notably, although the general framework is the same, the<br />
authors have made a series of changes to improve their system since their previously published<br />
work, that played a role in allowing them to make multiplexed measurements. The authors also<br />
creatively pull a variety of kinetic parameters from their single-molecule voltage traces that<br />
allow them to easily separate different polymerases after principle component analysis.<br />
On the other hand, the work has a couple of issues, detailed below, with regards to<br />
controls and clarity that would be helpful if addressed.<br />
Major Issues<br />
1. The authors utilize DNA bases that are tagged to generate unique signals for recognition<br />
when captured and blocking the nanopore. From the principle component analysis<br />
tables (Supplementary Table 4a-c), it appears that the polymerases vary quite a bit with<br />
regards to processing different bases. At present, it is unclear whether these kinetic<br />
differences are being caused by differences between structures of the bases, or whether<br />
they are caused by differences between structures of the tags. One control would be to<br />
repeat one set of experiments with the tags shuffled between the bases and observe<br />
how reproducible the results are. This would give the reader a sense of how much<br />
measurements are being affected by the tags used for this technique.<br />
2. For the experiment in Fig. 5, the authors end up showing that barcodes can be identified<br />
with a false positive rate of 13%. This is with a pilot experiment of 96 barcodes. From<br />
this data, it suggests that this technique would be difficult to scale-up any further, which<br />
may limit its usefulness – in fact even 96 barcodes may already be pushing the limit.<br />
From reading the paper, it is unclear if what is dominating this problem is the length of<br />
the barcode (i.e. limited sequence divergence due to 32-nt), or if nanopore sequencing<br />
accuracy is still a limiting factor. It would be great to see a small pilot experiment with<br />
longer barcodes to see if this could allow for improved accuracy, or some in silico<br />
statistical modeling extrapolating from their current data (e.g. length of barcode x<br />
required to accurately separate number of polymerases y with a false positive rate of z).
quite flexible, it still is unaddressed whether this repeated jostling of the tag<br />
(linked directly to the base) would affect kinetic measurements. Overall, it would be nice<br />
to see some measurements compared or benchmarked against a more well-established<br />
technique side-by-side (e.g. single-molecule optical trap), just to see if the data matches<br />
up or not. Notably with a parallel technique, you can also do the control of tagged vs.<br />
untagged nucleotides, thus unambiguously determining the potential effect of a tag on<br />
polymerase kinetics.<br />
Minor Issues<br />
1. In the abstract the authors mention they “develop a robust classification algorithm that<br />
discriminates kinetic characteristics of the different polymerase variants.” It is unclear<br />
what this is referring to in the paper. If it is simply the principle component analysis then<br />
saying “develop” may be a bit overreaching.<br />
2. Rather than referring to prior publications this publication should have in the<br />
supplement and/or methods the exact nucleotide + tag combinations used in this paper.<br />
3. It is unclear after reading the methods why there are three separate PCA tables per<br />
polymerase in the supplement.<br />
4. It is unclear what is the difference between tdwell and tag capture dwell from the written<br />
descriptions in the paper. Highlighting the difference visually in Fig. 3a (as was done<br />
with the rest of the kinetic variables) would help the reader clearly understand exactly<br />
what is being measured.<br />
5. A table of the 96 barcodes used for Fig. 5/6 should be added to the supplementary<br />
materials.<br />
6. The numbers in Supplementary Table 5 do not add up correctly – the authors should<br />
take a look again and make sure the correct numbers are present.<br />
7. In Fig. 2 the authors experimentally calculate BMPI cut-offs for 3 different barcodes and<br />
get 0.8, whereas in Supplementary Fig. 8 the authors do an in-silico calculation for BMPI<br />
cut-off and still get 0.8. One would imagine that increasing the number of barcodes<br />
would require a stricter BMPI cut-off. Some sort of commentary on this, or perhaps<br />
reanalysis of the multiplexed data with a stricter BMPI cut-off could be helpful.<br />
8. In Supplementary Fig. 12 the authors show a protein gel of their pore-polymerase<br />
conjugates. The bands show that post-linking, there is still a decent amount of nonlinked<br />
polymerase. In the methods there is no mention of a size exclusion purification<br />
step post-conjugation. Are the authors loading a mixed population onto their chips? This<br />
needs to be clarified.<br />
9. In Supplementary Table 7 the tag capture dwell (TCD) variable missing.
Reviewer 3.
In the study titled Multiplex single-molecule kinetics of nanopore-coupled<br />
polymerases, Palla et al. developed and demonstrated the use of a<br />
single-molecule sequencing technology for the high-throughput identification of<br />
DNA polymerases with desired kinetic properties. Nanopore sequencing reactions<br />
were carried out on complementary metal-oxide-semiconductor (CMOS) chips, each<br />
of which contains over 30,000 individually addressable electrodes, thereby<br />
allowing sequencing reactions to be carried out on each chip in a multiplex<br />
fashion. Each DNA polymerase was coupled to an ?-hemolysin pore and bound to a<br />
51 bp circular barcoded ssDNA template (CBT). The template is bound to a primer,<br />
thus enabling the incorporation of the appropriate nucleotides by the polymerase<br />
into the ssDNA template. Since each ssDNA template is circular, multiple<br />
iterations of the barcoded region can be observed during the sequencing of each<br />
template. Furthermore, each of the four nucleotides are uniquely tagged. When a<br />
nucleotide is being incorporated into the template ssDNA, the tag attached to<br />
the nucleotide is captured in the nanopore, thereby decreasing the conductance<br />
through the pore. Such a decrease in conductance is measured by an analog to<br />
digital converter (ADC) placed parallel to the sequencing circuit, and the<br />
recorded ADC values are then converted into a fraction of open channel signal<br />
(FOCS). Because the four tags are different from each other, the corresponding<br />
FOCS generated differ from each other as well, and can thus be used to<br />
distinguish the nucleotides from each other. Using a software, the FOCS is<br />
converted into raw reads. Then, using a barcode classification algorithm, each<br />
qualified raw read is compared to any template of the experimenter’s choice.<br />
Aligning a raw read to the correct template will more likely generate a higher<br />
barcode match probability index (BMPI) value for that read, while aligning a raw<br />
read to an incorrect template will more likely generate a lower BMPI value for<br />
that read. As such, for each sequencing experiment, the average BMPI value<br />
(derived from comparing raw reads to a template) can be used to identify the<br />
template to which the polymerase is bound. And if each polymerase-template pair<br />
is unique, the average BMPI value can then be used to identify the polymerase as<br />
well. Lastly, the authors defined a set of five kinetic parameters that can be<br />
measured during the course of a sequencing reaction. Because different<br />
polymerases are likely to differ from each other with respect to these kinetic<br />
parameters, comparison of the parameters between polymerases can help identify a<br />
polymerase with the desired properties.
To develop their nanopore sequencing technology, the authors first showed that<br />
the BMPI value can be used to identify a CBT. Thereafter, the authors showed<br />
that, after a polymerase is loaded with a particular CBT, the loaded CBT will<br />
not get replaced by another CBT that is present in the same reaction volume,<br />
thereby demonstrating the potential for multiplexing this sequencing platform.<br />
Then, as stated above, the authors defined five kinetic parameters that can be<br />
measured during sequencing. Using Principle component analysis (PCA), the<br />
authors showed that these kinetic parameters differ between polymerases, thus<br />
indicating the ability of this platform to distinguish polymerases based on<br />
these parameters. To demonstrate the multiplex potential of their platform, the<br />
authors conducted multiplex experiments in which different sets of CBTs were<br />
loaded onto three different polymerases. These pore-polymerase-CBT conjugates<br />
were then pooled prior to loading onto the CMOS chip. Notably, these experiments<br />
showed that CBTs can be identified in a pooled format. Finally, as a practical<br />
demonstration of the capability of the platform to identify, in a multiplex<br />
format, polymerases with properties of interest, the authors generated 96<br />
polymerases, each of which was then loaded with a unique CBT. In this multiplex<br />
reaction, the authors identified four polymerases that are potential candidates<br />
for further development for use in DNA amplification methods.
Here are some thoughts I had while going through the preprint:
-
The authors state that, in their pooled 3-plex sequencing experiment, about<br />
67% of the raw reads (n = 418) were identified as any of the three barcodes used<br />
in the experiment. In Supplementary Table 5, it can be seen that, for total<br />
RPol-CBT, [the percent of raw reads with BMPI > 0.8] = [the number of raw reads<br />
with BMPI > 0.8] / [the total number of raw reads]. That is, 66.9% = 280 / 418.<br />
However, the table shows that the total number of raw reads for the RPol1-CBT1<br />
alone is 418. If this is the case, it is unclear to me how the total number of<br />
raw reads for all three RPol-CBTs (RPol1-CBT1, RPol2-CBT2, and RPol3-CBT3) can<br />
be 418 if that of RPol1-CBT1 alone is already 418.
-
On p19, line 1, I believe that “Experiments 1 and 3” should say “experiments<br />
1 through 3”, since in all three of these experiments, the raw reads were<br />
compared to the correct template, as noted in the legend below the figure<br />
(Supplementary Figure 6b).
-
In Supplementary Figure 6a, the color-coding legend indicates that the<br />
barcode region of the ssDNA template is highlighted in grey. However, nothing in<br />
the ssDNA sequence was highlighted in grey.
-
The data presentation for Supplementary Figure 6b along with the associated<br />
text description are a bit confusing too me. It is stated that, in experiments<br />
1-3, the reads were compared to the correct templates, while the reads in<br />
experiment 4-5 were compared to the incorrect templates shown in Supplementary<br />
figure 6a. In this part of the study, the three pore-polymerase-CBT conjugates<br />
(RPol1:CBT1, RPol2:CBT2, and RPol3:CBT3) were first individually assembled, and<br />
then pooled and loaded onto the CMOS chip. Assuming that this has been done for<br />
each of the five experiments indicated in Supplementary Figure 6, then there is<br />
really no universally correct template (e.g., comparing CBT1 to the raw reads of<br />
a pooled experiment would only yield higher BMPI values for a third of the reads<br />
(i.e., only for RPol1:CBT1-derived raw reads). Are the raw reads from experiment<br />
1, 2, and 3 compared to CBT1, CBT2, and CBT3, respectively? This wasn’t<br />
specified anywhere in the text.
-
Regarding Figure 6a, the authors stated that, out of all of the 96<br />
polymerases screened in this multiplex experiment, 20 polymerases were<br />
identified as having detectable activity (p23, bottom). However, as depicted in<br />
Figure 6a, there are only 17 polymerases for which the associated barcodes were<br />
counted (i.e., there are only 17 yellow bars). Thus, it is unclear to me where<br />
the number “20” is derived from.
-
In the PCA analysis in Supplementary Figure 11, the authors tried to map the<br />
sequencing data derived from the multiplex experiment back to those derived from<br />
the singleplex experiments involving the same three polymerases. The sequencing<br />
data set for the second barcode set (CBT33-64) could not be mapped back well,<br />
and it was stated that this might be due to the high false positive rate of<br />
barcode identification for that barcode set. That being said, as indicated in<br />
Supplementary Table 6, the false positive rate for RPol1:CBT1-32 and<br />
RPol2:CBT33-64 are 11.94% and 16.06%, respectively. Thus, if the author’s claim<br />
is true, the inability to map back is due to a 16.06% – 11.94% = 4.12%<br />
difference in the false positive rate. It is unclear to me if a 4.12% difference<br />
in false positive rate would really lead to such a dramatic difference in the<br />
ability to map back. Also, it is unclear if this higher false positive rate<br />
arose due to polymerase (RPol2), the templates (CBT33-64), both, or neither.<br />
Logically, it seems unlikely that the rate would be due to the CBTs since it is<br />
unlikely that the middle third of the set of 96 CBTs would just happen to give<br />
higher false positive rates in comparison to the other two thirds. An easily<br />
accomplished comparison between two polymerases would be to load both<br />
polymerases with the exact same set of CBTs, and then compare the derived false<br />
positive rate for each polymerase. Then, one can repeat the experiment but using<br />
a different CBT set. This will help narrow down whether the observed false<br />
positive rate is due to the polymerase or the CBTs themselves.
-
Regarding Figure 5, it is unclear to me the exact differences between 5a and<br />
5b. I see that the data presentation is a little different, but I’m not sure if<br />
both figures are necessary here given that both deal with the same three<br />
polymerases as well as the same set of 96 CBTs.
-
It is stated that the surface of each individual CMOS chip contains 32,768<br />
electrodes (p30) and that the chip contains thousands of pores (p4). Now, as<br />
mentioned in the measurement setup (Figure 1a legend), the measurement setup<br />
requires two electrodes (a counter electrode and a working electrode). Given<br />
this, it is unclear to me what proportion of those 30,000-some electrodes are<br />
working or counter electrodes. I believe that clarification on this would help<br />
the reader get a better sense of the number of pore-polymerase-CBT conjugates on<br />
each individual CMOS chip, and thus, a better sense and appreciation of the<br />
multiplex scale.
-
On p30, under the section Pore-polymerase-template complex formation,<br />
“SpyCather” should say “SpyCatcher” (i.e., a “c” is missing).