That's not true. I just did a high-quality sequence and assembly of a new species of fungus from my home lab using nanopore. You can see all my code used for assembly and analysis that will be referenced in a paper I plan to publish in Jan here: https://github.com/EverymanBio/pestalotiopsis
Given that the decoder is machine-learned and depends on a training set to go from squiggle -> ATGC..., how do you ensure that sequences which haven't been seen before (not in the training set) are still accurately accounted for?
We used Guppy for basecalling, which is neural network based and used to turn raw signal data into the predicted bases. There're no guarantees of accuracy, only tools to determine and assess quality. One major way of assessing accuracy is to compare the subject genome with other similar reference genomes and denote the high-degree of homology in highly-conserved regions.
My question is if in the future, we would be able to fully rely on translations to predicted bases for sequencing or if there would always be a need to compare with a different sequencing methodology in the case of de novo genetic information that previously hasn't been seen before (no reference genomes being available in that case).
Is there publicly available information on how accurate Guppy is, as well as how the amount of training data scales with improvements in accuracy?
It didn't seem like these things were mentioned explicitly in the Community Update, other than that it’s expected to continue improving, but a clearer roadmap would definitely be much more helpful.
There are quality checks throughout the entire process, starting from the raw read quality scores returned directly from the sequencer all the way to fully assembled genome completeness. In our paper, one of the tools we used for this is called BUSCO[0] which scored our assembly at 97.9%, a relatively high score for de novo assemblies.