Crossref journal-article
Oxford University Press (OUP)
Nucleic Acids Research (286)
Abstract

Abstract Novel sequencing technologies permit the rapid production of large sequence data sets. These technologies are likely to revolutionize genetics and biomedical research, but a thorough characterization of the ultra-short read output is necessary. We generated and analyzed two Illumina 1G ultra-short read data sets, i.e. 2.8 million 27mer reads from a Beta vulgaris genomic clone and 12.3 million 36mers from the Helicobacter acinonychis genome. We found that error rates range from 0.3% at the beginning of reads to 3.8% at the end of reads. Wrong base calls are frequently preceded by base G. Base substitution error frequencies vary by 10- to 11-fold, with A > C transversion being among the most frequent and C > G transversions among the least frequent substitution errors. Insertions and deletions of single bases occur at very low rates. When simulating re-sequencing we found a 20-fold sequencing coverage to be sufficient to compensate errors by correct reads. The read coverage of the sequenced regions is biased; the highest read density was found in intervals with elevated GC content. High Solexa quality scores are over-optimistic and low scores underestimate the data quality. Our results show different types of biases and ways to detect them. Such biases have implications on the use and interpretation of Solexa data, for de novo sequencing, re-sequencing, the identification of single nucleotide polymorphisms and DNA methylation sites, as well as for transcriptome analysis.

Bibliography

Dohm, J. C., Lottaz, C., Borodina, T., & Himmelbauer, H. (2008). Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Research, 36(16).

Authors 4
  1. Juliane C. Dohm (first)
  2. Claudio Lottaz (additional)
  3. Tatiana Borodina (additional)
  4. Heinz Himmelbauer (additional)
References 21 Referenced 860
  1. 10.1126/science.1137325 / Science / Polony multiplex analysis of gene expression (PMAGE) in mouse hypertrophic cardiomyopathy by Kim (2007)
  2. 10.1073/pnas.74.12.5463 / Proc. Natl Acad. Sci. USA / DNA sequencing with chain-terminating inhibitors by Sanger (1977)
  3. 10.1101/SQB.1986.051.01.032 / Cold Spring Harb. Symp. Quant. Biol. / Specific enzymatic amplification of DNA in vitro: the polymerase chain reaction by Mullis (1986)
  4. 10.1038/nature03959 / Nature / Genome sequencing in microfabricated high-density picolitre reactors by Margulies (2005)
  5. 10.1186/1471-2164-7-275 / BMC Genomics / 454 sequencing put to the test using the complex genome of barley by Wicker (2006)
  6. 10.1186/1471-2164-7-272 / BMC Genomics / Sequencing Medicago truncatula expressed sequenced tags using 454 Life Sciences technology by Cheung (2006)
  7. 10.1101/gr.5145806 / Genome Res. / Gene discovery and annotation using LCM-454 transcriptome sequencing by Emrich (2007)
  8. 10.1104/pp.107.096677 / Plant Physiol. / Sampling the Arabidopsis transcriptome with massively parallel pyrosequencing by Weber (2007)
  9. 10.1093/nar/gkl444 / Nucleic Acids Res. / Multiplex sequencing of paired-end ditags (MS-PET): a strategy for the ultra-high-throughput analysis of transcriptomes and genomes by Ng (2006)
  10. 10.1101/gr.6435207 / Genome Res. / SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing by Dohm (2007)
  11. 10.1016/j.cell.2007.05.009 / Cell / High-resolution profiling of histone methylations in the human genome by Barski (2007)
  12. 10.1038/nmeth1068 / Nat. Methods / Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing by Robertson (2007)
  13. 10.1186/gb-2007-8-7-r143 / Genome Biol. / Accuracy and quality of massively parallel DNA pyrosequencing by Huse (2007)
  14. 10.1093/nar/gni170 / Nucleic Acids Res. / An analysis of the feasibility of short read sequencing by Whiteford (2005)
  15. 10.1371/journal.pgen.0020120 / PLoS Genet. / Who ate whom? Adaptive Helicobacter genomic changes that accompanied a host jump from early humans to large felines by Eppinger (2006)
  16. 10.1038/nmeth.1179 / Nat. Methods / Whole-genome sequencing and variant discovery in C. elegans by Hillier (2008)
  17. 10.1093/nar/gkl404 / Nucleic Acids Res. / Sequence biases in large scale gene expression profiling data by Siddiqui (2006)
  18. 10.1093/nar/29.12.e60 / Nucleic Acids Res. / Identification and prevention of a GC content bias in SAGE libraries by Margulies (2001)
  19. 10.1093/nar/23.8.1411 / Nucleic Acids Res. / PCR bias in amplification of androgen receptor alleles, a trinucleotide repeat marker used in clonality studies by Mutter (1995)
  20. {'key': '2021080317283230000_B20', 'volume-title': 'Automated DNA Sequencing and Analysis', 'author': 'Kelley', 'year': '1994', 'edition': '1st edn.'} / Automated DNA Sequencing and Analysis by Kelley (1994)
  21. 10.1038/nature06745 / Nature / Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning by Cokus (2008)
Dates
Type When
Created 17 years, 1 month ago (July 26, 2008, 8:25 p.m.)
Deposited 4 years ago (Aug. 3, 2021, 2:11 p.m.)
Indexed 2 days, 1 hour ago (Aug. 29, 2025, 6:04 a.m.)
Issued 17 years, 1 month ago (July 26, 2008)
Published 17 years, 1 month ago (July 26, 2008)
Published Online 17 years, 1 month ago (July 26, 2008)
Published Print 16 years, 11 months ago (Sept. 1, 2008)
Funders 1
  1. Max-Planck-Gesellschaft 10.13039/501100004189

    Region: Europe

    gov (Research institutes and centers)

    Labels4
    1. Max Planck Society for the Advancement of Science
    2. Max-Planck-Gesellschaft zur Förderung der Wissenschaften
    3. Max Planck Society
    4. MPG

@article{Dohm_2008, title={Substantial biases in ultra-short read data sets from high-throughput DNA sequencing}, volume={36}, ISSN={0305-1048}, url={http://dx.doi.org/10.1093/nar/gkn425}, DOI={10.1093/nar/gkn425}, number={16}, journal={Nucleic Acids Research}, publisher={Oxford University Press (OUP)}, author={Dohm, Juliane C. and Lottaz, Claudio and Borodina, Tatiana and Himmelbauer, Heinz}, year={2008}, month=jul }