The US National Center for Biotechnology Information hosts repositories for many types of biomedical and genomics data. Today we’ll retrieve reference data from the Genomes Database FTP server as well as the Sequence Read Archive
Our dataset is a SARS-CoV-2 Next Generation Sequencing sample. In this section we’ll obtain our reference data and our NGS reads in preparation for alignment.
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.fna.gz
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.gff.gz
Click Start followed by Close
Two jobs will appear in the History, Grey (pending) -> Orange (running) -> Green (complete).
The virus genome is in fasta format. Fasta format has two parts, a sequence identifier preceeded by a “>” symbol, followed by the sequence on subsequent lines. You can see a preview of it by clicking on the genome dataset in the History panel.
The gene annotation file is in Generic Feature Format (GFF). This formet tells us where genes are located in the reference genome. To preview the GFF file, click on the on the genes uncompressed dataset. Note that we must always be sure that our gene information and genome come from the same source.
We are interested in obtaining reads from the sample Viral genomic RNA sequencing of a B.1.617.2/Delta isolate; Severe acute respiratory syndrome coronavirus 2; RNA-Seq
We’ll download the data from Sequence Read Archive using a Galaxy tool called SRA Toolkit.
SRR15607266
Fastq format is a way to store both sequence data and information about the quality of each sequenced position.
Each block of 4 lines contains one sequencing reads, for example:
@SRR15607266.1 1 length=76
NTTATCTACTTTTATTTCAGCAGCTCGGCAAGGGTTTGTTGATTCAGATGTAGAAACTAAAGATGTTGTTGAATGT
+SRR15607266.1 1 length=76
#8ACCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
Paired end sequencing data will typically be stored as two fastq files, one for the forward and one for the reverse. Each file should contain the same number of reads, with the same labels, in the same order. If this convention is not followed, it could cause errors with downstream tools. Fortunately there are tools such as BBTools Repair that can help restore pairing information.
The symbols we see in the read quality string are an encoding of the quality score:
A quality score is a prediction of the probability of an error in base calling:
Going back to our read, we can see that for most of our read the quality score is “G” –> “Q” = 38 -> Probability < 1/1000 of an error.