intro-to-galaxy-ngs-sarscov2

Obtaining the Reference data and NGS Sequencing data from public repositories

The US National Center for Biotechnology Information hosts repositories for many types of biomedical and genomics data. Today we’ll retrieve reference data from the Genomes Database FTP server as well as the Sequence Read Archive

Step 1: Galaxy Setup

Create a new history

Step 2: Obtaining our Data

Our dataset is a SARS-CoV-2 Next Generation Sequencing sample. In this section we’ll obtain our reference data and our NGS reads in preparation for alignment.

Import the SARS-CoV2 genome and gene annotation from NCBI

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.fna.gz
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.gff.gz

Fasta Format

The virus genome is in fasta format. Fasta format has two parts, a sequence identifier preceeded by a “>” symbol, followed by the sequence on subsequent lines. You can see a preview of it by clicking on the genome dataset in the History panel.

GFF Format

The gene annotation file is in Generic Feature Format (GFF). This formet tells us where genes are located in the reference genome. To preview the GFF file, click on the on the genes uncompressed dataset. Note that we must always be sure that our gene information and genome come from the same source.

Step 3: Import NGS sequencing data from Sequence Read Archive

We are interested in obtaining reads from the sample Viral genomic RNA sequencing of a B.1.617.2/Delta isolate; Severe acute respiratory syndrome coronavirus 2; RNA-Seq

Download Reads

We’ll download the data from Sequence Read Archive using a Galaxy tool called SRA Toolkit.

Fastq format

Fastq format is a way to store both sequence data and information about the quality of each sequenced position.

Each block of 4 lines contains one sequencing reads, for example:

@SRR15607266.1 1 length=76
NTTATCTACTTTTATTTCAGCAGCTCGGCAAGGGTTTGTTGATTCAGATGTAGAAACTAAAGATGTTGTTGAATGT
+SRR15607266.1 1 length=76
#8ACCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
  1. Sequence identifier
  2. Sequence
  3. + (optionally lists the sequence identifier again)
  4. Quality string

Paired end sequencing data will typically be stored as two fastq files, one for the forward and one for the reverse. Each file should contain the same number of reads, with the same labels, in the same order. If this convention is not followed, it could cause errors with downstream tools. Fortunately there are tools such as BBTools Repair that can help restore pairing information.

Base Quality Scores

The symbols we see in the read quality string are an encoding of the quality score:

A quality score is a prediction of the probability of an error in base calling:

Going back to our read, we can see that for most of our read the quality score is “G” –> “Q” = 38 -> Probability < 1/1000 of an error.

Preview Fastq data

(Optional Next:) Process Raw Reads

Next: Read Alignment

Previous: Introduction to Galaxy