Introduction to Next Generation Sequencing Bioinformatics

Approximate time: 20 minutes

Goals

Log into the HPC cluster’s On Demand interface

[tutln01@login001 ~]$

This indicates you are logged in to the login node of the cluster.

Set up for the analysis

Find 500M storage space

Result:

Home Directory Quota
Disk quotas for user tutln01 (uid 31394):
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
hpcstore03:/hpc_home/home
                  1222M   5120M   5120M            2161   4295m   4295m        


Listing quotas for all groups you are a member of
Group: facstaff	Usage: 16819478240KB	Quota: 214748364800KB	Percent Used: 7.00%

Under blocks you will see the amount of storage you are using, and under quota you see your quota. Here, the user has used 1222M of the available 5120M and has enough space for our analysis.

Download the data

srun --pty -t 3:00:00 --mem 16G -N 1 --cpus 4 bash

Notes: If wait times are very long, you can try a different partitions by adding, e.g. -p preempt or -p interactive before bash. If you go through this workshop in multiple steps, you will have to rerun this step each time you log in.

cd

Or, if you are using a project directory:

cd /cluster/tufts/labname/username/

cp -R /cluster/tufts/bio/tools/training/intro-to-ngs/ .

(Also available via:  git clone https://gitlab.tufts.edu/rbator01/intro-to-ngs.git)

tree intro-to-ngs

You’ll see a list of all files

intro-to-ngs
├── all_commands.sh          <-- Bash script with all commands
├── raw_data                 <-- Folder with paired end fastq files
│   ├── na12878_1.fq         
│   └── na12878_2.fq
├── README.md                <-- Contents description
└── ref_data                 <-- Folder with reference sequence
    └── chr10.fa
2 directories, 5 files

Data for the class

Genome In a Bottle (GIAB) was initiated in 2011 by the National Institute of Standards and Technology “to develop the technical infrastructure (reference standards, reference methods, and reference data) to enable translation of whole human genome sequencing to clinical practice” (Zook et al 2012). We’ll be using a DNA Whole Exome Sequencing (WES) dataset released by GIAB for the purposes of benchmarking bioinformatics tools.

The source DNA, known as NA12878, was taken from a single person: the daughter in a father-mother-child ‘trio’. She is also mother to 11 children of her own, for whom sequence data is also available. (HBC Training). Father-mother-child ‘trios’ are often sequenced to study genetic links between family members.

As mentioned in the introduction, WES is a method to concentrate the sequenced DNA fragments in coding regions (exons) of the genome.

For this class, we’ve created a small dataset of reads that align to a single gene that will allow our commands to finish quickly.

Sample: NA12878

Gene: Cyp2c19 on chromosome 10

Sequencing: Illumina, Paired End, Exome

Next: Quality Control

Previous: Repository Home