Alignment and filtering of reads
Contributors: Mary Piper, Radhika Khetani, Meeta Mistry
Approximate time: 45 minutes
Learning Objectives
- Perform alignment of reads to the genome using Bowtie2
- Examining a SAM file and understanding the information stored in it
- Filtering aligned reads to keep only uniquely mapped ones
Alignment to Genome
Now that we have assessed the quality of our sequence data, we are ready to align the reads to the reference genome. Bowtie2 is a fast and accurate alignment tool that indexes the genome with an FM Index based on the Burrows-Wheeler Transform method to keep memory requirements low for the alignment process. Bowtie2 supports gapped, local and paired-end alignment modes and works best for reads that are at least 50 bp (shorter read lengths should use Bowtie1). By default, Bowtie2 will perform a global end-to-end read alignment, which is best for quality-trimmed reads. However, it also has a local alignment mode, which will perform soft-clipping for the removal of poor quality bases or adapters from untrimmed reads. We will use this option since we did not trim our reads.
NOTE: Our reads are only 36 bp, so technically we should explore alignment with bwa or Bowtie1 to see if it is better. However, since it is rare that you will have sequencing reads with less than 50 bp, we will show you how to perform alignment using Bowtie2.
Creating a Bowtie2 index
To perform the Bowtie2 alignment, a genome index is required. The index is analagous to the index in the back of a book. By indexing the genome, we have organized it in a manner that now allows for efficient search and retrieval of matches of the query (sequence read) to the genome. We previously generated the genome indices for you, and they exist in the reference_data
directory.
However, if you needed to create a genome index yourself, you would use the following command:
# DO NOT RUN
bowtie2-build <path_to_reference_genome.fa> <prefix_to_name_indexes>
A quick note on shared databases for human and other commonly used model organisms. The O2 cluster has a designated directory at
/n/groups/shared_databases/
in which there are files that can be accessed by any user. These files contain, but are not limited to, genome indices for various tools, reference sequences, tool specific data, and data from public databases, such as NCBI and PDB. So when using a tool that requires a reference of sorts, it is worth taking a quick look here because chances are it's already been taken care of for you.
bash $ ls -l /n/groups/shared_databases/igenome/
Aligning reads to the genome with Bowtie2
Since we have our indices already created, we can get started with read alignment. Change directories to the bowtie2
folder:
$ cd ~/chipseq/results/bowtie2
Now let's load the module. We can find out more on the module on O2:
$ module spider bowtie2
You will notice that before we load this module we also need to load the gcc compiler (as will be the case for many of the NGS analysis tools on O2. Always check module spider
first.)
$ module load gcc/6.2.0 bowtie2/2.2.9
We will perform alignment on our single raw FASTQ file, H1hesc_Input_Rep1_chr12.fastq
. Details on Bowtie2 and its functionality can be found in the user manual; we encourage you to peruse through to get familiar with all available options.
The basic options for aligning reads to the genome using Bowtie2 are:
-p
: number of processors / cores-q
: reads are in FASTQ format--local
: local alignment feature to perform soft-clipping-x
: /path/to/genome_indices_directory-U
: /path/to/FASTQ_file-S
: /path/to/output/SAM_file
$ bowtie2 -p 2 -q --local \
-x ~/chipseq/reference_data/chr12 \
-U ~/chipseq/raw_data/H1hesc_Input_Rep1_chr12.fastq \
-S ~/chipseq/results/bowtie2/H1hesc_Input_Rep1_chr12_aln_unsorted.sam
Alignment file format: SAM/BAM
The output we requested from the Bowtie2 aligner is an unsorted SAM file, also known as Sequence Alignment Map format. The SAM file, is a tab-delimited text file that contains information for each individual read and its alignment to the genome. While we will go into some features of the SAM format, the paper by Heng Li et al provides a lot more detail on the specification.
The file begins with a header, which is optional. The header is used to describe source of data, reference sequence, method of alignment, etc., this will change depending on the aligner being used. Each section begins with character ‘@’ followed by a two-letter record type code. These are followed by two-letter tags and values. Example of some common sections are provided below:
@HD The header line
VN: format version
SO: Sorting order of alignments
@SQ Reference sequence dictionary
SN: reference sequence name
LN: reference sequence length
SP: species
@PG Program
PN: program name
VN: program version
Following the header is the alignment section. Each line that follows corresponds to alignment information for a single read. Each alignment line has 11 mandatory fields for essential mapping information and a variable number of other fields for aligner specific information.
An example read mapping is displayed above. Note that the example above spans two lines, but in the file it is a single line. Let's go through the fields one at a time. First, you have the read name (QNAME
), followed by a FLAG
The FLAG
value that is displayed can be translated into information about the mapping.
Flag | Description |
---|---|
1 | read is mapped |
2 | read is mapped as part of a pair |
4 | read is unmapped |
8 | mate is unmapped |
16 | read reverse strand |
32 | mate reverse strand |
64 | first in pair |
128 | second in pair |
256 | not primary alignment |
512 | read fails platform/vendor quality checks |
1024 | read is PCR or optical duplicate |
- For a given alignment, each of these flags are either on or off indicating the condition is true or false.
- The
FLAG
is a combination of all of the individual flags (from the table above) that are true for the alignment - The beauty of the flag values is that any combination of flags can only result in one sum.
There are tools that help you translate the bitwise flag, for example this one from Picard
Moving along the fields of the SAM file, we then have RNAME
which is the reference sequence name. The example read is from chromosome 1 which explains why we see 'chr1'. POS
refers to the 1-based leftmost position of the alignment. MAPQ
is giving us the alignment quality, the scale of which will depend on the aligner being used.
CIGAR
is a sequence of letters and numbers that represent the edits or operations required to match the read to the reference. The letters are operations that are used to indicate which bases align to the reference (i.e. match, mismatch, deletion, insertion), and the numbers indicate the associated base lengths for each 'operation'.
Operation | Description |
---|---|
M | sequence match or mismatch |
I | insertion to the reference |
D | deletion from reference |
N | skipped region from the reference |
Now to the remaning fields in our SAM file:
The next three fields are more pertinent to paired-end data. MRNM
is the mate reference name. MPOS
is the mate position (1-based, leftmost). ISIZE
is the inferred insert size.
Finally, you have the data from the original FASTQ file stored for each read. That is the raw sequence (SEQ
) and the associated quality values for each position in the read (QUAL
).
Let;'s take a quick peek at our SAM file that we just generated. Since it is just a text file, we can browse through it using less
:
$ less H1hesc_Input_Rep1_chr12_aln_unsorted.sam
Does the information you see line up with the fields we described above?
Filtering reads
An important issue with ChIP-seq data concerns the inclusion of multiple mapped reads (reads mapped to multiple loci on the reference genome). Allowing for multiple mapped reads increases the number of usable reads and the sensitivity of peak detection; however, the number of false positives may also increase [1]. Therefore we need to filter our alignment files to contain only uniquely mapping reads in order to increase confidence in site discovery and improve reproducibility. Since there is no parameter in Bowtie2 to keep only uniquely mapping reads, we will need to perform the following steps to generate alignment files containing only the uniquely mapping reads:
- Change alignment file format from SAM to BAM
- Sort BAM file by read coordinate locations
- Filter to keep only uniquely mapping reads (this will also remove any unmapped reads)
1. Changing file format from SAM to BAM
While the SAM alignment file output by Bowtie2 is human readable, we need a BAM alignment file for downstream tools. Therefore, we will use Samtools to convert the file formats.
To use samtools
we will need to load the module:
module load samtools/1.3.1
The command we will use is samtools view
with the following parameters:
-h
: include header in output-S
: input is in SAM format-b
: output BAM format-o
: /path/to/output/file
$ samtools view -h -S -b \
-o H1hesc_Input_Rep1_chr12_aln_unsorted.bam \
H1hesc_Input_Rep1_chr12_aln_unsorted.sam
You can find additional parameters for the samtools functions in the manual.
2. Sorting BAM files by genomic coordinates
Before we can filter to keep the uniquely mapping reads, we need to sort our BAM alignment files by genomic coordinates (instead of by name). To perform this sort, we will use Sambamba, which is a tool that quickly processes BAM and SAM files.
The command we will use is sambamba sort
with the following parameters:
-t
: number of threads / cores-o
: /path/to/output/file
$ sambamba sort -t 2 \
-o H1hesc_Input_Rep1_chr12_aln_sorted.bam \
H1hesc_Input_Rep1_chr12_aln_unsorted.bam
NOTE: This tool is not available as a module on O2. You will only be able to use this as part of the tools available in the
bcbio
pipeline. In a previous lesson, you had added this to your $PATH by modifying your.bashrc
file. If the command above does not work for you, run this line below:
export PATH=/n/app/bcbio/tools/bin:$PATH
We could have also used samtools
to perform the above sort, however using sambamba
gives us dual functionality. List the contents of the directory -- what do you see? The advantage to using sambamba
is that along with the newly sorted file, an index file is generated. If we used samtools
this would have been a two-step process.
3. Filtering uniquely mapping reads
Finally, we can filter the uniquely mapped reads. We will use the sambamba view
command with the following parameters:
-t
: number of threads / cores-h
: print SAM header before reads-f
: format of output file (default is SAM)-F
: set custom filter - we will be using the filter to remove multimappers and unmapped reads.
$ sambamba view -h -t 2 -f bam \
-F "[XS] == null and not unmapped " \
H1hesc_Input_Rep1_chr12_aln_sorted.bam > H1hesc_Input_Rep1_chr12_aln.bam
We filtered out unmapped reads by specifying in the filter not unmapped
. Also, among the reads that were aligned, we filtered out multimappers by specifying [XS] == null
. 'XS' is a tag generated by Bowtie2 that gives an alignment score for the second-best alignment, and it is only present if the read is aligned and more than one alignment was found for the read.
Now that the alignment files contain only uniquely mapping reads, we are ready to perform peak calling.
NOTE: After performing read alignment, it's useful to generate QC metrics for the alignment using tools such as MultiQC prior to moving on to the next steps of the analysis.
This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.