module load perl-bioperl/1.7.6-semipod
perl ~/classdata/REFS/script/selectSeqsAboveMinLength.pl <your_assembly.fasta> <largest_contig.fasta> 8000
Congratulations!
You have made it to the mid-point of the course.
Now is a good opportunity to look back over anything you’re unclear on and would like to go over again. If you haven’t already annotated all your prokaryote genomes with Prokka, now is the time to do so. If you’ve done all that and are chomping at the bit for more, read on for a tutorial on how to annotate your mitochondrial assemblies.
Mitos2 for mitochondrial annotation
MITOS is a web server for the automatic annotation of metazoan mitochondrial genomes. MITOS allows a reliable and consistent annotation of proteins and non-coding RNAs. The analysis steps are as follows:
• Candidate protein coding genes are found by detecting congruences in the results of blastx searches against the amino acid sequences of the annotated proteins of metazoan mitochondrial genomes found in the NCBI RefSeq 81. A post-processing step detects start and stop codons, duplicates, and hits belonging to the same transcript, e.g. frame shift or splicing.
• tRNAs are annotated using MITFI, i.e. novel structure-based covariance models as described in Jühling, et al. Nucleic Acids Research, 2012, 40(7):2833-2845. This approach was shown to have an unmatched sensitivity (outperforming ARWEN and tRNAscan-SE, respectively) and a precision higher than ARWEN and equivalent to tRNAscan-SE.
• rRNA annotation is performed using structure-based covariance models that have been developed similarly to the tRNA models. Structural considerations improve 5’ and 3’ end predictions of the rRNAs.
• In a final step, conflicts are resolved and the outcome is prepared for visualization.
Ensure you have a single contig
MITOS will only work on a single contig. If your genome has assembled into multiple contigs, you have two choices: either pick the biggest, or artificially join the contigs together.
This code will extract your longest contig - if the assembly is a single contig there is no need to perform this step. It uses a script in a language called perl, so you will need to load the perl module.
The risk associated with this option is that the gene you are interested in may not be in the longest contig.
This code will remove the contig names (as long as they are named 2-9) for all except the first contig, and then remove any line breaks to ensure correct base numbering for the annotation.
module load seqtk/1.4-l7d6lqn
#This sed removes any contigs where the name start '>[any digit between 2-9]'
sed -i '/^>[2-9].*$/d' mito_assemblies/${base}.fasta
#This seqtk utility removes any line breaks from the sequences ensuring correct base numbering for the annotation
seqtk seq -l 0 [assembly].fasta > [assembly_NLB.fasta]
Annotate with MITOS2
N.B. Ensure you create your output folder before you run mitos
module load mitos/2.0.4-conda
mkdir mitos_annot
runmitos.py -i contigs.fasta -c 2 -o mitos_annot -R ~/classdata/REFS/mitos/ -r refseq81m --rrna 0 --trna 0 --intron 0 --debug --noplot
A loop would be something like this (remember to edit loop for your files)
#!/bin/bash
workdir=~/mydata/Session3/
module load mitos/2.0.4
for i in {1..3}; do
mkdir grp${i}_mitos_annot
runmitos.py -i "${workdir}/group${i}_unknown_assembly.fasta" \
-c 2 \
-o grp${i}_mitos_annot \
-R ~/classdata/REFS/mitos/ \
-r refseq81m \
--rrna 1 \
--trna 0 \
--intron 0 \
--debug \
--noplot
done
module unload mitos/2.0.4
Extract Fasta sequence for genes in Bed file
You can extract the fasta sequence for the genes in the bed file using bedtools getfasta:
module load bedtools2/2.31.1-7imv7b4
bedtools getfasta –fi contigs.fasta -bed result.bed -name > result.fasta
Or using a loop
for i in {1..3};
do
bedtools getfasta -fi spades_output_${i}/before_rr.fasta -bed mitos_${i}/result.bed -name >mitos_${i}_result.fasta
done
Follow these steps to annotate your mitochondrial genome using mitos:
Use seqtk seq to select only the largest contig into a new file
Use mitos to annotate the contig, and visualize it with Artemis
Create a loop to generate the output for all mitochondrial assemblies
[Extension] Use the BEDtools to extract the sequence of the annotated genes
[Extension] Use seqtk to extract just the COX1 gene
Take your assembled mitochondrial genome and identify mitochondrial orthologues using blast.
Perform blastx analysis of your sequence against the mitochondrial protein database. Interrogate the using ‘less’.
Re-run the blastx analysis with the output format parameter set to a cutomised
-outfmt 6
, which will output the data into a tablular format compatible with Artemis.
blastx -query [contigs.fasta] -db ~/classdata/REFS/blastdb/hum_mt_pep -out [output.txt] -outfmt "6 qseqid sseqid pident length mismatch gapopen sstart send qstart qend evalue bitscore"
You can now open the fasta sequence you have used in the blast with Artemis and then overlay with the clast output file.
Repeat the blastx analysis of your sequence against the mitochondrial protein database [mito_pro] only returning blast matches with an E value lower that 1E-10 and using all 4 threads.
Repeat the blastx analysis of your sequence against the mitochondrial protein database, only returning blast matches with an E value lower that 1E-10, using all 4 threads, limiting the outputs so that you keep to 100 and putting the outputs into the tabular format compatible with Artemis.
[Extension] Produce a loop to annotate all assembled contigs using blast
[Extension] You can try repeating this exercise with swisprot and note the differences