Loading Software and Accessing Data

It’s all there if you know where to look

Author

Prof. Peter Kille

Published

October 18, 2024

Now that you’re (hopefully) re-caffeinated and have stretched your legs and cleared your brain, let’s have a look at some more need-to-know information.

Software

The linux OS has some really powerful utilities but these are not truly bioinformatics software. We have to ‘install’ or ‘load’ software. We do not want to have all our software loaded all the time, as many informatic tools are very complex i.e. requiring different versions of python or java to run. Having software that requires conflicting versions loaded at the same time can cause a malfunction.

Most managed HPC systems (not your own personal computer) manage this by using modules. These are preconfigured installers for your software packages. Cardiff systems use modules - if this is the case you can see the software that is available try the following command:

module avail

this will bring up a software list that looks like the following:

------------ /mnt/clusters/darter/software/spack/share/spack/modules/linux-rocky8-zen -------------
abricate/1.0.0-qk6snru        freebayes/1.3.6-l3l4sxc                 py-panaroo/1.2.10-fukmiob
abyss/2.3.5-b76h5om           gatk/4.5.0.0-riltjjd                    py-quast/5.2.0-hgcvroq
alan/2.1.1-izjr2po            gdbm/1.23-n6o3b5r                       py-unicycler/0.5.0-d6k3wzh
augustus/3.5.0-qinnslv        grace/5.1.25-gukphap                    python/3.7.4-ftvphto
barrnap/0.9-kewos3t           gromacs/2022.6-3xkd27n                  python/3.10.5-fsxphqu
bcftools/1.19-d7iukuu         hmmer/3.4-yak3kcd                       python/3.11.7-d2ytb5r
bedtools2/2.31.1-7imv7b4      igv/2.16.2-qiah4ab                      qualimap/2.2.1-c5hghxs
blast-plus/2.14.1-eqbljdp     kraken2/2.1.2-lam4dbm                   r/4.4.1-5jt6lfe
bowtie/1.3.1-zgreifk          mafft/7.505-beg3hnq                     racon/1.4.3-qpn4rnq
bowtie2/2.5.2-4j3lpsk         mcl/14-137-hetggoi                      raxml-ng/1.1.0-3gc6hnr
busco/5.4.3-av4kf4t           minimap2/2.28-w7jvpwe                   samtools/1.19.2-m76oqh7
bwa/0.7.17-kikwf2c            ncurses/6.5-yhetvhq                     seqtk/1.4-l7d6lqn
caliper/2.11.0-774rsmi        perl-bioperl/1.7.6-semipod              spades/3.15.5-4zo7gyg
cdhit/4.8.1-z4h4fo5           perl/5.34.0-hvoyvlh                     star/2.7.11a-cbbjioq
dos2unix/7.4.4-scyfkh5        picard/3.1.1-peglbgr                    subread/2.0.6-abbqxcc
dyninst/13.0.0-ssbc5ep        pilon/1.22-hoz46gi                      tbl2asn/2022-04-26-mogg2jy
emboss/6.6.0-mlzei3k          prodigal/2.6.3-o4shfn4                  tmux/3.4-pp5yxaf
fastp/0.23.4-mr6qifo          prokka/1.14.6-barrnap-bedtools-5z4fskk  trimmomatic/0.39-erd3ktn
fasttree/2.1.11-x2xclli       prokka/1.14.6-mqcdb45                   velvet/1.2.10-rryts6f
fastx-toolkit/0.0.14-gp27cub  py-macs3/3.0.0b3-w77yz26                vt/0.5772-d76l4yn
figtree/1.4.4-knoy37j         py-multiqc/1.15-mqimc53                 wtdbg2/2.3-iy52ckp

------------------------ /mnt/clusters/darter/software/nospack/modulefiles ------------------------
apptainer/1.3.3       fmlrc/v1.0.0  MAGScoT/1.0.0-conda  qiime2/2024.5-conda  star/2.7.11b-conda
artemis/18.2.0-conda  go/1.22.4     maxbin2/2.2.7-conda  reportseff/v2.7.6
coverm/0.7.0-conda    java/11.0.2   metabat2/2.17-conda  snippy/v4.6.0
fastqc/v0.12.1        java/17.0.1   mitos/2.0.4-conda    snp-sites/2.5.1

Any of these software packages can be loaded by simply using the module load command, e.g.

module load fastqc/v0.12.1

Remember to unload the module when you are finished so it does not conflict with the next piece of software, or if you want to unload all modules, use the module purge command.

module unload fastqc/v0.12.1
or
module purge

Module versions can change so check they are correct with module avail rather than just copy and pasting from the guide.

Data

The data for your course is available to you on a pre-mapped drive location. You can see and list the folders it contains using the following simple command:

ls ~/classdata/

#this should yield the following list of folders

datasets REFS  Session1  Session2  Session3  Session4  Session5  Session6

You can see these folders but not write to them! Before each session, copy the corresponding folder to your mydata directory using cp.

Note the inclusion of the ‘-r’ flag to allow you to copy the folder ‘recursively’: this will copy all the folders and files inside.

Visualising text files

There are many commands available for reading text files on Linux/Unix. These are useful when you want to look at the contents of a file, but not edit them. Among the most common of these commands are cat, more, less, head and tail.

cat streams the entire contents of a file to your terminal and is thus not that useful for reading long files as the text streams past too quickly to read. It is a very useful facility for reading files into other programs.

head and tail show the beginning and end 10 lines of a document, respectively. head in particular is useful for taking a quick peek inside a file. You can use the argument -n [number] to change the number of lines displayed. For example, to see the first 20 lines of a file:

head -n 20 file.txt

more and less are commands that show the contents of a file one page at a time, but less has more functionality than more (doh)! With both more and less, you can use the space bar to scroll down the page, and typing the letter q causes the program to quit – returning you to your command line prompt.

Once you are reading a document with more or less, typing a forward slash / will start a prompt at the bottom of the page, and you can then type in text that is searched for below the point in the document you were at. Typing in a ? also searches for a text string you enter, but it searches in the document above the point you were at. Hitting the n key during a search looks for the next instance of that text in the file.

With less (but not more), you can use the arrow keys to scroll up and down the page, and the b key to move back up the document if you wish to.

Remember these are just for reading the files. If you want to edit them you’ll need a text editor like nano or vi (see below).

There are many command line options available for each of the above commands, as well as functionality we do not cover here. To read more about them, consult the manual pages.

Exercise: looking at text files

Move into mydata/Session1

Explore the file O97435.txt by using the commands cat, more, less, head and tail. (Note that the file name starts with a capital O, rather than a zero.)

Don’t forget that tab completion can save you typing effort!

cat O97435.txt
more O97435.txt # Use the spacebar to scroll down
Press q to quit.

Now try less

less O97435.txt

Use the spacebar to scroll down, b to go up a page, and the up and down arrow keys to move up and down the file line by line.

Press the / key and search for the letters sequence in the file. Press the ? key and search for the letters gene in the file.

Press the n key to search for other instances of gene in the file.

head O97435.txt
tail O97435.txt

For reading files yourself, we recommend the command less. The command cat is more usually used in conjunction with other commands when you wish to process text from within a file in some way.

Remember the man pages

There are many command line options available for each of the above commands, as well as functionality we do not cover here. To read more about them, consult the manual pages:

man cat
man more
man less
man head
man tail

Using text editors

Plain text files are important, both as input to bioinformatics programs and as input or configuration files for system programs. We highly recommend that you learn to use a text editor to prepare and edit plain text files.

Text files, Word Processors and Bioinformatics

Documents written using a word processor such as Microsoft Word or OpenOffice Write are not plain text documents and will not work in either a text editor or if used as a script. If your filename has an extension such as .doc or .odt, it is unlikely to be a plain text document. (Try opening a Word document in Notepad or another text editor on Windows if you want proof of this.)

Word processors are very useful for preparing documents, but do not use them for working with bioinformatics-related files. We recommend that you prepare text files for bioinformatics analyses using Linux-based text editors and not Windows- or Mac-based text editors. This is because Windows- or Mac-based text editors may insert hidden characters that are not handled properly by Linux-based programs. There are endless posts on bioinformatics forums bemoaning the problems that arise from this!

There are a number of different text editors available on Linux. These range in ease of use, and each has its pros and cons. In this practical we will briefly look at two editors, nano and vi. Vi is an extremely powerful text editor and popular with many professionals, as it is installed on almost every Linux system. However, it is also notoriously frustrating to get to grips with. For this reason, you are probably better starting off with nano, which is very easy to use and displays instructions at the bottom of the screen.

Nano

Pros:

Very easy – For example, command options are visible at the bottom of the window, it can be used when logged in without graphical support, and is fast to start up and use

Cons:

By default it puts return characters into lines too long for the screen (i.e. using nano for system administration can be dangerous!) This behavior can be changed by setting different defaults for the program or running it with the –w option. It is not completely intuitive for people who are used to graphical word processors.

Vi (or Vim)

Pros:

Appears on nearly every Unix system. Can be very powerful if you take the time to know the key-short cuts.

Cons:

You have to know the shortcuts!! There’s no menus and no on screen prompts

Exercise

Note that unlike most graphical programmes, where you write a document and then name it at the end, command-line text editors require you to name a file when you create it.

Create a file with nano

nano test_nano.txt 

Type some text, exit ctrl X, save and return to command line now list the contents of the file you created

less test_nano.txt 

Create a file with vi

vi test_vi.txt 

Vi has two separate modes: ‘input’ mode, where you can type and edit text, and ‘control’ mode where you can do things such as save, search and exit the programme.

Type ‘i’ to enter input mode, and add some text.

Press [esc] to enter control mode.

Exit, saving your edits, by typing :wq - this stands for write and quit! If you wanted to exit without saving the edits, typing :q! will force vi to exit without saving.

Now list the contents of the file you created

less test_vi.txt 

Runnign code when you need to have a coffee

Often you have some code or script that will take some time to run and you want to shut down your computer and go for coffee or sleep (yes even informaticians sleep) but you want the code to continue to run. To make this work use the command nohup. The syntax is simple - here are 2 simple examples.

#run this and you can close session and enjoy relax time
nohup [command/script] &

# this command will write any output or logs to a file called - logfile.txt
nohup [command/script] > logfile.txt 2>&1 &

End of Session 1. Go and relax….and have a coffee!

Or if you’re a total nerd and can’t leave the command line, go ahead and have a look at Extension: Commandline Tools and Scripting. This contains more Linux-related tidbits and a sneak preview of some material we’ll cover in Session 2.