Querying files with grep

grep stands for global regular expression print; you use this command to search for text patterns in a file (or any stream of text). eg.

grep "ency" ~/mydata/Session1/subsample.txt

You can also use flexible search terms, known as regular expressions, in your grep searches. You have already used glob pattern expressions in this practical, but regular expressions are somewhat different and more powerful. For example, when you listed all files with the pattern testxt you were using a glob pattern comprising explicit characters (e.g. tes) and special symbols (* meaning any character or characters). The equivalent in grep would be “tes.txt.” where the period signifies any single character and the * signifies any number of repeats.

Therefore to get from a shell glob pattern to a regular expression replace each * with .* and each ? with . . You also need to enclose the expression in quotes to tell the shell not to try and interpret it as a glob.

Unmodified glob patterns will be accepted by grep but will not work as intended. For example the pattern tes* in grep means te followed by any number of s characters in sequence (te, tes, tess, tesss, …). The question mark now signifies optionality – so tes? means te followed by zero or one s character (te, tes). Regular expressions are found in several places other than grep, most notably in the Perl scripting language. The full syntax is extensive and powerful but is beyond the scope of this course, so back to the command itself…

grep requires a regular expression as input, and returns all the lines containing that pattern to you as output.

grep is especially useful in combination with pipes as you can filter the results of other commands.

For example, perhaps you only want to see only the information in an txt file relating to the origin of the sequence, that is, the DE line. You do not need to search the file in an editor, you can just grep for lines beginning in DE, as in the next exercise.

Exercise

Move into mydata/Session1

Type the command:

grep "DE" O97435.txt

What is this command doing?

Can you see why the above command results in the output you see? An explanation of this command can be found below this exercise box.

Try the commands:

grep "^DE" O97435.txt
grep -x "DE.*" O97435.txt

What are the ^ symbol and the -x parameter in these commands doing? Check the man page for grep to be sure.

Try the command:

cat O97435.txt | grep "^DE".

Does that do what you expected?

Use the above command with a pipe and a grep command to search for files created or modified today.

The first command in the above exercise searches all the text in the O97435.txt file and returns the lines in which it finds the letter D followed by the letter E.

The second command in the exercise also returns lines in the file that have a letter D followed by a letter E, but only where DE is found at the beginning of a line. This is because the ^ symbol means “match at the beginning of a line”. The $ symbol can be used similarly to mean “at the end of a line”. These are known as anchors. Passing the -x flag to grep tells it to automatically anchor both ends of the search pattern.

What this anchoring does in the example above is return to you just the organism information in the txt file. This is because none of the other lines returned in the previous command started with DE, they just contained DE somewhere in them. This is an example where knowing how information is stored in an given file, along with a few basic Linux commands, allows you to retrieve information quickly.

Another common example is counting how many sequences are in a set of multi-fasta files. We can do this with pipes between the commands cat, grep and the handy wc (word count) utility, which here we use to count lines found by grep.

cat subsample_Ill1.fasta | grep "^>" | wc -l 

grep -c "^>" subsample_Ill1.fasta

Each sequence in a fasta file starts with a header line that begins with a > . The above command streams the contents of all files matching the glob pattern *seqs.fasta through a search with grep looking for lines that start with the symbol > . The quotes around the pattern ^> are necessary, as otherwise it is interpreted as a request for redirection of output to a file, rather than as a character to look for. As before, the ^ symbol means “match only at the beginning of the line”.

The output of this grep search is sent to the wc command, with the -l indicating that you want to know the number of lines – ie. the number of headers and by implication the number of sequences.

An easier version of this shown as the second example uses the grep -c argument that return the number of matches found.

Now try this:

cat *.fasta | grep "^>" | wc -l

Remember that * means any pattern, and that cat will concatentate all the files given as input. Therefore, a synopsis of the command above is: Read through all files with names ending seqs.fasta and look for all the header lines in the combined output, then count up those lines that matched and return the number to screen.

Use zgrep with compressed files

You can use all of grep functionality with a compressed file by adding a ‘z’ - just replace grep with zgrep and you can directly query that .gz files.

Running and rerunning scripts

Bioinformatic is full of altruistic people who share their skills by share their scripts. Most scripts you find on the internet are written in () or (). There are a few simple steps to making use of this fantastic resource.

Step 1: Create your Script

Make a text file containing the script in question. This can be achieved by downloading or transferring the scripts as a file in the correct format. Sometimes the scripts are posted as part of a website such as a web-post in a discussion forum. To use these scripts, create a file using vi or nano and copy into the test file the script in question ensuring you save it with an appropriate name.

Here’s a script you can try

#!/bin/bash 

echo {10..1} 

echo 'Blast off'

Step 2: Make your script executable

Make the file executable. Before you can run the script you must make it executable. This is done by changing its property using

chmod a+x [script name]

this is shorthand for chmod (change modify) a(all)+(add)execute(e) [script name] – thus changing the permission to allow everyone to execute a script. For more guide to chmod see https://en.wikipedia.org/wiki/Chmod

Step 3: Run your script

Run the program. This should be easy but there are a few ways of doing this.

Place the program into the directory where you want to use it and type

./[script name] parameters arguments

On first use try to run with no parameters or arguments or with -h and -help to see the manual for the script. Some poorly written scripts will need you to define the program you need to use them, i.e. for a bash script :

bash [script name] parameters arguments

Run from your current location using the script’s full path.

/full path/[script name] parameters arguments

Place the script into your ‘PATH’ – this means that the computer automatically knows about the script and will run it from any location just given the program name. I suggest that if you want to do this ask the demonstrators and they can show you……this is advanced as if you put two scripts with the same name into the PATH you can cause issues.

For…Done Loops – When Processing Lots of Data

Loops using numerical variables

Creating a Loop

Invoke a text editor such as nano, then type

#!/bin/bash

for i in {1..[number]}; do 

# use hash to include some level of documentation so when you get to script in a few months time 

# you can remember what it was all about.  ${i} = number which increment by 1 each time the loop runs

[your commands]${i} 

done

save.

now make the program executable

chmod +x [program_name]

run

./[program_name]

Loops using Strings (lists) as variables

Invoke a text editor such as nano, then type

#!/bin/bash

for i in sampleA sampleB sampleC sampleD; do 

# use hash to include some level of documentation so when you get to script in a few months time

# you can remember what it was all about.  ${i} = the list of strings given at the start of the for loop 

[your commands]${i} 

done

save

now make the program executable

chmod +x [program_name]

run

./[program_name]

Loops using directory listings

This is a great method if you want to execute a series of command on a set of data files contained in a specific directory, for instance a series of sequence files.

Invoke a text editor such as nano, then type

#!/bin/bash

sequence_dir=[location of folder containing paired end sequence files]
#*_R1.fastq - lists all files ending in _R1.fastq

for f in ${sequence_dir}/*_R1.fastq
do

#the file name are placed in variable $f - we can separate the name of the file away from the suffix (.fastq) using this simple cut expression - the variable 'R1' now contains the file name with no suffix 
R1=$(basename $f | cut -f1 -d.)

#Sometimes we want the 'base' name of the file without the direction suffix (_R1) - this expression creates a variable 'base' where the _R1 has been replace is nothing - ie removed 
base=$(echo $R1 | sed 's/_R1//')

echo ${base}

done

save

now make the program executable

chmod +x [program_name]

run

./[program_name]

Decompressing tar.gz and .gz files

You will commonly come across 2 type of compressed files in Linux .gz and tar.gz. The gz are equivalent to a ‘zip’ file in windows, whilst a tar.gz represent a compressed archive - that means it will contain multiple files and folders. Here’s how to handle these file:

gz files

….wait a minute do you really need to decompress this file !! Many programs will happily use a .gz file directly, this a win for your file space so check out if you really need to decompress the file. Unfortunate some utilits like ‘sed’ require files to be unzipped, it that case:

#to decompress
gunzip [filename].gz
#to recompress
gzip [filename]

How to extract a .tar.gz file on Linux

To extract a .tar.gz file on Linux, you can use the “tar” command in the terminal. Here is the general syntax:

tar -xvzf filename.tar.gz

Here is a brief explanation of the options used:

-x: This option tells tar to extract the contents of the archive.

-v: This option is for verbose output, which means that tar will display a list of the files being extracted as it does so.

-z: This option tells tar to decompress the archive using gzip.

-f: This option is used to specify the archive file to extract.