Week 1 Exercises

Author

Jelmer Poelstra

Published

February 25, 2024

The following are some of the exercises from Chapter 1 of the CSB book.

Getting set up

You should already have the book’s GitHub repository with exercise files in your personal dir within /fs/ess/PAS2700 from Thursday’s class. (If not, cd to /fs/ess/PAS2700/users/$USER, and run git clone https://github.com/CSB-book/CSB.git — this downloads the CSB directory referred to in the first step of the first exercise.)

Intermezzo 1.1

(a) ~~Go to your home directory~~. Go to /fs/ess/PAS2700/users/$USER

(b) Navigate to the sandbox directory within the CSB/unix directory.

(c) Use a relative path to go to the data directory within the python directory.

(d) Use an absolute path to go to the sandbox directory within CSB/python.

(e) Return to the data directory within the python directory.

Show hints

We didn’t see this in class, but the CSB book (page 21) mentions a shortcut you can use with cd to go back to the dir you were in previously (a little like a Browser’s “back” button).

Intermezzo 1.2

To familiarize yourself with using basic Unix commands, try the following:

(a) Go to the data directory within CSB/unix.

(b) How many lines are in the file Marra2014_data.fasta?

(c) Create the empty file toremove.txt in the CSB/unix/sandbox directory without leaving the current directory.

Show hints

We didn’t see this in class, but the CSB book (page 23) demonstrates the use of the touch command to create a new, empty file.

(d) List the contents of the directory unix/sandbox.

(e) Remove the file toremove.txt.

Intermezzo 1.3

(a) If we order all species names (fifth column) of Pacifici2013_data.csv (in CSB/unix/data/) in alphabetical order, which is the first species? And which is the last?

Show hints

You can either first select the 5th column using cut and then use sort, or directly tell the sort command which column to sort by.

In either case, you’ll also need to specify the column delimiter (if you use the latter approach, check the book for how to do that with sort).

To view just the first or the last line so you don’t have to scroll to get your answer, pipe to head or tail.

(b) How many families are represented in the database?

Show hints

Check the first line of the file to see which column contains the family, and then select the relevant column with cut.
Use the “tail trick” we saw in class to exclude the first line.
Remember to sort before using uniq.

Exercise 1.10.1: Next-Generation Sequencing Data

In this exercise, we work with next generation sequencing (NGS) data. Unix is excellent at manipulating the huge FASTA files that are generated in NGS experiments.

FASTA files contain sequence data in text format. Each sequence segment is preceded by a single-line description. The first character of the description line is a “greater than” sign (>).

The NGS data set we will be working with was published by Marra and DeWoody (2014), who investigated the immunogenetic repertoire of rodents. You will find the sequence file Marra2014_data.fasta in the directory CSB/unix/data. The file contains sequence segments (contigs) of variable size. The description of each contig provides its length, the number of reads that contributed to the contig, its isogroup (representing the collection of alternative splice products of a possible gene), and the isotig status.

1. Change directory to CSB/unix/sandbox.

2. What is the size of the file Marra2014_data.fasta?

Show hints

Recall that you can see file sizes by using specific options to the ls command.

3. Create a copy of Marra2014_data.fasta in the sandbox, and name it my_file.fasta.

4. How many “contigs” (FASTA entries, in this case) are classified as isogroup00036?

Show hints

Is there a grep option that counts the number of occurrences? Alternatively, you can pass the output of grep to wc -l.

5. Modify my_file.fasta to replace the original “two-spaces” delimiter with a comma (i.e. don’t just print the output to screen, but end up with a modified file). You’ll probably want to take a look at the “output file hints” below to see how you can end up with modified file contents.

Show output file hints

Due to the “streaming” nature of Unix commands, we can’t write output to a file that also serves as input (see here). So the following is not possible:

cat myfile.txt | tr "a" "b" > myfile.txt # Don't do this!

In this case, you’ll have to save the output in a different file. Then, if you do want to end up with a modified original file, you can overwrite the original file using mv.

Show other hints

In the file, the information on each contig is separated by two spaces:
```
>contig00001  length=527  numreads=2  ...
```
We would like to obtain:
```
>contig00001,length=527,numreads=2,...
```
Use cat to print the file, and substitute the spaces using the command tr. Note that you’ll first have to reduce the two spaces two one – can you remember an option to do that?

6. How many unique isogroups are in the file?

Show hints

You can use grep to match any line containing the word isogroup. Then, use cut to isolate the part detailing the isogroup. Finally, you want to remove the duplicates, and count.

7. Which contig has the highest number of reads (numreads)? How many reads does it have?

Show hints

Use a combination of grep and cut to extract the contig names and read counts. The command sort allows you to choose the delimiter and to order numerically — we didn’t see those sort options in class, so check the book for details.

Exercise 1.10.2: Hormone Levels in Baboons

Gesquiere et al. (2011) studied hormone levels in the blood of baboons. The data file is in CSB/unix/data/Gesquiere2011_data.csv.

Every individual was sampled several times. How many times were the levels of individuals 3 and 27 recorded?

Show hints

You can first use cut to extract just the maleID column from the file.
To match an individual (3 or 27), you can use grep with the -w option to match whole “words” only: this will prevent and individual ID like “13” to match when you search for “3”.

Solutions

Intermezzo 1.1

Solution

(a) ~~Go to your home directory.~~ Go to /fs/ess/PAS2700/users/$USER.

cd /fs/ess/PAS2700/users/$USER # To home would have been: "cd ∼"

(b) Navigate to the sandbox directory within the CSB/unix directory.

cd CSB/unix/sandbox

(c) Use a relative path to go to the data directory within the python directory.

cd ../../python/data

(d) Use an absolute path to go to the sandbox directory within python.

cd /fs/ess/PAS2700/users/$USER/CSB/python/sandbox

(e) Return to the data directory within the python directory.

With cd -:

# The '-' shortcut for cd will move you back to the previously visited dir
# (Note: you can't keep going back with this: using it a second time will toggle you "forward" again.)
cd -

Using a relative path:
```
cd ../data
```

Intermezzo 1.2

Solution

(a) Go to the data directory within CSB/unix.

cd /fs/ess/PAS2700/users/$USER/CSB/unix/data

(b) How many lines are in file Marra2014_data.fasta?

wc -l Marra2014_data.fasta

9515 Marra2014_data.fasta

(c) Create the empty file toremove.txt in the CSB/unix/sandbox directory without leaving the current directory.

touch ../sandbox/toremove.txt

(d) List the contents of the directory unix/sandbox.

ls ../sandbox

Papers and reviews  toremove.txt

(e) Remove the file toremove.txt.

rm ../sandbox/toremove.txt

Intermezzo 1.3

Solution

(a) If we order all species names (fifth column) of Pacifici2013_data.csv in alphabetical order, which is the first species? And which is the last?

# First species:
cut -d ";" -f 5 Pacifici2013_data.csv | sort | head -n 1

Abditomys latidens

# Last species:
cut -d ";" -f 5 Pacifici2013_data.csv | sort | tail -n 1

Zyzomys woodwardi

# Or, using sort directly, but then you get all columns unless you pipe to cut:
sort -t ";" -k 5 Pacifici2013_data.csv | head -n 1

42641;Rodentia;Muridae;Abditomys;Abditomys latidens;268.09;PanTHERIA;no information;no information;no information;no information;no information;no information;639.6318318208;Mean_family_same_body_mass

(b) How many families are represented in the database?

cut -d ";" -f 3 Pacifici2013_data.csv | tail -n +2 | sort | uniq | wc -l

Exercise 1.10.1: Next-Generation Sequencing Data

1. Change directory to CSB/unix/sandbox.

cd /fs/ess/PAS2700/users/$USER/CSB/unix/sandbox

2. What is the size of the file Marra2014_data.fasta?

ls -lh ../data/Marra2014_data.fasta

-rw-rw----+ 1 jelmer PAS0471 553K Feb 24 20:30 ../data/Marra2014_data.fasta

Alternatively, the command du (disk usage) can be used for more compact output:

du -h ../data/Marra2014_data.fasta

560K    ../data/Marra2014_data.fasta

3. Create a copy of Marra2014_data.fasta in the sandbox, and name it my_file.fasta.

cp ../data/Marra2014_data.fasta my_file.fasta

4. How many contigs are classified as isogroup00036?

To count the occurrences of a given string, use grep with the option -c:

grep -c isogroup00036 my_file.fasta

Slightly less efficient is to use a “regular” grep and then pipe to wc -l:

grep isogroup00036 my_file.fasta | wc -l

5. Replace the original “two-spaces” delimiter with a comma.

We use the tr option -s (squeeze) to change two spaces two one, and then replace the space with a ,. Importantly, we also write the output to a new file (see the Hints for details):
```
cat my_file.fasta | tr -s ' ' ',' > my_file.tmp
```
If we want to change the original file, we can now overwrite it as follows:
```
mv my_file.tmp my_file.fasta
```

Let’s take a look to check whether out delimiter replacement worked:

grep ">" my_file.fasta | head

>contig00001,length=527,numreads=2,gene=isogroup00001,status=it_thresh
>contig00002,length=551,numreads=8,gene=isogroup00001,status=it_thresh
>contig00003,length=541,numreads=2,gene=isogroup00001,status=it_thresh
>contig00004,length=291,numreads=3,gene=isogroup00001,status=it_thresh
>contig00005,length=580,numreads=12,gene=isogroup00001,status=it_thresh
>contig00006,length=3288,numreads=35,gene=isogroup00001,status=it_thresh
>contig00008,length=1119,numreads=10,gene=isogroup00001,status=it_thresh
>contig00010,length=202,numreads=4,gene=isogroup00001,status=it_thresh
>contig00011,length=5563,numreads=61,gene=isogroup00001,status=it_thresh
>contig00012,length=824,numreads=10,gene=isogroup00001,status=it_thresh

6. How many unique isogroups are in the file?

First, searching for > with grep will extract all lines with contig information:

grep '>' my_file.fasta | head -n 2

>contig00001,length=527,numreads=2,gene=isogroup00001,status=it_thresh
>contig00002,length=551,numreads=8,gene=isogroup00001,status=it_thresh

Now, add cut to extract the 4th column:

grep '>' my_file.fasta | cut -d ',' -f 4 | head -n 2

gene=isogroup00001
gene=isogroup00001

Finally, add sort -> uniq -> wc -l to count the number of unique occurrences:

grep '>' my_file.fasta | cut -d ',' -f 4 | sort | uniq | wc -l

7. Which contig has the highest number of reads (numreads)? How many reads does it have?

First, we need to isolate the number of reads as well as the contig names. We can use a combination of grep and cut:

grep '>' my_file.fasta | cut -d ',' -f 1,3 | head -n 3

>contig00001,numreads=2
>contig00002,numreads=8
>contig00003,numreads=2

Now we want to sort according to the number of reads. However, the number of reads is part of a more complex string. We can use -t '=' to split according to the = sign, and then take the second column (-k 2) to sort numerically (-n):
```
grep '>' my_file.fasta | cut -d ',' -f 1,3 | sort -t '=' -k 2 -n | head -n 5
```
```
>contig00089,numreads=1
>contig00176,numreads=1
>contig00210,numreads=1
>contig00001,numreads=2
>contig00003,numreads=2
```
Adding the sort option -r, we can sort in reverse order, which tells us that contig00302 has the highest coverage, with 3330 reads:
```
grep '>' my_file.fasta | cut -d ',' -f 1,3 | sort -t '=' -k 2 -n -r | head -n 1
```
```
>contig00302,numreads=3330
```

Exercise 1.10.2: Hormone Levels in Baboons

How many times were the levels of individuals 3 and 27 recorded?

First, let’s move back into the data dir:
```
cd ../data
```

Next, let’s take a look at the structure of the file:

head -n 3 Gesquiere2011_data.csv

maleID        GC      T
1     66.9    64.57
1     51.09   35.57

We want to extract all the rows in which the first column is 3 (or 27), and count them. To extract only the first column, we can use cut:
```
cut -f 1 Gesquiere2011_data.csv | head -n 3
```
```
maleID
1
1
```

Then we can pipe the results to grep -c to count the number of occurrences (note the option -w to match whole “words” – this will make it match 3 but not 13 or 23):
```
# For maleID 3
cut -f 1 Gesquiere2011_data.csv | grep -c -w 3
```
```
61
```
```
# For maleID 27
cut -f 1 Gesquiere2011_data.csv | grep -c -w 27
```
```
5
```