Week 2 exercises

Author

Jelmer Poelstra

Published

March 4, 2024

Exercise 1: Course notes in Markdown

Create a Markdown document with course notes. I recommend writing this document in VS Code.

Make notes of this week’s material in some detail. If you have notes from last week in another format, include those too. (And try to keep using this document throughout the course!)

Some pointers:

Use several header levels and use them consistently: e.g. a level 1 header (#) for the document’s title, level 2 headers (##) for each week, and so on.
Though this should foremost be a functional document for notes, try to incorporate any appropriate formatting option: e.g. bold text, italic text, inline code, code blocks, ordered/unordered lists, and hyperlinks.
Make sure you understand and try out how Markdown deals with whitespace, e.g. starting a new paragraph and how to force a newline.

Exercise 2: Organize project files

While doing this exercise, save the commands you use in a text document – either write in a text document in VS Code and send the commands to the terminal, or copy them into a text document later.

Getting set up
Create a directory for this exercise, and change your working dir to go there. Do this within your personal dir in the course’s project dir (e.g. /fs/ess/PAS2700/users/$USER/week02/ex2/).
Create a disorganized mock project
Using the touch command and brace expansions, create a mock project by creating 100s of empty files, either in a single directory or a disorganized directory structure.

If you want, you can create file types according to what you typically have in your project – otherwise, a suggestion is to create files with:
- Raw data (e.g. .fastq.gz)
- Reference data (e.g. .fasta),
- Metadata (e.g. .txt or .csv)
- Processed data and results (e.g. .bam, .out)
- Scripts (e.g. .sh, .py or .R)
- Figures (e.g. .png or .eps)
- Notes (.txt and/or .md)
- Perhaps some other file type you usually have in your projects.
Organize the mock project
Organize the mock project according to some of the principles we discussed this week.

Even while adhering to these principles, there is plenty of wiggle room and no single perfect dir structure: what is optimal will depend on what works for you and on the project size and structure. Therefore, think about what makes sense to you, and what makes sense given the files you find yourself with.

Try to use as few commands as possible to move the files – use wildcards!
Create mock “alignment” files¹
- Create a directory alignment inside an appropriate dir in your project (e.g. analysis, results)
- Inside the alignment dir, create files with names like sample01_A_08-14-2020.sam - sample50_H_09-16-2020.sam for all combinations of:
  - 30 samples (01-30)
  - 5 treatments (A-E)
  - 2 dates (08-14-2020 and 09-16-2020 – yes, use this date format for now)
These 300 files can be created with a single touch command².

Hints

Use brace expansion three times in the command: to expand (1) sample IDs, (2) treatments, and (3) dates.

Note that {01..20} will successfully zero-pad single-digit numbers.

Rename files in a batch
Woops! We stored the alignment files that we created in the previous step as SAM files (.sam), but this was a mistake – the files are actually the binary counterparts of SAM files: BAM files (.bam).

Move into the dir with BAM files, and use a for loop to rename them, changing the extension from .sam to .bam.

Hints

Loop over the files using globbing (wildcard expansion) directly; there is no need to call ls.
Use the basename command, or alternatively, cut, to strip the extension.
Store the output of the basename (or cut) call using command substitution ($(command) syntax).
The new extension can simply be pasted behind the file name, like newname="$filename_no_extension"bam or newname=$(basename ...)bam.

Copy files with wildcards
Still in the dir with your SAM files, create a new dir called subset. Then, using a single cp command, copy files that satisfy the following conditions into the subset dir:
- The sample ID/number should be 01-19, and
- The treatment should be A, B, or C.
Create a README.md in the dir that explains what you did.

Hints

Just like you used multiple consecutive brace expansions above, you can use two consecutive wildcard character sets ([]) here.

Create a README
Include a project-wide README.md that described what you did. Again, try to get familiar with Markdown syntax by using formatting liberally.
Bonus: a trickier renaming loop
You now realize that your date format is suboptimal (MM-DD-YYYY; which gave 08-14-2020 and 09-16-2020) and that you should use the YYYY-MM-DD format. Use a for loop to rename the files.

Hints

Use cut to extract the three elements of the date (day, month, and year) on three separate lines.
Store the output of these lines in variables using commands substitution, like: day=$(commands).
Finally, paste your new file name together like: newname="$part1"_"$year" etc.
When first writing your commands, it’s helpful to be able to experiment easily: start by echo-ing a single example file name, as in: echo sample23_C_09-16-2020.sam | cut ....

Bonus: Change file permissions
Make sure no-one has write permissions for the raw data files, not even yourself. You can also change other permissions to what you think is reasonable or necessary precaution for your fictional project.

Hints

Use the chmod command to change file permissions and recall that you can use wildcard expansion to operate on many files at once.

See this Bonus section of the Managing files in the shell page for an overview of file permissions and the chmod command.

Alternatively, chmod also has an -R argument to act recursively: that is, to act on dirs and all of their contents (including other dirs and their contents).

Bonus exercises

Exercise 3

If you feel like it would be good to reorganize one of your own, real projects, you can do so using what you’ve learned this week. Make sure you create a backup copy of the entire project first!

Buffalo Chapter 3 code-along

Move back to /fs/ess/PAS1855/users/$USER and download the repository accompanying the Buffalo book using git clone https://github.com/vsbuffalo/bds-files.git. Then, move into the new dir bds-files, and code along with Buffalo Chapter 3.

Solutions

Exercise 2

1. Getting set up

# For example:
mkdir /fs/ess/PAS2700/users/$USER/week02/ex2

cd /fs/ess/PAS2700/users/$USER/week02/ex2

2. Create a disorganized mock project

An example:

touch sample{001..150}_{F,R}.fastq.gz
touch ref.fasta ref.fai
touch sample_info.csv sequence_barcodes.txt
touch sample{001..150}{.bam,.bam.bai,_fastqc.zip,_fastqc.html} gene-counts.tsv DE-results.txt GO-out.txt
touch fastqc.sh multiqc.sh align.sh sort_bam.sh count1.py count2.py DE.R GO.R KEGG.R
touch Fig{01..05}.png all_qc_plots.eps weird-sample.png
touch dontforget.txt README.md README_DE.md tmp5.txt
touch slurm-84789570.out slurm-84789571.out slurm-84789572.out

3. Organize the mock project

An example:

Create directories:

mkdir -p data/{fastq,meta,ref}
mkdir -p results/{bam,counts,DE,enrichment,logfiles,qc/figures}
mkdir -p scripts
mkdir -p figures/{ms,sandbox}
mkdir -p doc/misc

Move files:

mv *fastq.gz data/fastq/
mv ref.fa* data/ref/
mv sample_info.csv sequence_barcodes.txt data/meta/
mv *.bam *.bam.bai results/bam/
mv *fastqc* results/qc/
mv gene-counts.tsv results/counts/
mv DE-results.txt results/DE/
mv GO-out.txt results/enrichment/
mv *.sh *.R *.py scripts/
mv README_DE.md results/DE/
mv Fig[0-9][0-9]* figures/ms
mv weird-sample.png figures/sandbox
mv all_qc_plots.eps results/qc/figures/
mv dontforget.txt tmp5.txt doc/misc/
mv slurm* results/logfiles/

4. Create mock alignment files

mkdir -p results/alignment
cd results/alignment 

# Create the files:
touch sample{01..30}_{A..E}_{08-14-2020,09-16-2020}.sam

# Check if we have 300 files:
ls | wc -l

5. Rename files in a batch

for oldname in *.sam; do
   newname=$(basename "$oldname" sam)bam
   mv -v "$oldname" "$newname"
done

In the code above:

$oldname will contain the old file name in each iteration of the loop.
We remove the sam suffix using basename "$oldname" sam.
We use command substitution ($() syntax) to catch the output of the basename command, and paste bam at the end.

Also, note that:

We don’t need a special construction to paste strings together: we simply type bam after what will be the extension-less file name from the basename command.
I used informative variable names (oldname and newname), not cryptic ones like i and o.

6. Copy files with wildcards

Create the new dir:
```
mkdir subset
```

Copy the files using four consecutive wildcard selections:

The first digit should be a 0 or a 1 [01] (or [0-1]),
The second can be any number [0-9] (? would work, too),
The third, after an underscore, should be A, B, or C [A-C],
We don’t care about what comes after that, but do need to account for the additional characters, so will use a * to match any character:

cp -v sample[01][0-9]_[A-C]* subset/

‘sample01_A_08-14-2020.bam’ -> ‘subset/sample01_A_08-14-2020.bam’
‘sample01_A_09-16-2020.bam’ -> ‘subset/sample01_A_09-16-2020.bam’
‘sample01_B_08-14-2020.bam’ -> ‘subset/sample01_B_08-14-2020.bam’
‘sample01_B_09-16-2020.bam’ -> ‘subset/sample01_B_09-16-2020.bam’
‘sample01_C_08-14-2020.bam’ -> ‘subset/sample01_C_08-14-2020.bam’
‘sample01_C_09-16-2020.bam’ -> ‘subset/sample01_C_09-16-2020.bam’
‘sample02_A_08-14-2020.bam’ -> ‘subset/sample02_A_08-14-2020.bam’
# [...output truncated...]

Report what we did, including a command substitution to insert the current date:

echo "On $(date), created a dir 'subset' and copied only files for samples 1-29
and treatments A-C into this dir." > subset/README.md

Check the resulting README files:

cat subset/README.md

On Mon Mar 18 10:07:17 EDT 2024, created a dir 'subset' and copied only files for samples 1-29
and treatments A-C into this dir.

8. Bonus: a trickier renaming loop

In the loop, first use cut to extract the month, day, and year:
- Start by extracting the entire date: cut by an _ and take the third item (cut -d "_" -f 3).
- Then extract the different components of the date separately for month, date, and year, with cut -d "-": the first item is the month, the second is the day, and the third is the year.
- Save these compoinents of the date in variables using command substitution ($()).
Second, use cut to extract what we may call the “sample prefix”, which contains the sample number and the treatment.
Third, build the new file name simply by putting the variables in the right order with _ and - delimiters.
Use mv to rename the files — below, I’ve added -v for verbose so it will report what it does.

for oldname in *.bam; do
     # Extract and store the month, day, and year:
     # (First cut by '_' taking the 3rd item, then by '-')
     month=$(echo "$oldname" | cut -d "_" -f 3 | cut -d "-" -f 1)
     day=$(echo "$oldname" | cut -d "_" -f 3 | cut -d "-" -f 2)
     year=$(basename "$oldname" .bam | cut -d "_" -f 3 | cut -d "-" -f 3)
     
     # Extract and store the sample prefix:
     sample_prefix=$(echo "$oldname" | cut -d "_" -f 1-2)
     
     # Paste together the new name:
     newname="$sample_prefix"_"$year"-"$month"-"$day".bam
     
     # Execute the move:
     mv -v "$oldname" "$newname"
done

‘sample01_A_08-14-2020.bam’ -> ‘sample01_A_2020-08-14.bam’
‘sample01_A_09-16-2020.bam’ -> ‘sample01_A_2020-09-16.bam’
‘sample01_B_08-14-2020.bam’ -> ‘sample01_B_2020-08-14.bam’
‘sample01_B_09-16-2020.bam’ -> ‘sample01_B_2020-09-16.bam’
‘sample01_C_08-14-2020.bam’ -> ‘sample01_C_2020-08-14.bam’
‘sample01_C_09-16-2020.bam’ -> ‘sample01_C_2020-09-16.bam’
‘sample01_D_08-14-2020.bam’ -> ‘sample01_D_2020-08-14.bam’
‘sample01_D_09-16-2020.bam’ -> ‘sample01_D_2020-09-16.bam’
‘sample01_E_08-14-2020.bam’ -> ‘sample01_E_2020-08-14.bam’
# [...output truncated...]

9. Change file permissions

Before we start, let’s check the current file permissions:

ls -lh data/fastq

ls -lh data/fastq/ | head
total 0
-rw-rw----+ 1 jelmer PAS0471 0 Mar 18 10:19 sample001_F.fastq.gz
-rw-rw----+ 1 jelmer PAS0471 0 Mar 18 10:19 sample001_R.fastq.gz
-rw-rw----+ 1 jelmer PAS0471 0 Mar 18 10:19 sample002_F.fastq.gz
-rw-rw----+ 1 jelmer PAS0471 0 Mar 18 10:19 sample002_R.fastq.gz
-rw-rw----+ 1 jelmer PAS0471 0 Mar 18 10:19 sample003_F.fastq.gz
-rw-rw----+ 1 jelmer PAS0471 0 Mar 18 10:19 sample003_R.fastq.gz
-rw-rw----+ 1 jelmer PAS0471 0 Mar 18 10:19 sample004_F.fastq.gz
-rw-rw----+ 1 jelmer PAS0471 0 Mar 18 10:19 sample004_R.fastq.gz
-rw-rw----+ 1 jelmer PAS0471 0 Mar 18 10:19 sample005_F.fastq.gz

The file “owner”/“user” (you) and the “group” (in this case, PAS0471, likely a different group for you) have read and write permissions, and “others” have no permissions at all.

There are several different ways to change permissions with the chmod command. Here are some examples which would ensure that no-one has write permission for the raw data:

Set read(-only) permissions for all:

# a=r => all=read
chmod a=r data/fastq/*

Take away write permissions for all:

# a-w => all minus write
chmod a-w data/fastq/*

You can also use the “numeric” syntax:
```
chmod 444 data/fastq/*
```

Whereas after running the second option, others won’t have read-access, the first and third option should give this result:

ls -lh data/fastq

ls -lh data/fastq/ | head
total 0
-r--r--r--+ 1 jelmer PAS0471 0 Mar 18 10:19 sample001_F.fastq.gz
-r--r--r--+ 1 jelmer PAS0471 0 Mar 18 10:19 sample001_R.fastq.gz
-r--r--r--+ 1 jelmer PAS0471 0 Mar 18 10:19 sample002_F.fastq.gz
-r--r--r--+ 1 jelmer PAS0471 0 Mar 18 10:19 sample002_R.fastq.gz
-r--r--r--+ 1 jelmer PAS0471 0 Mar 18 10:19 sample003_F.fastq.gz
-r--r--r--+ 1 jelmer PAS0471 0 Mar 18 10:19 sample003_R.fastq.gz
-r--r--r--+ 1 jelmer PAS0471 0 Mar 18 10:19 sample004_F.fastq.gz
-r--r--r--+ 1 jelmer PAS0471 0 Mar 18 10:19 sample004_R.fastq.gz
-r--r--r--+ 1 jelmer PAS0471 0 Mar 18 10:19 sample005_F.fastq.gz

Footnotes

Real alignment files like SAM/BAM are generated by aligning FASTQ sequence reads to a reference genome.↩︎
If you already happened to have an alignment dir among your mock project dirs, first delete its contents or rename it.↩︎