Week 2 exercises
Exercise 1: Course notes in Markdown
Create a Markdown document with course notes. I recommend writing this document in VS Code.
Make notes of this week’s material in some detail. If you have notes from last week in another format, include those too. (And try to keep using this document throughout the course!)
Some pointers:
Use several header levels and use them consistently: e.g. a level 1 header (
#
) for the document’s title, level 2 headers (##
) for each week, and so on.Though this should foremost be a functional document for notes, try to incorporate any appropriate formatting option: e.g. bold text, italic text,
inline code
, code blocks, ordered/unordered lists, and hyperlinks.Make sure you understand and try out how Markdown deals with whitespace, e.g. starting a new paragraph and how to force a newline.
Exercise 2: Organize project files
While doing this exercise, save the commands you use in a text document – either write in a text document in VS Code and send the commands to the terminal, or copy them into a text document later.
Getting set up
Create a directory for this exercise, and change your working dir to go there. Do this within your personal dir in the course’s project dir (e.g./fs/ess/PAS2700/users/$USER/week02/ex2/
).Create a disorganized mock project
Using thetouch
command and brace expansions, create a mock project by creating 100s of empty files, either in a single directory or a disorganized directory structure.If you want, you can create file types according to what you typically have in your project – otherwise, a suggestion is to create files with:
- Raw data (e.g.
.fastq.gz
) - Reference data (e.g.
.fasta
), - Metadata (e.g.
.txt
or.csv
) - Processed data and results (e.g.
.bam
,.out
) - Scripts (e.g.
.sh
,.py
or.R
) - Figures (e.g.
.png
or.eps
) - Notes (
.txt
and/or.md
) - Perhaps some other file type you usually have in your projects.
- Raw data (e.g.
Organize the mock project
Organize the mock project according to some of the principles we discussed this week.Even while adhering to these principles, there is plenty of wiggle room and no single perfect dir structure: what is optimal will depend on what works for you and on the project size and structure. Therefore, think about what makes sense to you, and what makes sense given the files you find yourself with.
Try to use as few commands as possible to move the files – use wildcards!
Create mock “alignment” files1
- Create a directory
alignment
inside an appropriate dir in your project (e.g.analysis
,results
) - Inside the
alignment
dir, create files with names likesample01_A_08-14-2020.sam
-sample50_H_09-16-2020.sam
for all combinations of:- 30 samples (
01
-30
) - 5 treatments (
A
-E
) - 2 dates (
08-14-2020
and09-16-2020
– yes, use this date format for now)
- 30 samples (
These 300 files can be created with a single
touch
command2.- Create a directory
Hints
Use brace expansion three times in the command: to expand (1) sample IDs, (2) treatments, and (3) dates.
Note that{01..20}
will successfully zero-pad single-digit numbers.
Rename files in a batch
Woops! We stored the alignment files that we created in the previous step as SAM files (.sam
), but this was a mistake – the files are actually the binary counterparts of SAM files: BAM files (.bam
).Move into the dir with BAM files, and use a
for
loop to rename them, changing the extension from.sam
to.bam
.
Hints
- Loop over the files using globbing (wildcard expansion) directly; there is no need to call
ls
. - Use the
basename
command, or alternatively,cut
, to strip the extension. - Store the output of the
basename
(orcut
) call using command substitution ($(command)
syntax). - The new extension can simply be pasted behind the file name, like
newname="$filename_no_extension"bam
ornewname=$(basename ...)bam
.
Copy files with wildcards
Still in the dir with your SAM files, create a new dir calledsubset
. Then, using a singlecp
command, copy files that satisfy the following conditions into thesubset
dir:- The sample ID/number should be 01-19, and
- The treatment should be A, B, or C.
Create a
README.md
in the dir that explains what you did.
Hints
Just like you used multiple consecutive brace expansions above, you can use two consecutive wildcard character sets ([]
) here.
Create a README
Include a project-wideREADME.md
that described what you did. Again, try to get familiar with Markdown syntax by using formatting liberally.Bonus: a trickier renaming loop
You now realize that your date format is suboptimal (MM-DD-YYYY
; which gave08-14-2020
and09-16-2020
) and that you should use theYYYY-MM-DD
format. Use afor
loop to rename the files.
Hints
- Use
cut
to extract the three elements of the date (day, month, and year) on three separate lines. - Store the output of these lines in variables using commands substitution, like:
day=$(commands)
. - Finally, paste your new file name together like:
newname="$part1"_"$year"
etc. - When first writing your commands, it’s helpful to be able to experiment easily: start by echo-ing a single example file name, as in:
echo sample23_C_09-16-2020.sam | cut ...
.
- Bonus: Change file permissions
Make sure no-one has write permissions for the raw data files, not even yourself. You can also change other permissions to what you think is reasonable or necessary precaution for your fictional project.
Hints
Use the chmod
command to change file permissions and recall that you can use wildcard expansion to operate on many files at once.
See this Bonus section of the Managing files in the shell page for an overview of file permissions and the chmod
command.
chmod
also has an -R
argument to act recursively: that is, to act on dirs and all of their contents (including other dirs and their contents).
Bonus exercises
Exercise 3
If you feel like it would be good to reorganize one of your own, real projects, you can do so using what you’ve learned this week. Make sure you create a backup copy of the entire project first!
Buffalo Chapter 3 code-along
Move back to /fs/ess/PAS1855/users/$USER
and download the repository accompanying the Buffalo book using git clone https://github.com/vsbuffalo/bds-files.git
. Then, move into the new dir bds-files
, and code along with Buffalo Chapter 3.
Solutions
Exercise 2
1. Getting set up
# For example:
mkdir /fs/ess/PAS2700/users/$USER/week02/ex2
cd /fs/ess/PAS2700/users/$USER/week02/ex2
2. Create a disorganized mock project
An example:
touch sample{001..150}_{F,R}.fastq.gz
touch ref.fasta ref.fai
touch sample_info.csv sequence_barcodes.txt
touch sample{001..150}{.bam,.bam.bai,_fastqc.zip,_fastqc.html} gene-counts.tsv DE-results.txt GO-out.txt
touch fastqc.sh multiqc.sh align.sh sort_bam.sh count1.py count2.py DE.R GO.R KEGG.R
touch Fig{01..05}.png all_qc_plots.eps weird-sample.png
touch dontforget.txt README.md README_DE.md tmp5.txt
touch slurm-84789570.out slurm-84789571.out slurm-84789572.out
3. Organize the mock project
An example:
Create directories:
mkdir -p data/{fastq,meta,ref} mkdir -p results/{bam,counts,DE,enrichment,logfiles,qc/figures} mkdir -p scripts mkdir -p figures/{ms,sandbox} mkdir -p doc/misc
Move files:
mv *fastq.gz data/fastq/ mv ref.fa* data/ref/ mv sample_info.csv sequence_barcodes.txt data/meta/ mv *.bam *.bam.bai results/bam/ mv *fastqc* results/qc/ mv gene-counts.tsv results/counts/ mv DE-results.txt results/DE/ mv GO-out.txt results/enrichment/ mv *.sh *.R *.py scripts/ mv README_DE.md results/DE/ mv Fig[0-9][0-9]* figures/ms mv weird-sample.png figures/sandbox mv all_qc_plots.eps results/qc/figures/ mv dontforget.txt tmp5.txt doc/misc/ mv slurm* results/logfiles/
4. Create mock alignment files
mkdir -p results/alignment
cd results/alignment
# Create the files:
touch sample{01..30}_{A..E}_{08-14-2020,09-16-2020}.sam
# Check if we have 300 files:
ls | wc -l
300
5. Rename files in a batch
for oldname in *.sam; do
newname=$(basename "$oldname" sam)bam
mv -v "$oldname" "$newname"
done
In the code above:
$oldname
will contain the old file name in each iteration of the loop.- We remove the
sam
suffix usingbasename "$oldname" sam
. - We use command substitution (
$()
syntax) to catch the output of thebasename
command, and pastebam
at the end.
Also, note that:
- We don’t need a special construction to paste strings together: we simply type
bam
after what will be the extension-less file name from thebasename
command. - I used informative variable names (
oldname
andnewname
), not cryptic ones likei
ando
.
6. Copy files with wildcards
Create the new dir:
mkdir subset
Copy the files using four consecutive wildcard selections:
- The first digit should be a 0 or a 1
[01]
(or[0-1]
), - The second can be any number
[0-9]
(?
would work, too), - The third, after an underscore, should be A, B, or C
[A-C]
, - We don’t care about what comes after that, but do need to account for the additional characters, so will use a
*
to match any character:
cp -v sample[01][0-9]_[A-C]* subset/
‘sample01_A_08-14-2020.bam’ -> ‘subset/sample01_A_08-14-2020.bam’ ‘sample01_A_09-16-2020.bam’ -> ‘subset/sample01_A_09-16-2020.bam’ ‘sample01_B_08-14-2020.bam’ -> ‘subset/sample01_B_08-14-2020.bam’ ‘sample01_B_09-16-2020.bam’ -> ‘subset/sample01_B_09-16-2020.bam’ ‘sample01_C_08-14-2020.bam’ -> ‘subset/sample01_C_08-14-2020.bam’ ‘sample01_C_09-16-2020.bam’ -> ‘subset/sample01_C_09-16-2020.bam’ ‘sample02_A_08-14-2020.bam’ -> ‘subset/sample02_A_08-14-2020.bam’ # [...output truncated...]
- The first digit should be a 0 or a 1
Report what we did, including a command substitution to insert the current date:
echo "On $(date), created a dir 'subset' and copied only files for samples 1-29 and treatments A-C into this dir." > subset/README.md
Check the resulting README files:
cat subset/README.md
On Mon Mar 18 10:07:17 EDT 2024, created a dir 'subset' and copied only files for samples 1-29 and treatments A-C into this dir.
8. Bonus: a trickier renaming loop
- In the loop, first use
cut
to extract the month, day, and year:- Start by extracting the entire date: cut by an
_
and take the third item (cut -d "_" -f 3
). - Then extract the different components of the date separately for month, date, and year, with
cut -d "-"
: the first item is the month, the second is the day, and the third is the year. - Save these compoinents of the date in variables using command substitution (
$()
).
- Start by extracting the entire date: cut by an
- Second, use
cut
to extract what we may call the “sample prefix”, which contains the sample number and the treatment. - Third, build the new file name simply by putting the variables in the right order with
_
and-
delimiters. - Use
mv
to rename the files — below, I’ve added-v
for verbose so it will report what it does.
for oldname in *.bam; do
# Extract and store the month, day, and year:
# (First cut by '_' taking the 3rd item, then by '-')
month=$(echo "$oldname" | cut -d "_" -f 3 | cut -d "-" -f 1)
day=$(echo "$oldname" | cut -d "_" -f 3 | cut -d "-" -f 2)
year=$(basename "$oldname" .bam | cut -d "_" -f 3 | cut -d "-" -f 3)
# Extract and store the sample prefix:
sample_prefix=$(echo "$oldname" | cut -d "_" -f 1-2)
# Paste together the new name:
newname="$sample_prefix"_"$year"-"$month"-"$day".bam
# Execute the move:
mv -v "$oldname" "$newname"
done
‘sample01_A_08-14-2020.bam’ -> ‘sample01_A_2020-08-14.bam’
‘sample01_A_09-16-2020.bam’ -> ‘sample01_A_2020-09-16.bam’
‘sample01_B_08-14-2020.bam’ -> ‘sample01_B_2020-08-14.bam’
‘sample01_B_09-16-2020.bam’ -> ‘sample01_B_2020-09-16.bam’
‘sample01_C_08-14-2020.bam’ -> ‘sample01_C_2020-08-14.bam’
‘sample01_C_09-16-2020.bam’ -> ‘sample01_C_2020-09-16.bam’
‘sample01_D_08-14-2020.bam’ -> ‘sample01_D_2020-08-14.bam’
‘sample01_D_09-16-2020.bam’ -> ‘sample01_D_2020-09-16.bam’
‘sample01_E_08-14-2020.bam’ -> ‘sample01_E_2020-08-14.bam’
# [...output truncated...]
9. Change file permissions
Before we start, let’s check the current file permissions:
ls -lh data/fastq
ls -lh data/fastq/ | head
total 0
-rw-rw----+ 1 jelmer PAS0471 0 Mar 18 10:19 sample001_F.fastq.gz
-rw-rw----+ 1 jelmer PAS0471 0 Mar 18 10:19 sample001_R.fastq.gz
-rw-rw----+ 1 jelmer PAS0471 0 Mar 18 10:19 sample002_F.fastq.gz
-rw-rw----+ 1 jelmer PAS0471 0 Mar 18 10:19 sample002_R.fastq.gz
-rw-rw----+ 1 jelmer PAS0471 0 Mar 18 10:19 sample003_F.fastq.gz
-rw-rw----+ 1 jelmer PAS0471 0 Mar 18 10:19 sample003_R.fastq.gz
-rw-rw----+ 1 jelmer PAS0471 0 Mar 18 10:19 sample004_F.fastq.gz
-rw-rw----+ 1 jelmer PAS0471 0 Mar 18 10:19 sample004_R.fastq.gz
-rw-rw----+ 1 jelmer PAS0471 0 Mar 18 10:19 sample005_F.fastq.gz
The file “owner”/“user” (you) and the “group” (in this case, PAS0471, likely a different group for you) have read and write permissions, and “others” have no permissions at all.
There are several different ways to change permissions with the chmod
command. Here are some examples which would ensure that no-one has write permission for the raw data:
Set read(-only) permissions for all:
# a=r => all=read chmod a=r data/fastq/*
Take away write permissions for all:
# a-w => all minus write chmod a-w data/fastq/*
You can also use the “numeric” syntax:
chmod 444 data/fastq/*
Whereas after running the second option, others won’t have read-access, the first and third option should give this result:
ls -lh data/fastq
ls -lh data/fastq/ | head
total 0
-r--r--r--+ 1 jelmer PAS0471 0 Mar 18 10:19 sample001_F.fastq.gz
-r--r--r--+ 1 jelmer PAS0471 0 Mar 18 10:19 sample001_R.fastq.gz
-r--r--r--+ 1 jelmer PAS0471 0 Mar 18 10:19 sample002_F.fastq.gz
-r--r--r--+ 1 jelmer PAS0471 0 Mar 18 10:19 sample002_R.fastq.gz
-r--r--r--+ 1 jelmer PAS0471 0 Mar 18 10:19 sample003_F.fastq.gz
-r--r--r--+ 1 jelmer PAS0471 0 Mar 18 10:19 sample003_R.fastq.gz
-r--r--r--+ 1 jelmer PAS0471 0 Mar 18 10:19 sample004_F.fastq.gz
-r--r--r--+ 1 jelmer PAS0471 0 Mar 18 10:19 sample004_R.fastq.gz
-r--r--r--+ 1 jelmer PAS0471 0 Mar 18 10:19 sample005_F.fastq.gz