Jelmer Poelstra


February 26, 2024

1 Basic commands

Command Description Examples / options
pwd Print current working directory (dir). pwd
ls List files in working dir (default) or elsewhere. ls data/
    -l long format
    -h human-readable file sizes
    -a show hidden files
cd Change working dir. As with all commands, you can use an absolute path (starting from the root dir /) or a relative path (starting from the current working dir). cd /fs/ess/PAS1855 (With absolute path)
cd ../.. (Two levels up)
cd - (To previous dir)
cp Copy files or, with -r, dirs and their contents (i.e., recursively).
If target is a dir, file will keep same name; otherwise, a new name can be provided.
cp *.fq data/ (All .fq files into dir data)
cp my.fq data/new.fq (With new name)
cp -r data/ ~ (Copy dir and contents to home dir)
mv Move/rename files or dirs (-r not needed).
If target is a dir, file will keep same name; otherwise a new name can be provided.
mv my.fq data/ (Keep same name)
mv my.fq my.fastq (Simple rename)
mv file1 file2 mydir/ (Last arg is destination)
rm Remove files or dirs/recursively (with -r).
With -f (force), any write-protections that you have set will be overridden.
rm *fq (Remove all matching files)
rm -r mydir/ (Remove dir & contents)
    -i Prompt for confirmation
    -f Force remove
mkdir Create a new dir.
Use -p to create multiple levels at once and to avoid an error if the dir exists.
mkdir my_new_dir
mkdir -p new1/new2/new3
touch If file does not exist: create empty file.
If file exists: change last-modified date.
touch newfile.txt
cat Print file contents to standard out (screen). cat my.txt
cat *.fa > concat.fq (Concatenate files)
head Print the first 10 lines of a file or specify number with -n <n> or shorthand -<n>. head -n 40 my.fq (print 40 lines)
head -40 my.fq (equivalent)
tail Like head but print the last lines. tail -n +2 my.csv (“trick” to skip first line)
tail -f slurm.out (“follow” file)
less View a file in a file pager; type q to exit. See below for more details. less myfile
    -S disable line-wrapping
column -t View a tabular file with columns nicely lined up in the shell. Nice viewing of a CSV file:
column -s "," -t my.csv
history Print previously issued commands. history | grep "cut" (Find previous cut usage)
chmod Change file permissions for file owner (user, u), “group” (g), others (o) or everyone (all; a). Permissions can be set for reading (r), writing (w), and executing (x).
chmod u+x (Make script executable)
chmod a=r data/raw/* (Make data read-only)
    -R recursive

2 Data tools

Command Description Examples and options
wc -l Count the number of lines in a file. wc -l my.fq
cut Select one or more columns from a file. Select columns 1-4:
cut -f 1-4 my.csv
    -d "," comma as delimiter

Sort lines.

The -V option will successfully sort chr10 after chr2. etc.

Sort column 1 alphabetically,
column 2 reverse numerically:
sort -k1,1 -k2,2nr my.bed

    -k 1,1 by column 1 only
    -n numerical sorting
    -r reverse order
    -V recognize number with string
uniq Remove consecutive duplicate lines (often from single-column selection): i.e., removes all duplicates if input is sorted. Unique values for column 2:
cut -f2 my.tsv | sort | uniq
uniq -c If input is sorted, create a count table for occurrences of each line (often from single-column selection). Count table for column 3:
cut -f3 my.tsv | sort | uniq -c

Substitute (translate) characters or character classes (like A-Z for uppercase letters). Does not take files as argument; piping or redirection needed.

To “squeeze” (-s) is to remove consecutive duplicates (akin to uniq).

cat my.csv | tr "\t" ","
Uppercase to lowercase: tr A-Z a-z < in.txt > out.txt

    -d delete
    -s squeeze

Search files for a pattern and print matching lines (or only the matching string with -o).

Default regex is basic (GNU BRE): use -E for extended regex (GNU ERE) and -P for Perl-like regex.

To print lines surrounding a match, use -A n (n lines after match) or -B n (n lines before match) or -C n (n lines before and after match).


Match AAC or AGC:
grep "A[AG]C" my.fa
Omit comment lines:
grep -v "^# my.gff

    -c count
    -i ignore case
    -r recursive
    -v invert
    -o print match only

3 Miscellaneous

Symbol Meaning example
/ Root directory. cd /
. Current working directory. cp data/file.txt . (Copy to working dir)
Use ./ to execute script if not in $PATH:
.. One directory level up. cd ../.. (Move 2 levels up)
~ or $HOME Home directory. cp myfile.txt ~ (Copy to home)
$USER User name. mkdir $USER
> Redirect standard out to a file. echo "My 1st line" > myfile.txt
>> Append standard out to a file. echo "My 2nd line" >> myfile.txt
2> Redirect standard error to a file. Send standard out and standard error for a script to separate files: >log.txt 2> err.txt
&> Redirect standard out and standard error to a file. &> log.txt
| Pipe standard out (output) of one command into standard in (input) of a second command The output of the sort command will be piped into head to show the first lines:
sort myfile.txt | head
{} Brace expansion. Use .. to indicate numeric or character ranges (1..4 => 1, 2, 3, 4) and , to separate items. mkdir Jan{01..31} (Jan01, Jan02, …, Jan31)
touch fig1{A..F} (fig1A, fig1B, …, fig1F)
mkdir fig1{A,D,H} (fig1A, fig1D, fig1D)
$() Command substitution. Allows for flexible usage of the output of any command: e.g., use command output in an echo statement or assign it to a variable. Report number of FASTQ files:
echo "I see $(ls *fastq | wc -l) files"
Substitute with date in YYYY-MM-DD format:
mkdir results_$(date +%F)
nlines=$(wc -l < $infile)
$PATH Contains colon-separated list of directories with executables: these will be searched when trying to execute a program by name.
Add dir to path:
(But for lasting changes, edit the Bash configuration file ~./bashrc.) dddddddddddddddddddddddddddddddd

4 Shell wildcards

Wildcard Matches
* Any number of any character, including nothing. ls data/*fastq.gz (Matches any file ending in “fastq.gz”)
ls *R1* (Matches any file containing “R1” somewhere in the name.)
? Any single character. ls sample1_?.fastq.gz (Matches sample1_A.fastq.gz but not sample1_AA.fastq.gz)
[] and [^] One or none (^) of the “character set” within the brackets.
ls fig1[A-C] (Matches fig1A, fig1B, fig1C)
ls fig[0-3] (Matches fig0, fig1, fig2, fig3)
ls fig[^4]* (Does not match files with a “4” after “fig”)

5 Regular expressions

“ERE” = GNU Extended regular expressions

Where it says “yes” in the ERE column, the symbol in questions needs to have ERE turned on in order to work1: use a -E flag for grep and sed (note that awk uses ERE by default) to turn on ERE.

Symbol ERE2 Matches Example
. Any single character Match Olfr with none or any characters after it:
grep -o "Olfr.*"
* Quantifier: matches preceding character any number of times See previous example.
+ yes Quantifier: matches preceding character at least once At least two consecutive digits:
grep -E [0-9]+
? yes Quantifier: matches preceding character at most once Only a single digit:
grep -E [0-9]?
{m} / {m,} / {m,n} yes Quantifier: match preceding character m times / at least m times / m to n times Between 50 and 100 consecutive Gs:
grep -E "G{50,100}"
^ / $ Anchors: match beginning / end of line Exclude empty lines:
grep -v "^$"
Exclude lines beginning with a “#”:
grep -v "^#"
\t Tab (To match in grep, needs -P flag for Perl-like regex) echo -e "column1 \t column2"
\n Newline (Not straightforward to match since Unix tools are line-based.) echo -e "Line1 \n Line2"
\w (yes) “Word” character: any alphanumeric character or “_”. Needs -E (ERE) in grep but not in sed. Match gene_id followed by a space and a “word”:
grep -E -o 'gene_id "\w+"'
Change any word character to X:
sed s/\w/X/
| yes Alternation / logical or: match either the string before or after the | Find lines with either intron or exon:
grep -E "intron|exon"
() yes Grouping Find “AAG” repeated 10 times:
grep (AAG){10}
\1, \2, etc. yes Backreferences to groups captured with (): first group is \1, second group is \2, etc.
Invert order of two words:
sed -E 's/(\w+) (\w+)/\2 \1/'

6 More details for a few commands

6.1 less

Key Function
q Exit less
space / b Go down / up a page. (pgup / pgdn usually also work.)
d / u Go down / up half a page.
g / G Go to the first / last line (home / end also work).
/<pattern> or ?<pattern> Search for <pattern> forwards / backwards: type your search after / or ?.
n / N When searching, go to next / previous search match.

6.2 sed

sed flags:

Flag Meaning
-E Use extended regular expressions
-e When using multiple expressions, precede each with -e
-i Edit a file in place
-n Don’t print lines unless specified with p modifier

sed examples

# Replace "chrom" by "chr" in every line,
# with "i": case insensitive, and "g": global (>1 replacements per line)
sed 's/chrom/chr/ig' chroms.txt

# Only print lines matching "abc":
sed -n '/abc/p' my.txt

# Print lines 20-50:
sed -n '20,50p'

# Change the genomic coordinates format chr1:431-874 ("chrom:start-end")
# one that has a tab ("\t") between each field:
echo "chr1:431-874" | sed -e 's/:/\t/' -e 's/-/\t/'
#> chr1    431     874

# Invert the order of two words:
echo "inverted words" | sed -E 's/(\w+) (\w+)/\2 \1/'
#> words inverted

# Capture transcript IDs from a GTF file (format 'transcript_id "ID_I_WANT"'):
# (Needs "-n" and "p" so lines with no transcript_id are not printed.) 
grep -v "^#" my.gtf | sed -E -n 's/.*transcript_id "([^"]+)".*/\1/p'

# When a pattern contains a `/`, use a different expression delimiter:
echo "data/fastq/sampleA.fastq" | sed 's#data/fastq/##'
#> sampleA.fastq

6.3 awk

  • Records and fields: by default, each line is a record (assigned to $0). Each column is a field (assigned to $1, $2, etc).

  • Patterns and actions: A pattern is a condition to be tested, and an action is something to do when the pattern evaluates to true.

    • Omit the pattern: action applies to every record.

      awk '{ print $0 }' my.txt     # Print entire file
      awk '{ print $3,$2 }' my.txt  # Print columns 3 and 2 for each line
    • Omit the action: print full records that match the pattern.

      # Print all lines for which:
      awk '$3 < 10' my.bed          # Column 3 is less than 10
      awk '$1 == "chr1"' my.bed     # Column 1 is "chr1"
      awk '/chr1/' my.bed           # Regex pattern "chr1" matches
      awk '$1 ~ /chr1/' my.bed      # Column 1 _matches_ "chr1"

awk examples

# Count columns in a GTF file after excluding the header
# (lines starting with "#"):
awk -F "\t" '!/^#/ {print NF; exit}' my.gtf

# Print all lines for which column 1 matches "chr1" and the difference
# ...between columns 3 and 2 (feature length) is less than 10:
awk '$1 ~ /chr1/ && $3 - $2 > 10' my.bed

# Select lines with "chr2" or "chr3", print all columns and add a column 
# ...with the difference between column 3 and 2 (feature length):
awk '$1 ~ /chr2|chr3/ { print $0 "\t" $3 - $2 }' my.bed

# Caclulate the mean value for a column:
awk 'BEGIN{ sum = 0 };            
     { sum += ($3 - $2) };             
     END{ print "mean: " sum/NR };' my.bed

awk comparison and logical operators

Comparison Description
a == b a is equal to b
a != b a is not equal to b
a < b a is less than b
a > b a is greater than b
a <= b a is less than or equal to b
a >= b a is greater than or equal to b
a ~ /b/ a matches regular expression pattern b
a !~ /b/ a does not match regular expression pattern b
a && b logical and: a and b
a || b logical or: a or b [note typo in Buffalo]
!a not a (logical negation)

awk special variables and keywords

BEGIN Used as a pattern that matches the start of the file
END Used as a pattern that matches the end of the file
NR Number of Records (running count; in END: total nr. of lines)
NF Number of Fields (for each record)
$0 Contains entire record (usually a line)
$1 - $n Contains one column each
FS Input Field Separator (default: any whitespace)
OFS Output Field Separator (default: single space)
RS Input Record Separator (default: newline)
ORS Output Record Separator (default: newline)

Some awk functions

Function Meaning
length(<string>) Return number of characters
tolower(<string>) Convert to lowercase
toupper(<string>) Convert to uppercase
substr(<string>, <start>, <end>) Return substring
sub(<from>, <to>, <string>) Substitute (replace) regex
gsub(<from>, <to> <string>) >1 substitution per line
print Print, e.g. column: print $1
exit Break out of record-processing loop;
e.g. to stop when match is found
next Don’t process later fields: to next iteration

7 Keyboard shortcuts

Shortcut Function
Tab Tab completion
/ Cycle through previously issued commands
Ctrl+Shift+C Copy selected text
Ctrl+Shift+V Paste text from clipboard
Ctrl+A / Ctrl+E Go to beginning/end of line
Ctrl+U / Ctrl+K Cut from cursor to beginning / end of line3
Ctrl+W Cut word before before cursor4
Ctrl+Y Paste (“yank”)
Alt+. Last argument of previous command (very useful!)
Ctrl+R Search history: press Ctrl+R again to cycle through matches, Enter to put command in prompt.
Ctrl+C Kill (stop) currently active command
Ctrl+D Exit (a program or the shell depending on the context)
Ctrl+Z Suspend (pause) a process: then use bg to move to background.
