Topic overview: Unix shell

Author

Jelmer Poelstra

Published

February 26, 2024



1 Basic commands

Command Description Examples / options
pwd Print current working directory (dir). pwd
ls List files in working dir (default) or elsewhere. ls data/
    -l long format
    -h human-readable file sizes
    -a show hidden files
cd Change working dir. As with all commands, you can use an absolute path (starting from the root dir /) or a relative path (starting from the current working dir). cd /fs/ess/PAS1855 (With absolute path)
cd ../.. (Two levels up)
cd - (To previous dir)
cp Copy files or, with -r, dirs and their contents (i.e., recursively).
If target is a dir, file will keep same name; otherwise, a new name can be provided.
cp *.fq data/ (All .fq files into dir data)
cp my.fq data/new.fq (With new name)
cp -r data/ ~ (Copy dir and contents to home dir)
mv Move/rename files or dirs (-r not needed).
If target is a dir, file will keep same name; otherwise a new name can be provided.
mv my.fq data/ (Keep same name)
mv my.fq my.fastq (Simple rename)
mv file1 file2 mydir/ (Last arg is destination)
rm Remove files or dirs/recursively (with -r).
With -f (force), any write-protections that you have set will be overridden.
rm *fq (Remove all matching files)
rm -r mydir/ (Remove dir & contents)
    -i Prompt for confirmation
    -f Force remove
mkdir Create a new dir.
Use -p to create multiple levels at once and to avoid an error if the dir exists.
mkdir my_new_dir
mkdir -p new1/new2/new3
touch If file does not exist: create empty file.
If file exists: change last-modified date.
touch newfile.txt
cat Print file contents to standard out (screen). cat my.txt
cat *.fa > concat.fq (Concatenate files)
head Print the first 10 lines of a file or specify number with -n <n> or shorthand -<n>. head -n 40 my.fq (print 40 lines)
head -40 my.fq (equivalent)
tail Like head but print the last lines. tail -n +2 my.csv (“trick” to skip first line)
tail -f slurm.out (“follow” file)
less View a file in a file pager; type q to exit. See below for more details. less myfile
    -S disable line-wrapping
column -t View a tabular file with columns nicely lined up in the shell. Nice viewing of a CSV file:
column -s "," -t my.csv
history Print previously issued commands. history | grep "cut" (Find previous cut usage)
chmod Change file permissions for file owner (user, u), “group” (g), others (o) or everyone (all; a). Permissions can be set for reading (r), writing (w), and executing (x).
ddddddddddddddddddddddddddddddddddddd
chmod u+x script.sh (Make script executable)
chmod a=r data/raw/* (Make data read-only)
    -R recursive
ddddddddddddddddddddddddddddddddddddddddddddd


2 Data tools

Command Description Examples and options
wc -l Count the number of lines in a file. wc -l my.fq
cut Select one or more columns from a file. Select columns 1-4:
cut -f 1-4 my.csv
    -d "," comma as delimiter
sort

Sort lines.

The -V option will successfully sort chr10 after chr2. etc.

Sort column 1 alphabetically,
column 2 reverse numerically:
sort -k1,1 -k2,2nr my.bed

    -k 1,1 by column 1 only
    -n numerical sorting
    -r reverse order
    -V recognize number with string
uniq Remove consecutive duplicate lines (often from single-column selection): i.e., removes all duplicates if input is sorted. Unique values for column 2:
cut -f2 my.tsv | sort | uniq
uniq -c If input is sorted, create a count table for occurrences of each line (often from single-column selection). Count table for column 3:
cut -f3 my.tsv | sort | uniq -c
tr

Substitute (translate) characters or character classes (like A-Z for uppercase letters). Does not take files as argument; piping or redirection needed.

To “squeeze” (-s) is to remove consecutive duplicates (akin to uniq).

TSV to CSV:
cat my.csv | tr "\t" ","
Uppercase to lowercase: tr A-Z a-z < in.txt > out.txt

    -d delete
    -s squeeze
grep

Search files for a pattern and print matching lines (or only the matching string with -o).

Default regex is basic (GNU BRE): use -E for extended regex (GNU ERE) and -P for Perl-like regex.

To print lines surrounding a match, use -A n (n lines after match) or -B n (n lines before match) or -C n (n lines before and after match).

ddddddddddddddddddddddddddddddddddddddd

Match AAC or AGC:
grep "A[AG]C" my.fa
Omit comment lines:
grep -v "^# my.gff

    -c count
    -i ignore case
    -r recursive
    -v invert
    -o print match only


3 Miscellaneous

Symbol Meaning example
/ Root directory. cd /
. Current working directory. cp data/file.txt . (Copy to working dir)
Use ./ to execute script if not in $PATH:
./myscript.sh
.. One directory level up. cd ../.. (Move 2 levels up)
~ or $HOME Home directory. cp myfile.txt ~ (Copy to home)
$USER User name. mkdir $USER
> Redirect standard out to a file. echo "My 1st line" > myfile.txt
>> Append standard out to a file. echo "My 2nd line" >> myfile.txt
2> Redirect standard error to a file. Send standard out and standard error for a script to separate files:
myscript.sh >log.txt 2> err.txt
&> Redirect standard out and standard error to a file. myscript.sh &> log.txt
| Pipe standard out (output) of one command into standard in (input) of a second command The output of the sort command will be piped into head to show the first lines:
sort myfile.txt | head
{} Brace expansion. Use .. to indicate numeric or character ranges (1..4 => 1, 2, 3, 4) and , to separate items. mkdir Jan{01..31} (Jan01, Jan02, …, Jan31)
touch fig1{A..F} (fig1A, fig1B, …, fig1F)
mkdir fig1{A,D,H} (fig1A, fig1D, fig1D)
$() Command substitution. Allows for flexible usage of the output of any command: e.g., use command output in an echo statement or assign it to a variable. Report number of FASTQ files:
echo "I see $(ls *fastq | wc -l) files"
Substitute with date in YYYY-MM-DD format:
mkdir results_$(date +%F)
nlines=$(wc -l < $infile)
$PATH Contains colon-separated list of directories with executables: these will be searched when trying to execute a program by name.
ddddddddddddddddddddddddddddddddddddd
Add dir to path:
PATH=$PATH:/new/dir
(But for lasting changes, edit the Bash configuration file ~./bashrc.) dddddddddddddddddddddddddddddddd


4 Shell wildcards

Wildcard Matches
* Any number of any character, including nothing. ls data/*fastq.gz (Matches any file ending in “fastq.gz”)
ls *R1* (Matches any file containing “R1” somewhere in the name.)
? Any single character. ls sample1_?.fastq.gz (Matches sample1_A.fastq.gz but not sample1_AA.fastq.gz)
[] and [^] One or none (^) of the “character set” within the brackets.
ddddddddddddddddddddddddddddddddddddd
ls fig1[A-C] (Matches fig1A, fig1B, fig1C)
ls fig[0-3] (Matches fig0, fig1, fig2, fig3)
ls fig[^4]* (Does not match files with a “4” after “fig”)
ddddddddddddddddddddddddddddddddddddddd


5 Regular expressions

“ERE” = GNU Extended regular expressions

Where it says “yes” in the ERE column, the symbol in questions needs to have ERE turned on in order to work1: use a -E flag for grep and sed (note that awk uses ERE by default) to turn on ERE.

Symbol ERE2 Matches Example
. Any single character Match Olfr with none or any characters after it:
grep -o "Olfr.*"
* Quantifier: matches preceding character any number of times See previous example.
+ yes Quantifier: matches preceding character at least once At least two consecutive digits:
grep -E [0-9]+
? yes Quantifier: matches preceding character at most once Only a single digit:
grep -E [0-9]?
{m} / {m,} / {m,n} yes Quantifier: match preceding character m times / at least m times / m to n times Between 50 and 100 consecutive Gs:
grep -E "G{50,100}"
^ / $ Anchors: match beginning / end of line Exclude empty lines:
grep -v "^$"
Exclude lines beginning with a “#”:
grep -v "^#"
\t Tab (To match in grep, needs -P flag for Perl-like regex) echo -e "column1 \t column2"
\n Newline (Not straightforward to match since Unix tools are line-based.) echo -e "Line1 \n Line2"
\w (yes) “Word” character: any alphanumeric character or “_”. Needs -E (ERE) in grep but not in sed. Match gene_id followed by a space and a “word”:
grep -E -o 'gene_id "\w+"'
Change any word character to X:
sed s/\w/X/
| yes Alternation / logical or: match either the string before or after the | Find lines with either intron or exon:
grep -E "intron|exon"
() yes Grouping Find “AAG” repeated 10 times:
grep (AAG){10}
\1, \2, etc. yes Backreferences to groups captured with (): first group is \1, second group is \2, etc.
ddddddddddddddddddddddddddddddddddddd
Invert order of two words:
sed -E 's/(\w+) (\w+)/\2 \1/'
ddddddddddddddddddddddddddddddddddddd


6 More details for a few commands

6.1 less

Key Function
q Exit less
space / b Go down / up a page. (pgup / pgdn usually also work.)
d / u Go down / up half a page.
g / G Go to the first / last line (home / end also work).
/<pattern> or ?<pattern> Search for <pattern> forwards / backwards: type your search after / or ?.
n / N When searching, go to next / previous search match.
dddddddddddddddddddddddddddddddddddddddddddddddddddd

6.2 sed

sed flags:

Flag Meaning
-E Use extended regular expressions
-e When using multiple expressions, precede each with -e
-i Edit a file in place
-n Don’t print lines unless specified with p modifier

sed examples

# Replace "chrom" by "chr" in every line,
# with "i": case insensitive, and "g": global (>1 replacements per line)
sed 's/chrom/chr/ig' chroms.txt

# Only print lines matching "abc":
sed -n '/abc/p' my.txt

# Print lines 20-50:
sed -n '20,50p'

# Change the genomic coordinates format chr1:431-874 ("chrom:start-end")
# ...to one that has a tab ("\t") between each field:
echo "chr1:431-874" | sed -e 's/:/\t/' -e 's/-/\t/'
#> chr1    431     874

# Invert the order of two words:
echo "inverted words" | sed -E 's/(\w+) (\w+)/\2 \1/'
#> words inverted

# Capture transcript IDs from a GTF file (format 'transcript_id "ID_I_WANT"'):
# (Needs "-n" and "p" so lines with no transcript_id are not printed.) 
grep -v "^#" my.gtf | sed -E -n 's/.*transcript_id "([^"]+)".*/\1/p'

# When a pattern contains a `/`, use a different expression delimiter:
echo "data/fastq/sampleA.fastq" | sed 's#data/fastq/##'
#> sampleA.fastq

6.3 awk

  • Records and fields: by default, each line is a record (assigned to $0). Each column is a field (assigned to $1, $2, etc).

  • Patterns and actions: A pattern is a condition to be tested, and an action is something to do when the pattern evaluates to true.

    • Omit the pattern: action applies to every record.

      awk '{ print $0 }' my.txt     # Print entire file
      awk '{ print $3,$2 }' my.txt  # Print columns 3 and 2 for each line
    • Omit the action: print full records that match the pattern.

      # Print all lines for which:
      awk '$3 < 10' my.bed          # Column 3 is less than 10
      awk '$1 == "chr1"' my.bed     # Column 1 is "chr1"
      awk '/chr1/' my.bed           # Regex pattern "chr1" matches
      awk '$1 ~ /chr1/' my.bed      # Column 1 _matches_ "chr1"

awk examples

# Count columns in a GTF file after excluding the header
# (lines starting with "#"):
awk -F "\t" '!/^#/ {print NF; exit}' my.gtf

# Print all lines for which column 1 matches "chr1" and the difference
# ...between columns 3 and 2 (feature length) is less than 10:
awk '$1 ~ /chr1/ && $3 - $2 > 10' my.bed

# Select lines with "chr2" or "chr3", print all columns and add a column 
# ...with the difference between column 3 and 2 (feature length):
awk '$1 ~ /chr2|chr3/ { print $0 "\t" $3 - $2 }' my.bed

# Caclulate the mean value for a column:
awk 'BEGIN{ sum = 0 };            
     { sum += ($3 - $2) };             
     END{ print "mean: " sum/NR };' my.bed

awk comparison and logical operators

Comparison Description
a == b a is equal to b
a != b a is not equal to b
a < b a is less than b
a > b a is greater than b
a <= b a is less than or equal to b
a >= b a is greater than or equal to b
a ~ /b/ a matches regular expression pattern b
a !~ /b/ a does not match regular expression pattern b
a && b logical and: a and b
a || b logical or: a or b [note typo in Buffalo]
!a not a (logical negation)

awk special variables and keywords

keyword/
variable
meaning
BEGIN Used as a pattern that matches the start of the file
END Used as a pattern that matches the end of the file
NR Number of Records (running count; in END: total nr. of lines)
NF Number of Fields (for each record)
$0 Contains entire record (usually a line)
$1 - $n Contains one column each
FS Input Field Separator (default: any whitespace)
OFS Output Field Separator (default: single space)
RS Input Record Separator (default: newline)
ORS Output Record Separator (default: newline)

Some awk functions

Function Meaning
length(<string>) Return number of characters
tolower(<string>) Convert to lowercase
toupper(<string>) Convert to uppercase
substr(<string>, <start>, <end>) Return substring
sub(<from>, <to>, <string>) Substitute (replace) regex
gsub(<from>, <to> <string>) >1 substitution per line
print Print, e.g. column: print $1
exit Break out of record-processing loop;
e.g. to stop when match is found
next Don’t process later fields: to next iteration


7 Keyboard shortcuts

Shortcut Function
Tab Tab completion
/ Cycle through previously issued commands
Ctrl+Shift+C Copy selected text
Ctrl+Shift+V Paste text from clipboard
Ctrl+A / Ctrl+E Go to beginning/end of line
Ctrl+U / Ctrl+K Cut from cursor to beginning / end of line3
Ctrl+W Cut word before before cursor4
Ctrl+Y Paste (“yank”)
Alt+. Last argument of previous command (very useful!)
Ctrl+R Search history: press Ctrl+R again to cycle through matches, Enter to put command in prompt.
Ctrl+C Kill (stop) currently active command
Ctrl+D Exit (a program or the shell depending on the context)
Ctrl+Z Suspend (pause) a process: then use bg to move to background.
Back to top

Footnotes

  1. When using the default regular expressions in grep and sed, Basic Regular Expressions (BRE), the symbol would need to be preceded by a backslash to work.↩︎

  2. GNU Extended Regular Expressions↩︎

  3. Ctrl+K doesn’t work by default in VS Code, but can be set there.↩︎

  4. Doesn’t work by default in VS Code, but can be set there.↩︎