Topic overview: Unix shell
1 Basic commands
Command | Description | Examples / options |
---|---|---|
pwd |
Print current working directory (dir). | pwd |
ls |
List files in working dir (default) or elsewhere. | ls data/ -l long format -h human-readable file sizes -a show hidden files |
cd |
Change working dir. As with all commands, you can use an absolute path (starting from the root dir / ) or a relative path (starting from the current working dir). |
cd /fs/ess/PAS1855 (With absolute path) cd ../.. (Two levels up) cd - (To previous dir) |
cp |
Copy files or, with -r , dirs and their contents (i.e., recursively). If target is a dir, file will keep same name; otherwise, a new name can be provided. |
cp *.fq data/ (All .fq files into dir data) cp my.fq data/new.fq (With new name) cp -r data/ ~ (Copy dir and contents to home dir) |
mv |
Move/rename files or dirs (-r not needed). If target is a dir, file will keep same name; otherwise a new name can be provided. |
mv my.fq data/ (Keep same name) mv my.fq my.fastq (Simple rename) mv file1 file2 mydir/ (Last arg is destination) |
rm |
Remove files or dirs/recursively (with -r ). With -f (force), any write-protections that you have set will be overridden. |
rm *fq (Remove all matching files) rm -r mydir/ (Remove dir & contents) -i Prompt for confirmation -f Force remove |
mkdir |
Create a new dir. Use -p to create multiple levels at once and to avoid an error if the dir exists. |
mkdir my_new_dir mkdir -p new1/new2/new3 |
touch |
If file does not exist: create empty file. If file exists: change last-modified date. |
touch newfile.txt |
cat |
Print file contents to standard out (screen). | cat my.txt cat *.fa > concat.fq (Concatenate files) |
head |
Print the first 10 lines of a file or specify number with -n <n> or shorthand -<n> . |
head -n 40 my.fq (print 40 lines) head -40 my.fq (equivalent) |
tail |
Like head but print the last lines. |
tail -n +2 my.csv (“trick” to skip first line) tail -f slurm.out (“follow” file) |
less |
View a file in a file pager; type q to exit. See below for more details. |
less myfile -S disable line-wrapping |
column -t |
View a tabular file with columns nicely lined up in the shell. | Nice viewing of a CSV file: column -s "," -t my.csv |
history |
Print previously issued commands. | history | grep "cut" (Find previous cut usage) |
chmod |
Change file permissions for file owner (user, u ), “group” (g ), others (o ) or everyone (all; a ). Permissions can be set for reading (r ), writing (w ), and executing (x ). ddddddddddddddddddddddddddddddddddddd |
chmod u+x script.sh (Make script executable) chmod a=r data/raw/* (Make data read-only) -R recursive ddddddddddddddddddddddddddddddddddddddddddddd |
2 Data tools
Command | Description | Examples and options |
---|---|---|
wc -l |
Count the number of lines in a file. | wc -l my.fq |
cut |
Select one or more columns from a file. | Select columns 1-4: cut -f 1-4 my.csv -d "," comma as delimiter |
sort |
Sort lines. The |
Sort column 1 alphabetically, column 2 reverse numerically: sort -k1,1 -k2,2nr my.bed -k 1,1 by column 1 only -n numerical sorting -r reverse order -V recognize number with string |
uniq |
Remove consecutive duplicate lines (often from single-column selection): i.e., removes all duplicates if input is sorted. | Unique values for column 2: cut -f2 my.tsv | sort | uniq |
uniq -c |
If input is sorted, create a count table for occurrences of each line (often from single-column selection). | Count table for column 3: cut -f3 my.tsv | sort | uniq -c |
tr |
Substitute (translate) characters or character classes (like To “squeeze” ( |
TSV to CSV: cat my.csv | tr "\t" "," Uppercase to lowercase: tr A-Z a-z < in.txt > out.txt -d delete -s squeeze |
grep |
Search files for a pattern and print matching lines (or only the matching string with Default regex is basic (GNU BRE): use To print lines surrounding a match, use ddddddddddddddddddddddddddddddddddddddd |
Match AAC or AGC: grep "A[AG]C" my.fa Omit comment lines: grep -v "^# my.gff -c count -i ignore case -r recursive -v invert -o print match only |
3 Miscellaneous
Symbol | Meaning | example |
---|---|---|
/ |
Root directory. | cd / |
. |
Current working directory. | cp data/file.txt . (Copy to working dir) Use ./ to execute script if not in $PATH : ./myscript.sh |
.. |
One directory level up. | cd ../.. (Move 2 levels up) |
~ or $HOME |
Home directory. | cp myfile.txt ~ (Copy to home) |
$USER |
User name. | mkdir $USER |
> |
Redirect standard out to a file. | echo "My 1st line" > myfile.txt |
>> |
Append standard out to a file. | echo "My 2nd line" >> myfile.txt |
2> |
Redirect standard error to a file. | Send standard out and standard error for a script to separate files: myscript.sh >log.txt 2> err.txt |
&> |
Redirect standard out and standard error to a file. | myscript.sh &> log.txt |
| |
Pipe standard out (output) of one command into standard in (input) of a second command | The output of the sort command will be piped into head to show the first lines: sort myfile.txt | head |
{} |
Brace expansion. Use .. to indicate numeric or character ranges (1..4 => 1 , 2 , 3 , 4 ) and , to separate items. |
mkdir Jan{01..31} (Jan01, Jan02, …, Jan31) touch fig1{A..F} (fig1A, fig1B, …, fig1F) mkdir fig1{A,D,H} (fig1A, fig1D, fig1D) |
$() |
Command substitution. Allows for flexible usage of the output of any command: e.g., use command output in an echo statement or assign it to a variable. |
Report number of FASTQ files: echo "I see $(ls *fastq | wc -l) files" Substitute with date in YYYY-MM-DD format: mkdir results_$(date +%F) nlines=$(wc -l < $infile) |
$PATH |
Contains colon-separated list of directories with executables: these will be searched when trying to execute a program by name. ddddddddddddddddddddddddddddddddddddd |
Add dir to path: PATH=$PATH:/new/dir (But for lasting changes, edit the Bash configuration file ~./bashrc .) dddddddddddddddddddddddddddddddd |
4 Shell wildcards
Wildcard | Matches | |
---|---|---|
* | Any number of any character, including nothing. | ls data/*fastq.gz (Matches any file ending in “fastq.gz”) ls *R1* (Matches any file containing “R1” somewhere in the name.) |
? | Any single character. | ls sample1_?.fastq.gz (Matches sample1_A.fastq.gz but not sample1_AA.fastq.gz ) |
[] and [^] | One or none (^ ) of the “character set” within the brackets. ddddddddddddddddddddddddddddddddddddd |
ls fig1[A-C] (Matches fig1A , fig1B , fig1C ) ls fig[0-3] (Matches fig0 , fig1 , fig2 , fig3 ) ls fig[^4]* (Does not match files with a “4” after “fig”) ddddddddddddddddddddddddddddddddddddddd |
5 Regular expressions
Where it says “yes” in the ERE column, the symbol in questions needs to have ERE turned on in order to work1: use a -E
flag for grep
and sed
(note that awk
uses ERE by default) to turn on ERE.
Symbol | ERE2 | Matches | Example |
---|---|---|---|
. |
Any single character | Match Olfr with none or any characters after it: grep -o "Olfr.*" |
|
* |
Quantifier: matches preceding character any number of times | See previous example. | |
+ |
yes | Quantifier: matches preceding character at least once | At least two consecutive digits: grep -E [0-9]+ |
? |
yes | Quantifier: matches preceding character at most once | Only a single digit: grep -E [0-9]? |
{m} / {m,} / {m,n} |
yes | Quantifier: match preceding character m times / at least m times / m to n times |
Between 50 and 100 consecutive Gs: grep -E "G{50,100}" |
^ / $ |
Anchors: match beginning / end of line | Exclude empty lines: grep -v "^$" Exclude lines beginning with a “#”: grep -v "^#" |
|
\t |
Tab (To match in grep , needs -P flag for Perl-like regex) |
echo -e "column1 \t column2" |
|
\n |
Newline (Not straightforward to match since Unix tools are line-based.) | echo -e "Line1 \n Line2" |
|
\w |
(yes) | “Word” character: any alphanumeric character or “_”. Needs -E (ERE) in grep but not in sed . |
Match gene_id followed by a space and a “word”: grep -E -o 'gene_id "\w+"' Change any word character to X: sed s/\w/X/ |
| |
yes | Alternation / logical or: match either the string before or after the | |
Find lines with either intron or exon : grep -E "intron|exon" |
() |
yes | Grouping | Find “AAG” repeated 10 times: grep (AAG){10} |
\1 , \2 , etc. |
yes | Backreferences to groups captured with () : first group is \1 , second group is \2 , etc. ddddddddddddddddddddddddddddddddddddd |
Invert order of two words: sed -E 's/(\w+) (\w+)/\2 \1/' ddddddddddddddddddddddddddddddddddddd |
6 More details for a few commands
6.1 less
Key | Function |
---|---|
q | Exit less |
space / b | Go down / up a page. (pgup / pgdn usually also work.) |
d / u | Go down / up half a page. |
g / G | Go to the first / last line (home / end also work). |
/<pattern> or ?<pattern> |
Search for <pattern> forwards / backwards: type your search after / or ? . |
n / N | When searching, go to next / previous search match. dddddddddddddddddddddddddddddddddddddddddddddddddddd |
6.2 sed
sed
flags:
Flag | Meaning |
---|---|
-E |
Use extended regular expressions |
-e |
When using multiple expressions, precede each with -e |
-i |
Edit a file in place |
-n |
Don’t print lines unless specified with p modifier |
sed
examples
# Replace "chrom" by "chr" in every line,
# with "i": case insensitive, and "g": global (>1 replacements per line)
sed 's/chrom/chr/ig' chroms.txt
# Only print lines matching "abc":
sed -n '/abc/p' my.txt
# Print lines 20-50:
sed -n '20,50p'
# Change the genomic coordinates format chr1:431-874 ("chrom:start-end")
# ...to one that has a tab ("\t") between each field:
echo "chr1:431-874" | sed -e 's/:/\t/' -e 's/-/\t/'
#> chr1 431 874
# Invert the order of two words:
echo "inverted words" | sed -E 's/(\w+) (\w+)/\2 \1/'
#> words inverted
# Capture transcript IDs from a GTF file (format 'transcript_id "ID_I_WANT"'):
# (Needs "-n" and "p" so lines with no transcript_id are not printed.)
grep -v "^#" my.gtf | sed -E -n 's/.*transcript_id "([^"]+)".*/\1/p'
# When a pattern contains a `/`, use a different expression delimiter:
echo "data/fastq/sampleA.fastq" | sed 's#data/fastq/##'
#> sampleA.fastq
6.3 awk
Records and fields: by default, each line is a record (assigned to
$0
). Each column is a field (assigned to$1
,$2
, etc).Patterns and actions: A pattern is a condition to be tested, and an action is something to do when the pattern evaluates to true.
Omit the pattern: action applies to every record.
awk '{ print $0 }' my.txt # Print entire file awk '{ print $3,$2 }' my.txt # Print columns 3 and 2 for each line
Omit the action: print full records that match the pattern.
# Print all lines for which: awk '$3 < 10' my.bed # Column 3 is less than 10 awk '$1 == "chr1"' my.bed # Column 1 is "chr1" awk '/chr1/' my.bed # Regex pattern "chr1" matches awk '$1 ~ /chr1/' my.bed # Column 1 _matches_ "chr1"
awk
examples
# Count columns in a GTF file after excluding the header
# (lines starting with "#"):
awk -F "\t" '!/^#/ {print NF; exit}' my.gtf
# Print all lines for which column 1 matches "chr1" and the difference
# ...between columns 3 and 2 (feature length) is less than 10:
awk '$1 ~ /chr1/ && $3 - $2 > 10' my.bed
# Select lines with "chr2" or "chr3", print all columns and add a column
# ...with the difference between column 3 and 2 (feature length):
awk '$1 ~ /chr2|chr3/ { print $0 "\t" $3 - $2 }' my.bed
# Caclulate the mean value for a column:
awk 'BEGIN{ sum = 0 };
{ sum += ($3 - $2) };
END{ print "mean: " sum/NR };' my.bed
awk
comparison and logical operators
Comparison | Description |
---|---|
a == b |
a is equal to b |
a != b |
a is not equal to b |
a < b |
a is less than b |
a > b |
a is greater than b |
a <= b |
a is less than or equal to b |
a >= b |
a is greater than or equal to b |
a ~ /b/ |
a matches regular expression pattern b |
a !~ /b/ |
a does not match regular expression pattern b |
a && b |
logical and: a and b |
a | | b |
logical or: a or b [note typo in Buffalo] |
!a |
not a (logical negation) |
awk
special variables and keywords
keyword/ variable |
meaning |
---|---|
BEGIN |
Used as a pattern that matches the start of the file |
END |
Used as a pattern that matches the end of the file |
NR |
Number of Records (running count; in END : total nr. of lines) |
NF |
Number of Fields (for each record) |
$0 |
Contains entire record (usually a line) |
$1 - $n |
Contains one column each |
FS |
Input Field Separator (default: any whitespace) |
OFS |
Output Field Separator (default: single space) |
RS |
Input Record Separator (default: newline) |
ORS |
Output Record Separator (default: newline) |
Some awk
functions
Function | Meaning |
---|---|
length(<string>) |
Return number of characters |
tolower(<string>) |
Convert to lowercase |
toupper(<string>) |
Convert to uppercase |
substr(<string>, <start>, <end>) |
Return substring |
sub(<from>, <to>, <string>) |
Substitute (replace) regex |
gsub(<from>, <to> <string>) |
>1 substitution per line |
print |
Print, e.g. column: print $1 |
exit |
Break out of record-processing loop; e.g. to stop when match is found |
next |
Don’t process later fields: to next iteration |
7 Keyboard shortcuts
Shortcut | Function |
---|---|
Tab | Tab completion |
⇧ / ⇩ | Cycle through previously issued commands |
Ctrl+Shift+C | Copy selected text |
Ctrl+Shift+V | Paste text from clipboard |
Ctrl+A / Ctrl+E | Go to beginning/end of line |
Ctrl+U / Ctrl+K | Cut from cursor to beginning / end of line3 |
Ctrl+W | Cut word before before cursor4 |
Ctrl+Y | Paste (“yank”) |
Alt+. | Last argument of previous command (very useful!) |
Ctrl+R | Search history: press Ctrl+R again to cycle through matches, Enter to put command in prompt. |
Ctrl+C | Kill (stop) currently active command |
Ctrl+D | Exit (a program or the shell depending on the context) |
Ctrl+Z | Suspend (pause) a process: then use bg to move to background. |
Footnotes
When using the default regular expressions in
grep
andsed
, Basic Regular Expressions (BRE), the symbol would need to be preceded by a backslash to work.↩︎GNU Extended Regular Expressions↩︎
Ctrl+K doesn’t work by default in VS Code, but can be set there.↩︎
Doesn’t work by default in VS Code, but can be set there.↩︎