# (Don't run this)
<- 5
x x
[1] 5
Week 1 - Thu lecture
Jelmer Poelstra
February 29, 2024
In this session, we’ll cover much of the material from CSB Chapter 1, to learn:
We can conceptualize Unix (or “Unix-like” / “*nix”)1 as a family of operating systems, which includes Linux and Mac but not Windows.
While these are not synonyms, in day-to-day computing/bioinformatics, they are often used interchangeably.
Versus programs with graphical user interfaces:
Versus scripting languages like Python or R:
“Directory” (or “dir” for short) is the term for folder that is commonly used in Unix contexts.
The Unix directory structure is hierarchical, with a single starting point: the root, depicted as /
.
A “path” gives the location of a file or directory, in which directories are separated by forward slashes /
.
For example, the path to our OSC project’s dir is /fs/ess/PAS2700
. This means: the dir PAS2700
is located inside the dir ess
, which in turn is inside the dir fs
, which in turn is in the computer’s root directory.
The OSC dir structure is somewhat different from that of a personal computer. Our Home dir is not /home/<username>
like in the book and the schematic on the left, but /users/<a-project>/<username>
.
Clusters
and then Pitzer Shell Access
.You can’t right-click in this shell, so to copy-and-paste:
Try copying and pasting a random word into your shell. This may just work, you may get a permission pop-up, or it may silently fail — if the latter, click on the clipboard icon in your browser’s address bar (see red circle in screenshot below):
Inside your terminal, the “prompt” indicates that the shell is ready for a command. What is shown exactly varies across shells and can also be customized, but our prompts at OSC should show the following information:
<username>@<node-name> <working-dir>]$
For example (and note that ~
means your Home dir):
[jelmer@pitzer-login02 ~]$
We type our commands after the dollar sign, and then press Enter to execute the command. When the command has finished executing, we’ll get our prompt back and can type a new command.
OSC prints welcome messages and storage quota information when you open a shell. To reduce the amount of text on the screen, I will clear the screen now and regularly throughout. This can be done with the keyboard shortcut Ctrl+L.
date
and pwd
The Unix shell comes with hundreds of “commands”: small programs that perform specific actions. If you’re familiar with R or Python, a Unix command is like an R/Python function.
Let’s start with a few simple commands:
The date
command prints the current date and time:
Thu Feb 29 14:58:19 EST 2024
The pwd
(Print Working Directory) command prints the path to the directory you are currently located in:
/users/PAS0471/jelmer
Both of those commands provided us with some output. That output was printed to screen, which is the default behavior for nearly every Unix command.
cal
command — and options & argumentsThe cal
command is another example of a command that simply prints some information to the screen, in this case a calendar. We’ll use it to learn about command options and arguments.
Invoking cal
without options or arguments will show the current month:
February 2024
Su Mo Tu We Th Fr Sa
1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29
Use the option -j
(dash and then j) for a Julian calendar, in which day numbering is continuous instead of restarting each month:
February 2024
Sun Mon Tue Wed Thu Fri Sat
32 33 34
35 36 37 38 39 40 41
42 43 44 45 46 47 48
49 50 51 52 53 54 55
56 57 58 59 60
Use the -3
option to show 3 months (adding the previous and next month):
January 2024 February 2024 March 2024
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
1 2 3 4 5 6 1 2 3 1 2
7 8 9 10 11 12 13 4 5 6 7 8 9 10 3 4 5 6 7 8 9
14 15 16 17 18 19 20 11 12 13 14 15 16 17 10 11 12 13 14 15 16
21 22 23 24 25 26 27 18 19 20 21 22 23 24 17 18 19 20 21 22 23
28 29 30 31 25 26 27 28 29 24 25 26 27 28 29 30
31
We can always combine multiple options, for example:
January 2024 February 2024 March 2024
Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 32 33 34 61 62
7 8 9 10 11 12 13 35 36 37 38 39 40 41 63 64 65 66 67 68 69
14 15 16 17 18 19 20 42 43 44 45 46 47 48 70 71 72 73 74 75 76
21 22 23 24 25 26 27 49 50 51 52 53 54 55 77 78 79 80 81 82 83
28 29 30 31 56 57 58 59 60 84 85 86 87 88 89 90
91
Handily, options can be “pasted together” like so (output not shown - same as above):
As we’ve seen, options are specified with a dash -
(or --
, as you’ll see later). So far, we’ve only worked with the type of options that are also called “flags”, which change some functionality in an ON/OFF type way:
-j
-3
.In general terms, options change the behavior of a command.
Whereas options change the behavior of a command, arguments typically tell the command what to operate on. Most commonly, these are file or directory paths.
Admittedly, the cal
command is not the best illustration of this pattern — when you give it one argument, this is supposed to be the year to show a calendar for:
2020
January February March
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
1 2 3 4 1 1 2 3 4 5 6 7
5 6 7 8 9 10 11 2 3 4 5 6 7 8 8 9 10 11 12 13 14
12 13 14 15 16 17 18 9 10 11 12 13 14 15 15 16 17 18 19 20 21
19 20 21 22 23 24 25 16 17 18 19 20 21 22 22 23 24 25 26 27 28
26 27 28 29 30 31 23 24 25 26 27 28 29 29 30 31
# [...output truncated, entire year is shown...]
We can also combine options and arguments:
2020
January February
Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 32
5 6 7 8 9 10 11 33 34 35 36 37 38 39
12 13 14 15 16 17 18 40 41 42 43 44 45 46
19 20 21 22 23 24 25 47 48 49 50 51 52 53
26 27 28 29 30 31 54 55 56 57 58 59 60
# [...output truncated, entire year is shown...]
So, arguments to a command:
-
(or --
)Many commands – and other command-line programs! – have a -h
option for help, which usually gives a concise summary of the command’s syntax, i.e. it’s available options and arguments:
Usage:
cal [options] [[[day] month] year]
Options:
-1, --one show only current month (default)
-3, --three show previous, current and next month
-s, --sunday Sunday as first day of week
-m, --monday Monday as first day of week
-j, --julian output Julian dates
-y, --year show whole current year
-V, --version display version information and exit
-h, --help display this help text and exit
man
command (Click to expand)
cal -h
, and try an option we haven’t used yet4. (You can also combine this new option with other options, if you want.) February 2024
Mo Tu We Th Fr Sa Su
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29
--
). Why would those be useful?For example:
February 2024
Mon Tue Wed Thu Fri Sat Sun
32 33 34 35
36 37 38 39 40 41 42
43 44 45 46 47 48 49
50 51 52 53 54 55 56
57 58 59 60
The advantage of using long options is that it is much more likely that any reader of the code (including yourself next week) will immediately understand what these options are doing.
Note that long options cannot be “pasted together” like short options.
cal [options] [[[day] month] year]
means. Can you print a calendar for April 2017?First of all, the square brackets around options
and all of the possible arguments means that none of these are required — as we’ve seen, just cal
(with no options or arguments) is a valid command.
The structure of the multiple square brackets around the day-month-year arguments indicate that:
Therefore, to print a calendar for April 2017:
April 2017
Su Mo Tu We Th Fr Sa
1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30
cd
and command “actions”All the commands so far “merely” provided some information, which was printed to screen.
But many commands perform another kind of action. For example, the command cd
will change your working directory. And like many commands that perform a potentially invisible action, cd
normally has no output at all.
First, let’s check again where we are — we should be in our Home directory:
/users/PAS0471/jelmer
Now, let’s use cd
to move to another directory by specifying the path to that directory as an argument:
/fs/ess/PAS2700
In summary:
cal
, cd
accepts an argument. Unlike cal
, this argument takes the form of a path that the command should operate on, which is much more typical.cd
gives no output when it successfully changed the working directory. This is very common behavior for Unix commands that perform operations: when they succeed, they are silent.Let’s also see what happens when cd
does not succeed — it gives an error:
-bash: cd: /fs/ess/PAs2700: No such file or directory
#
is considered a comment instead of code!# This entire line is a comment - you can run it and nothing will happen
pwd # 'pwd' will be executed but everything after the '#' is ignored
/users/PAS0471/jelmer
Using keyboard shortcuts help you work much more efficiently in the shell. And some are invaluable:
Up
/ Down
arrow keys to cycle through your command history.Type /f
and press Tab (will autocomplete to /fs/)
Add e
(/fs/e) and press Tab (will autocomplete to /fs/ess/).
Add PAS
(/fs/ess/PAS) and press Tab. Nothing should happen: there are multiple (many!) options.
Press Tab Tab (i.e., twice in quick succession) and it should say:
Display all 503 possibilities? (y or n)
Type n
to answer no: we don’t need to see all the dirs starting with PAS
.
Add 27
(/fs/ess/PAS27) and press Tab (should autocomplete to /fs/ess/PAS2700).
Press Enter. What does the resulting error mean?
bash: /fs/ess/PAS2700/: Is a directory
Press ⇧ to get the previous “command” back on the prompt.
Press Ctrl+A to move to the beginning of the line at once.
Add cd
and a space in front of the dir, and press Enter again.
To simulate a long-running command that we may want to abort, we can use the sleep
command, which will make the computer wait for a specified amount of time until giving your prompt back. Run the below command and instead of waiting for the full 60 seconds, press Ctrl + C to get your prompt back sooner!
Or, use Ctrl + C after running this example of an incomplete command (an opening parenthesis (
):
Note that even on Macs, you should use Ctrl instead of switching them out for Cmd as you may be used to doing (in some cases, like copy/paste, both work).
Shortcut | Function |
---|---|
Tab | Tab completion |
⇧ / ⇩ | Cycle through previously issued commands |
Ctrl(+Shift)+C | Copy selected text |
Ctrl(+Shift)+V | Paste text from clipboard |
Ctrl+A / Ctrl+E | Go to beginning/end of line |
Ctrl+U /Ctrl+K | Cut from cursor to beginning / end of line |
Ctrl+W | |
Ctrl+Y | Paste (“yank”) text that was cut with one of the shortcuts above |
Alt+. / Esc+. | Retrieve last argument of previous command (very useful!) (Esc+. for Mac) |
Ctrl+R | Search history: press Ctrl+R again to cycle through matches, Enter to put command in prompt. |
Ctrl+C | Cancel (kill/stop/abort) currently active command |
Ctrl+D | Exit (a program or the shell, depending on the context) (same as exit command) |
Ctrl+L | Clear the screen (same as clear command) |
You may be familiar with the concept of variables from previous experience with perhaps R or another language. Variables can hold values and other pieces of data and are essential in programming.
Assigning and printing the value of a variable in R:
Assigning and printing the value of a variable in the Unix shell:
5
=
in x=5
.$
prefix to reference (but not to assign) variables in the shell5.echo
command, a general command to print text, to print the value of $x
(cf. in R).By the way, echo
can also print literal text (as shown below) or combinations of literals and variables (next exercise):
Welcome to PLNTPTH 6193
Environment variables are pre-existing variables that have been automatically assigned values. Two examples:
/users/PAS0471/jelmer
jelmer
Using an environment variable, print “Hello, my name is <your username>” (e.g. “Hello, my name is natalie”).
Our base OSC directory for the course is the /fs/ess/PAS2700
dir we are currently in. Now, let’s all create our own subdir in here, and get the data from the CSB book.
Create a directory for yourself using the mkdir
(make dir) command:
Move there using cd
:
Get the files associated with the CSB book by “cloning” (downloading) its Git repository:
Move into the sandbox dir for the Unix chapter (remember to use tab completion):
Absolute (full) paths (e.g. /fs/ess/PAS2700
)
Paths that begin with a /
always start from the computer’s root directory, and are called “absolute paths”.
(They are equivalent to GPS coordinates for a geographical location, as they work regardless of where you are).
Relative paths (e.g. CSB/unix/sandbox
)
Paths that instead start from your current working directory are called “relative paths”.
(These work like directions along the lines of “take the second left:” they depend on your current location.)
Next week, we’ll talk more about the distinction between absolute and relative paths, and their respective merits.
~
(a tilde) — represents your Home directory. For example, cd ~
moves you to your Home dir..
(a single period) — represents the current working directory (we’ll soon see how why that can be useful)...
(two periods) — Represents the directory “one level up”, i.e. towards the computer’s root dir.Some examples of ..
:
Use ..
to go up one level in the dir hierarchy:
/fs/ess/PAS2700/users/jelmer/CSB/unix/sandbox
/fs/ess/PAS2700/users/jelmer/CSB/unix
This pattern can be continued all the way to the root of the computer, so ../..
means two levels up:
/fs/ess/PAS2700/users/jelmer
These are general shell shortcuts that work with any command that accepts a path/file name.
/fs/ess/PAS2700/users/$USER/CSB/unix/sandbox
dir...
) to move into the /fs/ess/PAS2700/users/$USER/CSB/unix/data
dir.You may have done this in two steps, because you may not have realized that you can “add to” a path after ..
like we did above. So you may have done this:
That’s OK, but is obviously more typing.
ls
command lists files and dirs, and accepts one or more paths as arguments. Use ls
to list the files in your Home dir with a shortcut and without moving there.ls
to list filesThe ls
command, short for “list”, will list files and directories — by default those in your current working dir:
Buzzard2015_about.txt Gesquiere2011_about.txt Marra2014_about.txt miRNA Pacifici2013_data.csv Saavedra2013_about.txt
Buzzard2015_data.csv Gesquiere2011_data.csv Marra2014_data.fasta Pacifici2013_about.txt Saavedra2013
ls
output colors (click to expand)
The ls
output above does not show the different colors you should see in your shell — the most common ones are:
miRNA
and Saavedra2013
above)For which dir ls
lists its contents can be changed with arguments, and how it shows the output can be changed with options. For example, we can call ls
with the option -l
(lowercase L):
total 1793
-rw-rw----+ 1 jelmer PAS0471 562 Feb 24 20:30 Buzzard2015_about.txt
-rw-rw----+ 1 jelmer PAS0471 39058 Feb 24 20:30 Buzzard2015_data.csv
-rw-rw----+ 1 jelmer PAS0471 447 Feb 24 20:30 Gesquiere2011_about.txt
-rw-rw----+ 1 jelmer PAS0471 38025 Feb 24 20:30 Gesquiere2011_data.csv
-rw-rw----+ 1 jelmer PAS0471 756 Feb 24 20:30 Marra2014_about.txt
-rw-rw----+ 1 jelmer PAS0471 566026 Feb 24 20:30 Marra2014_data.fasta
drwxrwx---+ 2 jelmer PAS0471 4096 Feb 24 20:30 miRNA
-rw-rw----+ 1 jelmer PAS0471 520 Feb 24 20:30 Pacifici2013_about.txt
-rw-rw----+ 1 jelmer PAS0471 1076150 Feb 24 20:30 Pacifici2013_data.csv
drwxrwx---+ 2 jelmer PAS0471 4096 Feb 24 20:30 Saavedra2013
-rw-rw----+ 1 jelmer PAS0471 322 Feb 24 20:30 Saavedra2013_about.txt
It lists the same items as earlier, but printed in a different format: one item per line, with additional information such as the date and time each file was last modified, and file sizes in bytes (to the left of the date).
Let’s add another option, -h
:
total 1.8M
-rw-rw----+ 1 jelmer PAS0471 562 Feb 24 20:30 Buzzard2015_about.txt
-rw-rw----+ 1 jelmer PAS0471 39K Feb 24 20:30 Buzzard2015_data.csv
-rw-rw----+ 1 jelmer PAS0471 447 Feb 24 20:30 Gesquiere2011_about.txt
-rw-rw----+ 1 jelmer PAS0471 38K Feb 24 20:30 Gesquiere2011_data.csv
-rw-rw----+ 1 jelmer PAS0471 756 Feb 24 20:30 Marra2014_about.txt
-rw-rw----+ 1 jelmer PAS0471 553K Feb 24 20:30 Marra2014_data.fasta
drwxrwx---+ 2 jelmer PAS0471 4.0K Feb 24 20:30 miRNA
-rw-rw----+ 1 jelmer PAS0471 520 Feb 24 20:30 Pacifici2013_about.txt
-rw-rw----+ 1 jelmer PAS0471 1.1M Feb 24 20:30 Pacifici2013_data.csv
drwxrwx---+ 2 jelmer PAS0471 4.0K Feb 24 20:30 Saavedra2013
-rw-rw----+ 1 jelmer PAS0471 322 Feb 24 20:30 Saavedra2013_about.txt
The only difference is in the format of the column reporting the sizes of the items listed.
We now have “Human-readable filesizes” (hence -h
), where sizes on the scale of kilobytes will be shown with K
s, of megabytes with M
s, and of gigabytes with G
s. That can be really useful especially for very large files.
We can also list files in directories other than the one we are in, by specifying that dir as an argument:
ggo_miR.fasta hsa_miR.fasta miR_about.txt miRNA_about.txt ppa_miR.fasta ppy_miR.fasta ptr_miR.fasta ssy_miR.fasta
And like we saw with cal
, we can combine options and arguments:
total 320K
-rw-rw----+ 1 jelmer PAS0471 18K Feb 24 20:30 ggo_miR.fasta
-rw-rw----+ 1 jelmer PAS0471 131K Feb 24 20:30 hsa_miR.fasta
-rw-rw----+ 1 jelmer PAS0471 104 Feb 24 20:30 miR_about.txt
-rw-rw----+ 1 jelmer PAS0471 104 Feb 24 20:30 miRNA_about.txt
-rw-rw----+ 1 jelmer PAS0471 4.0K Feb 24 20:30 ppa_miR.fasta
-rw-rw----+ 1 jelmer PAS0471 33K Feb 24 20:30 ppy_miR.fasta
-rw-rw----+ 1 jelmer PAS0471 29K Feb 24 20:30 ptr_miR.fasta
-rw-rw----+ 1 jelmer PAS0471 495 Feb 24 20:30 ssy_miR.fasta
Let’s move into the sandbox
dir in preparation for the next sections:
Papers and reviews
cp
to copy filesThe cp
command copies files and/or dirs from one location to another. It has two required arguments: what you want to copy (source), and where you want to copy it to (destination). Its basic syntax is cp <source> <destination>
.
For example, to copy a file to our current working directory using the .
shortcut, keeping the original file name:
We can also copy using a new name for the copy:
cp
will by default refuse to copy directories and their contents — that is, it is not “recursive” by default. The -r
option is needed for recursive copying:
Check the contents of the sandbox
dir again now that we’ve copied several items there:
buzz2.txt Buzzard2015_about.txt data Papers and reviews
mv
to move and rename filesUse mv
both to move and rename files (this is fundamentally the same operation):
Unlike cp
, mv
is recursive by default, so you won’t need the -r
option.
mv
and cp
commands will by default:
-v
option for verbose to make them report what they do).-i
option for interactive to make them ask before overwriting).rm
to remove files and dirsThe rm
command removes files and optionally dirs — here, we’ll remove the file copy we made above:
Like with cp
, the -r
option is needed to make the command work recursively:
# First we create 3 levels of dirs - we need `-p` to make mkdir work recursively:
mkdir -p d1/d2/d3
# Then we try to remove the d1 dir - which fails:
rm d1
rm: cannot remove ‘d1’: Is a directory
rm
with caution! (Click to expand)
rm -r
can be very dangerous — for example rm -r /
would at least attempt to remove the entire contents of the computer, including the operating system.
A couple ways to take precautions:
-i
option, which will have you confirm each individual removal (can be tedious)rmdir
command which will do just (and only) that — that way, if the dir isn’t empty after all, you’ll get an error.cp
and mv
behaviorFor both cp
and mv
, when operating on files (and this works equivalently for dirs):
With that in mind, try to answer the following questions about this command:
Because we put a trailing forward slash in more_data/
, we are making clear that we are referring to a directory. So the file should be copied into a dir more_data
, and keep the same file name.
However, the more_data/
dir does not exist, and cp
will not create a dir on the fly, so this will fail:
cp: cannot create regular file ‘more_data/’: Not a directory
If we had omitted the trailing forward slash, we would have created a copy of the file with file name more_data
(note that no file extension is needed, per se).
P.S: To make the original intention work, first create the destination dir:
Note also that once the more_data
dir exists, it does not make a difference whether or not you using a trailing slash (!).
cat
The cat
command will print the entire contents of (a) file(s) to screen:
Data published by:
Marra NJ, DeWoody JA (2014) Transcriptomic characterization of the immunogenetic repertoires of heteromyid rodents. BMC Genomics 15: 929. http://dx.doi.org/10.1186/1471-2164-15-929
Data description:
File D_spec_spleen_filtered.fasta (57.01Mb) contains Dipodomys spectabilis spleen transcriptome data. Combined assembly of 454 reads and fragmented Illumina assembly (see methods of associated paper) in gsAssembler version 2.6.
Note that we truncated the original file to 1% of its original size and named it Marra2014_data.txt
Data taken from:
Marra NJ, DeWoody JA (2014) Data from: Transcriptomic characterization of the immunogenetic repertoires of heteromyid rodents. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.qn474
head
and tail
The head
and tail
commands will print the first or last lines of a file:
head
& tail
’s defaults are to print 10 lines:
maleID GC T
1 66.9 64.57
1 51.09 35.57
1 65.89 114.28
1 80.88 137.81
1 32.65 59.94
1 60.52 101.83
1 65.89 65.84
1 52.72 43.98
1 84.85 102.31
Use the -n
option to specify the number of lines to print:
maleID GC T
1 66.9 64.57
1 51.09 35.57
A neat trick with tail
is to start at a specific line, often used to skip the header line, like in this example6:
1 66.9 64.57
1 51.09 35.57
1 65.89 114.28
1 80.88 137.81
1 32.65 59.94
1 60.52 101.83
1 65.89 65.84
1 52.72 43.98
1 84.85 102.31
1 98.25 149.61
[...output truncated...]
wc -l
The wc
command is different from the previous commands, which all printed file contents. By default, wc
will count lines, words, and characters — but it is most commonly used to only count lines, with the -l
option:
9 Marra2014_about.txt
Start by moving back to the sandbox
dir:
The regular output of a command is also called “standard out” (“stdout”). As we’ve seen many times now, such output is by default printed to screen, but it can alternatively be “redirected”, such as into a file.
With “>
”, we redirect output to a file:
First, let’s remind ourselves what echo
does without redirection:
My first line
Now, let’s redirect to a new file test.txt
— no output is printed, as it went into the file:
My first line
Let’s redirect another line into that same file:
My second line
That may not have been what we intended! As explained above, the file was overwritten.
With “>>
”, however, we append the output to a file:
My second line
My third line
Recall from today’s previous examples that a file name can be given as an argument to many commands — for example, see the following sequence of commands:
# First we redirect the ls output to a file
ls ../data/Saavedra2013 > filelist.txt
# Let's check what that looks like:
head -n 5 filelist.txt
n10.txt
n11.txt
n12.txt
n13.txt
n14.txt
# Then we count the nr. of lines, which is the number of files+dirs in Saavedra2013:
wc -l filelist.txt
59 filelist.txt
However, most commands also accept input from so-called “standard input” (stdin) using the pipe, “|
”:
59
When we use the pipe, the output of the command on the left-hand side (a file listing, in this case) is no longer printed to screen but is redirected into the wc
command, which will gladly accept input that way instead of via an argument with a file name like in the earlier example.
Pipes are useful because they avoid having to write/read intermediate files — this saves typing, makes the operation quicker, and reduces file clobber.
cut
We’ll now turn to some commands that may be described as Unix “data tools”: cut
, sort
, uniq
, tr
, and grep
.
Most of the examples will use the file Pacifici2013_data.csv
, so let’s have a look at the contents of that file first:
TaxID;Order;Family;Genus;Scientific_name;AdultBodyMass_g;Sources_AdultBodyMass;Max_longevity_d;Sources_Max_longevity;Rspan_d;AFR_d;Data_AFR;Calculated_GL_d;GenerationLength_d;Sources_GL
7580;Rodentia;Cricetidae;Eligmodontia;Eligmodontia typus;17.37;PanTHERIA;292;PanTHERIA;254.64;73.74;calculated;147.5856;147.5856;Rspan-AFR(SM+Gest)
42632;Rodentia;Cricetidae;Microtus;Microtus oregoni;20.35;PanTHERIA;456.25;PanTHERIA;445.85;58.06;calculated;187.3565;187.3565;Rspan-AFR(SM+Gest)
In this course, we’ll be working almost exclusively with so-called “plain-text” files. These are simple files that can be opened by any text editor and Unix shell tool, as opposed to more complex “binary” formats like Excel or Word files. Almost all common genomics file formats are plain-text.
“Tabular” files contain data that is arranged in a rows-and-columns format, like a table or an Excel worksheet.
Plain-text tabular files, like Pacifici2013_data.csv
, contain a specific “delimiter” to delimit columns — most commonly:
.tsv
extension for Tab-Separated Values (“TSV file”)..csv
extension for Comma-Separated Values (“CSV file”).Pacifici2013_data.csv
file appear to contain? (Click for the answer)
;
— “even though” the file has a .csv
extension.
The cut
command will select/“cut out” one or more columns from a tabular file:
-f
option to specify the desired column(s) / “field”(s).-d
.TaxID
7580
42632
42653
42662
16652
[...output truncated...]
That worked, but a ton of output was printed, and we may find ourselves scrolling to the top to see the first few lines – in many cases, it can be useful to pipe the output to head
to see if our command works:
TaxID
7580
42632
cut
(Click to expand)
To select multiple columns, use a range or comma-delimited list:
TaxID;Order;Family;Genus
7580;Rodentia;Cricetidae;Eligmodontia
42632;Rodentia;Cricetidae;Microtus
Order;Max_longevity_d
Rodentia;292
Rodentia;456.25
cut
, sort
, and uniq
to create a listLet’s say we want an alphabetically sorted list of animal orders from the Pacifici2013_data.csv
file. To do this, we’ll need two new commands:
sort
to sort/order/arrange rows, by default in alphanumeric order.uniq
to remove duplicates (i.e., keep all distinct) entries from a sorted file/listWe’ll build up a small “pipeline” to do this, step-by-step and piping the output into head
every time. First, we get rid of the header line with our tail
trick:
7580;Rodentia;Cricetidae;Eligmodontia;Eligmodontia typus;17.37;PanTHERIA;292;PanTHERIA;254.64;73.74;calculated;147.5856;147.5856;Rspan-AFR(SM+Gest)
42632;Rodentia;Cricetidae;Microtus;Microtus oregoni;20.35;PanTHERIA;456.25;PanTHERIA;445.85;58.06;calculated;187.3565;187.3565;Rspan-AFR(SM+Gest)
42653;Rodentia;Cricetidae;Peromyscus;Peromyscus gossypinus;27.68;PanTHERIA;471.45833335;PanTHERIA;444.87833335;72.58;calculated;201.59471667;201.5947166715;Rspan-AFR(SM+Gest)
42662;Macroscelidea;Macroscelididae;Elephantulus;Elephantulus myurus;59.51;PanTHERIA;401.5;PanTHERIA;412.34;90.48;calculated;210.0586;210.0586;Rspan-AFR(SM+Gest)
16652;Rodentia;Cricetidae;Peromyscus;Peromyscus boylii;23.9;PanTHERIA;547.5;PanTHERIA;514.13;79.97;calculated;229.0677;229.0677;Rspan-AFR(SM+Gest)
Second, we select our column of interest with cut
:
Rodentia
Rodentia
Rodentia
Macroscelidea
Rodentia
Third, we pipe to sort
to sort the result:
Afrosoricida
Afrosoricida
Afrosoricida
Afrosoricida
Afrosoricida
Fourth and finally, we use uniq
to only keep unique rows (values):
Afrosoricida
Carnivora
Cetartiodactyla
Chiroptera
Cingulata
Dasyuromorphia
Dermoptera
Didelphimorphia
[...output truncated...]
uniq -c
(Click to expand)
With a very small modification to our pipeline, using uniq
’s -c
option (for count), we can generate a “count table” instead of a simple list:
54 Afrosoricida
280 Carnivora
325 Cetartiodactyla
1144 Chiroptera
21 Cingulata
[...output truncated...]
grep
to print lines matching a patternThe grep
command is useful to find specific text or patterns in a file. By default, it will print each line that contains a “match” in full. It’s basic syntax is grep "<pattern>" file
.
For example, this will print all lines from Pacifici2013_data.csv
that contain “Vombatidae”:
40555;Diprotodontia;Vombatidae;Lasiorhinus;Lasiorhinus latifrons;26163.8;PanTHERIA;9928;"PanTHERIA;AnAge";9317.5;652.24;calculated;3354.315;3354.315;Rspan-AFR(SM+Gest)
40556;Diprotodontia;Vombatidae;Vombatus;Vombatus ursinus;26000;PanTHERIA;10238.25;"PanTHERIA;AnAge";9511.6;783.65;calculated;3542.014;3542.014;Rspan-AFR(SM+Gest)
11343;Diprotodontia;Vombatidae;Lasiorhinus;Lasiorhinus krefftii;31849.99;PanTHERIA;10950;"PanTHERIA;AnAge";no information;no information;no information;no information;3354.315;Mean_congenerics_same_body_mass
Instead of printing matching lines, we can also count them with the -c
option:
1144
The option -v
inverts grep
’s behavior and will print all lines not matching the pattern — here, we’ll combine -v
and -c
to count the number of lines that do not contain “Vombatidae”:
5424
grep
tips
Vombat
” matches Vombatidae
.grep
has many other useful options — more in Week 4:
-i
to ignore case-r
to search files recursively-w
to match “words”tr
tr
for tr
anslate will substitute characters – here, any a
for a b
:
bbbbbbb
Oddly enough tr
does not take a file name as an argument, so how can we provide it with input from a file? The easiest way is by using a pipe (note that the book also shows another method):
TaxID Order Family Genus Scientific_name AdultBodyMass_g Sources_AdultBodyMass Max_longevity_d Sources_Max_longevity Rspan_d AFR_d Data_AFR Calculated_GL_d GenerationLength_d Sources_GL
7580 Rodentia Cricetidae Eligmodontia Eligmodontia typus 17.37 PanTHERIA 292 PanTHERIA 254.64 73.74 calculated 147.5856 147.5856 Rspan-AFR(SM+Gest)
42632 Rodentia Cricetidae Microtus Microtus oregoni 20.35 PanTHERIA 456.25 PanTHERIA 445.85 58.06 calculated 187.3565 187.3565 Rspan-AFR(SM+Gest)
The example above converted the ;
delimited to a Tab
(i.e., a CSV file to a TSV file), where \t
is a regular expression meaning Tab8. (Though note that we didn’t modify the original file nor saved the output in a new file.)
tr
(Click to expand)
tr
also works well with multiple values and replacements (Click to expand)
tr
outputModify our command in which we changed the delimiter to a Tab to redirect the output to a new file, Pacifici2013_data.tsv
(note the extension) in the sandbox
dir (that’s not where you are located yourself).
less
pager to view filesfind
command to find filesfor
loops$PATH
and bash profile settingsWe’ll cover all of these in class over the next few weeks.
This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.
— Doug McIlory
Advantages of a modular approach:
Text “streams”?
Rather than loading entire files into memory, process them one line at a time. Very useful with large files!
# This command would combine all files in the working dir ending in `.fa`
# (i.e. FASTA files) into a single file -- even if that's multiple GBs,
# this will not be a heavy lift at all!
cat *.fa > combined.fa
In all course weeks, we will be working in the Unix shell, though our focus in several cases will be on a specific tool, such as Git in week 3.
In week 4, we’ll fully focus on the shell and shell scripting.
Technically, the latter terms are more correct, as Unix formally does refer to a specific operating system.↩︎
Command-line Interface (CLI), as opposed to Graphical User Interface (GUI)↩︎
Though some commands are flexible and accept either order.↩︎
There really is only one proper additional options: two others reflect the defaults, and then there’s the version option.↩︎
Because of this, anytime you see a word/string that starts with a $
in the shell, you can safely assume that it is a variable.↩︎
We’ll see this in action in a bit↩︎
Even though for “literal string” like in the examples above, as opposed to regular expressions, this is not strictly necessary, it’s good habit to always quote.↩︎
We’ll learn more about regular expressions in Week 4.↩︎