Wildcards and "Argument list too long"
OS X and Linux systems have a limit to the number of arguments that can be supplied to a command (more technically, the limit is to the total length of the arguments.) In general, it’s best to be as restrictive as possible with wildcards. This protects against accidental matches. Shell wildcards allow allow you to match specific characters or ranges of characters. For example, we could match the characters U, V, W, X, and Y with either [UVWXY] or [U-Y] (both are equivalent). Back to our example, we could exclude the C sample using either:
$ ls zmays[AB]_R1.fastq
zmaysA_R1.fastq zmaysB_R1.fastq
$ ls zmays[A-B]_R1.fastq
zmaysA_R1.fastq zmaysB_R1.fastq
There’s one very important caveat: ranges operate on character ranges, not numeric ranges like 13 through 30. This means that wildcards like snps_[10-13].txt will not match files snps_10.txt, snps_11.txt, snps_12.txt, and snps_13.txt
while wildcard matching and brace expansion may seem to behave similarly, they are slightly different. Wildcards only expand to existing files that match them, whereas brace expansions always expand regardless of whether corresponding files or directories exist or not.
Common Unix filename wildcards
Wildcard | What it matches |
---|---|
* | Zero or more characters (but ignores hidden files starting with a period) |
? | One character (also ignores hidden files). |
[A-Z] | Any character between the supplied alphanumeric range |
Leading Zeros and Sorting
use leading zeros (e.g., file-0021.txt rather than file-21.txt) when naming files. This is useful because lexicographically sorting files (as ls does) leads to the correct ordering.Using leading zeros isn’t just useful when naming filenames; this is also the best way to name genes, transcripts, and so on. Projects like Ensembl use this naming scheme in naming their genes (e.g.,ENSG00000164256).
Markdown for Project Notebooks
It’s very important to keep a project notebook containing detailed information about the chronology of your computational work, steps you’ve taken, information about why you’ve made decisions, and of course all pertinent information to reproduce your work. Plain text is a future-proof format. Additionally, plain-text project notebooks can also be put under version control.
A lightweight markup language called Markdown is a plain-text format that is easy to read and painlessly incorporated into typed notes, and can also be rendered to HTML or PDF.
Markdown Formatting Basics
Features: text can be broken down into hierarchical sections, there’s syntax for both code blocks and inline code, and it’s easy to embed links and images.
John Gruber’s full markdown syntax specification is available on his website. https://daringfireball.net/projects/markdown/syntax
Here is a basic Markdown document illustrating the format:
Markdown Formate
Using Pandoc to Render Markdown to HTML
We’ll use Pandoc, (http://johnmacfarlane.net/pandoc/ )a popular document converter, to render our Markdown documents to valid HTML. These HTML files can then be shared with collaborators or hosted on a website. See the Pandoc installation page(http://bit.ly/pan-install) for instructions on how to install Pandoc on your system.