We strongly support open source and freely share our software tools with others.
Some comments on commonly used software
Computational work is done almost exclusively on Linux systems (my laptop currently runs Ubuntu 16.04). It’s fine to start with data on Macs or PCs (Your data is in a bunch of word docs and excel files?!), but to process this stuff systematically and efficiently, we’re likely to move it to Linux.
Bioinformatics design and analysis work frequently employs a set of command-line C programs, maintained together as “vertools” (see below). In addition, sequence alignment and processing tools like BLAST, bedtools, samtools, picard, etc are often used. For multi-step processes (e.g. design of indexes, generation of feature tables for machine learning, etc.) “wrapper scripts” or “pipelines” that string together software tools for end-to-end data processing are often built. Depending on the scale and frequency of tasks, the effort spent writing such scripts is often well worth it; Besides task automation, scripts provide consistency (e.g. remove sources of human error), and can function as “working documentation” for the processes they address.
Public web tools and databases are, of course, used as needed. The number of useful bioinformatics resources on-line is phenomenal. For example, the UCSC genome site (go Banana Slugs!) has all kinds of useful data and tools in addition to the browser.
Vertools software collection
Vertools is a collection of command-line C programs useful for various sequence and data processing tasks. The source code can be found on github here: https://github.com/ryantkoehler/vertools
Code is more or less mature and trusted, though by no means “finished”. The repository has some minimal documentation (This is on my to-do list! … or if anyone would like to volunteer?) so it’s probably not to clear what the various tools are good for. The table below provides a quick summary.
|alphcont||DNA alphabet content reporting.||Getting base composition values (e.g. %GC, runs of bases). Numbers can be used for filtering or as features for sequence analysis.|
|comp_seq||Comparing sequences using simple alignments.||Evaluating various similarity, complementarity metrics between primers, probes, etc. Used for oligo design and QC.|
|dna_util||Sequence slicing and dicing. Includes complexity metrics.||Sequence manipulations (e.g. rev comp, reformatting, masking, probe tiling, etc). Complexity metrics useful for filtering and sequence analysis.|
|filter||Simple line-based numerical filter.||Extracting subsets of sequence or data files with specific values.|
|gen_seq||Generate sequences.||Constructing enumerated or random sequence sets, with optional composition constraints. Sequences can serve as starting points for probe or index designs.|
|numstat||Simple line-based numerical stats tool.||Quick evaluation of numbers (e.g. mean, sd, percentiles, etc). Includes text-based histogram plotting.|
|pick_seq||Select sequence subsets, minimizing pairwise metrics.||Selecting orthogonal sequence sets, for example indexes or hybridization probes. Can separately weight similarity, complementarity metrics. Simple or Hamming alignments.|
|scoretab||Data table and number manipulations.||Manipulating data tables (e.g. bounding, scaling, subset extraction, etc.). Generating scores from values (e.g. map oligo Tm to range 0-1 via transfer function). Includes genetic algorithm operators.|
|seqtweak||Modify input sequences.||Introducing "mutations" into supplied sequences. Can randomly or systematically introduce SNVs and/or InDels. This is useful for simulations and some design tasks.|
|tm_util||Thermodynamic calculator, including extraction of sub-seqs.||Calculates Tm, dG, fraction hybridized using various nearest neighbor models. Useful for extracting candidate oligos with specific thermodynamic properties (e.g. lengh 15-25 and Tm 60-62).|
|venpipe||Wrapper for secondary structure and energy calculations.||Wraps Vienna package folding code, reporting oligo structure(s) and energy(s). Used for oligo design and QC.|
|wordfreq||Tallies DNA words (n-mers). Includes word-based scoring.||Tallying n-mers. Frequency tables derived from genomes, sequence libraries, etc. are useful to investigate uniformity or biases, and may be used to score novel sequences.|