About Software

General philosophy

We strongly support open source and freely share our software tools with others.

Some comments on commonly used software

Computational work is done almost exclusively on Linux systems (my laptop currently runs Ubuntu 16.04). It’s fine to start with data on Macs or PCs (Your data is in a bunch of word docs and excel files?!), but to process this stuff systematically and efficiently, we’re likely to move it to Linux.

Preferred languages are Python for data analysis and Bash for scripting. For interactive data exploration, visualization and modeling (e.g. scikit-learn), running Python in Jupyter notebooks works great. Some applications (e.g. connecting to NCBI databases) work well with Perl, and C is sometimes used for compute-intensive things. When required, we can also work with other languages (e.g. Ruby, JavaScript, PhP, Visual Basic), for example to validate, document and/or port algorithm code.

Bioinformatics design and analysis work frequently employs a set of command-line C programs, maintained together as “vertools” (see below). In addition, sequence alignment and processing tools like BLAST, bedtools, samtools, picard, etc are often used. For multi-step processes (e.g. design of indexes, generation of feature tables for machine learning, etc.) “wrapper scripts” or “pipelines” that string together software tools for end-to-end data processing are often built. Depending on the scale and frequency of tasks, the effort spent writing such scripts is often well worth it; Besides task automation, scripts provide consistency (e.g. remove sources of human error), and can function as “working documentation” for the processes they address.

Public web tools and databases are, of course, used as needed. The number of useful bioinformatics resources on-line is phenomenal. For example, the UCSC genome site (go Banana Slugs!) has all kinds of useful data and tools in addition to the browser.

Vertools software collection

Vertools is a collection of command-line C programs useful for various sequence and data processing tasks. The source code can be found on github here: https://github.com/ryantkoehler/vertools

Code is more or less mature and trusted, though by no means “finished”. The repository has some minimal documentation (This is on my to-do list! … or if anyone would like to volunteer?) so it’s probably not to clear what the various tools are good for. The table below provides a quick summary.

ProgramSummaryUsed for
alphcontDNA alphabet content reporting.Getting base composition values (e.g. %GC, runs of bases). Numbers can be used for filtering or as features for sequence analysis.
blastoutBLAST output parserExtracting seqs, coords, scores. Compiling alignments into metrics (e.g. specificity screens)
comp_seqComparing sequences using simple alignments.Evaluating various similarity, complementarity metrics between primers, probes, etc. Used for oligo design and QC.
dna_utilSequence slicing and dicing. Includes complexity metrics.Sequence manipulations (e.g. rev comp, reformatting, masking, probe tiling, etc). Complexity metrics useful for filtering and sequence analysis.
filterSimple line-based numerical filter.Extracting subsets of sequence or data files with specific values.
gen_seqGenerate sequences.Constructing enumerated or random sequence sets, with optional composition constraints. Sequences can serve as starting points for probe or index designs.
numstatSimple line-based numerical stats tool.Quick evaluation of numbers (e.g. mean, sd, percentiles, etc). Includes text-based histogram plotting.
pick_seqSelect sequence subsets, minimizing pairwise metrics.Selecting orthogonal sequence sets, for example indexes or hybridization probes. Can separately weight similarity, complementarity metrics. Simple or Hamming alignments.
scoretabData table and number manipulations.Manipulating data tables (e.g. bounding, scaling, subset extraction, etc.). Generating scores from values (e.g. map oligo Tm to range 0-1 via transfer function). Includes genetic algorithm operators.
seqtweakModify input sequences.Introducing "mutations" into supplied sequences. Can randomly or systematically introduce SNVs and/or InDels. This is useful for simulations and some design tasks.
tm_utilThermodynamic calculator, including extraction of sub-seqs.Calculates Tm, dG, fraction hybridized using various nearest neighbor models. Useful for extracting candidate oligos with specific thermodynamic properties (e.g. lengh 15-25 and Tm 60-62).
venpipeWrapper for secondary structure and energy calculations.Wraps Vienna package folding code, reporting oligo structure(s) and energy(s). Used for oligo design and QC.
wordfreqTallies DNA words (n-mers). Includes word-based scoring.Tallying n-mers. Frequency tables derived from genomes, sequence libraries, etc. are useful to investigate uniformity or biases, and may be used to score novel sequences.