About Software

General philosophy

We strongly support open source and freely share our software tools with others.

Some comments on commonly used software

Computational work is done almost exclusively on Linux systems (my laptop currently runs Ubuntu 18.04). It’s fine to start with data on Macs or PCs (Your data is in a bunch of word docs and excel files?!), but to process this stuff systematically and efficiently, we’re likely to move it to Linux.

Preferred languages are Python for data analysis and Bash for scripting. For interactive data exploration, visualization and modeling (e.g. scikit-learn), running Python in Jupyter notebooks works great. Some applications (e.g. connecting to NCBI databases) work well with Perl, and C is sometimes used for compute-intensive things. When required, we can also work with other languages (e.g. Ruby, JavaScript, PhP, Visual Basic), for example to validate, document and/or port algorithm code.

Bioinformatics design and analysis work frequently employs a set of command-line C programs, maintained together as “vertools” (see below). In addition, sequence alignment and processing tools like BLAST, bedtools, samtools, picard, etc are often used. For multi-step processes (e.g. design of indexes, generation of feature tables for machine learning, etc.) “wrapper scripts” or “pipelines” that string together software tools for end-to-end data processing are often built. Depending on the scale and frequency of tasks, the effort spent writing such scripts is often well worth it; Besides task automation, scripts provide consistency (e.g. remove sources of human error), and can function as “working documentation” for the processes they address.

Public web tools and databases are, of course, used as needed. The number of useful bioinformatics resources on-line is phenomenal. For example, the UCSC genome site (go Banana Slugs!) has all kinds of useful data and tools in addition to the browser.

Vertools software collection

Vertools is a collection of command-line C programs useful for various sequence and data processing tasks. The source code can be found on github here: https://github.com/ryantkoehler/vertools

Code is more or less mature and trusted, though by no means “finished”. The repository has some minimal documentation (This is on my to-do list! … or if anyone would like to volunteer?) so it’s probably not to clear what the various tools are good for. This table provides a quick summary.

Other shared software

Several other potentially useful software tools are also on github. Here’s a short description of what these do and how they may be useful. There are more or less brief instructions for use (including example input files) within each repository.

Command line color tools (https://github.com/ryantkoehler/comline_color_tools) is useful to highlight things in terminal output. There are three Perl scripts to color numbers, rows / columns, or DNA sequences in various ways using ansi color codes.

Sequence feature table generator (https://github.com/ryantkoehler/seqs2ftab) is useful for data analysis and machine learning starting from DNA sequences. This Python script uses a “score definition file” to call various programs (e.g. Vertools) that report computed features of input sequences, then combines all outputs into a matrix with columns of feature values for each sequence.

Real-time PCR-visualization (https://github.com/ryantkoehler/rtPCR_viewer) is a GUI-based Python app to analyze PCR and melting data via various plots and calculated metrics. The stated-use-utility of this is limited (mainly interesting to view curves), but the code may serve as a useful example / starting point for wxPython applications(?)