Contributing to the SARSE toolbox
From Sarse
Tutorial by Ebbe S. Andersen and Allan Lind-Thomsen
This tutorial helps you to add existing programs or contribute new programs to the SARSE toolbox. It also explains new developments of the column format to accomodate various data types used for the DNA origami design.
Wrapping existing programs for the SARSE toolbox
From the SARSE toolbox you can run any program that can be executed on the command line. This can both be an independent program that is provided with input from SARSE or a data analyzer that provides several output data. The situation that fit best with the SARSE concept (link to about) is programs that recieves data from SARSE and returns a modified version back to SARSE. This allows the iterative refinement in projects like DNA origami design (link tutorial).
If you find a program on the web or have written stuff yourself, and it makes sense to you to include it in the SARSE toolbox then you can easily adapt it to SARSE by using a bash super script and format converter (described below). The bash script used by SARSE contains some code to help SARSE find the program and working directory and fetch possible optional values:
#!/usr/bin/env bash
program_dir() {
local p l
if [ "${0:0:1}" = "/" ]
then
p=$0
else
p=$PWD/$0
fi
if [ -L "$p" ]
then
l=$(find $p -printf '%l')
if [ "${l:0:1}" = "/" ]
then
p=$l
else
p=$(dirname $p)/$l
fi
fi
echo $(dirname $p)
}
dir=$(program_dir)
if [ -e "$2" ]
then
file="$2"
range="$1"
else
file="$1"
range=" "
fi
Now you add the command line execution of your program using the input file "${file}" and parameter "${range}". You can output any amount of files to the working directory. The file that you want to send back to SARSE has to be in the column format (see below) and be written to STDOUT. You can use one of the format converters provided in the SARSE/programs/tools directory e.g.
fasta2col file.fasta > file.col
At the end of the bash script you exit by:
exit 0
Making programs available in the SARSE toolbox
In the following we assume you know the basic XML syntax. The only prerequisites that must be fullfilled for a program to be added to SARSE is:
It must run under linux/mac It must read a col-file from std-in The results must be written to std-out in col-format It must be in your path The only thing you must do is edit a XML-file called programs.xml that is located in the "properties" folder in you sarse installation directory.
You insert your program immediately after a </program> tag if you are doubt, place it between these 2 tags: </program> </programs> at the end of the file. The enclosing tag for each program is a <program> tag it has a few nescessary attributes. Some of them has default values because they are for future extensions. Here is an example from the file:
<program name="stem_colors" priority="7" package="coloring tools" selected="false" sequencetype="RNA" type="analyzer" depends=""> </program>
The "name" attribute is both the exact name of the command to run the program and the name that is displayed in the menu. "Priority" is for deciding which programs are run first, the lower a number the sooner it is run. The "package" attribute is for grouping the programs in the menu. "Selected" is for a program to be selected by default when you open the program menu. "sequencetype" must be "RNA" and type must be "analyzer", no choice. The "depends" attribute can take the value of the "name" attribute of another program in the xml file. When you select a program in the menu that is dependent on another program that program will also automatically be selected. The program description has a tag of its own and is added like this:
<program ...> <programdescription> Colors stems of alignment in different colors. </programdescription> </program>
You then need to declare the in- and output formats. At the moment this is limited to col-format so you have to add the input-formats and output-formats in the following way:
<program ....> <programdescription>...</programdescription> <inputformats> <fileextension>col</fileextension> </inputformats> <outputformats> <fileext>col</fileext> </outputformats> </program>
If your program doesn't take commandline options you just insert <parameters/> just before the </program> and you are done.
Options are enclosed in <parameters> tag and each option is placed in a <param>. Each <param> has 3 attributes "selected" is either 'true' or 'false' and decides if the option is selected by default. "Input" is also 'true' or 'false' and decides if the option takes a value, the last, number, must be 0. "spaced" is inserted to tell if there should be a space between the parameter and its values (eg. -r 10 vs. -r10), it takes the values 'true' or 'false'
<param selected="false" input="false" number="0" spaced=false>
Then you add a <name> tag which should be exactly how the option is used on the commandline, including "-" or "--" if used (eg.).
<name>-s</name>
Then you supply the description
<paramdescription>
Support information is output as the last entry.
</paramdescription>
And then end with a closing tag for the option.
</param>
Another example is an option that takes an input. Then the input attribute of of the <param> must be true. The difference is an additional <input> tag
<input number="1" delimiter="" description="limit for support"> 0.75 </input>
The "number" attribute tells how many individual values it contains and "delimiter" say what character is used to separate the values. Then there is a "description" attribute and lastly the default-value of the tag. The whole example looks like this:
<parameters> <param selected="false" input="true" number="0" spaced="false"> <name>-l</name> <paramdescription>
Sets the limit for support. The default is 2/3.
</paramdescription>
<input number="1" delimiter="" description="limit for support"> 0.75 </input> </param> </parameters>
Writing your own programs for the SARSE toolbox
You can write a data analyzer in any programming language, since SARSE executes analysis tools directly in the command line. However, SARSE only reads and writes in the column format (described below).
A set of format converters to and from the column format is provided in the SARSE package programs/tools directory. They convert the normal sequence formats to the column format and vise versa. If you program reads and writes in another format you can use these converts to wrap your program using bash super scripts (described above).
The format converters use a Perl module called Formats.pm that is found in the SARSE/programs/tools directory. This module also makes it very easy to write new data analyzers in Perl. In your Perl script you include the module:
use Formats; which allows you to read data like this:
( $header, $entries ) = &Formats::read_col ( $file );
where $entries is a hash of the entries and $header the file header information.
After analysis and applying changes the data can be written like this:
&Formats::write_col ( $header, $entries );
For more details on programming in Perl see Perl documentation and look at the programs in the SARSE/Programs directory.
