Column format

From Sarse

Jump to: navigation, search

Motivation

Column format files are text based files for biological sequences. The idea being, that they should be easy to work with, rather than compact. Another important feature is that many different kinds of information can be contained in the same file. Also converting this format into anything else should be simple.

General description

The data is organized in columns. Each entry (sequence) is along with its various assignments arranged in columns. Entry consist of a header that contains information about the column organization as well as miscellaneous information about the sequences. The entire file (database) contain a main header that can be used to describe overall features. In column format files, everything but the sequence positions is on lines that begin with i semicolon. The entry information has some fields that are compulsory. The first line of the entry info must show what type of molecule is in the entry (the ``TYPE field). The next lines show what information is in the different columns of the entry. After this, the entry name comes. This is followed by additional information:

A column format file starts with a header containing info on the file. This header is followed by line with a semicolon and at least 10 equality signs (=). A header could look like this:

; This file contains a database of globin genes
;
; It is located at http://www.xyz.xyz
;
; ======================================================================== 

The sequence entry headers begin with a "TYPE" field indicating whether the entry is an RNA sequence or a specific entry describing the basepairings of the alignment. That is the first line in an entry should be of the form

; TYPE                 <type>

where type here is defined as either "RNA, DNA, DNA_blast, and PROTEIN". There is no distinction between upper and lower case letters. Lines on the form

; COL <number>         <word>

indicate that column <word> is described in column <number>. Each Entry have an "ENTRY" field on the form

; ENTRY                <one_word>

Other lines in the header describe miscellaneous features and have the form

; <ONE_TAG>            <string>

Header and columns are separated by a line of the type

; ---------

with at least 10 dashes. The column lines are organized on form

<word(COL 1)>  <word(COL 2)>  . . .  <word(COL N)>

for N columns. Entries are ended by a line of the type

; **********

with at least 10 *'s.

A description of the column types can be found on the colusage page along with a listing of which programs from the rnadbtool page. Examples on how to use those programs can found here.

Database example

An example from the tmRNA database alignment of RNA sequences.

;  The tmRNA Database version 043 (January 2001):
;  ----------------------------------------------
;
;
;  Availability:
;  -------------
;
; . . . .
; . . . . 
;
; ========================================================================
; TYPE              pairingmask
; COL 1             label
; COL 2             residue
; COL 3             alignpos
; ENTRY             pairingmask
; ---------- 
M     a     1 
M     a     2
M     a     3
M     a     4
M     a     5
.     .     .
.     .     .
M     -   675
M     -   676
; **********
; TYPE              RNA
; COL 1             label
; COL 2             residue
; COL 3             seqpos
; COL 4             alignpos
; COL 5             align_bp
; ENTRY             AQU.AEO.
; ORGANISM          Aquifex aeolicus
; ACCESSION         AE000657 + AE000749
; WWW-ACCESS        http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=6626248&dopt=GenBank + 
                                    http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=02983975&dopt=GenBank
; LINEAGE (NCBI)    CELLULAR ORGANISMS; BACTERIA; AQUIFICALES; AQUIFICACEAE; AQUIFEX
; ----------
N     G     1     1   672
N     G     2     2   671
N     G     3     3   670
N     G     4     4   669
N     G     5     5   668
N     C     6     6   667
N     G     7     7   666
N     g     8     8     .
N     a     9     9     .
G     -     .    10     .
N     a    10    11     .
N     a    11    12     .
N     g    12    13     .
N     g    13    14     .
.     .     .     .     .
.     .     .     .     .

The first entry describes the pairing mask (stem helix mapping) of the structural alignment (and is obtained using txt2col with "-m" option). In the second entry the first column indicates whether the entry (sequence) at a particular alignment positions contains a gap or a nucleotide. When TYPE is RNA or DNA, label takes the values "N" and "G". "residue" refers to the individual nucleotides Examples of program output in col format Shown here are examples of column file output from diffferent programs. Only the beginning of the files are shown.

Output from txt2col:

; Generated by txt2col
; ========================================================================
; TYPE              RNA
; COL 1             label
; COL 2             residue
; COL 3             seqpos
; COL 4             alignpos
; COL 5             align_bp
; ENTRY             Aqu.aeo.
; ----------
N     G     1     1   665
N     G     2     2   664
N     G     3     3   663
N     G     4     4   662
N     G     5     5   661
N     C     6     6   660
N     G     7     7   659
N     g     8     8     .
N     a     9     9     .
G     -     .    10     .
N     a    10    11     .

Output from ct2col:

; Generated by ct2col
; ========================================================================
; TYPE              RNA
; COL 1             label
; COL 2             residue
; COL 3             seqpos
; COL 4             align_bp
; ENTRY             mtu
; LENGTH            168
; ---------- 
N     C     1     .
N     U     2     .
N     U     3    17
N     C     4    16
N     G     5    15
N     C     6    14
N     A     7     .
N     U     8     .
N     C     9     .
N     A    10     .

Output from gb2col:

; File generated by gb2col
; ========================================================================
; TYPE              DNA
; COL 1             label
; COL 2             residue
; COL 3             seqpos
; ENTRY             MTU88049
; LENGTH            2805
; ACCESSION         U88049
; ---------- 
N     t     1
N     t     2
N     g     3
N     g     4
N     g     5
N     c     6
N     c     7
N     g     8
N     c     9
N     c    10

Output from blast2col:

; Generated by blast2col
; ========================================================================
; TYPE              DNA_blast
; COL 1             label
; COL 2             query_residue
; COL 3             match
; COL 4             subject_residue
; COL 5             query_seqpos
; COL 6             subject_seqpos
; ENTRY             MTU88049_vs_U88049
; BLAST_VERSION     BLASTN 2.0.11 [Jan-20-2000]
; QUERY             MTU88049
; QUERY_LENGTH      200
; SUBJECT           U88049
; SUBJECT_COMMENT   MTU88049 2805 bp DNA BCT U88049 .
; SUBJECT_STRAND    Plus
; SUBJECT_LENGTH    2805
; ALIGNMENT_LENGTH  200
; SCORE             396
; EXPECT            1e-109
; IDENTITIES        200
; ----------
N T  -  T       1     1
N T  -  T       2     2
N G  -  G       3     3
N G  -  G       4     4
N G  -  G       5     5
N C  -  C       6     6
N C  -  C       7     7
N G  -  G       8     8
N C  -  C       9     9
N C  -  C      10    10

Using the column format for various data types

The column format has earlier been used for biological sequence data (link). It is not designed to be compact, but rather easy to read and manipulate . The overall structure of the format is:

; file header
; ==========
; TYPE X
; column content
; ENTRY 1
; ----------
column data
; **********
; TYPE Y
; column content
; ENTRY 2
; ----------
column data
; **********

The file header can contain a description of content or information about how the file has been processed. In each entry header the data TYPE, column content, and ENTRY is defined. The data type can be equal or different between the different entries, which allow the storage of many related data in one file. An entry for a DNA sequence with secondary structure can look like this:

; ==========
; TYPE		DNA
; COL 1	label
; COL 2	residue
; COL 3	alignpos
; COL 4	align_bp
; COL 5	color_r
; COL 6	color_g
; COL 7	color_b
; ENTRY	#
; ---------- 
N	A	1	2	1.00	0.50	0.50
N	T	2	1	1.00	0.50	0.50
; **********

For the DNA origami designes described in the manuscript we do not use regular DNA sequences. First we import bitmap pixels where each pixel will correspond to a DNA base in the horizontical direction, but the vertical direction is relating to the distance between adjacent helices in a DNA origami. We use the following ENTRY header:

; ==========
; TYPE		ORIGAMI
; COL 1	residue
; COL 2	color_r
; COL 3	color_g
; COL 4	color_b
; ENTRY	line number
; ----------

This simple symbol with a color attached is sufficient for bothe the symbolic representation of scaffold strand and staple strands, and for the insertion of the M13 sequence in the design. For generating a 3D atomic model of the origami design we describe the atomic coordinates by the entry header:

; ==========
; TYPE		ATOM
; COL 1	Number
; COL 2	Atom
; COL 3	Label
; COL 4	x
; COL 5	y
; COL 6	z
; ENTRY	project name
; ----------

This is very similar to the Protein Data Bank (PDB) format. The PDB format stems for the 1970th and has severe limitations when making large molecular models like DNA origamis. But with some workarounds we can still convert the col format into the PDB format to be dispalyed in viewers like PyMol (link), QuteMol (link) or the like.

At last we define a column format for AFM image data to be loaded in the SARSE editor. Here we accomodate to two types of data in an AFM image file the height information (nm) and the angular deviation data (V). Here each ENTRY in the column format corresponds to each scan line in an AFM image.

; resolution info
; ==========
; TYPE	AFM
; COL 1	residue
; COL 2	nm
; COL 3	V
; COL 4	color_r
; COL 5	color_g
; COL 6	color_b
; ENTRY	line number
; ----------
Personal tools