gnomap (version 2 alpha, May 2002) for simple, efficient genome map production from command-line systems d. gilbert, gilbertd@bio.indiana.edu --- gnomap for simple efficient genome map graphics --- gnomap is optimized to read genome annotation data, draw a fairly simple and rigidly formatted display of it with Java AWT graphics methods, then dump that out into a graphics file (or output stream), in selected common formats, including GIF, PNG, PDF, and PS, the first two being bitmapped, suitable for web browser displays, the later two are vector graphics with more uses including print and document publications. It is currently used primarily in web servers at http://eugenes.org/ and http://flybase.net/ (as /.bin/gnomap) to provide simple, effective way for bioscientists to view and locate genes and other annotation features on the large quantity of data from genomes of human, fruitfly, worm and other organisms. It lacks some flexibility and options for styles, display layouts, and has no GUI interface to allow interactive Java application use (these were cut out from earier map tool variants to build an efficient server-side genome web browsing tool). The initial impetus for constructing gnomap in 1999 was the release of several large genome data sets (fruitfly, human, worm, weed), and lack at that time of a suitable, source-code available, tool for such simple map construction. AceDB-base map graphics were attractive but required use of a complex database; Java application/applets such as GeneScene/Seen had extensive client-side requirements. The main design goals for a suitable map display program were: server-based production of html-imagemap displays that had integrated map traversal and selection choices, which worked on the extant range of web browsers; efficient, fast generation of displays based on 100s megabytes of annotated genome data; and no complex database or other special software requirement. A 100% pure Java implementation would include the ability to run this on most biologists computers as desired as well as make it easy to incorporate in most web servers. With efficiency measures built in, gnomap can and does produce web-browser traversable, rapid displays of large data, including the 300MB human chromosome 1 with some 150000 features. --- Sources and Availabilty --- NOTE: version 2 is in development at this writing, with bugs and missing features that haven't been corrected to version 1 standards yet. It is running in test mode at eugenes dev. server http://cricket.bio.indiana.edu:8089/.bin/gnomap On the other hand, version 1 sources are only java 1.1 compatible, and are structured in a non-javac compatible form (lots of public classes are mixed in source files that are not named precisely for the classes :) Source code is freely available for unrestricted use. Its only accessory requirements beyond a Java 1.2+ runtime system are the Readseq program (in public domain, ftp://iubio.bio.indiana.edu/molbio/readseq/java), Gnu regexp (which may be replace by java 1.4 regexp), an XML parser package such as Xerces, and for 'headless' graphics on unix servers, a package like the PJA toolkit. These additional java archive libraries are included with gnomap distribution. Gnomap has a home directory for source code and examples at ftp://iubio.bio.indiana.edu/molbio/java/apps/gnomap/ http://iubio.bio.indiana.edu/soft/molbio/java/apps/gnomap/ Compile gnomap using javac compiler on unix as find gnomap.src -name "*.java" -exec \ javac -classpath lib/readseq.jar:lib/gnu-regexp-1.1.3.jar \ -sourcepath gnomap.src -d gnomap.src {} \; Test java -cp gnomap.src:lib/readseq.jar:lib/gnu-regexp-1.1.3.jar \ flybase.GnoMap The program is also in the java source tree for flybase public server applications (ftp://flybase.bio.indiana.edu/flybase-source/fbjava-source*). This source tree is complex and separating out the gnomap parts is a chore. Other distribution thru sourceforge.net will be forthcoming. --- Genome data inputs --- The version 1 of gnomap was optimized to use a simple but specific genome text database format that can be generated from the variety of extant data, from flat file biosequence formats including GenBank, EMBL feature table and sequences, to GFF gene finding format common for automated annotations, to databases from C. elegans AceDB and Human projects. For this data format structure, see http://eugenes.org/docs/eg-features-doc.html and ftp://iubio.bio.indiana.edu/eugenes/fly/features/ for example dataset This 'flattened-feature-format' (FFF) database is a tab-delimited structure with associated index files for quick, file-system based lookups from a directory of species and chromosome segments. One basic difference from FFF and the GFF format used with DAS and other annotation readers, is the concept of a 'feature' in FFF closely matches the DDBJ/EMBL/GenBank feature table construction, where any entity on a sequence has a feature class and a location, which can be complex and non-contiguous, and any range of attributes, including names, references. GFF is has a more rigid structure, where the basic feature has only start/stop positions, and complex features must be built from groupings of these. gnomap also reads data in any of a range of standard biosequence formats via Readseq software, though this is less efficient than the optimized FFF database structure. Version 2 of gnomap is extending the range of inputs to include DAS (www.das.org) sources, and an experimental LDAP directory service for genome data, and likely will add SQL relational database methods. -- feature styling and display options --- These currently are limited to built-in java classes, with ability to set feature drawing styles thru property files and with 'plug-in' java classes that are loaded via property settings. The base class flybase.map.FeatureKind is extended for any specific feature kind, where for instance flybase.map.Feature_gene is used to draw features of kind 'gene'. --- Common command-line options to run gnomap ---- gnomap is commonly run from a unix shell script a command-line invocation like: java -cp $cp $app $opts with app=flybase.GnoMap opts="style=pdf cl=all data=my-data.file bp=4500..8500" cp=./:./gnomap.jar:./readseq.jar The most useful command options are cl=gene,mRNA,CDS, ... -- feature kinds to display bp=12300..345000 -- base pair range to display start=12300 stop=345000 -- also works for range setting style=html OR png OR gif OR pdf OR ps -- output format (html is gif + html image map) mimetype=image/gif OR biosequence/embl ... -- similar but more extensive than style= including sequence format output as well as map images sequence output mimetypes are those defined by ReadSeq (biosequence/genbank, biosequence/embl biosequence/fasta, biosequence/gff, biosequence/xml, biosequence/fff are useful) for html output with hyperlinks in these biosequence formats, append ';html', as biosequence/embl;html out=output-file.pix -- file name for output (optional). gnomap will generate a unique output ID name for given input parameters, for use with caching in web service. e.g. fly4r99999-499999.pdf is the output ID for species-chromosome-baserange (w/o other opts) data=gno.ins/sean-FBgn0016122.html -- path to single data files or URL to data source following data formats and inputs are supported (v.2.0 alpha) see gnomapio-test.sh for test programs file(s) in standard biosequence format with feature table (GenBank,EMBL,GFF,FFF, and other formats known to ReadSeq) LDAP; data=ldap://eugenes.org/srv=srsgnomap where ldap server data schema matches server-ldap-das/gnomap2.schema (still in testing v2.0 alpha) DAS; http://.../das... where das server matches www.das.org 1.0 specifications (still in testing v2.0 alpha) mapdata=gno.ins/ -- path to flat-file database structure with species subfolders (fly,worm,yeast,...) of feature tables as used for euGenes server where spp= and chr= options need to be specified for use with any data directory/database input with multiple source/segment (species/chromosome) spp=fly -- species of data, depends on input source choices chr=2L -- chromosome of data, depends on input source/segment choices Some other special options thumb -- create thumbnail sized map width=550 -- width of map height=500 -- height of map id=FBgn0000015 -- locate and display map around given gene ID (will mark map) in this case for a map database, location coordinates are not needed idlist=file-of-gene-ids -- as above, for multiple IDs Resource files included in gnomap.jar: gnomap.jar:rez/gnomap.properties - used to set various options, styles this is included in gnomap.jar. To localize options, extract properties, edit and placed before .jar e.g., in folder rez/ of same folder as gnomap.jar, which is given in classpath (cp) to localize these options. Useful properties include base URLs for hyperlink maps, path to picture cache, others for using as CGI program gnomap.jar:rez/mapdata/ -- holds tables of species specific information (now has fly cytologic to sequence mapping table) --- Java Graphics in Unix XWindow-less shells ---- On unix systems, Java graphics normally require a running XWindow server (MacOSX is different - no XWindows). This poses problems for doing java graphics from command-line shell scripts, including web CGI servers. The prior version of gnomap used a self-developed Xwindow-free graphics environ. This v2 release switches to the PJA toolkit from http://www.eteks.com/ as a better choice. Version 1.1.4 of Java includes its own variant of this, enabled with a '-Djava.awt.headless=true' runtime option, but with initial tests, this requires more programming/debuging to use than the PJA toolkit. PJA is used with commandline switches such as these, to install the pja.jar code before the java runtime tries for its default graphics toolkit. $java3/bin/java -Xbootclasspath/p:$PJA/pja.jar \ -Dawt.toolkit=com.eteks.awt.PJAToolkit \ -Djava.awt.graphicsenv=com.eteks.java2d.PJAGraphicsEnvironment \ -Djava.awt.fonts=$java3/jre/lib/fonts \ -Duser.home=$PJAHOME \ $jflags -cp $cp $app $opts --- Testing map input adaptors --- Besides the main program, flybase.GnoMap, there are testing main entries for individual modules. These are newly in development v2 alpha, and include flybase.mapio.SeqFileService - read biosequence files via ReadSeq flybase.mapio.FlatFFDBService - read flatfile database in FFF format (see http://eugenes.org:8089/docs/eg-features-doc.html and ftp://eugenes.org/eugenes/fly/features/ for example dataset) flybase.mapio.DasFeatureService - read from DAS server (www.das.org) flybase.mapio.LdapFeatureService - read from LDAP server with gnomap2.schema data directories ## this is direct bio-sequence feature+sequence file reader (e.g. GenBank, Embl, GFF, FFF others with feature tables) app=flybase.mapio.SeqFileService opts="debug=1 style=png source=fly segment=4 range=100000..200000 cl=$features data=" ## this is local filesystem flat-file database (now used in euGenes.org to serve maps) app=flybase.mapio.FlatFFDBService opts="debug=1 style=png source=fly segment=4 range=100000..200000 cl=$features MAP_DATA=csomes/" ## this is standard das-bioperl Bio::DB::GFF - mysql data server app=flybase.mapio.DasFeatureService opts="debug=1 style=png source=$spp segment=CHROMOSOME_$chr bp=$range features=$features data=http://eugenes.org:7180/cgi-bin/das/" ## this is das adapted to run off same srs backend and data set as ldap://eugenes/srv=srsgnomap app=flybase.mapio.DasFeatureService opts="debug=1 style=png source=$spp segment=$chr bp=$range features=$features data=http://eugenes.org:7180/cgi-bin/srsdas/" ## this is LDAP-based directory of genome annotation features app=flybase.mapio.LdapFeatureService opts="debug=1 style=png source=$spp segment=$chr bp=$range features=$features data=ldap://eugenes.org:3891/srv=srsgnomap"