LuceGene - GBrowse GFF data adaptor DESCRIPTION This package provides a data adaptor to GBrowse using the Lucene/Lucegene text-data indexing software. It provides a low-effort, high-performance adaptor to view genome feature data in GBrowse. Lucene is a fast, open-source (in Java) search package, equivalent in speeds to compiled software such as Berkeley DB, and generally faster than RDBMS sofware (MySQL, Oracle, Postgresql) at text searching. The use with GBrowse is numeric search intensive (genome locations) where MySQL performs equivalently. The Lucegene wrapper to Lucene (www.gmod.org/lucegene/) adds bioinformatic and genomic data indexing and search methods to make it suitable for use, e.g. with genome maps, biosequence and other biodata search and retrieval (e.g. like the commercial SRS Sequence Retrieval System). The Perl package Bio::DB::GFF::Adaptor::LucegeneDBI is a basic interace to the Java Lucegene package, for query construction, search and retrieval. It is tailored to GFF/FFF genome feature searches. It is independent of Bio:: modules. It also includes methods, used by lucegene_bulk_load_gff.pl, for creating an index of GFF data suitable to use in GBrowse genome map display. This work is tested using Gbrowse v1.63 pre-release source cvs -d me@cvs.sf.net:/cvsroot/gmod export -r release_1_62-bugfixes Generic-Genome-Browser Originally tested 8 Jun 05, retested and updated August 2005. Tests compare this Lucegene adaptor with GFF-MySQL and ChadoFC database adaptors for the same fly data source. Find this server at http://eugenes.org/cgi-bin/gbrowse/ with datasets including dmel_* (Euchromatin), dmelhet_* (Heterochromatin). There is also a set of Lucene indices for GFF from C.elegans (Wormbase), Sacc. yeast (SGD), and Rice (Gramene). Several Gbrowse adaptors are configured with these data sets for comparisons (see BENCHMARKS below). An update to the GMOD Chado adaptor for GBrowse called Bio::DB::Das::ChadoFC.pm, has various fixes and optimizations to run faster and more fully display genome features with fly chado db. The Lucene indices used here are available at ftp://eugenes.org/eugenes/gbrowse/indices/lucene/ if you want to test w/o generating from GFF data. A copy of the software to add to GGB is in ftp://eugenes.org/eugenes/gbrowse/lucegene-gbrowse.zip See also the 'Lucene-lite' adaptor to Gbrowse. It is almost as fast as Lucegene, and is a very simple software set ftp://eugenes.org/eugenes/gbrowse/lucene-gbrowse-lite.zip BENCHMARKS This Lucegene BioPerl GFF adaptor to Gbrowse is somewhat faster than the MySQL-GFF adaptor. Both are about 5x faster than an optimized Das::Chado adaptor to Chado PostgreSQL database with same feature data. Time (seconds) for GBrowse web display, 30 iterations at different map locations on fruitfly (dmel) genome ---------------------------------------------------------- Server3 Server2 Relative GBrowse-Adaptor Mean SE Mean SE time (ave.) dmel_lucegene_500k 5.4 0.15 1.86 0.05 100 dmel_lucene_500k 6.1 0.13 2.23 0.05 117 dmel_mysql_500k 7.9 0.31 2.14 0.06 128 dmel_bdb_500k 8.3 0.53 4.10 0.32 187 dmel_chadofc_500k 25.9 0.91 9.86 0.77 510 ---------------------------------------------------------- run Aug.2005 This uses a 500kb map range; differences increase with map range. These all use the same data. There are slight differences in displays due to configurations and how adaptor works, but no significant differences in the data returned by adaptors. Lucene and MySQL indices are cross-platform shared here. BerkeleyDB and Postgres cannot be, and had to be regenerated for each server. Server2 is x64-Solaris-10 (yr2005), Server3 is ppc-MacOSX-10.3 (yr2004). Adaptor key: lucegene - lucegene.pm GFF (Lucene v1.9; Java 1.4/1.5) lucene - simple lucene.pm GFF (Lucene v1.9; Java 1.4/1.5) bdb - berkeleydb.pm GFF (BerkeleyDB v4.2) mysql - mysqlopt.pm GFF (MySQL v4.0x) chadofc - chado.pm DAS, modified for flybase Chado db (Postgres v7 & 8) Ref. http://sourceforge.net/mailarchive/forum.php?thread_id=8094404&forum_id=31947 TODO The Lucene-lite adaptor supports FastA in the same manner as the Berkeley DB adaptor, but uses Lucene instead of BerkeleyDB for indexing the fasta. The Lucegene adaptor still needs to incorporate techniques learned from writing Lucene-lite (as of 5 Sep 05). It will handle FastA sequence now but in an akward way (by creating .raw files of each chromosome/unit). The configuration file needs an entry for this using -DNA_LIB /path/to/dna/per-chromosome*.raw path info in the gbrowse.conf file. AUTHOR Don Gilbert . Copyright (c) 2005 Indiana University. This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.