Experimental Biodata Directory Systems
October 2002
Experimentals for developing practical, widely usable Bio-data
access for the BioGrid
Further information, sources and test components used here are available at
http://iubio.bio.indiana.edu/biogrid/directories/
Test Cases
This is a test of new automated methods for
bio-data search and retieval systems. It is looking for
practical methods which have several features suitable to
high-volume data search and retrieval, that provide
standard programming interfaces for wide use of automated
bio-data access, suitable to data grid and related
distributed computing for bioinformatics.
This work is based these components:
-
SRS,
the bioinformatics Sequence Retrieval System.
SRS is used as a backend data access system.
Currently SRS provides bio-object search and retrieval for
the largest volume of public bioinformatics (some 30 million
objects at EBI's server, Sept. 2002).
-
LDAP, the Lightweight
Directory Access protocol, which provides a mature standard
client/server protocol for search and retrieval of high-volume
directories of objects (whether people info, computer resources,
or bio-data). The OpenLDAP implementation provides an
open-source, high performance component for implementing LDAP
services (C source)
-
JavaLDAP,
an easy to use and manipulate implementation of LDAP services
(Java source). Performance is near to that of OpenLDAP, while
providing a simpler installation and programming interface for
experimental work.
-
Web Services,
an emerging IT standard for computable access to data and compute
services, using XML over Web (HTTP) protocols (SOAP, WSDL, UDDI
and others).
-
GLUE, a very useful
and practical Web Service toolkit. This also provides a common Web server
and Java servlet page methods for testing LDAP and WS thru web page clients.
-
Jakarta
The Apache Jakarta project set for Java-based web services and related
tools.
-
DSML,
the Directory Services markup language (an LDAP-XML
translation). This provides a standard directory service XML translation
to/from LDAP, ensuring a common standard for these services,
and providing for co-development of LDAP and WS access methods.
Why Java?
This work is now strongly focused on Java components. Why?
The methods being tested all are multi-language, multi-source
technologies with wide software industry support, including
open-source and commercial implementations in all standard
programming languages. This is a neccessary basis for a
bioinformatics technology designed for wide acceptance and
deployment. The author is testing methods using C, C++, Perl and
Java.
Java currently provides the best common ground for testing and
especially easily deploying these tests to new service centers,
both in terms of the wide range of freely available libraries for
network services, XML, Web and other needs, and for its simple
packaging for deployment (with some very carefully chosen
exceptions, the tests will run without compiling any
platform-specific code on the range of Solaris, Linux, MacOSX and
other Unix systems).
Perl is a very attractive language for bioinformatics and bio-data access,
however tests so far using standard Perl network and related libraries
have fallen short of the performance of Java, and often have blown-up with
high data volume tests (as a few Java tests have). The proportion of
compiled, platform-specific additions is also higher for Perl than Java,
making packaging and deployment on multiple systems more difficult.
The C-based OpenLDAP server has provided the most efficient method
so far for bio-directory access. But its intallation and use
requires more effort, and its modification to use bioinformatics data
access backends takes more work. For a production system, this may
be the best choice. The JavaLDAP package provides close (1/2x) the
performance, while being much simpler to deploy and develop with.
Contact: Don Gilbert, gilbertd@bio.indiana.edu, Oct. 2002