Name Last modified Size Description
Parent Directory 10-Mar-2008 14:13 - doc/ 12-Jun-2008 10:58 - drospege/ 20-Sep-2010 12:26 - scripts/ 26-Jun-2008 17:50 - tandy/ 26-Jun-2008 16:46 - build.properties 09-Jun-2008 10:48 1k build.xml 26-Jun-2008 18:23 5k genogrid-0.5.zip 26-Jun-2008 18:23 544k
Genome analysis and annotation via Grid computing
This subproject builds re-usable tools and workflows for genome analyses and annotation, using shared cyberinfrastructure (Grids or clusters). Here within are collections of scripts, documents and workflows for employing existing genome analysis tools (BLAST, homology tools, predictors, comparative and phylogenetic analyses) on available cyberinfrastructure.
An emphasis here is on simplified use of grids and genome tools, to make it feasible for new genome projects to take advantage of these readily.
See doc/ folder for more details, especially genome-grid-jobs.txt
A common, ongoing task for research that uses genome databases is to compare an organism's genome and proteome with related organisms, and other sequence data sets (ESTs, SNPs, transposable elements). This task requires significant computational infrastructure, one where reusable tools, protocols and resources will be valuable and significantly reduce duplicative infrastructure and maintenance effort. Software tools to fully assembly, analyze and compare these genomes are available to bioscientists. The ability to employ these tools on genome data sets is limited to those with extensive computational resources and engineering talent. Effective use of shared cyberinfrastructure in bioinformatics is a problem today. Cluster and Grid computing in bioinformatics have followed other disciplines in parallelizing applications, but this is costly and limited to a subset of bioinformatics applications. Parallelizing data access has potential to open many existing and new biology analyses to effective use of Grid and cluster computing. Gene finding, homology comparison, multiple alignment tools, phylogenetic comparison, analyses of functions can all benefit from data grid approaches. Promising newer genome tools draw relevant data from several sources: cross-species homologies, large scale functional and interaction data along with genome sequences. A practice common in genome analyses is ad hoc development of scripts to split and collate data and results. This can be automated for Grid computing by software designed split genome sequences to subsets. Each subset can be analyzed independently, and results collated. Depending on the analysis tools, this can be simple enough to handle in a generic manner.
The TeraGrid project (www.teragrid.org) is part of a shared cyberinfrastructure for sciences, funded primarily by NSF. TeraGrid has compute centers at national supercomputer centers, including NCSA, PSC, and SCSD, with associated project sites at other universities, including Indiana University. An important focus for the future of TeraGrid is to enable practical uses among science communities, in hopes of providing common, cost-effective infrastructure much in the same way the GMOD initiative is building common tools for genome databases. The TeraGrid system is particularly suitable to provide a high-performance genome analysis service, for example delivering BLAST, gene finding and phylogenetic services via a Science Gateway (TeraGrid 2005) serving this community. TeraGrid can be used by genome projects for (a) building a genomic database pipeline for annotation, often a manual and computationally intensive task which needs to be updated frequently, and (b) delivery of genome database searches to the sciences community.
One time 1. Obtain TeraGrid account Via web, see http://www.teragrid.org/userinfo/ 2. Establish certificates Grid-security entries; test proxy; local workstation certificate 3. Locate biology software Find and compile parallel applications Processing Per analysis 4. Locate and prepare data Partition, shred & randomize 5. Transfer data to TeraGrid FTP, secure-shell, other 6. Configure and run analysis Globus run scripts, attention to errors, queuing 7. Return and collate results Post-process to combine results from nodes; e.g. to-GFF for map view of genome blast.
See doc/ for more details, such as the genome-grid-jobs.txt
The docs and scripts in this distribution's drospege/ folder have an earlier variant of this basic genome annotation pipeline/workflow process. See esp. drospege/00-README.txt
Don Gilbert, firstname.lastname@example.org