Index of /gmod/genogrid

      Name                    Last modified       Size  Description

[DIR] Parent Directory 10-Mar-2008 14:13 - [TXT] build.properties 09-Jun-2008 10:48 1k [DIR] doc/ 12-Jun-2008 10:58 - [DIR] tandy/ 26-Jun-2008 16:46 - [DIR] scripts/ 26-Jun-2008 17:50 - [   ] build.xml 26-Jun-2008 18:23 5k [   ] genogrid-0.5.zip 26-Jun-2008 18:23 544k [DIR] drospege/ 20-Sep-2010 12:26 -

GMOD Genome Grid

README for GMOD Genome Grid

  Genome analysis and annotation via Grid computing

This subproject builds re-usable tools and workflows for genome analyses and annotation, using shared cyberinfrastructure (Grids or clusters). Here within are collections of scripts, documents and workflows for employing existing genome analysis tools (BLAST, homology tools, predictors, comparative and phylogenetic analyses) on available cyberinfrastructure.

An emphasis here is on simplified use of grids and genome tools, to make it feasible for new genome projects to take advantage of these readily.

See doc/ folder for more details, especially genome-grid-jobs.txt


REPOSITORY

  http://gmod.org/Genome_grid
  http://gmod.cvs.sourceforge.net/gmod/genogrid/


BACKGROUND

A common, ongoing task for research that uses genome databases is to compare an organism's genome and proteome with related organisms, and other sequence data sets (ESTs, SNPs, transposable elements). This task requires significant computational infrastructure, one where reusable tools, protocols and resources will be valuable and significantly reduce duplicative infrastructure and maintenance effort. Software tools to fully assembly, analyze and compare these genomes are available to bioscientists. The ability to employ these tools on genome data sets is limited to those with extensive computational resources and engineering talent. Effective use of shared cyberinfrastructure in bioinformatics is a problem today. Cluster and Grid computing in bioinformatics have followed other disciplines in parallelizing applications, but this is costly and limited to a subset of bioinformatics applications. Parallelizing data access has potential to open many existing and new biology analyses to effective use of Grid and cluster computing. Gene finding, homology comparison, multiple alignment tools, phylogenetic comparison, analyses of functions can all benefit from data grid approaches. Promising newer genome tools draw relevant data from several sources: cross-species homologies, large scale functional and interaction data along with genome sequences. A practice common in genome analyses is ad hoc development of scripts to split and collate data and results. This can be automated for Grid computing by software designed split genome sequences to subsets. Each subset can be analyzed independently, and results collated. Depending on the analysis tools, this can be simple enough to handle in a generic manner.

The TeraGrid project (www.teragrid.org) is part of a shared cyberinfrastructure for sciences, funded primarily by NSF. TeraGrid has compute centers at national supercomputer centers, including NCSA, PSC, and SCSD, with associated project sites at other universities, including Indiana University. An important focus for the future of TeraGrid is to enable practical uses among science communities, in hopes of providing common, cost-effective infrastructure much in the same way the GMOD initiative is building common tools for genome databases. The TeraGrid system is particularly suitable to provide a high-performance genome analysis service, for example delivering BLAST, gene finding and phylogenetic services via a Science Gateway (TeraGrid 2005) serving this community. TeraGrid can be used by genome projects for (a) building a genomic database pipeline for annotation, often a manual and computationally intensive task which needs to be updated frequently, and (b) delivery of genome database searches to the sciences community.

TERAGRID USAGE STEPS

  One time
    1. Obtain TeraGrid account 
  Via web, see http://www.teragrid.org/userinfo/
    2. Establish certificates
  Grid-security entries; test proxy; local workstation certificate 
    3. Locate biology software
  Find and compile parallel applications
  
  Processing
  Per analysis 
    4. Locate and prepare data
  Partition, shred & randomize 
    5. Transfer data to TeraGrid 
  FTP, secure-shell, other
    6. Configure and run analysis
  Globus run scripts, attention to errors, queuing 
    7. Return and collate results
  Post-process to combine results from nodes; e.g. to-GFF for map view of genome blast.


SEE ALSO

See doc/ for more details, such as the genome-grid-jobs.txt

The docs and scripts in this distribution's drospege/ folder have an earlier variant of this basic genome annotation pipeline/workflow process. See esp. drospege/00-README.txt


PROJECT CONTIRBUTORS

  Don Gilbert, gilbertd@indiana.edu