Index of /gmod/genbank2chado

      Name                    Last modified       Size  Description

[DIR] Parent Directory 10-Mar-2008 14:13 - [DIR] bin/ 27-Mar-2007 17:43 - [DIR] cgi-bin/ 02-May-2007 13:34 - [DIR] conf/ 02-May-2007 13:34 - [DIR] data/ 26-Mar-2007 22:02 - [TXT] gb2chado-pod.html 02-May-2007 14:16 19k [TXT] gb2chado-pod.txt 02-May-2007 14:15 14k [DIR] htdocs/ 22-Mar-2007 14:09 - [DIR] lib/ 02-Apr-2007 14:01 - [TXT] setenv 22-Mar-2007 14:47 1k

genbank2chado



NAME

genbank2chado package


SYNOPSIS

This package provides updates for GMOD and Bioperl tools, to simplify creating Chado genome databases using NCBI GenBank genomes.

  * Check prerequisites: some version of GMOD and GBrowse
  * Fetch and install new components in a safe, test directory 
    Find at http://eugenes.org/gmod/genbank2chado/ 
  * Load Postgres Chado database template
  * Fetch sample Genbank genome/chromosomes
  * Run Genbank2GFF3 for Chado db
  * Run Bulk_load_GFF3 to Chado db
  * View genome(s) with GBrowse.  An active instance is here
    http://server3.eugenes.org/cgi-bin/gmod01/gbrowse/dev_chado_ggb/

In summary, to load Yeast chromosome X to Chado database 'mychado', from a unix command-line, use

  curl ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk \
  | perl bp_genbank2gff3.pl -noCDS -in stdin -out stdout \
  | perl gmod_bulk_load_gff3.pl -dbname mychado -organism fromdata

GBROWSE_CHADO_EDIT

This May 2007 addition is a simple way to add community annotations to Chado database. See http://server3.eugenes.org/cgi-bin/gmod01/gbrowse_details/dev_chado_ggb/?name=TAX4 and change URL from /gbrowse_details/ to /gbrowse_edit/


QUICK VIEW

Here is a list of changed code for Genbank to Chado conversion updates, for those with current installations of Bioperl, GMOD and GBrowse who want to test. Find now at http://eugenes.org/gmod/genbank2chado/ I will add these to standard CVS distributions.

Bioperl bp_genbank2gff3.pl
  bin/bp_genbank2gff3.pl   (Bioperl CVS scripts/Bio-GFF-DB/genbank2gff3.PLS)
  lib/Bio-new/SeqFeature/Tools/TypeMapper.pm      (required for genbank2gff3 update)
  lib/Bio-new/SeqFeature/Tools/Unflattener.pm     (minor change suggested for genbank2gff3)
    (put into your Bioperl lib/Bio/... directories)
GMOD bulk_load_gff3.pl
  bin/bulk_load_gff3.pl       (GMOD CVS schema/chado/load/bin/bulk_load_gff3.PLS)
  lib/Bio-new/GMOD/DB/Adapter.pm
    (put into your GMOD lib/Bio/... directories)
GBrowse using Chado DB adaptor
  lib/Bio-new/DB/Das/Chado.pm                   (GMOD/GBrowse CVS)
  lib/Bio-new/DB/Das/Chado/Segment.pm
  lib/Bio-new/DB/Das/Chado/Segment/Feature.pm
  lib/Bio-new/Graphics/Glyph/processed_transcript.pm   
    (add 'clipfeature = polypeptide' to gbrowse.conf with 'glyph = processed_transcript')
    (put into your GBrowse lib/Bio/... directories)
GBrowse_edit to Chado DB
  This is a quick'n'dirty or simple'n'sweet addition to show 
  how community annotations can be added/updated to Chado db. May 2007
  Test server:
  http://server3.eugenes.org/cgi-bin/gmod01/gbrowse/dev_chado_ggb/
   -- pick a gene to view, then detail view of a gene; 
   -- change URL from gbrowse_details/ to gbrowse_edit/ and try updates.
  Updated files include 
    cgi-bin/gbrowse_edit               (hacked from gbrowse_details)
    lib/Bio-new/DB/Das/Chado.pm                   (GMOD/GBrowse CVS)
    conf/gbrowse.conf/dev_chado_ggb.conf
    conf/update_features_in.sql and v_genepage3.sql


PREREQUISITES

You should have some version of GMOD installation and GBrowse working in order to have general prerequisites around. THis package includes current and new/test versions of needed software (GMOD schema, GBrowse, Bioperl).

Requirements include Postgres DB, Apache web server, and general Perl packages requirements like Pg.pm, GD.pm, for GMOD should be in tested use on your system. Refer to GMOD install documents (e.g. http://wiki.gmod.org/index.php/GMOD_FAQ)

This package is safe to test. It *does not* replace or overwrite current system installation, but installs completely in a new directory of your choice.

You will need to make some symbolic links to your Apache cgi-bin folder, and install new databases in your Postgres instance. You need enough 10-100 MB extra disk space to process genome data.


INSTALL

FETCH Genbank2Chado

This test update package is available at http://eugenes.org/gmod/genbank2chado/ and includes patches to Bioperl and GMOD perl scripts including GBrowse to handle fuller conversion of GenBank to Chado use. Included in the package are a full Bioperl and GBrowse file set, with configurations and updated modules.

  mkdir mygenbank2chado
  
  # Fetch this way, remove -n to do it.
  rsync -n -au rsync://eugenes.org/argos/gmod/web/gmod/genbank2chado/ mygenbank2chado/

CONFIGURE

Set GMOD environment paths, Bioperl path by editing directory paths in 'setenv', conf/default.conf. A few files have a fixed path you must update: '/bio/argos/gmod/in01'


  cd mygenbank2chado/
  set TEST_HOME=`pwd`
  perl -pi -e"s,/bio/argos/gmod/in01,$TEST_HOME,g;"  \
     setenv \
     conf/default.conf \
     cgi-bin/gbrows*
  source setenv  # sets $GMOD_ROOT

Update as needed this Postgres database name in conf/gbrowse.conf/dev_chado_ggb.conf

    database      = dbi:Pg:dbname=dev_chado_01c;host=localhost

Create symlinks in your Apache web server cgi-bin to this TEST_HOME instance of GBrowse for viewing.

    cd /my/path/to/www/cgi-bin
    ln -s $TEST_HOME/cgi-bin gmod01    
    cd  $TEST_HOME

LOAD TEMPLATE Chado database

A template Chado database is included in TEST_HOME/data as chado_01_template.gz This includes the current Chado schema plus loaded Ontology, Organism, Db table standard values. Find also at http://eugenes.org/gmod/genbank2chado/data/

Make sure that your Postgres environ is working, e.g. 'psql -l' Load the chado_01_template this way:

  set dbname=chado_01_template
  set dbnote="GMOD Chado database template, version 0.1, 2007 march"
  
  createdb $dbname "$dbnote"
  createlang plpgsql $dbname  -- is this still need?
  (gzcat data/$dbname.gz | psql -d $dbname -f - )>&  log.chload &

Add a 'www' public user and privileges to chado db template:

  psql -At -d $dbname -o grantpublic.sql -c "\
    CREATE USER www;\
    SELECT 'grant select on table '||tablename||'  to www;' \
      FROM pg_tables where schemaname = 'public';" 
  
  psql -d $dbname -f grantpublic.sql
  
Then create a working instance to load data into:
  createdb --template=chado_01_template  dev_chado_01c


USAGE

A GenBank data file is loaded into a Chado database in a two step process:

  1. bp_genbank2gff3.pl of BioPerl, with updates here, will
     convert GenBank to GFF data format suited to Chado.
    
  2. gmod_bulk_load_gff3.pl of GMOD, with updates below,
     will load that GFF to Chado database.

Fetch Genbank genomes

Genbank genome data is available from NCBI genomes section, ftp://ftp.ncbi.nih.gov/genomes/ or from a current mirror at ftp://bio-mirror.net/biomirror/ncbigenomes/

  mkdir data; cd data
  
  # fetch from NCBI, or this Indiana mirror 
  curl ftp://bio-mirror.net/biomirror/ncbigenomes/
  curl -RO ftp://bio-mirror.net/biomirror/ncbigenomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk.gz
  cd $TEST_HOME
  
  .. etc for other sample genomes of interest .. 
  Drosophila_melanogaster/CHR_4/NC_004353.gbk.gz
  Caenorhabditis_elegans/CHR_III/NC_003281.gbk.gz
  Arabidopsis_thaliana/CHR_IV/NC_003075.gbk.gz
  M_musculus/CHR_19/mm_ref_chr19.gbk.gz
  H_sapiens/CHR_19/hs_ref_chr19.gbk.gz

Genbank to GFF

The Bioperl script bp_genbank2gff3.pl will convert to GFF v3 suited to Chado loading. The new -noCDS flag is required for this. Use '-s' flag to summarize features found.

  source setenv  # need perl paths now
  
  perl bin/bp_genbank2gff3.pl -noCDS -s -o data/ data/NC_001142.gbk.gz
  
  .. etc for ..
  data/NC_004353.gbk.gz  data/NC_003281.gbk.gz data/NC_003075.gbk.gz data/mm_ref_chr19.gbk.gz
  perl bin/bp_genbank2gff3.pl -noCDS -s -o data/ data/hs_ref_chr19.gbk.gz
  ## there are parse problems with this one in last NT_011295 contig; drop it (few features)
  grep -v ^NT_011295 data/hs_ref_chr19.gbk.gz.gff > data/hs_ref_chr19.fixed.gff
 
  # check data
  head data/*.gbk.gz.gff

GFF to Chado

Use the GMOD script gmod_bulk_load_gff3.pl for this. Note that gmod_bulk_load_gff3 will only handle ONE organism at a time. Chose the best --dbxref per organism (WormBase, SGD, MGI, FLYBASE), depending on contents of GenBank annotations. The 'GeneID' dbxref is standard for most GenBank genomes.

  bin/gmod_bulk_load_gff3.pl \
    --dbname dev_chado_01c \
    --dbxref GeneID \
    --organism fromdata \
    --gff data/NC_001142.gbk.gz.gff
  bin/gmod_bulk_load_gff3.pl  --dbname dev_chado_01c --dbxref GeneID --organism fromdata \
    --gff data/NC_004353.gbk.gz.gff
    
  ... etc ...
  
  # check data  
  psql -d dev_chado_01c -c'select count(f.*), \
   (select common_name from organism where organism_id = f.organism_id) as species \
   from feature f group by f.organism_id;'
  psql -d dev_chado_01c -c'select count(f.*), \
   (select common_name from organism where organism_id = f.organism_id) as species \
   from feature f where f.seqlen>0 group by f.organism_id;'

GBrowse view

Find an active instance is here http://server3.eugenes.org/cgi-bin/gmod01/gbrowse/dev_chado_ggb/

The install steps included making a symlink from your Apache www/cgi-bin folder to this TEST_HOME/cgi-bin with gbrowse software. This gbrowse instance needs the correct path to TEST_HOME, and you may need adjustments when using Mod_Perl with Apache server.

At this point your web server should find this test gbrowse ast http://YOUR_SERVER/cgi-bin/gmod01/gbrowse/ with the Chado genome database as cgi-bin/gmod01/gbrowse/dev_chado_ggb/

If this fails, try the default gbrowse yeast data set as cgi-bin/gmod01/gbrowse/yeast_chr1/ Should this fail, so problem other than covered by this test example exists. If it works, and dev_chado_ggb/ fails, check the settings for your gbrowse.conf/dev_chado_ggb.conf. As needed, edit this setting to match your chado database name. database = dbi:Pg:dbname=dev_chado_01c;host=localhost

Check your web server error logs for messages from this software.

GBrowse_edit to Chado DB

Use of this assumes you have installed and populated your Chado database. It should work for any chado db. The only alterations to Chado db are (1) add new update_features table and a view, (2) populate this table with a view that extracts most feature properties, (3) allow public www user rights to update this table.

To install, read and execute these new Chado db additions conf/v_genepage3.sql and conf/update_features_in.sql

Install these updated perl scripts

    cgi-bin/gbrowse_edit               (hacked from gbrowse_details)
    lib/Bio-new/DB/Das/Chado.pm                   (GMOD/GBrowse CVS)

Edit your gbrowse.conf to add [feature:EDITS] stanzas as in conf/gbrowse.conf/dev_chado_ggb.conf

Then any gbrowse_details/ view can be changed to gbrowse_edit/, and form submissions will(should) go into your Chado update_features table.

Reason for simple update_features

Think about a spreadsheet style where one table serves for all data fields and values, with a column for field names, and one for values, or many columns for field tags and values. This might make a good intermediate table structure for simple annotation uses via wiki, gbrowse and other tools. This is related to the simple data tuple in XML-RFD (rss, etc.), and like. This would be non-normalized, but would allow gene-centric (or feature-centric) use for updates. One could store all of a gene object data that way, see e.g. the sample Chado gene page and outputs here.

http://www.gmod.org/wiki/index.php/Sample_Chado_SQL#gene_page

Suppose we added such a table, 'update_features' to a chado database, and let annotation tools write to it updates with a structure like

  feature-name/id   field-tag   value   (status type: new, change, delete) (update-id)

This would serve as a staging table for updating main chado tables, and offer a simple schema api that would be easy to use from other tools. Such a gene_flat_table could be populated from a chado view or procedure, and then updated via external programs interactively. An annotation tool would be able to search it simply (one or two fields), retrieval e.g. all values for a given gene feature-id simply, and update a given feature-id value easily.


CHANGES

Genbank2gff3 changes
  * Polypeptide alternate gene model added (--noCDS option)
    Standard gene model:  gene > mRNA > (UTR,CDS,exon)
    G-R-P-E alternate model:   gene > mRNA > polypeptide > exon
    Polypeptide contains all the important protein info (IDs, translation, GO terms)
  * IO pipes: curl ftp://ncbigenomes/... | genbank2gff3 --in stdin --out stdout | gff2chado ...
  
  * GenBank main record fields are added to source feature
    and the sourcetype, commonly chromosome for genomes, is used.
      
  * Gene Model handling for ncRNA, pseudogenes are added.
  * GFF header is cleaner, more informative, and GFF_VERSION option
    
  * GFF ##FASTA inclusion is improved, and translation sequence stored there.
     
  * FT -> GFF attribute mapping is improved.
  
  * --format choice of SeqIO input formats (GenBank default). 
    Uniprot/Swissprot and EMBL produce useful GFF.
    
  * SeqFeature::Tools::TypeMapper has a few FT -> SOFA additions, more flexible usage.
Bulk_load_gff3 changes
  * auto-inserts (--noaddfpcv) these items:
    db table database IDs,  cvterm and cv fields
 
  * finds organism from GFF source line ( --organism fromdata)
 
  * sets reference class/type in chado database (chromosome, region, ...)
 
  * Bio::GMOD::DB::Adapter now easier to add new tables to update (cvterm,cv,db,..)
Gbrowse Bio/DB/Das changes

  * Find map reference class type from db (cvtermprop table)
  * correction to name2term for SO/non-SO terms (cv name is needed)
  * ugly patch to attributes() to fetch polypeptide translation (residues)
    - should go into 'gfffeatureatts' chado procedure
  * patch Glyph/processed_transcript.pm
  * species() and desc() additions to Chado::Segments

=item GBrowse_edit Chado
  * added May 2007
TODO
  * check analysis handling for predictions, blast-match gff


AUTHOR

Don Gilbert (gilbertd@indiana.edu)

GPL (c) 2007 Indiana University.