![]() GMOD Resources
Demos Links to Related Projects
![]() GMOD is supported by a specific cooperative agreement from the USDA Agricultural Research Service, and by NIH grants co-funded from the National Human Genome Research Institute and the National Institute of General Medical Sciences.
|
NAMEgenbank2chado package
SYNOPSISThis package provides updates for GMOD and Bioperl tools, to simplify creating Chado genome databases using NCBI GenBank genomes.
* Check prerequisites: some version of GMOD and GBrowse
* Fetch and install new components in a safe, test directory
Find at http://eugenes.org/gmod/genbank2chado/
* Load Postgres Chado database template
* Fetch sample Genbank genome/chromosomes
* Run Genbank2GFF3 for Chado db
* Run Bulk_load_GFF3 to Chado db
* View genome(s) with GBrowse. An active instance is here
http://server3.eugenes.org/cgi-bin/gmod01/gbrowse/dev_chado_ggb/
In summary, to load Yeast chromosome X to Chado database 'mychado', from a unix command-line, use curl ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk \ | perl bp_genbank2gff3.pl -noCDS -in stdin -out stdout \ | perl gmod_bulk_load_gff3.pl -dbname mychado -organism fromdata
GBROWSE_CHADO_EDITThis May 2007 addition is a simple way to add community annotations to Chado database. See http://server3.eugenes.org/cgi-bin/gmod01/gbrowse_details/dev_chado_ggb/?name=TAX4 and change URL from /gbrowse_details/ to /gbrowse_edit/
QUICK VIEWHere is a list of changed code for Genbank to Chado conversion updates, for those with current installations of Bioperl, GMOD and GBrowse who want to test. Find now at http://eugenes.org/gmod/genbank2chado/ I will add these to standard CVS distributions.
PREREQUISITESYou should have some version of GMOD installation and GBrowse working in order to have general prerequisites around. THis package includes current and new/test versions of needed software (GMOD schema, GBrowse, Bioperl). Requirements include Postgres DB, Apache web server, and general Perl packages requirements like Pg.pm, GD.pm, for GMOD should be in tested use on your system. Refer to GMOD install documents (e.g. http://wiki.gmod.org/index.php/GMOD_FAQ) This package is safe to test. It *does not* replace or overwrite current system installation, but installs completely in a new directory of your choice. You will need to make some symbolic links to your Apache cgi-bin folder, and install new databases in your Postgres instance. You need enough 10-100 MB extra disk space to process genome data.
INSTALL
FETCH Genbank2ChadoThis test update package is available at http://eugenes.org/gmod/genbank2chado/ and includes patches to Bioperl and GMOD perl scripts including GBrowse to handle fuller conversion of GenBank to Chado use. Included in the package are a full Bioperl and GBrowse file set, with configurations and updated modules. mkdir mygenbank2chado # Fetch this way, remove -n to do it. rsync -n -au rsync://eugenes.org/argos/gmod/web/gmod/genbank2chado/ mygenbank2chado/
CONFIGURESet GMOD environment paths, Bioperl path by editing directory paths in 'setenv', conf/default.conf. A few files have a fixed path you must update: '/bio/argos/gmod/in01' cd mygenbank2chado/ set TEST_HOME=`pwd`
perl -pi -e"s,/bio/argos/gmod/in01,$TEST_HOME,g;" \
setenv \
conf/default.conf \
cgi-bin/gbrows*
source setenv # sets $GMOD_ROOT Update as needed this Postgres database name in conf/gbrowse.conf/dev_chado_ggb.conf
database = dbi:Pg:dbname=dev_chado_01c;host=localhost
Create symlinks in your Apache web server cgi-bin to this TEST_HOME instance of GBrowse for viewing.
cd /my/path/to/www/cgi-bin
ln -s $TEST_HOME/cgi-bin gmod01
cd $TEST_HOME
LOAD TEMPLATE Chado databaseA template Chado database is included in TEST_HOME/data as chado_01_template.gz This includes the current Chado schema plus loaded Ontology, Organism, Db table standard values. Find also at http://eugenes.org/gmod/genbank2chado/data/ Make sure that your Postgres environ is working, e.g. 'psql -l' Load the chado_01_template this way: set dbname=chado_01_template set dbnote="GMOD Chado database template, version 0.1, 2007 march" createdb $dbname "$dbnote" createlang plpgsql $dbname -- is this still need? (gzcat data/$dbname.gz | psql -d $dbname -f - )>& log.chload & Add a 'www' public user and privileges to chado db template:
psql -At -d $dbname -o grantpublic.sql -c "\
CREATE USER www;\
SELECT 'grant select on table '||tablename||' to www;' \
FROM pg_tables where schemaname = 'public';"
psql -d $dbname -f grantpublic.sql
Then create a working instance to load data into:
createdb --template=chado_01_template dev_chado_01c
USAGEA GenBank data file is loaded into a Chado database in a two step process:
1. bp_genbank2gff3.pl of BioPerl, with updates here, will
convert GenBank to GFF data format suited to Chado.
2. gmod_bulk_load_gff3.pl of GMOD, with updates below,
will load that GFF to Chado database.
Fetch Genbank genomesGenbank genome data is available from NCBI genomes section, ftp://ftp.ncbi.nih.gov/genomes/ or from a current mirror at ftp://bio-mirror.net/biomirror/ncbigenomes/ mkdir data; cd data # fetch from NCBI, or this Indiana mirror curl ftp://bio-mirror.net/biomirror/ncbigenomes/ curl -RO ftp://bio-mirror.net/biomirror/ncbigenomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk.gz cd $TEST_HOME .. etc for other sample genomes of interest .. Drosophila_melanogaster/CHR_4/NC_004353.gbk.gz Caenorhabditis_elegans/CHR_III/NC_003281.gbk.gz Arabidopsis_thaliana/CHR_IV/NC_003075.gbk.gz M_musculus/CHR_19/mm_ref_chr19.gbk.gz H_sapiens/CHR_19/hs_ref_chr19.gbk.gz
Genbank to GFFThe Bioperl script bp_genbank2gff3.pl will convert to GFF v3 suited to Chado loading. The new -noCDS flag is required for this. Use '-s' flag to summarize features found. source setenv # need perl paths now perl bin/bp_genbank2gff3.pl -noCDS -s -o data/ data/NC_001142.gbk.gz .. etc for .. data/NC_004353.gbk.gz data/NC_003281.gbk.gz data/NC_003075.gbk.gz data/mm_ref_chr19.gbk.gz perl bin/bp_genbank2gff3.pl -noCDS -s -o data/ data/hs_ref_chr19.gbk.gz ## there are parse problems with this one in last NT_011295 contig; drop it (few features) grep -v ^NT_011295 data/hs_ref_chr19.gbk.gz.gff > data/hs_ref_chr19.fixed.gff # check data head data/*.gbk.gz.gff
GFF to ChadoUse the GMOD script gmod_bulk_load_gff3.pl for this. Note that gmod_bulk_load_gff3 will only handle ONE organism at a time. Chose the best --dbxref per organism (WormBase, SGD, MGI, FLYBASE), depending on contents of GenBank annotations. The 'GeneID' dbxref is standard for most GenBank genomes.
bin/gmod_bulk_load_gff3.pl \
--dbname dev_chado_01c \
--dbxref GeneID \
--organism fromdata \
--gff data/NC_001142.gbk.gz.gff
bin/gmod_bulk_load_gff3.pl --dbname dev_chado_01c --dbxref GeneID --organism fromdata \
--gff data/NC_004353.gbk.gz.gff
... etc ...
# check data
psql -d dev_chado_01c -c'select count(f.*), \
(select common_name from organism where organism_id = f.organism_id) as species \
from feature f group by f.organism_id;'
psql -d dev_chado_01c -c'select count(f.*), \ (select common_name from organism where organism_id = f.organism_id) as species \ from feature f where f.seqlen>0 group by f.organism_id;'
GBrowse viewFind an active instance is here http://server3.eugenes.org/cgi-bin/gmod01/gbrowse/dev_chado_ggb/ The install steps included making a symlink from your Apache www/cgi-bin folder to this TEST_HOME/cgi-bin with gbrowse software. This gbrowse instance needs the correct path to TEST_HOME, and you may need adjustments when using Mod_Perl with Apache server. At this point your web server should find this test gbrowse ast http://YOUR_SERVER/cgi-bin/gmod01/gbrowse/ with the Chado genome database as cgi-bin/gmod01/gbrowse/dev_chado_ggb/ If this fails, try the default gbrowse yeast data set as cgi-bin/gmod01/gbrowse/yeast_chr1/ Should this fail, so problem other than covered by this test example exists. If it works, and dev_chado_ggb/ fails, check the settings for your gbrowse.conf/dev_chado_ggb.conf. As needed, edit this setting to match your chado database name. database = dbi:Pg:dbname=dev_chado_01c;host=localhost Check your web server error logs for messages from this software.
GBrowse_edit to Chado DBUse of this assumes you have installed and populated your Chado database. It should work for any chado db. The only alterations to Chado db are (1) add new update_features table and a view, (2) populate this table with a view that extracts most feature properties, (3) allow public www user rights to update this table. To install, read and execute these new Chado db additions conf/v_genepage3.sql and conf/update_features_in.sql Install these updated perl scripts
cgi-bin/gbrowse_edit (hacked from gbrowse_details)
lib/Bio-new/DB/Das/Chado.pm (GMOD/GBrowse CVS)
Edit your gbrowse.conf to add [feature:EDITS] stanzas as in conf/gbrowse.conf/dev_chado_ggb.conf Then any gbrowse_details/ view can be changed to gbrowse_edit/,
and form submissions
Reason for simple update_featuresThink about a spreadsheet style where one table serves for all data fields and values, with a column for field names, and one for values, or many columns for field tags and values. This might make a good intermediate table structure for simple annotation uses via wiki, gbrowse and other tools. This is related to the simple data tuple in XML-RFD (rss, etc.), and like. This would be non-normalized, but would allow gene-centric (or feature-centric) use for updates. One could store all of a gene object data that way, see e.g. the sample Chado gene page and outputs here. http://www.gmod.org/wiki/index.php/Sample_Chado_SQL#gene_page Suppose we added such a table, 'update_features' to a chado database, and let annotation tools write to it updates with a structure like feature-name/id field-tag value (status type: new, change, delete) (update-id) This would serve as a staging table for updating main chado tables, and offer a simple schema api that would be easy to use from other tools. Such a gene_flat_table could be populated from a chado view or procedure, and then updated via external programs interactively. An annotation tool would be able to search it simply (one or two fields), retrieval e.g. all values for a given gene feature-id simply, and update a given feature-id value easily.
CHANGES
AUTHORDon Gilbert (gilbertd@indiana.edu) GPL (c) 2007 Indiana University. |
|
|
|
cain@cshl.org |