From cavanaug from ncbi.nlm.nih.gov Fri Sep 12 10:35:21 2008 From: cavanaug from ncbi.nlm.nih.gov (Cavanaugh, Mark (NIH/NLM/NCBI) [E]) Date: Fri Sep 12 10:35:32 2008 Subject: [Genbank-bb] GenBank Update Problem : 0912 : Incorrect files between 04:30am and 10:08am Message-ID: <7B6F170840CA6C4DA63EE0C8A7BB43EC027D8ABA@NIHCESMLBX15.nih.gov> Dear GenBank Users, Processing for the GenBank Incremental Update (GIU) and for GenBank WGS data products was moved to new hardware on Thursday, September 11. Unfortunately, some configuration files that were used during previous tests of the new hardware were *not* updated with the files from the production system. This led to the creation of unnecessarily large GIU files on September 12 (nc0912), containing records that date back to (at least) August 10th. The affected 0912 GIU files had these timestamps and sizes: -rw-r--r-- 1 gbupdate gbproces 30892117 Sep 12 04:44 con_nc.0912.flat.gz -rw-r--r-- 1 gbupdate gbproces 533951491 Sep 12 04:31 nc0912.flat.gz -rw-r--r-- 1 gbupdate gbproces 317692100 Sep 12 04:37 nc0912.fsa_nt.gz -rw-r--r-- 1 gbupdate gbproces 34925179 Sep 12 04:10 nc0912.fsa.gz -rw-r--r-- 1 gbupdate gbproces 60205075 Sep 12 04:10 nc0912.gnp.gz -rw-r--r-- 1 gbupdate gbproces 92824768 Sep 12 04:14 nc0912.qscore.gz -rw-r--r-- 1 gbupdate gbproces 33507308 Sep 12 04:44 con_nc.0912.aso.gz -rw-r--r-- 1 gbupdate gbproces 423510078 Sep 12 04:16 nc0912.aso.gz Note that the uncompressed size of nc0912.flat.gz is over 2.5 GB : compressed uncompressed ratio uncompressed_name 533951491 2652522424 79.9% nc0912.flat =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= This problem was discovered on the morning of September 12. The incorrect GIU files were removed, a new GIU run was started, and this yielded corrected 0912 update products at about 10:00am : -rw-r--r-- 1 gbupdate gbproces 55850950 Sep 12 10:08 nc0912.flat.gz -rw-r--r-- 1 gbupdate gbproces 36800819 Sep 12 10:08 nc0912.fsa_nt.gz -rw-r--r-- 1 gbupdate gbproces 1919440 Sep 12 10:08 nc0912.fsa.gz -rw-r--r-- 1 gbupdate gbproces 3169378 Sep 12 10:08 nc0912.gnp.gz -rw-r--r-- 1 gbupdate gbproces 2254192 Sep 12 10:08 nc0912.qscore.gz -rw-r--r-- 1 gbupdate gbproces 38159306 Sep 12 10:08 nc0912.aso.gz Note that the uncompressed size of the corrected nc0912.flat.gz GIU is only a tenth of the size of the incorrect version: compressed uncompressed ratio uncompressed_name 55850950 252578593 77.9% nc0912.flat Note also that there are no CON-division GIU products for 0912 . =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= The invalid 0912 GIU products were available via FTP for approximately six hours. If you transferred them between 4:00am ET and 10:08am ET, please check their sizes to see if you need to obtain new, corrected, smaller versions of the files. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Fortunately, the effect on our WGS project files was very minimal : the data files for a single project, CABB, were unnecessarily refreshed. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Our apologies for the inconvenience that this error has caused. Mark Cavanaugh GenBank NCBI/NLM/NIH/HHS From cavanaug from ncbi.nlm.nih.gov Fri Sep 26 15:41:11 2008 From: cavanaug from ncbi.nlm.nih.gov (Cavanaugh, Mark (NIH/NLM/NCBI) [E]) Date: Fri Sep 26 15:41:39 2008 Subject: [Genbank-bb] PROJECT linetype to be replaced by DBLINK Message-ID: <7B6F170840CA6C4DA63EE0C8A7BB43EC029E10A9@NIHCESMLBX15.nih.gov> Greetings GenBank Users, The PROJECT linetype allows a sequence record to be linked to information about the sequencing project that generated the data which ultimately resulted in the record's submission to the International Nucleotide Sequence Database ( INSD : http://www.insdc.org ). This complete bacterial GenBank record illustrates the use of PROJECT: LOCUS CP000964 5641239 bp DNA circular BCT 24-SEP-2008 DEFINITION Klebsiella pneumoniae 342, complete genome. ACCESSION CP000964 VERSION CP000964.1 GI:206564770 PROJECT GenomeProject:28471 When viewed on the web in NCBI's Entrez:Nucleotide, the record's project identifier (28471) links to an entry in the Genome Project Database (GPDB) : http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=Retrieve&dopt= Overview&uid=28471 where information about the sequencing center, the bacterium, and other GenBank records (eg, plasmids) associated with the sequencing project can be obtained. Since the introduction of PROJECT, the scope of the "Genome" Project Database has expanded, to include projects that are not necessarily targeted to the sequencing of a complete genome. In addition, there can be other resources which underlie an INSD sequence record, such as the Trace Assembly Archive at the NCBI: http://www.ncbi.nlm.nih.gov/Traces/assembly/assmbrowser.cgi?cmd=show&f=t ree&m=main&s=tree Because of the expanded scope of the GPDB, and because we anticipate a need to link to more resources than just the GPDB, the PROJECT linetype is going to be replaced by a new linetype: DBLINK Further details about this change, and its timetable, follow. Mark Cavanaugh GenBank NCBI/NLM/NIH/HHS =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- =-= Modifications to linetypes can be disruptive, so the switch to DBLINK will occur in several stages. Starting in October 2008, links to the NCBI Trace Assembly Archive will be supported via a line of text in the COMMENT section of sequence records. Here is a mock-up, based on CP000964, which illustrates this change: LOCUS CP000964 5641239 bp DNA circular BCT 24-SEP-2008 DEFINITION Klebsiella pneumoniae 342, complete genome. ACCESSION CP000964 VERSION CP000964.1 GI:206564770 PROJECT GenomeProject:28471 .... COMMENT Trace Assembly Archive:123456 The source for the DNA and/or cells is: Professor Eric W. Triplett, Chair, Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, P.O. Box 110700, Gainesville, FL 32611-0700, ewt@ufl.edu. Note: Use of the Trace Assembly Archive is still in its early stages, so only a few records are expected to have these links in the short term. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- =-= The new DBLINK linetype will be introduced as of GenBank Release 170.0 (February 15, 2009) . The Genome Project ID and the Trace Assembly Archive ID will be presented via DBLINK, and the existing PROJECT line will continue to be displayed: LOCUS CP000964 5641239 bp DNA circular BCT 24-SEP-2008 DEFINITION Klebsiella pneumoniae 342, complete genome. ACCESSION CP000964 VERSION CP000964.1 GI:206564770 PROJECT GenomeProject:28471 DBLINK Project:28471 Trace Assembly Archive:123456 .... COMMENT The source for the DNA and/or cells is: Professor Eric W. Triplett, Chair, Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, P.O. Box 110700, Gainesville, FL 32611-0700, ewt@ufl.edu. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- =-= PROJECT and DBLINK will co-exist for one GenBank release, until Release 171.0 (April 15, 2009), at which point the PROJECT line will be removed. In its final state, our mock-up for CP000964 becomes: LOCUS CP000964 5641239 bp DNA circular BCT 24-SEP-2008 DEFINITION Klebsiella pneumoniae 342, complete genome. ACCESSION CP000964 VERSION CP000964.1 GI:206564770 DBLINK Project:28471 Trace Assembly Archive:123456 .... COMMENT The source for the DNA and/or cells is: Professor Eric W. Triplett, Chair, Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, P.O. Box 110700, Gainesville, FL 32611-0700, ewt@ufl.edu. In summary: PROJECT -> DBLINK 'GenomeProject' -> 'Project' Additional linkages, such as Trace Assembly, will be added to DBLINK as-needed The PROJECT line will be removed as of April 15 2009. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- =-= For those who process sequence data in NCBI's ASN.1 format: The underlying representation for (Genome) Project IDs will remain unchanged; there will be no changes to the ASN.1 User-object that is used to store them: user { type str "GenomeProjectsDB" , data { { label str "ProjectID" , data int 28471 } , { label str "ParentID" , data int 0 } } } , However, to support linkages to other resources, like the Trace Assembly Archive, a new "DBLink" User-object will be introduced: user { type str "DBLink" , data { { label str "Trace Assembly Archive" , data ints { 123456 } } } } As new types of linkages are established, they will be added to the DBLink User-object, and displayed via the DBLINK linetype in the GenBank flatfile format. There is a possibility that the GenomeProjectsDB User-object might someday be incorporated into the new DBLink User-object. But at the moment, there are no firm plans to do so. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- =-= From cavanaug from ncbi.nlm.nih.gov Fri Sep 26 16:03:42 2008 From: cavanaug from ncbi.nlm.nih.gov (Cavanaugh, Mark (NIH/NLM/NCBI) [E]) Date: Fri Sep 26 16:04:09 2008 Subject: [Genbank-bb] Re-Send : PROJECT linetype to be replaced by DBLINK Message-ID: <7B6F170840CA6C4DA63EE0C8A7BB43EC029E10B2@NIHCESMLBX15.nih.gov> [This listserv seems to impose a fairly short line-wrap for text messages, which made my previous post difficult to read. Hence this re-send, with shorter line lengths, where possible.] Greetings GenBank Users, The PROJECT linetype allows a sequence record to be linked to information about the sequencing project that generated the data which ultimately resulted in the record's submission to the International Nucleotide Sequence Database ( INSD; see http://www.insdc.org ). This complete bacterial GenBank record illustrates the use of the PROJECT line: LOCUS CP000964 5641239 bp DNA circular BCT 24-SEP-2008 DEFINITION Klebsiella pneumoniae 342, complete genome. ACCESSION CP000964 VERSION CP000964.1 GI:206564770 PROJECT GenomeProject:28471 When viewed on the web in NCBI's Entrez:Nucleotide, the record's project identifier (28471) links to an entry in the Genome Project Database (GPDB) : http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=Retrieve&dopt= Overview&uid=28471 where information about the sequencing center, the bacterium, and other GenBank records (eg, plasmids) associated with the sequencing project can be found. Since the introduction of PROJECT, the scope of the "Genome" Project Database has expanded, to include projects that are not necessarily targetted to the sequencing of a complete genome. In addition, there can be other resources which underlie an INSD sequence record, such as the Trace Assembly Archive at the NCBI: http://www.ncbi.nlm.nih.gov/Traces/assembly/assmbrowser.cgi?cmd=show&f=t ree&m=main&s=tree Because of the expanded scope of the GPDB, and because we anticipate a need to link to more resources than just the GPDB, the PROJECT linetype is going to be replaced by a new linetype: DBLINK Further details about this change, and its timetable, follow. Mark Cavanaugh GenBank NCBI/NLM/NIH/HHS =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Modifications to linetypes can be disruptive, so the switch to DBLINK will occur in several stages. Starting in October 2008, links to the NCBI Trace Assembly Archive will be supported via a line of text in the COMMENT section of sequence records. Here is a mock-up, based on CP000964, to illustrate this change: LOCUS CP000964 5641239 bp DNA circular BCT 24-SEP-2008 DEFINITION Klebsiella pneumoniae 342, complete genome. ACCESSION CP000964 VERSION CP000964.1 GI:206564770 PROJECT GenomeProject:28471 .... COMMENT Trace Assembly Archive:123456 The source for the DNA and/or cells is: Professor Eric W. Triplett, Chair, Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, P.O. Box 110700, Gainesville, FL 32611-0700, ewt@ufl.edu. Note: Use of the Trace Assembly Archive is still in its early stages, so only a few records are expected to have these links in the short term. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= The new DBLINK linetype will be introduced as of GenBank Release 170.0 (February 15, 2009) . The Genome Project ID and the Trace Assembly Archive ID will be presented via DBLINK, and the existing PROJECT line will continue to be displayed: LOCUS CP000964 5641239 bp DNA circular BCT 24-SEP-2008 DEFINITION Klebsiella pneumoniae 342, complete genome. ACCESSION CP000964 VERSION CP000964.1 GI:206564770 PROJECT GenomeProject:28471 DBLINK Project:28471 Trace Assembly Archive:123456 .... COMMENT The source for the DNA and/or cells is: Professor Eric W. Triplett, Chair, Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, P.O. Box 110700, Gainesville, FL 32611-0700, ewt@ufl.edu. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= PROJECT and DBLINK will co-exist for one GenBank release, until Release 171.0 (April 15, 2009), at which point the PROJECT line will be removed. In its final state, our mock-up for CP000964 becomes: LOCUS CP000964 5641239 bp DNA circular BCT 24-SEP-2008 DEFINITION Klebsiella pneumoniae 342, complete genome. ACCESSION CP000964 VERSION CP000964.1 GI:206564770 DBLINK Project:28471 Trace Assembly Archive:123456 .... COMMENT The source for the DNA and/or cells is: Professor Eric W. Triplett, Chair, Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, P.O. Box 110700, Gainesville, FL 32611-0700, ewt@ufl.edu. In summary: PROJECT -> DBLINK 'GenomeProject' -> 'Project' Additional linkages, such as Trace Assembly, added to DBLINK as-needed The PROJECT line will be removed as of April 15 2009. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= For those who process sequence data in NCBI's ASN.1 format: The underlying representation for (Genome) Project IDs will remain unchanged; there will be no changes to the ASN.1 User-object that is used to store them: user { type str "GenomeProjectsDB" , data { { label str "ProjectID" , data int 28471 } , { label str "ParentID" , data int 0 } } } , However, to support linkages to other resources, like the Trace Assembly Archive, a new "DBLink" User-object will be introduced: user { type str "DBLink" , data { { label str "Trace Assembly Archive" , data ints { 123456 } } } } As new types of linkages are established, they will be added to the DBLink User-object, and displayed via the DBLINK linetype in the GenBank flatfile format. There is a possibility that the GenomeProjectsDB User-object might someday be incorporated into the new DBLink User-object. But at the moment, there are no firm plans to do so. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=