>From usenet.ucs.indiana.edu!sol.ctr.columbia.edu!spool.mu.edu!agate!biosci!net.bio.net!GUNBRF.NBRFNET.GEORGETOWN.EDU!POSTMASTER Thu Oct 1 13:10:09 EST 1992 Article: 757 of bionet.software Path: usenet.ucs.indiana.edu!sol.ctr.columbia.edu!spool.mu.edu!agate!biosci!net.bio.net!GUNBRF.NBRFNET.GEORGETOWN.EDU!POSTMASTER >From: POSTMASTER@GUNBRF.NBRFNET.GEORGETOWN.EDU Newsgroups: bionet.software Subject: Announcements of PIR Network Request Service Message-ID: <01GPFPRHTHHI8WVZ4N@NBRF.Georgetown.Edu> Date: 1 Oct 92 17:47:20 GMT Sender: daemon@net.bio.net Distribution: bionet Lines: 302 Announcements of the Protein Identification Resource Network Request Service Highlights 1. PIR Release 34 and NRL_3D Release 10 2. Updated Distribution Information 3. Database Standardization Efforts 4. Internet Addresses for Anonymous FTP and Network Request Service 5. FASTA Searches for NRL_3D Only 6. Network Request Service Command Summary Announcements 1. PIR Release 34 and NRL_3D Release 10 As of 30 September Release 34 of the PIR databases and Release 10 of the NRL_3D database (corresponding to Brookhaven Protein Data Bank Release 61) are now available through the PIR On-line system and Network Request Server. Distribution of the tape and CD-ROMs of the new release will begin shortly. Database Release Sequences Residues PIR1 34.00 10,550 3,591,370 PIR2 34.00 16,188 4,330,190 PIR3 34.00 18,162 5,284,017 NRL_3D 10.00 1,457 244,804 Growth of the PIR databases is documented in the file DBGROWTH.LIS available through the Network Request Server. The following files are also available through the Server: the list of superfamiles in PIR1 is in SUPERFAM.LIS, the list of keywords in PIR1 and PIR2 is in KEYWORDS.LIS, the list of features in PIR1 and PIR2 is in FEATURES.LIS. 2. Updated Distribution Information The databases and programs of the PIR are distributed on magnetic tape and on TK50 and TK70 cartridges in VAX/VMS format and in ASCII card image format; the protein databases are updated and distributed on a quarterly basis, the sequence analysis software package is updated irregularly. The prices listed are per release and are subject to change. Tapes may be ordered on a one-time or on a standing order basis. The PIR-International Protein Sequence Database ($250) contains substantially sequenced proteins and sequences translated from nucleic acid sequences. The database is divided into three data sets categorized by the degree of annotation in the sequence entries. The sequences in the PIR1 data set (and some of the PIR2 data set) have been annotated to identify post-translational modifications, active sites, signal sequences, disulfide bonds, etc. The PIR3 data set contains minimal entries that have not yet been examined by scientific staff. The datatape also contains the NRL_3D database of sequence information extracted from the Brookhaven Protein Data Bank. The VAX/VMS format of the protein sequence datatape contains the PSQ (Protein Sequence Query) and the NAQ (Nucleic Acid Query) retrieval programs and programs for creating user databases. As a service to our users, the PIR is also including files required to use the PIR database with the GCG software. The ATLAS multidatabase retrieval program is available on CD-ROM ($100) along with the PIR-International Protein Sequence Database, the ALN protein alignment database, the NRL_3D database, the PATCHX database, and the GenBank Genetic Sequence Databank. The ATLAS program is currently designed to run on PC/DOS and VAX/VMS systems. Support for UNIX and Mac will be added. The PATCHX database ($250) is produced by MIPS at the Max Planck Institute for Biochemistry, Martinsried, Germany. The PATCHX database includes all protein sequences (not identical with or contained in sequences from PIR1, PIR2 and PIR3) from the following databases: MIPSOwn MIPS preliminary entries, PIRMOD MIPS/PIR preliminary entries, MIPSH MIPS yeast entries, NRL_3D Brookhaven Data Bank Sequences, MIPSTrn MIPS preliminary translations, EMTrans (EMBL translation by F. Pfeiffer), SwissProt, GenPept (GenBank(R) translation by Los Alamos Nat. Lab.), Kabat, and PSeqIP. All sequences that are IDENTICAL within or between databases are presented only ONCE. Also sequences completely contained within others have been removed. The NBRF-PIR Sequence Analysis Software tape ($200) contains programs designed to run on a VAX computer operating under VMS version 5. All programs are written in VAX-11 Fortran (a superset of ANSI Fortran 77), with the exception of the Lipman-Pearson programs (FASTA, RDF), which are written in VAX-11 C. Included are: database searching programs (SEARCH, ISEARCH, FASTA); global similarity programs (ALIGN, IALIGN); local similarity programs (RELATE & DOTMATRIX); and prediction programs (PRPLOT & CHOFAS - from the IDEAS package). More information about the databases, sequence analysis programs, tapes, on-line services, custom services or prices can be obtained by contacting: Kathryn E. Sidman Protein Identification Resource National Biomedical Research Foundation 3900 Reservoir Road, NW Washington, DC 20007 Phone: (202) 687-2121 FAX: (202) 687-1662 E-mail: PIRMAIL@GUNBRF.BITNET 3. Database Standardization Efforts The combined staffs of the PIR-International have been engaged in a vigorous effort to standardize the keyword and features records occurring in the PIR1 and PIR2 databases. Previous efforts to standardize the species and reference records and the title records for enzymes had been very successful. The standardization effort progressed by: (1) determining the complete variety of information that existed in those records, (2) formulating rules for which forms were acceptable and which were not, (3) imposing those rules by correcting the non-compliant entries and introducing additional checking procedures during the data entry process. The success of this standardization effort for the keyword records can be judged from these results: in Release 30 there were 1614 different keywords with 63% of those keywords appearing in fewer than 4 entries; in Release 34 there are 1037 different keywords and 40% of those keywords appear in fewer than 4 entries. The following table provides a more complete breakdown. Frequency of Keywords Frequency Different Keywords in Entries Rel. 30 Rel. 34 >400 7 12 201-400 10 24 101-200 19 42 51-100 38 58 26-50 61 61 13-25 103 105 7-12 131 135 4-6 218 185 2-3 395 208 1 632 207 4. Internet Addresses for Anonymous FTP and Network Request Service During September the PIR Network Request Service was made available through the National Biomedical Research Foundation's Internet address. For users on BITNET the address remains FILESERV@GUNBRF. For users on Internet and other networks with gateways to Internet the preferred address is now FILESERV@NBRF.Georgetown.Edu. Provided in the last part of this announcement is a synopsis of instructions for using this database query and FASTA sequence search service. Each PIR release and its accompanying NRL-3D release are available for anonymous FTP from the UH Gene-Server, ftp.bchs.uh.edu, IP address 129.7.2.43. The login is "anonymous" and the password is your e-mail address. The files are kept in pub/gene-server/pir/pir_relXX/{ascii,vms}. "XX" is the release number. All files are stored as Unix 16-bit compressed files and the file names end in .Z (e.g. pir.1.dat.Z) as a reminder. The "ascii" directory contains the CODATA format files, and the "vms" directory the NBRF format files and indices in VMS format. Note that two of the files required by GCG V.7.X are not included; those can be generated by GCG-supplied utilities. Uncompress utilities are available for non-Unix systems; the DOS archive sites have a file "cmprs430.zip"; the Info-Mac archives have "maccompress-32.hqx"; and various VMS archives have "lhzcomp.exe" or "decompress.exe". The latter is also available in pub/gene-server/pir, with a sample (but non-working) .CLD file. Questions about the FTP server can be directed to Dan Davison, davison@uh.edu. Our thanks to Bill Pearson and Dan Davison for their efforts in providing FTP access to the PIR databases. 5. FASTA Searches for NRL_3D Only Some users had suggested that they wanted to do FASTA sequence searches only for the sequences with known 3-dimensional structures, the sequences extracted from the Brookhaven Protein Data Bank in NRL_3D. Normally our FASTA searches are done against all the protein databases, PIR1, PIR2, PIR3, the non-redundant PATCHX (described in the August announcement and in part 2 above) and NRL_3D. Now when the command USE BASES NRL_3D is used before a SEARCH command, only the NRL_3D database will be used for the FASTA search. Otherwise, all the protein databases will be used. Thanks to Ada Prochnicka-Chalufour at the Pasteur Institute for her helpful suggestion and her hospitality this spring. 6. PIR Network Request Service Command Summary The National Biomedical Research Foundation Protein Identification Resource network request service is a full-function fileserver and database query system. It has been operating since August 1990 and is capable of handling database queries, sequence searches and sequence submissions, in addition to fileserver requests. To use this server, request commands should be sent to FILESERV@GUNBRF on BITNET. The FILESERVer recognizes the following commands sent either in a mail message, or (if the sender is on BITNET) in command messages or in a file: Command Action ------- ----------------------------------------------- ACCESSION list entry codes and titles by accession number AND combine QUERY commands with Boolean AND AUTHOR list entry codes and titles by author BASES list accessible databases CROSS list PIR entry codes and titles corresponding to a particular nucleic sequence database entry DEPOSIT deposit entry for database submission END DEPOSIT terminate deposit entry FEATURE list entry codes and titles by feature table entry GENE list entry codes and titles for a gene name GET return entry by entry code HELP return HELP instructions HOST list entry codes and titles by host species INDEX list SENDable files JOURNAL list entry codes and titles by journal citation KEYWORD list entry codes and titles by keyword MEMBER list alignments containing entry code as a member NOT combine QUERY commands with Boolean NOT OR combine QUERY commands with Boolean OR QUERY begin collecting QUERY commands END QUERY terminate collecting commands and execute QUERY QUIT ignore the remaining text (E-mail signature blocks) RETURN change return address for gateway mail SEARCH search for sequence by FASTA procedure END SEARCH terminate sequence for searching SEND send file SPECIES list entry codes and titles by species SUGGEST leave suggestion or correction for PIR staff END SUGGEST terminate suggestion text SUPERFAMILY list entry codes and titles by superfamily name TAXONOMY report taxonomy for scientific or common name TITLE list entry codes and titles by title USE set databases, dates or formats to use in limited searches Multiple commands can be sent with one command on each line of a mail message or file. Commands should NOT be sent on the Subject line of a mail message. Receipt of command messages and files will be acknowledged immediately. Mail messages will be acknowledged by return mail. For help in using any of the commands, send a request of the for HELP topic for example HELP SEARCH In addition to the commands, help instructions are also available on the following topics: Custom_Services Databases Gateway_Access Help_en_Espanol Help_en_francais IBM-VM_BITNET On-Line_Access PIR_Distribution VAX-VMS_BITNET Because of network gateway communication protocols, there are limitations on requests sent through gateways. Users not on BITNET or INTERNET who access BITNET through local or network gateways should read and carefully follow these instructions before sending requests. Only mail message requests (not command messages or files) can be sent through gateways. Because addresses posted on gateway mail do not always work for the return, before you send requests through network gateways it is strongly recommended that you first contact Dr. John S. Garavelli at POSTMASTER@GUNBRF on BITNET. We will confirm a return address for you and may instruct you to use the RETURN command to insure that your request output will reach you. It is not usually necessary to do this if you are on BITNET or INTERNET, unless your system employs a local remailer or your mail program applies a non-standard return address (for example a personal name on the FROM: line). The BITNET network and the network gateways impose strict limits on file size. Poorly posed database queries may result in output so extensive that it could not be returned by network mail. Therefore, an output limit of 1000 lines for each command and 3000 lines for each request is imposed by the PIR FILESERVer. The DEPOSIT and QUERY commands must, and the SEARCH and SUGGEST commands may, be followed by their respective END commands when text appears on intervening lines. The DEPOSIT command requires, and the SEARCH command optionally uses, parameters that appear on the same line as the command. Because these four commands are so complex, users should obtain and carefully read the help instructions before attempting to use them. The databases available through the PIR Network Server and their abbreviations for code specification are as follows: Abbreviation Database Update Schedule PIR1 PIR Annotated and Classified Entries quarterly PIR2 PIR Preliminary Entries approximately monthly PIR3 PIR Unverified Entries weekly ALN PIR Alignment Entries semiannually NRL_3D Brookhaven Data Bank Sequences quarterly PATCHX MIPS PIR-Supplementary Database quarterly N NBRF Nucleic GB GenBank (TM) as received GBSUP GenBank (TM) as received GBNEW GenBank (TM) New Entries weekly EMBL EMBL as received EMBLSUP EMBL as received Not all commands work with all databases; please read the information returned by the command HELP DATABASES. ------------------------------------------------------------------------ Dr. John S. Garavelli Database Coordinator Protein Identification Resource National Biomedical Research Foundation Washington, DC 20007 POSTMASTER@GUNBRF.BITNET POSTMASTER@NBRF.Georgetown.Edu