// SRS-FastA Notes
// d.gilbert, oct. 1995
// email: software@bio.indiana.edu

SRS-FastA modification notes

This note describes modifications to Thure Etzold's Sequence Retrieval System (SRS) and William Pearson's FastA software that allow one to perform biosequence similarity searches on subsets of biosequence data libraries.

Software needed for this service include:

Source for this service, including modifications for SRS and FastA source, are available at <ftp://iubio.bio.indiana.edu/molbio/search/subsets/>.

Perl and HTML Glue


  • sfsubsearch.pl -- do general SRS query, or select section query, then Fasta similarity search. Normally you would have this available from a HTTP server's "cgi-bin" folder, where it is called by 'sfsubsearch.html'
  • sfsectionlib.pl -- generate index libraries for fixed section queries (e.g., species subsets). Run this when SRS indices are updated. It will update data files and update the 'sfsectionsearch.html' list of data.
  • HTML

    -- in sfsubsearch.pl, variables defined at the top will need to be localized for each server. These include

    ###  Server specific vars
    $SRSROOT = "/b4/srs";  ## path to SRS software
    $seqsearchpath  = 'nice /b4/srsfasta/bin/';  # [t]fasta app is appended to path
    $SRSserver  = "http://iubio.bio.indiana.edu:81/srs/srsc";

    -- in sfsectionlib.pl, the variables defined at the top will need to be localized for each server. These include

    ###  Server specific vars
    $SRSROOT = "/b4/srs";  ## path to SRS software
    # SRS Databanks to use
    @nadatabanks=('genbank','gbnew', 'gbest');  
    @aadatabanks=('pir','swissprot', 'swissnew'); 
    # SRS queries to generate predefined subset libraries
    # format is 'data-field=srs-query=index-file-name=description'
    push(@queries, 'org=arabidopsis=arabidopsis');
    push(@queries, 'org=caenorhabditis=caenorhabditis');

    FastA changes

    The main change to FastA software is the addition of sequence library type code that lets FastA use the index files (described in SRS section) to read only specified sequence records from a set of data library files. The files changed include small changes to FFFASTA.C, NXGETAA.C, ALTLIB.H, and addition of a new source file SUBGETAA.C with the bulk of code to read sequences from the subset indices. This file format may be generated by software other than SRS fairly easily, so isn't dependent on SRS software.

    For using with Fasta, this subset library type is defined as #12, so when Fasta/Tfasta ask for a library, you provide the name of the subset indices file (without suffix), followed by the library type number (12), as in

        test sequence file name: my.seq
        library file name: []srslibs  12

    where the files srslibs.names and srslibs.indices were generated with an SRS command like 'getz [databank-section:query] -fse srslibs'

    SRS changes

    The main change to SRS software is to add an option to write a file of indices for the sequences that match a given query. This amounts to adding a new command argument (ListFSE) to GETZ &/or WGETZ (I use it from Getz), and a new function in the shared SEQLIB.C file called "SlbListFSE". The ListFSE function will produce two files, "queryname.indices" and "queryname.names".

    The version of SRS that I am working from, srs4_02, is not the current one, so you may need to do some judicious cut & paste to get these patches into the current release. I've made various other changes to my working version of SRS software, so the difference files supplied show many changes not relevant to this modification.

    -- File structure

    The two files .names and .indices are created by calling the new ListFSE function in getz or wgetz (using the "-fse dataname" command argument). This is their current structure.

    The .names file is a list of data file names, in this form

    full-pathname        file-format   file-number
    devc:srs:data:pir:pir1.dat 2 0
    devc:srs:data:pir:pir2.dat 2 1

    The .indices file has an entry for each sequence matching the SRS query,

    in this format:

    file-number   start-of-document  start-of-sequence  end-of-sequence
    0 33946 34851 35946
    0 35946 36851 1104
    1 12972 14168 15074

    -- The changes to GETZ.C or WGETZ.C are small, adding merely a new function call:


     else if (ParGetNum ("printText") || ParGetNum ("printData") ||
         *ParGetStr ("fieldList")
         || *ParGetStr("listFSE") /* dgg <<< */
         ) {
            /*  ParDefNum ("listFOSN", 0); nicht mehr notwendig, neues seqlib */
         EntryPrintSet (setName, 1, 0);


      /* dgg --- add for subset indices */
      else if ((tmp = ParGetStr ("listFSE")) && *tmp) { 
        if (QryDo (query, "Q")) { 
       ENTRYo   *entry;
       SETo     *set;
       char  *setName;
       IDoENTRY id;
       INT4     setEntryN, k;
       setName= "Q";
       set = SetGet (setName);
       setEntryN = SetSize (setName);
       for (k=1;  k <= setEntryN;  k++) {
        SetGetID (set, k, &id); 
        entry = EntryOpen (&id);
        if (entry) SlbListFSE( entry,  setName,  setEntryN,  k);  
        EntryClose (&entry);

    -- Add items to the odd/parameters.sdl and odd/arglist.sdl to allow this new function to be called in getz and wgetz. Also the SRSTEMP folder variable is defined to specify where the new files *.names and *.indices will be created and stored.




    ! dgg addition
        #parameter /name="listFSE" /type=string /str=""
             /comment="produce a File of Sequence Entries"
        #parameter /name=tempDirName /type=string /str="SRSTEMP:"  
             /comment="name of temporary files directory"  ! dgg


    #command /name=getz
        #arg /name="-fse"  /parameter="listFSE"     ! dgg added
        #arg /name="-tdir" /parameter=tempDirName     ! dgg added
    #command /name=wgetz
        #arg /name="-fse"  /parameter="listFSE"     ! dgg added
        #arg /name="-tdir" /parameter=tempDirName     ! dgg added

    -- each sequence data ODD file that defines a #filetype must add an entry "/seqtype=genbank" for instance. I did this for EMBL/SwissProt which share a #filetype, GenBank and PIR in this way. Please note that I use the CODATA format of PIR, while others may use the NBRF formatted PIR dataset.


    #filetype /ID=%EMBLSEQ_FILE 
              /begstr="SQ   " /advance=1 
              /seqtype=embl ! SLBxEMBL, dgg addition
              /typename="dat" /maxline=90 /exitstr="//"


    #filetype /ID=%GBSEQ_FILE 
              /begstr="ORIGIN" /advance=1
              /seqtype=genbank ! SLBxGENBANK, dgg addition
              /typename="seq" /maxline=200 /exitstr="//"


    #filetype /ID=%PIRASEQ_FILE 
              /typename="dat" /maxline=200 
              /seqtype=codata ! SLBxCODATA, dgg addition
              /begstr="SEQUENCE" /advance=1 /exitstr="///"

    -- in srsgeneral-def.sdl, the allowable seqtypes are defined in the %%O_FILTYP file type definition. The seqtypes must match in with the values (+10) used by FastA for filetypes. This file type value is written into the 'sequence.names' file in the ListFSE function.


    #object /id=%%O_FILTYP /ppname=SRS
        #attribute /name=seqtype /declname="type" /type=uchar
            /valtype=name /defaultval=0
            /rem="Type of sequence file.  Distinguish base file format for various
                  uses." !dgg modified
            !/rem="Type of sequence file. This was only important on VMS and with
            !      GCG. GCG version 8.0 does not have a record separator as with 
            !      version 7.0."
           #value /name=normal /val=0
            #value /name=gcg7 /val=1
            #value /name=gcg8 /val=2
            ! dgg additions for fasta lib reader format values, val-10 for pearson value
            #value /name=pearson /val=10
            #value /name=genbank /val=11
            #value /name=codata  /val=12
            #value /name=embl    /val=13
            #value /name=ig      /val=14
            #value /name=nbrf    /val=15

    -- the ListFSE routine added to srs/src/seqlib.c

     /* --- ListFSE parts --------------*/
    /* dgg - from hash.c, for hashing file names... */
    static  long HashString(void *s)
      register unsigned long h=0, g;
      register char *p = (char *) s;
      while (*p) {
        h= (h << 4) + *p++;
        if ((g= (h & 0xf0000000)))
          h= (h ^ (g >> 24)) ^ g;
      return (long) h;
    static long   lastdocstart = -1;
    static short  lastfileid = -1;
    static long   nFofn = 0, maxFofn = 0;
    static long * hashFofn = NULL;
    char * gOutentrySuffix = ".indices";   
    char * gOutnamesSuffix = ".names";
    void SlbListFSE(ENTRYo *entry, char *setName, int entryN, int entryCurrN)
     /* dgg addition 
     - list sequence position in file 
     if ( EntryOpenData (entry)) {
      short   entrynuminfile, seqformat, fileid;
      long   docstart, datastart, dataend;
      char * filename;
      IDoENTRY  nextid;
       ENTRYo   *nextentry;
      filename= entry->file[1]->nam;
      fileid= 0; /* this must be set to filename entry in file of filenames ! */
      entrynuminfile= entry->id->fip; /* need to look at next entry for end index !*/
      docstart= entry->fip[0];
      datastart= entry->fip[1];
      dataend= 0; /* flag we don't know end */
      /* we want real file format, in fasta/pearson lib file values */
      seqformat= entry->lib->form->fil_t[1]->type - SLBxPEARSON;
        if (!entryCurrN || !file) {
       /* force output to files -- file of entry indices & file of filenames */
         char  *tmp, *outentry;
         char  outDirName[FILxXNAM+1], outFile[FILxXNAM+1];
         lastdocstart= 0; lastfileid= 0;
       nFofn = 0; 
       maxFofn= 20;
       if (hashFofn) free(hashFofn);
       hashFofn = (long*) malloc( maxFofn * sizeof(long));
         if ((tmp = ParGetStr ("tempDirName"))) sprintf (outDirName, "%s", tmp);    
         else  *outDirName = '\0';
         outentry= ParGetStr("listFSE");
       sprintf (outFile, "%s%s%s", outDirName, outentry, gOutentrySuffix);
          file = fopen( outFile, "r+");
          if (!file) file = fopen( outFile, "w");
          else fseek(file, 0, 2);
       sprintf (outFile, "%s%s%s", outDirName, outentry, gOutnamesSuffix);
          fofn = fopen( outFile, "r+");
          if (!fofn) 
           fofn = fopen( outFile, "w");
          else {
            /* read & store hashFofn values */
           char *ep, aline[FILxXNAM+20];
           *aline= '\0';
           while ( fgets( aline, sizeof(aline), fofn) ) {
            ep= aline;
            while (*ep && *ep != '\t' && *ep != ' ' && *ep != '\n') ep++;
             if (*ep) *ep= '\0';
           if (nFofn >= maxFofn) {
            maxFofn = nFofn + 20;
            hashFofn= ( long*) realloc(hashFofn, maxFofn * sizeof( long));
            if (hashFofn) hashFofn[nFofn]= HashString( aline);
       if (fofn) {
       /* !! save time by storing filenames in local array !? */
         short linenum, more;
         long  namelen;
       long filehash;
       more= 1;
       filehash= HashString( filename);
       if (hashFofn) for (linenum=0; linenum<nFofn; linenum++) {
        if (filehash == hashFofn[linenum]) { 
         fileid= linenum;
         more= 0;
          if (more) {
           char * libname;
           /* libname= LibGetName(entry->lib,"full"); */
           libname= entry->lib->lnam[0]; /* short name !? */
          fileid= nFofn;
           fseek( fofn, 0, 2);
           fprintf( fofn,"%s\t%d\t%d\t%s\n", filename, seqformat, fileid, libname);
          if (nFofn >= maxFofn) {
           maxFofn = nFofn + 20;
           hashFofn= ( long*) realloc(hashFofn, maxFofn * sizeof( long));
          if (hashFofn) hashFofn[nFofn]= filehash;
       if (fileid == lastfileid && docstart == lastdocstart)
        goto skipListFSE;
      /* find data end -- shouldn't this be in entry record somewhere ! */
      IdCopy( &nextid, entry->id);
      nextid.fip++; /* skip to next entry in library */
      nextentry = EntryOpen (&nextid);
        if (nextentry) { 
        if ( EntryOpenText ( nextentry) ) dataend= nextentry->fip[0]; 
         EntryClose (&nextentry);
      /* now write it all out -- use binary output to file for compactness? */   
      fprintf( file, "%d\t%ld\t%ld\t%ld\n", fileid, docstart, datastart, dataend);
       /* save to screen out dups */
      /* strncpy(lastename, EntryGetFullName(entry), sizeof(lastename)); */
      lastfileid= fileid;
      lastdocstart= docstart;
      if ( entryCurrN == entryN - 1 ) {
      if (file) { fclose (file); file = NULL; }  
      if (fofn) { fclose(fofn); fofn= NULL; } /* dgg */
      if (hashFofn) { free(hashFofn); hashFofn= NULL; }
     /*--------end ListFSE----------- */ 
    /****** SlbEntryPrint *********************************************************
    : :
    static INT4 gSeqFormat = -1;
    void SlbEntryPrint (ENTRYo *entry, char *setName, int entryN, int entryCurrN)
      SLBoFEATURE *feature;
      SEQo        *seq = NULL;
      char        *fileExt;
      char    * forms;
      INT4        errCode, seqFormat, (*print)(char*,...);
     if ( *ParGetStr("listFSE")) {
      /* a self-contained function - can be called from other places - dgg */
      SlbListFSE( entry,  setName,  entryN,  entryCurrN);