As far as file format is concerned the program proceeds as follows. If the file extension of the database file is .SCA (MSDOS or VAX) or SCANFILE (CMS) it assumes that the library file is in FASTA format (ie a header line starting with a >, and the sequence follows on subsequent lines in upper case). If the file extension is anything else the program uses the GETSEQ routine described in the paper - the database format is determined from the first line of each entry as being either EMBL, GENBANK, CODATA, Staden (DButil) or SCANFILE. If it is neither of these is reads in the file in its entirity (Blank format). Using this approach it is no problem to have a mixed parentage file containing entries of individual interest extracted from the various databases and collated into a single file. ATTENTION: If file format is NOT SCANFILE format then entries have to be separated by lines containing a double slash: // In the program the following functions are called: File name = xxxxxxx.SCA getdbseq getlib /* reads in a single sequence assuming SCANFILE format */ File name != xxxxxx.SCA getdata getseq /* reads in a single sequence not assuming file format */ GETSEQ also performs the following operations: it forces all sequence characters to uppercase, and it translates any u(U) to T - simplifying scanning DNA/RNA sequence databases. The latter is necessary for convience until the databases are circulated all in one format (I know EMBL has declared its intension to do so). However since this is preformaed using a table lookup that would have to be performed anyway there is no additional penalty in terms of program execution. As described in the paper when the motif (or pattern) is defined case is significant. Normal sequence characters should be given in upper case as the lower case characters are reserved for the IUB nucleic acid symbols: the ambiguous bases are translated by the program into the equivalent base set (eg r becomes [AG] and y becomes [CT] ).