Many programs read a file of data from the GCC data directories. You will not always be aware that a data file is being read, but the program output will always document the file used. eg: MAP reads a file of restriction endonuclease names and target sequences from GENRUNDATA:ENZYME.DAT. All sequence alignment programs (eg: GAP, COMPARE) read a 'symbol comparison table', see chapter 13.
To obtain a copy of the data file use the GCG program fetch eg:
$ Fetch enzyme.dat
FETCH searches all the GCG data directories for the matching file name, and writes a copy to your default directory. If you subsequently use a program that requires ENZYME.DAT it will use the copy in your current directory. This allows you to edit the file to your own requirements.
The GCG data directories of interest to users are:
GENRUNDATA, GENLOCALDATA, GENMOREDATA, GENCOMFILES, GENDOCDATA, GENDOCSOURCE
The top of the enzyme data file looks like this:
Overhang codes : 5'overhang > 0
blunt = 0
3'overhang < 0
unknown = ?
Off Over Com Commercial Ref
Enzyme Set Site Hang ment Isoschizomers Source Number
;AatI 3 AGG'CCT 0 ! Eco147I,StuI >OU 806
AatII 5 G_ACGT'C -4 ! >EFLMNOPRSUX
StuI 3 AGG'CCT 0 ! AatI,Eco147I >ABEGKLMNPQR
The syntax of the file should be observed, the basic rules being:
Enzymes starting with ; indicate isoschizomers that are not selected when selecting all enzymes. Where isoschizomers have the same recognition site AND cut point, only one of them does not have the ; character.
The 'enzyme' names should be no more than 31 letters.
Recognition sequences should be less than 60 characters long and may include IUB ambiguity codes.
There must be an offset number to show the programs where to mark the 'cut'.
An apostrophe denotes the cut point on the top strand. An underline shows the cut site on the bottom strand when the cut does not leave a blunt end.