AFRC TRAINING MANUAL - CHAPTER 13 13. GCG DATA FILES 13.1. Local data files Many of the GCG programs use "local data files" to read default data for running the program. The programs have been deliberately constructed to allow users to change those files which have a strong influence on the results of any analysis. The files you are likely to be most interested in are: the enzyme tables, codonpreference tables, translation tables, symbol comparison tables and secondary structure tables. The subtopic "local_data_files" for each program in GENHELP documents the file(s) used. 13.1.1. Enzyme tables The full list of enzymes used by the mapping programs is in a file called enzyme.dat: $ Fetch Enzyme.dat $ Type Enzyme.dat You can reduce the list of enzymes to just the ones you are interested in by editing this file. You can also make the mapping programs identify any sequence you choose, even if it is not a restriction site! You should observe the syntax used by the enzymes already present: Use the editor to add the following lines to the ENZYME.DAT file: MotI 1 CGCTCC 0 ! Motif MotII 1 CCTCGC 0 ! Inverted motif MadI 3 bA'rmy 0 ! Insane example 'b', 'r', 'm' and 'y' are all nucleotide ambiguity codes. Now run one of the mapping programs. eg: Use MAP. $ MAP (Linear) MAP of what sequence ? platelet.seq Begin (* 1 *) ? End (* 1308 *) ? *** I read your enzyme data file "Enzyme.dat"!! *** Enzyme(* * *): mot* "MOT*" selected 2 enzymes, new total: 2. Enzyme: madi "MADI" selected 1 enzyme, new total: 3. Enzyme: Please select (capitalize for 3-letter) (* t *): {space} What should I call the output file (* Platelet.Map *) ? $ Type platelet.map NB: The program FINDPATTERNS can be used to search the databases for motifs, by providing a data file in the same format. 13.1.2. Codon usage (or codonpreference) tables A codon usage table is compiled from the codons within a defined set of sequences. The table lists every codon and the frequency of its use within the observed sequences. The set of sequences may be related as a gene-family, organism, type of expression, regulation, or any combination of these. In turn the table is used (eg: CODONPREFERENCE) to check if other sequences have the same usage, or bias (if it exists) and could thereby be vrelated. The default table in GCG is ECOHIGH.COD, which you can FETCH and examine. ECOHIGH.COD is for "highly expressed genes in E.coli". You can create your own table using CODONFREQUENCY. Some additional tables for other organisms are available in the GENMOREDATA directory. Additional tables in GENLOCALDATA show the observed codon-usage tables by ORGANISM, unsorted, and do not necessarily reflect a codon preference. $ DIR GENMOREDATA:*.COD The .COD files can be fetched or referred to using the full file specification. ed: at the prompt from CODONPREFERENCE: What codon frequency file (* GenDocData:EcoHigh.Cod *) ? Your reply would be: GENMOREDATA:ATH.COD The codonpreference table is very important when using the BACKTRANSLATE program. Run the CODPREFERENCE program with the default codon usage table against the sequence EMBL:AGMG. Why does this "whale myoglobin" sequence have a codon usage like E.coli high-rate expressed genes? 13.1.3. Symbol comparison tables PRETTYPEP.CMP - Dayhoff evolutionary matrix, used by many programs. GENMOREDATA:PAM250.CMP - An alternative version of the above. GENMOREDATA:BLOSUM62.CMP - Another alternative protein matrix SWGAPDNA.CMP - Bestfit DNA matrix SWGAPPEP.CMP - Bestfit protein matrix NWSGAPDNA.CMP - Gap DNA matrix NWSGAPPEP.CMP - Gap protein matrix All the above can be FETCHed. Note the difference between the DNA tables used by BESTFIT and GAP. BESTFIT penalises mismatches more heavily. Other programs use similar, or slight variations. You will note that only one protein symbol comparison table is used throughout by different programs. JCALIEN and (G)CLUSTALV too, can use different tables. Tables are also used by PROSRCH (chapter 6) and PROFILESEARCH (chapter 10). 13.1.4. Translation tables There are five different translation tables available in GCG: Translate.Txt - "Universal or standard" table TransCiliate.Txt - Cilated protozoa TransMitoDros.Txt - Drosophila mitochondrion TransMitoMam.Txt - Mammalian mitochondrion TransMitoYeast.Txt - Yeast mitochondrion By default the Translate.txt table is used. To use any of the others, first FETCH them, then rename them to Translate.txt, or enter the file name on the command line. In this example, with the mammalian mitochondrial table, two of the standard ARG codons have become END codons. $ Fetch transmitomam.txt $ translate/trans=transmitomam.txt TRANSLATE from what sequence ? platelet.seq Begin (* 1 *) ? 388 End (* 1308 *) ? 1020 Reverse (* No *) ? Range begins ATGAG and ends CCACC. Is this correct (* Yes *) ? . . Please choose one (* W *): What should I call the output file (* Platelet.Pep *) ? mam.pep Now try $ Frames/font=1/Trans=transmitomam.txt 13.1.5. Yet more data!! The directory GENMOREDATA holds a few more surprises: ENZ_REFS.TXT - Published references for the ENZYME.DAT enzymes. ENZ_SOURCES.TXT - Addresses of commercial suppliers. TFSITES.DAT -The transcription factor sites. You can use the TFSITES*.DAT files with FINDPATTERNS $ Findpatterns/data=genmoredata:tfsites.dat 13.2. PLASMIDMAP files MAPSORT/fragment creates a file that allows PLASMIDMAP to display boxes for the sequence range between cut sites. This is called "block" display as opposed to the "tick" default display. (PLASMIDMAP /remarks=0 removes the text below the circle, /nocaption removes the caption to the left.) $ Mapsort/fragment/Infile=platelet.seq (Choose just two or three enzymes) $ Plasmidmap/remarks=0/nocaption platelet.block The PLATELET.BLOCK file denotes "block" against the cut positions. PLASMIDMAP interprets these accordingly. We can use "block", "range" and "tick" features together if we edit the file. 13.2.1. Displaying blocks and ranges In chapter 3 we used the "tick" file produced by MAPSORT/PLASMID, it contained information for PLASMIDMAP to mark the enzymes cut sites: $Mapsort/mincuts=2/maxcuts=2/plasmid/Infile=platelet.seq {accept all the defaults} Now edit the platelet.tick file to create a label for the region between the two HaeI sites. HaeI 844 844 . Black . . Tick HaeI 972 972 . Black . . Tick MaeII 724 724 . Green . . Tick MaeII 857 857 . Green . . Tick MaeIII 1091 1091 . Green . . Tick MaeIII 1215 1215 . Green . . Tick To read: HaeI 844 972 + Black | > Range MaeII 724 857 - Green > > Range MaeIII 1091 1215 + Green . . Block The only styles available are tick, block and range. The direction of a range is controlled by a + or - in the strand column. Display this in the usual way: $Plasmidmap/font=1 platelet.tick {accept all the defaults}