AFRC TRAINING MANUAL - CHAPTER 9 9. FINDING SEQUENCE MOTIFS 9.1. Locating partial sequences FINDPATTERNS locates short sequence patterns, or 'motifs'. Pattern(s) are specified by typing them in at the keyboard, or by creating a file called PATTERN.DAT. $ Findpatterns FINDPATTERNS in what sequence(s) ? Platelet.seq . . Pattern 1: GGAGGA Pattern 2: TTCTTC Pattern 3: What should I call the output file (* Platelet.Find *) ? Submit Findpatterns job to which batch queue (* LONG *) Short Examine the file PLATELET.FIND. When searching for several different patterns, it may be easier to give a file of patterns. FINDPATTERNS recognises the "local data file" name of PATTERN.DAT. An example can be fetched: $ FETCH pattern.dat $ Type pattern.dat Name Offset Pattern Overhang Documentation .. BamHI 1 GGATCC 0 ! EcoRI 1 GAATTC 0 ! Promotor 1 TAATA(N){20,30}ATG 0 ! These files are similar to the enzyme data files used for the mapping programs and FINDPATTERNS can even read an enzyme data file directly. Edit in a real or imaginary pattern of your own. This time allow for a single mismatch, in any pattern, by using /mismatch: $ Findpatterns/mismatch=1 FINDPATTERNS in what sequence(s) ? @myoglobin.strings Search patterns read from "Pattern.Dat" What should I call the output file (* FindPatterns.Find *) ? myoglobin.find Submit Findpatterns job to which batch queue (* LONG *) ? Short The file called MYOGLOBIN.STRINGS, was created using the STRINGSEARCH program, identifying all entries with "myoglobin" in the description line. MYOGLOBIN.FIND shows all the occurences of the patterns (with up to one mismatch) in the sequences named in Myoglobin.strings. 9.2. Protein motifs The PROSITE database has about 500 protein motifs, identified and documented from the Swissprot protein database. A motif is a partial sequence associated with a particular function or structure that has been identified in a group of proteins. The GCG program MOTIFS searches for PROSITE patterns within a sequence or sequences. $ Motifs MOTIFs from what protein sequence(s) ? sw:kad1_human What should I call the output file (* Kad1_Human.Motifs *) ? Kad1_Human len: 194 ............. The output file shows the motif, the fitted pattern and documentation for each matching motif in PROSITE. Frequently found patterns such as post-translational modifications are NOT shown unless the /FREQ command line parameter is used. 9.2.1. Retrieving PROSITE documentation MOTIFS is very like FINDPATTERNS in that it uses a file of patterns (PROSITE.PATTERNS). To see what patterns are available, retrieve the pattern file. $ FETCH prosite.patterns Have a look at the file using the editor. PROSITETOGCG of: D16:[Flat.Prosite]Prosite.Doc and D16:[Flat.Prosite]Prosite.Da* Release 10.1 (4/93) Name Offset Pattern .. PDoc_Name 11s_Seed_Storage 1 NGx(D,E)2x2C(S,T) 0284.PDoc . . Adenylate_Kinase 1 (L,I,V,M,F,Y,W)3DG(F,Y)PRx3(N,Q) 0104.PDoc The 11s_Seed_Storage pattern is the first in the file. The documentation for that pattern is in the file 284.PDOC, which can also be fetched: $ FETCH 0284.Pdoc Again, have a look at the file using the editor (or $ Type).