DATASET allows you to create a database from a set of GCG format sequences. The database can be made from a subset of any of the other GCG databases, or your own sequence data, or a mixture of both.
First you need a file of sequence names, as before. The file MYOGOBLIN.STRINGS is an example.
$ Dataset
Assemble DATASET from what sequence(s) ? @myoglobin.strings
What should I call the data library ? myog
.
.
DATASET complete:
Sequences: 24
Total length: 28,445
Output files: Myog.Seq, .Header, .Ref, .Offset, .Names, .Numbers, .SeqCat
DATASET creates a logical name for your database, so all GCG programs will know where it is. The name is written to the file GCGMYINIT.COM in your login directory.
You can use FETCH, FASTA and other programs with the myog database:
$ FETCH MYOG:AGMG
$ FASTA/in1=AGMG.MYOG
Search for query in what sequence(s) (* GenEMBL *) ? MYOG:*
{submit the search to the SHORT queue}
You should always choose a unique name for your database, preferably one that is clearly distinct from any existing databases (eg: avoid database names beginning NEW, PROT, NUC, OLD). If you forget the name of the database, look in the GCGMYINIT.COM file.
A database may be shared amongst a group of people. It is quite simple to place a single command line in the LOGIN.COM file to initialise a common MYGCGINIT.COM file. The database can be maintained by one person in the grou(possibly in a separate user account). Security considerations on access to data, and who else has the potential to see the data, should be considered.
The output from the STRINGSEARCH program is sufficient as a FOSN. To get a FOSN from a FASTA search, edit the file and use only the ranked-scores table, omitting the alignments. (Otherwise the sequences will appear in your "database" several times!)