From usenet.ucs.indiana.edu!sol.ctr.columbia.edu!newsxfer.itd.umich.edu!zip.eecs.umich.edu!caen!batcomputer!news.reed.edu!gaia.ucs.orst.edu!steele!rootd Mon May 16 07:39:15 EST 1994 Article: 415 of bionet.molbio.genome-program Xref: usenet.ucs.indiana.edu bionet.molbio.gene-linkage:252 bionet.molbio.genome-program:415 Newsgroups: bionet.molbio.gene-linkage,bionet.molbio.genome-program Path: usenet.ucs.indiana.edu!sol.ctr.columbia.edu!newsxfer.itd.umich.edu!zip.eecs.umich.edu!caen!batcomputer!news.reed.edu!gaia.ucs.orst.edu!steele!rootd From: rootd@ohsu.edu (Darrell Root) Subject: bionet.molbio.gene-linkage FAQ under construction Message-ID: <1994May16.051730.25905@ohsu.edu> Organization: Oregon Health Sciences University Date: Mon, 16 May 1994 05:17:30 GMT Lines: 581 This is a PRELIMINARY ROUGH DRAFT of a FAQ (frequently asked questions list) for the bionet.gene-linkage newsgroup A FAQ is a document where commonly-asked questions can be answered by the experts in that area. By putting common questions (and their answers!) into one document, researchers will waste less time searching the internet for answers to common questions. This is my attempt to start a FAQ for bionet.gene-linkage When reading through this preliminary rough-draft, please remember the following: 1) I am not an expert in these topics (except for sun speed optomization) 2) I do not even know many of the questions which should be asked/answered 3) The questions I have asked/answered are biased torward my areas of research, so other (valid!) areas may not be represented at all What I would like to do is create a first-class FAQ for genetic-linkage researchers to use. I cannot do this alone. Below is my preliminary-rough-draft. Read through it. Pick a question I have not: 1) asked, 2) answered completely, or 3) answered correctly. State the question clearly, answer it to the best of your ability (even if you just add one point which I failed to mention), and email the result to rootd@ohsu.edu I will incorporate new submissions and post an updated FAQ during the May28-29 weekend. Eventually I will arrange for the FAQ to be archived at the normal FAQ archives (if this project is successful). Think of all the time you waste looking for stuff on the internet, all of which could be saved if a first-class FAQ was created! For the record, this project is not supported by any of our grants. I wrote this on my own time. Darrell Root rootd@ohsu.edu BIONET.GENE-LINKAGE FREQUENTLY-ASKED-QUESTIONS 1) Where can I obtain the bionet.gene-linkage FAQ? 2) What anonymous-ftp sites have programs/utilities useful for genetic linkage analysis? 3) I think I know the name of a program I want, but I don't know where I can find it 4) I have an ftp site with gene-linkage programs/utilities on it. How do I get registered with the archie servers? 5) What gopher sites have useful genetic-linkage information? 6) What database management programs do people use for genetic-linkage data? 7) What programs are available for pedigree drawing? 8) Why are some programs used primairly for chromosome mapping, while others are used for disease-mapping? 9) What programs are used for chromosome mapping? 10) What programs are used for disease-gene mapping? 11) How do you calculate MAXHAP? 12) What programs are available to help detect errors in linkage data? 13) What books are helpful when learning about genetic linkage analysis? 14) How can I increase the speed of the linkage/fastlink package on my workstation? 15) I set up 300 megs of paging space on my workstation, but now I'm running out of hard-drive. Is there any way I can use my hard drive space more effeciently? 16) But I don't know how to do all this optomization, and my research assistant is spending all his/her time trying to figure it out. 17) What genetic-linkage databases are available on the internet? 1) Where can I obtain the bionet.gene-linkage FAQ? [rootd;15may94] It is available by anonymous-ftp from ursula.ee.pdx.edu in /pub/users/cat/rootd. Once it is no longer "preliminary" I will make certain it is on all the normal FAQ archive sites. 2) What anonymous-ftp sites have programs/utilities useful for genetic linkage analysis? [rootd;15may94] corona.med.utah.edu has Jurg Ott's Linkage package for many platforms, including some binaries york.ccc.columbia.edu also has Jurg Ott's linkage package, but it is on a platform running VMS, and is difficult for us UNIX types to "look around" softlib.cs.rice.edu has FASTLINK, the optomized C versions of linkage which continue to undergo massive improvements genome1.hgen.pitt.edu has Multimap, a lisp-based expert system which uses an optomized version of crimap to map chromosomes ftp.bchs.uh.edu has some useful IBM programs, including: peddraw (a DOS pedigree drawing program--completely different from the B. Dyke MacIntosh peddraw 4.x) fastmap produces a quick approxomation to multipoint lod scores dolink A DOS genetic database/analysis-setup program easistat A simple DOS statistics package easigraf Draws graphs of lod scores prep.ai.mit.edu is the home of GNU (the free software foundation) which produces free software (such as the gcc compiler, and the emacs editor). wuarchive.wustl.edu is the largest anonymous ftp-site on the planet. They have the whole GNU/free software foundation distribution, and tons of other stuff. mendel.welch.jhu.edu has all the files for OMIM (online mendelian inheritance in man) and GDB (genome-data-base). Searching within the search program is much easier. [I need an ftp site for crimap] [There are many more sites with useful stuff. Email information to rootd@ohsu.edu and I will add them to this list] 3) I think I know the name of a program I want, but I don't know where I can find it [rootd;15may94] There is a database program called archie, which maintains a list of all files in registered anonymous-ftp sites. You can telnet to an archie server, and have it search the database. Each site is updated every 30 days, so very recently posted programs might not be listed yet. To use archie, you need to telnet to one of the archie server sites, which are: archie.rutgers.edu archie.sura.net archie.unl.edu archie.ans.net archie.mcgill.ca (thanks to O'Reilly's Internet book for this list) Use the login name "archie" and nothing as your password. Here is a simple archie login an search: bigbox% telnet archie.unl.edu login: archie password: <--just hit return, not like anonomous-ftp unl-archie> find linkmap # Search type: sub. # Your queue position: 2 # Estimated time for completion: 00:24 working... - Host gatekeeper.dec.com (16.1.0.2) Last updated 21:04 9 Apr 1994 Location: /contrib/src/pa/m3-2.07/src/driver/boot-DS3100 FILE -rw-r--r-- 4000 bytes 23:00 2 Jun 1992 M3LinkMap_i.c FILE -rw-r--r-- 14027 bytes 23:00 2 Jun 1992 M3LinkMap_m.c Location: /contrib/src/pa/m3-2.07/src/driver/linker/src FILE -rw-r--r-- 1307 bytes 00:00 4 Dec 1991 M3LinkMap.i3 FILE -rw-r--r-- 3078 bytes 00:00 4 Dec 1991 M3LinkMap.m3 unl-archie> Unfortunately, these linkmap programs have nothing to do with J Ott's linkage package. Most gene-linkage programs are not on registered ftp sites 4) I have an ftp site with gene-linkage programs/utilities on it. How do I get registered with the archie servers? [rootd;15may94] send email to archie-admin@bunyip.com with the domain-name of the ftp site and the email address of the administrator. If you are the administrator of the ftp-site identify yourself as such. 5) What gopher sites have useful genetic-linkage information? [rootd;15may94] gopher.gdb.org has background information on the human genome project, and archives of the "Human Genome News" newsletter. Editor's note: There are many more, including the genethon gopher site (who's address I do not know) 6) What database management programs do people use for genetic-linkage data? [rootd;15may94] Paradox: This is a full database-management system available from Borland computer company for IBM machines. Like most other "full feature" databases, it is reliable and supported on most IBM platforms, but not tailored specifically to the needs of genetic researchers. It has a good educational discount. We use it, but have to repeatedly set up our report-formats for linkage output. Getting liped output format is nontrivial. Linksys: This custom-made database program was written by J Attwood and S Bryant. Although they continue to use it, Dr Attwood suggests using dolink instead. Linksys is not currently available at any ftp sites Dolink: This DOS custom database program (by D Curtis I think??) manages genetic data and sets up input files for your analysis. It is available from ftp.bchs.uh.edu Kindred: This new DOS database program, distributed by Epicenter Software, is specifically designed for linkage analysis. A free demo is available by calling (818)-304-9487. In addition to database duties, this program (according to the ad, not from personal experience) will draw pedigrees, haplotype marker data, and can output in linkage format. The demo did not work on our IBM because our monitor is from the stone age. We were able to get the demo to run on a Power-PC Mac with SoftWindows emulation, but it crashed the Mac when we hit the escape-key during the demo. Be forewarned: the list price is about $500. CEPH: This database is specifically designed for chromosome mapping with ceph-style-pedigrees. It can output data in ped.out format or linkage format. Our version (5.0) fails when we output over 90 markers, but not the entire dataset. Santosh Gupta wrote a program (called mkcrigen) which converted the ped.out files to .gen files. Unfortunately we only have an old binary which was compiled with a maximum of about 85 markers. If you try to convert a ped.out file to a .gen file with more than 85 markers, your final .gen file is messed up. Santosh Gupta modified the program to work with 500 markers, but we do not have any source code for mkcrigen (any version) and we do not have a binary for the improved version. Some other labs output the data in linkage format and convert that to .gen format. We don't like that because that separates the marker name from the marker data, and can result in errors. I believe that the ceph database is available on the ceph ftp site, but I do not have the address. [Please send comments on database programs you use] 7) What programs are available for pedigree drawing? [rootd;15may94] peddraw(IBM version): This program (Possibly written by Dave Curtis) is a pedigree drawing program for IBMs available from ftp.bchs.uh.edu in the /pub/gene-server/dos directory. I have never used it. ftree: This is another IBM pedigree program written by Rodney C.P.(?) at the University of Alabama. I have a copy, but do not know where this program is available. I don't use it, but some old pedigrees in a notebook look very pretty. peddraw(Mac Version): This program, written by B Dyke, P Mamelka, and J MacCleur, is available from: Paul Mamelka Department of Genetics Soutwest Foundation for Biomedical Research PO. Box 28147 San Antonio, TX 78228-0147 An upgrade from a previous version is $10 (current version = 4.4) Documentation costs $10 I THINK the program itself costs $35, but that may be too high. 8) Why are some programs used primairly for chromosome mapping, while others are used for disease-mapping? [rootd;15may94] Any family can be used for chromosome mapping, so CEPH has picked a particular family "shape" and generated a large database with these families. Programs designed for chromosome mapping can be optomized for using these families, reducing the time needed for calculations. Only families afflicted with a disease can be used for disease-gene-mapping. As a result, programs designed for disease-gene-mapping need to be able to deal with arbitrary pedigrees. In addition, these programs need to be able to handle incomplete-penetrance. 9) What programs are used for chromosome mapping? [rootd;15may94] crimap: This program has been used for chromosome mapping for years. It has options which can generate maps, calculate order probablities, and printout recombination data. It works on .gen files with data from CEPH-style families. multimap: This Lisp-based expert system uses an optomized version of crimap to create a chromosome map. It is available via anonymous ftp from genome1.hgen.pitt.edu. The authors (T Matise, M Perlin, and A Chakravarti) continute to improve the code, add new functions, and provide excellent support. When used with the crimap chrompic option (to find double-recombinations to identify possible errors), it is incredibly useful. 10) What programs are used for disease-gene mapping? [rootd;15may94] Simlink: This fortran program (by L Ploughhman and M Boehnke) simulates linkage analysis on a family, and gives you an "estimate the probability, or power, of detecting linkage given family history information on a set of identified pedigrees." It allows the researcher to determine whether a family has sufficient informativeness to detect linkage. In addition, it can help the researcher to decide how far apart to seperate their genetic probes without "missing" the disease locus (ie. Do I use probes seperated by 30cM? or will 40cM be close enough given the informativeness of this family). This can save the researcher considerable time and money. The researcher won't waste money doing a genome search on an insufficiently-informative family. Large families can be "trimmed" during the initial genome-search, and then the entire family can be used later during marker-localization. Simlink data can be useful on grant applications (to prove that the family you propose to analyze is sufficiently informative). Simlink requires large quantities of memory. It was written for IBM's, but has been ported to many platforms including Sequent symmetry S8000's. Liped: This IBM program (written by Jurg Ott) calculates probabilities for genetic linkage between disease-markers and genetic-markers. It's input file differentiates between phenotypes and genotypes. As a result, this program is easiest to use when your data is from "old-style" genetic-markers (such as blood phenotype data). Linkage: This package of programs, written by Jurg Ott in Pascal, calculates genetic linkage probablilities. It consists of several analysis programs (each of which do a particular type of analysis) and several utility programs (which makes the analysis programs easy to use). Versions are available for IBM's and unix platforms. Here are some of the analysis programs: mlink: 2-point lod-score calculations at fixed recombination distances linkmap: multipoint lod-score calcuations at fixed distances ilink: calculates the recombination distance with the highest lod-score fastlink: This is a port of the linkage package to C (by A Schaffer, R Cottingham, and R Idury). The initial port increased the speed by an order of magnitude. They continue to optomize the algorithm and code, resulting in continued speed improvements. In addition, fastlink allows you to compile in "fast" or "slow" mode (the slow version of fastlink is still much faster than the old linkage programs). The "fast" version uses a ton of memory, but uses that memory to contain some of the intermediate results which are repetitively recalculated in the "slow" version (and the old linkage package). We obtain good results by setting up 300 megs of virtual memory on our sparc and using the fast version (at one point we ran a fastlink linkmap run with 700 haplotypes). The fastlink programs are also more portable. Earlier versions of fastlink required installation of p2c (the free-software foundation's pascal-to-C converter). That is no longer necessary. emaillink: I am developing an email-server for the fastlink programs, which will allow users to submit linkage data for analysis via email (similar to NIH's "Blast" DNA homology search server). The system is currently working, but needs improvement before we betatest. Betatesting should begin later this month. 11) How do you calculate MAXHAP? [rootd;15may94] Maxhap is the maximum possible number of haplotypes in your analysis. You multiply together the number of alleles at each locus used in a particular run (not all the loci in your dataset, just the loci you use). Remember that affection status counts as two alleles, regardless of the number of liability classes. For example, if a dataset has the following information: affection status: 4 liability classes marker A: 3 alleles marker B: 4 alleles marker C: 5 alleles And your run includes a linkmap run between affection-status, A, and B, then your MAXHAP must be (at least) 2*3*4 12) What programs are available to help detect errors in linkage data? By linkage data, I mean any genetic-linkage dataset, not just those for Ott's Linkage package. This is an important question, and I simply do not know the answer. I've used the crimap-chrompic option, and played with xpic/phap a little bit, but I really hope some people send me some information on this topic. 13) What books are helpful when learning about genetic linkage analysis? [rootd;15may94] Jurg Ott's Analysis of Human Genetic Linkage is THE work in this area, but it is very advanced and difficult for many people (including me) to understand (although I haven't tried recently, I should try again...) Rumors indicate that Jurg Ott is also writing a book on how to use the linkage package. Please send me other suggestions. 14) How can I increase the speed of the linkage/fastlink package on my workstation? [rootd;15may94] [aha, finally a question I can confidently answer!] a. Use fastlink (it will increase your speed by an order of magnitude) b. Setting up tons of paging space (using the hard-drive as virtual memory) and use the "fast" versions of fastlink. 300 megs is usually plenty. c. Use gcc (the GNU/free software foundation C compiler) to compile fastlink (gcc produces machine language that is about 10% faster than sun's C compiler). d. Install the generic-small kernel instead of the generic kernel (the generic kernel has device files for almost EVERYTHING. The generic-small kernel is configured for a system without many devices and without many users). Installing a generic-small kernel is an option during system installation on sun workstations. e. Reconfigure your kernel so it has only devices which you need. This is a task for an experienced system administrator. This should give you a small improvement in overall system speed, but if you are already running the generic-small kernel, additional improvement may be so small that it's not worth the trouble. If the generic-small kernel is insufficent for your system (so you were forced to install the generic kernel) this step is a MUST. The generic kernel will slow down your workstation significantly, and most of the device-support is unnecessary. f. Don't run your linkage analyses in the background, because running programs in the background gives them a lower priority (on suns it reduces the priority level by 3 out of a total range of 40). Either do the runs in the foreground (which is fine as long as you don't plan to log out) or you can use the root password to renice the pedin process by -3 to compensate (negative nice values give a higher priority). If you need to log out, you can use the screen command (distributed by GNU/free software foundation) and "detach" a session so you can log out without programs terminating. Later you can log back in and "reattach" the session, which continued to run while you were logged out. The screen command is available at prep.ai.mit.edu, and is also on the O'Reilly Unix Power Tools CD-ROM. According to the sun documentation, renicing below -10 can interfere with the operating system and actually reduce the process' speed. I just run them at a priority/nice level of 0 (the standard default level). That gives me reasonable response with my other applications, but still lets fastlink run at a decent speed. g. Run with 100% penetrance Runs with 100% penetrance can run faster than runs with incomplete penetrance. Of course, if you have an unaffected obligate carrier, this won't work. In addition, incomplete-penetrance runs may be necessary for your research to be "good" (decisions like this are why the professors make the big bucks :-) Of course, buying more RAM will increase your speed. I've heard that increasing RAM from 16 to 32 megs will result in a large increase in speed. Increasing RAM from 32-64 megs will result in a significant increase. Increasing beyond 64megs is not particulairly helpful. Note that this data is anecdotal in nature (I haven't seen it myself), but it makes intuitive sense to me. If someone sends me some SIMMS for our sparcII, I'll be glad to test it out :-) [note: I run on a sun sparcII. I'd like to hear data from people on other platforms. I'd especially like to hear data on the speed-RAM relationship. 15) I set up 300 megs of paging space on my workstation, but now I'm running out of hard-drive. Is there any way I can use my hard drive space more effeciently? Paging space is hard-drive space which is used as virtual RAM. Unix boxes use paging space constantly, swapping processes out to the hard-drive and into RAM constant. There are two types of paging-space on sun systems (and many other types of Unix systems as well): paging files, and paging sectors. Paging files are actual files (you can do an ls and find them in a directory somewhere) in the filesystem. Paging sectors are separate disk partitions, and as such are not in the filesystem. A filesystem has two types of overhead. Consider the following output: bigbox% df Filesystem kbytes used avail capacity Mounted on /dev/sd0a 7735 5471 1491 79% / /dev/sd0g 151399 127193 9067 93% /usr /dev/sd3a 306418 266644 9133 97% /usr2 bigbox% df -i Filesystem iused ifree %iused Mounted on /dev/sd0a 951 3913 20% / /dev/sd0g 10218 66390 13% /usr /dev/sd3a 6278 150394 4% /usr2 The top df command shows the space available on "bigbox" in k. Note that, although sd3a has 306 megs, of which 267 megs are used, only 9 megs are available. This is because the filesystem saves a "10%" rainy day fund, so 10% of the filesystem is unusable. Although you can reduce this percentage (with the root password and using an arcane command), it is not recommended. According to sun's documentation, when the filesystem gets more than 90% full the speed of the filesystem will begin to rapidly drop. When you have a 100 meg paging file, there is a corresponding 10 megs of "rainy-day-fund" which you cannot access, so setting up a 100 meg paging file requires 110 megs of paging space. But when you use a seperate partition as a paging sector, no 10% rainy-day fund is necessary. 100 megs of raw disk space will give you 100 megs of virtual-RAM. The bottom df command shows the number of inodes available in the filesystem. An inode points to files, and is part of the filesystem that you rarely need to look at. By default, when you create a filesystem in a partition, one inode is created for every 2k in the partition. The 306 meg partition has 156,000 inodes, but only 4% of them are used. I don't know how large an inode is (a quick search through my documentation failed to find it) but I would guess that an inode is 256 bytes. If that's true, the 150,000 unused inodes above are wasting 37.5 megs of disk-space. One inode for every 2k is too much. When you create a 100 meg paging file, you only use 1 inode, but that 100 megs of filesystem has a corresponding 50,000 inodes! If you create a paging-sector, you are not using a filesystem, so no inodes are necessary. In addition, when you create a filesystem, you can reduce the number of inodes to something more reasonable (like one inode for every 10k of disk space). I generally don't mess with the inode count on my / and /usr partitions, since that contains the operating system. Make certain not to reduce the default inode number too much: YOU DONT WANT TO RUN OUT OF INODES. We converted our 350 megs of paging files to paging sectors, and got another 70 megs of free disk space as a result (20%)! 16) But I don't know how to do all this optomization, and my research assistant is spending all his/her time trying to figure it out. Unix system administration is a complex task which requires experience. An experienced sysadmin can do in minutes what it would take you hours (or days) to accomplish. In addition, an experienced sysadmin won't make stupid mistakes very often (lets see, while I was learning on-the-job I ruined our backup tape during an upgrade {luckily the upgrade was successful!}, moved a directory inside itself as root, botched email service a couple times, and spent tons of time figuring out how to accomplish simple tasks). Most universities have small budgets for their system administrators. Many head sysadmins have recruited students to assist them. Basically the students slave away for nothing, learn tons of stuff, barely pass their classes, become unix gods, and get hired for 40k+/year if/when they graduate/flunk out. If your university has a sysadmin group like this, you can probably "hire" them to support your machine for about $6/hour at about 4 hours/week*machine. The head-sysadmin will be happy to give some money to their more-experienced volunteers, the volunteers get another line on their resume+additional experience, and you get experienced sysadmins to run your machine. In addition, most sysadmin groups have an automated nightly backup. Just think: your machine gets backed up EVERY NIGHT AUTOMATICALLY! At Portland State University the Electrical Engineering sysadmin group has been hired to maintain the unix machines of four other departments, at an average price of $15/week*machine (no additional price for xterms!) The quality of the service is excellent (especially since the most experienced volunteers are usually the ones given the money), there is no annual training-gap as people leave (since the experienced volunteers are constantly training the new ones) and you have the entire resources and experience of the sysadmin group to help you. Of course, test them by deleting an unimportant file and seeing if they can restore it from backups (the backup test is the most important in system administration--have you tested your backups lately?). If they successfully restore the file from backups, give them the sun-optomization list (above two questions) and watch as the most experienced volunteer turns the optomization into a recruit-training session :-) They may even have a contest to see how small they can make your kernel-configuration file! 17) What genetic-linkage databases are available on the internet? medline is a database for searching for articles in journals. If you are in the pacific-northwest, you can get to medline using telnet. Just telnet to uwin.u.washington.edu and go into the library databases. It can even email you the output if you wish! Many libraries and many internet service providers have medline services online. Some interfaces are better than others (we don't even bother using the one at OHSU--it's too painful...) Your local library can probably supply you with information. Victor McKusick wrote a book: Mendelian Inheritance in Man. It is continuously updated online at Johns-Hopkins University (making it online-MIM or OMIM). Combined with the Genome- Data-Base, it is available online at welchlab.welch.jhu.edu You need to get an account. Send email to help@welch.jhu.edu for information.