Enclosed in this package is a Perl script to convert single nucleotide mRNA variants in HGMD-like format to genomic coordinates that can be used as input into SKIPPY. There are 9 files in this tarball: the Perl script itself, 3 supporting files needed to run this script, 3 sample output files, and 2 documents, including this README file. You will need to edit some of the supporting files to fit your own specific needs or create the equivalent files using the instructions provided here. Outline: 1. To run this program, you will need to have Perl and one of its support packages "XML::Simple" installed on your system. Step 2 and 3 briefly describe how to obtain Perl and install "XML::Simple". Please contact your system administrator if you need assistance. 2. In a terminal window, issue the command "perl -v". Perl is installed on your computer if the "version" information is returned. Otherwise, please download and install the Perl interpreter for your system from http://www.perl.org/get.html 3. Once Perl is installed, issue the command "cpan -i XML::Simple" in a terminal window. The system should either install the "XML::Simple" package for Perl or should show a message indicating that XML::Simple is "up to date". If the command does not work, contact your system administrator. There may be a problem with file permissions on your system, or you may need to download "XML::Simple" from http://search.cpan.org/dist/XML-Simple/. 4. Go to ftp://ftp.nhgri.nih.gov/pub/software/mrna2genomics/ to download the tarball "mRNA2genomics.tar.gz". Save it to your local system, for example in /home/userXYZ/tools/. 5. Use the UNIX "tar" command to unpack this file; for example, "tar -xf mRNA2genomics.tar.gz" would result in "/home/userXYZ/tools/mRNA2genomics/". 6. Open the included sample "XML" configuration file "NM_003183.xml" in the unpacked directory ("/home/userXYZ/tools/mRNA2genomics/" in the example above). Edit and save as your own XML configuration file so that the new file reflects your configurations. A detailed description of this file is provided below. 7. Format the list of variants as shown in the file "adam17.variants". This format is the one used by the public version of The Human Gene Mutation Database (HGMD; http://www.hgmd.cf.ac.uk/). The variants must be numbered by their codon position with respect to a reference mRNA sequence. 8. Determine the genomic coordinates of the exons of the reference mRNA sequence. The coordinates can be determined manually, using a program like splign (http://www.ncbi.nlm.nih.gov/sutils/splign/). Alternatively, if the accession number of the mRNA is known, the coordinates can be downloaded from the UCSC Table Browser. The coordinates should be formatted as shown in "NM_003183.hg19.refGene.txt". 9. Note that the "mRNA2genomics.pl" script assumes that the mRNA sequence aligns to the genome throughout the full length of the mRNA, with no indels in the mRNA. Mismatches are acceptable. If the UCSC-provided alignment is not over 100% of the length of the mRNA, the exon coordinate file (step 8) will need to be edited manually (see "Editing the exon coordinate file," below). 10. Make sure that the coordinates of the variants (step 7) and the coordinates of the exons (step 8) are based on the same transcript. 11. Run "mRNA2genomics.pl" with the XML configuration file as a passed argument, for example, "perl mRNA2genomics.pl NM_003183.xml". The names of the variant file and the exon coordinate file must be indicated in this XML file. 12. The program produces three output files, which are written into the directory containing the variant file (step 7). The one with the suffix "skippy_web.output" has 3 columns and can be used as input into the form on the SKIPPY website (http://research.nhgri.nih.gov/skippy/input.shtml). The one with the suffix "skippy_command.output" has 3 columns and can be used as input into the command-line version of SKIPPY (ftp://ftp.nhgri.nih.gov/pub/software/SKIPPY/). The file with the suffix "genomics.output" is more complete; it has 11 columns, described below. Detailed descriptions of the 9 files included in the tarball are provided below. ------------------------------------------ README.txt This file. ------------------------------------------ COPYRIGHT.txt This software/database is "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the authors' official duties for the United States Government and thus cannot be copyrighted. This software/database is freely available to the public for use without a copyright notice. Restrictions cannot be placed on its present or future use. Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the National Human Genome Research Institute (NHGRI) and the U.S. Government does not and cannot warrant the performance or results that may be obtained by using this software or data. NHGRI and the U.S. Government disclaims all warranties as to performance, merchantability or fitness for any particular purpose. In any work or product derived from this material, proper attribution of the authors as the source of the software or data should be made, using "NHGRI Genome Technology Branch" as the citation. ------------------------------------------ mRNA2genomics.pl The Perl script. It takes an XML configuration file (described below) as input. ------------------------------------------ NM_003183.xml This is the XML configuration file that "mRNA2genomics.pl" requires. Edit as necessary for your own purposes. The text within the angle-brackets is required; do not change any of it, including the slash marks. Example contents: adam17.variants mRNA2genomics 184 RefSeq NM_003183 NM_003183.hg19.refGene.txt Note: The cdsStart is the start position of the CDS feature in a GenBank flatfile for an mRNA sequence. For example, for mRNA NM_003183 (http://www.ncbi.nlm.nih.gov/nuccore/NM_003183), the CDS start is at position 184: CDS 184..2658 ------------------------------------------ adam17.variants A tab-delimited text file listing the variants. The variants should be formatted following the convention of The Human Gene Mutation Database (HGMD; http://www.hgmd.cf.ac.uk/). Note that this script can only convert single site missense or nonsense mutations. The first line is a required header line. 1st column: A unique identifier for each variant. 2nd column: The codon that is changed is required and should be CAPITALIZED; any other sequences are optional and should be lower cased; the codon to the left of the dash corresponds to the codon in the reference sequence while the codon after the dash contains the variant. 3rd column: Amino acid change caused by the codon change; stop codon is marked as "Term". 4th column: Codon number at which the variation occurs. 5th column and beyond: Notes or references; the script will concatenate these columns and report them in a single column in the output file. Example contents: User_Acc Codon Change AA Change Codon Number ID_00001 cCCA-CAA Pro-Gln 783 ID_00004 TCG-TTG Ser-Leu 747 ID_00002 GTTg-ATT Val-Ile 673 ID_00003 CCT-TCT Pro-Ser 491 ID_00005 cAGA-TGA Arg-Term 202 ------------------------------------------ NM_003183.hg19.refGene.txt The genomic coordinates of the exons of the reference mRNA, downloaded from the UCSC Table Browser or created in another program like splign (http://www.ncbi.nlm.nih.gov/sutils/splign/). All coordinates in this file must follow the UCSC Genome Browser's internal database representations of coordinates and have a zero-based start and a one-based end. Coordinates obtained as described here are in the correct format. Note that the "mRNA2genomics.pl" script assumes that the mRNA sequence aligns to the genome throughout the full length of the mRNA, with no indels in the mRNA. Mismatches are acceptable. If the UCSC-provided alignment is not over 100% of the length of the mRNA, the exon coordinate file will need to be edited manually (see "Editing the exon coordinate file," below). To download the annotation file with the exon coordinates, if you know the accession number: 1. Go to the UCSC table browser: http://genome.ucsc.edu/cgi-bin/hgTables 2. select "Mammal" from "clade", "Human" from "genome", 3. select the desired genome assembly ("Feb 2009 (GRCh37/hg19)" in this example) 4. select "Genes and Gene Prediction Tracks" from "group" list for for RefSeq accessions (select "mRNA and EST tracks" from "group" list for GenBank accessions) 5. select "RefSeq Genes" from "track" list for RefSeq accessions (select "Human mRNAs"for GenBank accessions). 6. select "refGene" from the "table" list for RefSeq accessions (select "all_mrna" for GenBank accessions). 7. enter the accession number (NM_003183 or U69611 in these examples) in the "region" text box 8. select "position" button 9. select "all fields from selected tables" from "output format" 10. enter desired file name for the genomic coordinates 11. choose "plain text" 12. click "get output" and save the file refSeq example (refGene table): 2 lines are returned when the refGene table is downloaded for the region of RefSeq accession NM_003183. The relevant lines for NM_003183 are shown here; other lines will be ignored by the "mRNA2genomics.pl" script. #bin name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds score name2 cdsStartStat cdsEndStat exonFrames 658 NM_003183 chr2 - 9629410 9695917 9630305 9695734 19 9629410,9631229,9633026,9633875,9634765,9637242,9642301,9645294,9650107,9658029,9658231,9661331,9663377,9666239,9667914,9675962,9676826,9683281,9695637, 9630647,9631280,9633115,9633954,9634896,9637377,9642405,9645494,9650260,9658118,9658376,9661445,9663467,9666373,9668083,9676051,9676957,9683414,9695917, 0 ADAM17 cmpl cmpl 0,0,1,0,1,1,2,0,0,1,0,0,0,1,0,1,2,1,0, GenBank example (all_mrna table): 13 lines are returned when the all_mrna table is downloaded for the region of GenBank accession U69611. The relevant lines for U69611 are shown here; other lines will be ignored by the "mRNA2genomics.pl" script. #bin matches misMatches repMatches nCount qNumInsert qBaseInsert tNumInsert tBaseInsert strand qName qSize qStart qEnd tName tSize tStart tEnd blockCount blockSizes qStarts tStarts 658 3010 3 0 0 1 1 19 62955 - U69611 3014 0 3014 chr2 243199373 9629880 9695848 21 63,301,402,51,89,79,131,135,104,200,153,89,145,114,90,134,169,89,131,133,211, 0,64,365,767,818,907,986,1117,1252,1356,1556,1709,1798,1943,2057,2147,2281,2450,2539,2670,2803, 9629880,9629943,9630245,9631229,9633026,9633875,9634765,9637242,9642301,9645294,9650107,9658029,9658231,9661331,9663377,9666239,9667914,9675962,9676826,9683281,9695637, o For RefSeq-style accession numbers (NM_XXXXX), the refGene file must have at least 11 columns (RefSeq example above, for accession NM_003183): o Column 1: can be blank; not read by the "mRNA2genomics.pl" script. o Column 2 (name): RefSeq accession number; must be identical to the tag in the XML configuration file (line 10 in the XML file). o Column 3 (chrom): The chromosome; in the format of "chr" and a number, for example, "chr17". o Column 4 (strand): The strand of the transcript, "+" for forward strand and "-" for reverse. o Column 5 (txStart): The transcription start coordinate of this mRNA in the genome. o Column 6 (txEnd): The transcription end coordinate of this mRNA in the genome. o Column 7( cdsStart): The CDS start coordinate of this mRNA in the genome. o Column 8 (cdsEnd): The CDS end coordinate of this mRNA in the genome. o Column 9 (exonCount): The number of exons in this mRNA. o Column 10 (exonStarts): A comma-delimited list of the start coordinates of all the exons; the number of values should match the exon count in Column 9; the values are in ascending order. o Column 11 (exonEnds): A comma-delimited list of the end coordinates of all the exons; the number of values should match the exon count in Column 9; the values are in ascending order. o Columns 12 and following: can be blank; not read by "mRNA2genomics.pl" script. o o For GenBank-style accession numbers, the file must have at least 22 Columns (GenBank example above, for accession U69611): o Columns 1-9: can be blank; not read by "mRNA2genomics.pl" script. o Column 10 (strand): The strand of the transcript, "+" for forward strand and "-" for reverse. o Column 11 (qName): Accession number; must be identical to the tag in the XML configuration file (line 10 in the XML file). o Columns 12-14: can be blank; not read by "mRNA2genomics.pl" script. o Column 15 (tName): The chromosome; in the format of "chr" and a number, for example, "chr17". o Column 16: can be blank; not read by "mRNA2genomics.pl" script. o Column 17 (tStart): The lower coordinate of the position at which this mRNA aligns to the genome; regardless of the strand of the transcript, this value should be less than the value in Column 18. o Column 18 (tEnd): The upper coordinate of the position at which this mRNA aligns to the genome; regardless of the strand of the transcript, this value should be greater than the value in Column 18. o Column 19 (blockCount): The number of exons in this mRNA. o Column 20 (blockSizes): A comma-delimited list of the lengths of each exon; the number of values should match the exon count in Column 19; the order of values here matches the order of values in column 22. o Column 21: can be blank; not read by "mRNA2genomics.pl" script. o Column 22 (tStarts): A comma-delimited list of the lower coordinates of all exons; for transcripts on the + strand, these numbers are the exon starts; for transcripts on the - strand, these numbers are the exon ends; the number of values should match the exon count in Column 19; the values are in ascending order. o Columns 23 and following: can be blank; not read by "mRNA2genomics.pl" script. *** Notes*** The example provided in the file NM_003183.hg19.refGene.txt is from the refGene table. You can also create the exon coordinate file manually, for example, after determining the exon boundaries using splign. You can then format this file either in the all_mrna format (for GenBank accessions) or in the refGene format (for RefSeq accessions). Be sure to indicate the correct accession_file_type in line 9 of the XML file configuration file. All coordinates in this file must have a zero-based start and a one-based end ------------------------------------------ After you have set up all the files, run "mRNA2genomics.pl" with the XML configuration file as a passed argument, for example, "perl mRNA2genomics.pl NM_003183.xml". The names of the variant file and the exon coordinate file must be indicated in this XML file. The script writes the output files to the directory containing the list of variants, so be sure that this directory has the correct write permissions. The program produces three output files, which are written into the directory containing the variant file. The one with the suffix "skippy_web.output" has 3 columns and can be used as input into the form on the SKIPPY website (http://research.nhgri.nih.gov/skippy/input.shtml). The one with the suffix "skippy_command.output" has 3 columns and can be used as input into the command-line version of SKIPPY (ftp://ftp.nhgri.nih.gov/pub/software/SKIPPY/). The file with the suffix "genomics.output" is more complete; it has 11 columns as shown below. All coordinates are 1-based. ------------------------------------------ adam17.variants.skippy_web.output chr2 9630433 C->A chr2 9630541 C->T chr2 9633092 G->A chr2 9645368 C->T chr2 9667930 A->T ------------------------------------------ adam17.variants.skippy_command.output chr2 9630433 C A chr2 9630541 C T chr2 9633092 G A chr2 9645368 C T chr2 9667930 A T ------------------------------------------ adam17.variants.genomics.output In addition to returning the information in the variant file, the script also returns the coordinate of the variant in the mRNA sequence (cDNA_coord) as well as the coordinate of the variant in the genomic sequence (Genomic_coord). GenBank_Acc Variant_ID Base_change aa_change Notes Genomic_coord cDNA_coord Skippy_chromosome Skippy_coord Skippy_base_change_web Skippy_base_change_command NM_003183 ID_00001 C to A Pro-Gln ID_00001|cCCA-CAA|Pro-Gln|783 9630433 2531 chr2 9630433 C->A C A NM_003183 ID_00004 C to T Ser-Leu ID_00004|TCG-TTG|Ser-Leu|747 9630541 2423 chr2 9630541 C->T C T NM_003183 ID_00002 G to A Val-Ile ID_00002|GTTg-ATT|Val-Ile|673 9633092 2200 chr2 9633092 G->A G A NM_003183 ID_00003 C to T Pro-Ser ID_00003|CCT-TCT|Pro-Ser|491 9645368 1654 chr2 9645368 C->T C T NM_003183 ID_00005 A to T Arg-Term ID_00005|cAGA-TGA|Arg-Term|202 9667930 787 chr2 9667930 A->T A T ------------------------------------------ Editing the exon coordinate file Note that the "mRNA2genomics.pl" script assumes that the mRNA sequence aligns to the genome throughout the full length of the mRNA, with no indels in the mRNA. Mismatches are acceptable. If the UCSC-provided alignment is not over 100% of the length of the mRNA, the exon coordinate file must be edited manually. For example, the alignment between accession U14680 and the hg19 genome assembly shows that 7 nt in the first exon of the mRNA (top sequence) do not align correctly to hg19 (bottom sequence): AGCTCGCTGA GACTTCCTGG ACcccgcacC AGGCTGTGGG GTTTCTCAGA 50 AGCTCGCTGA GACTTCCTGG ACggggga-C AGGCTGTGGG GTTTCTCAGA 41277338 Because of this mis-alignment, in the corresponding all_mrna file, this 7-nt genomic sequence becomes an intron, splitting the first exon into two separate exons. This is not correct, and has the result of throwing off the numbering of the exon coordinates. Thus, the blockSizes and tStarts columns must be manually edited to remove this intron. (Since the mRNA aligns to the minus strand, the first exon is on the right). blockSizes original from the all-mrna table: 125,61,74,55,84,41,78,88,311,191,127,172,89,3426,77,46,106,140,89,78,54,99,71,22, manually edited: 125,61,74,55,84,41,78,88,311,191,127,172,89,3426,77,46,106,140,89,78,54,99,100, tStarts original from the all-mrna table: 41197694,41199659,41201137,41203079,41209068,41215349,41215890,41219624,41222944,41226347,41228504,41234420,41242960,41243451,41247862,41249260,41251791,41256138,41256884,41258472,41267742,41276033,41277287,41277364, manually edited: 41197694,41199659,41201137,41203079,41209068,41215349,41215890,41219624,41222944,41226347,41228504,41234420,41242960,41243451,41247862,41249260,41251791,41256138,41256884,41258472,41267742,41276033,41277287,