Annotation exceptions
From AAAWiki
Please post all problems you find with annotations, orthology calls, etc... here.
Contents |
Amino translations, GFF phase errors
There is a problem with at least DGIL Snap, NCBI Gnomon protein translations. The GFF phase col 8, missing for SNAP (don't no why, but snap doesn't output phase field), and may be off for Gnomon and/or not read by Venky Iyer or others. This leads to wrong aa translation of CDS identified in GFF features.
See here for aa translations ftp://ftp.ncbi.nih.gov/genomes/Drosophila_melanogaster/special_requests/CAF1/ ftp://eugenes.org/eugenes/genomes/caf1a/ http://rana.lbl.gov/~venky/AAA/fasta/translation/
--Dongilbert 13:43, 11 August 2006 (PDT) I've learned from Ian Korf how to recover the phase value for SNAP predictions, and will provide updated SNAP gff data sets as soon as I can work out the details. --Dongilbert 18:33, 11 August 2006 (PDT)
All the SNAP prediction GFF files now have phase values added at ftp://eugenes.org/eugenes/genomes/caf1a/ --Dongilbert 12:11, 14 August 2006 (PDT)
Examples:
Dana: scaffold_1001 DGIL_SNO gene 150 655 7.745 + . ID=GF_DGIL_SNO_28056503;orig_id=scaffold_1001-snapho.1 scaffold_1001 DGIL_SNO exon 150 655 7.745 + . Parent=GF_DGIL_SNO_28056503 Snap aa: >GF_DGIL_SNO_28056503 KRIQDEVKTTCLAAQTNVTTDNPWDKLLNRNSSWLKLIHTLAYVLRFIHR MKHPSSKQTSNSLTFDEIKAARIRWLQHAQAGFQQEFQLLRANKALGNQS QLVKLSPASWQLAKVIDTFQGKDNMVRAVKIKTAAGELTRPITKIAKLPS SETVFQGGPGCLGTIDT* dana-venkaa.1 >GF_DGIL_SNO_28056503 dana_scaffold_1001:150-655:+ TNAYKTKSKLPA*LRRRMSRRTTHGISFSIATHLG*SLYTHLLTFCASFIA*STHLRNKH RTL*RSMRSRQQGFGGCNTLKLVFNRSSSCYAQTRL*ETNLNWSSSHQHHGN*RR*STRS KERTIWFAQSRSRRQQEN*PGQ*RKLRNYPVQKPCFRVARDV*ELSIHX Dana CDS (as per GFF): >GF_DGIL_SNO_28056503 ACAAACGCATACAAGACGAAGTCAAAACTACCTGCCTAGCTGCGCAGACGAATGTCACGA CGGACAACCCATGGGATAAGCTTCTCAATCGCAACTCATCTTGGCTGAAGCTTATACACA CACTTGCTTACGTTTTGCGCTTCATTCATCGCATGAAGCACCCATCTTCGAAACAAACAT CGAACTCTCTAACGTTCGATGAGATCAAGGCAGCAAGGATTCGGTGGCTGCAACACGCTC AAGCTGGTTTTCAACAGGAGTTCCAGTTGCTACGCGCAAACAAGGCTCTAGGAAACCAAT CTCAATTGGTCAAGCTCTCACCAGCATCATGGCAACTAGCGAAGGTGATCGACACGTTCC AAGGAAAGGACAATATGGTTCGCGCAGTCAAGATCAAGACGGCAGCAGGAGAATTGACCC GGCCAATAACGAAAATTGCGAAACTACCCAGTTCAGAAACCGTGTTTCAGGGTGGCCCGG GATGTTTAGGAACTATCGATACATAG Translations: Forward 2 << Snap produced aa 0 KRIQDEVKTT CLAAQTNVTT DNPWDKLLNR NSSWLKLIHT LAYVLRFIHR 50 MKHPSSKQTS NSLTFDEIKA ARIRWLQHAQ AGFQQEFQLL RANKALGNQS 100 QLVKLSPASW QLAKVIDTFQ GKDNMVRAVK IKTAAGELTR PITKIAKLPS 150 SETVFQGGPG CLGTIDT! Forward 0 << venky's translation 0 TNAYKTKSKL PA!LRRRMSR RTTHGISFSI ATHLG!SLYT HLLTFCASFI 50 A!STHLRNKH RTL!RSMRSR QQGFGGCNTL KLVFNRSSSC YAQTRL!ETN 100 LNWSSSHQHH GN!RR!STRS KERTIWFAQS RSRRQQEN!P GQ!RKLRNYP 150 VQKPCFRVAR DV!ELSIH Dwil NCBI_GNO example: from NCBI_GNO gff: scaffold_10475 NCBI_GNO mRNA 94 1060 . + . ID=GF_NCBI_GNO_32261360;Parent=GF_NCBI_GNO_30261360;Name=hmm261360;model_evidence=pure ab initio;note=Derived by automated computational analysis using gene prediction method: GNOMON. scaffold_10475 NCBI_GNO CDS 94 1060 . + 1 Parent=GF_NCBI_GNO_32261360;exon_number=1 ncbi aa: >GF_NCBI_GNO_32261360 Partial gene predicted by Gnomon on Drosophila ananassae genomic scaffold dana_caf1 _scaffold_10475 RRRRSSIWGSHPHGGACVKASTSKRVDWARHAASIEPFPTLFSAWLKKYA NIVRTVLDGEGKEPRRRVEQRDDRHGGCPICGGQHAKTSCREFIEASPPG RLSMVKRHRLCFTCLRSGHSSRSCDVHGECQTNGCRRLHHRLLHGAYEER RWPEQRGGFRRHNGGNQQSAVSRRSPDRRSSPRGSYRHHERSHQSAVPRN SLERRAPQPAEAPVQRNLSIDVEGGRLLFRILPVTLYRAGRQVDTYALLD EGSSVTMIDDELWRDLEVRGEQRQLNIQWFGGRTSREPTNVVSLEISGAG KPTRHPLKNVYAVSSLSLPMQ venky's aa >GF_NCBI_GNO_32261360 dana_scaffold_10475:94-1060:+ FVEGGAAFGDPTLMEELVSKLPRASEWIGPGMLHRSSPFPRFLARG*RSTQTSCVRFWTA RERSRGVESSSGMIGMEVVQFVEGNMLRRAAGSSSKLRHRAG*AW*RGIGSASHAYGVVI RPDPAMCMASARPTDAADCITVCYMELTKSADGRSSEVASGATTEETSSQQFPDAAPTGG LRHEVVTGTTRGAISRLSPETAWREGPRSQRRRPCRGI*ASTSKEADFYSVYCQ*RCTEL VARWTHTRSWMKDPPSR*STTSYGGIWKCEANSDS*TSNGLEEGPAGSPPT**AWR*VEL GSPLATR*RTCTPYRA*VCRCRX CDS >GF_NCBI_GNO_32261360 Partial gene predicted by Gnomon on Drosophila ananassae genomic scaffold dana_caf1 _scaffold_10475 TTCGTCGAAGGCGGAGCAGCATTTGGGGATCCCACCCTCATGGAGGAGCT TGTGTCAAAGCTTCCACGAGCAAGCGAGTGGATTGGGCCAGGCATGCTGC ATCGATCGAGCCCTTTCCCACGCTTTTTAGCGCGTGGCTAAAGAAGTACG CAAACATCGTGCGTACGGTTTTGGACGGCGAGGGAAAGGAGCCGAGGCGT CGAGTCGAGCAGCGGGATGATCGGCATGGAGGTTGTCCAATTTGTGGAGG GCAACATGCTAAGACGAGCTGCAGGGAGTTCATCGAAGCTTCGCCACCGG GCAGGTTGAGCATGGTGAAGAGGCATCGGCTCTGCTTCACATGCTTACGG AGTGGTCATTCGTCCAGATCCTGCGATGTGCATGGCGAGTGCCAGACCAA CGGATGCCGCAGATTGCATCACCGTCTGCTACATGGAGCTTACGAAGAGC GCAGATGGCCGGAGCAGCGAGGTGGCTTCAGGCGCCACAACGGAGGAAAC CAGCAGTCAGCAGTTTCCAGACGCAGCCCCGACAGGAGGTCTTCGCCACG AGGTAGTTACAGGCACCACGAGAGGAGCCATCAGTCGGCTGTCCCCAGAA ACAGCCTGGAGAGAAGGGCCCCGCAGCCAGCGGAGGCGCCCGTGCAGAGG AATCTAAGCATCGACGTCGAAGGAGGCCGACTTTTATTCCGTATACTGCC AGTAACGCTGTACCGAGCTGGTCGCCAGGTGGACACATACGCGCTCTTGG ATGAAGGATCCTCCGTCACGATGATCGACGACGAGCTATGGAGGGATCTG GAAGTGCGAGGCGAACAGCGACAGCTGAACATCCAATGGTTTGGAGGAAG GACCAGCAGGGAGCCCACCAACGTAGTGAGCCTGGAGATAAGTGGAGCTG GGAAGCCCACTCGCCACCCGTTGAAGAACGTGTACGCCGTATCGAGCTTG AGTCTGCCGATGCAGAG Translations of above: Forward 2 << ncbi amino; BUT ncbi gff says phase=1 0 RRRRSSIWGS HPHGGACVKA STSKRVDWAR HAASIEPFPT LFSAWLKKYA 50 NIVRTVLDGE GKEPRRRVEQ RDDRHGGCPI CGGQHAKTSC REFIEASPPG 100 RLSMVKRHRL CFTCLRSGHS SRSCDVHGEC QTNGCRRLHH RLLHGAYEER 150 RWPEQRGGFR RHNGGNQQSA VSRRSPDRRS SPRGSYRHHE RSHQSAVPRN 200 SLERRAPQPA EAPVQRNLSI DVEGGRLLFR ILPVTLYRAG RQVDTYALLD 250 EGSSVTMIDD ELWRDLEVRG EQRQLNIQWF GGRTSREPTN VVSLEISGAG 300 KPTRHPLKNV YAVSSLSLPM Q Forward 0 << venky's translation 0 FVEGGAAFGD PTLMEELVSK LPRASEWIGP GMLHRSSPFP RFLARG!RST 50 QTSCVRFWTA RERSRGVESS SGMIGMEVVQ FVEGNMLRRA AGSSSKLRHR 100 AG!AW!RGIG SASHAYGVVI RPDPAMCMAS ARPTDAADCI TVCYMELTKS 150 ADGRSSEVAS GATTEETSSQ QFPDAAPTGG LRHEVVTGTT RGAISRLSPE 200 TAWREGPRSQ RRRPCRGI!A STSKEADFYS VYCQ!RCTEL VARWTHTRSW 250 MKDPPSR!ST TSYGGIWKCE ANSDS!TSNG LEEGPAGSPP T!!AWR!VEL 300 GSPLATR!RT CTPYRA!VCR CR
Venky's response
NCBI's GFF is correct, I think. Maybe NCBI is using the "intron phase" which is the number of bases into the codon when it is interrupted by an intron (and is used to label the subsequent exon).
Phase is not the same thing as Frame. Phase is the number of bases to skip before reading in-frame, while frame is the actual frame identifier beginning at 1. My mistake was that I ignored NCBI's phase information and recomputed my own under the faulty assumption that they wouldn't tack on partial codons at the beginning of the gene. This affected partial gene predictions.
I'll fix these in the next few days.
I should point out that while this affects the translations for the submitted sets, it does not affect any of the GLEAN consensus sets.
I'm not sure how to deal with SNAP if it doesn't output the phase...
GFF phase column ambiguity
Do be cautious about what is in gff column#8 of the predictions, there has been a fair amount of ambiguity about this, and uses have differed. For instance, the GFF.v3 spec used to say something else about phase. For the NCBI Gnomon case above, it says phase=1 in the gff, but I had to remove two bases from the start to get an aa translation matching NCBI's, which is not in accord with current GFF.v3 spec (should be phase=2). --Dongilbert 18:29, 11 August 2006 (PDT)
GFF phase corrections
Phase only needed for partial gene predictions, a small percent, mostly found at ends of short scaffolds (e.g SNAP has 0 in dmel, a handful in dpse, dyak, dsim, and around 1000 in the species with several thousand scaffolds). --Dongilbert 21:38, 13 August 2006 (PDT)
The predictions are inconsistant in use of phase:
SNAP (DGIL_SNP, _SNO) had no phase values; updated now (see above) BREN_NSC has good phase values BATZ_CNA needs corrected phase values RGUI_GID has good phase values NCBI_GNO needs corrected phase values PACH_GMP,OXFD_GPI,EISE_CGW,EISE_CEX have good phases EISE_CGM probably needs corrected phases
Tool to check aa translations and add phase: http://insects.eugenes.org/species/data/work/snap-predictions/cdsphase.pl
