Annotation exceptions

From AAAWiki

Please post all problems you find with annotations, orthology calls, etc... here.

Contents

Amino translations, GFF phase errors

There is a problem with at least DGIL Snap, NCBI Gnomon protein translations. The GFF phase col 8, missing for SNAP (don't no why, but snap doesn't output phase field), and may be off for Gnomon and/or not read by Venky Iyer or others. This leads to wrong aa translation of CDS identified in GFF features.

See here for aa translations ftp://ftp.ncbi.nih.gov/genomes/Drosophila_melanogaster/special_requests/CAF1/ ftp://eugenes.org/eugenes/genomes/caf1a/ http://rana.lbl.gov/~venky/AAA/fasta/translation/

--Dongilbert 13:43, 11 August 2006 (PDT) I've learned from Ian Korf how to recover the phase value for SNAP predictions, and will provide updated SNAP gff data sets as soon as I can work out the details. --Dongilbert 18:33, 11 August 2006 (PDT)

All the SNAP prediction GFF files now have phase values added at ftp://eugenes.org/eugenes/genomes/caf1a/ --Dongilbert 12:11, 14 August 2006 (PDT)

Examples:

Dana:
scaffold_1001   DGIL_SNO        gene    150     655     7.745   +       .       ID=GF_DGIL_SNO_28056503;orig_id=scaffold_1001-snapho.1
scaffold_1001   DGIL_SNO        exon    150     655     7.745   +       .       Parent=GF_DGIL_SNO_28056503

Snap aa:
>GF_DGIL_SNO_28056503
KRIQDEVKTTCLAAQTNVTTDNPWDKLLNRNSSWLKLIHTLAYVLRFIHR
MKHPSSKQTSNSLTFDEIKAARIRWLQHAQAGFQQEFQLLRANKALGNQS
QLVKLSPASWQLAKVIDTFQGKDNMVRAVKIKTAAGELTRPITKIAKLPS
SETVFQGGPGCLGTIDT*

dana-venkaa.1
>GF_DGIL_SNO_28056503 dana_scaffold_1001:150-655:+
TNAYKTKSKLPA*LRRRMSRRTTHGISFSIATHLG*SLYTHLLTFCASFIA*STHLRNKH
RTL*RSMRSRQQGFGGCNTLKLVFNRSSSCYAQTRL*ETNLNWSSSHQHHGN*RR*STRS
KERTIWFAQSRSRRQQEN*PGQ*RKLRNYPVQKPCFRVARDV*ELSIHX

Dana CDS (as per GFF):
>GF_DGIL_SNO_28056503
ACAAACGCATACAAGACGAAGTCAAAACTACCTGCCTAGCTGCGCAGACGAATGTCACGA
CGGACAACCCATGGGATAAGCTTCTCAATCGCAACTCATCTTGGCTGAAGCTTATACACA
CACTTGCTTACGTTTTGCGCTTCATTCATCGCATGAAGCACCCATCTTCGAAACAAACAT
CGAACTCTCTAACGTTCGATGAGATCAAGGCAGCAAGGATTCGGTGGCTGCAACACGCTC
AAGCTGGTTTTCAACAGGAGTTCCAGTTGCTACGCGCAAACAAGGCTCTAGGAAACCAAT
CTCAATTGGTCAAGCTCTCACCAGCATCATGGCAACTAGCGAAGGTGATCGACACGTTCC
AAGGAAAGGACAATATGGTTCGCGCAGTCAAGATCAAGACGGCAGCAGGAGAATTGACCC
GGCCAATAACGAAAATTGCGAAACTACCCAGTTCAGAAACCGTGTTTCAGGGTGGCCCGG
GATGTTTAGGAACTATCGATACATAG

Translations:
Forward 2 << Snap produced aa 
   0 KRIQDEVKTT CLAAQTNVTT DNPWDKLLNR NSSWLKLIHT LAYVLRFIHR 
   50 MKHPSSKQTS NSLTFDEIKA ARIRWLQHAQ AGFQQEFQLL RANKALGNQS 
  100 QLVKLSPASW QLAKVIDTFQ GKDNMVRAVK IKTAAGELTR PITKIAKLPS 
  150 SETVFQGGPG CLGTIDT! 

Forward 0 << venky's translation
   0 TNAYKTKSKL PA!LRRRMSR RTTHGISFSI ATHLG!SLYT HLLTFCASFI 
   50 A!STHLRNKH RTL!RSMRSR QQGFGGCNTL KLVFNRSSSC YAQTRL!ETN 
  100 LNWSSSHQHH GN!RR!STRS KERTIWFAQS RSRRQQEN!P GQ!RKLRNYP 
  150 VQKPCFRVAR DV!ELSIH 

Dwil NCBI_GNO example:
from NCBI_GNO gff:
scaffold_10475  NCBI_GNO        mRNA    94      1060    .       +       .       ID=GF_NCBI_GNO_32261360;Parent=GF_NCBI_GNO_30261360;Name=hmm261360;model_evidence=pure ab initio;note=Derived by automated computational analysis using gene prediction method: GNOMON.
scaffold_10475  NCBI_GNO        CDS     94      1060    .       +       1       Parent=GF_NCBI_GNO_32261360;exon_number=1

ncbi aa:
>GF_NCBI_GNO_32261360 Partial gene predicted by Gnomon on Drosophila ananassae genomic scaffold dana_caf1
_scaffold_10475
RRRRSSIWGSHPHGGACVKASTSKRVDWARHAASIEPFPTLFSAWLKKYA
NIVRTVLDGEGKEPRRRVEQRDDRHGGCPICGGQHAKTSCREFIEASPPG
RLSMVKRHRLCFTCLRSGHSSRSCDVHGECQTNGCRRLHHRLLHGAYEER
RWPEQRGGFRRHNGGNQQSAVSRRSPDRRSSPRGSYRHHERSHQSAVPRN
SLERRAPQPAEAPVQRNLSIDVEGGRLLFRILPVTLYRAGRQVDTYALLD
EGSSVTMIDDELWRDLEVRGEQRQLNIQWFGGRTSREPTNVVSLEISGAG
KPTRHPLKNVYAVSSLSLPMQ

venky's aa
>GF_NCBI_GNO_32261360 dana_scaffold_10475:94-1060:+
FVEGGAAFGDPTLMEELVSKLPRASEWIGPGMLHRSSPFPRFLARG*RSTQTSCVRFWTA
RERSRGVESSSGMIGMEVVQFVEGNMLRRAAGSSSKLRHRAG*AW*RGIGSASHAYGVVI
RPDPAMCMASARPTDAADCITVCYMELTKSADGRSSEVASGATTEETSSQQFPDAAPTGG
LRHEVVTGTTRGAISRLSPETAWREGPRSQRRRPCRGI*ASTSKEADFYSVYCQ*RCTEL
VARWTHTRSWMKDPPSR*STTSYGGIWKCEANSDS*TSNGLEEGPAGSPPT**AWR*VEL
GSPLATR*RTCTPYRA*VCRCRX

CDS
>GF_NCBI_GNO_32261360 Partial gene predicted by Gnomon on Drosophila ananassae genomic scaffold dana_caf1
_scaffold_10475
TTCGTCGAAGGCGGAGCAGCATTTGGGGATCCCACCCTCATGGAGGAGCT
TGTGTCAAAGCTTCCACGAGCAAGCGAGTGGATTGGGCCAGGCATGCTGC
ATCGATCGAGCCCTTTCCCACGCTTTTTAGCGCGTGGCTAAAGAAGTACG
CAAACATCGTGCGTACGGTTTTGGACGGCGAGGGAAAGGAGCCGAGGCGT
CGAGTCGAGCAGCGGGATGATCGGCATGGAGGTTGTCCAATTTGTGGAGG
GCAACATGCTAAGACGAGCTGCAGGGAGTTCATCGAAGCTTCGCCACCGG
GCAGGTTGAGCATGGTGAAGAGGCATCGGCTCTGCTTCACATGCTTACGG
AGTGGTCATTCGTCCAGATCCTGCGATGTGCATGGCGAGTGCCAGACCAA
CGGATGCCGCAGATTGCATCACCGTCTGCTACATGGAGCTTACGAAGAGC
GCAGATGGCCGGAGCAGCGAGGTGGCTTCAGGCGCCACAACGGAGGAAAC
CAGCAGTCAGCAGTTTCCAGACGCAGCCCCGACAGGAGGTCTTCGCCACG
AGGTAGTTACAGGCACCACGAGAGGAGCCATCAGTCGGCTGTCCCCAGAA
ACAGCCTGGAGAGAAGGGCCCCGCAGCCAGCGGAGGCGCCCGTGCAGAGG
AATCTAAGCATCGACGTCGAAGGAGGCCGACTTTTATTCCGTATACTGCC
AGTAACGCTGTACCGAGCTGGTCGCCAGGTGGACACATACGCGCTCTTGG
ATGAAGGATCCTCCGTCACGATGATCGACGACGAGCTATGGAGGGATCTG
GAAGTGCGAGGCGAACAGCGACAGCTGAACATCCAATGGTTTGGAGGAAG
GACCAGCAGGGAGCCCACCAACGTAGTGAGCCTGGAGATAAGTGGAGCTG
GGAAGCCCACTCGCCACCCGTTGAAGAACGTGTACGCCGTATCGAGCTTG
AGTCTGCCGATGCAGAG

Translations of above:

Forward 2 << ncbi amino; BUT ncbi gff says phase=1 
   0 RRRRSSIWGS HPHGGACVKA STSKRVDWAR HAASIEPFPT LFSAWLKKYA 
   50 NIVRTVLDGE GKEPRRRVEQ RDDRHGGCPI CGGQHAKTSC REFIEASPPG 
  100 RLSMVKRHRL CFTCLRSGHS SRSCDVHGEC QTNGCRRLHH RLLHGAYEER 
  150 RWPEQRGGFR RHNGGNQQSA VSRRSPDRRS SPRGSYRHHE RSHQSAVPRN 
  200 SLERRAPQPA EAPVQRNLSI DVEGGRLLFR ILPVTLYRAG RQVDTYALLD 
  250 EGSSVTMIDD ELWRDLEVRG EQRQLNIQWF GGRTSREPTN VVSLEISGAG 
  300 KPTRHPLKNV YAVSSLSLPM Q 

Forward 0 << venky's translation
   0 FVEGGAAFGD PTLMEELVSK LPRASEWIGP GMLHRSSPFP RFLARG!RST 
   50 QTSCVRFWTA RERSRGVESS SGMIGMEVVQ FVEGNMLRRA AGSSSKLRHR 
  100 AG!AW!RGIG SASHAYGVVI RPDPAMCMAS ARPTDAADCI TVCYMELTKS 
  150 ADGRSSEVAS GATTEETSSQ QFPDAAPTGG LRHEVVTGTT RGAISRLSPE 
  200 TAWREGPRSQ RRRPCRGI!A STSKEADFYS VYCQ!RCTEL VARWTHTRSW 
  250 MKDPPSR!ST TSYGGIWKCE ANSDS!TSNG LEEGPAGSPP T!!AWR!VEL 
  300 GSPLATR!RT CTPYRA!VCR CR

Venky's response

NCBI's GFF is correct, I think. Maybe NCBI is using the "intron phase" which is the number of bases into the codon when it is interrupted by an intron (and is used to label the subsequent exon).

Phase is not the same thing as Frame. Phase is the number of bases to skip before reading in-frame, while frame is the actual frame identifier beginning at 1. My mistake was that I ignored NCBI's phase information and recomputed my own under the faulty assumption that they wouldn't tack on partial codons at the beginning of the gene. This affected partial gene predictions.

I'll fix these in the next few days.

I should point out that while this affects the translations for the submitted sets, it does not affect any of the GLEAN consensus sets.

I'm not sure how to deal with SNAP if it doesn't output the phase...

GFF phase column ambiguity

Do be cautious about what is in gff column#8 of the predictions, there has been a fair amount of ambiguity about this, and uses have differed. For instance, the GFF.v3 spec used to say something else about phase. For the NCBI Gnomon case above, it says phase=1 in the gff, but I had to remove two bases from the start to get an aa translation matching NCBI's, which is not in accord with current GFF.v3 spec (should be phase=2). --Dongilbert 18:29, 11 August 2006 (PDT)

GFF phase corrections

Phase only needed for partial gene predictions, a small percent, mostly found at ends of short scaffolds (e.g SNAP has 0 in dmel, a handful in dpse, dyak, dsim, and around 1000 in the species with several thousand scaffolds). --Dongilbert 21:38, 13 August 2006 (PDT)

The predictions are inconsistant in use of phase:

 SNAP (DGIL_SNP, _SNO) had no phase values; updated now (see above)
 BREN_NSC has good phase values
 BATZ_CNA needs corrected phase values
 RGUI_GID has good phase values
 NCBI_GNO needs corrected phase values
 PACH_GMP,OXFD_GPI,EISE_CGW,EISE_CEX have good phases
 EISE_CGM probably needs corrected phases

Tool to check aa translations and add phase: http://insects.eugenes.org/species/data/work/snap-predictions/cdsphase.pl