Consensus sets
From AAAWiki
This page tracks the efforts of various groups to build consensus sets.
Contents |
Updates
Lineage specific orthologies posted here -venky 14Sep
Consensus models' GFF3s, orthologies, FASTA and CDS files, 1-1 ortholog FASTA/CDS files, updated Genome Browsers posted
Bug: Translations for some of the submitted sets are incorrect (and related to this), the phase in the GFF files for some of the submitted sets (standardized) is incorrect. The consensus sets are not affected.
Changes:
- All GLEAN* models are renamed with species specific prefixes (dsim_GLEANR_1 etc). Nothing else about these models should have changed.
- Genome Browsers are now searchable by gene name or FBgn IDs. GLEANR models' detail pages link to the ortholog on Flybase's dmel browser.
--venky 05 Aug
Links
Orthology
all by all & combined
6793 genes have exactly 1 ortholog in each of 12 species.
See detailed ortholog/paralog clusters here tsv and histogram of classifications
This was done using a multiple-genome version of the Fuzzy Reciprocal BLAST clustering algorithm (Iyer VN, Pollard DA, Eisen MB, unpublished).
Another set using [multiparanoid] and all pairs of orthologies/synteny maps will follow.
dmel-pairwise
I assigned orthology against dmel using two methods: INPARANOID and Fuzzy Reciprocal BLAST (Iyer VN, Pollard DA, Eisen MB, unpublished), a variation on the Reciprocal BLAST technique that we developed.
- 1-1 orthology assigned to 10,143 models (in dpse) by both FRB and INPARANOID -- between 9000 and 11000 models for all genomes.
- 4760 models have orthologs in all 11 species.
Some statistics on the GLEANR consensus set
I'll post more detailed numbers later, but briefly (for the dpse genome, as an example -- the numbers are similar for the other genomes):
- 17328 models
- 23% have no overlap in the conservative GLEANSH/JIGSAW sets
- 26% are exon-by-exon identical in all three sets (GLEANSH, JIGSAW, GLEANFPH)
- 35% have overlaps to exactly 1 model in each set
GLEAN
- built by Venky Iyer/Eisen Lab/UC Berkeley.
Description of various GLEAN sets and how the putative Final set was built
I built five consensus sets for each genome using GLEAN.
- The GLEAN set: This was the first set I built and posted here as the 20060703_allsets_v0.1 version. This was built using all the protein coding annotations, each as a separate source of evidence to GLEAN.
I noticed a few issues with the models in the GLEAN set:
- models that were separate in the homology based sets (GeneWise, GeneMapper etc) were sometimes linked in the consensus
- consensus models sometimes used exons from the ab initio (SNAP, GeneId etc) in place of exons from the homology sets.
Given that Dmel is a well annotated genome, I think that the above mentioned issues are likely to be mistakes, since the homology based annotations use the Dmel annotations as explicit evidence. On the other hand, the homology based annotations are prone to miss exons (often small and terminal exons). We'd also like to be able to discover new genes and genes without good homology to Dmel.
So I built a few more sets:
- The GLEANH set (homology): This was an initial attempt at a homology-only set. I used DGIL_SNP, BREN_NSC, NCBI_GNO (filtered for supported models), EISE_*, PACH_GMP, and OXFD_GPI as evidence to see what the effect of ab initio predictions was.
- The GLEANSH set (strict_homology): This is a strict homology set: GeneMapper (EISE_CGM, PACH_GMP :grouped), GeneWise/Exonerate (EISE_CGW, EISE_CEX, OXFD_GPI :grouped), and Gnomon (NCBI_GNO filtered for supported models). The grouping was done to get good statistics, in consultation with Aaron Mackey (the author of GLEAN).
The GLEANSH set has ~13000 models uniformly across the genomes. This is what you'd expect from a set built with Dmel models as the starting point, given ~13000 genes in Dmel.
dana.gff3:13016 dere.gff3:13429 dgri.gff3:12714 dmoj.gff3:12253 dper.gff3:12972 dpse.gff3:13024 dpserec.gff3:13320 dsec.gff3:13917 dsim.gff3:13942 dvir.gff3:12350 dwil.gff3:12567 dyak.gff3:14110 dyakrec.gff3:14057
- The GLEANFPH set (filtered_plus_homology): This was another all-inclusive set, with grouped evidence sources. ab initio sources: SNAP (DGIL_*), GeneID (RGUI_*), NCBI_GNO (ab initio models only), CONTRAST, plus the grouped homology sources used in GLEANSH. All the ab initio sets were filtered to remove models that linked two predictions from the GLEANSH consensus set, in an attempt to eliminate one of the issues mentioned above.
The GLEANFPH set has ~15000-22000 models, depending upon the genome.
dana.gff3:22485 dere.gff3:17281 dgri.gff3:17385 dmoj.gff3:18278 dper.gff3:23629 dpse.gff3:17660 dpserec.gff3:16949 dsec.gff3:21913 dsim.gff3:18892 dvir.gff3:18230 dwil.gff3:19847 dyak.gff3:20075 dyakrec.gff3:15573
- The GLEANR set (reconciled): This was a reconciliation of the GLEANFPH and GLEANSH sets:
- Any GLEANFPH models that overlapped two GLEANSH predictions were replaced by the individual GLEANSH models.
- When a GLEANFPH model overlapped a GLEANSH model, every GLEANSH CDS was required to be represented by an overlapping GLEANFPH CDS i.e., the GLEANR set is allowed to add or extend exons but not to remove them completely, relative to the GLEANSH set. If this condition was not met, the GLEANFPH model was replaced by the GLEANSH model.
- Any GLEANSH models not represented in the GLEANFPH set were added back.
- Any GLEANFPH models not overlapping GLEANSH models were kept.
The GLEANR set was then compared to the GLEANSH, GLEANFPH and JIGSAW sets. Every model is annotated as being identical to, or overlapping one or multiple models in each of the comparison sets.
Some numbers (and poor man's histograms) for how well the GLEANR models agreed with the GLEANSH, GLEANFPH and JIGSAW sets are here
Barring systematic problems and major bugs, I'd like to present the GLEANR set for consideration as the putative final consensus set.
The GLEANR set has similar numbers to the GLEANFPH set. Note that every CDS/mRNA from the GLEANSH should be represented in the GLEANR set, plus additional models from the GLEANFPH set.
dana.gff3:22551 dere.gff3:16881 dgri.gff3:16901 dmoj.gff3:17739 dper.gff3:23029 dpse.gff3:17328 dpserec.gff3:16836 dsec.gff3:21332 dsim.gff3:18273 dvir.gff3:17684 dwil.gff3:20257 dyak.gff3:19430 dyakrec.gff3:15425
This set can be filtered for conservative analyses in two ways:
- by requiring a model to be represented in GLEANSH, GLEANFPH, and JIGSAW or some subset of the above
- by filtering for models where one-to-one orthology is assigned by comparison to the Dmel translation set (see below).
One rather surprising (to me, at least) observation was that there is quite a bit of variation in the specifics (especially at the ends of mRNAs and CDSs) of the gene model among the various prediction sets even if they get the general region correct (from looking at the genome browsers). This is probably an important caveat for any detailed analyses of gene structure in these genomes.
Translation and CDS FASTA files and gene model GFF3 files, as well as Genome Browsers for all the sets are available. See Consensus sets#Links section above.
Ongoing: Orthologies are being assigned for the GLEANR set using INPARANOID and Fuzzy Reciprocal BLAST. These will be posted later this week.
GLEAN statistics, comparison of submitted sets
GLEAN statistics are now here html,text,spreadsheet
These are the false positive and false negative rates for each of the sets/species, as trained by GLEAN. They are meant to reflect some level of confidence in a set's ability to "accurately" (measured by agreement to other sets and validity of gene models) predict each of four "sites": starts, stops, donors, and acceptors. The raw data as well as one slice of it is here.
(Note: there's a minor wrinkle in the calculation of the means that my script did not account for, initially: some prediction sets did not include dyakrec and dpserec, while some did. fp and fn for the dpserec/dyakrec were 0 and 1 when there were no predictions. Taking this into account (ie) excluding the dpserec/dyakrec columns for the sets that didn't include them changes the values of the means slightly, but the ranks and the ordering of the sets by the ranksums do not change --venky).
JIGSAW
(courtesy Jonathan Allen, with reformatting by Venky Iyer )See Consensus sets#Links section above.
Genome Browser Instructions
This is now fixed to turn on all the tracks by default. You can search by scaffold_id. Each gbrowser has an example of what the scaffold_id looks like in the Instructions section. You may have to zoom in to visualize features. Site Map
