Datasets

From AAAWiki

These datasets are frozen for final analyses, barring catastrophic problems with them. Contact Venky Iyer (venky AT berkeley . edu) if you have questions.


CAF1 assemblies have been submitted to Genbank and underwent some changes in the process. Conversion tables are linked to below. Please convert your datasets to these coordinates so that they will be comparable to other groups' results and data.


Most directories linked to below have Footers below the directory tree containing notes and describing the contents of the files in that directory

Contents

ChangeLog

Nov 20, 8am EST
Fixed a bug affecting how masked alignment columns were stripped to generate the /stripped/*.fasta
alignments.  The bug caused frame shifts to be introduced into approximately 300 alignments when
masked alignment columns were stripped out.  Tar balls for coding gene alignments have been updated
with the fixed alignments.  Note that this ONLY affects the /stripped/*.fasta alignments, not the 
masked/*.fasta or full/*.fasta alignments.

Nov 18, 1pm PST
Protein-coding alignments for three genes were missing from the all_species alignment sets 
(with and without guide-tree). These have now been added back. No other changes have been made.

Nov 16th, 6am PST
Fixed a bug in the synteny_resolved_ortholog tables. The bugfix caused some synpipe calls that
conflicted with the frb calls to be removed. As a consequence, the annotation GFFs were also
modified. 
(reported by Hiroshi Akashi)

Nov 15th, 1pm PST
Fixed a bug affecting the attributes of these GFF files that sometimes mislabeled 1:n paralogs as 
orthologs. Please use the new files if you downloaded the data previously. The homology 
assignments did not change, just the labeling of a few hundred gene models in the GFF files.
(reported by Peili Zhang)

Genomes

Genbank FASTA files

All final analyses should be in reference to the Genbank accessions/assemblies. All datasets here are based on the Genbank sequences.

CAF1<->Genbank mappings

(from Paul Kitts at NCBI). Models not in the Genbank assemblies were excluded.

Foreign sequences

(from Paul Kitts), these are sequences that look like spurious sequences from other species (human,yeast,Wolbachia etc). Models from these regions were excluded.

Protein Coding Genes

Gene Model GFF3 files

Gene Models were built using a series of GLEAN (Aaron Mackey) consensus-building and filtering steps starting with community submitted annotation sets. These only have coding sequences annotated.

Homologies

Fuzzy Reciprocal BLAST clustering

Homologies were assigned first by using a modified reciprocal BLAST method (Fuzzy Reciprocal BLAST clustering, Venky Iyer unpublished). This produces clusters of homologous groups from all 12 species. These were then parsed into pairwise 'ortholog'(1:1), 'paralog'(1:n, n:1, n:n), and 'no_homolog'(0:?, ?:0) assignments.

Synpipe

Independently, Arjun (AJ) Bhutkar from the Gelbart group at Harvard/Flybase used dmel-dxxx TBLASTN-based comparisons and syntenic considerations [Ref: "Techniques for Multi-Genome Synteny Analysis to Overcome Assembly Limitations",Arjun Bhutkar, Susan Russo, Temple F. Smith and William M. Gelbart; Genome Informatics Vol.17, No.2, 2006] to assign relationships between a dmel translation and a dxxx genomic region. These are the Synpipe calls. These were then mapped to GLEANR models for each of the dxxx species, filtered to only retain 1:1 relationships that were not already represented in the FRB calls and did not conflict with them, and added as 'synteny_resolved_ortholog' calls. The interpretation of these calls is that they usually resolve paralogous groups by using synteny.

It is important to keep in mind that the synteny_resolved_ortholog models potentially have one or more paralogs.

Translations and CDSs, FASTA

Coding Gene Alignments

PLEASE NOTE: There are known problems with some of the coding gene alignment sets on this site. Official alignment sets from the community papers can be found on FLYBASE. Old alignment sets, however, can still be accessed here.

Coding gene alignments were produced using T-COFFEE by threading cds sequence onto protein alignments, and then filtered to mask out regions where the alignment quality was poor. More details on the masking protocol can be found here.

Currently available

The following different alignment sets are available
  • Fuzzy Reciprocal Blast (FRB) based alignments of Dmel genes with 1-1 orthologs in either the 5 species in the Dmel species group or in all 11 species
    • alternative CDSs/translations, longest CDSs/translations
    • with/without guide tree
    • masked, unmasked
  • FRB based alignments of Dmel genes with 1-1 orthologs in any species where an ortholog was found
    • longest CDSs/translations only
    • with a guide tree only
    • masked, unmasked
  • FRB based alignments of Dmel genes with 1-1 orthologs in any species with paralogs resolved to putative orthologs using Synpipe
    • longest CDSs/translations only
    • with a guide tree only
    • masked, unmasked
  • FRB based alignments of all clusters without paralogs in any species (these contain lineage specific genes)
    • longest CDSs/translations only
    • with a guide tree only
    • unmasked (these might not ever get masked unless there is demand for it)
  • Matt Hahn's group has separately produced FRB based alignments of all clusters, including paralogs, which are available here.

Noncoding RNA genes

Gene Model GFF3 files

Feb 12, 2007 17:00 GMT
New version of GFF3 produced for ncRNA gene sets. Major changes include:

1) include MAKA group snRNA predictions, 
2) include miscellaneous ncRNAs, and 
3) filter out Rfam predictions that are RNA structures, not RNA genes.  

A nonredundant set of noncoding RNA genes has been assembled from RFAM, miRNA, tRNA and snoRNA submissions. This set was derived by removing all miRNA, snoRNA & tRNA predictions from the RFAM datasets and adding higher resolution results from the miRNA, tRNA, snRNA and snoRNA specific analysis.

CAF1 versions of the data can be downloaded from here and species specific gff files are as follows:

Dmel Dsim Dsec Dyak Dere Dana Dpse Dper Dwil Dvir Dmoj Dgri

NB: genbank conversion for the final ncRNA gene models is pending.

Deprecated Nov 2006 versions on thes files can be found here

Alignments of predicted ncRNAs

Multiple alignments of Rfam ncRNA gene predictions were made using stemloc by members of Ian Holmes' lab (Robert Bradley, Gabriel Wu and Yuri Bendana).

They are hosted on our webserver in Stockholm format (tarball) and in HTML colorized by mutation type (tarball).

Additionally, Rfam and tRNA predictions were aligned using INFERNAL's cmalign (using all default parameters). Here are the gzipped tarballs of the output (each tarball contains all of the alignments):

entire cmalign trace (incl. scores for each sequence)

just the Stockholm alignment portion

You can also get these alignments on our webserver broken down by family.

Whole Genome Alignments

MAVID/MERCATOR Alignments

Multiple whole-genome alignments of Drosophila species were generated by Mercator (an orthology mapping program) and MAVID (a multiple alignment program). These alignments were engineered by Lior Pachter's group (Nick Bray, Anat Caspi, Colin Dewey). More details on the program can be found here.

Currently available

The following different alignment sets are available
  • pairwise alignments for each of 11 genomes to D.melanogaster.
  • alignments restricted to each internal branching in the tree (i.e., alignments for each group and subgroup)
  • an alignment at the root of the tree incorporating all 12 genomes.

UCSC whole genome alignments

Angie Hinrich and colleagues at UCSC have produced whole genome alignments for the CAF1 genomes, Anopheles, Apis and Tribolium.

UCSC's multiz alignments of 12 flies (CAF1 freeze), mosquito, honeybee and red flour beetle can be downloaded here: [1].

The alignments and phastCons scores are displayed together in the "Conservation (15)" track in the UCSC test browser: [2].

The pairwise alignments used to make UCSC's multiple alignment are available for download as follows:

Dsim Dsec Dyak Dere Dana Dpse Dper Dwil Dvir Dmoj Dgri Agam Amel Tcas

Possible Transposon Contaminations in Gene Sets

Many predicted genes are highly masked (>90% of its CDS masked) by repeats (ReAS de novo assembly). Aligning these highly masked genes with known TE-related peptides found that about half of them well matched. Very few of them are existing in the FRB and synpipe ortholog tables. So these highly-masked genes are most likely TE contaminations. Detailed information regarding what fraction of CDS masked for each gene can be found ftp. Related ReAS repeat annotation of the genomes can be found ftp. Any questions, please contact Ruiqiang Li <lirq@genomics.org.cn>

ID # Genes # Genes with >90% CDS Masked # Overlap FRB # Overlap Synpipe

dsim 17,049 956 0 4

dsec 21,332 4,254 1 4

dmel 13,733 17 - -

dyak 18,816 2,121 0 7

dere 16,880 1,378 1 2

dana 22,551 7,007 0 6

dpse 17,328 450 2 5

dper 23,029 5,489 0 7

dwil 20,211 3,805 1 15

dmoj 17,738 2,614 1 3

dvir 17,679 2,815 1 4

dgri 16,901 1,422 0 4

Todo

  • dyak caf1-genbank mapping needs work to make sure no models span inter-scaffold gaps
  • synpipe orthology calls are made, need to match to fuzzyRB homology calls and break into high conf/low/medium confidence sets
  • send off to dan/tim for alignment/masking
  • generate synteny blocks for dmel-dxxx pairings using synpipe calls to add to dxxx-dxxx pairings
  • pull in whole genome alignments