Assembly Problems

From AAAWiki

Contents

Probable Assembly Errors Near Coding Sequence

We investigated possible cases of mis-assembly of the D. simulans CAF1 assembly that led to apparent transpositions of small regions of the genomes. Our strategy was to first match D. melanogaster genes against the D. simulans assembly and a control against the D. melanogaster assembly using BLAT. For genes which matched to a different chromosome, we controlled against the possibility that the gene was a paralog by comparison to the BLAT to D. melanogaster. For those genes in which the best match did appear to be transposed, we identified the original contigs from which the sequence was derived and double checked the assembly in and around the contig using AGP and trace archive data, as well as BLATs to the melanogaster assembly.

We identified several types of errors. In one, the contigs that appeared to be arrived from a different chromosome were flanked by N's. In these cases it appears that the contig was clearly misassigned to the chromosome in question. In another error type, there appeared to be a breakpoint in the sequence of a contig that had no overlapping traces in the NCBI trace archive. On either side of the breakpoint the sequence appeared to have been derived from different chromosomes, indicating a problem with contig assembly.

The problems found are listed below, by location of the gene in D. melanogaster:

X chromosome

CG14218: Missassigned to 3L sequence on contig W501_Contig137.10. Whole contig appears to belong on X chromosome.

CG12685: Poorly Assembled Contig -- The contig begins at 4459460 and runs to 4495704 of chromosome 3R. The first ~1010 bases appear to belong on the simulans X chromosome, and the break appears somewhere position near 1110 of the contig.

CG12689: Poorly Assembled Contig -- The contig begins at 9351985 and runs to 9352307 on the 3L chromosome. There appears to be a break at base 2219 of the contig.

CG14218: Misassigned contig between 19168693 and 19172670 of 3L.

CG9303-CG12398: Apparently misplaced contig surrounding these genes on 2L from base 10266502 to 10274605.

CG32559: Poorly Assembled Contig: W501_Contig5.29 has portions that should be assigned to 3L.

CG5659: Poorly Assembled Contig: W501_Contig4.7 has erroneous regions around base 8000. Contig is on 3L but has portions belonging to X.

3L chromosome

Enormous chromosomal move to 2R? Maybe not, The sequence looks heavy with material that matches to ALL chromosomes around the break point. Also heavily padded with N's. Certainly worth checking after, though, if it's a simple test.

CG32219: Contig between 11930117 and 11944896 of X chromosome. Single contig bordered by N's, probably misassigned.

CG14177: Poorly Assembled contig -- The contig begins at 6241263 and runs to 6248000 of chromosome 2R. The apparent error occurs at 6246062 of the chromosome.

CG11251: Poorly Assembled contig -- The contig begins at 3878343 and runs to 3883220 of chromosome 2R. The apparent error occurs at 5297 of the contig.

CG14041-CG9124: Possible 3L sequence misplaced on 2L.

3R chromosome

CG14369: Poorly Assembled contig -- The gene begins at 4621879 and runs to 4622234 of chromosome 2R in simulans. Areas of the contig is inconsistent with many traces in the archive.

4 chromosome

CG17923 has moved to chromosome 3R and this has been experimentally confirmed (Masly et al 2006). No breaks in trace sequence or flanking N's were detected.

See also: Assembly Quality Control

--Cdjones 11:00, 17 January 2007 (PST)