Background
TaoChongBao, metaphorically described as "an e-marketplace for exchanging worm variants", is a comprehensive database that effectively manages and utilizes the vast EMS-mutagenized Caenorhabditis elegans missense mutations.
Data sources and processing
- Protein sequences of nematode(Caenorhabditis elegans) were downloaded from Ensembl (assembly: WBcel235).
- Protein sequences of human(Homo sapiens) were downloaded from Ensembl (assembly: GRCh38).
- The nematode-human orthologous pairs were retrieved from the OrthoList2 project, with the additional 4,572 pairs clustered by MMseqs2.
- Sequence pairs were aligned using the ClustalW package with the BLOSUM62 scoring matrix, which revealed 2,248,512 conserved residues in nematode proteins with human homologous proteins.
- AlphaMissense predictions for human isoforms were downloaded from the official repository.
- ClinVar data were downloaded from the official FTP.
ClinVar data were obtained in 2025/11, other data were obtained in 2024/03.
Content
- 12,069 viable C. elegans strains harbouring total 20,315,536 variations
- 541,102 unique missense mutations on 20,914 genes, average 25.8 mutations per gene
- 22.2% (120,363 of 541,102) missense mutations occurred on nematode-human conserved residues, covering 3.3% of all conserved residues
- 50.8% (61,183 of 120,363) missense variations were predicated as "pathogenic" in AlphaMissense
- 6,255 missense variations were recorded in ClinVar, of which 625 were reported as "(likely) pathogenic"
- Also includes 25,241 unique stop gained variations on 11,548 genes, and 1,659 unique frameshift variations on 1,444 genes
Data above as of 2025/11
References
A paper describing our results and methods is in preparation.