Validation of compounds and reactions from the KEGG Ligand database

BioMeta is intended to be complementary to the KEGG Ligand database by focusing on the application of organic chemical knowledge to small compounds, thus ensuring that the compounds and implicitly the reactions are correct. Hundreds of molecular structures were corrected or improved.

Table 1 gives a breakdown of the validation results and the corrections made in the 12,815 molecule entries present in both BioMeta and the KEGG Ligand compound section of October 25, 2005. Note that the absence of a structure does not need to be an error - it may be a generic compound such as "acceptor" or "phosphorylated protein". The validation program can detect only syntactical problems, e.g., valence violations, undefined enantiomer, or invalid stereochemistry. Some are real errors requiring correction, such as valence violations or ambiguously drawn stereocenters. Problems in the "undefined" categories suggest incomplete structural information, but not all such cases are necessarily incorrect; for instance, a drug that is a racemic compound would trigger the warning "unspecified enantiomer". Problems in the "incorrect" categories have not been detected by the validation program since these errors are semantic rather than syntactic - they were detected through visual inspection. In table 1, the row "total (stereochemistry)" is not the sum of the preceding cases because compounds may have multiple problems. The rows with the totals do not add up because of the "unknown" entries - if these numbers were known then the numbers would add up.

Table 1. Detected and corrected problems in the BioMeta database (version October 23, 2006)

Type of Problem # in KEGG # in BioMeta # Corrected
Structure missing 1239 1106 133
Valence violation(s) 76 0 76
Incorrect constitution unknown unknown 107
Total (constitution) 1315 1106 316

Undefined stereo double bond(s) 35 32 3
Invalid sp3 stereocenter(s) 70 47 23
Ambiguous sp3 stereocenter(s) 46 0 46
Undefined sp3 stereocenter(s) 1398 865 533
Unspecified enantiomer 2326 1840 486
Undefined sp3 stereochemistry 554 366 188
Incorrect stereochemistry unknown unknown 69
Total (stereochemistry) 3990 2907 1152

Total corrected 1468

A total of 1468 structures were corrected. The large majority of valence errors involved nitrogen atoms that were not trivalent. The most common of these were: 1) a nitrogen atom having one double bond and two single bonds, but no charge (i.e., intended to be a pyridinium- or nitro-type nitrogen), these were corrected by removing an attached hydrogen or else by adding a positive charge, and 2) coordinative bonds from a imine-type nitrogen to a metal indicated as covalent. Unfortunately, the molfile format does not support coordinative bonds, so these bonds had to be removed.

Table 2 gives a more detailed breakdown of the sp3 stereochemistry enhancements from Table 1 (in some places the numbers are slightly lower because double-bond stereochemistry is omitted). The numbers relate to the 12,815 molecule entries present in both BioMeta and the KEGG Ligand compound section of October 25, 2005, minus the 1,239 entries that had no structure in KEGG. The table lists 76 more entries for BioMeta than for KEGG because compounds with valence errors are not stereochemically analysed. The "unspecified enantiomer" cases from Table 1 are split here between two "relative" stereochemistry cases, incompletely and completely defined. Note again that not all "Completely defined - relative" cases need to be errors - a number of drugs may be racemic compounds. All cases (also for meso compounds) are listed so that the numbers add up.

Table 2. Statistics of sp3 stereochemical content in the KEGG Compound and BioMeta (version October 23, 2006) databases

Stereochemistry OK # in KEGG # in BioMeta # Corrected
Not possible + 3725 3764
Undefined (i.e., left out) 554 366 188
Incompletely defined - meso 24 3 21
Incompletely defined - absolute 1080 691 389
Incompletely defined - relative 294 171 123
Completely defined - meso + 56 89
Completely defined - absolute + 3735 4823
Completely defined - relative 2032 1669 363
Total not OK 3984 2900 1084
Total OK 7516 8676
Total 11500 11576