Validation program for MDL molfiles/sdfiles

Available for the Intel/Linux platform.

To download the software, the file files.tar.gz suffices.

Auxiliary files

The files structin_elements.dat and cvtstr_pref.dat should reside in the directory from where cvtstr.exe is called, otherwise environment variables STRUCTIN_ELEMENTS and/of CVTSTR_PREF should be set prior to calling cvtstr.exe. structin_elements.dat contains element and isotope data from the periodic table and cvtstr_pref.dat contains the preferences.

The program

The executable cvtstr.exe can perform three tasks: file conversion, structure analysis, and plotting (Postscript file generation). For structure validation, the analysis mode should be used. The program has a simple command-line interface.

Sample input

Two sample input files (sample.sdf and tartrates.sdf) are provided, illustrating some of the complaints the validation program can produce. Just call cvtstr.exe and reply with "A" (to request an analysis), "SFDILE" (to indicate an SD file), and "sample" or "tartrates".

Output

The program produces two output files: for sample.sdf, these are sample_ana.sdf and sample_ana.txt. The sdfile contains canonicalized and standardized structural representations of the input molecules, except for those which had errors that prevented further analysis (e.g., too many atoms/bonds, valence violations). The text file contains the analysis results. For each input molecule a number of fields are calculated, e.g., for the fifth entry in sample.sdf (note that it is the fourth entry in the output!):

##        4 Chiral  Absolute   1   1   0   0 354B626A
IK C03066
NM 3-Hydroxy-L-glutamate
TY 1
ST a
SD 1
SU 1
BD 0
BU 0
RC 0
US a,C\,C|,C,N,C,O,O,O,C,O,O,1-2,1-3,1-4,2-5,2-6,3=7,3-8,5-9,9=10,9-11
UC n,C,C,C,N,C,O,O,O,C,O,O,1-2,1-3,1-4,2-5,2-6,3=7,3-8,5-9,9=10,9-11
MF C5H9NO5
MW 163.129
EM 163.0481
MZ 100% 163.0481
Legend:
ID BioMeta Compound ID
IK KEGG Compound ID
NM compound name
TY structure type (3 = polymeric, 2 = generic, 1 = normal)
ST stereochemistry (a= absolute, m = meso, r = relative, x = racemic, n = none)
SD number of defined sp3 stereocenters
SU number of undefined sp3 stereocenters
BD number of defined stereo double bonds
BU number of undefined stereo double bonds
RC number of rings
US unique (canonical) string including stereochemistry
UC unique (canonical) string excluding stereochemistry
MF molecular formula
MW molecular weight
EM exact mass
MZ M/Z peak with abundance (currently only 100% and exact mass)