Current Spark databases

The currently released databases for Spark are listed below. Cresset currently release databases based on commercially available screening compounds (from ZINC1 drug-like), from literature reports (ChEMBL), from theoretical rings (VEHICLe), and from commercial reagents (eMolecules). The larger databases are split based on the frequency of occurrence of the fragments.

Fragments from screening compounds

In general, fragments from the Very Common or Common databases are more likely to be readily synthesizable as they appear in many different commercially-available molecules. Fragments from the Rare, Very Rare and Singleton databases are more likely to be non drug-like or hard to make.

Note that these databases have been filtered to remove potentially toxic or reactive fragments (such as alkyl halides or nitroso functionalities). In addition, all phosphorus-containing fragments have been removed as the calculation of fields on phosphorus-containing functional groups is still under development.

Fragments from ChEMBL

The current Spark databases from ChEMBL are based on release 21 of ChEMBL and again have been split based on the frequency of occurrence of the fragments.

Reagents

Spark reagent databases are derived from eMolecules building blocks using Cresset’s reagent importer, which converts a file of usable reagents into the corresponding R-group. For example, to create the eMolecules_acid database, all the eMolecules building blocks containing a C(=O)OH or C(=O)Cl group were processed to add the R-group to the database. The use of databases derived from available reagents ensures that the results of your Spark experiment are  tethered to molecules which are readily synthetically accessible.

Analysis of fragment databases

Database overlaps (number of fragments present in both databases)

VeryCommon Common LessCommon Rare VeryRare Singleton
ChEMBL21 ChEMBL_common 40,937 40,786 42,650 26,092 17,803 14,878
ChEMBL21 ChEMBL_rare 7,445 23,628 44,437 46,357 45,123 33,921
ChEMBL21 ChEMBL_veryrare 2,652 10,881 26,099 34,483 42,126 69,648

Fragment and connection point counts

Database Total fragments Fragments with 1 connection Fragments with 2 connections Fragments with 3 connections Fragments with 4 connections Rings only
Very Common 58,853 22,176 24,894 9,906 1,877 1,572
Common 127,199 43,681 52,508 24,726 6,284 1,883
Less Common 283,358 83,759 115,525 64,280 19,794 3,189
Rare 445,559 133,219 172,209 103,044 37,087 5,027
Very Rare 742,983 253,349 273,555 158,373 57,706 7,311
Singleton 1,459,827 601,833 506,664 259,657 91,673 16,606
ChEMBL_common 266,880 51,521 101,156 78,713 35,490 6,176
ChEMBL_rare 438,040 86,464 156,128 129,154 66,294 6,456
ChEMBL_veryrare 598,316 151,083 213,242 155,983 78,008 9,013

Number of fragments within specified molecular weight range

Molecular weight   1-50 51-100 101-150 151-200 201-250
Very Common 250 7,673 33,570 16,990 370
Common 84 5,978 55,001 61,652 4,484
Less Common 90 8,527 87,277 162,119 25,345
Rare 66 11,217 114,156 258,945 61,175
Very Rare 56 15,599 173,660 428,438 125,230
Singleton 68 24,617 310,894 844,468 279,780
ChEMBL_common 384 18,501 102,211 121,303 24,481
ChEMBL_rare 86 15,012 128,082 228,121 66,739
ChEMBL_veryrare 53 14,346 150,012 320,510 113,395

Atom count distribution

NH   1-2 3-4 5-6 7-8 9-10 11-12 13-14 15-16 17-18
Very Common 81 733 4,096 13,430 21,543 16,040 2,864 66 0
Common 8 357 3,070 15,116 38,682 47,872 20,648 1,438 8
Less Common 8 560 4,427 20,939 62,953 102,885 79,900 11,497 189
Rare 11 526 5,773 26,803 82,368 149,329 149,665 30,251 833
Very Rare 1 543 7,576 41,308 126,849 242,429 256,403 65,255 2,619
Singleton 5 646 11,492 71,505 232,477 469,778 516,075 150,186 7,663
ChEMBL_common 92 1,508 9,942 33,628 67,219 81,675 56,197 15,217 1,402
ChEMBL_rare 11 639 7,607 35,437 89,336 135,034 122,689 42,093 5,194
ChEMBL_veryrare 7 499 6,996 39,181 107,134 178,827 182,956 72,922 9,794

Number of fragments with specified number of rotatable bonds

Num Rot Bonds 0 1 2 3 4 5
Very Common 5,778 15,745 19,442 13,374 3,714 800
Common 9,782 31,101 45,325 32,267 7,378 1,346
Less Common 20,090 78,583 108,619 62,076 12,060 1,930
Rare 37,347 137,290 173,644 80,486 14,625 2,167
Very Rare 67,757 242,805 283,650 124,533 21,166 3,072
Singleton 138,561 491,429 559,268 227,716 38,014 4,839
ChEMBL_common 27,864 79,816 92,992 49,725 13,411 3,072
ChEMBL_rare 46,004 134,829 155,765 77,919 19,489 4,034
ChEMBL_veryrare 67,533 192,490 211,541 98,267 23,836 4,649

Number of fragments with specified number of conformations

Number of conformations 15 6-10 11-15 16-20 21-25 26-30
Very Common 34,581 9,405 4,800 2,958 1,865 5,242
Common 69,116 21,587 11,546 7,147 4,664 13,129
Less Common 159,438 47,452 25,495 15,064 9,355 26,521
Rare 260,847 74,542 38,476 22,059 13,458 36,069
Very Rare 440,776 125,504 62,270 36,149 21,745 56,194
Singleton 865,867 248,748 122,812 70,060 42,504 108,894
ChEMBL_common 158,404 38,633 17,721 12,035 7,275 17,535
ChEMBL_rare 246,360 67,776 30,238 21,426 12,390 27,068
ChEMBL_veryrare 336,517 92,088 41,957 29,468 17,168 35,764

Analysis of reagent databases

Number of fragments within specified molecular weight range

The figures below are approximate. The exact number of fragments may change over time as the reagent databases are updated on a monthly basis.

Make sure you have you keep your reagent databases updated following the instructions at installing Spark databases.

Molecular weight Description Total 1-50 51-100 101-150 151-200 201-250
eMolecules_acidCO Acids, keep the CO 3000 3 350 6000 16700 7000
eMolecules_acid Acids, delete the -COOH 69500 45 2400 17000 35000 15000
eMolecules_alcohol Aliphatic alcohols, delete the O 24300 7 1200 7300 11900 3900
eMolecules_alcoholO Alcohols, keep the O 28900 3 370 6700 15000 6900
eMolecules_aliphatic_halide Aliphatic halide 17600 10 690 3800 9100 4000
eMolecules_alkyne Alkynes, delete the -C#C 2500 15 360 1300 700 100
eMolecules_aromatic_alcoholO Aromatic alcohols, keep the O 17800 0 45 2600 9900 5300
eMolecules_aromatic_aminesN Aromatic amines, keep the N 25100 0 120 4700 15000 5300
eMolecules_aromatic_halide Aromatic halide 66500 7 440 15000 40000 11000
eMolecules_boronic Aromatic boronic acids, delete -B(OH)2 9600 0 140 2500 4500 2500
eMolecules_cyano Cyano groups, delete -CN 49800 15 1000 8800 26000 14000
eMolecules_isocyanateCO Isocyanates, keep -NCO 1650 0 15 370 970 290
eMolecules_olefin Olefins, delete the -C=C 4400 10 420 1800 1700 450
eMolecules_primary_aliphatic_amine Primary aliphatic amines, delete the N 24500 6 1100 8000 12000 3400
eMolecules_primary_aliphatic_amineN Primary aliphatic amines, keep the N 11600 0 330 4400 5700 1200
eMolecules_primary_aliphatic_halide Primary aliphatic halide 15300 10 500 3200 7900 3700
eMolecules_primary_aromatic_amines Primary aromatic amines, delete N 39300 0 330 8000 21000 10000
eMolecules_secondary_aliphatic_amineN Secondary aliphatic amines, keep the N 14800 1 230 3600 8400 2600
eMolecules_sulfonicacid Sulfonic acids, delete the -SO2X 20600 30 640 4900 11000 4000
eMolecules_sulfonicacidSO2 Sulfonic acids, keep the -SO2 6100 0 10 300 2400 3400
eMolecules_thiol Aliphatic thiols, delete S 515 5 130 270 100 10
eMolecules_thiolS Thiols, keep the S 5300 0 30 570 2400 2300

References

1 Irwin, Sterling, Mysinger, Bolstad and Coleman, J. Chem. Inf. Model. 2012.