spark

Currently released Spark™ databases

The currently released databases for Spark are derived from commercially available screening compounds (eMolecules screening compounds1), literature reports (ChEMBL2), commercial reagents (eMolecules building blocks3), small molecule crystal structures (Crystallography Open Database4 and Cambridge Structural Database5), and theoretical ring systems (VEHICLe6).

Update to the latest version of all databases by following the instructions at Installing Spark databases.

Fragments from screening compounds

The Commercial Spark databases are based on eMolecules screening compounds and are split based on the frequency of occurrence of the fragments.

  • VeryCommon (477 MB) – fragments which appear in more than 725 molecules
  • Common (961 MB) – fragments which appear in 215-724 molecules
  • LessCommon (1.95 GB) – fragments which appear in 65-214 molecules
  • Rare (2.68 GB) – fragments which appear in 25-64 molecules
  • VeryRare (5.3 GB) – fragments which appear in 9-24 molecules
  • ExtremelyRare (5.8 GB) – fragments which appear in 5-8 molecules
  • UltraRare (8.1 GB) - fragments which appear in 3-4 molecules

In general, fragments from the VeryCommon or Common databases are more likely to be readily synthesizable as they appear in many different commercially available molecules. Fragments from the VeryRare, ExtremelyRare and UltraRare databases are more likely to be non-drug-like or hard to make. These databases have been filtered to remove potentially toxic or reactive fragments (such as alkyl halides or nitroso functionalities). In addition, all phosphorus-containing fragments have been removed as the calculation of fields on phosphorus-containing functional groups is still under development. See below for a detailed analysis of these databases

Two optional databases are also available:

  • Doubleton (2 files, each of 5.7 GB) – fragments which appear in 2 molecules
  • Singleton (3 files, each of 9.3 GB) – fragments which appear in a single molecule

Typically we would recommend to install only the databases including fragments which appear at least 3-4 times in the original collections. The databases containing fragments seen with lower frequency are very large, and may contain fragments derived from unrealistic/wrong structures in the original collections. Contact support if you wish to download these optional databases.

Fragments from ChEMBL

The current ChEMBL Spark databases are based on Release 26 of ChEMBL and are split based on the frequency of occurrence of the fragments.

  • ChEMBL_common (1.9 GB) – fragments which appear in more than 12 molecules
  • ChEMBL_rare (2.5 GB) – fragments which appear in 4-12 molecules
  • ChEMBL_veryrare (3.3 GB) – fragments which appear in 2-3 molecules

An optional database is also available (contact Cresset support to download this database):

  • ChEMBL_extremelyrare (5.3 GB) – fragments which appear in a single molecule

Reagents

Spark Reagent Databases are derived from eMolecules building blocks using the Cresset reagent importer, which converts a file of usable reagents into the corresponding R-group. For example, to create the eMolecules_acid database, all the eMolecules building blocks containing a C(=O)OH or C(=O)Cl group were processed to add the R-group to the database.

Using databases derived from available reagents ensures that the results of your Spark experiment are tethered to molecules that are readily synthetically accessible. Monthly updates for these databases provide reliable availability information on the reagents that you wish to employ.

The current list of Spark Reagent Databases includes 23 common chemical transformations. See below for a detailed analysis of these databases.

Fragments from small molecules crystal structures

These databases contain fragments in their crystallographic conformation, derived from small molecule crystal structures.  

The Spark COD database contains fragments from the Crystallography Open Database. This database is available for download to all Spark customers.

The Spark CSD Fragment Database is derived from the Cambridge Structural Database (CSD). A valid CSD-System license is required for use of this database. If you do not already have a license, please contact CCDC for assistance.

Theoretical rings

A collection of theoretical ring systems derived from the VEHICLe6 database. 

Create your own database

Spark fragment and reagent databases provide an excellent source of new bioisosteres. However, if you have access to significant proprietary chemistry, to specialized reagents, or simply want to only consider fragments from reagents that you have in stock then creating your own custom databases will add value to your Spark experiments.

Custom databases can be easily created using the Database Generator, a dedicated and user-friendly interface to custom database creation within Spark, or using the equivalent functionality from the command line.

Contact Cresset support if you need assistance with the Spark Database Generator.

Analysis of fragment databases

Database overlaps (number of fragments present in both databases)

VeryCommon Common LessCommon Rare VeryRare ExtremelyRare UltraRare Doubleton Singleton Unique
ChEMBL_common 39,692 32,368 29,831 19,141 17,510 10,050 8,981 7,894 11,728 54,656
ChEMBL_rare 5,930 16,342 25,920 24,584 28,361 16,307 14,213 12,012 18,410 123,775
ChEMBL_veryrare 2,672 8,621 16,564 18,898 27,425 19,827 19,528 20,442 28,251 220,070
ChEMBL_extremelyrare 1,687 6,915 14,976 18,655 29,899 23,567 26,144 24,975 58,362 435,774

Fragment and connection point counts

Database Total frags Frags with 1 connection Frags with 2 connection Frags with 3 connection Frags with 4 connection Rings only
VeryCommon 67,888 20,525 28,979 14,872 3,512 2,129
Common 112,949 33,362 47,336 25,704 6,547 1,712
LessCommon 211,480 55,409 90,338 50,652 15,081 2,214
Rare 279,658 70,058 117,127 69,984 22,489 2,286
VeryRare 526,840 138,999 211,859 130,123 45,859 3,458
ExtremelyRare 534,444 153,583 203,897 128,718 48,246 3,307
UltraRare 769,744 242,565 283,665 174,584 68,930 4,136
Doubleton 1,053,200 348,677 372,553 233,822 98,148 4,552
Singleton 2,525,655 990,026 863,227 476,933 195,469 13,822
ChEMBL_common 231,851 43,204 88,710 69,205 30,732 5,909
ChEMBL_rare 285,854 56,125 103,480 84,481 41,768 4,741
ChEMBL_veryrare 382,298 79,814 135,555 110,334 56,595 5,449
ChEMBL_extremelyrare 640,954 160,002 229,892 167,813 83,247 8,555

Number of fragments within specified molecular weight range

MW

1-50

51-100

101-150

151-200

201-250

VeryCommon 278 9,640 34,547 21,376 2,047
Common

78

6,504 49,778 49,163 7,426
LessCommon 71 8,164 79,583 104,379 19,283
Rare 62 7,940 90,885 148,243 32,528
VeryRare 89 11,525 147,883 300,310 67,033
ExtremelyRare 40 9,562 132,797 316,064 75,981
UltraRare 53 11,225 175,373 460,012 123,081
Doubleton 46 12,771 220,404 641,188 178,791
Singleton 100 27,089 504,001 1,546,404 448,061
ChEMBL_Common 401 17,864 92,074 102,534 18,978
ChEMBL_rare 95 11,377 91,214 146,515 36,653
ChEMBL_veryrare 46 11,439 105,863 203,958 60,992
ChEMBL_extremelyrare 61 14,225 155,820 351,153 119,695

Atom count distribution

NH

1-2

3-4

5-6

7-8

9-10

11-12

13-14

15-16

17-18

VeryCommon 91 823 5,005 15,716 21,928 17,265 6,458 575 27
Common 7 355 3,217 15,463 35,939 37,023 18,343 2,505 97
LessCommon 4 370 4,008 20,771 59,820 77,274 42,080 6,881 272
Rare 6 324 3,700 21,798 71,050 109,396 63,055 9,880 449
VeryRare 6 501 5,332 31,599 113,807 208,056 148,049 18,759 731
ExtremelyRare 3 340 4,313 26,647 102,055 205,647 171,247 23,360 832
UltraRare 6 381 4,915 33,299 139,587 297,631 250,142 42,541 1,232
Doubleton 10 339 5,455 40,232 174,597 403,392 357,547 69,323 2,305

Singleton

11 748 11,269 87,867 407,805 988,823 850,975 171,433 6,824
ChEMBL_common 95 1,494 9,379 30,952 59,453 69,688 46,958 12,600 1,232
ChEMBL_rare 10 524 5,517 25,292 62,313 88,749 76,203 24,575 2,671
ChEMBL_veryrare 4 407 5,425 27,631 73,403 116,229 112,918 41,023 5,258

ChEMBL_extremelyrare

7 509 6,663 37,541 109,724 192,523 203,053 80,939 9,995

Number of fragments with specified number of rotatable bonds

Num Rot Bonds

0

1

2

3

4

5

VeryCommon 7,672 21,257 23,070 12,806 2,556 527
Common 8,109 30,861 44,005 24,089 4,985 900
LessCommon 11,515 51,357 86,080 50,841 9,929 1,758
Rare 14,903 66,903 113,832 69,916 12,201 1,903
VeryRare 24,878 125,381 226,167 125,802 21,587 3,025

ExtremelyRare

25,555 134,461 233,722 119,379 18,901 2,426
UltraRare 36,607 200,252 336,853 167,705 25,273 3,054
Doubleton 54,939 285,567 459,339 218,635 31,324 3,486
Singleton 125,606 658,710 1,089,611 564,699 79,640 7,389
ChEMBL_common 24,864 70,511 80,043 42,156 11,603 2,674
ChEMBL_rare 28,614 88,319 102,614 50,604 13,028 2,675
ChEMBL_veryrare 40,356 120,742 136,867 65,392 15,790 3,151
ChEMBL_extremelyrare 67,529 209,632 231,504 104,026 23,818 4,445

Number of fragments with specified number of conformations

Number of conformations 1-5 6-10 11-15 16-20 21-25 26-30
VeryCommon 44,252 10,555 4,911 2,645 1,572 3,953
Common 66,886 18,687 9,566 5,354 3,483 8,973
LessCOmmon 117,222 36,734 19,443 11,193 7,261 19,627
Rare 153,445 48,487 25,923 15,198 9,610 26,995
VeryRare 277,079 95,952 52,121 30,677 18,843 52,168
ExtremelyRare 272,328 98,675 55,479 32,889 20,505 54,568
UltraRare 392,232 142,261 79,098 47,446 30,349 78,358
Doubleton 519,712 204,130 111,851 66,985 42,200 108,322
Singleton 1,205,682 469,222 276,838 173,270 111,995 288,648
ChEMBL_common 148,624 33,534 15,814 9,370 6,089 18,420
ChEMBL_rare 176,979 45,018 20,980 12,352 7,664 22,861
ChEMBL_veryrare 235,841 61,924 28,423 16,609 10,230 29,271
ChEMBL_extremelyrare 406,523 100,086 46,551 25,959 16,332 45,503

Analysis of reagent databases

Number of fragments within specified molecular weight range

The figures below are approximate. The exact number of fragments may change over time as the reagent databases are updated on a monthly basis. Make sure you keep your reagent databases updated following the instructions the Installing Spark databases.

Molecular weight distribution Description

Total

1-50

51-100

101-150

151-200

201-250

eMolecules_acidCO Acids, keep the -CO 23944 3 401 6725 13333 3482
eMolecules_acid Acids, delete the -COOH 41416 44 2804 15594 17979 5095
eMolecules_alcohol Aliphatic alcohols, delete the O 17929 11 1434 7495 7127 1862
eMolecule_alcoholO Aliiphatic alcohols, keep the O 19589 3 467 6758 9646 2715
eMolecules_aliphatic_halide Aliphatic halide 8810 13 925 3650 3424 798
eMolecule_alkyne Alkynes, delete the - C#C 2823 27 505 1412 767 112
eMolecules_aromatic_alcoholO Aromatic alcohols, keep the O 8617 0 44 1924 5022 1627
eMolecules_aromatic_aminesN Aromatic amines, keep  the N 18459 0 110 4196 10495 3658
eMolecules_aromatic_halide Aromatic halide 39929 8 449 13568 22636 3268
eMolecules_boronic Aromatic boronic acids, delete -B(OH)2 4500 0 128 1898 2093 381
eMolecules_cyano Cyano groups, delete -CN 15084 20 1085 5653 6266 2060
eMolecules_isocyanateCO Isocyanates, keep -NCO 554 0 20 170 286 78
eMolecules_olefin OLefines, delete the -C=C 3214 16 523 1408 1048 219
eMolecules_primary_aliphatic_amine Primary aliphatic amines, delete the N 18952 6 1367 8475 7731 1373
eMolecules_primary_aliphatic_amineN Primary aliphatic amines, keep the N 11547 0 398 5223 5092 834
eMolecules_primary_aliphatic_halide Primary aliphatic halide 6879 12 627 2886 2708 646
eMolecules_primary_aromatic_amines Primary aromatic amines, delete the N 23241 0 323 6555 12115 4248
eMolecules_secondary_aliphatic_amineN Secondary aliphatic amines, keep the N 14992 1 276 4253 8372 2090
eMolecules_sulphonicacid Sulfonic acids, delete the SO2X 5068 31 602 2268 1759 408
eMolecules_sulphonicacidSO2 Sulfonic aicds, keep the -SO2X 3075 0 13 302 1586 1174
eMolecules_thiol Aliphatic thiols, delete S 716 7 205 327 163 14
eMolecules_thiolS thiols, keep S 1972 1 38 533 1073 327

References

  1. https://www.emolecules.com/info/products-screening-compounds
  2. https://www.ebi.ac.uk/chembl/
  3. https://www.emolecules.com/info/products-building-blocks
  4. http://www.crystallography.net/cod/
  5. https://www.ccdc.cam.ac.uk/solutions/csd-system/components/csd/
  6. Pitt, W. R.; Parry, D. M.; Perry, B. G.; Groom, C. R. Heteroaromatic Rings of the Future. J. Med. Chem. 2009, 52 (9), 2952–2963 ftp://ftp.ebi.ac.uk/pub/databases/chembl/VEHICLe/

Licensing Spark

Try Spark on your project

See flexible licensing options
Spark