spark

Currently released Spark™ databases

The currently released databases for Spark are derived from commercially available screening compounds (eMolecules screening compounds1), literature reports (ChEMBL2), patents (SureChEMBL2), commercial reagents (eMolecules building blocks3), small molecule crystal structures (Crystallography Open Database4 and Cambridge Structural Database5), and theoretical ring systems (VEHICLe6).

Update to the latest version of all databases by following the instructions at Installing Spark databases.

Fragments from screening compounds

The Commercial Spark databases are based on eMolecules screening compounds and are split based on the frequency of occurrence of the fragments.

  • VeryCommon (477 MB) – fragments which appear in at least 725 molecules
  • Common (961 MB) – fragments which appear in 215-724 molecules
  • LessCommon (1.95 GB) – fragments which appear in 65-214 molecules
  • Rare (2.68 GB) – fragments which appear in 25-64 molecules
  • VeryRare (5.3 GB) – fragments which appear in 9-24 molecules
  • ExtremelyRare (5.8 GB) – fragments which appear in 5-8 molecules
  • UltraRare (8.1 GB) - fragments which appear in 3-4 molecules

In general, fragments from the VeryCommon or Common databases are more likely to be readily synthesizable as they appear in many different commercially available molecules. Fragments from the VeryRare, ExtremelyRare and UltraRare databases are more likely to be non-drug-like or hard to make. These databases have been filtered to remove potentially toxic or reactive fragments (such as alkyl halides or nitroso functionalities). In addition, all phosphorus-containing fragments have been removed as the calculation of fields on phosphorus-containing functional groups is still under development. See below for a detailed analysis of these databases

Two optional databases are also available:

  • Doubleton (2 files, each of 5.7 GB) – fragments which appear in two molecules
  • Singleton (3 files, each of 9.3 GB) – fragments which appear in a single molecule

Typically we would recommend to install only the databases including fragments which appear at least 3-4 times in the original collections. The databases containing fragments seen with lower frequency are very large, and may contain fragments derived from unrealistic/wrong structures in the original collections. Contact support if you wish to download these optional databases.

Fragments from ChEMBL

The current ChEMBL Spark databases are based on Release 30 of ChEMBL and are split based on the frequency of occurrence of the fragments.

  • ChEMBL_common (1.8 GB) – fragments which appear in more than 12 molecules
  • ChEMBL_rare (2.6 GB) – fragments which appear in 4-12 molecules
  • ChEMBL_veryrare (3.4 GB) – fragments which appear in 2-3 molecules

An optional database is also available (contact Cresset support to download this database):

  • ChEMBL_extremelyrare (6.6 GB) – fragments which appear once

Fragments from SureChEMBL

The SureChEMBL Spark databases are based on the complete SureChEMBL compound collection and are split based on the frequency of occurrence of the fragments.

  • SureChEMBL_verycommon (4.1 GB) – fragments which appear in at least 45 molecules
  • SureChEMBL_common (7.0 GB) – fragments which appear in 14-44 molecules
  • SureChEMBL_uncommon (5.0 GB) – fragments which appear in 8-13 molecules

Additional optional databases are also available (contact Cresset support to download these databases):

  • SureChEMBL_rare (8.8 GB) – fragments which appear in 5-7 molecules
  • SureChEMBL_veryrare (7.0 GB) – fragments which appear in 4 molecules
  • SureChEMBL_extremelyrare (9.2 GB) – fragments which appear in 3 molecules
  • SureChEMBL_doubleton (3 files, each of 8.2 GB) – fragments which appear twice
  • SureChEMBL_singleton (3 files, each of 24 GB) – fragments which appear once

Reagents

Spark Reagent Databases are derived from eMolecules building blocks using the Cresset reagent importer, which converts a file of usable reagents into the corresponding R-group. For example, to create the eMolecules_acid database, all the eMolecules building blocks containing a C(=O)OH or C(=O)Cl group were processed to add the R-group to the database.

Using databases derived from available reagents ensures that the results of your Spark experiment are tethered to molecules that are readily synthetically accessible. Monthly updates for these databases provide reliable availability information on the reagents that you wish to employ.

The current list of Spark Reagent Databases includes 23 common chemical transformations. See below for a detailed analysis of these databases.

Fragments from small molecules crystal structures

These databases contain fragments in their crystallographic conformation, derived from small molecule crystal structures.  

The Spark COD database contains fragments from the Crystallography Open Database. This database is available for download to all Spark customers.

The Spark CSD Fragment Database is derived from the Cambridge Structural Database (CSD). A valid CSD-System license is required for use of this database. If you do not already have a license, please contact CCDC for assistance.

Theoretical rings

A collection of theoretical ring systems derived from the VEHICLe6 database. 

Create your own database

Spark fragment and reagent databases provide an excellent source of new bioisosteres. However, if you have access to significant proprietary chemistry, to specialized reagents, or simply want to only consider fragments from reagents that you have in stock then creating your own custom databases will add value to your Spark experiments.

Custom databases can be easily created using the Database Generator, a dedicated and user-friendly interface to custom database creation within Spark, or using the equivalent functionality from the command line.

Contact Cresset support if you need assistance with the Spark Database Generator.

Analysis of fragment databases

Database overlaps (number of fragments present in both databases)

VeryCommon Common LessCommon Rare VeryRare ExtremelyRare UltraRare Doubleton Singleton Unique
ChEMBL_common 38,687 31,717 28,917 18,493 16,285 9,251 8,338 7,324 10,988 52,926
ChEMBL_rare 6,160 17,611 27,846 26,412 29,948 17,198 14,873 12,168 18,793 132,602
ChEMBL_veryrare 2,520 8,702 17,121 19,536 28,007 20,231 19,887 20,989 26,793 225,922
ChEMBL_extremelyrare 1,707 7,042 16,144 20,506 33,310 26,688 29,587 28,784 67,511 551,591

SureChEMBL _verycommon SureChEMBL _common SureChEMBL _uncommon SureChEMBL _rare SureChEMBL _veryrare SureChEMBL _extremelyrare SureChEMBL _doubleton SureChEMBL _singleton Unique

ChEMBL_common

146,547 31,002 7,536 5,865 2,682 2,828 3,780 4,674 18,012

ChEMBL_rare

84,101 64,144 22,539 19,914 9,116 8,349 12,050 14,985 68,413

ChEMBL_veryrare

49,522 64,713 29,188 31,969 16,604 17,930 27,779 25,255 126,748
ChEMBL _extremelyrare 43,094 77,947 43,087 55,009 29,521 39,150 59,864 103,934 331,264

Fragment and connection point counts

Database Total frags Frags with 1 connection Frags with 2 connections Frags with 3 connections Frags with 4 connections Rings only
VeryCommon 67,888 20,525 28,979 14,872 3,512 2,129
Common 112,949 33,362 47,336 25,704 6,547 1,712
LessCommon 211,480 55,409 90,338 50,652 15,081 2,214
Rare 279,658 70,058 117,127 69,984 22,489 2,286
VeryRare 526,840 138,999 211,859 130,123 45,859 3,458
ExtremelyRare 534,444 153,583 203,897 128,718 48,246 3,307
UltraRare 769,744 242,565 283,665 174,584 68,930 4,136
Doubleton 1,053,200 348,677 372,553 233,822 98,148 4,552
Singleton 2,525,655 990,026 863,227 476,933 195,469 13,822
ChEMBL_common 222,926 41,681 85,539 66,533 29,173 5,554
ChEMBL_rare 303,611 58,825 110,572 90,044 44,170 4,837
ChEMBL_veryrare 389,708 78,333 138,420 114,435 58,520 5,234
ChEMBL_extremelyrare 782,870 196,345 280,643 204,450 101,432 10,803
SureChEMBL_verycommon 509,358 86,420 187,700 158,257 76,981 12,976
SureChEMBL_common 794,523 143,479 280,986 241,500 128,558 13,201
SureChEMBL_uncommon 553,707 106,530 194,255 165,853 87,069 8,466
SureChEMBL_rare 956,521 196,465 335,158 277,534 147,364 13,816
SureChEMBL_veryrare 756,812 135,465 251,609 230,503 139,235 8,975
SureChEMBL_extremelyrare 979,007 224,962 339,225 272,282 142,538 14,079
SureChEMBL_doubleton 2,545,388 543,596 850,467 731,831 419,494 28,330
SureChEMBL_singleton 7,190,330 1,842,604 2,420,995 1,894,851 1,031,880 99,811

Number of fragments within specified molecular weight range

MW 1-50 51-100 101-150 151-200 201-250
VeryCommon 278 9,640 34,547 21,376 2,047
Common 78 6,504 49,778 49,163 7,426
LessCommon 71 8,164 79,583 104,379 19,283
Rare 62 7,940 90,885 148,243 32,528
VeryRare 89 11,525 147,883 300,310 67,033
ExtremelyRare 40 9,562 132,797 316,064 75,981
UltraRare 53 11,225 175,373 460,012 123,081
Doubleton 46 12,771 220,404 641,188 178,791
Singleton 100 27,089 504,001 1,546,404 448,061
ChEMBL_common 390 16,983 89,129 98,541 17,883
ChEMBL_rare 95 11,717 96,026 157,481 38,292
ChEMBL_veryrare 42 11,250 107,594 209,116 61,706
ChEMBL_extremelyrare 63 17,115 189,677 429,362 146,653
SureChEMBL_verycommon 742 47,394 219,111 207,803 34,308
SureChEMBL_common 135 44,056 294,455 376,934 78,943
SureChEMBL_uncommon 59 25,058 194,015 272,902 61,673
SureChEMBL_rare 59 36,652 316,932 484,401 118,477
SureChEMBL_veryrare 34 22,361 225,615 395,407 113,395
SureChEMBL_extremelyrare 33 31,277 309,965 505,645 132,087
SureChEMBL_Doubleton 62 61,850 740,051 1,353,626 389,799
SureChEMBL_Singleton 101 146,960 2,071,784 3,873,342 1,098,143

Atom count distribution

NH 1-2 3-4 5-6 7-8 9-10 11-12 13-14 15-16 17-18
VeryCommon 91 823 5,005 15,716 21,928 17,265 6,458 575 27
Common 7 355 3,217 15,463 35,939 37,023 18,343 2,505 97
LessCommon 4 370 4,008 20,771 59,820 77,274 42,080 6,881 272
Rare 6 324 3,700 21,798 71,050 109,396 63,055 9,880 449
VeryRare 6 501 5,332 31,599 113,807 208,056 148,049 18,759 731
ExtremelyRare 3 340 4,313 26,647 102,055 205,647 171,247 23,360 832
UltraRare 6 381 4,915 33,299 139,587 297,631 250,142 42,541 1,242
Doubleton 10 339 5,455 40,232 174,597 403,392 357,547 69,323 2,305
Singleton 11 748 11,269 87,867 407,805 988,823 850,875 171,433 6,824
ChEMBL_common 89 1,423 8,873 29,512 57,389 67,408 45,087 12,129 1,016
ChEMBL_rare 11 538 5,686 25,859 65,266 95,270 82,224 26,202 2,555
ChEMBL_veryrare 6 379 5,294 27,218 73,888 119,503 116,455 41,936 5,029
ChEMBL_extremelyrare 5 552 7,850 45,343 132,855 234,897 248,173 100,834 12,361
SureChEMBL_verycommon 131 3,605 25,590 77,393 139,579 145,566 91,880 23,795 1,819
SureChEMBL_common 15 1,615 22,051 92,561 194,428 248,864 180,431 50,910 3,648
SureChEMBL_uncommon 10 686 12,114 58,502 129,982 176,680 134,020 38,822 2,891
SureChEMBL_rare 5 841 16,907 92,084 215,539 306,917 244,564 74,264 5,400
SureChEMBL_veryrare 2 473 10,029 61,888 157,124 240,510 206,432 74,025 6,329
SureChEMBL_extremelyrare 5 540 13,927 86,669 214,839 314,899 259,750 81,667 6,711
SureChEMBL_doubleton 4 954 26,565 192,048 525,322 820,654 707,802 247,689 24,350
SureChEMBL_singleton 11 1,650 59,560 498,692 1,488,435 2,346,572 2,022,233 701,144 72,033

Number of fragments with specified number of rotatable bonds

Num Rot Bonds 0 1 2 3 4 5
VeryCommon 7,672 21,257 23,070 12,806 2,556 527
Common 8,109 30,861 44,005 24,089 4,985 900
LessCommon 11,515 51,357 86,080 50,841 9,929 1,758
Rare 14,903 66,903 113,832 69,916 12,201 1,903
VeryRare 24,878 125,381 226,167 125,802 21,587 3,025
ExtremelyRare 25,555 134,461 233,722 119,379 18,901 2,426
UltraRare 36,607 200,252 336,853 167,705 25,273 3,054
Doubleton 54,939 285,567 459,339 218,635 31,234 3,486
Singleton 125,606 658,710 1,089,611 564,699 79,640 7,389
ChEMBL_common 23,571 68,638 77,166 40,153 10,866 2,532
ChEMBL_rare 29,233 94,861 110,432 53,160 13,252 2,673
ChEMBL_veryrare 39,627 124,544 140,450 66,406 15,617 3,064
ChEMBL_extremelyrare 82,920 258,927 281,784 125,207 28,727 5,305
SureChEMBL_verycommon 54,168 154,094 172,017 91,940 28,968 8,171
SureChEMBL_common 75,189 235,835 279,244 150,539 42,743 10,973
SureChEMBL_uncommon 51,188 163,399 196,338 105,991 29,725 7,066
SureChEMBL_rare 85,850 286,799 338,432 183,991 50,079 11,370
SureChEMBL_veryrare 79,560 235,261 260,900 137,687 35,676 7,728
SureChEMBL_extremelyrare 86,466 291,036 349,375 189,661 51,469 11,000
SureChEMBL_doubleton 262,379 775,009 881,264 477,870 124,286 24,580
SureChEMBL_singleton 703,648 2,156,275 2,528,462 1,381,729 359,156 61,060

Number of fragments with specified number of conformations

Number of conformations 1-5 6-10 11-15 16-20 21-25 26-30
VeryCommon 44,252 10,555 4,911 2,645 1,572 3,953
Common 66,886 18,687 9,566 5,354 3,483 8,973
LessCommon 117,222 36,734 19,443 11,193 7,261 19,627
Rare 153,445 48,487 25,923 15,198 9,610 26,995
VeryRare 277,079 95,952 52,121 30,677 18,843 52,168
ExtremelyRare 272,328 98,675 55,479 32,889 20,505 54,568
UltraRare 392,232 142,261 79,098 47,446 30,349 78,358
Doubleton 519,712 204,130 111,851 66,985 42,200 108,322
Singleton 1,205,682 469,222 276,838 173,270 111,995 288,648
ChEMBL_common 145,594 31,494 14,636 8,669 5,599 16,934
ChEMBL_rare 192,666 46,668 21,355 12,232 7,853 22,837
ChEMBL_veryrare 245,523 61,298 28,334 16,183 9,932 28,438
ChEMBL_extremelyrare 493,161 123,404 57,544 32,120 19,880 56,761
SureChEMBL_verycommon 326,882 74,551 35,357 20,391 13,053 39,124
SureChEMBL_common 483,553 129,383 61,031 35,321 22,073 63,162
SureChEMBL_uncommon 328,986 92,816 44,684 25,613 16,260 45,348
SureChEMBL_rare 558,832 164,147 79,803 46,280 28,124 79,335
SureChEMBL_veryrare 447,316 131,320 62,844 35,314 21,541 58,477
SureChEMBL_extremelyrare 557,873 172,719 85,156 48,850 30,061 84,348
SureChEMBL_doubleton 1,438,946 463,072 224,621 128,134 79,230 211,385
SureChEMBL_singleton 3,876,155 1,363,410 676,075 389,869 241,079 643,742

Analysis of reagent databases

Number of fragments within specified molecular weight range

The figures below are approximate. The exact number of fragments may change over time as the reagent databases are updated on a monthly basis. Make sure you keep your reagent databases updated following the instructions the Installing Spark databases.

Molecular weight distribution Description

Total

1-50

51-100

101-150

151-200

201-250

eMolecules_acidCO Acids, keep the -CO 19,694 3 354 4,930 10,718 3,689
eMolecules_acid Acids, delete the -COOH 36,182 42 2,122 11,871 16,339 5,808
eMolecules_alcohol Aliphatic alcohols, delete the O 15,442 10 1,160 5,417 6,500 2,355
eMolecules_alcoholO Aliphatic alcohols, keep the O 18,923 3 408 5,555 9,279 3,678
eMolecules_reductive_amination Aldehydes/ketones, delete the O and reduce C 24,005 5 658 5,419 12,003 5,920
eMolecules_aliphatic_halide Aliphatic halide 7,564 13 667 2,512 3,191 1,181
eMolecules_alkyne Alkynes, delete the - C#C 2,448 24 346 1,080 848 150
eMolecules_aromatic_alcoholO Aromatic alcohols, keep the O 10,097 0 40 2,040 5,531 2,486
eMolecules_aromatic_aminesN Aromatic amines, keep  the N 18,312 0 114 3,959 10,012 4,227
eMolecules_aromatic_halide Aromatic halide 44,181 9 433 14,329 25,671 3,739
eMolecules_boronic Aromatic boronic acids, delete -B(OH) 5,479 0 127 2,160 2,644 548
eMolecules_cyano Cyano groups, delete -CN 19,271 18 838 5,771 8,782 3,862
eMolecules_isocyanateCO Isocyanates, keep -NCO 626 0 9 92 320 205
eMolecules_olefin OLefines, delete the -C=C 3,126 18 399 1,287 1,173 249
eMolecules_primary_aliphatic_amine Primary aliphatic amines, delete the N 12,937 7 1,010 5,282 5,344 1,294
eMolecules_primary_aliphatic_amineN Primary aliphatic amines, keep the N 7,762 1 318 3,320 3,322 801
eMolecules_primary_aliphatic_halide Primary aliphatic halide 6,150 12 467 1,986 2,631 1,054
eMolecules_primary_aromatic_amines Primary aromatic amines, delete the N 24,010 0 301 6366 12,193 5,150
eMolecules_secondary_aliphatic_amineN Secondary aliphatic amines, keep the N 9,428 1 224 2,852 4,837 1,514
eMolecules_sulfonicacid Sulfonic acids, delete the SO2X 3,313 35 308 1,278 1,253 439
eMolecules_sulfonicacidSO2 Sulfonic acids, keep the -SO2X 1,928 0 14 184 832 898
eMolecules_thiol Aliphatic thiols, delete S 456 5 103 197 139 12
eMolecules_thiolS thiols, keep S 1,572 0 34 336 927 275

References

  1. https://www.emolecules.com/info/products-screening-compounds
  2. https://www.ebi.ac.uk/chembl/
  3. https://www.emolecules.com/info/products-building-blocks
  4. http://www.crystallography.net/cod/
  5. https://www.ccdc.cam.ac.uk/solutions/csd-system/components/csd/
  6. Pitt, W. R.; Parry, D. M.; Perry, B. G.; Groom, C. R. Heteroaromatic Rings of the Future. J. Med. Chem. 2009, 52 (9), 2952–2963 ftp://ftp.ebi.ac.uk/pub/databases/chembl/VEHICLe/

Licensing Spark

Try Spark on your project

See flexible licensing options
Spark