The currently released databases for Spark are derived from commercially available screening compounds (eMolecules screening compounds1), literature reports (ChEMBL2), patents (SureChEMBL2), commercial reagents (eMolecules building blocks3), small molecule crystal structures (Crystallography Open Database4 and Cambridge Structural Database5), and theoretical ring systems (VEHICLe6).
Update to the latest version of all databases by following the instructions at Installing Spark databases.
Fragments from screening compounds
The Commercial Spark databases are based on eMolecules screening compounds and are split based on the frequency of occurrence of the fragments.
- VeryCommon (477 MB) – fragments which appear in at least 725 molecules
- Common (961 MB) – fragments which appear in 215-724 molecules
- LessCommon (1.95 GB) – fragments which appear in 65-214 molecules
- Rare (2.68 GB) – fragments which appear in 25-64 molecules
- VeryRare (5.3 GB) – fragments which appear in 9-24 molecules
- ExtremelyRare (5.8 GB) – fragments which appear in 5-8 molecules
- UltraRare (8.1 GB) - fragments which appear in 3-4 molecules
In general, fragments from the VeryCommon or Common databases are more likely to be readily synthesizable as they appear in many different commercially available molecules. Fragments from the VeryRare, ExtremelyRare and UltraRare databases are more likely to be non-drug-like or hard to make. These databases have been filtered to remove potentially toxic or reactive fragments (such as alkyl halides or nitroso functionalities). In addition, all phosphorus-containing fragments have been removed as the calculation of fields on phosphorus-containing functional groups is still under development. See below for a detailed analysis of these databases.
Two optional databases are also available:
- Doubleton (2 files, each of 5.7 GB) – fragments which appear in two molecules
- Singleton (3 files, each of 9.3 GB) – fragments which appear in a single molecule
Typically we would recommend to install only the databases including fragments which appear at least 3-4 times in the original collections. The databases containing fragments seen with lower frequency are very large, and may contain fragments derived from unrealistic/wrong structures in the original collections. Contact support if you wish to download these optional databases.
Fragments from ChEMBL
The current ChEMBL Spark databases are based on Release 30 of ChEMBL and are split based on the frequency of occurrence of the fragments.
- ChEMBL_common (1.8 GB) – fragments which appear in more than 12 molecules
- ChEMBL_rare (2.6 GB) – fragments which appear in 4-12 molecules
- ChEMBL_veryrare (3.4 GB) – fragments which appear in 2-3 molecules
An optional database is also available (contact Cresset support to download this database):
- ChEMBL_extremelyrare (6.6 GB) – fragments which appear once
Fragments from SureChEMBL
The SureChEMBL Spark databases are based on the complete SureChEMBL compound collection and are split based on the frequency of occurrence of the fragments.
- SureChEMBL_verycommon (4.1 GB) – fragments which appear in at least 45 molecules
- SureChEMBL_common (7.0 GB) – fragments which appear in 14-44 molecules
- SureChEMBL_uncommon (5.0 GB) – fragments which appear in 8-13 molecules
Additional optional databases are also available (contact Cresset support to download these databases):
- SureChEMBL_rare (8.8 GB) – fragments which appear in 5-7 molecules
- SureChEMBL_veryrare (7.0 GB) – fragments which appear in 4 molecules
- SureChEMBL_extremelyrare (9.2 GB) – fragments which appear in 3 molecules
- SureChEMBL_doubleton (3 files, each of 8.2 GB) – fragments which appear twice
- SureChEMBL_singleton (3 files, each of 24 GB) – fragments which appear once
Reagents
Spark Reagent Databases are derived from eMolecules building blocks using the Cresset reagent importer, which converts a file of usable reagents into the corresponding R-group. For example, to create the eMolecules_acid database, all the eMolecules building blocks containing a C(=O)OH or C(=O)Cl group were processed to add the R-group to the database.
Using databases derived from available reagents ensures that the results of your Spark experiment are tethered to molecules that are readily synthetically accessible. Monthly updates for these databases provide reliable availability information on the reagents that you wish to employ.
The current list of Spark Reagent Databases includes 23 common chemical transformations. See below for a detailed analysis of these databases.
Fragments from small molecules crystal structures
These databases contain fragments in their crystallographic conformation, derived from small molecule crystal structures.
The Spark COD database contains fragments from the Crystallography Open Database. This database is available for download to all Spark customers.
The Spark CSD Fragment Database is derived from the Cambridge Structural Database (CSD). A valid CSD-System license is required for use of this database. If you do not already have a license, please contact CCDC for assistance.
Theoretical rings
A collection of theoretical ring systems derived from the VEHICLe6 database.
Create your own database
Spark fragment and reagent databases provide an excellent source of new bioisosteres. However, if you have access to significant proprietary chemistry, to specialized reagents, or simply want to only consider fragments from reagents that you have in stock then creating your own custom databases will add value to your Spark experiments.
Custom databases can be easily created using the Database Generator, a dedicated and user-friendly interface to custom database creation within Spark, or using the equivalent functionality from the command line.
Contact Cresset support if you need assistance with the Spark Database Generator.
Analysis of fragment databases
Database overlaps (number of fragments present in both databases)
VeryCommon | Common | LessCommon | Rare | VeryRare | ExtremelyRare | UltraRare | Doubleton | Singleton | Unique | |
ChEMBL_common | 38,687 | 31,717 | 28,917 | 18,493 | 16,285 | 9,251 | 8,338 | 7,324 | 10,988 | 52,926 |
ChEMBL_rare | 6,160 | 17,611 | 27,846 | 26,412 | 29,948 | 17,198 | 14,873 | 12,168 | 18,793 | 132,602 |
ChEMBL_veryrare | 2,520 | 8,702 | 17,121 | 19,536 | 28,007 | 20,231 | 19,887 | 20,989 | 26,793 | 225,922 |
ChEMBL_extremelyrare | 1,707 | 7,042 | 16,144 | 20,506 | 33,310 | 26,688 | 29,587 | 28,784 | 67,511 | 551,591 |
SureChEMBL _verycommon | SureChEMBL _common | SureChEMBL _uncommon | SureChEMBL _rare | SureChEMBL _veryrare | SureChEMBL _extremelyrare | SureChEMBL _doubleton | SureChEMBL _singleton | Unique | |
ChEMBL_common |
146,547 | 31,002 | 7,536 | 5,865 | 2,682 | 2,828 | 3,780 | 4,674 | 18,012 |
ChEMBL_rare |
84,101 | 64,144 | 22,539 | 19,914 | 9,116 | 8,349 | 12,050 | 14,985 | 68,413 |
ChEMBL_veryrare |
49,522 | 64,713 | 29,188 | 31,969 | 16,604 | 17,930 | 27,779 | 25,255 | 126,748 |
ChEMBL _extremelyrare | 43,094 | 77,947 | 43,087 | 55,009 | 29,521 | 39,150 | 59,864 | 103,934 | 331,264 |
Fragment and connection point counts
Database | Total frags | Frags with 1 connection | Frags with 2 connections | Frags with 3 connections | Frags with 4 connections | Rings only |
VeryCommon | 67,888 | 20,525 | 28,979 | 14,872 | 3,512 | 2,129 |
Common | 112,949 | 33,362 | 47,336 | 25,704 | 6,547 | 1,712 |
LessCommon | 211,480 | 55,409 | 90,338 | 50,652 | 15,081 | 2,214 |
Rare | 279,658 | 70,058 | 117,127 | 69,984 | 22,489 | 2,286 |
VeryRare | 526,840 | 138,999 | 211,859 | 130,123 | 45,859 | 3,458 |
ExtremelyRare | 534,444 | 153,583 | 203,897 | 128,718 | 48,246 | 3,307 |
UltraRare | 769,744 | 242,565 | 283,665 | 174,584 | 68,930 | 4,136 |
Doubleton | 1,053,200 | 348,677 | 372,553 | 233,822 | 98,148 | 4,552 |
Singleton | 2,525,655 | 990,026 | 863,227 | 476,933 | 195,469 | 13,822 |
ChEMBL_common | 222,926 | 41,681 | 85,539 | 66,533 | 29,173 | 5,554 |
ChEMBL_rare | 303,611 | 58,825 | 110,572 | 90,044 | 44,170 | 4,837 |
ChEMBL_veryrare | 389,708 | 78,333 | 138,420 | 114,435 | 58,520 | 5,234 |
ChEMBL_extremelyrare | 782,870 | 196,345 | 280,643 | 204,450 | 101,432 | 10,803 |
SureChEMBL_verycommon | 509,358 | 86,420 | 187,700 | 158,257 | 76,981 | 12,976 |
SureChEMBL_common | 794,523 | 143,479 | 280,986 | 241,500 | 128,558 | 13,201 |
SureChEMBL_uncommon | 553,707 | 106,530 | 194,255 | 165,853 | 87,069 | 8,466 |
SureChEMBL_rare | 956,521 | 196,465 | 335,158 | 277,534 | 147,364 | 13,816 |
SureChEMBL_veryrare | 756,812 | 135,465 | 251,609 | 230,503 | 139,235 | 8,975 |
SureChEMBL_extremelyrare | 979,007 | 224,962 | 339,225 | 272,282 | 142,538 | 14,079 |
SureChEMBL_doubleton | 2,545,388 | 543,596 | 850,467 | 731,831 | 419,494 | 28,330 |
SureChEMBL_singleton | 7,190,330 | 1,842,604 | 2,420,995 | 1,894,851 | 1,031,880 | 99,811 |
Number of fragments within specified molecular weight range
MW | 1-50 | 51-100 | 101-150 | 151-200 | 201-250 | |
VeryCommon | 278 | 9,640 | 34,547 | 21,376 | 2,047 | |
Common | 78 | 6,504 | 49,778 | 49,163 | 7,426 | |
LessCommon | 71 | 8,164 | 79,583 | 104,379 | 19,283 | |
Rare | 62 | 7,940 | 90,885 | 148,243 | 32,528 | |
VeryRare | 89 | 11,525 | 147,883 | 300,310 | 67,033 | |
ExtremelyRare | 40 | 9,562 | 132,797 | 316,064 | 75,981 | |
UltraRare | 53 | 11,225 | 175,373 | 460,012 | 123,081 | |
Doubleton | 46 | 12,771 | 220,404 | 641,188 | 178,791 | |
Singleton | 100 | 27,089 | 504,001 | 1,546,404 | 448,061 | |
ChEMBL_common | 390 | 16,983 | 89,129 | 98,541 | 17,883 | |
ChEMBL_rare | 95 | 11,717 | 96,026 | 157,481 | 38,292 | |
ChEMBL_veryrare | 42 | 11,250 | 107,594 | 209,116 | 61,706 | |
ChEMBL_extremelyrare | 63 | 17,115 | 189,677 | 429,362 | 146,653 | |
SureChEMBL_verycommon | 742 | 47,394 | 219,111 | 207,803 | 34,308 | |
SureChEMBL_common | 135 | 44,056 | 294,455 | 376,934 | 78,943 | |
SureChEMBL_uncommon | 59 | 25,058 | 194,015 | 272,902 | 61,673 | |
SureChEMBL_rare | 59 | 36,652 | 316,932 | 484,401 | 118,477 | |
SureChEMBL_veryrare | 34 | 22,361 | 225,615 | 395,407 | 113,395 | |
SureChEMBL_extremelyrare | 33 | 31,277 | 309,965 | 505,645 | 132,087 | |
SureChEMBL_Doubleton | 62 | 61,850 | 740,051 | 1,353,626 | 389,799 | |
SureChEMBL_Singleton | 101 | 146,960 | 2,071,784 | 3,873,342 | 1,098,143 |
Atom count distribution
NH | 1-2 | 3-4 | 5-6 | 7-8 | 9-10 | 11-12 | 13-14 | 15-16 | 17-18 |
VeryCommon | 91 | 823 | 5,005 | 15,716 | 21,928 | 17,265 | 6,458 | 575 | 27 |
Common | 7 | 355 | 3,217 | 15,463 | 35,939 | 37,023 | 18,343 | 2,505 | 97 |
LessCommon | 4 | 370 | 4,008 | 20,771 | 59,820 | 77,274 | 42,080 | 6,881 | 272 |
Rare | 6 | 324 | 3,700 | 21,798 | 71,050 | 109,396 | 63,055 | 9,880 | 449 |
VeryRare | 6 | 501 | 5,332 | 31,599 | 113,807 | 208,056 | 148,049 | 18,759 | 731 |
ExtremelyRare | 3 | 340 | 4,313 | 26,647 | 102,055 | 205,647 | 171,247 | 23,360 | 832 |
UltraRare | 6 | 381 | 4,915 | 33,299 | 139,587 | 297,631 | 250,142 | 42,541 | 1,242 |
Doubleton | 10 | 339 | 5,455 | 40,232 | 174,597 | 403,392 | 357,547 | 69,323 | 2,305 |
Singleton | 11 | 748 | 11,269 | 87,867 | 407,805 | 988,823 | 850,875 | 171,433 | 6,824 |
ChEMBL_common | 89 | 1,423 | 8,873 | 29,512 | 57,389 | 67,408 | 45,087 | 12,129 | 1,016 |
ChEMBL_rare | 11 | 538 | 5,686 | 25,859 | 65,266 | 95,270 | 82,224 | 26,202 | 2,555 |
ChEMBL_veryrare | 6 | 379 | 5,294 | 27,218 | 73,888 | 119,503 | 116,455 | 41,936 | 5,029 |
ChEMBL_extremelyrare | 5 | 552 | 7,850 | 45,343 | 132,855 | 234,897 | 248,173 | 100,834 | 12,361 |
SureChEMBL_verycommon | 131 | 3,605 | 25,590 | 77,393 | 139,579 | 145,566 | 91,880 | 23,795 | 1,819 |
SureChEMBL_common | 15 | 1,615 | 22,051 | 92,561 | 194,428 | 248,864 | 180,431 | 50,910 | 3,648 |
SureChEMBL_uncommon | 10 | 686 | 12,114 | 58,502 | 129,982 | 176,680 | 134,020 | 38,822 | 2,891 |
SureChEMBL_rare | 5 | 841 | 16,907 | 92,084 | 215,539 | 306,917 | 244,564 | 74,264 | 5,400 |
SureChEMBL_veryrare | 2 | 473 | 10,029 | 61,888 | 157,124 | 240,510 | 206,432 | 74,025 | 6,329 |
SureChEMBL_extremelyrare | 5 | 540 | 13,927 | 86,669 | 214,839 | 314,899 | 259,750 | 81,667 | 6,711 |
SureChEMBL_doubleton | 4 | 954 | 26,565 | 192,048 | 525,322 | 820,654 | 707,802 | 247,689 | 24,350 |
SureChEMBL_singleton | 11 | 1,650 | 59,560 | 498,692 | 1,488,435 | 2,346,572 | 2,022,233 | 701,144 | 72,033 |
Number of fragments with specified number of rotatable bonds
Num Rot Bonds | 0 | 1 | 2 | 3 | 4 | 5 |
VeryCommon | 7,672 | 21,257 | 23,070 | 12,806 | 2,556 | 527 |
Common | 8,109 | 30,861 | 44,005 | 24,089 | 4,985 | 900 |
LessCommon | 11,515 | 51,357 | 86,080 | 50,841 | 9,929 | 1,758 |
Rare | 14,903 | 66,903 | 113,832 | 69,916 | 12,201 | 1,903 |
VeryRare | 24,878 | 125,381 | 226,167 | 125,802 | 21,587 | 3,025 |
ExtremelyRare | 25,555 | 134,461 | 233,722 | 119,379 | 18,901 | 2,426 |
UltraRare | 36,607 | 200,252 | 336,853 | 167,705 | 25,273 | 3,054 |
Doubleton | 54,939 | 285,567 | 459,339 | 218,635 | 31,234 | 3,486 |
Singleton | 125,606 | 658,710 | 1,089,611 | 564,699 | 79,640 | 7,389 |
ChEMBL_common | 23,571 | 68,638 | 77,166 | 40,153 | 10,866 | 2,532 |
ChEMBL_rare | 29,233 | 94,861 | 110,432 | 53,160 | 13,252 | 2,673 |
ChEMBL_veryrare | 39,627 | 124,544 | 140,450 | 66,406 | 15,617 | 3,064 |
ChEMBL_extremelyrare | 82,920 | 258,927 | 281,784 | 125,207 | 28,727 | 5,305 |
SureChEMBL_verycommon | 54,168 | 154,094 | 172,017 | 91,940 | 28,968 | 8,171 |
SureChEMBL_common | 75,189 | 235,835 | 279,244 | 150,539 | 42,743 | 10,973 |
SureChEMBL_uncommon | 51,188 | 163,399 | 196,338 | 105,991 | 29,725 | 7,066 |
SureChEMBL_rare | 85,850 | 286,799 | 338,432 | 183,991 | 50,079 | 11,370 |
SureChEMBL_veryrare | 79,560 | 235,261 | 260,900 | 137,687 | 35,676 | 7,728 |
SureChEMBL_extremelyrare | 86,466 | 291,036 | 349,375 | 189,661 | 51,469 | 11,000 |
SureChEMBL_doubleton | 262,379 | 775,009 | 881,264 | 477,870 | 124,286 | 24,580 |
SureChEMBL_singleton | 703,648 | 2,156,275 | 2,528,462 | 1,381,729 | 359,156 | 61,060 |
Number of fragments with specified number of conformations
Number of conformations | 1-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-30 |
VeryCommon | 44,252 | 10,555 | 4,911 | 2,645 | 1,572 | 3,953 |
Common | 66,886 | 18,687 | 9,566 | 5,354 | 3,483 | 8,973 |
LessCommon | 117,222 | 36,734 | 19,443 | 11,193 | 7,261 | 19,627 |
Rare | 153,445 | 48,487 | 25,923 | 15,198 | 9,610 | 26,995 |
VeryRare | 277,079 | 95,952 | 52,121 | 30,677 | 18,843 | 52,168 |
ExtremelyRare | 272,328 | 98,675 | 55,479 | 32,889 | 20,505 | 54,568 |
UltraRare | 392,232 | 142,261 | 79,098 | 47,446 | 30,349 | 78,358 |
Doubleton | 519,712 | 204,130 | 111,851 | 66,985 | 42,200 | 108,322 |
Singleton | 1,205,682 | 469,222 | 276,838 | 173,270 | 111,995 | 288,648 |
ChEMBL_common | 145,594 | 31,494 | 14,636 | 8,669 | 5,599 | 16,934 |
ChEMBL_rare | 192,666 | 46,668 | 21,355 | 12,232 | 7,853 | 22,837 |
ChEMBL_veryrare | 245,523 | 61,298 | 28,334 | 16,183 | 9,932 | 28,438 |
ChEMBL_extremelyrare | 493,161 | 123,404 | 57,544 | 32,120 | 19,880 | 56,761 |
SureChEMBL_verycommon | 326,882 | 74,551 | 35,357 | 20,391 | 13,053 | 39,124 |
SureChEMBL_common | 483,553 | 129,383 | 61,031 | 35,321 | 22,073 | 63,162 |
SureChEMBL_uncommon | 328,986 | 92,816 | 44,684 | 25,613 | 16,260 | 45,348 |
SureChEMBL_rare | 558,832 | 164,147 | 79,803 | 46,280 | 28,124 | 79,335 |
SureChEMBL_veryrare | 447,316 | 131,320 | 62,844 | 35,314 | 21,541 | 58,477 |
SureChEMBL_extremelyrare | 557,873 | 172,719 | 85,156 | 48,850 | 30,061 | 84,348 |
SureChEMBL_doubleton | 1,438,946 | 463,072 | 224,621 | 128,134 | 79,230 | 211,385 |
SureChEMBL_singleton | 3,876,155 | 1,363,410 | 676,075 | 389,869 | 241,079 | 643,742 |
Analysis of reagent databases
Number of fragments within specified molecular weight range
The figures below are approximate. The exact number of fragments may change over time as the reagent databases are updated on a monthly basis. Make sure you keep your reagent databases updated following the instructions the Installing Spark databases.
Molecular weight distribution | Description |
Total |
1-50 |
51-100 |
101-150 |
151-200 |
201-250 |
eMolecules_acidCO | Acids, keep the -CO | 19,694 | 3 | 354 | 4,930 | 10,718 | 3,689 |
eMolecules_acid | Acids, delete the -COOH | 36,182 | 42 | 2,122 | 11,871 | 16,339 | 5,808 |
eMolecules_alcohol | Aliphatic alcohols, delete the O | 15,442 | 10 | 1,160 | 5,417 | 6,500 | 2,355 |
eMolecules_alcoholO | Aliphatic alcohols, keep the O | 18,923 | 3 | 408 | 5,555 | 9,279 | 3,678 |
eMolecules_reductive_amination | Aldehydes/ketones, delete the O and reduce C | 24,005 | 5 | 658 | 5,419 | 12,003 | 5,920 |
eMolecules_aliphatic_halide | Aliphatic halide | 7,564 | 13 | 667 | 2,512 | 3,191 | 1,181 |
eMolecules_alkyne | Alkynes, delete the - C#C | 2,448 | 24 | 346 | 1,080 | 848 | 150 |
eMolecules_aromatic_alcoholO | Aromatic alcohols, keep the O | 10,097 | 0 | 40 | 2,040 | 5,531 | 2,486 |
eMolecules_aromatic_aminesN | Aromatic amines, keep the N | 18,312 | 0 | 114 | 3,959 | 10,012 | 4,227 |
eMolecules_aromatic_halide | Aromatic halide | 44,181 | 9 | 433 | 14,329 | 25,671 | 3,739 |
eMolecules_boronic | Aromatic boronic acids, delete -B(OH) | 5,479 | 0 | 127 | 2,160 | 2,644 | 548 |
eMolecules_cyano | Cyano groups, delete -CN | 19,271 | 18 | 838 | 5,771 | 8,782 | 3,862 |
eMolecules_isocyanateCO | Isocyanates, keep -NCO | 626 | 0 | 9 | 92 | 320 | 205 |
eMolecules_olefin | OLefines, delete the -C=C | 3,126 | 18 | 399 | 1,287 | 1,173 | 249 |
eMolecules_primary_aliphatic_amine | Primary aliphatic amines, delete the N | 12,937 | 7 | 1,010 | 5,282 | 5,344 | 1,294 |
eMolecules_primary_aliphatic_amineN | Primary aliphatic amines, keep the N | 7,762 | 1 | 318 | 3,320 | 3,322 | 801 |
eMolecules_primary_aliphatic_halide | Primary aliphatic halide | 6,150 | 12 | 467 | 1,986 | 2,631 | 1,054 |
eMolecules_primary_aromatic_amines | Primary aromatic amines, delete the N | 24,010 | 0 | 301 | 6366 | 12,193 | 5,150 |
eMolecules_secondary_aliphatic_amineN | Secondary aliphatic amines, keep the N | 9,428 | 1 | 224 | 2,852 | 4,837 | 1,514 |
eMolecules_sulfonicacid | Sulfonic acids, delete the SO2X | 3,313 | 35 | 308 | 1,278 | 1,253 | 439 |
eMolecules_sulfonicacidSO2 | Sulfonic acids, keep the -SO2X | 1,928 | 0 | 14 | 184 | 832 | 898 |
eMolecules_thiol | Aliphatic thiols, delete S | 456 | 5 | 103 | 197 | 139 | 12 |
eMolecules_thiolS | thiols, keep S | 1,572 | 0 | 34 | 336 | 927 | 275 |
References
- https://www.emolecules.com/info/products-screening-compounds
- https://www.ebi.ac.uk/chembl/
- https://www.emolecules.com/info/products-building-blocks
- http://www.crystallography.net/cod/
- https://www.ccdc.cam.ac.uk/solutions/csd-system/components/csd/
- Pitt, W. R.; Parry, D. M.; Perry, B. G.; Groom, C. R. Heteroaromatic Rings of the Future. J. Med. Chem. 2009, 52 (9), 2952–2963 ftp://ftp.ebi.ac.uk/pub/databases/chembl/VEHICLe/