Spark databases

The currently released databases for Spark are derived from commercially available screening compounds (eMolecules screening compounds¹), literature reports (ChEMBL²), patents (SureChEMBL²), commercial reagents (eMolecules building blocks³), small molecule crystal structures (Crystallography Open Database⁴ and Cambridge Structural Database⁵), and theoretical ring systems (VEHICLe⁶).

Update to the latest version of all databases by following the instructions at Installing Spark databases.

Fragments from screening compounds

The Commercial Spark databases are based on eMolecules screening compounds and are split based on the frequency of occurrence of the fragments.

VeryCommon (477 MB) – fragments which appear in at least 725 molecules
Common (961 MB) – fragments which appear in 215-724 molecules
LessCommon (1.95 GB) – fragments which appear in 65-214 molecules
Rare (2.68 GB) – fragments which appear in 25-64 molecules
VeryRare (5.3 GB) – fragments which appear in 9-24 molecules
ExtremelyRare (5.8 GB) – fragments which appear in 5-8 molecules
UltraRare (8.1 GB) - fragments which appear in 3-4 molecules

In general, fragments from the VeryCommon or Common databases are more likely to be readily synthesizable as they appear in many different commercially available molecules. Fragments from the VeryRare, ExtremelyRare and UltraRare databases are more likely to be non-drug-like or hard to make. These databases have been filtered to remove potentially toxic or reactive fragments (such as alkyl halides or nitroso functionalities). In addition, all phosphorus-containing fragments have been removed as the calculation of fields on phosphorus-containing functional groups is still under development. See below for a detailed analysis of these databases.

Two optional databases are also available:

Doubleton (2 files, each of 5.7 GB) – fragments which appear in two molecules
Singleton (3 files, each of 9.3 GB) – fragments which appear in a single molecule

Typically we would recommend to install only the databases including fragments which appear at least 3-4 times in the original collections. The databases containing fragments seen with lower frequency are very large, and may contain fragments derived from unrealistic/wrong structures in the original collections. Contact support if you wish to download these optional databases.

Fragments from ChEMBL

The current ChEMBL Spark databases are based on Release 30 of ChEMBL and are split based on the frequency of occurrence of the fragments.

ChEMBL_common (1.8 GB) – fragments which appear in more than 12 molecules
ChEMBL_rare (2.6 GB) – fragments which appear in 4-12 molecules
ChEMBL_veryrare (3.4 GB) – fragments which appear in 2-3 molecules

An optional database is also available (contact Cresset support to download this database):

ChEMBL_extremelyrare (6.6 GB) – fragments which appear once

Fragments from SureChEMBL

The SureChEMBL Spark databases are based on the complete SureChEMBL compound collection and are split based on the frequency of occurrence of the fragments.

SureChEMBL_verycommon (4.1 GB) – fragments which appear in at least 45 molecules
SureChEMBL_common (7.0 GB) – fragments which appear in 14-44 molecules
SureChEMBL_uncommon (5.0 GB) – fragments which appear in 8-13 molecules

Additional optional databases are also available (contact Cresset support to download these databases):

SureChEMBL_rare (8.8 GB) – fragments which appear in 5-7 molecules
SureChEMBL_veryrare (7.0 GB) – fragments which appear in 4 molecules
SureChEMBL_extremelyrare (9.2 GB) – fragments which appear in 3 molecules
SureChEMBL_doubleton (3 files, each of 8.2 GB) – fragments which appear twice
SureChEMBL_singleton (3 files, each of 24 GB) – fragments which appear once

Reagents

Spark Reagent Databases are derived from eMolecules building blocks using the Cresset reagent importer, which converts a file of usable reagents into the corresponding R-group. For example, to create the eMolecules_acid database, all the eMolecules building blocks containing a C(=O)OH or C(=O)Cl group were processed to add the R-group to the database.

Using databases derived from available reagents ensures that the results of your Spark experiment are tethered to molecules that are readily synthetically accessible. Monthly updates for these databases provide reliable availability information on the reagents that you wish to employ.

The current list of Spark Reagent Databases includes 23 common chemical transformations. See below for a detailed analysis of these databases.

Fragments from small molecules crystal structures

These databases contain fragments in their crystallographic conformation, derived from small molecule crystal structures.

The Spark COD database contains fragments from the Crystallography Open Database. This database is available for download to all Spark customers.

The Spark CSD Fragment Database is derived from the Cambridge Structural Database (CSD). A valid CSD-System license is required for use of this database. If you do not already have a license, please contact CCDC for assistance.

Theoretical rings

A collection of theoretical ring systems derived from the VEHICLe⁶ database.

Create your own database

Spark fragment and reagent databases provide an excellent source of new bioisosteres. However, if you have access to significant proprietary chemistry, to specialized reagents, or simply want to only consider fragments from reagents that you have in stock then creating your own custom databases will add value to your Spark experiments.

Custom databases can be easily created using the Database Generator, a dedicated and user-friendly interface to custom database creation within Spark, or using the equivalent functionality from the command line.

Contact Cresset support if you need assistance with the Spark Database Generator.

Analysis of fragment databases

Database overlaps (number of fragments present in both databases)

	VeryCommon	Common	LessCommon	Rare	VeryRare	ExtremelyRare	UltraRare	Doubleton	Singleton	Unique
ChEMBL_common	38,687	31,717	28,917	18,493	16,285	9,251	8,338	7,324	10,988	52,926
ChEMBL_rare	6,160	17,611	27,846	26,412	29,948	17,198	14,873	12,168	18,793	132,602
ChEMBL_veryrare	2,520	8,702	17,121	19,536	28,007	20,231	19,887	20,989	26,793	225,922
ChEMBL_extremelyrare	1,707	7,042	16,144	20,506	33,310	26,688	29,587	28,784	67,511	551,591

	SureChEMBL _verycommon	SureChEMBL _common	SureChEMBL _uncommon	SureChEMBL _rare	SureChEMBL _veryrare	SureChEMBL _extremelyrare	SureChEMBL _doubleton	SureChEMBL _singleton	Unique
ChEMBL_common	146,547	31,002	7,536	5,865	2,682	2,828	3,780	4,674	18,012
ChEMBL_rare	84,101	64,144	22,539	19,914	9,116	8,349	12,050	14,985	68,413
ChEMBL_veryrare	49,522	64,713	29,188	31,969	16,604	17,930	27,779	25,255	126,748
ChEMBL _extremelyrare	43,094	77,947	43,087	55,009	29,521	39,150	59,864	103,934	331,264

Fragment and connection point counts

Database	Total frags	Frags with 1 connection	Frags with 2 connections	Frags with 3 connections	Frags with 4 connections	Rings only
VeryCommon	67,888	20,525	28,979	14,872	3,512	2,129
Common	112,949	33,362	47,336	25,704	6,547	1,712
LessCommon	211,480	55,409	90,338	50,652	15,081	2,214
Rare	279,658	70,058	117,127	69,984	22,489	2,286
VeryRare	526,840	138,999	211,859	130,123	45,859	3,458
ExtremelyRare	534,444	153,583	203,897	128,718	48,246	3,307
UltraRare	769,744	242,565	283,665	174,584	68,930	4,136
Doubleton	1,053,200	348,677	372,553	233,822	98,148	4,552
Singleton	2,525,655	990,026	863,227	476,933	195,469	13,822
ChEMBL_common	222,926	41,681	85,539	66,533	29,173	5,554
ChEMBL_rare	303,611	58,825	110,572	90,044	44,170	4,837
ChEMBL_veryrare	389,708	78,333	138,420	114,435	58,520	5,234
ChEMBL_extremelyrare	782,870	196,345	280,643	204,450	101,432	10,803
SureChEMBL_verycommon	509,358	86,420	187,700	158,257	76,981	12,976
SureChEMBL_common	794,523	143,479	280,986	241,500	128,558	13,201
SureChEMBL_uncommon	553,707	106,530	194,255	165,853	87,069	8,466
SureChEMBL_rare	956,521	196,465	335,158	277,534	147,364	13,816
SureChEMBL_veryrare	756,812	135,465	251,609	230,503	139,235	8,975
SureChEMBL_extremelyrare	979,007	224,962	339,225	272,282	142,538	14,079
SureChEMBL_doubleton	2,545,388	543,596	850,467	731,831	419,494	28,330
SureChEMBL_singleton	7,190,330	1,842,604	2,420,995	1,894,851	1,031,880	99,811

Number of fragments within specified molecular weight range

MW	1-50	51-100	101-150	151-200	201-250
VeryCommon	278	9,640	34,547	21,376	2,047
Common	78	6,504	49,778	49,163	7,426
LessCommon	71	8,164	79,583	104,379	19,283
Rare	62	7,940	90,885	148,243	32,528
VeryRare	89	11,525	147,883	300,310	67,033
ExtremelyRare	40	9,562	132,797	316,064	75,981
UltraRare	53	11,225	175,373	460,012	123,081
Doubleton	46	12,771	220,404	641,188	178,791
Singleton	100	27,089	504,001	1,546,404	448,061
ChEMBL_common	390	16,983	89,129	98,541	17,883
ChEMBL_rare	95	11,717	96,026	157,481	38,292
ChEMBL_veryrare	42	11,250	107,594	209,116	61,706
ChEMBL_extremelyrare	63	17,115	189,677	429,362	146,653
SureChEMBL_verycommon	742	47,394	219,111	207,803	34,308
SureChEMBL_common	135	44,056	294,455	376,934	78,943
SureChEMBL_uncommon	59	25,058	194,015	272,902	61,673
SureChEMBL_rare	59	36,652	316,932	484,401	118,477
SureChEMBL_veryrare	34	22,361	225,615	395,407	113,395
SureChEMBL_extremelyrare	33	31,277	309,965	505,645	132,087
SureChEMBL_Doubleton	62	61,850	740,051	1,353,626	389,799
SureChEMBL_Singleton	101	146,960	2,071,784	3,873,342	1,098,143

Atom count distribution

NH	1-2	3-4	5-6	7-8	9-10	11-12	13-14	15-16	17-18
VeryCommon	91	823	5,005	15,716	21,928	17,265	6,458	575	27
Common	7	355	3,217	15,463	35,939	37,023	18,343	2,505	97
LessCommon	4	370	4,008	20,771	59,820	77,274	42,080	6,881	272
Rare	6	324	3,700	21,798	71,050	109,396	63,055	9,880	449
VeryRare	6	501	5,332	31,599	113,807	208,056	148,049	18,759	731
ExtremelyRare	3	340	4,313	26,647	102,055	205,647	171,247	23,360	832
UltraRare	6	381	4,915	33,299	139,587	297,631	250,142	42,541	1,242
Doubleton	10	339	5,455	40,232	174,597	403,392	357,547	69,323	2,305
Singleton	11	748	11,269	87,867	407,805	988,823	850,875	171,433	6,824
ChEMBL_common	89	1,423	8,873	29,512	57,389	67,408	45,087	12,129	1,016
ChEMBL_rare	11	538	5,686	25,859	65,266	95,270	82,224	26,202	2,555
ChEMBL_veryrare	6	379	5,294	27,218	73,888	119,503	116,455	41,936	5,029
ChEMBL_extremelyrare	5	552	7,850	45,343	132,855	234,897	248,173	100,834	12,361
SureChEMBL_verycommon	131	3,605	25,590	77,393	139,579	145,566	91,880	23,795	1,819
SureChEMBL_common	15	1,615	22,051	92,561	194,428	248,864	180,431	50,910	3,648
SureChEMBL_uncommon	10	686	12,114	58,502	129,982	176,680	134,020	38,822	2,891
SureChEMBL_rare	5	841	16,907	92,084	215,539	306,917	244,564	74,264	5,400
SureChEMBL_veryrare	2	473	10,029	61,888	157,124	240,510	206,432	74,025	6,329
SureChEMBL_extremelyrare	5	540	13,927	86,669	214,839	314,899	259,750	81,667	6,711
SureChEMBL_doubleton	4	954	26,565	192,048	525,322	820,654	707,802	247,689	24,350
SureChEMBL_singleton	11	1,650	59,560	498,692	1,488,435	2,346,572	2,022,233	701,144	72,033

Number of fragments with specified number of rotatable bonds

Num Rot Bonds	0	1	2	3	4	5
VeryCommon	7,672	21,257	23,070	12,806	2,556	527
Common	8,109	30,861	44,005	24,089	4,985	900
LessCommon	11,515	51,357	86,080	50,841	9,929	1,758
Rare	14,903	66,903	113,832	69,916	12,201	1,903
VeryRare	24,878	125,381	226,167	125,802	21,587	3,025
ExtremelyRare	25,555	134,461	233,722	119,379	18,901	2,426
UltraRare	36,607	200,252	336,853	167,705	25,273	3,054
Doubleton	54,939	285,567	459,339	218,635	31,234	3,486
Singleton	125,606	658,710	1,089,611	564,699	79,640	7,389
ChEMBL_common	23,571	68,638	77,166	40,153	10,866	2,532
ChEMBL_rare	29,233	94,861	110,432	53,160	13,252	2,673
ChEMBL_veryrare	39,627	124,544	140,450	66,406	15,617	3,064
ChEMBL_extremelyrare	82,920	258,927	281,784	125,207	28,727	5,305
SureChEMBL_verycommon	54,168	154,094	172,017	91,940	28,968	8,171
SureChEMBL_common	75,189	235,835	279,244	150,539	42,743	10,973
SureChEMBL_uncommon	51,188	163,399	196,338	105,991	29,725	7,066
SureChEMBL_rare	85,850	286,799	338,432	183,991	50,079	11,370
SureChEMBL_veryrare	79,560	235,261	260,900	137,687	35,676	7,728
SureChEMBL_extremelyrare	86,466	291,036	349,375	189,661	51,469	11,000
SureChEMBL_doubleton	262,379	775,009	881,264	477,870	124,286	24,580
SureChEMBL_singleton	703,648	2,156,275	2,528,462	1,381,729	359,156	61,060

Number of fragments with specified number of conformations

Number of conformations	1-5	6-10	11-15	16-20	21-25	26-30
VeryCommon	44,252	10,555	4,911	2,645	1,572	3,953
Common	66,886	18,687	9,566	5,354	3,483	8,973
LessCommon	117,222	36,734	19,443	11,193	7,261	19,627
Rare	153,445	48,487	25,923	15,198	9,610	26,995
VeryRare	277,079	95,952	52,121	30,677	18,843	52,168
ExtremelyRare	272,328	98,675	55,479	32,889	20,505	54,568
UltraRare	392,232	142,261	79,098	47,446	30,349	78,358
Doubleton	519,712	204,130	111,851	66,985	42,200	108,322
Singleton	1,205,682	469,222	276,838	173,270	111,995	288,648
ChEMBL_common	145,594	31,494	14,636	8,669	5,599	16,934
ChEMBL_rare	192,666	46,668	21,355	12,232	7,853	22,837
ChEMBL_veryrare	245,523	61,298	28,334	16,183	9,932	28,438
ChEMBL_extremelyrare	493,161	123,404	57,544	32,120	19,880	56,761
SureChEMBL_verycommon	326,882	74,551	35,357	20,391	13,053	39,124
SureChEMBL_common	483,553	129,383	61,031	35,321	22,073	63,162
SureChEMBL_uncommon	328,986	92,816	44,684	25,613	16,260	45,348
SureChEMBL_rare	558,832	164,147	79,803	46,280	28,124	79,335
SureChEMBL_veryrare	447,316	131,320	62,844	35,314	21,541	58,477
SureChEMBL_extremelyrare	557,873	172,719	85,156	48,850	30,061	84,348
SureChEMBL_doubleton	1,438,946	463,072	224,621	128,134	79,230	211,385
SureChEMBL_singleton	3,876,155	1,363,410	676,075	389,869	241,079	643,742

Analysis of reagent databases

Number of fragments within specified molecular weight range

The figures below are approximate. The exact number of fragments may change over time as the reagent databases are updated on a monthly basis. Make sure you keep your reagent databases updated following the instructions the Installing Spark databases.

Molecular weight distribution	Description	Total	1-50	51-100	101-150	151-200	201-250
eMolecules_acidCO	Acids, keep the -CO	19,694	3	354	4,930	10,718	3,689
eMolecules_acid	Acids, delete the -COOH	36,182	42	2,122	11,871	16,339	5,808
eMolecules_alcohol	Aliphatic alcohols, delete the O	15,442	10	1,160	5,417	6,500	2,355
eMolecules_alcoholO	Aliphatic alcohols, keep the O	18,923	3	408	5,555	9,279	3,678
eMolecules_reductive_amination	Aldehydes/ketones, delete the O and reduce C	24,005	5	658	5,419	12,003	5,920
eMolecules_aliphatic_halide	Aliphatic halide	7,564	13	667	2,512	3,191	1,181
eMolecules_alkyne	Alkynes, delete the - C#C	2,448	24	346	1,080	848	150
eMolecules_aromatic_alcoholO	Aromatic alcohols, keep the O	10,097	0	40	2,040	5,531	2,486
eMolecules_aromatic_aminesN	Aromatic amines, keep the N	18,312	0	114	3,959	10,012	4,227
eMolecules_aromatic_halide	Aromatic halide	44,181	9	433	14,329	25,671	3,739
eMolecules_boronic	Aromatic boronic acids, delete -B(OH)	5,479	0	127	2,160	2,644	548
eMolecules_cyano	Cyano groups, delete -CN	19,271	18	838	5,771	8,782	3,862
eMolecules_isocyanateCO	Isocyanates, keep -NCO	626	0	9	92	320	205
eMolecules_olefin	OLefines, delete the -C=C	3,126	18	399	1,287	1,173	249
eMolecules_primary_aliphatic_amine	Primary aliphatic amines, delete the N	12,937	7	1,010	5,282	5,344	1,294
eMolecules_primary_aliphatic_amineN	Primary aliphatic amines, keep the N	7,762	1	318	3,320	3,322	801
eMolecules_primary_aliphatic_halide	Primary aliphatic halide	6,150	12	467	1,986	2,631	1,054
eMolecules_primary_aromatic_amines	Primary aromatic amines, delete the N	24,010	0	301	6366	12,193	5,150
eMolecules_secondary_aliphatic_amineN	Secondary aliphatic amines, keep the N	9,428	1	224	2,852	4,837	1,514
eMolecules_sulfonicacid	Sulfonic acids, delete the SO2X	3,313	35	308	1,278	1,253	439
eMolecules_sulfonicacidSO2	Sulfonic acids, keep the -SO2X	1,928	0	14	184	832	898
eMolecules_thiol	Aliphatic thiols, delete S	456	5	103	197	139	12
eMolecules_thiolS	thiols, keep S	1,572	0	34	336	927	275

References

https://www.emolecules.com/info/products-screening-compounds
https://www.ebi.ac.uk/chembl/
https://www.emolecules.com/info/products-building-blocks
http://www.crystallography.net/cod/
https://www.ccdc.cam.ac.uk/solutions/csd-system/components/csd/
Pitt, W. R.; Parry, D. M.; Perry, B. G.; Groom, C. R. Heteroaromatic Rings of the Future. J. Med. Chem. 2009, 52 (9), 2952–2963 ftp://ftp.ebi.ac.uk/pub/databases/chembl/VEHICLe/

desktop

Server

Currently released Spark™ databases

Fragments from screening compounds

Fragments from ChEMBL

Fragments from SureChEMBL

Reagents

Fragments from small molecules crystal structures

Theoretical rings

Create your own database

Analysis of fragment databases

Database overlaps (number of fragments present in both databases)

Fragment and connection point counts

Number of fragments within specified molecular weight range

Atom count distribution

Number of fragments with specified number of rotatable bonds

Number of fragments with specified number of conformations

Analysis of reagent databases

Number of fragments within specified molecular weight range

References

Licensing Spark