We released BlazeGPU a couple of years ago, allowing the full power of the Blaze virtual screening system to be used on a few consumer graphics cards rather than a full-scale Linux cluster. Since then, graphics cards and CPUs have only got faster, so we decided that it was time to update our benchmarks and see how well all of the new hardware performs.
For these benchmarks we took a random subset of 4,000 molecules from our in-house Blaze data set and searched with a medium-sized query molecule. The molecules in the data set average 80 conformers each. We’ve run with three different search conditions: the full slow-but-accurate simplex algorithm, the standard clique algorithm and the new fastclique algorithm. All of these were run with 50% fields and 50% shape.
Firstly, the CPU benchmarks. All of these are single-core performance, but with all cores loaded so that we’re not benefitting from Intel Turbo Boost. In most cases Blaze will be saturating all cores, so this is representative of real-world performance. Note that the vertical axis is on a log scale.
As can be seen, there’s a significant performance difference between the older CPUs at Cresset (such as the Q6600) and the newer Ivy Bridge i7-3770K chips, but not nearly as much as you would expect given that the Q6600s are around 7-8 years old at this point. The significant speed improvements of the fastclique algorithm are clearly visible with the throughput being more than 4x greater than the original clique algorithm. The last set of columns on the graph are from an Amazon c4.xlarge instance and show that the performance of each core on those systems is roughly the same as the Sandy Bridge i3-2120.
Moving on to the GPUs, we’ve tested the throughput on a variety of different systems. Firstly, we’ve tested a variety of GTX580s on different motherboards and processors. As you would expect, for the most part the performance is governed by the GPU, but the exception is the fifth test system which is noticeably slower than the others. That card is sitting in a much older chassis with an older motherboard and hence is probably suffering from lack of backplane bandwidth to the GPU.
The newer GTX960s perform extremely well on the Blaze calculations. We weren’t sure if they would, after the disappointment of the GTX680 which was noticeably slower than the 580 (data not shown). The difference is noticeable in the clique stages, but really stands out in the simplex calculations where a GTX960 is 50% faster than the GTX580s. By contrast, the high-end Tesla hardware is not a great performer on the Blaze OpenCL kernels. By all accounts the Tesla hardware is significantly faster than the consumer hardware on double precision workloads, but the Blaze code is all single precision and in that realm the cheap consumer hardware has an unbeatable price/performance advantage.
Finally, the GRID K520 is the hardware found on the Amazon g2.2xlarge and g2.8xlarge instances. As can be seen, it’s not a brilliant performer on the Blaze workload, being around the same speed as the Tesla on the fastclique algorithm but noticeably slower than all of the other cards tested on the simplex workload. However, it provides a nice test of GPU scaling: when running on a 4 times larger data set on all 4 GPUs of a g2.8xlarge instance, we observed substantially the same throughput as running the original data set on a single K520 GPU, showing that we can parallelise across multi-GPU systems with no loss of performance.
Cost efficiency on Amazon
Converting the throughput shown above, we can look at the cost of screening on the Amazon cluster with Blaze. The raw cost to screen a million molecules is shown in the table. Note that the actual costs will be somewhat higher, due to job overheads and data transfer costs.
The Amazon GPU solutions are noticeably cheaper for fastclique jobs, roughly cost-competitive for the clique runs, but the poor performance of the K520 on the simplex task means that it is significantly more expensive there. As a result, at the moment there’s no real impetus to use the Amazon GPU resources unless you can get them significantly more discounted than the CPU instances on the spot market.
New hardware is significantly faster at running Blaze than old stock as would be expected. However, the speed increases are much lower than they have been in the past, with CPUs that are well past their best still performing adequately. On the GPU side, Blaze performs particularly well on commodity graphics cards leaving few reasons for us to invest in dedicated GPU co-processing cards.
The cost of running a million molecule virtual screen on the Amazon cloud has never been cheaper. If tiered processing is used as is the default for Blaze then these screens can be performed for a very low cost indeed – less than $15 per million molecules for the processing costs.