Remember me

Register  |   Lost password?

The Trading Mesh

FPGA & Hardware Accelerated Trading, Part Two - Alternative Approaches

Mon, 02 Jul 2012 05:00:00 GMT           

This is the second part of a six-part series by Mike O'Hara, looking at the usage of FPGAs and other types of hardware acceleration in the financial trading ecosystem.


In part one of this series, we looked at how, where and by whom FPGA technology is currently being used in financial trading.


In this article, we investigate some alternative approaches to hardware acceleration, identifying key differences between FPGAs, GPUs and ASICs. We also look at how FPGA technology is developing, particularly around areas like floating-point calculations.


One of the key advantages offered by FPGA technology is its capacity for parallel processing. Unlike a traditional CPU, which can typically only process one instruction at a time for each clock cycle, an FPGA can process multiple instructions in parallel.


There are other, cheaper chips designed for massive parallel processing. i.e. GPGPUs (General Purpose Graphical Processing Units). However, GPU parallelism and FPGA parallelism are two different things, as James Spooner, Vice President of Acceleration at Maxeler Technologies, a firm specialising in high performance computing solutions, explains.


“GPUs have a stronghold in the parallel processing of graphics, but where they fall down is that they don’t help you with latency at all,” explains Spooner. “While they have a massive amount of parallelism, it’s coarse-grain parallelism, it’s not pipeline parallelism.”


What does that mean exactly?


“If you’ve got one message, you can only process it in one core. So it doesn’t matter how many different cores you have, it’s only in one of them. Which means that GPUs give you a throughput play rather than a latency play. If you want to reduce latency, you need fine-grain parallelism,” says Spooner. "At Maxeler we use Dataflow engines (DFEs), which provide the same fine grain parallelism, but are ready-to-compute rather than just blank chips.”


Another problem with GPUs is that they tend to be more prone to failures and errors in calculations. So although they can work well for certain parallel computing tasks like Monte Carlo simulations, they are unable to deliver the level of determinism offered by FPGAs.


At the other end of the spectrum are ASICs (Application-Specific Integrated Circuits). Designed and programmed specifically to run one set of instructions and perform one function, ASICs are able to offer both 100% determinism and massive parallelism, but it comes at a high cost. Because they are not programmable, they are uneconomical for trading environments where things are constantly changing.


It is possible that some standard functions such as TCP offload, which are now often performed on FPGAs, might start appearing on ASICs but we are still some distance away from that point, according to Kelly Masood, Chief Technology Officer of Intilop, a firm that provides FPGA-based TCP offload engines.


“We are talking to companies who want to implement our TCP offload engines onto ASICs,” says Masood. “The technology itself will then become a lot cheaper for users, but the disadvantage of ASICs is of course that they’re not programmable. In high-speed trading, the flexibility of changes and modifications is very important because the standards are not quite defined or mature yet.”


Even TCP standards?


“TCP standards have been mature for a long time, but other standards – for things like interfacing with other hardware and software functions for example - are not defined, they’re constantly being updated, upgraded and modified. They are also end user application dependent. So if you implement this technology in an ASIC, it then becomes very difficult, very time-consuming and very costly to change those interfaces and those specs.


“The great thing about FPGAs versus ASICs,” concludes Masood, “is that they allow you to make those modifications - and implement them – quickly. I see FPGA and ASIC implementations coexisting for the foreseeable future .”


FPGA technology is improving all the time, both in terms of the number of gates and the number of individual calculation units they have. But other than following Moore’s law, how else are FPGAs actually evolving?


Marcus Perrett, Head of Development at Fixnetix, the outsourced trading systems vendor, believes that part of the evolutionary process will see more offerings from vendors in terms of algorithms offered directly on the FPGA.


“We’re now seeing vendors embracing the full power of the chip,” states Perrett.


“Traditionally they offered developers a blank canvas, then they adopted an approach whereby various cores were added, like MegaWizard from Altera and CORE Gen from Xilinx, which were well-known, well-established constructs around things like FIFO (first-in-first-out). That was great because you could build a design quite easily using this ‘Lego’. But what it didn’t really do was offer a multi-faceted algorithm development platform. Yes, you had ANDs, ORs and other arithmetic operators, but trying to build up logical or algorithmic expressions using individual components can end up being quite weighty and not really optimised. So to do anything clever like standard deviation, you were basically on your own and had to develop it yourself.”


That situation is now changing, according to Perrett.


“What we’re seeing now is an encapsulation of some of these algorithms offered by the vendors themselves,” he says. “Which means that in the future firms like ours can look at offering out algorithm development, either as a professional services undertaking or via our own API, which customers can hook into as part of the EMS platform.”


Doing things like standard deviation can involve floating point processing, which is an area where FPGAs have traditionally been weak, although FPGA chip vendors like Altera have been doing a lot of work in their tool chain to support floating point development, using IEEE 754, the floating point format for math coprocessors.


However, according to Ron Huizen, Vice President of Technology at BittWare, a provider of FPGA computing systems, doing floating point designs in FPGAs can still be a time consuming development, which is why his firm has come up with an alternative approach.


“Floating point in FPGA can provide a tremendous amount of performance, but it is not that easy to do for complex algorithms”, states Huizen. “So we’ve invested in what we call a floating point co-processor for FPGA, the Anemone, which is actually a small, very low power, many-core, C-programmable processor designed to sit beside an FPGA so you can offload floating point to it and write C code to control it.  That way you can optimize the more straightforward parts of your algorithms in pure FPGA, and leave the complex parts, or those that change often, in C on the programmable co-processor”.


This differs from most other approaches, according to Huizen.


“Other people have gone down the route of having the programmable processor as the centre with the FPGA on the periphery, but because we work primarily in FPGA, we see it the other way around. We’ve deliberately made this co-processor small, low power and very high performance because we don’t care about all the other interfaces, that’s what the FPGA itself is for. If you look at a typical Texas Instruments co-processor, it’s a big, power-hungry chip that has PCIe, 10GbE and all these interfaces built in, whereas all we have is a very small chip with external ports that connect to the FPGA and then have 16 cores that just run flat-out. So we get 24 gigaFLOPS in  under 1 watt of total core power.


“The guys who are doing the pure network interface stuff probably don’t care as much about floating point as the ones who are doing hardware acceleration and risk analysis”, suggests Huizen, “but for people trying to put their big algorithms on the FPGA, there are some things that are still hard to do in FPGA. So with this approach, where there’s a lot of decision-making going on or if there are things that change very often, they can put that into C code and run it in the Anemone co-processor instead.


“We’re seeing a lot of interest in this from people in the finance space, because it means they can do the decisions in the FPGA space instead of sending them up into the Intel CPU over the PCIe bus, where there’s latency and Linux has to deal with it. By having the Anemone processor sitting directly aside the FPGA, they can put some decision-making down there and have it close to the hardware but still write it in C code”, says Huizen, “They’ve recognized that the Anemone coprocessor is really 16 little RISC engines that can run any code, it doesn’t have to be floating point based.  So essentially you get to run 16 control threads in parallel with no context switching.”


There is no doubt as FPGA and other forms of hardware acceleration become more and more prevalent in trading architectures not just at the so-called high frequency trading shops but also across a much wider group of firms, vendors will come up with increasingly innovative solutions to service those firms.


In part three of this series, which will be published at next week, we will look at the different programmatic approaches to working with FPGAs.



, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,