FPGA & Hardware Accelerated Trading, Part Three - Programming
Mon, 09 Jul 2012 06:07:00 GMT
This is part three of a six-part series by Mike O'Hara, looking at the usage of FPGAs and other types of hardware acceleration in the financial trading ecosystem.
In part one of this series, we looked at how, where and why FPGA technology is currently being used in financial trading, and in part two we investigated some alternative approaches to hardware acceleration, identifying key differences between FPGAs, GPUs and ASICs and looking at developments around floating-point and maths co-processing technology.
This article focuses on FPGA programming, and looks at the various paths that different firms are taking in order to arrive at the same destination, i.e. a well-programmed chip that can perform a specific set of tasks efficiently, quickly and deterministically.
At its lowest level, the representation of a digital circuit on an FPGA is defined at Register Transfer Level (RTL). This RTL definition is usually created via some sort of Hardware Description Language (HDL), the two most common being VHDL and Verilog. These are both very low-level programming languages that can be used to describe and define the digital logic within an FPGA circuit.
However, in the real world, actual business logic in financial trading applications is generally programmed using high level programming languages, such as C++ (or even higher level languages like Java).
This leads to a potential problem for firms who want to build trading solutions around FPGAs, because it is not particularly easy to go from a high level language to a low level language. Taking an abstracted language like C++ and translating it into hardware is challenging, to say the least.
James Spooner, Vice President of Acceleration at Maxeler Technologies, explains the issue.
“There are two separate things here,” he says. “One is dealing with the details, the syntax and so on in making the chip work, all the stuff that you don’t think about until you’re faced with actually doing it. The second is thinking about the algorithm in a data flow way rather than a sequential way.
“At one extreme, you give someone a low-level language, you give them all the detail they need, the data sheets, the specifications, an oscilloscope and a layout tool, so that they can build a board and develop it from there. At the other end of the spectrum is the claim that you can take a serial process, written in C for example, and somehow magically turn it into something that can run in a massively parallel way, which is what you need to do to put it on the chip.
“Trying to parallelize something that was originally expressed in a serial way is well recognized as an extremely difficult problem that people have been working on for as long as computers have been invented. So trying to solve that problem is not necessarily the best way to make this work. On the other hand, forcing people to understand the very low level details about how to, for example, reset a phase-locked loop, is just not useful,” he says.
So what is the alternative?
“At Maxeler we take a very different approach. You shouldn't be thinking in terms of sequential C++ code, but instead think about the flow of data through your algorithm, which at the end of the day is all that matters. MaxCompiler does all the heavy lifting for you, like making sure the data is in the right place at the right time, and presents the programmer with a high level abstraction of the dataflow that is easy to conceptualise. Because of this you spend your time designing great algorithms rather than getting your hands dirty with all the messy details.” claims Spooner.
Nick Ciarleglio, Systems Engineer and FSI Product Manager at Arista Networks, believes that it is good to take a pragmatic approach to FPGA toolsets.
“You’re not going to input C and output perfectly formed machine code that you can overlay on a hardware platform – or multiple platforms - and say ‘OK, we’ve got our application in hardware now’,” he says. “But what makes some of these toolsets interesting is their potential to allow someone who doesn’t know how to write Verilog or VHDL to start developing to a library that may at some point output RTL, which can then be synthesized on hardware.”
Do such tools actually exist?
“Although I don’t believe there are any magic tools right now that allow you to take (for example) socket calls or libraries and output platform-specific code, things like Impulse C are pretty interesting because they allow a hardware developer, who actually knows how to write RTL, to take some kind of usable application logic output, fit it to a platform and start tuning and tweaking the timing and latencies through the circuit,” he says.
There are other advantages to the toolset approach, maintains Ciarleglio.
“Tools like Impulse C allow people to work on software development environments that are completely separate from the hardware developers writing the lower level code. So if anything, they allow more people to develop at once on getting code ported over to hardware. All of the final placement and tuning needs to be done by someone who is closer to being an electronics engineer and who writes RTL, but that doesn’t mean that he couldn’t be taking output from multiple people writing applications to a library,” Ciarleglio says.
Despite the clear advantages that tools like Impulse C offer, challenges still exist according to Brian Durwood, CEO of Impulse Accelerated Technologies, the firm behind Impulse C.
“To write code that compiles down to multiple streams, you have to separate as many elements as you can into individual streaming processes. If you’re using a compiler, you have to learn to write in a slightly different style, using coarse-grained logic, so the machine can compile down to multiple streaming processes. Then from there, it’s just iterative,” he explains.
But Durwood accepts that there is no “silver bullet”.
“There’s no such thing as a compiler that lets you put in microprocessor code at one end, press a big red button and come out with perfectly parallelised code at the other,” he says. “You have to iterate. And the pain with iteration is that the final synthesis or ‘place-and-route’ step, where it all compiles down to RTL, can take four to eight hours. Some of the new devices have what’s called ‘partial reconfigurability’, which really helps, but it’s still a challenge.”
Another problem with the compiler approach, according to Mike Dini, President of Dini Group, a hardware and software engineering firm specializing in digital circuit design and application development, is that when you abstract yourself away from the hardware - for example at C level - you hurt yourself in exactly the area you wanted to improve, i.e. latency.
“The entire purpose for the existence of C-to-gates compilers is to make it easy to go from C code into gates,” he states. “They don’t provide you with a nice hardware-optimised solution. You get something that’s very large and very slow.”
There is a raging argument in the industry around this whole topic, according to Dini.
“In the ASIC prototyping world, there have been attempts to get from C to gates for many years. And there have been any number of companies founded and failed on the basis of their C-to-gates technology, because it just doesn’t work out. From our point of view, as FPGA guys, we don’t understand how you can even begin to minimise latency in terms of clock cycles without designing as close down to the actual logic as possible. There are companies claiming you can do it in C or even in MATLAB, but we don’t agree,” he says.
Terry Keene, CEO of iSys Integration Systems, an independent financial technology consultancy that specialises in next generation trading technology and exchange infrastructures, agrees that efficiency will always be compromised when going down the compiler route.
“People go to FPGAs because they are looking for the most efficient, the fastest, lowest latency, lowest jitter platform they can find,” he says. “But if you put a whole bunch of overhead in just because you tried to make it easier to program, how much have you had to trade off?
It’s a good question. But all of this means that there is now a growing demand in the financial markets for programmers and engineers who can work directly in VHDL and Verilog.
“Nobody knows how to write parallel code except the HPC guys,” asserts Keene. “A programmer writing an application to do financial transactions is unlikely to know how to write parallel code. FPGAs are still the realm of the electrical engineer as opposed to the computer science engineer. And those guys have now discovered that what they know is really valuable, so they’re charging an arm and a leg for it.”
Does that mean that it is difficult to find programmers or engineers with these skill sets?
Not necessarily, according to Arista’s Ciarleglio.
“It’s no more difficult than hiring mathematicians and physicists,” he says. “It’s just a different skill set that typically wasn’t around in financial services until recently. There is a decent amount of talent out there and people who know VHDL and Verilog have now managed to find their way into the financial world, although they’re not necessarily living in New York or Chicago or London.”
Now that such talent is becoming more widely available, will we see much wider adoption of FPGAs right across the industry?
Terry Keene has his doubts.
“Like everything else,” Keene says, “the pendulum swings. When everyone found out what FPGAs were without knowing all they details, they all said ‘we’ve got to use FPGAs’ and the pendulum thrashed way over to that side. Now people are finding out that if you’re not clever, effective and efficient in both designing and writing the programs for them - and more importantly testing them to see if they do what you expected them to do - then FPGAs don’t buy you much at all except a lot of pain, heartache and money spent. The pendulum is slowly swinging back to the point where people are realizing that maybe FPGAs are not the nirvana, even though there are some things they do really well.
“One interesting area to look at,” believes Keene, “is taking some of the communications technology and building a programmatic interface to it, to be able to put it on the NIC card and build it into a data flow that could then be connected to an FPGA or a CPU or even a GPU. You could then start taking advantage of a lot of stuff Intel are now building into their chips. They’ve now put some pretty significant maths functions directly into hardware. And as they continue to do that, HFT firms will most likely gain more of an advantage by using those chips for their algorithms rather than trying to move them onto FPGAs.
"If you want to stay ahead as an HFT,” concludes Keene, “you’ve got to be optimising your algorithms almost on a daily basis, making subtle changes continuously. And you can’t do that with FPGA technology.”
Not yet, anyway. But who knows what the future holds?
In part four of this series, which will be published at www.thetradingmesh.com/pg/fpga next week, we will drill down into some of the specific challenges firms face when working with FPGAs.