CPU vs FPGA - make RAM the new disk!
Wed, 15 Feb 2012 05:15:00 GMT
I had a great conversation with Rolf Andersson of Pantor last week and the discussion turned to FPGAs and their use in High Performance Trading. I've seen them being used as "circuit breakers" for in-flow risk checking and some limited algo trading but Rolf has seen some sophisticated algos being used on complex FPGAs. At a recent discussion of FIX trading engine performance Ben Stephens described how Nomura is committing to FPGA development. Things are changing in low latency infrastructure space for example, SolarFlare, the low latency network folks, have discussed adding a FPGA to their NICs. Rolf's observations was that software programmers were responding to the FPGA challenge by becoming sensitive to the CPU design and now describe RAM as the new disk!
If RAM is the new disk?
When I was a performance analyst at Sun one of the hardest taks was persuading programmers that reading from disk was bad and best avoided. Battery backed large cache arrays from NetApp, EMC et al. goes some way to address the issue but comparing the read performance of CPU cache (60ns) to memory (400ns) to disk (10ms) reveals the shocking difference. Missing the cache is bad, fetching the data from disk is plain awful. Rolf's idea that RAM is the new disk points out a CPU cache miss will hit the performance as badly in the Low Latency context as reading from disk in conventional programming.
If nana-seconds were Seconds
As a human I struggle to get a comprehend mill-seconds much less nano-seconds so I think of the analogy of if seconds were seconds a 60 nanosecond CPU cache fetch would be like a minute or me looking for a page on my desk and a minute later found it. Thinking about a cache miss fetching from memory would be about 400 nanoseconds or 6 to 7 minutes if nanoseconds where seconds. That's like me having to leave my desk and search for the page in a filing cabinet, an annoying interrupt.
Cache sensitive programming
The latest generation of low latency network cards using both PCI3 and a technique called kernel bi-pass to write a data packet straight into the user space of an application at a blistering pace. If the programmer can keep the processing on cahce, there's 32MB, then the there's every chance the algorithm will perform as well as FPGA.
It's a delight to read the article, however, I'm not sure that I agree with "there's every chance the algorithm will perform as well as FPGA", it would be good to see some comparisons in numbers.
Jack Harvard 1069 days ago,(2012/02/26)