Remember me

Register  |   Lost password?

The Trading Mesh

Derivative Pricing on Alteras OpenCL-enabled FPGAs

Wed, 10 Apr 2013 06:13:41 GMT           

By Paul Sutton at the Xcelerit Blog

 

FPGAs are programmable hardware devices, traditionally used in the signal processing domain for real-time number-crunching where high performance and low power consumption are paramount. For financial services, their deterministic high performance and low latency makes FPGAs a perfect fit for high-frequency trading – and that’s where FPGAs are typically used in banks and hedge funds. However, the complexity to develop in VHDL or Verilog has been a major barrier for adoption in derivatives pricing and risk management applications, i.e. compute-intensive analytics. In this blog post, we will look into using FPGAs for this type of algorithms, using the OpenCL-enabled PCIe-385N FPGA board from Nallatech that we’ve just received. It features the powerful Altera Stratix V A7 FPGA. We’ve put it to the test using the example of a complex derivative pricing algorithm.

 

Nallatech PCIe-385N

This board comes in a PCI express form factor which can be plugged into workstations easily. It is small (Low Profile, Half Length PCIe) and low power (tens of watts). It can be configured with 8 or 16GB of memory, leaving enough head room for most financial applications. The board supports Altera’s SDK for OpenCL – allowing it to be programmed using higher-level software tools. This SDK automatically compiles and synthesizes the OpenCL kernel code into FPGA logic, creating deep parallel pipelines and adding the interfacing logic to control the execution via the host CPU.

Nallatech PCIe-385N FPGA board

Nallatech PCIe-385N FPGA Card with Altera Stratix V FPGA

 

Algorithm

As a test algorithm, we’ve used a Monte-Carlo LIBOR Swaption portfolio pricing algorithm. It prices a portfolio of 15 swaptions for the LIBOR rate, using thousands of Monte-Carlo paths. In each path, a potential future development of the LIBOR interest rate is simulated at 80 time steps, employing a LIBOR market model and using normally-distributed random numbers. For these, the value of the swaption portfolio is computed by applying a portfolio pay-off function. The overall value of the portfolio is then estimated by computing the mean across all paths. The equations for computing the LIBOR rates and pay-offs are given in Prof. Mike Giles’ notes (Oxford University). The algorithm is depicted by the dataflow graph below:

 

LIBOR Swaption Portfolio Valuation

 

Test Setup

We’ve run the described algorithm on the Nallatech card with the core algorithm completely implemented on the FPGA. That is, the random number generation, path computation, and mean reduction is running in FPGA logic. The overall application is directed from software running on the host CPU. The FPGA uses single precision floating point in all computations.

The following test system was used:

  • CPU: 2 Intel Xeon E5620 processors, 4 cores each
  • Accelerator: Nallatech PCIe-385N with Stratix V A7 FPGA
  • OS: RedHat Enterprise Linux 5.4 (64bit)
  • RAM: 24GB
  • FPGA Design Suite: Altera Quartus II 12.1 SP1, 64bit
  • OpenCL SDK: Altera SDK for OpenCL version 12.1 beta
  • Host Compiler: GCC 4.1

The FPGA resource utilisation is as follows:

Resource Usage
Logic Utilisation 54%
ALUTs Used 27%
Dedicated Logic Registers 26%
Memory Blocks 91%
DSP Blocks 57%

 

As can be seen, the design is dominated by memory blocks (used for storing temporary arrays and caching data), followed by DSP blocks (which perform the floating point calculations).

 

Performance

Note: These performance numbers are indicative only, as they have been recorded with a beta version of the first OpenCL SDK from Altera. The purpose of this blog post is to show that FPGAs have evolved from a specialist hardware domain into a platform that can be smoothly handled by a software developer without FPGA experience.

We’ve measured the computation times of the FPGA version and compared to a sequential CPU reference of the same code. The speedup factors of the FPGA-based computation vs. the sequential CPU reference have been computed, taking into account the full algorithm execution time. These are illustrated in the graph below for varying numbers of paths, and in the table that follows.

 

Speedup FPGA vs. Sequential CPU (LIBOR Swaption Portfolio Pricer)

 

Paths 4K 16K 64K 256K 1024K
Speedup 35.0x 29.3x 27.9x 27.6x 27.6x

 

Discussion

These numbers clearly show that the FPGA delivers high performance – up to 35x faster than the sequential CPU implementation. And considering the fact that it consumes an order of magnitude less power than a server-grade CPU, this is even more impressive. The deterministic performance of FPGAs can also be seen in the speedups above: The execution time per Monte-Carlo path is almost exactly constant on the FPGA, while on the CPU this improves with more paths. Therefore we can see a slightly decreasing speedup curve in the figure above.

It should be noted that synthesising hardware designs, even if generated automatically, is a much more complex process than compiling a piece of conventional software – synthesis for the above example took nearly 4 hours. The process of fixing some code and testing it is therefore much more involved. If you are interested to learn more about efficient software development techniques for FPGAs, just drop us a line and we’ll get back to you with more details.

 

The opinions and writing contained in this article are of the author alone and do not necessarily represent those of HFTReview.com.


, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,