Remember me

Register  |   Lost password?

The Trading Mesh

The LMAX Disruptor Open-Sourced Mechanical Sympathy in Action

Mon, 01 Aug 2011 08:21:00 GMT           

An Interview with Martin Thompson

 

In this interview for the High Frequency Trading Review, Mike O’Hara talks to Martin Thompson, Chief Technical Officer of LMAX, about the design principles behind the Disruptor, the Concurrent Programming Framework for high-performance, low-latency transaction processing, which is used to run LMAX’s own trading platform and was open-sourced earlier this year.

 

HFT Review: Martin, before we look in detail at what the Disruptor actually does and how it does it, can you give us some background on LMAX and how the Disruptor came about?


Martin Thompson: Yes, LMAX is an online exchange for retail investors who want to trade FX and CFDs (contracts for difference) on a range of different products.

 

In terms of background, our parent company Betfair followed the typical Internet trajectory of a successful company and become bound by technology for its own growth. So the question was, how could we come up with a new exchange that could basically put to bed the growth issues for a while and cope with any capacity issues that we might face over the next five years?

 

Our goal was to come up with an exchange that was one hundred times the throughput capacity of the current Betfair exchange, so we called that project 100X. It was a successful project; the final result was a prototype that was capable of 120X. That heavily influenced the design of the exchange here at LMAX when we started.

 

HFTR: What kind of message volumes and throughput are we talking about here?


MT: We internally set ourselves a goal of designing an exchange that could do over 100,000 orders per second, with less than one millisecond latency. And although we’re not seeing those transaction volumes yet, we’re building the systems capable of scaling to more than that.

 

HFTR: What were the main challenges you faced when building this infrastructure?


MT: One of the things we learned very quickly as we started profiling this application was that we had significant latency. In the retail space, latency is very critical to people managing their risk when getting out of positions by getting a closing order onto an order book quickly, so we looked into what was the cause of that latency and we discovered that the queues in the different stages of processing an order, or managing someone’s risk, were actually taking up more time than the business logic itself

 

We were using an architecture back then called SEDA (Staged Event Driven Architecture), where you break down the process into a number of steps, a CPU thread is allocated to each step, it does part of the work, puts the results of that work onto the queue for the next step to pick it up and then moves on to the next step in the queue, basically working like a pipeline through the process. But we discovered that putting things onto the queue and taking things off the queue was taking the most time, introducing a lot of latency into the process.

 

We also realised that lots of things couldn’t run in parallel. For example, inside our core exchange, whenever we get a message off the network, it’s just a stream of bytes, which we might want to do a number of things with. We might want to write that message to disk, so we can recover it at some point in the future. We might want to send it to another server so that we can have hot failover. We might also have to turn that stream of bytes into an object that exists in the programming language we’re dealing with, so it actually processes the message at that point in time. Now all three of those things are completely independent and can happen at the same time, but we found that with the previous model, we were doing one, then the next, followed by the next, all of which was adding latency.

 

So we asked ourselves how we could make all of this happen at the same time without the cost of having queues between the different stages. Or how do we just make queues much faster? We were kicking around these ideas for a while and we did manage to create new types of queues, new implementations that were much faster than the standard available ones, but they still had issues and we couldn’t run all these steps in parallel.

 

Then eventually we came up with the design that ultimately became the Disruptor.

 

HFTR: I understand that one of the key problems you faced was how to deal with contention, is that correct?


MT: Yes, as with anything in life where two or more things are contending to use any given resource, you have to work out how to arbitrate the contention. With a queue, you’ve fundamentally got two “writers”, because you’ve got something putting things on to a queue (the producer) and something  taking things off the queue (the consumer). In fact you can have multiple producers and multiple consumers. They all modify the queue structure by either adding things into it or removing things out of it, and this gives you a contention problem. You need to manage all of the interaction that’s going on and you can do that in one of two ways.

 

One way to manage that interaction is to use a lock around the whole structure. What basically happens here is you need to get arbitration involved, i.e. you go down to the operating system kernel, which says, “You’ve got the lock first so you can continue to run”, then to the other thread, “but you’re second so I’m going to put your process to sleep until the first guy has finished and released the lock”. That requires a lot of overhead because of things like ring transitions in the processor for letting the kernel run in privileged mode to do the stop & start on the threads and so on. Now the two threads will have pulled data into their caches (and caches are usually accessed more than one or two orders of magnitude faster than main memory is accessed); when the kernel gets involved, it has to pull its own data and ends up polluting the cache, losing the data that’s been “warmed up” by the previous two threads. So when those previous two threads go to run again, they end up having to pull their data through from main memory once again. That adds a lot of overhead, and with that overhead can come significant latency.

 

HFTR: I believe you actually have some figures in your white paper (http://disruptor.googlecode.com/files/Disruptor-1.0.pdf) on how much latency is actually introduced by these processes?


MT: Yes, there’s one section where I run some comparisons: updating a variable; updating a variable with a lock; two variables contending on a lock, etc. For example, if I want to increment a single counter, I can do it 500,000,000 times in 300 milliseconds when there’s no locking and nothing else involved. If I end up with two threads contending to update that counter and using locks, it’s 224,000 milliseconds, which is nearly four minutes! At that high level of contention, the pain and expense is obvious.

 

The second way to manage contention in a queue is to use CAS operations (Compare and Swap) where you say, for a single CPU instruction, “I want to set a variable to this value if its current value is what I expect it to be”. When you do that you don’t get the kernel involved for arbitration, you’re just using the CAS operations. But it’s important that these two things happen together because somebody else could have snuck in and done a piece of work underneath you. The problem is, with something queue-based, it’s going to take a minimum of two CAS operations to do anything. And CAS operations are fairly expensive themselves, usually an order of magnitude more expensive than just doing a simple operation. On top of that, you’re still fundamentally trying to touch the same structures; imagine two processes going for something, one gets it with the CAS and the other one misses it so has to keep spinning and waiting until the other one concludes. OK, the kernel hasn’t got involved but the CPU and underlying infrastructure still have to fight between these two, trying to co-ordinate them for that piece of memory, which is bouncing backwards and forwards between the two of them, fighting for ownership all of the time in the CPU cache.

 

The point is, when you’ve got contention, you’ve got a problem. With CAS, you might not have the hugely expensive arbitrator in place (i.e. the kernel), but you’ve still got the problem where two things are fighting over the same thing. A better solution would be to look at how to remove that contention problem altogether. And that’s what we’ve done with the Disruptor.

 

HFTR: So what are the key, fundamental elements of the Disruptor’s design. How is it different?


Click here to access the full text of this interview (registered users only)

If you are not a registered user, click here to register for free

 

, , , , , , , , , , , , , , , , , , , , , , , , ,