Remember me

Register  |   Lost password?

The Trading Mesh

SubMicroTrading Open Source Ultra Low Latency Trading Framework

Mon, 17 Aug 2015 09:28:39 GMT           


SubMicroTrading is a new open source component based trading framework.  It has been designed from the ground up with core principles focused on minimal latency and maximum throughput. SubMicroTrading contains various components including:


  • Fix engine 
  • Order Management System (OMS)
  • Market data handlers
  • Book management
  • Exchange trading adapters
  • Basic exchange simulator
  • Highly concurrent exchange agnostic strategy container


SubMicroTrading  is a million dollar trading toolkit with over quarter of a million lines of code which is now open source and available on GitHub.


4 micros average tick to trade, wire to wire at 800,000 ticks/second (MAX tcp replay ... 2 micros in Java process) 



In 2008 the fastest third party OMS written in java had an average order processing latency of 400 microseconds. At that time anything sub millisecond was considered fast. My background in finance,  real time systems, compilers and telecoms gave me the opportunity to design and  build a new Order Management System for a major bank. This new system was deployed globally and achieved average latencies  less than 100 microseconds.  In 2010 I left the bank and founded Low Latency Trading Limited. The objective was to build a new framework for trading systems that would allow even lower latencies under 20 micros. To achieve this, the framework was designed from the ground up to make every line of code as efficient as possible. 


Achieving Holistic Latency

Jitter and latency in an application can come from many sources. To achieve best performance you must consider the system as a whole. 


BIOS has various power saving settings which need to be disabled along with speedstep and hyper threading, as you want cores to run at full speed without interruption.


The OS has a number of features which can cause jitter e.g. irqbalance daemon, paging, switching and other key config such as TCP tuning parameters which all affect performance.  CPU core isolation and OpenOnload are big winners here. (SolarFlare wrote a good guide on system configuration for best performance.)


The Programming Language is another consideration. Java has GC and JIT but also allows in-lining of virtual methods. GC pauses are easy to avoid with pooling and JIT is mitigated by running systems 24*7 (as in the old days) and warmup code.


Application design is the single biggest factor to achieving ultra low latency. It cannot be achieved by attempting to tune a poorly designed system. Profiling at the nano second level just doesn’t work. The core principles are to minimise expensive operations such as object creation, memcpy, map lookups, synchronisation, try catch handlers and nested looping. Throughput during exchange busy periods requires concurrency. Concurrency requires discrete core thread affinity and spin locks along with thread multiplexing and careful mapping of cores to threads based on target hardware.


Hidden latency can also be caused by external factors like slow consumption on exchange side leaving packets waiting, enqueued in TCP send buffer.  Figure 1 depicts backlog of UDP messages in a single threaded system as may be experienced during spikes where market data is generated faster than the single threaded process can process. The order generated off the back of T1 is in the Exchange Session Writer about to be encoded. Meanwhile ticks T2 to T12 are awaiting processing in the operating system buffers.



There is a misconception that single threaded systems are faster and easier to write and maintain  than multi-threaded systems. However this is not true of low latency single threaded systems. Especially those that have to share the thread of control over many logical tasks.


Reading input, order management, risk checks, market data processing and exchange order writing all sit naturally in a more concurrent form. These can all be written individually as single threaded code which run in their own threads passing messages to each other via fast queues or ring buffers (as seen with The Disruptor pattern). In a low latency single threaded system all code must be written to ensure it doesn't exceed a fair time slice. This multiplexing creates some very complex code.


Horizontal / Pipeline scaling is the ability to split workflow into individual components which can run concurrently.  From the above example the following component can run concurrently:


  • Market Data Handlers
  • Exchange Line Handlers 
  • Book Management
  • Strategy Instances


Vertical scaling is the ability to run multiple horizontal pipelines concurrently. For example the ability to decode CME market data concurrently from different multicast groups. Vertical pipes don't have to be complete duplicates of the horizontal pipeline, instead they can join and split with the use of queues and routers. Key to success is understanding the throughput of each component and keeping cross thread contention to an absolute minimum. 


Controlled thread multiplexing allows a pinned core to be shared across a number of  configured tasks. Different configurations are used for different target hardware.


Figure 2 shows how concurrent processing  both across component and by pipeline  increase efficiency and reduce "real" wire to wire latency.  At the point where the order generated off the back of T1 is in the Exchange Session, ticks T2 to T12 have been read from the O/S buffers and are being processed at various stages of the pipeline.




Benchmark results are almost as bad as politicians spin at election time. Is the benchmark representative of real conditions in the market ?  Is it free of skewing, e.g. wire to wire measurement using network packet capture ? How do you compare benchmark results when in reality they are apples and oranges. What does it mean to decode a Market Data Tick ... what about updating the order book and ensuring no concurrent dirty reads ?  Is it a proper book that is thread safe and useful ?


STAC-T1 benchmark was a good starting measuring time to process a  tick and send an order to market. The input market data is canned and the test requires network capture for wire to wire timing which is independent of the server running the trading application.


This is still too simplistic a scenario. When developing benchmarks to measure your software you need to cater for average and peak market conditions. You need to consider how many books to subscribe to, which trading strategy to use and how many strategy instances to run. Understanding the data topology and required performance characteristics are key.  A benchmark where 95% of updates are on a single book will give very different results to a random spread across 1000 books.


Don’t over focus on 99.99th latency percentile .... if the 99th percentile is fastest then 99 times out of 100 you will win. The assumption that the top 1% of jitter in trading applications coincide with arbitration opportunities in the market .......  is often pure speculation.


Dont benchmark just market data processing, or exchange order encoding, but the whole main use cases wire to wire. Otherwise you may be missing hidden latency between components.


Having good repeatable and controlled benchmarks for all key use cases is essential for a deep understanding of system performance and tracking the impact of development changes over time.


Colocation OMS

The first application to be built with the new framework was an ultra low latency order management system targeting FX colocation. Although a technical success, the advent of sponsored access for DMA and the perception that the FPGA kill switch cards (which were touted at 8 micros at the time) were faster than software based solutions killed off that application as a business opportunity. That said, the order management system contained many components that could be reused in an algo trading system, including fix engine, generated exchange codecs and exchange line handlers.


Exchange Agnostic Strategy Container

I have previously worked on strategy trading containers that allowed rapid prototyping with normalised market data and exchange agnostic trading events. This worked very well and so I decided to write one for SubMicroTrading. The first requirement was a safe, non-blocking, concurrent book manager which allows algos to skip stale tick updates. This component processes the market data normalised POJO events, updating the master book as required. Book changes invoke listeners without map lookup and the strategies then process the event, snapping the book if required. A simple order API is provided for generating and routing exchange orders and an exchange agnostic event handler is provided for processing the exchange reply POJO's.


Open Source

The reason to open source SubMicroTrading is to showcase how Java can be successfully used for ultra low latency trading. All the benefits of faster development and ease of hiring in my opinion make Java a winner. I have no more time to invest in this project and I sincerely hope that someone picks this up and uses it successfully.  If you are a C++ shop or using a  hybrid of software and FPGA then why not try it out and benchmark it wire to wire against your existing system ?  Please bear in mind that to fairly benchmark the system you will need to correctly configure for your target hardware (check out the thread to cpu core allocation map file).

, , , , , , , , , , , , , , , ,