Remember me

Register  |   Lost password?

The Trading Mesh

How to Maximize Java Application Performance in a Trading Environment

Tue, 02 Apr 2013 09:53:39 GMT           

An interview with Scott Sellers, President, CEO & Co-Founder Azul Systems

(2nd April 2013)

 

In the world of latency-sensitive trading and HFT, we often focus on how performance can be improved at the hardware layer, for example via hardware FPGA/GPU acceleration or high-speed networking components. But what about the software stack?

 

In this interview, we speak with Scott Sellers, President, CEO & Co-Founder of Azul Systems, about how application performance in a Java trading environment can be massively improved by using the right Java Virtual Machine (JVM).

 

HFT Review: Scott, welcome to HFT Review. Can we start with you giving some background on yourself and on your company Azul Systems?

 

Scott Sellers: I started the company in 2002 along with two other individuals, including Gil Tene, our CTO. Since its inception, Azul has served a variety of different markets; everything from financial services to e-commerce to insurance to telecommunications to big SaaS companies. Over the course of our history we’ve identified a tremendous market opportunity for addressing many of the shortcomings prevalent in today’s enterprise Java deployments. And because of the proliferation of Java that exists throughout the world - specifically in financial services - our value proposition has really resonated.

 

HFTR: What sort of shortcomings are you referring to?

 

SS: There are a number of limitations in today’s Java platforms that can limit the inherent scalability of applications. There are also limitations impacting the business requirements of an application in terms of the percentage of time, for example, that it needs to respond within a certain latency.  Customers also complain about the lack of visibility within the Java platform, so that when an application is in production it’s very difficult to really understand what’s actually going on inside the run-time environment. It’s like a black box, which is not very comforting for the owners of mission critical deployments. Those three key elements of pain that we continue to hear about -- scalability, inconsistent response times, and poor visibility -- are the three major problems that existed in 2002 when we started the company and they continue to exist today for people using off-the-shelf Java solutions.

 

HFTR: I’d like to drill down on the problem with inconsistent response times.  Can you expand on that? 

 

SS: Certainly.  Most applications that we’re typically asked to power with our products are going to have a set of SLAs or a set of multi-dimensional business requirements, and latency is always going to be one very critical element. For example, how fast can you process a trade?  How quickly can you pass a message through if you’re dealing with something like a FIX engine or a FIX gateway? How quickly can you update your risk analysis for a real time risk engine?  Latency is a critical element, particularly latency under load. What typically happens is that an application might be meeting the business requirements and processing what it needs to do within a certain latency threshold, but only under a certain set of load conditions. So it might work 99% of the time, but in mission critical deployments 99% is nowhere near good enough. Our customers require three, four or five 9’s, meaning 99.999% of the time the application must respond within a very strict latency requirement. 

 

The nature of the challenge is that although the average latency may be acceptable for the business, the most challenging latencies - the ones that occur in the three, four or five 9’s - are not acceptable.  And given the fact that this is such a brutally competitive environment where pennies sometimes have to be made millions of times over in a day in order to make money, if you are unable to have an absolutely consistent platform, not only may you be losing money, quite frankly you may be out of business.

 

HFTR: So where do the bottlenecks tend to occur?

 

SS: The problem is always one of peeling away the onion.  When you’re trying to deliver a truly world-class platform that meets all of the latency requirements (which inevitably are reducing over time), there can be many elements that can cause worst-case latencies. It could be slow hardware, it could be network limitations, it could be how the operating system is tuned, it could be the application itself, and then of course it could be the piece of software that runs the Java workload, i.e. the Java Virtual Machine (JVM).  Many of those elements in the overall application stack could be a limiting factor causing latency outliers. 

 

Intel has done a wonderful job of continuing to evolve the server platform and as long as an application is running on a reasonably modern piece of commodity hardware, typically it’s not going to be the server itself that’s the limiter.  Networking standards have also improved and networking vendors have done tremendous work in terms of reducing latencies associated with TCP/IP networking protocols. Five years ago network latency was a real issue, but that’s becoming less and less so with the evolution of the technology in that space. 

 

From the Operating Systems perspective, there has been a number of changes over the last few years to improve the scheduler within the Linux kernel, adding some tuning flags that help with latency sensitive situations, and distributions of real-time Linux are also beginning to be used.  And once the Operating System is tuned, it can generally be removed as a cause of latency outliers. 

 

Another culprit of latency outliers may be the application itself. Perhaps poor programming constructs or practices are causing long application delays. Developers can use profilers to determine areas in their code that are not performing and make required changes.  But once they’ve been addressed there’s only thing left, and that’s the JVM. And this is one of the most frustrating elements for users of Java because you only have so much control over a JVM, and what happens inside the JVM is very complicated. It’s responsible for compiling the Java application onto whatever platform it’s being run on (typically x86 Linux), it’s responsible for running the application itself, it’s responsible for all the threading involved in the applications, and very importantly it’s also involved with the garbage collection process, which quite frankly is the Achilles’ heel of the Java run-time.

 

HFTR: Can you spend a couple of minutes just explaining why the garbage collection process can have such a significant impact on the performance of running applications?

 

SS: Certainly. In Java - or any managed run-time type of language - it is not the responsibility of the application itself to worry about the memory management, it’s actually the responsibility of the run-time layer, and in the case of Java that’s the JVM.  The reason it’s important is that the management of memory is a very tricky thing to get right.  With programs written in C and C++ the memory management is the responsibility of the programmer or the application itself and as a result of that there are often bugs in the application and/or security vulnerabilities, because the way the memory is managed can be exploited by viruses and other threats that find holes in the application and do malicious things.

 

The combination of the trickiness of getting memory management code correct, bug free and scalable in performance, combined with the inherent security vulnerabilities of having memory managed by the application itself, led the computer science world to this concept of managed run-time environments, like Java. One of the reasons that Java has been and remains the world’s most popular programming language is because of the programmer efficiencies and the productivities of using a language where you only need to worry about coding the business logic, you don’t have to worry about subtleties such as memory management.  That’s the good news. 

 

The bad news is that when memory management is the responsibility of the run-time,  as the application is running and objects are being created, used, thrown away and then eventually reclaimed, the memory used by the running application gets fragmented. An analogy is disk fragmentation that occurs on your desktop computer or your laptop. At some point in time what used to be a nice, clean memory space ends up looking like Swiss cheese. As the application runs, at some point the memory becomes so fragmented that no more memory can be effectively allocated, so the memory needs to be cleaned up and defragmented. And just like when the disk defragmenter kicks in on your desktop computer and your work productivity plummets, the same thing is true in a running Java application.  When the garbage collector - which is the process within the JVM responsible for cleaning up this memory - kicks in, it can have a dramatic impact on the running Java application because, in essence, the application has to be paused while the JVM goes around and cleans up the memory, remarks it, packs it and condenses it, in order to be able to remove those holes and make it a nice, clean, continuous memory space so that memory can be allocated once again and the application’s happy. 

 

HFTR: And presumably the impact of the garbage collection process on the application’s latency can be quite significant?

 

SS: Well, the technology and the algorithms involved in doing this have improved over the years.  In the early days of Java this was truly an “all or nothing” type of capability.  Either the application was running or the garbage collector was cleaning up the memory, and when that was happening, it had to stop the application completely, what’s called a “stop the world” (STW) pause. Those familiar with tuning Java still have nightmares about STW pauses because if you’re an application with real-time requirements, it’s very difficult to predict when that garbage collector will actually kick in and freeze the application itself. And when that application is stopped, that means trades aren’t happening, messages aren’t being passed, risk is not being analysed, matching engines freeze, all these things pause whilst this crazy process called garbage collection is just doing its job. 

 

HFTR: Presumably this is where Azul comes in?

 

SS: Yes. The state of the market today is that there are really only two major JVMs being used within the financial services community on trading platforms. It’s either going to be the HotSpot JVM that originated from Sun or it’s going to be Azul’s Zing JVM. The difference between the two is that in the context of strict latency requirements, the garbage collection algorithms inside the HotSpot JVM were never designed for the demanding low latency requirements mandated by today’s financial services markets. Even though the application pauses due to garbage collection with Hotspot have improved over time, they’re nowhere near the levels that are required by today’s aggressive trading platforms, which require worst-case pauses in the low single digit milliseconds and often times in microsecond levels.

 

The big benefit of Zing is specifically around this point of the worst-case pauses associated with the garbage collection process. Not only have we dramatically reduced those pauses down to sub-millisecond levels, but we have also done it in a way that is independent of the size of the memory being used, which is another key element. HotSpot, with lots of tuning, lots of effort and very, very small heap sizes (the heap is the amount of memory used for a running Java application), can perhaps meet the business requirements of a given application.  But what happens is that today’s applications want to use more and more memory to be able to handle higher volumes and to do more in-memory analytics. So instead of accessing systems across a network (whether that’s a database system or talking to another platform), in today’s modern environments everything wants to be held in memory as close to the application as possible. As a result, the memory size requirements for these Java applications are growing all the time. Within HotSpot, there’s a direct relationship between the amount of memory being used and the worst-case pauses associated with garbage collection and there’s nothing you can do about it, the more memory you use, the longer the pauses are going to be. 

 

Understandably this causes angst amongst HotSpot users within the Java community. This pressure to keep memory small to try to keep the garbage collection pauses manageable is in conflict with modern computer science programming, where you want do more in-memory analytics, you don’t want to take the penalties of crossing networks and hopping over other systems. So the benefit of Azul’s Zing is the fact that not only does it slash latencies for any size memory, but it can also host microsecond-level worst-case pause times with hundreds of gigabytes of heap being used for Java processes. It’s truly the best of all worlds and quite frankly why Zing is being used so prevalently is today’s financial markets.

 

HFTR: Can you give us a few examples of where Zing has been implemented in the financial markets, how it’s helped and what kind of results have been achieved?

 

SS: Certainly.  In a general sense we’ve seen it used predominantly in the front and middle office.  Obviously those are the areas where latency and faster response times are really critical for the business.  Specifically, in the front office, we see Zing first and foremost in trading platforms that are processing millions of trades a day across the equities markets, across foreign exchange, across fixed income and a variety of other instruments, where the trader needs the absolute, most consistent performance and fastest results.  There are many other areas that require similarly consistent response times, so we power many of the FIX engines and FIX gateways that are commonly used throughout banks and hedge funds.  Our customers include market makers whose pricing engines need to be more responsive and more competitive all the time.  We power matching engines. 

 

In the mid-office, risk is moving from a batch process into real time and as a result, the SLAs for the real time risk engines are now as strict as the trading platforms. The need to be able to constantly assess risk in real time, to be able to re-analyse risk based on changing prices and changing positions means that we’re also seeing a lot of use of Zing in this area, in terms of being able to reduce latency to keep up this never ending need for faster and faster performance and more and more consistent latency.

 

HFTR: When swapping out an existing JVM with Azul Zing, how much tuning is typically required in order to get optimum performance?

 

SS: That’s a good question. The answer is perhaps ironic in the sense that we find a lot less tuning is needed with Zing compared to HotSpot.  Often when we engage with a business we find that they have spent literally man-years of effort to try to tune HotSpot to get to where they need to be in terms of the requirements of the business, and all of their efforts have ultimately resulted in a failure to meet those demands.  The beauty of Zing is that it’s a 100% compatible JVM, it’s fully compliant and compatible with all the Java specifications. So instead of what literally can be hundreds of tuning flags associated with HotSpot and all of the trial and error of trying to get it to do something reasonable to meet the business demands, Zing is incredibly simple and has very few flags and arguments associated with it. Most of the time the only thing a customer has to do is just set the heap size and they’re off and running.  The same application that runs on HotSpot runs on Zing out of the box, there’s no application change, re-coding, re-compilation, none of that stuff is necessary. 

 

We can very quickly derive quantitative metrics that compare Zing vs HotSpot, and what we typically see is that worst-case latencies will be reduced by factors of 100x or 1000x and average latencies also reduced. We also see maximum volume increasing because you can use more memory and it’s a much more scalable JVM. The fact that Zing is truly a plug and play solution that doesn’t require a bunch of tuning facilitates a very easy pilot/evaluation process, which also results in an easy quantitative comparison between HotSpot and Zing.

 

HFTR: I understand you offer a free open source toolkit. What can you tell me about that?

 

SS: We find in certain cases, users of Java don’t really understand where their applications are failing to meet the business requirements, so we have a set of tools that help them with that. The most common one used is something called jHiccup, a 100% open source free tool that can be run with Zing or with HotSpot. It is simple to install and runs alongside an application and measures the “hiccups” or the pauses of the latency outliers, allowing an operator to quickly assess whether or not those latency outliers are due to the JVM.  It’s very beneficial because it helps users try to figure out what layer of the onion is a potential bottleneck, so it’s very popular and is used by all sorts of people in the Java community.

 

HFTR: Finally, what can you tell us about your future plans?

 

SS: Clearly the success we’ve had in financial services is a foundation for future development.  We are becoming the de facto standard for Java applications in the financial markets. More and more, Java applications that have strict SLAs and competitive latency requirements are using Zing.  Where we’re headed is to continue to add features within Zing that are unique and also specifically designed for use within capital markets.  Without getting into specifics, the fact that we are powering so many of these mission critical deployments within the financial space means that we have a unique perspective on some of the remaining problems that exist, which we’re working into our roadmap. 

 

And of course we have plans to support all of the forthcoming versions of Java, so we’ll continue to support the industry standards.

 

HFTR: Scott, thank you very much.

 

Sponsored by Azul Systems

www.azulsystems.com

 

Click here to download this article as a pdf


, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Can't the GC be implemented as an accelerator in the CPU? so it doesn't need to swap the app thread out from the CPU.

Jack Harvard 1666 days ago,(2013/04/02)

<In reply to Jack Harvard's question>

The problem with most GC implementations is their "stop the world" nature. When the GC algorithm itself requires all processing to be stopped while collection work is ongoing (as is the case with HotSpot newgen collection, even when empty CPUs are available), acceleration or custom hardware will not do anything to help such algorithms. When the GC algorithm allows collection work to proceed concurrently with the application code executing (as is the case in Azul's C4 collector, even for newgen collections), using a co-processor or accelerator is not a concern either, as the work can easily be done on otherwise unused general purpose CPU cores [and most modern servers have plenty of those, especially in the low latency apps space].

 

Gil Tene 1665 days ago,(2013/04/02)

Even if the GC can run on a different core on a many/multiple core CPU, the GC will jump from one core to the other, causing poor performance (or in the Zing JVM, the GC is affixed to one core). The accelerator at least won't have the thread migration issue. And the poor scalability of Java is very true even for concurrent GC, the CPU utilization is not that high for multithreaded Java apps, somewhere around 75% top as I've seen. 

Not a GC expert myself, but why it can't be like the GC scheme used by the city council, every household gets several bins, and the council only collects the bins once or twice a week. It seems to work well and predictable in my area at least. Profile the app, make sure it never overflows the bins. The traderoff is to speed up each GC process and do GCs regularly (rather than on demand), will it work?

Jack Harvard 1665 days ago,(2013/04/02)

GC can actually extremely efficient and scalable, especially given the massive amounts of cheap empty memory current servers can throw at it. It's only real problem is those pesky pauses - for both newgen and oldgen collections. Once those are taken care of (as is the case with Azul's C4), GC becomes just a small amount of background work that is not hapenning within your sensitive transaction times, and does not slow them down because it happens on other cores. Simply put: it's noise contribution drops so low that it is no longer dominant - other issues like system jitter, CPU affinity, power management become bigger signals than GC.

If you want to understand more about the things in GC that make it different and harder to deal with than the "city council GC" you discuss above, I'd suggest a 1 hour crash-course in the core GC mechanisms and terminology: you can watch my InfoQ presentation titled "Understanding Java Garbage Collection" at http://tinyurl.com/bvbs4hh which covers exactly that.

Gil Tene 1665 days ago,(2013/04/03)

Many thanks Gil for pointing out the GC tutorial, very good, are there still challenges faced by the C4 GC? in other words, what's on the roadmap if you could share?

Jack Harvard 1664 days ago,(2013/04/04)

Glad you found the GC tutorial useful.

As to remaining challanges with C4 - we can always make it better - and we keep doing so. For example, While the C4 algorithm itself is completely puase-free, C4's implementation in Zing still performs some phase flips (changing from one phase to another in the GC) in safeppints, to coordinate work with some of the non-GC parts of the system. These phase flips are currently all in the sub-msec range, and we keep knocking them down lower and lower (to the low 100s of usec), making them orders of magnitude better than on other JVMs. They are not only dramatically shorter, but also much more rare than the typical GC pause events in other JVMs, because we are able to use empty memory to elongate our cycles without any increase in outlier magnitude.

The reality for C4 right now is that the "noise" level generated by GC with C4 is now smaller than the noise levels caused by other JVM Runtime and Linux system issues. This means that for normal enterprise (not low latency) apps, where multi-milliseconds-level noise is not considered a problem, we've effectively eliminated GC as a response time concern altogether. For low latency applications, where the goal is always to reduce the nosie levels as low as possible, C4 is to a point where GC is no longer the dominant cause of jitter. We now tend to focus on ways to contain reponse-time-affecting artifacts other than GC. In Zing, we've already cleaned up much of the other stuff a JVM's Runtime does that could cause such jitters (long time-to-safepoint code paths, locking logic that causes global pauses, etc.), and we continue to beat down those parts. We're also continually growing the set of Linux tuning tricks we use to help the OS itself avoid or reduce system-related jitter.

The road to ever lower latency and jitter never ends. But it's nice to be so far ahead of the pack when it comes to Java.

 

 

Gil Tene 1664 days ago,(2013/04/04)