Java Performance Testing
The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry. [1]
While I don't agree with the above quote completely, there are elements to this that ring true. In my experience the quest to pursue faster and faster delivery of product features has lead to the delivery of inefficient solutions to the problems we attempt to solve. To compound that if we are doing our job well as a software engineers we are also thinking about how to build a solution that will be easy to maintain over time, which can also lead to focusing more on maintainability versus performance. In this post I'd like to discuss the times when we do get to or ought to focus on performance, where we take a step back and think about what we want to deliver, how that delivery should perform, and lastly how can we verify it meets the benchmarks we set out to meet. Whether it's a new service, a new feature, or perhaps incorporating a new technology into your stack.
What inspired this post is something I recently came across with a move from Java 8 to Java 17 in a key part of our platform, and along with that upgrade we incorporated Observability in the form of tracing. As part of this upgrade we wanted to ensure various performance metrics were not only maintained, but improved before rolling out this upgrade to other services in this part of our platform. Before we dive in let's discuss some of these metrics and what they mean, because not all of them are commonly discussed when talking about performance.
Observable Performance Metrics
The key performance metrics I would like to run through are:
- Throughput
- Latency
- Capacity
- Utilization
- Efficiency
- Scalability
- Degradation
Throughput
Throughput measures the rate at which a system completes work, often expressed in units per time (e.g., transactions per second). It provides a direct sense of a system’s capacity to handle load and must be contextualized with the platform and test setup to be meaningful.
Latency
Latency is the time it takes for a unit of work to complete — how long it takes from initiation to response. Lower latency typically implies better responsiveness but doesn’t necessarily equate to higher throughput.
Capacity
Capacity refers to the maximum load a system can handle concurrently — how much work it can process in parallel. It’s tightly linked to available resources and system architecture.
Utilization
Utilization indicates how efficiently system resources (CPU, memory, etc.) are being used. High utilization can imply efficiency but may also signal resource saturation, while low utilization could mean underuse or idle resources.
Efficiency
Efficiency is the ratio of useful work output to resource input. A system is efficient if it delivers high throughput and/or low latency with minimal resource consumption and cost.
Scalability
Scalability assesses how well a system handles increasing workload or resource addition. A scalable system maintains or improves performance as it scales, whether vertically (more powerful machine) or horizontally (more machines).
Degradation
Degradation refers to how system performance deteriorates under load or stress. Robust systems degrade gracefully, maintaining acceptable service levels as conditions worsen, rather than failing abruptly.
Establishing Testing Goals
In many cases, but not all, performance testing is being done as part of a some sort of change. In my case it was upgrading from Java 8 to Java 17. So I needed to establish what was the current baseline performance before beginning any work towards the upgrade. The baseline performance testing was absolutely crucial to performance testing with a scientific approach versus going off of a 'vibe' as I've seen many times before. Once you have a baseline you can then establish goals for you performance upgrades, ask yourself which key performance metric would you like to improve and by what amount would be considered a success. I urge you to focus on only a few metrics at a time, as often times that is all that can be done without incorporating too broad of changes. In fact many times in long-lived projects optimizing for one metric may actually cause the degradation of another.
Because of the nature of change I was pursuing, the business goals of the upgrade was not to improve performance, but instead improve security and support. So I ended up with the following goal:
Throughput, Latency, and Capacity see zero degradation with the move from Java 8 to Java 17
Once you have your testing goal. You'll be ready to do your baseline system test, implement your changes, and you'll be on to retesting and verifying.
Testing Types, Tips and Tools
There are many different performance tests one can perform on a system, and in most cases it's beneficial to perform a suite of these tests. Let's go over some of them, and what they are generally used for as some of these test names get used interchangeably when the intent of tests are actually different.
Stress Test A stress test pushes a system beyond its capacity to observe how it fails. It’s used to reveal hard limits, stability issues, or bottlenecks under extreme conditions that may not show up during normal usage.
Load Test Load tests simulate expected usage levels, often mimicking peak real-world demand, to verify that the system meets performance goals under normal and heavy conditions without failure or unacceptable latency.
Endurance Test Endurance tests assess how a system behaves under sustained load over long periods. They reveal memory leaks, slow resource exhaustion, and other issues that only appear after extended uptime.
So we have a few different tests we can perform, there are even more out there that you can read up on like degradation and capacity planning tests.
Next let's talk about tools you can use to do performance testing. Two that I have used in the past are Apache JMeter and believe it or not Postman. I find Postman easier for non-technical people to setup and use, whilst JMeter is more feature rich. You should investigate what will work best for your needs or perhaps there are better tools out there now.
Lastly a tip for performance testing all measurements contain some amount of error, so you should always perform your tests multiple times and then average out any results you get out of it. According to Statistically Rigorous Java Performance Evaluation [2] that number is 30 times!
Comparing Baseline Performance to Baseline + Modifications
Okay the time has come for you to compare your testing results to the baseline measurements you gathered prior to your modifications to verify the changes you made had the effects you intended. It may be tempting to take any improvement (or regression) at face value, but you should first determine is this change statistically significant. In short statistically significance can help determine if the result you received was due to chance or to some factor of interest (the modifications you made). A shorthand way of doing this is as follows:
- Compute mean and standard deviation
- For each metric:
- Calculate mean and standard deviation (σ).
- If you're comparing two implementations (A and B), do this for both.
- Look at overlap
- Compute mean ± 2×σ for both groups (i.e., 95% confidence interval).
- If the confidence intervals do not overlap, you can reasonably infer the difference is statistically significant.
- Rule of Thumb for Practical Significance
- If performance improves or degrades by >10% consistently, it's often worth acting on, regardless of formal significance.
Conclusion
Performance work isn’t glamorous, and it often gets sidelined in favor of faster delivery or easier maintenance. But as this post has hopefully shown, there are moments in a system’s lifecycle — like a major Java version upgrade — when performance deserves focused attention. In those moments, it’s essential to approach testing methodically: define clear goals, understand which metrics matter, baseline before you optimize, and validate your changes with rigor, not gut feel.
-- Joseph Leiferman July 14th, 2025
Further Reading
[1] This quote can be attributed to Henry Petroski
[2] PDF on this research