< Previous		Next >

Design Approach

Contents

This chapter describes the design approach for StreamBase® applications. It also defines the key concepts that must be understood to perform performance analysis and tuning.

Concepts

Path length: The amount of time that it takes to process a unit of application work (such as processing a request and sending its response), excluding any time spent blocked (such as disk I/O, or waiting for a response from an intermediate system).
Horizontal scaling: Adding more computing nodes (machines) to a system.
Vertical scaling: Adding more resources (such as CPUs or memory) to a single computing node in a system.
Contention: Competition for computing resources. When resources are not available the application waits and often uses up other system resources competing for the requested resource.
Latency: The time between when a request is issued and a response is received. Latency can consist of a variety of components, such as network, disk, application.
Throughput: A measure of the overall amount of work that a system is capable of over a given time.

Guidelines

Identifying Performance Requirements

Clear and complete requirements.

Start with a clearly stated set of requirements. Without this, performance work cannot be judged as either necessary or complete.
What are the units of work to be measured?

Request messages? Request and response messages? Some larger business aggregation of requests and responses? Are there logging requirements?
Which protocol stacks are used?

How many connections? What are the expected request rates per connection? What do the messages look like?
What are the request/respone latency requirements?

What is the maximum allowable latency? What is the required average latency? What percentage of request/response pairs must meet the average latency? Occasionally there are no latency requirements, only throughput requirements.
What is the sustained throughput requirement?

How many units of work per second must be completed, while still meeting the average latency requirements?
What is the burst throughput requirement?

How many units of work per second must be completed, while still meeting the maximum latency requirements?
Are third-party simulators required for the testing?

What role do the simulators play in the performance testing? What are their performance characteristics? Are they stable, correct, predictable, scalable, and linear? Are they capable of generating a load that meets the performance requirements for the application?

Measuring Performance

Working on performance without first collecting meaningful and repeatable performance data is wasted effort. No one can predict the exact performance of an application, nor can one predict where the top bottlenecks are. These things must be measured. And they must be remeasured as the application or environment changes.

Develop an automated test.

Performance testing involves the repeated configuration and coordination of many individual parts. Doing these steps manually guarantees a lack of repeatability.
Measure meaningful configurations.

Do not test performance in the VMware® image. Test in an environment that is the same or similar to the production environment.

Use production mode binaries.

Test with assertions disabled.

Eliminate deadlocks from the application. The performance of a path that contains a deadlock is boundlessly worse than the same path without a deadlock. Performance tests that encounter deadlocks in the steady state are invalid, and indicate application problems.

Do not run performance tests with any extraneous code, logging or tracing enabled. Developers often add extra code and tracing to an application that may be of use in the development process. This code will not be used in the production system, and can significantly perturb performance measurements.

Use representative numbers of client connections. There is much to be learned from the performance characteristics of a single client talking to the application via a single connection. Performance testing should start with this case. However, StreamBase® is optimized for parallelism. Almost all well-designed applications support multiple connections. Performance testing configuration should mirror the actual application configuration. If the target configuration calls for 100 connections at 10 messages per second, per connection, test it that way. This is not the same as one connection at 1000 messages per second.
Measure the steady state of an application, not the load stabilization time or the application startup time.

Complex systems generally have complex initialization sequences, and while there are often performance requirements for this startup time, they are generally not the primary performance requirements. Repeatable performance runs are done against an already started and stable application, with a warm-up period that allows for the test load to stabilize.
Run on otherwise idle hardware.

Steady states cannot be meaningfully measured if there is concurrent uncontrolled machine usage on the target systems.
Start measuring performance early in a project.

Do not wait until the end of a project to create the performance tests. Performance measurement can, and should, be done throughout the life cycle of the project. Once a test is in place, it should be mostly a matter of configuration to integrate it with the application. Begin working with the real application as soon as possible.
Performance runs versus data collection runs.

Make a distinction between test runs, which are measuring the best performance, and those that are collecting data for analyzing performance. Best performance runs should have little or no extra data collection enabled.
Do not measure saturated systems.

When a system has reached the maximum rate of work that it can process, it is said to be saturated. Production systems are intentionally not run in this way, or should performance testing be run in this manner. At the saturation point (or approaching it) systems can exhibit various forms of undesirable behavior. Excessive CPU utilization per request, increased memory utilization, non-linearly deteriorating response times and throughput. As the system nears the saturation point, the performance generally decreases due to these effects. Saturation can occur at different levels, including protocol stacks, application logic, and the system itself. Performance testing should be organized to identify the various saturation points of the system.

For example, CPU utilization should not be driven to 100%. Typically tests should designed to drive the CPU utilization to a maximum of 80-90%.
Sweeping the load.

Nothing useful can be gained by running a performance test with a single configuration that saturates the application. Proper performance testing calls for starting with a simple configuration and a modest workload that does not tax the application, and then increasing the load in subsequent runs to identify the linearity of the application and the saturation point. The exercise is then repeated with other configurations.

Analyzing Performance

In StreamBase® applications, we concern ourselves with three main types of performance:

Single-path performance: the CPU cost and wall clock time for a single request.
Multi-threaded or scaling: running the single path concurrently on multiple threads.
Multi-node or horizontal scaling: running the single path concurrently on multiple threads on multiple machines.

We generally want to look first at multi-threaded performance. The StreamBase® runtime environment is optimized for running on multi-processor, multi-threaded platforms. Regardless of the single path speed, additional performance is most easily obtained by running the single path concurrently on multiple threads.

At this point, you should have a set of data that describes the application functioning normally and in saturation. You will already have done some analysis that lead to your choice of configurations to measure.

Now look at your data asking scalability questions: Pick unsaturated data points with the same number of requests per second, and differing numbers of clients. How does the CPU utilization change as the number of clients are increased? If your data shows that near-perfect linearity and scaling your application may not need tuning. In this case, additional performance can be gained by adding more or faster CPUs. Usually the data shows a lack of scaling or linearity, an inability to utilize all of the CPUs, or overall performance is not acceptable on the target hardware. The next task is to understand why. At this point, performance work resembles scientific research:

A set of experiments are run and data is collected.
The data is analyzed.
Hypotheses are made to explain the data.
A change is made to the system under test and the process is repeated.

At this point, we transition to statistics collection runs to help us identify scaling bottlenecks. Typically scaling bottlenecks are a result of contention. Contention can be for a variety of resources; processing cycles, network I/O, disk I/O, transaction locks, Java monitors, system memory, and so on. Excessive garbage collection can also be a cause of performance problems. When trying to identify the cause of scaling problems there are no absolute rules, which tells us ahead of time, which statistics reports produce the most interesting data.

In looking at these data, one should first look for anything that indicates gross problems with the run, such as application failure, deadlocks, or swapping. If seen, the results should be disregarded, and the problem corrected before continuing.

At this point, you should have an automated, repeatable test and data, which demonstrate performance and/or scaling issues with your target application. You can now begin to use the collected data to optimize your application.

After you have removed the bottlenecks and the application scales well across multiple processors it may still not meet the performance requirements. The single execution path performance should be examined at this time with a Java profiling tool.

Horizontal scaling may also be examined at this point as a way to increase overall system throughput. Add a single node at a time (or pairs for high-availability active and replica nodes) to the test configuration and rerun the measurement process.