Monday, April 30, 2012

Successful application performance tuning

Non-technical but useful to techies
In this entry I am going to outline a strategy for successful application performance tuning. This is deliberately non-technical but my target audience here is also the technical people involved in fixing these problems. For the most part performance tuning failures do not spring from a lack of technical know-how, they spring from a lack of a strategic approach.

The key steps are easily laid out, but there are pitfalls along the way that must be avoided.

The overview is:
  1. Assemble a team
  2. Define the problem
  3. Reproduce the problem
  4. Investigate the cause
  5. Propose a solution
  6. Test for success
Depending on the scale of the problem or complexity of the application some of the below may not be applicable, but the general outline will be helpful at any level of tuning exercise.

Assemble a team
It is essential to avoid the pass-the-parcel approach that is often used in tackling performance issues. The users complain about the issue, management asks the application support team to investigate. They in turn ask the DBA team, who talk to the Unix team and so on.

Performance issues are generally a combination of different components within a system. Their resolution takes a combination of skills. Depending on your organization or the application being analyzed these people may have different titles or some people may provide a combination of the skills needed, but it breaks down to this:
  1. Business Analyst, someone who knows what the system is supposed to do for the business and preferably has a direct contact with the user community.
  2. Functional Application support, someone with knowledge and understanding of the functional aspects of the application
  3. Technical Application support, someone with a technical understanding of the application infrastructure.
  4. Database support, someone with knowledge of the underlying database.
  5. OS support, someone who can provide insight into the operating system and hardware resources.
  6. Developer and/or vendor support. If the application is supplied by a vendor and they cannot be onsite for the whole process then a clear channel of communication, preferably with a named individual should be setup.
  7. Network Support, although not normally required throughout the process it is good to have involvement from this area from the start.

Defining the problem
A "What not to do" note. You do not yet know what the problem is. So do not start solving it. It is very easy in the initial meetings of the team to begin to discuss things such as CPU load or bad SQL, lack of memory etc etc. Do not do this.

The first step is to properly identify what the problem is. Usually the response to the question "What is slow?" is vague at best. Quite often "It's just generally slow" is the reply. Most of the time this is wrong.

It seems obvious that you must first identify what is slow before you can fix it, but a lack of clear definition at this stage is central to the failure of many a performance tuning effort. This is the point where you define the scope of any activities. It is also the starting point for communication with the user community. The other vital component that comes from this step is a set of criteria to judge success by.

In short, if the problem definition is "It's slow" and the success criteria "It's faster" then the outcome will likely be failure.

If at all possible direct communication with the users should be started. By this I mean that someone from the team should visit the users, sit down with them as they work and actually see the problem in action. This has the added benefit of letting the users know that someone is working on their problems. Get out from behind your email client and go sit with your user community.

Your goal is a set of clear Problem Statements that define the problem or problems. Avoid generalized statements such as "The Order Entry screen is slow".  Also, we are not looking for Use Cases such as might be required for functional testing or performance load testing. An example of the output you want might be:
  1. In the Order Entry screen, when a client name is searched for it takes over 30s to return results
A user enters the Order Entry screen and searches for the client name by entering the first two characters of the name in the search box. When the user presses the Search button it takes over 30s to return results. 

It is also useful to add in a screenshot of the problem page and the menu choices taken to arrive at the page. Where applicable the username used to log in and any group information such as business units should be noted.

The number and selection of your Problem Statements will depend on the nature of the application and the performance issues. If there are many areas where performance is a problem it is best to list a fixed number of targets in key areas of the application. Indeed, creating a Top 10 list with the users can be very helpful if you are struggling to properly define Problem Statements. Even though you may not be targeting all the possible issues that users are experiencing, you need a focused list to perform a proper analysis with.

Reproduce the problem
Another "What not to do" note. Do not try and reproduce the problem with all your analysis tooling switched on. This runs the risk of giving initial false results.

The first goal you have is to reproduce the problem on a test environment. While it is possible to investigate on a production system it is often very difficult to isolate the activity of the Problem Statements in a busy production environment.

A test environment must be as close as possible to the production one. This is not only the case for hardware infrastructure but also for data. With data care should be taken to not only reproduce the size of the data set, but also data skew. If you have ten business units in an application you will often find that most of the orders come through only one or two business units. When querying order history it is likely that performance will be poorer for business units with large numbers of orders.

Ideally a copy of the production system would be used, but this opens up security issues in creating a test environment with production data. Tools such as Oracle's Data Masking Pack can duplicate a production database and at the same time mask the underlying data, giving you a realistic test environment without the danger of exposing production data.

Care should also be taken with parameter files and application options. Again, these options should preferably be a copy of a production set adjusted to match the test environment. Be wary of how these parameters are changed. Any such changes should be recorded and tracked with the team.

Once the test environment is setup a backup should be taken and used as a reference point for all future tests and results reporting. This is important because as testing continues you will normally be changing the underlying data set. Each test cycle should start from the same base of data and parameters.

You should initially set out to reproduce each Problem Statement. In doing so look to define the conditions under which the problem occurs. Is it always for a given client number, or a particular business unit? For a search does it matter how many characters are searched on or the number of results returned? Remember also the effect of caching at various levels of the infrastructure. Try and level the playing field in this aspect by repeating tests with and without data cached. The goal of this activity is not only to reproduce the problem but also to begin to understand the workings of the application with the underlying data set and infrastructure.

In some cases the Problem Statement may include a time-of-day factor or the problem may be intermittent. If such cases cannot be immediately reproduced then notes should be taken on the tests made and the results. Work should then continue onto the next phase.

Investigate the cause
A good first step here is to set up automated tests using tools that can replay the test cases and properly time them. The creation of these automated tests should use the previous steps gathered data as input. Their output should become the initial baseline for reporting the impact of tuning activity in the teams final report.

This area is very technical and I will leave the details to another time, but some useful approaches can be outlined:

  1. Make very sure that you are working together and not against each other. In particular coordinate when changes and testing will occur and when the environment will be set back to its baseline start point.
  2. Make use of available profiling tools, tracing and other data gathering.
  3. Work closely with the developers and use any data gathering tools built into the application. If none are available discuss with them how they might be added.
  4. Reuse this work. As you begin to build up data points think also about how this data might be reused later in gathering and reporting production performance metrics.
  5. Find and examine changes in behavior closely. Work to explain why a particular case behaves well with one set of options or data and poorly with another. This will greatly aid your understanding of the problem.
  6. Extrapolate and test. If you have a working theory on the nature of a problem you should be able to predict under what circumstances it will worsen or improve. Create these circumstances and test and prove your theory.
Be aware that you cannot hope for perfection, therefore you need to focus efforts on where the majority of the problems lie. Here your data gathering abilities come into key play. Measure each aspect of the various parts of the system per Problem Statement to determine where the elapsed time is being used up. In a modern Java environment for example, you want to find put how much CPU time is recorded at the web layer, how much at the application tier and how much at the database. If out of 30s elapsed time there are 25s spent in CPU on the database then this will become your focus in resolving the problem.

At this stage you should have far more knowledge about where the problems lie. You can now revisit the Problem Statements in the previous stage that you could not reproduce to see if, armed with your new knowledge, you are better able to do so.

Try and build a picture across Problem Statements. Are there correlations in where time is most spent? Relating the data and information gathered across your different tests can help you pinpoint more structural issues in the application or its framework.

Propose a solution
One last "What not to do" note. You are not now armed with a set of weapons with which to attack a group or groups of people that you think "caused" the problem. A solution will only come about if you can get agreement from all on what the problem is and how to tackle it. Be sensitive about the vested interests of those involved and try and bring them with you.


Unless you are particularly lucky there is probably no quick fix that can be easily implemented. If that had been the case it would normally have been found and put in place a while ago. You are faced then with what might be a considerable requirement for time and effort from one or more groups to implement a solution. It is therefore important how you represent your findings.

The data gathered and the information about the solution are only a part of this. You also need to show what work was done to ensure that your findings are correct. As well you need to demonstrate why the solution is the best way forward and in particular why smaller incremental changes will not work. There will be a strong tendency, driven by a need to reduce costs, to accept small changes in place of your proposed solution. Your report should demonstrate, with evidence, why this will not resolve the problem or will only create a short term fix.

It is all too easy for those required to fix the problem to feel under attack. This is particularly the case for vendors or other external parties. You should discuss fully the contents of the report with the groups involved in the solution prior to publishing. Gain their trust and agreement on the data presented and the proposals being made. Ensure that everyone is of one voice in this.

Use the communication opened with your user community to understand how changes will impact them. It may be that changes in what information is presented or how work is done are a part of the solution. If that is the case you will need to demonstrate that the changes will not diminish the business fit of the application. Having user approval prior to the report being finalized will help in this area.

Examine also across the team what options there exist for staged implementation. Your investigation will have brought about a better understanding of what the most important problem areas are. Can these areas be tackled first, or does the solution have to be a big bang?

The job does not finish with the report being published. Organize follow up meetings to discuss the report involving decision makers and preferably the project management that will be required to implement. The important part here is a clear set of goals and check markers for the project that will ensure the underlying issues have been satisfactorily dealt with and not merely masked. In the case of external suppliers this set of success criteria will have to be agreed with them.

Test for success
As all or part of the solution is implemented you should reuse the test environment built for the reproduction and investigation phases to test the results of the changes. It is important here to have a level playing field or to be able to account for differences. If there has been a length of time between the investigation and the solution you may be looking at a larger data set than had previously existed. As much as possible you want the new test results to be comparable to the previous ones.

The tests of the solution should not simply be about timings. They should also reuse the expertise gained in the initial investigations. You are looking to ensure that the same or similar mistakes that caused the original problems have been avoided.

Be wary of fixes being implemented that fit the test cases. Go back to your reproduction phase and test for other conditions outside of the test cases. Try and ensure that you have not broken something else in order to solve one or more of your Problem Statements.

Finally, you now have an infrastructure and knowledge base for ongoing micro testing of performance. Make sure this is now a documented part of your test and release process.