Friday, March 9, 2012

Database performance, measuring the effects of network latency

In Oracle 11gR2 there are 16 different wait events associated with SQL*Net traffic. The two most common are SQL*Net message to client and SQL*Net message from client. Without additional data neither of these events is useful in measuring the impact of network latency on application performance. The first measures only the time to pass the message to the operating system, not to actually send the message across the wire. The second is the time spent waiting for the client to send a message, so it does not tell us how long the message took to send.

Yet we do know that there exists an impact. If it was suggested to place the application tier of a performance critical application on one continent and the database on another there would be an immediate concern about latency. We can measure the latency of a network connection using network tools, but this alone does not give us real data about what effect that latency will have on user experience.

I decided to delve into this a little further, to see first how we might measure the issue and then later to use those measurements to get a better understanding of what impact latency has.

Gathering the data
The tests were executed on three separate databases, each with the same hardware configuration and init.ora parameters. The first database, Local, was on the same machine as the test scripts. The second, LAN, was on a high performance local LAN which is expected to have minimal amounts of latency. The third, WAN, was over a Wide Area Network which has an expected latency of 10-12ms.

The test scripts were written in Perl, using the DBD::Oracle package. The timings were taken using the Time::HiRes package, which delivers timings in microseconds. The time measured was from the start to the end of each iteration.

The Output
SELECT 2000 rows, Fetch size 4 rows

SELECT 2000 rows, fetch size 4, timings (s)
In this test a single SELECT statement is executed against a table containing 2000 rows. The fetch size is set to four rows, meaning the rows are returned to the client four at a time. This results in 503 round trips from the application to the database. The total includes parse and execute round trips and an extra fetch returning no rows that signals the end of the cursor to the client. You can see from the results that there is a noticeable difference even between the  Local and LAN times. The WAN timing is over 4.5s, not an acceptable response time for such a simple query.

SELECT 2000 rows, Fetch size 20 rows

SELECT 2000 rows, fetch size 20, timings (s)
Here the test is repeated, but this time fetching 20 rows at a time. There is an immediate benefit across the three test sets. Even in the Local test the response time has decreased. In the Local test the connection does not pass through any network, but there is still an overhead in process communication and in CPU time spent in re-reading blocks in memory to get the next rows. The WAN savings are of course the most significant. A simple parameter change has reduced the response time here to under 1s.

SELECT 2000 rows, Fetch size 200 rows

SELECT 2000 rows, fetch size 20,0 timings (s)
This time the test is executed fetching 200 rows at a time. Again there is benefit to be had, but at the Local and LAN level it is becoming less significant. It should be kept in mind that there is an overhead in memory usage the larger the fetch size. The client is required to pre-allocate enough memory to fit all columns at their largest byte size for all rows. Care should be taken not to create a huge memory footprint on the application tier just to shave off a few ms of elapsed time.

SELECT 1 row, multiple executions

In this test the script executes a single SELECT statement returning one row from a table containing only one row. Each execution includes the statement parse. Along with the execution and fetch steps this is a total of three network To/From pairs per execution.

This test cuts down as much as possible any other factors than just network latency. For the WAN measurements only the first 500 results, reporting 1500 round trips, are shown. This was done purely to keep a reasonable scale for the chart.
As might be expected, there is a linear progression between the number of network packets and the elapsed response time. Even though the Local connection should show only a negligible latency the times become significant when there are very large numbers of interactions with the database. In the LAN test the drop in response time is steeper. The difference between the LAN and Local timings would probably not be significant for most applications, but where there are design faults which lead to large amounts of network traffic it could be an issue. With the WAN configuration the drop-off in response time becomes very significant quite quickly. 

But do the tests reasonably match with the expected results? Although for Local and LAN there is no fixed expected latency the WAN configuration has, as I stated at the start, an expected round trip latency of 10-12 ms. Below are the figures for the difference between Local, LAN and WAN timings.

Delta times between configuration tests (s)
You can see that the delta time for the WAN compared to the Local timing is on average about 9ms. This is near our expected 10-12ms. Not spot on the values we expect, but within a range that shows that the tests are producing a reasonable estimation of the network latency. The figures for response time calculations are reasonable then. They can be used to give estimates of the effect of network latency on applications.

Finishing up

Network latency and network performance between the application tier and database tier can be a significant factor in response times. It has an effect that can be measured and these measurements can be used to feed information into business cases for improvements in infrastructure to reduce latency or the re-factoring of applications to reduce network round trips.

The second topic, that of application re-factoring, involves more than just network latency. It also includes other overheads that can have a far larger impact on overall performance. In a future blog I will return to the topic of how you might apply these measurements in real applications.

1 comment:

  1. Great job Mark. I see this all the time.. There is a batch job that takes 2 minutes, and makes 20,000 calls to the database. The law of physics is the bottleneck not the database.
    Developers need to utilize more logic in the database and get rid of the chatter.