'Busy
Tone' for CGI Web Applications
Steve Jenkin
This was implemented for the Australian Government's first large on-line transaction - Business Number (ABN) registrations. The site served 20 times its normal load for two days and maintained reasonable service levels for connected users.
Techniques for handling caching or overload of static pages are well understood. This is a method for handling CGI based applications.
The telepone system answers the question, "How do you economically handle extraordinarily high traffic loads, such as Christmas day"? With a 'busy tone' – “come back later, we're busy just now”.
These traffic peaks are characterised as unpredictable, unrestrained demand of short duration, with no tolerance for increased pricing. It is uneconomic for the service provider to install and maintain sufficient permanent capacity for a 10- or 20-fold increase over normal loads. Callers/users will trade overall cost for busy time limitations, preferably with predictable delays.
This concept is also called "Load Shedding".
The prerequisites are:-
Extreme peaks in demand that can't be serviced economically,
user tolerance of limited busy time access,
a usability requirement of reasonable service levels once connected,
precise and repeatable identification of individual sessions, allowing user anonymity,
real-time measurement of the primary service level characteristic, and
the ability for a real-time load shedding response to system saturation.
The on-line ABN registration system (https://trans.business.gov.au) was deployed in November 1999. It was expected to serve 5% of a total 2M registrations. Businesses wishing to claim GST (Goods & Services Tax) rebates after 1st July, 2000, had to apply for an ABN 2-4 weeks in advance to allow for Tax Office checks and processing. The advertised deadline, our "Christmas Day", was 31 May 2000. Figures from Canada suggested 60% of registrations would occur in the last 3 weeks.
The application consisted of ~20 HTML forms backed by a CGI script interfacing to an ORACLE database. Completed applications were transferred by verified e-mail to the Tax Office system. Security regulations required the database system be isolated from the web servers by a firewall. A 'layer 4' (L4) load balancer was used to share the load across multiple web servers, and to provide increased manageability and availability. All connections were SSL (https) hence sessions could be identified by the L4 switch and web server without using cookies - a key part of the privacy policy.
The CGI application could not see this information and relied on HTTP's username/password facility to identify sessions and allow later 'resumption' of sessions.
There was a single 'entry' page where registrants were allocated a unique session identifier and requested for a password. This page was used as the 'gate', either allowing users access to the registration forms or delivering a 'busy, please try later' page.
The site accepted 350,000 ABN applications of 3.3M total by 1 June 200. Near 50% of sessions lead to submitted applications. A factor in this was an application time limit of 36 hours to 'resume' a held application.
Registrations grew from a few hundred a day at launch, to 21,000 on 31st May, 2000. 'Steady state' traffic became ~650 a day, 4000 a week - probably half current new registrations. Traffic doubled about every 3 weeks, excepting the six week school holiday period over Christmas.
There were consistent daily and weekly traffic patterns - two roughly equal peaks during the day, AM and PM, and an evening peak up to 75% of the days'. Traffic tailed off quickly after midnight East Coast time, resuming about 6 AM, and getting busy about 9 AM. The evening peak was consistently missing on Friday nights, whilst it took until midday Saturday for the traffic to rise. Friday is still 'party night' in Australia. The Sunday night peak lasted later than any other day - till past midnight. Weekend traffic was about 60% normal. Mondays were always 'slow days', whilst the busiest day moved from Tuesday to Thursday.
The web and database servers were all multi-CPU systems capable of being quickly upgraded. There was sufficient system memory to avoid any paging, except during database backups. The database server used a hardware disk array giving about a 10 fold performance increase on most disk operations over the other systems. There was a six week lead time in purchasing additional CPU's. For the last of the 'crunch' time, the hardware supplier generously loaned some disks and a CPU board, giving 2 and 4 CPU's on the two web servers and 6 CPU's on the database. Commissioning additional web servers was investigated, but abandoned due to the relatively low performance of available hardware and the high cost and delay in building and configuring 'hardened' systems.
ORACLE was hired to perform a comprehensive performance analysis of the database and to tune it. Another contractor was employed to analyse the network traffic, identify bottlenecks, and report daily on system throughput, response to load, and estimate system throughput limits. A number of significant application improvements were made as a result.
Some simple light-weight system monitoring, via pages of graphs available from the web server, was set-up with the public domain software, RRD (Round Robin Database). This became the most powerful and useful tool for monitoring system health and performance. It was crucial in explaining performance to mangers and developers alike. It was also crucial in tuning and monitoring the 'busy tone' software when deployed.
A key system overhead, CPU time used by the web server processing the encrypted SSL connections was not available. Replaying and timing live traffic captured after decryption to establish a comparison was not possible at the time.
At the end of February, well before the final deadline of 31st May, the system hit saturation. At the time there was no L4 switch, a single 2 CPU web server, and 2 CPU's in the database server. The peak system load had been 70% the previous week, allowing about 2 weeks before saturation. A number of system activities were scheduled to meet this projected demand in time.
The Tax Office launched its advertising campaign for the site over the week end, and traffic roughly doubled on the Monday. As queuing theory models, when a resource comes near saturation, service times explode. Response times that had sat around 2 seconds for months went out beyond 20 seconds, and a flood of complaints came down the hotline. All users were unable to get response from the system with the TCP timeouts. Although the systems were flat out, no useful work was done.
This provided the impetus to commission the 'busy tone' subsystem and for management approve roughly doubling CPU capacity. The overload was addressed in the short term by bringing forward a scheduled upgrade to the script interpreter used.
The 'gate' page CGI acted in concert with a monitoring program that set a busy flag in the database. Additionally, to cater for failure of the monitor program, the 'gate' page CGI also counted, using the database, the number of applications started. When this exceeded a tuneable threshold per period, the busy page was served.
Because there was a single database and multiple web servers, the monitor program was only run on a single web server, that judged to be the busiest. The L4 switch was used to apportion sessions according to the CPU capacity of the frontends.
The performance target was a per page 'response time' of 5 seconds. Only the page start time was available from the web server logs, not the elapsed time per page. Also, there was no way to separate the server response time (internal) from the network (external) delays. Web browser delays (rendering pages) were also ignored, being completely client dependent and unmeasurable.
The response time monitor program read the system accounting logs each period (settable) and calculated a mean response time. During overnight database backups, apparent response times blew out, so a minimum traffic threshold was added to the monitor. A huge assumption was made: that the execution time of the CGI program reflected the server part of the total response time.
An initial design idea was to model behaviour on 'sendmail' and set the busy flag based on 'system load average', a number that accurately scales with multiple CPU's and systems, is very low-cost to obtain, and does not need a background monitor. This was abandoned when it was realised that it is a secondary measure, not primary, and behaves exceedingly non linearly. By the time the load average started to rise, the response times would've exceeded the target. Data for load average and nominal response times still exist and this hypothesis could be tested.
The L4 switch also tested the 'health' of each of the web servers by polling a page. If a server failed to respond before the next health check, the server was automatically taken out of service. The first time the database system saturated, all the web servers were 'failed' as response times uniformly blew out. For the rest of the duration, the L4 switch 'health checks' were set to 'TCP/IP' (ping).
The busy tone subsystem was deployed 3 weeks before the deadline. There was still more than 30% of the system unused at the time. Apart from testing, it 'fired' for the first time about 10 days before the deadline. It was active for most of 2nd last day, and from about 9 AM until after midnight on the final day. Due to other commitments on test system resources and privacy problems with recording and replaying live traffic, none of the design and implementation assumptions had been tried before deployment, nor were system response characteristics known.
Tuning the busy tone parameters, and understanding the interactions between load, demand, and system behaviour, proved somewhat 'hit and miss'. A final set of tuning parameters were settled on that kept response time 'cycling' between 3 and 5 seconds, and the rate of new sessions reasonably constant, maximising system throughput whilst maintaining user response times.
The final model was a feedback system with high inertia - due to the 'gate' committing the system to 20 or more pages of work over ~30 minutes for each accepted connection. The monitor cycle time was initially too high, it provided a 'moving average' of 5 to 10 minutes, not close to an instantaneous value. The system showed classic feedback systems 'hunting' behaviour - cycling between 2 and 15 seconds response time. The 'undershoot' significantly lowered system throughput, causing the subsystem to be switched off at times. ('Hunting' is a control system failure - usually heard in stationary motors where the speed surges up and down.)
The resources used in running the 'gate' CGI script was not monitored, resulting in only about 75% best throughput of registrations on the final night. The rest of the system resources presumably were devoted to serving 'busy pages'.
Unexpectedly, the firewalls became the ultimate limiting factor. All systems only went to about 80-90% CPU at the busiest time, whilst a number of subsystems, such as DNS and e-mail experienced network connectivity problems.
Other strategies were employed to maximise throughput over the last two days. The supporting website and searches were turned off and replaced by a single static page, e-mailing submitted applications was rescheduled to the overnight quiet time, and replicating the database logs, 500 Mb/hour, using the CPU intensive 'secure shell' was stopped. A number of small websites containing mostly static pages were moved to an unused small system well beforehand.
Whilst the subsystem developed performed adequately, it only had to survive a single overwhelming peak. Intensive management and manual intervention and some extreme trade-offs, such as stopping rollback log replication exactly when it was needed most, were acceptable for the last two days.
For large scale commercial systems that can reasonably expect ongoing unpredictable overwhelming loads, moving a number of facilities off the web servers to dedicated switches or servers optimised for their tasks would be most useful:-
SSL servers are now available to lighten server CPU load considerably.
Definitive response timings, even per server, are available at the L4 switch.
The L4 switch, if it buffers server responses, can simultaneously calculate both the server and network delay, and minimise use of kernel resources on servers by quickly releasing connections.
The L4 switch is the perfect place to make the busy tone decision and return a static busy page with a short cache life.
L4 switches already recognise sessions, allowing busy tone to be extended from the 'gate' configuration described here, to sets of individual CGI pages.
Modifying individual CGI scripts for busy tone requires considerable redesign and rework. The L4 switch is already a separate, high reliability device that could provide a consistent, easily implemented and tuned implementation.