Thursday, June 30, 2005

JavaOne: Real-world high availability and scalability design decisions

Notes from JavaOne 2005...

Mikael Ottosson, Oracle

"The Daily Planet" system characteristics: Hosts more than 20 different major news sites and growing. More than 10 million unique users per month. Dynamic and quickly changing load characteristics. Uptime and reliability is most important when the load is the highest.

Requirements: Scalability - how to scale with increased traffic? more applications? peak loads and traffic spikes? Availability - How to achieve no single-points-of-failure? Handle system abnormalities? How does app isolation in a clustered environment affect reliability and utilization? System management - how to manage dynamic application updates? configuration changes?

High availability and scalability - When designing a hardware and software architecture, scalability, reliability, redundancy, and fault tolerance must be considered because one affects the other.

Choices...
Hardware configuration - few large or many smaller servers? Layered architecture - how to avoid the "funnel of death". Load balancing and failover. Session affinity. Application isolation.

If you have good management and monitoring tools, many smaller boxes are usually better than fewer large ones.

Layered architecture - the funnel effect.

Normal operation, normal load: X users hitting web tier results in Y connections to J2EE layer and Z resources are consumed in J2EE and back-end layers. Unusually high user load, spikes can consume more resources in underlying layers - high threads, memory, CPU.

Normal load, latency in middle-tier or back-end systems. Latency can result in unpredictable resource allocations in all tiers.

Solutions: implement a throttle at the web server layer. Pay close attention to how back-end resources are utilized. Monitor request-response timing. Use timeouts in infrastructure and when accessing external services - don't allow infinite wait to happen. Implement "smart" load balancing metrics.

Load balancing and failover: goal is to get even distribution of load during normal operation. Get even distribution of failed-over sessions. Get the right distribution when something abnormal happens. Get the right ramp-up when starting or re-starting components.

Normal operation - round robin or a simple metric-based load balancing is usually fine. Also works well for redistributing failed-over systems - might need to adapt if application state requires it.

When latency goes up or other abnormality, you might need more intelligence for load balancing. Choke functionality, metric-based load balancing, watch connections and CPU - don't go into hyper-polling mode, since that can exacerbate performance issues.

For start-up, choke, metrics, or monitoring-based load balancing can be useful.

The idea is that you keep it simple as long as possible until you really need to do something more sophisticated - that means your apps need to be able to recognize and adapt to changing conditions.

Session affinity - when you need to pin sessions to servers. This is another place where metric-based load balancing (connections, sessions, CPU, whatever) may apply. This is important to restore equilibrium in the event of system failures.

When evaluating load balancing and fail-over strategies, you must consider various scenarios including specifics of applications, user patterns, and abnormalities.

Application isolation and resource utilization. Do you stripe apps across servers? Do you combine same apps on multiple containers on same systems? If you had unlimited hardware and good management tools, it's better to keep applications isolated from each other. However, this isn't usually practical. Resource utilization will be best if you stripe apps across all servers.

System and application management (e.g., application updates or configuration changes) affects the availability of the system. "Daily Planet" requirements: need for dynamic updates to the apps without downtime and with minimal administration (could be changes to individual pages, smaller changes to already deployed apps). Need for controlled, larger changes without downtime.

Dynamic application updates - servers all read their apps from a single NFS directory. Configuration files may be needed locally on each server. To make config updates - change on one server and propagate to others, then manually or automatically restart servers in sequence.

Summary: More small servers scale better and give more reliability / flexibility, but attention must be paid to system & app management. Architecture, "funnel effect", load balancing, and fail-over. Choke control at web services layer. Careful attention to back-end resources. Careful implementation of functionality for load balancing, fail-over scenarios, startup, and restart. Avoid or limit session affinity if possible. System and application management - understand requirements and procedures, understand need for scalability and availability.

3 comments:

andrew cooke said...

seems like we went to the same presentations (along with a thousand others).

found your page trying to google for more specific information on routing. wondered if you knew - or at least whether you could point me to a more exact idea of what i'm asking.

before javaone we were going to use jms with activemq servers all over teh damn place so that we could get p2p asynchronous messaging without going through a central server, when such communication was appropriate.

my question, then, is wheter we get that for free with esbs. or whether it's a stupid optimisation and shows that we're doing something wrong (like don't have well enough separated services, or are using messaging at two low a level).

andrew cooke said...

oh, my email is acooke@noao.edu, though i'll try to check back here. just on the off chance...

andrew cooke said...

forgive stupid errors too!

www.flickr.com