Scale and Performance
I work at the BBC and scale and performance is a big deal. After talking to Mark Gledhill (one of the Senior Software Engineers) and Andy Macinnes (head of ops) I put together some notes on scale and performance. It isn't rocket science; just good common sense.
The general message is to avoid unnecessary repetition. If you're a scalability/performance problem then you could explore ...
Tweak Server settings
Major example: increase db connections
The aim of a caching is to reduce hits on server.
You can cache at different levels:
- code. e.g. store result of a calculation
- file. e.g. news feed updated every 10 minutes
- template. i.e. common part of the part
You can flatten dynamic pages to enable caching. Until recently almost all BBC pages were flat html served statically through a publishing queue. It is clunky but serves pages fast because the pages can be heavily cached. Only recently have we moved to dynamic publishing; this has only been possible by buying lots of kit.
You must have to remember to release cache. This is complex when using several layers of caching. Poor caching can cause a memory leak if cache is not released
You need fast access to the database, so ...
- keep db connections open, e.g. each apache server could have one db connection open at all times
- index what you're interested in
- clear out expired info
- Possibly replicate DB for redundancy
- Cache queries
- Optimise queries
- focus on writes (writes take 1000 times longer than reads)
- but reads can also be problematic if spanning multiple tables
- can aggregate data from normalised tables into a new table to query against. Requires synchronising the various tables on update
For load testing to be effective you must simulate live conditions and hit scripts with millions of requests. The testers used Avalanche for this but the developers use apache bench.
You must understand what simple programming statements are doing behind the scenes, e.g. db queries in Ruby on Rails.
At the BBC page assembly is handled by specialists variously called Client Side Developers (CSDs), Front End Developers or Web Developers.
They have to:
- reduce number of files used to assemble page
- reduce page weight
- take care of SSIs
- have conditional logic
- have variables which can hit a performance threshold
- User work flow
- change user experience to reduce hits on server
- An example of this was the 606 forum where an editorial change eliminated a major performance problem without technical input.
- Spread load
- BBC was using "DNS road Robin" where user is restricted to a particular server for 5 minutes then allocated a new server. This is because one ISP has one DNS server so the ISP, and all its users, go to one random server for 5 minutes. A major problem is that you can't take a machine out.
- Now requests go to servers that are up
- Track trends and look at the graphs over time
Multi-casting means, instead of stream directly to users, w streaming to ISPs who stream to users. This takes the load of our servers.
Get more capacity
- Buy more kit in the hosting environment
- Buy more streaming capability from your Content Distribution Network (CDN), e.g. Akamai
- Use existing kit differently. For example if you can predict the peaks then temporarily borrow kit from other areas for the peak and then hand it back afterwards.
The BBC has scalability issues in one of its development standards. This is so the organisation, and new developers, can benefit from past lessons.
Rough overview of the BBC set up
The BBC has impressive stats:
- page traffic doubles every year
- Aim to server every page < 1.3. seconds
- this isn't possible on all pages. For example the old homepage took ~ 5 seconds.
- serve 13 GB data/sec (peak over 20 GB)
- 6 billion page impressions / month
- performance varies a lot
- Media Selector 120 requests / sec
SSO 20 requests / sec
- Media Selector 120 requests / sec
Broadly speaking both the old and new platforms are a four tier architecture:
- presentation layer
- custom applications e.g. message boards, celebdaq
- platform, e.g. DNA, SSO, Postcoder, Search
- 40-50 web servers
- 20-30 app servers
- each platform replicated on many servers, e.g. SSO on 7
- Solaris 9/10
- Apache 2 on web servers
- Apache 1.3 + modperl on app servers
- predominantly serve static pages (hence borg)
- for dynamic serving
- Linux (HP blades)
- 160 app servers