Just a quick post on why sharding makes large WordPress installs happy with a real life example and how seeing the code makes debugging and lifting a service out of a downtime much easier.
We sharded the db of UBC Blogs about a year into the project in 2010 after reading about potential bottle necks a large WPMU install can hit and generally just makes the db happy. Why this makes a DB happy are situations like what happened last week. We had a plugin on UBC Blogs running happily since May with no problems churning out xAPI statements by the thousand . We needed to move this to another WordPress multi site install with far fewer sites and users as UBC Blogs (still a fair amount 2000+ sites) but not partitioned / sharded. It ran for a week no problems noticed… once the students arrived though bam the CPU was devoured on the database server quickly watching the idle drop to 20% it was clear a crash was going to occur.
Thoughts were initially
1) Did the devs push anything new? confirmed that was a no.
2) Was it just more traffic? Looking at the analytics it was not up near a peak that we handled no problem in the past?
Thinking it might have been more traffic and some bad code we asked for more CPU / RAM from IT which they delivered. That brought the idle up but not out of the woods still around 25% and eventually tanked briefly.
Quickly started digging through the web logs nothing out of the ordinary then onto the db logs no MySQL errors. Then moved onto the db console using the old SHOW PROCESSLIST; It was revealed a nasty SHOW TABLES LIKE query was running often which is absolutely devastating on a multi site install with an un partitioned database, this query had to traverse 50000+ tables just running the query raw took 4 seconds to execute. On UBC Blogs because the db was partitioned and the table is located in a global db this query was pretty much instant zero performance impact. Once realizing this was the issue quickly asked Richard to kill the plugin that was doing this and we were back to normal. The plugin fix was very easy since the plugin only had to run that query once on installation Richard patched and put it back into production quickly.
- Partitioning is good.
- Having an application performance monitor would be best to locate bad code / issues more efficiently (time for New Relic?) .
- Having the ability to see fix and re-deploy code quickly in an open source enviro makes getting out of bad situations manageable in some instances relatively painless (I actually have no idea how we would get out of this if it was a black box closed source app call the vendor???) .
We were throttled on the UBC CMS with kick back traffic from Connect going down. Always impressed with how well our pretty basic setup can handle it. Properly tuned PHP opcode cache with a content cache still goes along way 🙂
Using this handy cURL command to see if blocks we are adding to prevent DDoS scans are working.
curl -L -A "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5" -c cookie.txt -X POST https://example.com/wp-login.php
-L = follow redirects
-A = user agent string
-c = cookie file
-X = request method
Furthering our attempts to make our WordPress install handle real time classroom back channel usage I started testing Batcache/Memcache in our verf environment as a replacement for the good old reliable WP Super Cache . WP Super Cache is great in a single server install with traditional traffic serving cached pages but once you have a fair number of users logging in across a multi server install it maybe time to move to a persistent backend cache for the WordPress object cache.
memcache installed on a 1 CPU / 4 GB RAM VM runing RHEL 6.3 (Santiago)
on the web servers pecl memcache-3.0.7 installed *this was key 2.2.7 the default was very buggy. web server had 4GB of RAM and 2xCPU
Test Results comparing WP-Super-Cache vs Batcache/Memcache
using ab -n 600 -c 100 against a site running PulsePress
WP Super Cache
Concurrency Level: 100
Requests per second: 49.32 [#/sec] (mean)
Concurrency Level: 100
Requests per second: 1007.86 [#/sec] (mean)
That is a huge jump in requests per second.
Trying even bigger loads with 250 concurrent making 5000 requests still had a pretty solid result:
Requests per second: 333.19 [#/sec] (mean)
September is always the busy time on campus so one thing that is cool to do is look back to last years traffic and see how things are fairing, are things going up or down are things scaling etc. Well things are still going up, way up which is very cool. No longer brochure traffic! WordPress is more popular than ever on this campus.
The WordPress CMS service which has been growing rapidly and has seen a huge spike in traffic thanks to some 50+ campus sites coming on board (and one very popular site http://elearning.ubc.ca). Last year during the month of September the service had (79,821) page views fast forward a year and the service averaged (2,009,301) page views 25x increase!
WordPress CMS Service page views Sept 2009 – 2010
UBC Blogs which has moved out of the “pilot” mode is also increasing not as rapidly as the CMS service which is kind of a surprise we though it would be the opposite… In September 2009 UBC Blogs saw (158,654) page views in September 2010 (434,203) more than double.
UBC Blogs page views September 2009 – 2010
Even wiki.ubc.ca has seen a double in traffic as well. Thanks to quite a few courses taking a stab at developing their course in the wide open wiki.