Show tables like === BAD , Sharding saves DBs and Open Source debugging

Just a quick post on why sharding makes large WordPress installs happy with a real life example and how seeing the code makes debugging and lifting a service out of a downtime much easier.

We sharded the db of UBC Blogs about a year into the project in 2010 after reading about potential bottle necks a large WPMU install can hit and generally just makes the db happy. Why this makes a DB happy are situations like what happened last week. We had a plugin on UBC Blogs running happily since May with no problems churning out xAPI statements by the thousand . We needed to move this to another WordPress multi site install with far fewer sites and users as UBC Blogs (still a fair amount 2000+ sites) but not partitioned / sharded. It ran for a week no problems noticed… once the students arrived though bam the CPU was devoured on the database server quickly watching the idle drop to 20% it was clear a crash was going to occur.

Thoughts were initially
1) Did the devs push anything new? confirmed that was a no.
2) Was it just more traffic? Looking at the analytics it was not up near a peak that we handled no problem in the past?

Thinking it might have been more traffic and some bad code we asked for more CPU / RAM from IT which they delivered. That brought the idle up but not out of the woods still around 25% and eventually tanked briefly.

Quickly started digging through the web logs nothing out of the ordinary then onto the db logs no MySQL errors. Then moved onto the db console using the old SHOW PROCESSLIST; It was revealed a nasty SHOW TABLES LIKE query was running often which is absolutely devastating on a multi site install with an un partitioned database, this query had to traverse 50000+ tables just running the query raw took 4 seconds to execute. On UBC Blogs because the db was partitioned and the table is located in a global db this query was pretty much instant zero performance impact. Once realizing this was the issue quickly asked Richard to kill the plugin that was doing this and we were back to normal. The plugin fix was very easy since the plugin only had to run that query once on installation Richard patched and put it back into production quickly.

Take aways?

  • Partitioning is good.
  • Having an application performance monitor would be best to locate bad code / issues more efficiently (time for New Relic?) .
  • Having the ability to see fix and re-deploy code quickly in an open source enviro makes getting out of bad situations manageable in some instances relatively painless (I actually have no idea how we would get out of this if it was a black box closed source app call the vendor???) .