Ignoring Rules When Benchmarking Databases

This post is part of a series about benchmarking the performance of relational database.  My earlier posts detailed the official TPC benchmarks for relational databases and the popular tools for running informal versions of those benchmarks.  This post describes some of the rules that drive cost and complexity, which leads to the most commonly ignored rules.

Each database benchmark has a strict set of rules called a "specification".  There is one specification for TPC-C, another for TPC-E, and so on for every TPC benchmark.  Following all of the rules in a given specification is expensive and time consuming.  Some of the rules aren't practical and can be ignored when testing informally or internally.  You are only required to follow the rules when testing with the intent to publish or if you plan to use the TPC trademark in a publication.

To be honest, I have never performed a formal TPC benchmark or followed all of the rules.  All of my tests were informal using tools like HammerDB, SwingBench, and Benchmark Factory which don't implement all of the rules and allow you to disable rules.

Earlier I said it is expensive to follow the rules.  A formal TPC-C test conducted by Oracle in 2013 achieved 8.55 million transactions per minute (tpmC) using 170TB of storage and 6.8 million users spread over 16 clients, 56 storage servers, and one database server.  The total system cost was $4.66M in 2013 dollars so it would certainly cost more today.  When conducting a formal TPC benchmark and you want to compete with the big boys, these are the kinds of resources you must pony up. 

To understand how rules impact cost, the TPC prescribes a performance range your results must fit within and that drives up the data scale factor which in turn drives up the number of users and equipment.  For example, TPC-C requires your results fall within the range of 9.0 to 12.86 transactions per minute per unit of scale and you must also have at least 10 users per unit of scale.  The idea is, you first pick a score you want to achieve, then divide by roughly 12.8 to determine how much data you need to load, multiply that by 10 to determine how many clients you need, and then build a system big enough to do all of that.  To achieve a score of 1 million you need 1,000,000 / 12.8 = 78,125 warehouses of data loaded and that requires 78,125 * 10 = 781,250 users.

Here's a real life example of how rules drive cost.  The current TPC-C world record held by Alibaba Cloud Services.  They used a scale factor of 55,944,000 warehouses with 559,440,000 users (10 x scale as required) and they achieved a score of 707,351,007 New Order Transactions Per Minute (tpmC) which equates to 12.63 tpmC per warehouse and that fits perfectly in the allowable range of 9.00 to 12.86.  The system consisted of 1,557 database nodes each with 84 CPU threads and 716 GB RAM, plus 3 management nodes and another 400 client nodes (a total of 1,960 computers).  Total system cost was CNY 2.8 Billion (roughly USD $300M).

Think about that.  Someone in a meeting said, "Hey guys, let's shoot for 700 million transactions a minute" and after reading the rules they realized it would require nearly 560 million users and an insane number of computers.  And then the boss like actually says, "Ok, let's do it!"

So, yeah, I don't do all of that.  I generally run TPC-C-like tests on a single server with a total cost under $80K.  About half of the cost is the server, and the other half is flash storage.  It always happens that we have a new computer getting deployed and I can borrow it for a week of testing, so really my cost is $0.

Now, down to the reality of informal testing.  Let's talk about what most people do when running informal benchmarks.

For TPC-C-like testing I disable all of the following …

First, disable "key and think time" in all tests.  In real life each user thinks before typing, and it takes them time to key in the data before clicking the Submit button.  The TPC-C specification notes that Key Time average 9.63 seconds and Think Time averages 11.36 seconds, and the maximum total Key and Think Time is 23.07 seconds per transaction, so each will complete 3 transactions per minute.  Most people who run an informal TPC-C-like benchmark disable Key and Think Time so there is zero lag between each transaction and they run as fast as your system can go.

Note on my first rule of cheating … When all constraints like Key Time and Think Time are enabled, a TPC-C-like test in HammerDB will achieve 3.79 transactions per minute per user.  It's predictable.  It doesn’t matter how fast you server is.  If you use 5000 sessions with K&T enabled, then you are going to get 5000 * 3.79 = 18,950 transactions per minute.  Funny how that works.

The second rule to ignore is maximum allowed performance and its relationship to scale factors.  The TPC-C spec says you cannot exceed 12.86 transactions per minute per warehouse, so if you use the HammerDB GUI's maximum 5000 warehouses then you're limited to 5000 * 12.86 = 64,300 transactions per minute.  Really!?!  We want a score of a million or more.  Similarly, the TPC-E spec limits score to (scale / 500)*1.02.  What they want you to do is first pick a target score and use these formulas to calculate the scale of data you must load.  Forget about it.

The third rule to ignore is the required number of users.  The TPC-C spec (section 4.2.2 paragraph 1) calls for ten terminals per warehouse, so when using a very small database with a scale of 5,000 warehouses you need 50,000 clients.  (A formal test has virtual terminals in a middleware program called TPMonitor, they don't need to be distinct physical clients).  This "ten times" rule goes hand-in-hand with the Key and Think Time rule we discussed earlier.  Each of these 50,000 users is sitting at a computer keying in transactions every 18-20 seconds.  Well, it simply isn't feasible to run 50,000 client sessions against a database server that only has enough resources to support 10,000 process stacks. 

Notes on rule 3:

  • Try setting the number of database sessions to twice the number of processor threads on the server.  A server with dual 18 core CPUs and Hyperthreading enabled has 72 threads, so try 144 users.
  • If you are using Key and Think Time, then consider HammerDB's Event Driven Scaling feature which essentially turns each of these connections into pools that are each capable of running hundreds of sessions.
  • HammerDB has a user-scaling feature called AutoPilot that allows you to cycle through an increasing number of users so you determine the optimal number for your server.  You might start at 100 users and increment by 10 up to 200 users and see where performance peaks.

Fourth, do try to use all of the data so you don't get worthless test results.  The TPC-C spec calls for ten times as many users as warehouse to ensure all warehouses are actively used during the benchmark.  It randomly assigns users to warehouses in such a way that it takes ten times as many users to guarantee each warehouse will have at least one assigned user.  But, it isn't feasible to have this many users as noted earlier, so HammerDB has a checkbox called Use All Warehouses to resolve this problem.  As long as you have as many users as warehouses, HammerDB will map at least one user to each warehouse.

Note on rule #4: The TPC-C spec requires each user be assigned to a random warehouse using a predefined function, and statistically every 100 users will be assigned to 92 warehouses and the other 8 warehouses won't get any assignments.  Only those 92 warehouses will be utilized during a test.  If you load 5000 warehouses and only start 100 users, then you'll still have 92 active warehouses and the other 4908 warehouses never get used.  In other words, your test will access such a trivial amount of data that it will run entirely in cache and your results will be worthless.  Statistically, you need 10 times as many users as warehouses just to guarantee every warehouse gets one assignment.

Rule 5 only applies to TPC-H.  Disable the two refresh functions.  They simulate loading new data from an OLTP system into the DSS system being tested.  I appreciate the goals of causing contention, growing the database, etc., but since it changes the data you cannot re-run a TPC-H test without dropping the schema and rebuilding it.  Building a good size TPC-H schema can take several days.

There are a lot of knobs and buttons in the TPC-like benchmarking tools which allow you to disable various rules and capabilities.  Likewise, most DBMS have hundreds of "tuning parameters".  Just use your good sense when disabling things.  Getting maximum performance from a configuration that your own organization refuses support in production because it breaks ACID compliance is a waste of time.

Comments

Popular posts from this blog

Using DBGen to Generate TPC-H Test Data

Oracle 21c Caching Solutions

TPC-like Database Benchmarking Tools