Using DBGen to Generate TPC-H Test Data

In this post I describe the Data Base Generator (DBGen) utility in terms of what it is, how to install it, and how to use it.  This post assumes a Linux operating system.  There are different build procedures for DBGen on Windows, but the concepts carry over.

DBGen is used to create a TPC-H schema (i.e., it provides the CREATE TABLE statements) and to generate the data.  DBGen is used in official TPC-H benchmarking, but it can also be used in informal TPC-H-like benchmarking.  The schema and data is compatible with most 3rd party benchmarking tools like HammerDB, so even if you plan to run a TPC-H-like test using HammerDB you may still use DBGen to create the schema and data.

The CREATE TABLE statements are ANSI standard SQL and can be run in any DBMS without editing.  There are no SQL statements for indexes or foreign key constraints provided by DBGen since those types of objects are not required.  The data is created as delimited flat files, with one file for each of the eight TPC-H tables.  There are no provided scripts for loading data: it is entirely up to you to load the data.

Before we start using DBGen there are several important points:

  • DBGen can be installed by any user.  You don't need root privileges, except to install prerequisite packages like gcc and make.
  • DBGen will not connect to a database and doesn't care if you have a database.  It just generates data and writes that data into delimited flat files that we call "table files" because there is one file per table.  This allows you to work on a different computer without stealing CPU cycles from your database.
  • DBGen needs to be stored on and run from the filesystem where you want to create the table files.  It can only create files in the current working directory.
  • The table files can be quite large.  Size is 1GB per unit of scale and the allowable scale factors are 1, 10, 30, 100, 300, 1000, 3000, 10000, 30000, and 100000.  Most people use scale 3000 which requires 3 TB of free disk space to stage the data and about 4 TB additional space in your database to load a copy of that data (database blocks maintain free space).  Frankly, small data sets are better generated in tools like HammerDB, so with DBGen we assume terabyte scale data.
  • The time it takes to create the data files is mostly CPU bound, but the time it takes to load the data into your database is largely I/O bound.  During generation each DBGen worker can generate data at speeds of 70 MB/s bandwidth and up to 250 large block IOPS depending on your CPU power.

To summarize requirements, you need a computer with enough free space to stage the data before it gets copied into the database.  You do not need a database to run DBGen: it can be run on one computer and then the data can be copied into a database on another computer.  Since you're copying the data, the database needs equal or more space than your staging area.

Start by obtaining the DBGen utility source code from tpc.org.  It's a zip file.  The software is free, but you must agree to the TPC's license agreement before they will give you access to the software.


Figure 1: the TPC Download menu

Figure 2: the registration page for TPC-H Tools

Once you get the software downloaded to your test computer, which can be Windows, Linux, or Unix, unzip the source code.  Do not unzip it to a temporary file system for reasons noted earlier.  Assuming we have a filesystem mounted at /foo and we created a directory there called dbgen, then we can extract the DBGen source code there with the following command:

unzip ./tpc-h-tools_v2.18.0.zip -d /foo/dbgen

Navigate into the software:

cd /foo/dbgen/2.18.0_rc2/dbgen/

Notice the file dss.ddl.  This file contains all of the CREATE TABLE statements you need to create your TPCH schema.  This file will not change when you make the DBGen software, so you can go ahead and use it now to create your schema.  The SQL statements are plain vanilla and can be run on any DBMS (Oracle, MSSQL, MySQL, PostgreSQL, etc.).  The statements do not include partitioning or other things you might want to use, so we typically copy this file and customize it before running it in our database.

Next, edit file makefile.suite and set the following fields:

################
## CHANGE NAME OF ANSI COMPILER HERE
################
CC      = gcc
# Current values for DATABASE are: INFORMIX, DB2, TDAT (Teradata)
#                                  SQLSERVER, SYBASE, ORACLE, VECTORWISE
# Current values for MACHINE are:  ATT, DOS, HP, IBM, ICL, MVS,
#                                  SGI, SUN, U2200, VMS, LINUX, WIN32
# Current values for WORKLOAD are:  TPCH
DATABASE= ORACLE
MACHINE = LINUX
WORKLOAD = TPCH

  • The C compiler variable CC must be named in all lowercase letters.  The next three variables must have uppercase values and match a sample value exactly.
  • The DATABASE variable can be set to any of the sample values since the CREATE TABLE statements are always generated in plain SQL without any regard for the target DBMS.  Setting this variable only affects the generation of queries by the qgen utility which we are not using.
  • Set the MACHINE variable for your platform.  All flavors of Linux simply use LINUX - there's no distinction between RHEL, Oracle Linux, Ubuntu, or SuSE Linux.
  • The WORKLOAD variable must be set to TPCH.

Close and save the file.

If the target DBMS is PostgreSQL, then you must edit one additional file to prevents DBGen from adding field separators to the end of each record.  Other DBMS don't have a problem with the field separators (DBGen was defined by a team from Oracle, Sybase, IBM, and others), but PostgreSQL cannot handle the field separators.

  • Edit file dss.h
  • Locate the following line:

#define  PR_END(fp) fprintf(fp, "\n")   /* finish the record here */

  • Change it to look like this:

#define  PR_END(fp) {fseek(fp, -1, SEEK_CUR);fprintf(fp, "\n");}

  • When done close and save the file. 

So, I guess I didn't mention this earlier, but you need a C compiler and a "make" utility on Linux/UNIX.  All of my Linux systems include packages gcc and make.  If you aren't sure, then just run this command (or the equivalent for your operating system) to ensure the required packages are installed:

yum install gcc make

Finally, we can make the DBGen executable.  The result will be a file named dbgen (all lowercase letters).

make -f ./makefile.suite

List the new DBGen executable:

ls -l dbgen

Let's test it to see if it works, by generating a very small "scale factor 1" set of data.  The argument -v instructs it to display verbose messages.  Argument -f tells it to overwrite any existing table files, so you can re-run the command without having to manually cleanup first.  Argument -s N tells it the scale of data to generate and equates to N GB of data.

./dbgen -v -f -s 1

The output is one file per table, so you get eight files whose name include or ends with .tbl.  Table files are created in the directory where dbgen is run.  These are plain text files containing rows of data.  The ownership will be based on the user who ran dbgen. 

The earlier example purposefully did not generate a realistic amount of data because that takes a lot of time and we don't yet know if things are working  To verify the setup load one of the tables into your database.  I would choose a small file like region.tbl.  Either they will all work or all fail.  Once you verify things are working, feel free to generate larger data sets.

The below example is from a scale 300 generation:

# ls -lSrh *.tbl
-rw-r--r-- 1 root root  384 Mar 26 08:48 region.tbl
-rw-r--r-- 1 root root 2.2K Mar 26 08:48 nation.tbl
-rw-r--r-- 1 root root 409M Mar 26 08:48 supplier.tbl
-rw-r--r-- 1 root root 6.9G Mar 26 08:48 part.tbl
-rw-r--r-- 1 root root 6.9G Mar 26 08:48 customer.tbl
-rw-r--r-- 1 root root  35G Mar 26 08:48 partsupp.tbl
-rw-r--r-- 1 root root  50G Mar 26 08:48 orders.tbl
-rw-r--r-- 1 root root 224G Mar 26 08:48 lineitem.tbl

Note the sizes shown above are base 2 not base 10.  File lineitem.tbl is 240 GB which is 224 GiB.

For very large data sets DBGen offers chunking or split files.  The complexity of using it makes this approach unworthy of smaller data sets.  Scales less than 1000 are generally created and loaded using the HammerDB GUI, not with DBGen.  Scale 1000 is generally created using DBGen without splits or compression.  Scales 3000 are larger are generally created using DBGen with splits and compression.

When splitting table data into multiple chunks (separate files) you must specify the total number of chunks and which split to generate each time you invoke DBGen.  Tables NATIONS and REGION will be created and recreated by each command in their entirety, they are never split because they are too small, and it is important to use the force overwrite flag (-f) to allow overwriting the NATIONS and REGION files each time you generate chunks.

As a quick example, let's just do table LINEITEM and create split number 3 of 10.  See the notes that follow the example.

./dbgen -s 10 -T L -C 10 -S 3

  • Parameter -s 10 tells it to create scale factor 10.  If omitted, then scale factor 1 is created.
  • Parameter -T L tells it to only create table LINEITEM.  If omitted, then all tables are created.  The mandatory tables NATIONS and REGION are always created regardless of which table you specify.
  • Parameter -C 10 tells it to chunk the data into 10 even sized files.  If omitted, then you must also omit parameter -S and you will get one file per table.  This parameter only defines the total number of chunks and does not actually generate anything.
  • Parameter -S 3 tells it to create the split file #3.  If omitted, then you must also omit parameter -C and you will get one file per table.

Since each command can only generate one of the splits, we need 10 commands to create 10 splits (following the earlier example with parameter -C 10). 

The next example shown below creates all 10 splits.  Parameter -s 3000 will generate 3 TB of data.  Parameter -C sets the total number of chunks or splits of the output file to be generated, and parameter -S indicates which split should be created at this time.  To create all 10 splits we repeat the command 10 times and increment the split number from 1 to 10.  The force overwrite flag (-f) is needed because every execution of DBGen creates tables NATIONS and REGIONS.  The verbose flag (-v) is optional.

./dbgen -v -f -s 3000 -C 10 -S 1
./dbgen -v -f -s 3000 -C 10 -S 2
./dbgen -v -f -s 3000 -C 10 -S 3
./dbgen -v -f -s 3000 -C 10 -S 4
./dbgen -v -f -s 3000 -C 10 -S 5
./dbgen -v -f -s 3000 -C 10 -S 6
./dbgen -v -f -s 3000 -C 10 -S 7
./dbgen -v -f -s 3000 -C 10 -S 8
./dbgen -v -f -s 3000 -C 10 -S 9
./dbgen -v -f -s 3000 -C 10 -S 10

Just as an interesting example, you could use the below example with a loop to create all ten splits:

$ for i in {1..10}; do   ./dbgen -f -s 3000 -T L -C 10 -S $i; done

TPC-H Population Generator (Version 2.18.0)
Copyright Transaction Processing Performance Council 1994 - 2010
TPC-H Population Generator (Version 2.18.0)
Copyright Transaction Processing Performance Council 1994 - 2010
TPC-H Population Generator (Version 2.18.0)
Copyright Transaction Processing Performance Council 1994 - 2010
TPC-H Population Generator (Version 2.18.0)
Copyright Transaction Processing Performance Council 1994 - 2010
TPC-H Population Generator (Version 2.18.0)
Copyright Transaction Processing Performance Council 1994 - 2010
TPC-H Population Generator (Version 2.18.0)
Copyright Transaction Processing Performance Council 1994 - 2010
TPC-H Population Generator (Version 2.18.0)
Copyright Transaction Processing Performance Council 1994 - 2010
TPC-H Population Generator (Version 2.18.0)
Copyright Transaction Processing Performance Council 1994 - 2010
TPC-H Population Generator (Version 2.18.0)
Copyright Transaction Processing Performance Council 1994 - 2010
TPC-H Population Generator (Version 2.18.0)
Copyright Transaction Processing Performance Council 1994 - 2010

Compression is not a feature of DBGen, but you can compress the table files manually.  Using gzip level 1 compression works best.  For instance:

gzip -1 lineitem.tbl*

The data must be decompressed before it can be loaded into a database.  Oracle has the ability to treat a gzip file as an external table, and it can decompress the data during reads into the database.  Most DBMS lack such a feature. 

The example below shows creating an Oracle external table object named CUSTOMER_EXT that maps to a set of 16 table files compressed using gzip.

CREATE TABLE customer_ext
  (
   C_CUSTKEY NUMBER(10),
   C_NAME VARCHAR2(25),
   C_ADDRESS VARCHAR2(40),
   C_NATIONKEY NUMBER(10),
   C_PHONE VARCHAR2(15),
   C_ACCTBAL NUMBER,
   C_MKTSEGMENT VARCHAR2(10),
   C_COMMENT VARCHAR2(117)
  )
  ORGANIZATION EXTERNAL
  ( TYPE oracle_loader
    DEFAULT DIRECTORY dbgen_staging
    ACCESS PARAMETERS (
       RECORDS DELIMITED BY NEWLINE
       preprocessor zcat_dir:'zcat'
       BADFILE 'bad_%a_%p.bad'
       LOGFILE 'log_%a_%p.log'
       FIELDS TERMINATED BY '|'
       MISSING FIELD VALUES ARE NULL)
    LOCATION ('customer.tbl.1.gz',  'customer.tbl.2.gz',
              'customer.tbl.3.gz',  'customer.tbl.4.gz',
              'customer.tbl.5.gz',  'customer.tbl.6.gz',
              'customer.tbl.7.gz',  'customer.tbl.8.gz',
              'customer.tbl.9.gz',  'customer.tbl.10.gz',
              'customer.tbl.11.gz', 'customer.tbl.12.gz',
              'customer.tbl.13.gz', 'customer.tbl.14.gz',
              'customer.tbl.15.gz', 'customer.tbl.16.gz' )
   )
  PARALLEL 8
  REJECT LIMIT 0
  NOMONITORING;

Now, to read that data into your database you simply "select" from this external table object like so:

INSERT INTO CUSTOMER SELECT * FROM CUSTOMER_EXT;

All DBMS have some method of reading data from external sources into a database table.  Each works a bit differently.  Few support compression as per the Oracle specific example above.

DBGen is provided by the Transaction Processing Performance Council (TPC).  The terms TPC Benchmark and TPC-H are trademarks of the Transaction Processing Performance Council.

 

Comments

Post a Comment

Popular posts from this blog

Oracle Database Smart Flash Cache (DSFC)

Understanding TPC and TPC-like Database Benchmarks