Using DBGen to Generate TPC-H Test Data
In this post I describe the Data Base Generator (DBGen) utility
in terms of what it is, how to install it, and how to use it. This post assumes a Linux operating system. There are different build procedures for DBGen
on Windows, but the concepts carry over.
DBGen is used to create a TPC-H schema (i.e., it provides the
CREATE
TABLE statements)
and to generate the data. DBGen is used
in official TPC-H benchmarking, but it can also be used in informal TPC-H-like
benchmarking. The schema and data is compatible
with most 3rd party benchmarking tools like HammerDB, so even if you
plan to run a TPC-H-like test using HammerDB you may still use DBGen to create
the schema and data.
The CREATE TABLE statements are ANSI standard
SQL and can be run in any DBMS without editing.
There are no SQL statements for indexes or foreign key constraints
provided by DBGen since those types of objects are not required. The data is created as delimited flat files,
with one file for each of the eight TPC-H tables. There are no provided scripts for loading
data: it is entirely up to you to load the data.
Before we start using DBGen there are several important points:
- DBGen can be installed by any user. You don't need root privileges, except to install prerequisite packages like gcc and make.
- DBGen will not connect to a database and doesn't care if you have a database. It just generates data and writes that data into delimited flat files that we call "table files" because there is one file per table. This allows you to work on a different computer without stealing CPU cycles from your database.
- DBGen needs to be stored on and run from the filesystem where you want to create the table files. It can only create files in the current working directory.
- The table files can be quite large. Size is 1GB per unit of scale and the allowable scale factors are 1, 10, 30, 100, 300, 1000, 3000, 10000, 30000, and 100000. Most people use scale 3000 which requires 3 TB of free disk space to stage the data and about 4 TB additional space in your database to load a copy of that data (database blocks maintain free space). Frankly, small data sets are better generated in tools like HammerDB, so with DBGen we assume terabyte scale data.
- The time it takes to create the data files is mostly CPU bound, but the time it takes to load the data into your database is largely I/O bound. During generation each DBGen worker can generate data at speeds of 70 MB/s bandwidth and up to 250 large block IOPS depending on your CPU power.
To summarize requirements, you need a computer with enough
free space to stage the data before it gets copied into the database. You do not need a database to run DBGen: it can
be run on one computer and then the data can be copied into a database on
another computer. Since you're copying
the data, the database needs equal or more space than your staging area.
Start by obtaining the DBGen utility source code from tpc.org. It's a zip file. The software is free, but you must agree to
the TPC's license agreement before they will give you access to the software.
Once you get the software downloaded to your test computer, which can be Windows, Linux, or Unix, unzip the source code. Do not unzip it to a temporary file system for reasons noted earlier. Assuming we have a filesystem mounted at /foo and we created a directory there called dbgen, then we can extract the DBGen source code there with the following command:
unzip ./tpc-h-tools_v2.18.0.zip
-d /foo/dbgen
Navigate into the software:
cd /foo/dbgen/2.18.0_rc2/dbgen/
Notice the file dss.ddl. This file contains all of the CREATE TABLE statements you need to create your TPCH schema. This file will not change when you make the
DBGen software, so you can go ahead and use it now to create your schema. The SQL statements are plain vanilla and can be
run on any DBMS (Oracle, MSSQL, MySQL, PostgreSQL, etc.). The statements do not include partitioning or
other things you might want to use, so we typically copy this file and customize
it before running it in our database.
Next, edit file makefile.suite and set the following fields:
################
## CHANGE NAME OF ANSI COMPILER HERE
################
CC = gcc
# Current values for DATABASE are: INFORMIX, DB2, TDAT (Teradata)
#
SQLSERVER, SYBASE, ORACLE, VECTORWISE
# Current values for MACHINE are: ATT,
DOS, HP, IBM, ICL, MVS,
# SGI, SUN, U2200, VMS,
LINUX, WIN32
# Current values for WORKLOAD are: TPCH
DATABASE= ORACLE
MACHINE = LINUX
WORKLOAD = TPCH
- The C compiler variable CC must be named in all lowercase letters. The next three variables must have uppercase values and match a sample value exactly.
- The DATABASE variable can be set to any of the sample values since the CREATE TABLE statements are always generated in plain SQL without any regard for the target DBMS. Setting this variable only affects the generation of queries by the qgen utility which we are not using.
- Set the MACHINE variable for your platform. All flavors of Linux simply use LINUX - there's no distinction between RHEL, Oracle Linux, Ubuntu, or SuSE Linux.
- The WORKLOAD variable must be set to TPCH.
Close and save the file.
If the target DBMS is PostgreSQL, then you must edit one
additional file to prevents DBGen from adding field separators to the end of
each record. Other DBMS don't have a
problem with the field separators (DBGen was defined by a team from Oracle,
Sybase, IBM, and others), but PostgreSQL cannot handle the field separators.
- Edit file dss.h
- Locate the following line:
#define PR_END(fp) fprintf(fp, "\n") /* finish the record here */
- Change it to look like this:
#define PR_END(fp) {fseek(fp, -1,
SEEK_CUR);fprintf(fp, "\n");}
- When done close and save the file.
So, I guess I didn't mention this earlier, but you need a C compiler and a "make" utility on Linux/UNIX. All of my Linux systems include packages gcc and make. If you aren't sure, then just run this command (or the equivalent for your operating system) to ensure the required packages are installed:
yum install gcc make
Finally, we can make the DBGen executable.
The result will be a file named dbgen (all lowercase letters).
make -f ./makefile.suite
List the new DBGen executable:
ls -l dbgen
Let's test it to see if it works, by generating a very small
"scale factor 1" set of data.
The argument -v instructs it to display verbose messages. Argument -f tells it to overwrite any existing
table files, so you can re-run the command without having to manually cleanup
first. Argument -s N
tells it the scale of data to generate and equates to N GB of data.
./dbgen -v -f -s 1
The output is one file per table, so you get eight files whose
name include or ends with .tbl. Table files are created in the directory
where dbgen is run.
These are plain text files containing rows of data. The ownership will be based on the user who
ran dbgen.
The earlier example purposefully did not generate a
realistic amount of data because that takes a lot of time and we don't yet know
if things are working To verify the setup
load one of the tables into your database.
I would choose a small file like region.tbl. Either they will all work or all fail. Once you verify things are working, feel free
to generate larger data sets.
The below example is from a scale 300 generation:
# ls -lSrh *.tbl
-rw-r--r-- 1 root root 384 Mar 26 08:48
region.tbl
-rw-r--r-- 1 root root 2.2K Mar 26 08:48 nation.tbl
-rw-r--r-- 1 root root 409M Mar 26 08:48 supplier.tbl
-rw-r--r-- 1 root root 6.9G Mar 26 08:48 part.tbl
-rw-r--r-- 1 root root 6.9G Mar 26 08:48 customer.tbl
-rw-r--r-- 1 root root 35G Mar 26 08:48
partsupp.tbl
-rw-r--r-- 1 root root 50G Mar 26 08:48
orders.tbl
-rw-r--r-- 1 root root 224G Mar 26 08:48 lineitem.tbl
Note the sizes shown above are base 2 not base 10. File lineitem.tbl is 240 GB
which is 224 GiB.
For very large data sets DBGen offers chunking or split
files. The complexity of using it makes
this approach unworthy of smaller data sets.
Scales less than 1000 are generally created and loaded using the
HammerDB GUI, not with DBGen. Scale 1000
is generally created using DBGen without splits or compression. Scales 3000 are larger are generally created
using DBGen with splits and compression.
When splitting table data into multiple chunks (separate files)
you must specify the total number of chunks and which split to generate each time
you invoke DBGen. Tables NATIONS and REGION will be
created and recreated by each command in their entirety, they are never split
because they are too small, and it is important to use the force overwrite flag
(-f) to allow overwriting the NATIONS and REGION files each
time you generate chunks.
As a quick example, let's just do table LINEITEM and create split number 3 of 10. See the notes that follow the example.
./dbgen -s 10 -T L -C 10 -S 3
- Parameter -s 10 tells it to create scale factor 10. If omitted, then scale factor 1 is created.
- Parameter -T L tells it to only create table LINEITEM. If omitted, then all tables are created. The mandatory tables NATIONS and REGION are always created regardless of which table you specify.
- Parameter -C 10 tells it to chunk the data into 10 even sized files. If omitted, then you must also omit parameter -S and you will get one file per table. This parameter only defines the total number of chunks and does not actually generate anything.
- Parameter -S 3 tells it to create the split file #3. If omitted, then you must also omit parameter -C and you will get one file per table.
Since each command can only generate one of the splits, we need 10 commands to create 10 splits (following the earlier example with parameter -C 10).
The next example shown below creates all 10 splits. Parameter -s 3000 will generate 3 TB of data. Parameter -C sets the total
number of chunks or splits of the output file to be generated, and parameter -S indicates which split should be created at this
time. To create all 10 splits we repeat
the command 10 times and increment the split number from 1 to 10. The force overwrite flag (-f) is needed because every execution of DBGen creates
tables NATIONS and REGIONS. The
verbose flag (-v) is optional.
./dbgen -v -f -s 3000 -C 10
-S 1
./dbgen -v -f -s 3000 -C 10 -S 2
./dbgen -v -f -s 3000 -C 10 -S 3
./dbgen -v -f -s 3000 -C 10 -S 4
./dbgen -v -f -s 3000 -C 10 -S 5
./dbgen -v -f -s 3000 -C 10 -S 6
./dbgen -v -f -s 3000 -C 10 -S 7
./dbgen -v -f -s 3000 -C 10 -S 8
./dbgen -v -f -s 3000 -C 10 -S 9
./dbgen -v -f -s 3000 -C 10 -S 10
Just as an interesting example, you could use the below
example with a loop to create all ten splits:
$ for i in {1..10}; do ./dbgen -f -s 3000 -T L -C 10 -S $i; done
TPC-H Population Generator (Version 2.18.0)
Copyright Transaction Processing Performance Council 1994 - 2010
TPC-H Population Generator (Version 2.18.0)
Copyright Transaction Processing Performance Council 1994 - 2010
TPC-H Population Generator (Version 2.18.0)
Copyright Transaction Processing Performance Council 1994 - 2010
TPC-H Population Generator (Version 2.18.0)
Copyright Transaction Processing Performance Council 1994 - 2010
TPC-H Population Generator (Version 2.18.0)
Copyright Transaction Processing Performance Council 1994 - 2010
TPC-H Population Generator (Version 2.18.0)
Copyright Transaction Processing Performance Council 1994 - 2010
TPC-H Population Generator (Version 2.18.0)
Copyright Transaction Processing Performance Council 1994 - 2010
TPC-H Population Generator (Version 2.18.0)
Copyright Transaction Processing Performance Council 1994 - 2010
TPC-H Population Generator (Version 2.18.0)
Copyright Transaction Processing Performance Council 1994 - 2010
TPC-H Population Generator (Version 2.18.0)
Copyright Transaction Processing Performance Council 1994 - 2010
Compression is not a feature of DBGen, but you can compress
the table files manually. Using gzip
level 1 compression works best.
For instance:
gzip -1 lineitem.tbl*
The data must be decompressed before it can be loaded into a
database. Oracle has the ability to treat
a gzip file as an external table, and it can decompress
the data during reads into the database.
Most DBMS lack such a feature.
The example below shows creating an Oracle external table
object named CUSTOMER_EXT that maps to a set of 16 table
files compressed using gzip.
CREATE TABLE customer_ext
(
C_CUSTKEY NUMBER(10),
C_NAME VARCHAR2(25),
C_ADDRESS VARCHAR2(40),
C_NATIONKEY NUMBER(10),
C_PHONE VARCHAR2(15),
C_ACCTBAL NUMBER,
C_MKTSEGMENT VARCHAR2(10),
C_COMMENT VARCHAR2(117)
)
ORGANIZATION EXTERNAL
(
TYPE oracle_loader
DEFAULT DIRECTORY dbgen_staging
ACCESS PARAMETERS (
RECORDS DELIMITED BY NEWLINE
preprocessor zcat_dir:'zcat'
BADFILE 'bad_%a_%p.bad'
LOGFILE 'log_%a_%p.log'
FIELDS TERMINATED BY '|'
MISSING FIELD VALUES ARE NULL)
LOCATION ('customer.tbl.1.gz', 'customer.tbl.2.gz',
'customer.tbl.3.gz', 'customer.tbl.4.gz',
'customer.tbl.5.gz', 'customer.tbl.6.gz',
'customer.tbl.7.gz', 'customer.tbl.8.gz',
'customer.tbl.9.gz', 'customer.tbl.10.gz',
'customer.tbl.11.gz', 'customer.tbl.12.gz',
'customer.tbl.13.gz',
'customer.tbl.14.gz',
'customer.tbl.15.gz', 'customer.tbl.16.gz' )
)
PARALLEL 8
REJECT
LIMIT 0
NOMONITORING;
Now, to read that data into your database you simply
"select" from this external table object like so:
INSERT INTO CUSTOMER SELECT
* FROM CUSTOMER_EXT;
All DBMS have some method of reading data from external
sources into a database table. Each
works a bit differently. Few support
compression as per the Oracle specific example above.
DBGen is provided by the Transaction Processing Performance Council
(TPC). The terms TPC Benchmark and TPC-H
are trademarks of the Transaction Processing Performance Council.
rofuKcent-gu_1998 Jensen Tripp https://marketplace.visualstudio.com/items?itemName=0repdeonenma.Crypt-Hunter-gratuita
ReplyDeleteblowgareni
UmelneOconcno2002 Kenneth Thompson https://www.blackbird.ee/profile/quintilayenenah/profile
ReplyDeleteprofilnisi
Aforcanco-pu_1997 Ivan Hindiyeh Here
ReplyDeleteoregtrohec
kafuXnobi Stacy Duncan download
ReplyDeleteringleelini
Nconsmo0cul-shi Joe Bonsness click here
ReplyDeletesoftware
icdedere