The Pith of Performance: BRL-CAD Benchmark and USL Modeling

Monday, May 10, 2010

BRL-CAD Benchmark and USL Modeling

MariuszW asked a question in a previous post entitled This is Your Measurements on Models. Since answering it is rather involved, I decided to address it here as a separate blog post. The context of the question concerns the application of my universal scalability model (USL).

"I understand that system characteristic is in α and β and the interpretation is the key. And that there is aggregation. In GCaP [book] (in table 5.1) there are ray tracing benchmark results - what is workload (users, tasks) described in such case? Numbers? Is it Xmax on given processor in this table? - so processor p1 is loaded to measure Imax, next p4 is loaded to measuer Imax? - isn't task or user number important for given p-Imax?"

First off, I have to confess that I know nothing about the specific benchmark workloads or the how they are executed. One of the great beauties of performance modeling is how much you can accomplish while remaining totally clueless. All I did was review the CMG 2000 paper, "COMPARING CPU PERFORMANCE BETWEEN AND WITHIN PROCESSOR FAMILIES," and noticed that their data was a good candidate for modeling with the USL; such complete scaling data not being easy to come by.

In this BRL-CAD ray-tracing benchmark, there are 6 reference models or codes that are executed: 9moss, world, star, Bldg391, M35, and Sphflake. MW's question really pertains to the benchmark runtime methodology on SMPs. Are these codes scheduled individually across processors or all run as a master process? How does it compare with SPECrate?

"The SPECrate measurement involves running multiple copies of a benchmark program simultaneously, and the formula is a bit more complicated because it needs to include an extra variable for the number of copies that were running simultaneously."

The SPECrate run rules say:

"What is measured is the elapsed time from when all copies are launched simultaneously until the time the last copy finishes. The elapsed time and the number of copies executed are then used in a formula to calculate a completion rate: the SPECrate."

The performance metric is calculated as: SPECrate = #CopiesRun * ReferenceFactor * UnitTime/ElapsedExecutionTime.

I downloaded the BRL-CAD benchmark code but that didn't make me any the wiser, so I decided email the authors. Here's the reply I got:

-------- Original Message --------
Subject: RE: BRL-CAD benchmarking question (UNCLASSIFIED)
Date: Fri, 16 Apr 2010 14:23:07 -0400
From: Name redacted (Civ, ARL/SLAD) 
To: Neil Gunther

> of the 6 reference models/codes ... are these codes scheduled individually
> across CPUs or all run as a master process?

These references are actually different data sets used as inputs to the same
code.  They are run individually, not all at the same time.

The architecture of the benchmark (in simple terms) for an SMP architecture
is that the data is read from disk and then the timing is started.  The code
then creates a thread of execution for each execution unit (CPU/core, pick
your terminology).  These share semaphored access to the pool of work to be
done, and "work units" are taken "on demand" from the pool of available
work.  Computation continues until all work units are completed, whereupon
the created threads terminate and the original thread stops the clock and
prints statistics.

As to how the threads of execution are scheduled, that depends on the
operating system and scheduler in place on the OS.

Based upon the short description you quote, this does not have any
resemblance to the SPECrate measurement you speak of.

So, to answer MW's question, the throughput X is being measured as a function of the hardware processor (p) configuration up to a maximum of 64 processors in the SGI Origin. In other words, the Table shows X(p): throughput as a function of p processors. The number N of processes (as in software processes) per processor (CPU) is some fixed quantity that saturates a single CPU. Suppose it was N = 10 processes that maxed out a CPU then, when I reconfigure the system with 2 CPUs, I need to run N = 20 processes in the system, and so on.

1 comment:

MariuszW said...: Hi,
Thank you for your answer.

Regards,
MariuszW; Monday, May 10, 2010 at 3:01:00 PM PDT