We are constantly looking for new testing tools for our hard disk drives that is why we couldn’t help checking if the new FutureMark PCMark04 is suitable for HDDs testing. Learn about the pitfalls of the new test package. You will be surprised with the results of this detailed investigation!

by **Nikita Nikolaichev**

02/11/2004 | 11:52 PM

I am constantly looking for new testing tools for our hard disk drives (read about the reasons that inspire me in our article called 13 IDE Hard Disk Drives Roundup). So, I certainly couldn’t pass by a pretty interesting PCMark04 test set from FutureMark Company, which you all probably know due to their widely spread 3DMark test package. Especially, since some of my colleagues have already managed to try this test in their reviews…

Although I have been quite skeptical about the mere idea of testing the HDDs with PCMark04 test set for a while, remembering very well the failure I went through when I tried using the previous test package aka PCMark2002 (by MadOnion in those days) for HDD tests (for more details see our article called PCMark2002 as Hard Disk Drive Test). But I am an eternal optimist by nature so when one of the testbeds turned out free I decided to take a closer look at the new PCMark04.

Of all the things offered by PCMark04 I was interested only in the tests developed for the disk subsystem. According to the PCMark04 description in the PDF file on the official FutureMark web-site, the disk subsystem testing set consists of four sub-tests.

Unlike PCMark2002, the new testing software involved the so-called “traces of disk activity”, in other words disk activity traces pre-recorded on some reference system. There was a trace created for each subtest, and then PCMark04 reproduced it on the hard disk drive of the tested system. The performance is estimated according to the time it takes the drive to process the trace and is measured in MB/sec.

FutureMark used a special RankDisk utility developed by Intel for traces creation (actually, it is done by WinTrace32 utility, though PCMark04 mentioned RankDisk). This software tool allows saving a sequence of requests sent to the drive at a pretty low level (by the controller driver), so that we get only the pure HDD performance when this trace is reproduced. At least the PCMark04 developers think so.

RankDisk utility for reproduction of HDD request traces have long been used by the most respected HDD testers, such as the guys from the well-known StorageReview site. That is why the use of this program or even a part of its code in PCMark04 is most welcome.

However, the StorageReview people reproduce the saved traces on a non-formatted drive, which guarantees that the requests will “fall” repeatedly right into the zone of the drive where the trace was initially aiming at. And the PCMark04 developers seem to suggest a different approach: according to this PDF-file they suggest reproducing the traces on a formatted drive. For testing purposes they also have to create a special dummy-file, the trace is rearranged so that it could work only inside this file, and this way they protect the user’s data against possible damage during the test. However, unfortunately, no one can guarantee that this temporary dummy-file will be located in the same part of the drive where the trace has been initially recorded. Moreover, it would be even harder to make sure that this file is created in the very same place on the drives of different storage capacity.

Moreover, the trace should contain not only the requests sent by user programs to the HDD, but the entire disk activity of the operation system (swap-file requests, keeping transaction logs, etc.). So, I would take the phrase “file is created in the same (or closest possible) physical location of the target hard disk” with a grain of skepticism.

Of course, I do understand the desire of PCMark04 developers to let not only professionals, but also the regular users to test their hard disk drives. However, when you are testing a HDD with the already installed OS, you automatically break the major requirement to the testing environment: you will not get repeated results, as the tested drive is not free from the influence of the operation system.

Luckily, PCMark04 also allows running the tests on a non-formatted HDD:

Well, I believe this screenshot could scare an unprepared user :)

So, PCMark04 reproduces the following types of disk workload:

**Windows XP Startup**: This trace contains a sequence of requests sent to the HDD on system start-up.

**Application Loading**: This trace contains disk activity when the system opens and closes the following applications:

- Microsoft Word
- Adobe Acrobat Reader 5
- Windows Media Player
- 3DMark 2001SE
- Leadtek Winfast DVD
- Mozilla Internet Browser

**File Copying**: This trace contains a log for HDD requests recorded during the copy of about 400MB of files. Unfortunately, there is no mention of the average file size or of the number of files in the given PDF-file.

**General Hard Disk Drive Usage**: This trace contains the info about disk activity during the work of some widely spread applications. When we recorded this trace the following things were undertaken:

- We opened a Microsoft Word document, checked the grammar, saved and closer the file.
- With WinZIP utility the file was packed and then unpacked.
- The file was encoded and then decoded with PowerCrypt utility.
- A set of files was checked for viruses with F-Secure Antivirus software.
- WinAmp played MP3- and WAV-files.
- Windows Media Player played DivX-video.
- The system browsed through the pictures with Windows Picture Viewer .
- Etc.

To reduce the time required to process a single trace, we compressed the latter: the long pauses between requests were reduced to 50milliseconds. This value was derived experimentally and represents the minimal pause possible without affecting the disk subsystem performance.

Having run all the tests for a given HDD, PCMark04 generates a certain performance index calculated according to the following formula:

HDD Score = (XP Startup Trace x 120) + (Application Load trace x 180) + (File Copy Trace x 28) + (General Usage x 265)

This sophisticated rating formula can actually be explained in a very simple way. The results of the subtests weight differently in the end. For instance, the Windows loading speed weighs 25%, while the applications loading weighs a little bit more: 28%. Copy speed on the contrary weighs less – 12% only. And finally the maximum weight of 35% belongs to the general hard Disk Drive Usage.

Our test system was configured as follows:

- Albatron PX865PE Pro II mainboard;
- Intel Pentium 4 2.4GHz CPU;
- IBM DTLA-307015 15GB system HDD;
- Radeon 7000 32MB graphics card;
- 256MB PC2700 DDR SDRAM;
- Promise Ultra133 TX2 controller;
- Microsoft Windows XP Pro SP1.

The major difference between the testbed we used for PCMark04 and our standard testbed, which we use for most of our storage solutions reviews is the Windows XP operation system. Unfortunately, we had to break our principle: **not to use Windows operation systems until SP2 comes out**. Otherwise, we would be unable to run PCMark04… The thing is that the disk subsystem benchmarks of the PCMark04 set work only in WinXP (which is pretty strange, as RankDisk should work fine in any NT-like operation systems).

Not so long ago Futuremark released a patch to PCMark04 (build 110) and among the changes introduced in it we can see the so long-awaited line saying: "Hard Disk Tests now work also on Windows 2000". So, I would say that all our complaints about the inability of this benchmark to work in Win2000 have become somewhat outdated. In the ongoing articles we will try to figure out how adequate the results obtained in Win2000 and WinXP are.

We used four “coolest” IDE hard disk drives for our tests, because the quality of the testing software should be best checked with the most advanced drives. Therefore, I took a 250GB Maxtor MaxLine (7Y250P0), WD2500JB and two Hitachi drives - HDS722525VLAT80 and IC35L180AVV207-1. As you may have noticed, the latter differs from the previous three models by its storage capacity. Besides, it belongs to the previous generation HDDs (60GB platters). However, the cache buffer of all drives equals the same 8MB. That is why we will be able to compare the performance of three 250GB HDDs from different manufacturers, and see how greatly the results differ from what we have obtained before (see our article called 7200rpm Hard Disk Drives Roundup: Major League). And secondly, we will get a perfect chance to compare the performance of two Hitachi HDDs representing two latest product generations from this manufacturer.

To test the hard disk drives we created a 32GB partition on each of them and formatted it in NTFS and FAT32 (I decided to stick to the methodology suggested by the benchmark developers and check the HDD performance on formatted drives). The results obtained showed that the benchmark is indifferent to the type of the used file system of the tested drive (which I have actually expected taking into account the testing methodology). Therefore, I will provide only one set of results in the ongoing performance analysis: those obtained in NTFS file system.

To investigate the repeated PCMark04 results we carried out a series of 10 tests. The HDD was defragmented between the tests, then we restarted the system and when the boot-up was complete we waited for five minutes before the test session started. We undertook all these measures to reduce the influence of the operation system on the benchmark results.

Well, it is probably the first time I provide such detailed results. Very soon you will understand why :)

So, each test was run 10 times for each hard disk drive. Then the average results were taken for the diagrams:

Hitachi 7K250 finished the XP Startup trace faster than its competitors: maybe it is due to the drive’s low access time? The results of WD2500JB prove this point, too: WD2500JB boasts the highest access time of all the tested drives, and therefore, it fell significantly behind the competitors.

Hitachi 7K250 also managed to cope fastest of all with the Application loading trace, while Hitachi 180GXP arrive the last one, as it features the lowest data density among all the tested drives.

During files copying we got two evident leaders: Hitachi 7K250 and Maxtor 7Y250P0. They managed to leave Hitachi 180GXP and WD2500JB considerably behind. What is especially interesting, this result corresponds very well to what we have already discussed in our article called 7200rpm Hard Disk Drives Roundup: Major League, during file copy tests analysis. There was only one thing, which didn’t go well with the general picture, but I will dwell on it a little bit later.

During the General HDD Usage Hitachi 7K250 is ahead. Well, today is its benefit-night, I assume…

Other HDDs run very close to one another. Really close… Our testing participants have evidently shown very close results in some tests. Therefore, a question arises: do these results repeat often enough, for us to conclude that one drive is faster than the other? In other words, don’t you think we should check the dispersions?

In order to find out how correct the obtained results are, let’s go a little bit deeper into math1ematical statistics.

As we have already mentioned in the beginning of this article, many random phenomena, including the natural ones, obey to the normal Gaussian distribution law. Even during the HDD tests we can come across this law. This is what the distribution of HDD random access time looks like:

Therefore, let’s assume that the HDD performance we measure equals [Actual HDD Speed + Error], where the actual HDD speed is a constant, and the error is random. Then the distribution of random HDD performance values will be normal for a definitely higher number of measurements.

The most important measuring distribution parameters are math1ematical expectation **M(x)** and dispersion **D(x)** of the random variable **x**. The distribution parameters of the random variable **x** are usually unknown in the math1ematical statistics tasks. The researcher usually has at his disposal only a sample of independent experiments of the size **n [x1, x2, …, xn]**. In this case the *sampling parameters* are derived from the sample and then serve as a certain approximation of the theoretical or *general parameters*. The larger is the sample n, the better is the approximation. In practice we can consider sampling parameters as coinciding with general parameters in case of n>50.

Let’s take a closer look at these parameters and their features.

The math1ematical expectation of the continuous random variable is set by the following integral:

For the discrete random variable the formula looks as follows:

Where **x(i)** and **p(i) **are separate values and corresponding probabilities of the random variables, and** n **– the number of its possible values.

In a particular case for the even distribution of a random variable with **n** possible values we get:

In other words, the math1ematical expectation coincides with the notion of the arithmetic mean value. In the general case when both events are not equally probable

the math1ematical expectation equals the so called average value of the discrete random variable, when the different probabilities of the individual values are taken into account.

To cut the long story short, math1ematical expectation is a value with the random variable values surrounding it.

The dispersion **D(x)** of the random variable is math1ematical expectation of this random variable calculated according to the formula: **[x-M(x)]^2**. For the continuous random variable **x** it looks as follows:

For the random variable with **n** possible values we get:

The dispersion for the sample selection including n values of the random variable is calculated according to the following formula:

This value is called standard sample variance S.

Dispersion is a very convenient and natural way of statistical analysis, because it considers all deviations of the results from the average and normalizes them accordingly.

The results of the repeated experiment and the corresponding random measuring errors are usually characterized with two statistical criteria:

*The width of the confidence interval*, where the results of individual experiments may fall;**[x1, x2]***The confidence probability*of the fact these results will not fall beyond the interval.

During the statistical analysis and statistical data processing the random variable may have normal or close to normal distribution (as you remember, we mentioned this in the very beginning of our discussion), while the sample selection representing it, appears too small, i.e. is not representative enough. This part of math1ematical statistics devoted to less representative samples (**s=<n<20**) is also known as micro-statistics.

Micro-statistical estimates of the normally distributed random values are based on Student’s distribution, which links together three major parameters of the sample selection: width of the confidence interval, the corresponding confidence probability and the sample size or the number of sample freedom degrees **f=n-1**.

This is what the dependence of the probability density on the width of the confidence interval t in Student’s distribution for different number of freedom degrees.

When f is infinite, our curve coincides with the curve for normalized standard distribution. But the fewer degrees of freedom are involved, the flatter is the graph for large |t| values (it gets to the x axis slower). As a result, if we have the same width of the confidence interval, the confidence probability according to Student’s distribution is always lower than the confidence probability of the Gaussian-Laplace normal distribution. Moreover, the less representative is the sample, the higher is the estimates difference.

The confidence estimate of the average result in Student’s distribution looks as follows:

where

- is the math1ematical expectation of the average result,

- is the confidence probability of the random error of n independent experiments being below

The analytical representation of the f(t) function is pretty complicated that is why a table of pre-calculated Student’s coefficients is usually used. In this case all you need to know is the number of freedom degrees f=n-1 and the required confidence probability.

And now that we have gone through all this theoretical math1ematics let’s return to our initial goal.

When we compared the average performance coefficients for the tested HDDs, I got the impression that the difference between them is too small. And I immediately asked myself: is this difference statistically important? Since not many testers can afford to repeat the HDD tests multiple times, then no one can actually guarantee that the average value will not sift to the right or to the left in case of a small sample (few repeated experiments). And in this case the HDDs, which really perform close to one another, may simply swap places in the rating…

So, we need to figure out the width of the confidence interval where the average HDD speed will fall as a result of 10 experiments with the confidence probability (take for instance, 0.95). As soon as we get the width of the confidence interval we will check if the confidence intervals of different HDDs overlap.

At first, let’s check the dependence of S (standard deviation calculated as a square root of the dispersion) on the type of the test:

Of course, we see that the results deviation is the highest at Copy trace. And on the contrary, the smallest deviation can be observed at General hard Disk Drive Usage. I wonder if it has anything to do with the time it takes to run the entire trace?

Just in case, let’s also check if the dispersion difference is a random or meaningful thing. The criterion for determining the importance of the dispersion difference at a certain level (1-0.95=0.05) is known as F-criterion (Fisher’s criterion) and is based on Fisher’s distribution.

The derived value of the F-function for two considered samples is obtained as a particular value of S1/S2, with a larger dispersion put into the term of fraction. If the obtained F at a given level is lower than Fcritical, then we can consider the experimental results represented by both samples equally precise.

To simplify all these calculations, we will take the samples with the maximal and minimal S for each type of tests, because if these results will pass the qualification according to Fisher’s criterion, then the other results will surely do.

The calculations revealed that our results can be considered equally precise.

And now we should only do two more things.

Now that we know the S values, we can finally calculate the confidence interval. After all the necessary manipulations have been done, we got three performance values for each hard disk drive: average performance according to the experiments, the minimum and maximum values of the confidence interval.

It turned out that the picture is not quite clear only in one subtest.

Note that the confidence intervals of the two drives coincide. To be more exact, the confidence interval of WD2500JB “includes” the confidence interval of Hitachi 180GXP. It means that despite the higher average performance of WD2500JB than that of Hitachi 180GXP (the average obtained as a result of 10 experiments), we can’t claim with 0.95 probability that the situation will not change in case of a different sample considered.

Let’s use the Student’s t-criterion to check the importance of the performance difference in Application Loading test for Hitachi 180GXP and WD2500JB HDDs.

According to this criterion, the sample average values Xa and Xb vary significantly if their difference exceeds the standard deviation of by more than times, where is the Student’s coefficient for the confidence probability and there are freedom degrees for the selection .

In practice, the following ratios are usually calculated:

where

And then these values are compared with the Student’s coefficient. Let’s do the same thing now :)

The obtained value (0.530484) is lower than Student’s coefficient (t=2.10), so the average results difference is not so meaningful.

If the performance difference between these two HDDs is not statistically visible, then why don’t we call them “equally fast” (in this test) and pass over to the conclusions? :)

The results of PCMark04 benchmarking set showed that Hitachi 7K250 is an indisputable performance leader. Other three drives performed about the same. However, I would still like to point out that Maxtor 7Y250P0 was very fast in the Copy test and WD2500JB was quite slow in XP Startup and File Copying.

Now I am going to sum up our verdict about the PCMark04 as a test for hard disk drives. At first let’s see if this benchmark meets the requirements I formulated two years ago (see our article called PCMark2002 as Hard Disk Drive Test).

I believe that each good benchmark should meet the following requirements:

- It should be of reasonable size: Yes. According to the today’s standards 35MB is not that much at all (compared with the same 3DMark03, for instance).
- It should be free for the end-user: No. The free version of this test provides only the overall performance index without any detailed results for each PC subsystem.
- It should work properly under any popular OS: No. If you want to test your PC thoroughly, you need to purchase Windows XP, it is especially critical for the HDD tests :)
- It should not be optimized for the products of some particular company: So far, I do not have any info about any optimizations of the kind.
- It should generate "repeating" results, i.e. the difference in the results obtained during multiple tests shouldn't be too big: As for the disk subsystem, this seems to be done.
- The test algorithm and the results obtained should be easy to explain: Here I can speak only for myself :) having read this PDF-file and the FAQ on the site of FutureMark Company I didn’t encounter any problems working with PCMark04.

Now I should estimate how adequate the results obtained during HDD testing were. The performance values we got in PCMark04 do correlate with our ideas about the contemporary HDD speeds (see our article called 7200rpm Hard Disk Drives Roundup: Major League). However, I would still like to dwell on the results of one test. According to PCMark04 two drives out of four performed the file copying at about 40MB/sec. While the maximum read speed from these drives is a little more than 60MB/sec.

As far as I understand, during file copying the HDD should:

- read some data from the platter into the buffer;
- transfer the data from the buffer into the PC memory;
- move the heads to the track where the data should be saved;
- move the data from the PC memory back into the HDD buffer;
- write the data block.

This 1-5 cycle is repeated until all data has been moved from A to B. But even if we assume that the latest and greatest hard disk drive skips the steps 2 and 4 and performs step 3 immediately, the copy speed will still never be more than 50% higher than the maximum read speed!

In other words, for the hard disk drives with 60MB/sec read speed the copy speed cannot be higher than 30MB/sec. and this is what our FC-Test shows. The maximum copy speed we could obtain was equal to 25MB/sec. And here we have 40MB/sec. Something should be wrong…

And now about the testing methodology for HDD tests in PCMark04 (if you decide to do it). When you compare the results of hard disk drives tested in NTFS, FAT32 and unformatted, you will get the most precise results in the latter case, what I have actually expected.

If we compare the absolute values of hard disk drives tested in various modes, I will have to point out a significant performance drop of the three drives out of four when we ran XP Startup without any file system installed. I cannot tell you if it is the influence of the OS or the benchmark peculiarity…

So, you CAN use PCMark04 for HDD testing, but you’d better test unformatted drives and do not run File Copy test.

Good luck now!