by Aleksey Meyev Nikita Nikolaichev
12/03/2009 | 01:09 PM
The corporate sector, which is the main consumer of such serious products as RAID controllers, is highly conservative. For a business, any failure or downtime, however small they may be, can provoke a financial loss. Most consumers in this market follow the principle of making haste slowly. It is easy to come up with an example: while desktop HDDs quickly transitioned from the parallel to serial interface (from PATA to SATA), the similar transition from SCSI to SAS has taken a long while. Disk racks with SCSI 320 interface are not any kind of museum rarity even today. Anyway, it is time for yet another change. This time around, the SAS interface has improved its bandwidth from 3Gbps to 6Gbps. The new SAS 2.0 standard has brought about some other innovations including reduced EMI, an increased maximum length of cables (from 6 to 10 meters) with reduced crosstalk, and a revised topology of complex configurations. It is the double bandwidth that is the most prominent and called-for innovation, though. Of course, the new standard is fully compatible with the old one. The transition would be much more difficult otherwise. Generally speaking, SAS has a stable niche among today’s disk interfaces. It is “shorter” than the competing Fibre Channel and iSCSI but much cheaper than the former and has lower latencies than the latter. Now at 6Gbps, SAS looks like a very nice way of connecting disks located in several adjacent racks.
As a matter of fact, the bandwidth of 3Gbps (or, theoretically, 300GBps) is already quite enough if each disk is connected to a dedicated port of the RAID controller. Even the best of today’s SAS drives, the models with a spindle rotation speed of 15,000rpm and a huge recording density, are only approaching a read speed of 200MBps. So why is the higher bandwidth so important? Because it can be used not only for connecting individual disks but whole racks of disks! If one controller’s channel is shared by multiple drives, the bandwidth of 3Gbps is not enough as it can only satisfy two modern HDDs. We shouldn’t also forget about solid state drives that easily deliver speeds of 250-270MBps and are already limited by the interface.
It is also easy to tell what users need this increased bandwidth. It is not necessary for those who want as many operations per second from their disk subsystem as possible, but it is vitally important for users who need to quickly read and write large amounts of data simultaneously. Various file storage systems or video-on-demand servers are examples of that.
Now, let’s have a look at the first SAS RAID controller we’ve got that supports SAS 6Gbps. It is the MegaRAID SAS 9260-8i model from a new controller series introduced by the renowned LSI.
LSI’s new-generation controllers can be easily distinguished by the number 92 in the beginning of the model name. They split into three series: HBA, Value and Feature. The HBA series is comprised of controllers without a XOR coprocessor and supporting but the basic array types RAID0, RAID1, RAID1E and RAID10. They still need a rather advanced processor (a PowerPC 440 clocked at 533MHz), though. For the two senior series LSI has developed a new processor LSISAS 2108 with a clock rate of 800MHz. As a matter of fact, the Value and Feature series are very similar, differing with the type of connectors only. The Feature series models are equipped with external SFF-8088 connectors while the Value series, like the 9260-8i model we’ve got, has only internal SFF-8087 ports. So, everything we’ll write below refers to all products from the two new series.
The controller does not differ externally from LSI’s previous series. It is a low-profile MD2 card whose main chip is covered by a needle-shaped heatsink. Like all its cousins (including the 4-port models), the LSI 9260-8i carries 512 megabytes of DDR2 SDRAM clocked at 800MHz. It is kind of unserious to install less memory on board whereas a larger amount is not really necessary (at least on configurations similar to our testbed). An important, although not conspicuous, difference from the previous series is that the interface for the mainboard has changed. Yes, the controller now uses PCI Express version 2.0 which means that it can pump up to 64GBps through its eight PCI Express lanes, which is more than enough even for eight fully loaded SAS 6Gbps ports.
Everything is conventional when it comes to the array types supported. New types have not been invented while nearly every full-featured controller today supports all the old types like RAID0, 1, 5, 6 and combinations thereof.
LSI’s new controllers support a battery backup unit. Each of them is compatible with the iBBU07 model which had been previously used for the LSI 8880EM2 only.
SSD support must be noted, too, although it is not unique to this controller series. It has been added with the latest version of LSI’s software which was released almost together with the new controllers. You should not expect any miracles from that, but RAID controllers have learned to identify SSDs as a separate class of storage devices that are completely different from HDDs. There is no sense in such operations as patrol reading of the surface, for example. Hopefully, cache data are written to an SSD less often, which should affect its service life positively.
The following testing utilities were used:
We have had to change our testbed somewhat. We have switched to Windows Server 2008. The reason is simple. LSI just does not release drivers for the ancient Windows 2000.
Unfortunately, the hardware of our testbed has remained the same. We still use a mainboard with a PCI 1.0 slot and hard drives with SAS 3Gbps. However, with our test method when each HDD is connected to an individual controller port, the interface should have no effect on the performance.
So, the controller was installed into the mainboard’s PCI Express x8 slot. We used Fujitsu MBA3073RC disks installing them into the default rack of the SC5200 case. The controller was tested with eight HDDs in the following modes:
We have changed the set of arrays for our tests. To save time and effort, we do not test 4-disk and degraded arrays, but include RAID50.
We will publish the results of a single Fujitsu MBA3073RC on a LSI SAS3041E-R controller for the reference’s sake, but you should be aware that this controller/drive combination has a well-known problem. It is slow at writing in FC-Test.
The stripe size is set at 64KB for each array type.
The controller was tested with the latest BIOS and driver versions we could get: BIOS 12.0.1-008 and driver 188.8.131.52.
Before proceeding to our tests, we want to tell you about one peculiarity of this test session. In some cases you will see results of two test runs with absolutely identical arrays. This is because the controller behaved queerly in our first test session. Just take a look at the following diagram.
We had seen lots of weird things in our tests but there was something basically wrong in the 8-disk arrays being slower than the single disk at sequential writing. And there was no such problem with reading. At first we could not pinpoint the reason. Everything was right with the BBU and caching. The logs were clean but the speed refused to rise whatever we did with the settings.
We found the cause of the problem eventually. It was in the order of our tests. Our standard test procedure goes like this: a script is launching various types of loads in IOMeter and then we manually partition the disk and launch FC-Test. The sequence of IOMeter loads goes like this: the access time test goes first (it is a lot of operations with small-size random-address data blocks), then we have random reading and writing tests with a varying data block size. Next goes a group of tests with sequential requests (including multithreaded loads) and finally we emulate server loads and run the Database pattern. We found out that if we first ran FC-Test and then began the IOMeter part with the group of sequential-load tests, we had completely different results. The only explanation we could think of was that the controller was capable of adjusting for the current load. However, the controller seems to require some time or a certain number of requests to decide that the caching policy needs adjustment and during our rather short tests the controller could not catch up with our rate of load change.
And one more note: when changing arrays, we had to shut the server down after we had removed the previous array. Otherwise, we would get a low write speed again on the newly established array.
It wasn’t our business to deeply explore the controller’s operating algorithms, yet we’ve decided to publish data from two test cycles. One cycle began with random-address loads, and the other with sequential loads.
So, we can evaluate the controller’s performance in two scenarios: a file-server and a video server.
This time we would like to start the discussion of test results with the sequential patterns for pretty obvious reasons. In these patterns the storage devices receive a chain of requests with queue depth of 4. Once a minute the data block size increases. As a result, we can check the dependence of array linear read and write speeds on the size of the data block and this way estimate maximum achieved speed.
There is almost no difference between the two test cycles. It fits within the measurement accuracy range. The order of test loads has no effect on sequential reading. What can we see in the diagram? First, the 2-level RAID10 and RAID50 arrays are somewhat slower than the other arrays on small data blocks. Second, the controller is obviously trying to read data from both disks in mirror pairs, but only on large data blocks. The top speeds are all right. You can predict them by knowing the speed of the single disk and making allowances for a minor loss of efficiency. It is only sad that the arrays reach those top speeds on rather large data blocks. We have seen higher speeds in our comparative review.
After this diagram we could give out a sigh of relief because we had managed to make the controller deliver a decent speed of writing. The results of the first cycle are somewhere at the bottom of the diagram whereas the second-cycle speeds are much better, even though not perfect. The RAID5 and RAID50 stumble on medium-size data blocks for some reason. The RAID10’s flat stretch on 16 to 32KB blocks is no good, either. On very large data blocks the top speeds are proper enough: the RAID0 is almost as fast as 1000MBps. The RAID5 is slower by the speed of one disk. The RAID6 and RAID50 are slower yet because they have two disks for writing checksums to. The RAID10 is almost exactly half as fast as the leader.
In the Database pattern the disk subsystem is processing a stream of requests to read and write 8KB random-address data blocks. The ratio of read to write requests is changing from 0% to 100% with a step of 10% throughout the test while the request queue depth varies from 1 to 256.
You can click these links to view the tabled results:
We won’t show both cycles in the diagrams for the sake of clarity because the difference is negligible.
The only firmware algorithm that works at a request queue of 1 is deferred writing. And it works well enough on the LSI 9260-8i: write requests are cached and the performance scales up together with the total amount of cache memory in the disk subsystem (that of the controller and disks) an array has.
Take note of one curious fact: the RAID50 writes about as fast as (and even somewhat faster than) the RAID5 and much faster than the RAID6. There is nothing really odd about that, though. There is no need for a RAID50 to calculate two checksums simultaneously. One checksum is enough. The minor advantage over the RAID5 can be explained by the fact that data for checksums have to be taken from only half the number of disks. So if you are willing to exchange some degree of security (a RAID6 can survive a failure of any two disks whereas a RAID50 can only survive a failure of two disks if the disks fail in both RAID5s) for an increase in writing performance, RAID50 may be the right option for you.
It is not exactly correct to compare this controller with the LSI 8708EM2 we tested earlier since our OS has changed, but anyway. The new controller is faster at writing. Every type of RAID is faster and the RAID5’s graph is nearly horizontal. The graph of the RAID10 even curls up on the right. What is especially nice, the controller has not lost its ability to effectively search for the luckier disk in a mirror pair. It is through this technique that the RAID10 is ahead of the other arrays at high percentages of reads.
Take note that the RAID50 is still much better at writing than the RAID6 although has no advantage over the RAID5 anymore.
When the queue depth is increased to 256 requests, we see the RAID50 lose its speed at high percentages of reads. While the other arrays yield about the same amount of read operations from eight disks (excepting the RAID10 which is better than the others by choosing the luckier disk in a mirror), the RAID50 is surprisingly slower. Otherwise, everything is very good, and we can see rather high results at writing.
For 10 minutes IOMeter is sending a stream of requests to read and write 512-byte data blocks with a request queue of 1. The total of requests processed by the disk subsystem is over 60 thousand, so we get a sustained response time that doesn’t depend on the amount of cache memory.
It is easy to see that the order of tests in a cycle has no effect on the response time. The same array delivers the same result all the time (by the way, this proves the good repeatability of the statistical method of measuring a disk subsystem’s response). The RAID10 is ahead, of course. The other arrays are somewhat slower than the single HDD but the RAID10 is faster than the latter thanks to the firmware algorithms.
It is the array’s total cache that determines its write response. The larger it is, the lower the response. We must acknowledge that, unlike its predecessor, this controller is as good as the best of its opponents in this test.
Now we will see the dependence between the controllers’ performance in random read and write modes on the size of the processed data block.
We will discuss the results in two ways. For small-size data chunks we will draw graphs showing the dependence of the amount of operations per second on the data chunk size. For large chunks we will compare performance depending on data-transfer rate in megabytes per second. This approach helps us evaluate the disk subsystem’s performance in two typical scenarios: working with small data chunks is typical for databases. The amount of operations per second is more important than sheer speed then. Working with large data blocks is nearly the same as working with small files, and the traditional measurement of speed in megabytes per second becomes more relevant.
Let’s start with reading.
There are no surprises when the controller is reading small random-address data blocks. The RAID10 is ahead, followed by the single HDD. The others are trailing behind. The results do not depend on the order of tests.
The arrays split up in twos on large data blocks. Reading is stable irrespective of the data chunk size and the previous load. The standings are normal and compatible with what we have seen in the sequential reading test. The RAID0 is ahead, followed by the RAID5. The RAID10 is the worst array here. Its ability to choose a luckier disk in a mirror is an advantage when reading small data blocks, but when the sequential speed becomes the crucial factor (at 2MB blocks), the RAID10 cannot match the other arrays.
It is the amount of available cache memory and the need (or the lack thereof) to calculate checksums that is the decisive factor at writing small data blocks. Take note that the RAID50 is always a little bit faster than the RAID5 as it has to make fewer disk accesses to calculate a checksum. The RAID6 cannot match the RAID50 under this load because the second checksum is quite a burden at random writing.
Interestingly, both test cycles produce almost coinciding results although we are testing writing. You cannot see the difference in the diagrams. Thus, the discrepancy is not due to the controller’s request caching mechanisms.
It is at writing in large data blocks that we can see the huge difference between the two test cycles. This is all quite mysterious. Most of the first-cycle arrays seem to hit against some performance ceiling on large data blocks, even showing some reduction of speed as the data block grows larger. However, the RAID0 did very well in the first cycle but delivered low performance in the second cycle as if finding the same barrier that prevented the other first-cycle arrays from showing their best. This is a very odd behavior, we should say.
The multithreaded tests simulate a situation when there are one to four clients accessing the virtual disk at the same time – the clients’ address zones do not overlap. We will discuss diagrams for a request queue of 1 as the most illustrative ones. When the queue is 2 or more requests long, the speed doesn’t depend much on the number of applications.
As we have already found out, reading is performed in the same way in both test cycles, so we publish only one set of arrays here.
The controller is unfortunately no better than its predecessor at one data thread. It only accelerates to its top speed at queue depths longer than 1 request as you can easily see in the tables. Thus, we see a rather funny picture: the single-level arrays all deliver the same speed of slightly higher than 400MBps at the shortest queue depth (in practical terms, it is the same as simply reading a single file in 64KB blocks). The 2-level RAID10 and RAID50 are somewhat slower, so the RAID50 finds itself losing to the RAID6 despite the identical results in the sequential read test.
There are notable changes when there are more data threads to be processed. Every array slows down at two threads, but in a varying degree. The RAID10 and RAID50 suffer the most as their speed is reduced by half. The RAID10 behaves in a curious manner. It speeds up at three data threads (perhaps it is reading one thread from an individual set of disks in the mirror pairs?) but slows down again at four threads. The RAID50 behaves like the other arrays: it is slowly increasing its speed as the number of data threads is increased. It looks like the increased number of threads is for these arrays like a replacement of a long request queue. You can see it clearly with the RAID0: this array reads four threads faster than only one thread at a queue depth of 1 request.
Writing gives us one more reason to compare the two test cycles. In the first cycle the speeds are very low even at one thread. Every array is inferior to the single HDD. The second-cycle arrays are much better, however. Each of them is about as fast as 400MBps then. The RAID0 stands out with its highest speed as well as with being the only array to accelerate as the queue depth grows longer. Although the speeds are far from what we can expect theoretically, this controller is better than its predecessor which was much worse in this test.
The first-cycle arrays are indifferent to the number of data threads while the second-cycle arrays all speed up at two threads. The RAID0 is especially good here.
The same RAID0 reacted eagerly to the addition of a third data thread, speeding up a little more. There is nothing particularly interesting overall, though. The arrays just retain their speeds.
The drives are tested under loads typical of servers and workstations.
The names of the patterns are self-explanatory. The Web-Server pattern emulates a server that receives read requests only whereas the File-Server pattern has a small share of write requests. The request queue is limited to 32 requests in the Workstation pattern.
The order of tests has no effect on the performance of the controller in this group of tests. Therefore we will only show one test cycle in the diagrams below.
The number of disks in the array is the crucial factor when the load consists of reads only as in this test. If there are no additional factors, of course. Of such factors we can name the algorithm of selecting a luckier disk in the RAID10 that helps this array win and some problems of the RAID50 at request queue depths of longer than 16 that make it slower than the other arrays.
Under mixed load the standings are not so obvious. The RAID10 is ahead short queue depths but gives way to the RAID0 at long ones. However, the RAID10 has the highest overall rating according to our formula. The RAID5, RAID6 and RAID50 arrays go neck and neck at short queue depths but the RAID5 goes ahead at longer queue depths whereas the RAID6 falls behind the RAID50.
The Workstation load is more complex and we do not consider queue depths longer than 32 requests. As a result, the RAID10 is again in the lead while the RAID50 is somewhat faster than the RAID5. Both of them are much faster than the RAID6. When there is a lot of writing to be done, calculating two checksums affects the latter array’s performance most negatively.
If the test zone is limited to the initial 32 gigabytes of storage space, the results grow up. The resulting zone on the disks is very narrow, so the luckier disk selection algorithms do not provide a big advantage for the RAID10. The RAID0 is on top as the result.
For this test two 32GB partitions are created on the disk and formatted in NTFS and then in FAT32. A file-set is then created, read from the disk, copied within the same partition and copied into another partition. The time taken to perform these operations is measured and the speed of the disk is calculated. The Windows and Programs file-sets consist of a large number of small files whereas the other three patterns (ISO, MP3, and Install) include a few large files each. The ISO pattern uses the largest files of all.
We’d like to note that the copying test is indicative of the drive’s behavior under complex load. In fact, the disk is working with two threads (one for reading and one for writing) when copying files.
You should be aware that the copying test not only indicates the speed of copying within the same disk but is also indicative of the latter’s behavior under complex load. In fact, the disk is processing two data threads then, one for reading and another for writing.
Both test cycles are interesting here.
The writing results are especially impressive, so we want to discuss them in groups.
Let’s start with the ISO pattern. Its files are so large that cannot fit into the cache memory. As a result, the arrays are all very slow in the first test cycle. If it were not for the problem with the firmware of the LSI SAS3041E-R controller, the single HDD would be able to beat all its 8-disk opponents. In the second test cycle the speeds are 3 to 4 times higher. They should be even higher theoretically, yet these results are anyway good: the previous LSI 8708EM2 model could only deliver a data-transfer rate of 200MBps with RAID0. LSI’s programmers have to work on the firmware more: RAID0 should not lose to RAID50 and be but slightly ahead of other array types.
It is completely different with the Install and MP3 patterns. The files are smaller, so some of them sink right into the cache. This produces a curious effect: the controller delivers almost the same speeds as with the ISO pattern. Moreover, there is a difference between the results of the first and second cycles, but it is not too big. And the first-cycle arrays win in more than half the number of cases. It looks like the large number of random-address requests affects the caching mechanism in some way or another, which provokes a terrible performance hit for large files but can produce some performance gain for small files. If so, the controller’s behavior is no good. Its performance is just too unpredictable.
And finally, the Programs and Windows patterns are all about small files. As a result, the difference between the arrays is almost nonexistent. The arrays of different types and the same-type arrays from the two test cycles all deliver very similar speeds. Frankly speaking, we are quite at a loss trying to comprehend what is going on in the controller.
The inexplicable can be observed in the results again. All the previous results seem to suggest that reading goes in the same way in both test cycles but in two out of the 25 cases there is a difference that cannot be written off as a measurement inaccuracy.
We don’t see high speeds with the ISO pattern: no array is able to overcome a 400MBps barrier. A notable difference between the arrays can be seen with the Install and MP3 patterns: the 2-level RAID10 and RAID50 fall behind the others. This remind us of the multithreaded test where we saw the same standings when the arrays were processing one data thread. The arrays all have similar performance with the other two file-sets.
By the way, take note that the speed of reading of the last two file-sets is much lower than the speed of writing. This must be due to caching. Our test load cannot fully fit into the disk subsystem’s cache, yet the 512 megabytes of cache memory must be accounted for when discussing the results.
The speed of copying files is usually determined by the speed of writing. This is why we see the second-cycle arrays enjoy such a large advantage over the first-cycle ones with the ISO pattern. The MP3 file-set produces a similar picture with one difference: the RAID0 is rather too slow with the ISO files. The 2-level RAID10 and RAID50 are the expected losers in the MP3 pattern.
The arrays produce similar results irrespective of the previous load in the Programs and Windows patterns.
The Install pattern produces weird results. The mixed load (this pattern contains both large and small files) plays a trick on the controller, messing up its performance.
We have tested a new-generation controller from LSI that supports the new SAS 6Gbps standard. The controller has shown some indubitable progress over its predecessors and offers excellent performance at database operations (especially with mirror-based arrays). On the other hand, the new SAS interface is going to be the most demanded for streaming loads. And it is at streaming loads that we cannot see impressive results. Other controllers delivered better performance on SAS 3Gbps.
The adaptive caching algorithms are the most memorable thing for us, though. It may be an excellent idea for real-life applications, but is a real headache for hardware testers. :)