RAID resync – Best practices

RAID aka Redundant Array of Independent Disks provide fault tolerance to your servers.

But, what if there are errors within your RAID array?

Unfortunately, that can result in loss of data. RAID resync helps to keep the disk data in sync.

At 1onlyhost, we regularly monitor the RAID resync process in servers as part of our Server Monitoring Services.

Today, we’ll see the details of RAID resync process and how our Support Engineers actively monitor it to avoid potential hard disk failures.

Understanding RAID resync

Firstly, let’s get an understanding of the RAID resync process.

In production servers, the process of adding a device to the RAID array can happen at any time. The success rate of RAID depends largely on the data sync among the disks. But, on adding new disks, the data will not be synchronized with the other devices. That’s where RAID re-syncing helps.

In the re-syncing process, the kernel starts a scan on the original devices and writes the correct blocks to the new device. Usually, the resync is set up as a cron job that run at regular intervals. For example, in Debian, it is based on a Linux utility called mdadm, that manages and monitor software RAID devices.

Similarly, in CentOS systems, it make use of the binary /usr/sbin/raid-check.

Best practices in RAID resync

Server peforms a resync for its software raid in defined intervals. Usually, this results in massive load and it may start affecting all services until the resync is complete. Unfortunately, the disk resync process can be lengthy and take up several hours depending on the size of the disk.

Now, let’s see the best practices that our Support Engineers follow to make the RAID resync process faster.

1. Resource allocation limits

Normally, the server kernel will automatically prioritize the RAID resync to avoid impact on the server performance. But, in our experience in managing servers, we often see a degraded server performance as the resync progresses.

To overcome this scenario, our Support Engineers limit the bandwidth allocated to the resync process. For this, we add the minimum and maximum cut off limit values in /proc/sys/dev/raid/speed_limit_min and /proc/sys/dev/raid/speed_limit_max.

For example, to restrict the maximum speed of RAID reconstruction to 5 Mb/s, we set the value as

echo 5000 > /proc/sys/dev/raid/speed_limit_max

Similarly, we’ve seen cases where we need to put off the resync processes for a later time, when the websites are having its peak hours. Here, to stop the RAID check and prevent it from restarting, we set the following entry..

echo frozen > /sys/block/md0/md/sync_action

This will stop the check, but still leave the array in a partially checked state. Again, the next time a check starts, it will start from where it left off. Thus, it can really help with managing server resources.

2. Using read_ahead

Again, from our experience in managing RAID, we see that setting read_ahead per raid device also helps to make resync faster. During any disk read operation, the read-ahead policy determines when the controller will read additional data records into cache.

In an application that reads data sequentially, read_ahead can improve the performance as such. For example, to set read-ahead to 32 MiB, we use the command:

blockdev --setra 65536 /dev/md0

3. Set stripe-cache_size

Similarly, increasing the stripe_cache_size show better results in some types of RAID like RAID5 and RAID6. Stripe_cache_size plays an important role in synchronising all write operations to the array and all read operations if the array is degraded.

However, using high values can cause ‘Out of memory’ error on the server. Therefore, our Support Engineers set the values as per the resource availability on the server. To set stripe_cache_size to 16 MiB for /dev/md3, we use:

echo 16384 > /sys/block/md3/md/stripe_cache_size

4. Disable NCQ

Yet another method to reduce the resync time in RAID is to disable Native Command Queuing (NCQ).

NCQ allows the individual hard disk to internally optimize the order in which received read and write commands are executed. But, it can even slow down the resync process. Therefore, we disable it for all the drives in the array.

5. Regular monitoring

Again, regularly monitoring of RAID resync always helps. When you suddenly see RAID-resyncing for no apparent reason, it can be a warning signal about something going out of place. It can be a bad disk, or even a RAID failure.

That’s why, we always keep a check on the RAID resync process. In the managed servers, our Support Engineers setup monitoring tools like Nagios that constantly monitor the RAID status from the file /proc/mdstat.

[Need advice on setting up RAID, our Support Engineers can help you.]

Conclusion

In a nut-shell, RAID resync helps devices to catch-up with the RAID array and get data back on sync. Today, we saw the best practices followed by our Support Engineers in making the resync process faster.