We maintain our datacenters at ambient temperatures

mostakimvip04 · Post by **mostakimvip04** » Wed Jul 09, 2025 5:59 am

Our data mirroring scheme ensures that information stored on any specific disk, on a specific node, and in a specific rack is replicated to another disk of the same capacity, in the same relative slot, and in the same relative datanode in a another rack usually in another datacenter. In other words, data stored on drive 07 of datanode 5 of rack 12 of Internet Archive datacenter 6 (fully identified as ia601205-07) has the same information stored in datacenter 8 (ia8) at ia801205-07. This organization and naming scheme keeps tracking and monitoring 20,000 drives with a small team manageable.

humidity, meaning that we don’t incur the cost of operating telegram data and maintaining an air-conditioned environment (although we do use exhaust fans in hot weather). This keeps our power consumption down to just the operational requirements of the racks (about 5 kilowatts each), but does put some constraints on environmental specifications for the computers we use as data nodes. So far, this approach has (for the most part) worked in terms of both computer and disk drive longevity.

Of course, disk drives all eventually fail. So we have an active team that monitors drive health and replaces drives showing early signs for failure. We replaced 2,453 drives in 2015, and 1,963 year-to-date 2016… an average of 6.7 drives per day. Across all drives in the cluster the average “age” (arithmetic mean of the time in-service) is 779 days. The median age is 730 days, and the most tenured drive in our cluster has been in continuous use for 6.85 years!

So what happens when a drive does fail? Items on that drive are made “read only” and our operations team is alerted. A new drive is put in to replace the failed one and immediately after replacement, the content from the mirror drive is copied onto the fresh drive and read/write status is restored.

Although there are certainly alternatives to drive mirroring to ensure data integrity in a large storage system (ECC systems like RAID arrays, CEPH, Hadoop, etc.) Internet Archive chooses the simplicity of mirroring in-part to preserve the the transparency of data on a per-drive basis. The risk of ECC approaches is that in the case of truly catastrophic events, falling below certain thresholds of disk population survival means a total loss of all data in that array. The mirroring approach means that any disk that survives the catastrophe has usable information on it.