Maintaining a Mining Farm: Farm Monitoring, Backup Drive, Cleaning and Check-ups, Warranty and Minor Repairs
Hi there! This is the fifth article in the series where I share my experience with beginner miners. This time, we will learn how to maintain rigs to ensure long service life.
Quick routine maintenance
I recommend running a quick maintenance once a month or bimonthly. First, you will need to update your software to the latest version. An operating system update may help you deal with vulnerabilities, and the latest versions of mining software usually offer improved performance. Unfortunately, updating Windows is often complicated and unsafe due to frequent issues with video drivers, which is why I would recommend buying one of the commercial Linux-based distributions.
Normally, I update my software in two steps: first, I install the update on a couple of rigs to see how it works, and a week later I proceed with the rest of the devices. I’ve had negative experience with installing a Hive OS update on all rigs at once; mining was down for an hour because Claymore stopped working for some reason. So, I had to urgently switch to Ethminer which turned out to have a lower hash rate.
Visual examination is one of the steps of quick maintenance. Make sure to turn off the power supply and shut down your computer to discharge PSU capacitors. PSU is what you need to look at first since it’s difficult to monitor remotely. Test the fan: check whether it turns freely or not, or if it has too much dust inside. If it makes any unusual sounds or the rotation speed is too low, you have to take care of it as soon as possible as overheating can seriously damage the board.
In most cases this is easy to fix. What you need to do is to remove the PSU cover fastened with four screws, and then undo four more fan screws. The fan cable can be either quick-removal or soldered onto the board. Regular PC fans usually work just fine, so you can buy one of the right size in any computer store. In my experience, Chieftec fans are the ones requiring the most frequent replacement.
GPU and riser power cable connectors often burn after working at full capacity for a long time. If you use removable cables, make sure to check the slots on the PSU housing.
If the plastic slots are burnt or darkened, you’ll have to halt the rig and replace the PSU. Sometimes, the cable can be repaired by soldering on a connector from another cable. That is why I keep all the cables that come with the PSUs. Unfortunately, in many cases the connectors were burnt so badly that I had to buy a new PSU. Surprisingly, I have never had burnt cables on China-made mining PSUs, whereas almost all my Corsair and Chieftec hardware got damaged over time.
Next, check the CPU fan. For modern processors, overheating is not a big deal as they shut down before irreparable damage is done; besides, the CPU is not usually overloaded when you mine using video cards. However, a broken CPU fan can cause problems when starting the rig. A CPU FAN ERROR will be displayed, and the computer will prompt you to press the F1 key. The fan can be monitored remotely: if the CPU temperature rises, you might want to check the cooler. In my experience, the fan itself is rarely the cause of problems; it’s mostly some foreign objects like cables or peeling riser bases that get stuck inside it.
Scheduled Advanced Maintenance
Every few months I run advanced maintenance, including a visual inspection and cleaning. An air compressor is your best friend for dust control. It is better not to vacuum clean your computers because of static voltage, and buying air dusters is bound to break the bank. So, a simple air compressor is highly recommended for a modern mining farm. Always remember to turn the power off before you start cleaning.
When cleaning is done, I highly recommend also to replug all your cables and connectors. Connector pins may oxidize over time causing a number of tricky issues. Therefore, it pays to unplug and plug back in your power cables, risers, and RAM on a regular basis. At the same time, you can inspect the power connectors.
Turning the rig on, examine the cooling system of your video cards. If the fan is slow to start, it’s time to replace it. You may even have to turn the blades manually for a while to make it work again.
Now, let’s pass on to graphic card maintenance.
Graphic card maintenance
Video cards are the most strained and the most important element of the whole system; pay especially close attention to the GPU temperature, as it is the most valuable parameter for memory chip performance. If possible, keep the temperature under 70° C. At this temperature, there is no chip degradation or excessive cooling load. If one of your cards runs hotter than the others with a higher fan RPM, most likely the fan has difficulty spinning. The order of your graphic cards displayed in the OS usually doesn’t match their order on the motherboard, nonetheless, as I mentioned before, a visual inspection will help to identify the problem.
Unfortunately, instead of a bearing most factory fans use a copper grommet which wears out causing play. There is no point lubricating the fan and a new matching grommet is extremely hard to find, so I’ve simply ordered new fans by wholesale from AliExpress. They are sold in pairs at a price of $7.5, all cables included. They are easy to replace: you just undo a few screws, remove the old fan and install the new one. I had to replace two fans at once because the new fans had a different connector, yet it all depends on the video card. Replacement under a warranty is also possible, but we’ll get to that later.
There is a thin layer of heat-conducting material between the graphics chip and the cooling system radiator. This material is either a thermal paste or some thermal plates. If the fan replacement and cleaning didn’t help, it may be time to reapply the thermal paste. On mining farms, paste and plates usually last for about 3–5 years. Avoid replacing thermal plates too often, since it may cause constant overheating in case the new plates are not thick enough.
Also, please pay attention to GPU Memory Errors. Typically, memory chips wear out over time and errors occur when mining. To make sure you don’t get banned in the mining pool, reduce overclocking. For more information about it, check this article.
The only thing you need to do once in a year or two is change the CR2032 CMOS battery on your motherboard. A low battery level will reset the BIOS settings and you may realize that only after a sudden power outage. Old batteries are prone to corrosion and leaking which can damage the board.
Repairing a motherboard costs a fortune and is too difficult, which is why it’s easier to replace it every once in a while.
If you are sure that your hardware problem was caused by a manufacturing defect, you may try having your devices repaired under warranty. The warranty terms depend on your area: in some counties, you can ask the seller for a refund and in others you have to go to an authorized service center. In either case, they don’t usually like miners very much.
If your warranty provider discovers that you have a modified VBIOS or 4 video cards connected to one PSU, they may refuse to go through with the repair even if there is a manufacturing defect. So before bringing your devices in for diagnostics, I suggest you restore the firmware to default and keep the PSU thing to yourself.
The warranty period depends on the manufacturer and the range of components. So-called ‘mining’ versions usually come with a shorter 3-month warranty, so I always suggest buying a standard GPU or a graphic card for gaming. All the seals have to be intact, of course. Fans are the ones to break down the most and you can even request the replacement. Nevertheless, that might not compensate for the downtime, so I replace the parts myself with new ones I order from AliExpress. Fan replacement doesn’t involve breaking any seals and keeps the warranty valid.
Early failure detection is the key to smooth operation. I only use remote monitoring for a couple of things:
- GPU temperature. Usually perfectly monitored with a web-based GUI.
- CPU temperature and fan RPM. In some cases, you can check it using the web GUI or a command line (lm-sensors tool).
- GPU Errors.
If you have a smart home kit, consider adding a room temperature sensor to avoid unnecessary overloading of the cooling system.
Unfortunately, not all problems can be anticipated if you rely on remote monitoring. If something unexpected happens, I recommend keeping a few spare risers, PSUs, and a backup drive for prompt decision-making. Your backup hard drive has to have a customized OS and the miner, so the broken HDD can be replaced and the operation restored as soon as possible. On Hive OS, you may add your FARM_HASH to the rig.conf file — and it will immediately display the new rig once you are connected. The new rig is easy to set up using the web GUI.
This is the end of the article on how to maintain the rigs. Next time, I will cover quitting mining and selling your mining farm. Sign up to the blog to make sure you don’t miss it!