My knowledge is very specific to just these cards. The only 'secret', which isn'...

My knowledge is very specific to just these cards.

The only 'secret', which isn't a secret, was that I realized that since this is hardware, we know the maximum settings (hash/watt) that should be possible. Therefore, I set the cards to that best setting and then tune down from there. This is the opposite train of thinking than most other miners.

Most people think, let's start low and then tune up from there to make them 'faster'. The cards crash when they can't handle the settings, so it turns out that tuning down is a better way to tune since they stop crashing when they are stable... and thus don't need tuning from there. There are 3 different sets of 'knobs' to tweak, so I had to build an algo to adjust the knobs in the right order to tune things down. I just had the concept of 'current -> next settings'.

Temperature and power fluctuations can make the cards crash too... so by always tuning down, you're always heading towards more stability instead of instability. Since neither of those could be controlled, machines would reboot randomly all the time.

The software I built was a golang daemon that ran on each machine and watched for these crashes and modified the tuning of each card individually. The daemon is pretty cool as it is effectively a task runner. I had different tasks to configure and monitor the machines as well. The machines are all independent, idempotent and self-healing workers. Reliably distributing the software to 20k different workers, is a fun challenge. There are a ton of unit tests, so that helped a lot.

If I have the energy, I may rip out the tasks from the daemon, turn it into a library and open source that. It is kind of a fun project that could be useful for others trying to manage large scale individual workers. Tasks could easily be 'apt install' or monitoring utilities. I even bundled node_exporter into the binary, so that we could monitor the machines with prometheus.