Tuesday, December 4, 2012

Slurm and health check

For torque we used to run every 30 minutes a cronjob that checked if the node is working properly and if not it disabled them. With slurm I finally took the time to look for a way to have slurm automatically do it. Discovered it was extremely easy. You just need to add two config lines:


HealthCheckProgram=/home/test/testNode.sh
HealthCheckInterval=300

Now slurm runs every 5 minutes the health check program and if it gets stuck it's killed within 60s. The script has to perform a check and if a check fails it's got to take care of fixing it or disabling the node. It's done fairly simply. For example we check the presence of /hdfs directory for access to storage and if not ask slurm to drain the node:


# Test HDFS
NHDFS=`ls /hdfs|wc -l`
if [ $NHDFS -eq 0 ]; then
  scontrol update NodeName=$NODE State=drain Reason="HDFS mount lost"
fi

You can add pretty much any check you want. The result is that sinfo nicely shows the drained nodes with reasons:


[root@slurm-1 ~]# sinfo -lN
Tue Dec  4 16:39:01 2012
NODELIST         NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON              
wn-v-[2036,..,5384]  8     main*   allocated   32   2:16:1  65536        0      1   (null) none                
wn-v-[2072,..,7039] 19     main*     drained  24+  2:12+:1 49152+        0      1   (null) HDFS mount lost     
wn-v-[2324,..,6428]  6     main*        idle   32   2:16:1  65536        0      1   (null) none 



As you can see nodes that have lost the mount are automatically disabled and drained to be taken care of by you at some point.


No comments:

Post a Comment