This is a backup because parallel cluster with Slurm is really hard to get working, but terraform has been more reliably. We are going to create a Flux cluster in EC2 and then pull down Singularity containers to it.
Note that we previously built with packer. That no longer seems to work (maybe this issue)
Instead we are going to run the commands there manually and save the AMI. The previous instruction was to export AWS credentials, cd into build-images,
and make. For the manual build, you'll need to create an m5.large instance in the web UI, ubuntu 22.04, and manually run the contents of each
of the scripts in build-images. For example, for the top AMI below I ran each of:
- install-deps.sh
- install-flux.sh
- install-singularity.sh
Once you have images, we deploy! Make sure you update the AMI to be the one you built.
$ cd tfAnd then init and build. Note that this will run init, fmt, validate and build in one command.
They all can be run with make. Make sure to change the number of instances to the size that you want - the min and max should be identical:
$ makeYou can then shell into any node, and check the status of Flux. I usually grab the instance name via "Connect" in the portal, but you could likely use the AWS client for this too.
$ ssh -o 'IdentitiesOnly yes' -i "mykey.pem" ubuntu@ec2-xx-xxx-xx-xxx.compute-1.amazonaws.comCheck the cluster status, the overlay status, and try running a job:
$ flux resource list
STATE NNODES NCORES NODELIST
free 2 2 i-012fe4a110e14da1b,i-0354d878a3fd6b017
allocated 0 0
down 0 0 $ flux run -N 2 hostname
i-012fe4a110e14da1b.ec2.internal
i-0354d878a3fd6b017.ec2.internalYou can look at the startup script logs like this if you need to debug.
$ cat /var/log/cloud-init-output.logAt this point (with a running, working cluster) move into the experiment directory.