ELSA3 Upgrade FAQ
ELSA3 Upgrade Issues & Resolutions
Job failed with "The following module(s) are unknown" error
If your job fails and you have an error like the one below in your job.####.out file, it means that the application or the version of the application you are trying to load is not on the ELSA3 cluster. For an application like Matlab where there are multiple version, simply change your SLURM submit script to select a different version or remove the version entirely and let the system pick the "default" version. If the application does not exist on the ELSA3 cluster, and you need it for your workflow, please contact Shawn Sivy (ssivy@tcnj.edu) to have it moved over from the old ELSA cluster.
Lmod has detected the following error: The following module(s) are unknown: "matlab/R2021a"
Look for a line in your submit script like this (add and load are synonymous).
module add matlab/R2021a
and replace with one of these
module add matlab/R2025a or module add matlab
To see what modules are available, use the module avail command.
Example:
module avail
module avail matlab
ELSA3 Upgrade/Testing Frequently Asked Questions
So why is there a new cluster?
ELSA3 features an upgraded operating system, Rocky Linux 9, and the OpenHPC 3.x cluster tools. The current cluster is using Rocky Linux 8 and the OpenHPC 2.x cluster tools.
Does this cluster upgrade affect the virtual machines (VMs) on the "MoOSE" virtual infrastructure?
No. The virtualization system is a separate cluster with its own servers and software. However, I will look to upgrade the underlying virtualization software this summer, but that will (mostly) be transparent to the end-users of that system.
Why do I need to test anything?
Like moving from one version of Windows to another or one version of MacOS to another, there are some changes that may cause issues with the existing software applications. While I have done some basic testing to make sure there aren't obvious issues with the applications, I am inviting you to also try out some of your applications and double-check my testing.
I don't have time to do the testing. Can I have my students access the ELSA3 test cluster to do the testing instead?
Absolutely. Feel free to share this email with them. Have them contact me if they run into any issues.
Both my students and I are too busy to do any testing. What will happen if we don't do any testing?
That's not a problem. I just wanted to provide you with an opportunity to test drive the new cluster before it goes into production after the spring semester. If you don't any have time to test now, we can work together later on to resolve any issues you encounter.
Are my files also available on this new cluster?
Yes. Your files in your home directory, /projects, /courses and global /scratch are available and are currently shared between the two clusters.
Is there a new login for the ELSA3 test cluster?
Nope. Just use your regular TCNJ username and password like you do on the current ELSA cluster.
How do I access the test cluster?
For the test, you can use these names to access the ELSA3 cluster test: Via SSH: Use elsa-test3.hpc.tcnj.edu, e.g. ssh ssivy@elsa-test3.tcnj.edu Via Web: https://ondemand-test3.hpc.tcnj.edu/ When we move to production, these will be changed to the existing elsa.hpc.tcnj.edu and ondemand.hpc.tcnj.edu.
How long is the testing period of the first "cluster crunch"?
You can login to the new cluster now and continue to use it through Friday, March 27th. At that point, I'll take the new cluster down to make any necessary changes, updates or tweaks. Another test period will be available in April.
What have you tested?
I surveyed the main users of the cluster to determine which applications needed to be "ported" over to ELSA3. Some applications and some older versions of applications have been left behind. You can see all the applications that have been ported and tested by me by reviewing this Google Sheet.
In addition, all applications available under the Interactive Applications in Open OnDemand on the new cluster have been tested, and, at the very least, should start up.
Is there an entire new cluster running for this test?
Yes, sort of. The test cluster should be full-featured except for those items listed below. The applications below are disabled in this test because it could cause issues with their equivalent applications in the current cluster. The disabled apps are: WebMO website ELSA VDI Desktop under Interactive Applications in Open OnDemand ARM nodes (these will be reimaged with Rocky Linux 9 over the summer) These will be available in a future test session or with the launch of the ELSA3 cluster in production
There are limited resources available. Here are the nodes available for testing:
- 2 CPU-based nodes (1 AMD-based and 1 Intel-based)
- 1 GPU node with 4 x RTX 2080 GPUs
- 1 GPU node with 4 x RTX A5000 GPUs
- 1 GPU node with 4 x L40S GPUs
- 1 Visualization node with an L40 GPU (new!) for graphics rendering
- 1 login node
- 1 Open OnDemand node
When will ELSA3 launch into production?
I hope to switch over to the new ELSA3 version the week of May 25th (Memorial Day week). Exact dates/times will be announced in the future. There is currently a planned campus power outage on May 25th so I am thinking of starting up the new cluster after the power returns. The switch over may take up to 2 days depending on how smooth the process goes (I'm hoping for just one day).
What will happen to the current cluster?
All the nodes will be moved over to the new ELSA3 cluster when it goes into production. The old cluster software will remain for a while, but will no longer be accessible to end-users. If something needs to be moved over later, I will still be able to access the old cluster configurations and software. Remember, all files in your /home, /projects, /courses and /scratch will be available to you on the new cluster.
Any gotchas I should worry about when testing?
Mostly no. I've disabled the ELSA VDI Desktop because switching back and forth between the new cluster and the old cluster can cause issues. My only other suggestion is to be careful using Open OnDemand on both clusters at the same time. Since your /home directory is available, both will write info to it. Don't accidentally delete or cancel a job running on the wrong Open OnDemand. It won't work anyway and will confuse the other cluster's Open OnDemand. I would suggest using only one at a time and running them in a private/incognito browser window.