ELSA3 Upgrade FAQ: Difference between revisions
| (28 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
== ELSA3 Upgrade Issues & Resolutions == | == ELSA3 Upgrade Issues & Resolutions == | ||
===== | ===== SSH host key/identification errror when connecting to elsa.hpc.tcnj.edu ===== | ||
If | If you can't connect to '''elsa.hpc.tcnj.edu''' using your SSH client because of a host key/identification error, see [[HPC_Cluster_ELSA3_Host_Key_Change|these instructions]] to resolve the issue. | ||
This '''does not''' affect using the web interface and its built-in terminal in Open OnDemand. | |||
===== Job or module command failed with "The following module(s) are unknown" error ===== | |||
If your job or <code>module add</code> command fails and you have an error like <code>Lmod has detected the following error: The following module(s) are unknown:</code> in your '''job.''####''.out''' file, it means that the application or the version of the application you are trying to use is not on the ELSA3 cluster. For an application like Matlab where there are multiple version, simply change your SLURM submit script to select a different version or remove the version entirely and let the system pick the best "default" version. If the application does not exist on the ELSA3 cluster, and you need it for your workflow, please contact Shawn Sivy (ssivy@tcnj.edu) to have it moved over from the old ELSA cluster. | |||
To fix the situation where the version is no longer available, look for a line in your submit script like the one below. Note: <code>module add</code> and <code>module load</code> are synonymous. | |||
<code>module add matlab/R2021a</code> | <code>module add matlab/R2021a</code> | ||
and replace with one of these | and replace it with one of these | ||
<code>module add matlab/R2025a</code> ''or'' <code>module add matlab</code> | <code>module add matlab/R2025a</code> ''or'' <code>module add matlab</code> | ||
To see what modules are available, use the <code>module avail</code> command. | To see what modules are available, use the <code>module avail</code> command. | ||
Example: | Example: <code>module avail</code> ''or'' <code>module avail matlab</code> | ||
===== The module command gives me an "attempt to compare string with number" error ===== | |||
<code>module avail | If you are using the <code>module</code> command such as <code>module avail</code> and get an error like the one below, you need to run the <code>lmod-clear-cache</code> command. The <code>module</code> command should now work again as expected. This error occurs because the caching format used by <code>module</code> differs from the prior version. | ||
== ELSA3 Upgrade | <pre> | ||
[ssivy@login001 ~]$ module avail | |||
/usr/bin/lua: /opt/ohpc/admin/lmod/lmod/libexec/Cache.lua:663: attempt to compare string with number | |||
stack traceback: | |||
/opt/ohpc/admin/lmod/lmod/libexec/Cache.lua:663: in function 'Cache.build' | |||
/opt/ohpc/admin/lmod/lmod/libexec/ModuleA.lua:750: in function 'ModuleA.singleton' | |||
/opt/ohpc/admin/lmod/lmod/libexec/Hub.lua:1251: in function 'Hub.avail' | |||
/opt/ohpc/admin/lmod/lmod/libexec/cmdfuncs.lua:145: in function 'Avail' | |||
/opt/ohpc/admin/lmod/lmod/libexec/lmod:527: in function 'main' | |||
/opt/ohpc/admin/lmod/lmod/libexec/lmod:603: in main chunk | |||
[C]: in ? | |||
[ssivy@login001 ~]$ | |||
</pre> | |||
===== Large message when using the Open OnDemand Terminal ===== | |||
When you launch the '''>_ ELSA OpenHPC Cluster Shell Access''' to get to a command-line terminal or use the '''>_Open in Terminal''' option, you see a large error message similar to the one below. | |||
<pre> | |||
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |||
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ | |||
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |||
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! | |||
Someone could be eavesdropping on you right now (man-in-the-middle attack)! | |||
It is also possible that a host key has just been changed. | |||
The fingerprint for the ECDSA key sent by the remote host is | |||
SHA256:2d6jgn56y7AwKQdOHeyIvDG9wpgybxcQvDcgqYr5mWU. | |||
Please contact your system administrator. | |||
Add correct host key in /home/hpc/hpcuser/.ssh/known_hosts to get rid of this message. | |||
Offending ECDSA key in /home/hpc/hpcuser/.ssh/known_hosts:200 | |||
Password authentication is disabled to avoid man-in-the-middle attacks. | |||
Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks. | |||
UpdateHostkeys is disabled because the host key is not trusted. | |||
</pre> | |||
You should still get to a terminal prompt. To remove the message when launching future terminals, run the command <code>ssh-keygen -R elsa.hpc.tcnj.edu</code> | |||
== ELSA3 Upgrade Frequently Asked Questions == | |||
===== So why is there a new cluster? ===== | ===== So why is there a new cluster? ===== | ||
| Line 31: | Line 71: | ||
===== Does this cluster upgrade affect the virtual machines (VMs) on the "MoOSE" virtual infrastructure? ===== | ===== Does this cluster upgrade affect the virtual machines (VMs) on the "MoOSE" virtual infrastructure? ===== | ||
No. The virtualization system is a separate cluster with its own servers and software. However, I will look to upgrade the underlying virtualization software this summer, but that will (mostly) be transparent to the end-users of that system. | No. The virtualization system is a separate cluster with its own servers and software. However, I will look to upgrade the underlying virtualization software this summer, but that will (mostly) be transparent to the end-users of that system. | ||
<!-- | |||
===== Why do I need to test anything? ===== | ===== Why do I need to test anything? ===== | ||
Like moving from one version of Windows to another or one version of MacOS to another, there are some changes that may cause issues with the existing software applications. While I have done some basic testing to make sure there aren't obvious issues with the applications, I am inviting you to also try out some of your applications and double-check my testing. | Like moving from one version of Windows to another or one version of MacOS to another, there are some changes that may cause issues with the existing software applications. While I have done some basic testing to make sure there aren't obvious issues with the applications, I am inviting you to also try out some of your applications and double-check my testing. | ||
===== Any gotchas I should worry about when testing? ===== | |||
Mostly no. I've disabled the ELSA VDI Desktop because switching back and forth between the new cluster and the old cluster can cause issues. My only other suggestion is to be careful using Open OnDemand on both clusters at the same time. Since your /home directory is available, both will write info to it. Don't accidentally delete or cancel a job running on the wrong Open OnDemand. It won't work anyway and will confuse the other cluster's Open OnDemand. I would suggest using only one at a time and running them in a private/incognito browser window. | |||
===== I don't have time to do the testing. Can I have my students access the ELSA3 test cluster to do the testing instead? ===== | ===== I don't have time to do the testing. Can I have my students access the ELSA3 test cluster to do the testing instead? ===== | ||
| Line 40: | Line 83: | ||
===== Both my students and I are too busy to do any testing. What will happen if we don't do any testing? ===== | ===== Both my students and I are too busy to do any testing. What will happen if we don't do any testing? ===== | ||
That's not a problem. I just wanted to provide you with an opportunity to test drive the new cluster before it goes into production after the spring semester. If you don't any have time to test now, we can work together later on to resolve any issues you encounter. | That's not a problem. I just wanted to provide you with an opportunity to test drive the new cluster before it goes into production after the spring semester. If you don't any have time to test now, we can work together later on to resolve any issues you encounter. | ||
--> | |||
===== Are my files also available on this new cluster? ===== | ===== Are my files also available on this new cluster? ===== | ||
Yes. Your files in your home directory, /projects, /courses and | Yes. Your files in your home directory, /projects, and /courses and are available and are currently shared between the two clusters. Because of how the global /scratch system works, it won't be available until after the upgrade. | ||
===== Is there a new login for the ELSA3 test cluster? ===== | ===== Is there a new login for the ELSA3 test cluster? ===== | ||
No. Just use your regular TCNJ username and password like you do on the current ELSA cluster. | |||
<!-- | |||
===== How do I access the test cluster? ===== | ===== How do I access the test cluster? ===== | ||
For the test, you can use these names to access the ELSA3 cluster test: | For the test, you can use these names to access the ELSA3 cluster test: | ||
Via SSH: Use elsa-test3.hpc.tcnj.edu, e.g. ssh ssivy@elsa-test3.tcnj.edu | Via SSH: Use elsa-test3.hpc.tcnj.edu, e.g. ssh ssivy@elsa-test3.hpc.tcnj.edu | ||
Via Web: https://ondemand-test3.hpc.tcnj.edu/ | Via Web: https://ondemand-test3.hpc.tcnj.edu/ | ||
When we move to production, these will be changed to the existing elsa.hpc.tcnj.edu and ondemand.hpc.tcnj.edu. | When we move to production, these will be changed to the existing elsa.hpc.tcnj.edu and ondemand.hpc.tcnj.edu. | ||
===== How long is the testing period of the | ===== How long is the testing period of the second "cluster crunch"? ===== | ||
You can login to the new cluster now and continue to use it through | You can login to the new cluster now and continue to use it through Sunday, May 3rd. At that point, I'll take the new cluster down to make any necessary changes, updates or tweaks before the final cut-over on May 21st & 22nd. | ||
===== What have you tested? ===== | ===== What have you tested? ===== | ||
| Line 63: | Line 109: | ||
===== Is there an entire new cluster running for this test? ===== | ===== Is there an entire new cluster running for this test? ===== | ||
Yes, sort of. The test cluster should be full-featured except for those items listed below. The applications below are disabled in this test because it could cause issues with their equivalent applications in the current cluster. The disabled apps are: | Yes, sort of. The test cluster should be full-featured except for those items listed below. The applications below are disabled in this test because it could cause issues with their equivalent applications in the current cluster. The disabled apps are: | ||
WebMO website | * WebMO website | ||
ELSA VDI Desktop under Interactive Applications in Open OnDemand | * ELSA VDI Desktop under Interactive Applications in Open OnDemand | ||
ARM nodes (these will be reimaged with Rocky Linux 9 over the summer) | * ARM nodes (these will be reimaged with Rocky Linux 9 over the summer) | ||
* Access to /scratch | |||
These will be available in a future test session or with the launch of the ELSA3 cluster in production | These will be available in a future test session or with the launch of the ELSA3 cluster in production | ||
| Line 72: | Line 120: | ||
* 1 GPU node with 4 x RTX 2080 GPUs | * 1 GPU node with 4 x RTX 2080 GPUs | ||
* 1 GPU node with 4 x RTX A5000 GPUs | * 1 GPU node with 4 x RTX A5000 GPUs | ||
* 1 GPU node with 4 x L40S GPUs | * 1 GPU node with 4 x L40S GPUs (may not always be available depending on L40S needs on the current cluster) | ||
* 1 Visualization node with an L40 GPU (new!) for graphics rendering | * 1 Visualization node with an L40 GPU (new!) for graphics rendering | ||
* 1 login node | * 1 login node | ||
* 1 Open OnDemand node | * 1 Open OnDemand node | ||
--> | |||
===== When will ELSA3 launch into production? ===== | ===== When will ELSA3 launch into production? ===== | ||
ELSA will be upgraded on May 21 & 22nd (Commencement days). Because of the campus power outage on Sunday, May 24th and Memorial Day on Monday, May 25th, ELSA3 will be down until Tuesday, May 26th. I hope the have the system up and operational in the afternoon. | |||
===== What will happen to the current cluster? ===== | ===== What will happen to the current cluster? ===== | ||
All the nodes will be moved over to the new ELSA3 cluster when it goes into production. The old cluster software will remain for a while, but will no longer be accessible to end-users. If something needs to be moved over later, I will still be able to access the old cluster configurations and software. Remember, all files in your /home, /projects, /courses and /scratch will be available to you on the new cluster. | All the nodes will be moved over to the new ELSA3 cluster when it goes into production. The old cluster software will remain for a while, but will no longer be accessible to end-users. If something needs to be moved over later, I will still be able to access the old cluster configurations and software. Remember, all files in your /home, /projects, /courses and /scratch will be available to you on the new cluster. | ||
Latest revision as of 12:27, 27 May 2026
ELSA3 Upgrade Issues & Resolutions
SSH host key/identification errror when connecting to elsa.hpc.tcnj.edu
If you can't connect to elsa.hpc.tcnj.edu using your SSH client because of a host key/identification error, see these instructions to resolve the issue.
This does not affect using the web interface and its built-in terminal in Open OnDemand.
Job or module command failed with "The following module(s) are unknown" error
If your job or module add command fails and you have an error like Lmod has detected the following error: The following module(s) are unknown: in your job.####.out file, it means that the application or the version of the application you are trying to use is not on the ELSA3 cluster. For an application like Matlab where there are multiple version, simply change your SLURM submit script to select a different version or remove the version entirely and let the system pick the best "default" version. If the application does not exist on the ELSA3 cluster, and you need it for your workflow, please contact Shawn Sivy (ssivy@tcnj.edu) to have it moved over from the old ELSA cluster.
To fix the situation where the version is no longer available, look for a line in your submit script like the one below. Note: module add and module load are synonymous.
module add matlab/R2021a
and replace it with one of these
module add matlab/R2025a or module add matlab
To see what modules are available, use the module avail command.
Example: module avail or module avail matlab
The module command gives me an "attempt to compare string with number" error
If you are using the module command such as module avail and get an error like the one below, you need to run the lmod-clear-cache command. The module command should now work again as expected. This error occurs because the caching format used by module differs from the prior version.
[ssivy@login001 ~]$ module avail /usr/bin/lua: /opt/ohpc/admin/lmod/lmod/libexec/Cache.lua:663: attempt to compare string with number stack traceback: /opt/ohpc/admin/lmod/lmod/libexec/Cache.lua:663: in function 'Cache.build' /opt/ohpc/admin/lmod/lmod/libexec/ModuleA.lua:750: in function 'ModuleA.singleton' /opt/ohpc/admin/lmod/lmod/libexec/Hub.lua:1251: in function 'Hub.avail' /opt/ohpc/admin/lmod/lmod/libexec/cmdfuncs.lua:145: in function 'Avail' /opt/ohpc/admin/lmod/lmod/libexec/lmod:527: in function 'main' /opt/ohpc/admin/lmod/lmod/libexec/lmod:603: in main chunk [C]: in ? [ssivy@login001 ~]$
Large message when using the Open OnDemand Terminal
When you launch the >_ ELSA OpenHPC Cluster Shell Access to get to a command-line terminal or use the >_Open in Terminal option, you see a large error message similar to the one below.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the ECDSA key sent by the remote host is SHA256:2d6jgn56y7AwKQdOHeyIvDG9wpgybxcQvDcgqYr5mWU. Please contact your system administrator. Add correct host key in /home/hpc/hpcuser/.ssh/known_hosts to get rid of this message. Offending ECDSA key in /home/hpc/hpcuser/.ssh/known_hosts:200 Password authentication is disabled to avoid man-in-the-middle attacks. Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks. UpdateHostkeys is disabled because the host key is not trusted.
You should still get to a terminal prompt. To remove the message when launching future terminals, run the command ssh-keygen -R elsa.hpc.tcnj.edu
ELSA3 Upgrade Frequently Asked Questions
So why is there a new cluster?
ELSA3 features an upgraded operating system, Rocky Linux 9, and the OpenHPC 3.x cluster tools. The current cluster is using Rocky Linux 8 and the OpenHPC 2.x cluster tools.
Does this cluster upgrade affect the virtual machines (VMs) on the "MoOSE" virtual infrastructure?
No. The virtualization system is a separate cluster with its own servers and software. However, I will look to upgrade the underlying virtualization software this summer, but that will (mostly) be transparent to the end-users of that system.
Are my files also available on this new cluster?
Yes. Your files in your home directory, /projects, and /courses and are available and are currently shared between the two clusters. Because of how the global /scratch system works, it won't be available until after the upgrade.
Is there a new login for the ELSA3 test cluster?
No. Just use your regular TCNJ username and password like you do on the current ELSA cluster.
When will ELSA3 launch into production?
ELSA will be upgraded on May 21 & 22nd (Commencement days). Because of the campus power outage on Sunday, May 24th and Memorial Day on Monday, May 25th, ELSA3 will be down until Tuesday, May 26th. I hope the have the system up and operational in the afternoon.
What will happen to the current cluster?
All the nodes will be moved over to the new ELSA3 cluster when it goes into production. The old cluster software will remain for a while, but will no longer be accessible to end-users. If something needs to be moved over later, I will still be able to access the old cluster configurations and software. Remember, all files in your /home, /projects, /courses and /scratch will be available to you on the new cluster.