ELSA3 Upgrade FAQ: Difference between revisions

From HPC Docs
Jump to navigation Jump to search
 
(3 intermediate revisions by the same user not shown)
Line 10: Line 10:


If your job or <code>module add</code> command fails and you have an error like <code>Lmod has detected the following error: The following module(s) are unknown:</code> in your '''job.''####''.out''' file, it means that the application or the version of the application you are trying to use is not on the ELSA3 cluster. For an application like Matlab where there are multiple version, simply change your SLURM submit script to select a different version or remove the version entirely and let the system pick the best "default" version. If the application does not exist on the ELSA3 cluster, and you need it for your workflow, please contact Shawn Sivy (ssivy@tcnj.edu) to have it moved over from the old ELSA cluster.
If your job or <code>module add</code> command fails and you have an error like <code>Lmod has detected the following error: The following module(s) are unknown:</code> in your '''job.''####''.out''' file, it means that the application or the version of the application you are trying to use is not on the ELSA3 cluster. For an application like Matlab where there are multiple version, simply change your SLURM submit script to select a different version or remove the version entirely and let the system pick the best "default" version. If the application does not exist on the ELSA3 cluster, and you need it for your workflow, please contact Shawn Sivy (ssivy@tcnj.edu) to have it moved over from the old ELSA cluster.


To fix the situation where the version is no longer available, look for a line in your submit script like the one below. Note: <code>module add</code> and <code>module load</code> are synonymous.
To fix the situation where the version is no longer available, look for a line in your submit script like the one below. Note: <code>module add</code> and <code>module load</code> are synonymous.
Line 19: Line 18:


<code>module add matlab/R2025a</code> ''or''  <code>module add matlab</code>
<code>module add matlab/R2025a</code> ''or''  <code>module add matlab</code>


To see what modules are available, use the <code>module avail</code> command.
To see what modules are available, use the <code>module avail</code> command.


Example: <code>module avail</code> ''or''  <code>module avail matlab</code>
Example: <code>module avail</code> ''or''  <code>module avail matlab</code>
===== The module command gives me an "attempt to compare string with number" error =====
If you are using the <code>module</code> command such as <code>module avail</code> and get an error like the one below, you need to run the <code>lmod-clear-cache</code> command. The <code>module</code> command should now work again as expected. This error occurs because the caching format used by <code>module</code> differs from the prior version.
<pre>
[ssivy@login001 ~]$ module avail
/usr/bin/lua: /opt/ohpc/admin/lmod/lmod/libexec/Cache.lua:663: attempt to compare string with number
stack traceback:
  /opt/ohpc/admin/lmod/lmod/libexec/Cache.lua:663: in function 'Cache.build'
  /opt/ohpc/admin/lmod/lmod/libexec/ModuleA.lua:750: in function 'ModuleA.singleton'
  /opt/ohpc/admin/lmod/lmod/libexec/Hub.lua:1251: in function 'Hub.avail'
  /opt/ohpc/admin/lmod/lmod/libexec/cmdfuncs.lua:145: in function 'Avail'
  /opt/ohpc/admin/lmod/lmod/libexec/lmod:527: in function 'main'
  /opt/ohpc/admin/lmod/lmod/libexec/lmod:603: in main chunk
  [C]: in ?
[ssivy@login001 ~]$
</pre>


== ELSA3 Upgrade Frequently Asked Questions ==
== ELSA3 Upgrade Frequently Asked Questions ==
Line 51: Line 67:
===== Is there a new login for the ELSA3 test cluster? =====
===== Is there a new login for the ELSA3 test cluster? =====
No. Just use your regular TCNJ username and password like you do on the current ELSA cluster.
No. Just use your regular TCNJ username and password like you do on the current ELSA cluster.
<!--
<!--
===== How do I access the test cluster? =====
===== How do I access the test cluster? =====

Latest revision as of 13:04, 18 May 2026

ELSA3 Upgrade Issues & Resolutions

SSH host key/identification errror when connecting to elsa.hpc.tcnj.edu

If you can't connect to elsa.hpc.tcnj.edu using your SSH client because of a host key/identification error, see these instructions to resolve the issue.

This does not affect using the web interface and its built-in terminal in Open OnDemand.

Job or module command failed with "The following module(s) are unknown" error

If your job or module add command fails and you have an error like Lmod has detected the following error: The following module(s) are unknown: in your job.####.out file, it means that the application or the version of the application you are trying to use is not on the ELSA3 cluster. For an application like Matlab where there are multiple version, simply change your SLURM submit script to select a different version or remove the version entirely and let the system pick the best "default" version. If the application does not exist on the ELSA3 cluster, and you need it for your workflow, please contact Shawn Sivy (ssivy@tcnj.edu) to have it moved over from the old ELSA cluster.

To fix the situation where the version is no longer available, look for a line in your submit script like the one below. Note: module add and module load are synonymous.

module add matlab/R2021a

and replace it with one of these

module add matlab/R2025a or module add matlab

To see what modules are available, use the module avail command.

Example: module avail or module avail matlab

The module command gives me an "attempt to compare string with number" error

If you are using the module command such as module avail and get an error like the one below, you need to run the lmod-clear-cache command. The module command should now work again as expected. This error occurs because the caching format used by module differs from the prior version.

[ssivy@login001 ~]$ module avail
/usr/bin/lua: /opt/ohpc/admin/lmod/lmod/libexec/Cache.lua:663: attempt to compare string with number
stack traceback:
  /opt/ohpc/admin/lmod/lmod/libexec/Cache.lua:663: in function 'Cache.build'
  /opt/ohpc/admin/lmod/lmod/libexec/ModuleA.lua:750: in function 'ModuleA.singleton'
  /opt/ohpc/admin/lmod/lmod/libexec/Hub.lua:1251: in function 'Hub.avail'
  /opt/ohpc/admin/lmod/lmod/libexec/cmdfuncs.lua:145: in function 'Avail'
  /opt/ohpc/admin/lmod/lmod/libexec/lmod:527: in function 'main'
  /opt/ohpc/admin/lmod/lmod/libexec/lmod:603: in main chunk
  [C]: in ?
[ssivy@login001 ~]$

ELSA3 Upgrade Frequently Asked Questions

So why is there a new cluster?

ELSA3 features an upgraded operating system, Rocky Linux 9, and the OpenHPC 3.x cluster tools. The current cluster is using Rocky Linux 8 and the OpenHPC 2.x cluster tools.

Does this cluster upgrade affect the virtual machines (VMs) on the "MoOSE" virtual infrastructure?

No. The virtualization system is a separate cluster with its own servers and software. However, I will look to upgrade the underlying virtualization software this summer, but that will (mostly) be transparent to the end-users of that system.

Are my files also available on this new cluster?

Yes. Your files in your home directory, /projects, and /courses and are available and are currently shared between the two clusters. Because of how the global /scratch system works, it won't be available until after the upgrade.

Is there a new login for the ELSA3 test cluster?

No. Just use your regular TCNJ username and password like you do on the current ELSA cluster.

When will ELSA3 launch into production?

ELSA will be upgraded on May 21 & 22nd (Commencement days). Because of the campus power outage on Sunday, May 24th and Memorial Day on Monday, May 25th, ELSA3 will be down until Tuesday, May 26th. I hope the have the system up and operational in the afternoon.

What will happen to the current cluster?

All the nodes will be moved over to the new ELSA3 cluster when it goes into production. The old cluster software will remain for a while, but will no longer be accessible to end-users. If something needs to be moved over later, I will still be able to access the old cluster configurations and software. Remember, all files in your /home, /projects, /courses and /scratch will be available to you on the new cluster.