UMIACS Servers: Difference between revisions

From David's Wiki
 
(5 intermediate revisions by the same user not shown)
Line 29: Line 29:


==Python==
==Python==
Do not install anaconda. You will run out of space.<br>
Do not install anaconda in home. You will run out of space.<br>
Load the Python 3 module adding the following to your .bashrc file
Load the Python 3 module adding the following to your .bashrc file
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
Line 48: Line 48:
** <code>export PYTHONPATH="${PYTHONPATH}:/nfshomes/$(whoami)/.local/lib/python3.7/site-packages/"</code>
** <code>export PYTHONPATH="${PYTHONPATH}:/nfshomes/$(whoami)/.local/lib/python3.7/site-packages/"</code>
* You can also install using <code>pip install --target=/my-libs-folder/</code>
* You can also install using <code>pip install --target=/my-libs-folder/</code>
===Conda===
If you must install conda, install it somewhere with a lot of space like scratch.


===Install PyTorch===
===Install PyTorch===
Line 75: Line 78:
</pre>
</pre>


; CPU-only on default QOS
; CPU-only on scavenger QOS
<pre>
<pre>
srun --pty --mem=16G --qos=default --time=23:59:00 -w mbrc00 bash
srun --pty --account=scavenger --partition=scavenger \
    --time=3:59:00 \
    --mem=1G -c1 -w mbrc00 bash
</pre>
</pre>


Line 125: Line 130:
REMOTE_PORT=22350
REMOTE_PORT=22350
REMOTE_SSH_PORT=22450
REMOTE_SSH_PORT=22450
REMOTE_ADDR=192.168.78.34
REMOTE_ADDR=$(echo "$SSH_CONNECTION" | awk '{print $1}')


/usr/sbin/sshd -D -f sshd_config & \
/usr/sbin/sshd -D -f sshd_config & \
Line 134: Line 139:
Proxy the sshd from the local docker to your localhost.   
Proxy the sshd from the local docker to your localhost.   
Connect to the the sshd on the cluster
Connect to the the sshd on the cluster
==Class Accounts==
See [https://wiki.umiacs.umd.edu/umiacs/index.php/ClassAccounts UMIACS Wiki: ClassAccounts] 
Class accounts have the least priority. If GPUs are available, you can access 1 GPU up to 48 hours. 
However, your home disk only has 18GB and installing PyTorch takes up ~3GB. 
You cannot fit a conda environment in here so just use the python module.
The ssh endpoint is
<pre>
class.umiacs.umd.edu
</pre>
Start a job with:
<pre>
srun --pty --account=class --partition=class --gres=gpu:1 --mem=16G --qos=default --time=47:59:00 -c4 bash
</pre>
{{hidden | My .bashrc |
<pre>
#PS1='\w$ '
PS1='\[\e]0;\u@\h: \w\a\]${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\]\$'
# Modules
module load tmux
module load cuda/10.0.130
module load cudnn/v7.5.0
module load Python3/3.7.6
alias python=python3
export PATH="${PATH}:${HOME}/bin/"
export PATH="${PATH}:${HOME}/.local/bin/"
</pre>
}}


==<code>.bashrc</code>==
==<code>.bashrc</code>==
Line 157: Line 197:
if command_exists module ; then
if command_exists module ; then
   module load tmux
   module load tmux
   module load cuda/10.1.243
   module load cuda/10.2.89
   module load cudnn/v7.6.5
   module load cudnn/v8.0.4
   module load Python3/3.7.6
   module load Python3/3.7.6
   module load git/2.25.1
   module load git/2.25.1
   module load gitlfs
   module load gitlfs
   module load gcc/8.1.0
   module load gcc/8.1.0
   #module load gcc/6.3.0
   module load openmpi/4.0.1
   module load ffmpeg
   module load ffmpeg
 
  module load rclone
fi
fi
if command_exists python3 ; then
if command_exists python3 ; then
Line 191: Line 231:


==Copying Files==
==Copying Files==
There are 3 ways that I copy files to the scratch drives:
There are 3 ways that I use to copy files:
* For small files, you can copy to your home directory under <code>/nfshomes/</code> via SFTP to mbrcsub00. I rarely do this because the home directory is only a few gigs.
* For small files, you can copy to your home directory under <code>/nfshomes/</code> via SFTP to the submission node. I rarely do this because the home directory is only a few gigs.
* For large files, I typically use [[rclone]] to copy to my terpmail Google Drive and then copy back to the scratch drives with a cpu-only job. Do not do this with thousands of small files; it will take forever since Google Drive has a limit on files per second. Also note that Google Drive has a daily limit of 750GB in transfers.
* For large files and folder, I typically use [[rclone]] to copy to the cloud and then copy back to the scratch drives with a cpu-only job.
* For mounting, I have a convoluted system where I start SSHD in a job and port forward the SSH port to my local PC. See above for more details.
** You can store project files on Google Drive or the UMIACS object storage.
** Note that Google Drive has a limit on files per second and a daily limit of 750GB in transfers.

Latest revision as of 15:23, 15 June 2023

Notes on using UMIACS servers


Modules

Use modules to load programs you need to run.

Notes
  • You can load modules in your .bashrc file
# List loaded modules
module list

# Load a module
module load [my_module]

# List all available modules
module avail

Some useful modules in my .bashrc file

module load tmux
module load cuda/10.0.130
module load cudnn/v7.5.0
module load Python3/3.7.6
module load git

Python

Do not install anaconda in home. You will run out of space.
Load the Python 3 module adding the following to your .bashrc file

module load Python3/3.7.6
export PATH="${PATH}:$(python3 -c 'import site; print(site.USER_BASE)')/bin"

Then run the following to get pip installed

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py --user
Notes
  • You will need to install things with pip --user
  • You may need to add your local site-packages to your PYTHONPATH environment variable
    • Add this to .bashrc:
    • export PYTHONPATH="${PYTHONPATH}:/nfshomes/$(whoami)/.local/lib/python3.7/site-packages/"
  • You can also install using pip install --target=/my-libs-folder/

Conda

If you must install conda, install it somewhere with a lot of space like scratch.

Install PyTorch

pip install --user torch===1.3.1 torchvision===0.4.2 -f https://download.pytorch.org/whl/torch_stable.html

Installing Packages to a Directory

pip install geographiclib -t /scratch1/davidli/python/

MBRC Cluster

See UMIACS MBRC

SLURM Job Management

See https://docs.rc.fas.harvard.edu/kb/convenient-slurm-commands/

1 GPU
srun --pty --gres=gpu:1 --mem=16G --qos=high --time=47:59:00 -w mbrc00 bash
2 GPUS mbrc00
srun --pty --gres=gpu:2 --mem=16G --qos=default --time=23:59:00 -w mbrc00 bash
CPU-only on scavenger QOS
srun --pty --account=scavenger --partition=scavenger \
     --time=3:59:00 \
     --mem=1G -c1 -w mbrc00 bash
Notes
  • You can add -w mbrc01 to pick mbrc01
  • -c 4 for 4 cores

See Jobs

See my own jobs
squeue -u <user> -o "%8i %10P %8j %10u %10L %5b"
Formatting
  • %L is remaining time
  • %b is the number of GPUs
See all jobs
squeue

SFTP

Note: If you know of an easier way, please tell me.

On your PC
Start an sshd for forwarding. You can do this in a docker container for privacy purposes.

On the cluster:
Generate an sshd host key:

ssh-keygen -t ed25519 -a 100 -f /nfshomes/dli7319/ssh/ssh_host_ed25519_key

Create the following sshd_config file

#	$OpenBSD: sshd_config,v 1.103 2018/04/09 20:41:22 tj Exp $
Port 5981
HostKey /nfshomes/dli7319/ssh/ssh_host_ed25519_key
AuthorizedKeysFile	.ssh/authorized_keys
Subsystem	sftp	/usr/libexec/openssh/sftp-server

Start the sshd daemon and proxy the port to your local sshd. You can make a script like this:

#!/bin/bash

LOCAL_PORT=5981
REMOTE_PORT=22350
REMOTE_SSH_PORT=22450
REMOTE_ADDR=$(echo "$SSH_CONNECTION" | awk '{print $1}')

/usr/sbin/sshd -D -f sshd_config & \
ssh -R $REMOTE_PORT:localhost:$LOCAL_PORT root@$REMOTE_ADDR -p $REMOTE_SSH_PORT 

On your PC:
Proxy the sshd from the local docker to your localhost.
Connect to the the sshd on the cluster

Class Accounts

See UMIACS Wiki: ClassAccounts

Class accounts have the least priority. If GPUs are available, you can access 1 GPU up to 48 hours.
However, your home disk only has 18GB and installing PyTorch takes up ~3GB.
You cannot fit a conda environment in here so just use the python module.

The ssh endpoint is

class.umiacs.umd.edu

Start a job with:

srun --pty --account=class --partition=class --gres=gpu:1 --mem=16G --qos=default --time=47:59:00 -c4 bash
My .bashrc
#PS1='\w$ '
PS1='\[\e]0;\u@\h: \w\a\]${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\]\$'

# Modules
module load tmux
module load cuda/10.0.130
module load cudnn/v7.5.0
module load Python3/3.7.6
alias python=python3

export PATH="${PATH}:${HOME}/bin/"
export PATH="${PATH}:${HOME}/.local/bin/"

.bashrc

My .bashrc
#PS1='\w$ '
PS1='\[\e]0;\u@\h: \w\a\]${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\]\$'

if test -f "/opt/rh/rh-php72/enable"; then
    source /opt/rh/rh-php72/enable
fi

export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh"  # This loads nvm
[ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion"  # This loads nvm bash_completion

command_exists() {
  type "$1" &> /dev/null ;
}


# Modules
if command_exists module ; then
  module load tmux
  module load cuda/10.2.89
  module load cudnn/v8.0.4
  module load Python3/3.7.6
  module load git/2.25.1
  module load gitlfs
  module load gcc/8.1.0
  module load openmpi/4.0.1
  module load ffmpeg
  module load rclone
fi
if command_exists python3 ; then
  alias python=python3
fi

if command_exists python3 ; then
  export PATH="${PATH}:$(python3 -c 'import site; print(site.USER_BASE)')/bin"
fi
export PYTHONPATH="${PYTHONPATH}:/nfshomes/dli7319/.local/lib/python3.7/site-packages/"

export PATH="${HOME}/bin/:${PATH}"

Software

git

The MBRC cluster has an git available in the modules.
Then you can download git-lfs compiled and drop it in ~/bin/.
Make sure ${HOME}/bin is in your path and run git lfs install

Notes
  • Make sure you have a recent version of git
    • E.g. module load git/2.25.1

Copying Files

There are 3 ways that I use to copy files:

  • For small files, you can copy to your home directory under /nfshomes/ via SFTP to the submission node. I rarely do this because the home directory is only a few gigs.
  • For large files and folder, I typically use rclone to copy to the cloud and then copy back to the scratch drives with a cpu-only job.
    • You can store project files on Google Drive or the UMIACS object storage.
    • Note that Google Drive has a limit on files per second and a daily limit of 750GB in transfers.