High Performance Computing Facility (HPCF)

High Performance Computing Facility

Overview

CPOS IT operates the High Performance Computing Facility (HPCF), comprising of 48 nodes of 6,848 CPU cores, 124 GPU card, 54 TB memory and Petabyte scale All-flash storage, on a 400GB high speed network.

Target Users

Staff and students in HKU needing computing services for PanorOmics research

Service Hours

Generally available on a 24×7 basis except for system maintenance

Serving 200+ Researchers

on omics research

600+ Software/Tools

extensive software library

387,000+ Jobs Completed

over the past 12 months

High Performance Computing Cluster (HPCF3)

Hardware and Access

🖥️HPCF3 Cluster Overview

The HPCF3 login node is accessible via SSH using the hostname:

hpcf3.cpos.hku.hk

Click here for more details.

🧰Operating System & Software

All HPCF3 servers run on Rocky Linux 8.
A broad range of commonly used bioinformatics software is preinstalled and available to all users.

⚙️Services

Users submit jobs through OpenPBS, the cluster’s job scheduling system.
Execution is carried out on compute nodes via PBS jobs only.

➡️ Direct login to compute nodes is not permitted.

👤User Account Setup

To request a user account, contact itsupport.cpos@hku.hk.

All account applications are subject to CPOS approval.
Approved users receive a Linux account to access the cluster.
Each account comes with 100 GB default storage; additional quota can be requested (subject to availability and charges).
Each account is assigned to a named individual — account sharing is strictly prohibited.
While storage systems are protected by hardware redundancy, users are strongly advised to maintain their own backups regularly.
Users should be familiar with the Linux/UNIX environment. A reference guide is available from ITS.

🔐Cluster Login

Users may connect remotely via SSH using clients such as PuTTY, available at: http://www.chiark.greenend.org.uk/~sgtatham/putty/

Connection Details

To connect to the login node of HPCF3, please use an SSH client to open a SSH terminal session with the following:

Hostname: hpcf3.cpos.hku.hk
Username:

🔒Network Requirement

For security reasons, HPCF3 servers are accessible only from within the HKU network.

Off-campus users must first connect through HKUVPN.
Details: https://its.hku.hk/services/network-connectivity/hkuvpn/

📦Data Transfer

🪟 Windows Users

For Windows users there are several free, GUI-based secure file transfer clients, such as WinSCP or FileZilla. You could use these clients with simply drag and drop files between the cluster front-end node and your desktop or laptop.

🍏 Mac / 🐧 Linux Users

For a Mac or Linux, the command-line scp client is typically installed with the operating system. You can access it through a terminal running on your local machine. For example, suppose you have a Mac, and a local file called myfile on your mac that you want to copy to your home directory on the cluster.

Example:

This copies myfile into the directory mydir/ under your HPCF3 home directory.

As with any Unix command that takes files or directories as arguments, the file or directory for either the source or destination can be specified as either an absolute path (one that begins with a slash) or a relative path (one that does not). For local files or directories, relative paths are relative to the current working directory. For remote files or directories, relative paths are relative to a home directory on the remote host.

📡Bulk Data Transfer Guidelines

To avoid impacting campus network performance:

Transfers larger than 30 GB to/from outside the HKU network should be done outside office hours.
For transfers exceeding 500 GB, please notify itsupport.cpos@hku.hk in advance so we can coordinate with ITS.

⚠️ Failure to provide notice for large transfers may result in ITS temporarily blocking cluster servers from the Internet.

Charges

Click here for more details.

Charging Policies

👤HPCF User Accounts

Each HPCF user account would be charged monthly for daily support and maintenance of that account in our cluster.
Charging period starts from the 1st day of each month.
For new user account, it would be charged starting from the next charging period. For example: If the new account is ready with email notification to the user on 28th, it would not be charged for 29th but would be charged starting from 1st of next month onwards.
Each HPCF user account must be in use and charged for at least ONE month. Short term usage is NOT allowed.
For removal of user account, it would be performed on the end day of a charging period during which charging would still be applied.

📦User Home Data Storage

The user home data storage (/home/) would be charged monthly by the disk quota granted in units of 100GB (1KB=1000 bytes).
Charging period starts from the 1st day of each month.
For the home storage of a new user account, it would be charged starting from the next charging period, along with the new user account charges.
By default, each new user account would have 100GB disk quota if not specified.
Increase of the disk quota would be performed upon the request from the user with cc to PI and the raised disk quota would be charged starting from the current charging period.
Upon request, decrease of the disk quota would be performed on the end day of a charging period during which charging would still be applied.

📦User Group Data Storage

The user group data storage (/home/groups/) would be charged monthly by the disk quota granted in units of 100GB (1KB=1000 bytes).
Charging period starts from the 1st day of each month.
Increase of the disk quota would be performed upon the request from the group coordinator with cc to PI and the raised disk quota would be charged starting from the current charging period.
Upon request, decrease of the disk quota would be performed on the end day of a charging period during which charging would still be applied.

Co-location Services

For new co-location equipment units, monthly support fee would be charged from the next charging period after the UAT acceptance.

Important Notes

CPOS may adjust charges from time to time based on actual usages and recovery needs. Prior notice will be given before changes are implemented.

For co-location equipment, all FEO inventory would be under CPOS. A co-location server may be integrated into HPCF cluster and setup as a compute node. A separate job queue would be setup for job execution on the co-location server(s). Please note that CPOS IT centrally manages all the system resources in HPCF, including co-location equipment, and may from time to time schedule jobs for any idling resources accordingly without prior notice for effective and efficient uses and benefit of HKU community.

Environment Modules

The HPCF3 cluster uses the Environment Modules system to manage centrally installed software. Multiple versions of commonly used bioinformatics tools are already available.
Please note: Environment Modules are supported on HPCF3 only and not available on HPCF2.

Show Available Modules

Environment Modules allow users to easily load, unload, and switch between different software environments using simple commands.

To check which software packages and versions are available on HPCF3, run:

module avail

This command lists all modules currently installed and ready for use.

Installed Software

Job Scheduling

Job Scheduling System (OpenPBS) at HPCF3 cluster

All jobs must be scheduled via the batch job system (OpenPBS). Jobs are submitted to PBS specifying requested resources, including the queue, number of CPUs, the amount of memory, and the length of time. PBS will then run job(s) on available compute nodes when the resources are available, subject to constraints on maximum resource usage.

Click here for more details.

High Performance Computing Cluster (HPCF2)

Hardware and Access

HPCF2 cluster

The master node is Omics which can be accessed by SSH with hostname hpcf2.cpos.hku.hk.

The HPCF2 cluster consists of 8 compute nodes with system configuration listed below:


Server	CPU Brand & Model	No of CPU	Cores per CPU	Physical Cores per Server	CPU Threads per Server	RAM (GB)
hpch01	Intel Xeon E5-2650 v4 2.2GHz	2	12	24	48	256
hpch02	Intel Xeon E5-2650 v4 2.2GHz	2	12	24	48	256
hpch03	Intel Xeon E5-2650 v4 2.2GHz	2	12	24	48	256
hpch04	Intel Xeon E5-2650 v4 2.2GHz	2	12	24	48	256
hpch05	Intel Xeon E5-2650 v4 2.2GHz	2	12	24	48	256
hpch06	Intel Xeon E5-2650 v4 2.2GHz	2	12	24	48	256
hpch07	Intel Xeon E5-2650 v4 2.2GHz	2	12	24	48	256
hpch08	Intel Xeon E5-2683 v4 2.1GHz	2	16	32	64	512
			Total:	200	400	2,304

Operating System and System Software

All servers in HPCF2 are running CentOS 7 Linux Operating System and major bioinformatics software have been installed and ready for use by all users.

Cluster Nodes

Master Nodes

Services

Compile/Run command-line programs in interactive mode via SSH
Submit batch jobs to the job queue system
File transfer via sftp client like Filezilla

Control Measures

The master node shall be used for the services above while users’ analysis jobs shall be executed at the compute nodes by job submission to the job queue system. Particularly, the master node shall not be used to execute those resource hungry jobs. Therefore, the following control measures are implemented at the master nodes:

The maximum total CPU usage of all the user’s jobs at any one time is 400% (i.e. running four jobs concurrently that each thread consumes 100% CPU time or a single job with 4 threads each consume 100% CPU time).
The maximum running time of each user’s job is 10 minutes.
The maximum total amount of main memory usage of all the user’s jobs at any one time is 10GB.

The master node will check if the limits are exceeded for each user and will terminate the user jobs if the limits are exceeded in descending order of usage until the resource usage is brought down to the accepted limits. Notification emails will be automatically sent to the user with the details of the processes that are terminated.

Compute Nodes

Services

User can submit PBS jobs for execution on the compute nodes from the master node through the job scheduling system. Therefore, user cannot login to the compute node directly for job execution.

User Account Setup

Please contact itsupport.cpos@hku.hk for user account setup.
Account approval is subject to consideration by CPOS.
Upon approval of account application, each user is allocated a Linux user account for access to the cluster.
Each user is allocated a Linux user account with 100GB disk space by default initially. User may apply for extra disk space subject to availability and charges.
Each user account should be used for and held responsible to a named person. NO account sharing is allowed.
User may apply for extra disk quota subject to availability and charges.
Although the storage is protected by hardware redundancy, user is highly recommended to do his/her own regular backup of important files from the cluster to a local disk for peace of mind.
The user shall be familiar with Linux/UNIX software environment. You may refer to the Unix user guide prepared by ITS at http://intraweb.hku.hk/local/its/training/unix_sem.ppt as reference.

Cluster Login

User may remotely connect to a command line terminal of HPCF2 via SSH using SSH clients like PUTTY (Link to SSH client: http://www.chiark.greenend.org.uk/~sgtatham/putty/).

To connect to the master node of HPCF2, please use an SSH client to open a SSH terminal session with the following:

 hostname: hpcf2.cpos.hku.hk

N.B. For security reasons, the servers can only be connected from machines in the HKU network. For machines outside of HKU network, user has to connect to HKUVPN before connecting to the servers. Please see the details at http://www.its.hku.hk/documentation/guide/network/remote/hkuvpn2fa

Data Transfer

For Windows users there are several free, GUI-based secure file transfer clients, such as WinSCP or FileZilla (http://filezilla-project.org/). You could use these clients with simply drag and drop files between the cluster front-end node and the your desktop or laptop.

On a Mac/Linux PC, start the Terminal application and type

 scp myfile userid@hpcf2.cpos.hku.hk:mydir/

That would copy the local file myfile to the subdirectory mydir/ under your home directory on the cluster and keeps the filename myfile. As with any Unix command that takes files or directories as arguments, the file or directory for either the source or destination can be specified as either an absolute path (one that begins with a slash) or a relative path (one that does not). For local files or directories, relative paths are relative to the current working directory. For remote files or directories, relative paths are relative to a home directory on the remote host.

Bulk Data Transfer

In order not to affect other users in the HKU campus network, it is suggested by the HKU Information Technology Services that network transfer of bulk data (>30GB) between the cluster and machines outside the HKU Network should be scheduled in non-office hours. If the data to be transferred is of more than 500GB, prior notice should be sent to itsupport.cpos@hku.hk such that we can relay it to ITS for the planned network traffic. Failure to do so may result in ITS blocking our servers from reaching the Internet.

Decommission

HPCF2 will be fully decommissioned by Oct-2026

Charges

Click here for more details.

Charging Policies

HPCF User Accounts

Each HPCF user account would be charged monthly for daily support and maintenance of that account in our cluster.
Charging period starts from the 1st day of every month.
For new user account, it would be charged starting from the next charging period. For example: If the new account is ready with email notification to the user on 28th, it would not be charged for 29th but would be charged starting from 1st of next month onwards.
Each HPCF user account must be in use and charged for at least ONE month. Short term usage is NOT allowed.
For removal of user account, it would be performed on the end day of a charging period during which charging would still be applied.

User Home Data Storage

The user home data storage (/home/) would be charged monthly by the disk quota granted in units of 100GB (1KB=1000 bytes).
Charging period starts from the 1st day of every month.
For the home storage of a new user account, it would be charged starting from the next charging period, along with the new user account charges.
By default, each new user account would have 100GB disk quota if not specified.
Increase of the disk quota would be performed upon the request from the user with cc to PI and the raised disk quota would be charged starting from the current charging period.
Upon request, decrease of the disk quota would be performed on the end day of a charging period during which charging would still be applied.

User Group Data Storage

The user group data storage (/home/groups/) would be charged monthly by the disk quota granted in units of 100GB (1KB=1000 bytes).
Charging period starts from the 1st day of every month.
Increase of the disk quota would be performed upon the request from the group coordinator with cc to PI and the raised disk quota would be charged starting from the current charging period.
Upon request, decrease of the disk quota would be performed on the end day of a charging period during which charging would still be applied.

Co-location Services

For new co-location equipment units, monthly support fee would be charged from the next charging period after the UAT acceptance.

Important Notes

CPOS may revise the charges from time to time based on actual usages and recovery needs. Prior notice will be given before changes are implemented.

Environment Modules

A system called ‘Environment Modules” is used for management of centrally installed software. Several versions of commonly used bioinformatics software have already been installed. Note that it is available at the new HPCF2 cluster only, but NOT on the old HPCF (statgenpro).

Show available modules

Environment modules allow users to dynamically switch between software environments with a few simple commands. To know which software packages and their versions are available on the cluster, use the command “module avail” to list out all available module environments:

 [itsupport@omics ~]$module avail
 
 --------------------------------------------- /software/Modules/modulefiles ----------------------------------------------
  ANNOVAR/2017Jul16             HTSeq/0.9.1            (D)    STAR/2.5.2a               bwa/0.7.12
  BEDTools/2.12.0               MACS/2.0.10-2012.06.06        STAR/2.5.3a        (D)    bwa/0.7.17        (D)
  BEDTools/2.17.0               MACS/2.1.0-2015.04.20  (D)    TrimGalore/0.4.1          cutadapt/1.8.1
  BEDTools/2.27.1        (D)    MUMmer/3.22                   TrimGalore/0.4.5   (D)    cutadapt/1.15     (D)
  BioPerl/1.7.2                 MUMmer/3.23            (D)    Trimmomatic/0.33          idba/1.1.3
  Canu/1.5                      NCBI-blast/2.2.27+            Trimmomatic/0.36   (D)    java/7.0_25
  Canu/1.6               (D)    NCBI-blast/2.7.1+      (D)    VerifyBamID/1.1.2         java/7.0_80
  CellRanger/2.0.1              Oncotator/1.9.6.1             VerifyBamID/1.1.3  (D)    java/8.0_161      (D)
  CellRanger/2.1.0       (D)    PEAR/0.9.10                   bamUtil/1.0.13            java/9.0.4
  DESeq2/1.10.1                 PEAR/0.9.11            (D)    bamUtil/1.0.14     (D)    miniconda2/4.3.31
  DESeq2/1.18.1          (D)    Perl/5.26.1                   bamtools/2.3.0            muTect/1.1.4
  EBSeq/1.9.3                   Picard/2.0.1                  bamtools/2.5.1     (D)    muTect/1.1.5      (D)
  EBSeq/1.18             (D)    Picard/2.17.4          (D)    bcl2fastq/2.19            python2/2.7.14
  FASTX-toolkit/0.0.13.2        QIIME/1.9.1                   bcl2fastq/2.20     (D)    python3/3.6.4
  FASTX-toolkit/0.0.14   (D)    QIIME2/2017.12                bedGraphToBigWig/4        samtools/0.1.18
  FastQC/0.11.2                 R/3.2.5                       bismark/0.14.3            samtools/1.3
  FastQC/0.11.7          (D)    R/3.4.3                (D)    bismark/0.19.0     (D)    samtools/1.6      (D)
  GenomeAnalysisTK/3.5          RNAmmer/1.2                   bowtie/1.0.0              strelka/1.0.15
  GenomeAnalysisTK/3.7          RSEM/1.2.31                   bowtie/1.2.2       (D)    strelka/2.8.4     (D)
  GenomeAnalysisTK/3.8   (D)    RSEM/1.3.0             (D)    bowtie2/2.2.5             tRNAscan-SE/1.3.1
  HOMER/4.9                     SPAdes/3.10.0                 bowtie2/2.3.4      (D)
  HTSeq/0.6.1                   SPAdes/3.11.1          (D)    bwa/0.6.2
  
  -------------------------------------- /opt/Lmod/7.7.14/lmod/lmod/modulefiles/Core ---------------------------------------
  lmod/7.7.14    settarg/7.7.14
  
  Where:
  D:  Default Module

Load module

Then you can execute these centrally installed software package by loading the corresponding module(s). For example, if you would like to run the BWA alignment tool, you can use the command “module load” to enable the bwa module environment:

 [itsupport@omics ~]$ module load bwa
 bwa/0.7.17 is loaded

Then, the default version of bwa module would be loaded. You can then use the bwa tool for your subsequent data analysis:

 [itsupport@omics ~]$ bwa
 
 Program: bwa (alignment via Burrows-Wheeler transformation)
 Version: 0.7.17-r1188
 Contact: Heng Li 
 
 Usage:   bwa  [options]
 
 Command: index         index sequences in the FASTA format
        mem           BWA-MEM algorithm
        fastmap       identify super-maximal exact matches
 ...

List loaded modules

To list out those modules that are currently loaded, use the command “module list”:

 [itsupport@omics ~]$ module list
 Currently Loaded Modules:
   1) bwa/0.7.17

Unload modules

When you no longer need to execute that software, use the command “module unload” to unload it from your current session:

 [itsupport@omics ~]$ module unload bwa
 bwa/0.7.17 is unloaded

Search for available versions of a module

If you would like to use another version but not the default one, you can search and specify a particular version of the software module:

 itsupport@omics ~]$module avail bwa
 ------------------------------- /software/Modules/modulefiles -------------------------------
    bwa/0.6.2    bwa/0.7.12    bwa/0.7.17 (D)
 
 Where:
  D:  Default Module

Load specific version of a module

Load an older version (0.7.12) of bwa:

 [itsupport@omics ~]$ module load bwa/0.7.12
 bwa/0.7.12 is loaded

You can then execute this older version of bwa now:

 [itsupport@omics ~]$ bwa
 Program: bwa (alignment via Burrows-Wheeler transformation)
 Version: 0.7.12-r1039
 Contact: Heng Li 
 
 Usage:   bwa  [options]
 Command: index         index sequences in the FASTA format
          mem           BWA-MEM algorithm
 ...

Loading alternative version of a module

 [itsupport@omics ~]$ module load bowtie2/2.2.5
 bowtie2/2.2.5 is loaded
 [itsupport@omics ~]$ module load bowtie2/2.3.4
 bowtie2/2.2.5 is unloaded
 bowtie2/2.3.4 is loaded
 The following have been reloaded with a version change:
   1) bowtie2/2.2.5 => bowtie2/2.3.4

Quick reference for Module Commands


Command	Description
module list	List currently loaded module(s)
module avail	Show what modules are available for loading
module avail [name]	Show only the modules that are available for the application named [name]
module keyword [word1] [word2] …	Show available modules matching the search criteria
module whatis [module_name]	Show description of particular module
module help [module_name]	Show help information
module load [module_name]	Configure your environment according to modulefile(s)
module load [module_name]/[version]	Load specific version of a module
module load [mod A] [mod B] …	Load a list of modules
module unload [module_name]	Roll back configuration performed by the modulefile(s)
module unload [mod A] [mod B] …	Unload a list of modules
module swap [module A] [module B]	Unload modulefile A and load modulefile B
module purge	Unload all modules currently loaded

Using module in shell scripts

Here is an example of using module in the shell script:

 #!/bin/bash
 
 # cleanup first
 module purge
 
 # load the modules we need for this script
 module load bwa/0.7.12
 
 # perform the data analysis
 bwa mem reference.fa reads1.fq reads2.fq > aligned_pairs.sam

Loading modules with prerequisites

Some modules may depend on other modules in order to work properly and user will be prompted to load the prerequisite modules when loading such modules.

 [itsupport@omics ~]$module load bismark
 Lmod has detected the following error:  Cannot load module "bismark/0.19.0"
 without these module(s) loaded:
  bowtie2
 
 While processing the following module(s):
   Module fullname  Module Filename
   ---------------  ---------------
   bismark/0.19.0   /software/Modules/modulefiles/bismark/0.19.0.lua

 [itsupport@omics ~]$module load bowtie2
 bowtie2/2.3.4 is loaded 
 [itsupport@omics ~]$module load bismark
 bismark/0.19.0 is loaded

Installed Software

Software

Software packages listed below are already centrally installed at the HPCF2 cluster that you can use just by loading with the module command:


Software Name	Module Name on HPCF2	Installed Version(s)	Homepage	Description and Functionalities
7-Zip	7-Zip	16.02	http://www.7-zip.org/	file archiver used to place groups of files within compressed containers known as “archives”.
ANNOVAR	ANNOVAR	2020Jun08	http://annovar.openbioinformatics.org/	annotate genetic variants detected from diverse genomes
BamTools	bamtools	2.3.0 2.5.1	https://github.com/pezmaster31/bamtools	end-user’s toolkit for handling BAM files
BamUtil	bamUtil	1.0.13 1.0.14	https://genome.sph.umich.edu/wiki/BamUtil	end-user’s programs for operating BAM/SAM files
bcl2fastq	bcl2fastq	2.19 2.20	https://support.illumina.com/sequencing/sequencing_software/bcl2fastq-conversion-software.html	demultiplexes data and converts BCL files generated by Illumina sequencing systems to standard FASTQ file formats for downstream analysis
bedGraphToBigWig	bedGraphToBigWig	4	https://www.encodeproject.org/software/bedgraphtobigwig/	Convert bedGraph to bigWig file
BEDTools	BEDTools	2.12.0 2.17.0 2.27.1	http://bedtools.readthedocs.io/en/latest/	a collection of utility tools for a wide-range of genomics analysis tasks such as merging and shuffling genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF
BioPerl	BioPerl	1.7.2	http://bioperl.org/	open source Perl tools for bioinformatics, genomics and life science
Bismark	bismark	0.14.3 0.19.0 0.20.0	https://www.bioinformatics.babraham.ac.uk/projects/bismark/	A tool to map bisulfite converted sequence reads and determine cytosine methylation states
Bowtie	bowtie	1.0.0 1.2.2 2.2.5 2.3.4 2.3.4.1 2.3.4.3 2.4.2	http://bowtie-bio.sourceforge.net/index.shtml	An ultrafast memory-efficient short read aligner
Bowtie 2	bowtie2	2.2.5 2.3.4 2.3.4.1 2.3.4.3 2.4.2	http://bowtie-bio.sourceforge.net/bowtie2/index.shtml	An ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences
BWA	bwa	0.6.2 0.7.12 0.7.17	http://bio-bwa.sourceforge.net/	Alignment of short reads via Burrows-Wheeler transformation on an indexed reference sequence
Canu	Canu	1.5 1.6 1.9	http://canu.readthedocs.io/	a single molecule sequence assembler for genomes large and small
CellRanger	CellRanger	2.0.1 2.1.0 2.2.0 3.0.2 3.1.0 4.0.0 6.1.2	https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger	a set of analysis pipelines that process Chromium single-cell RNA-seq output to align reads, generate gene-cell matrices and perform clustering and gene expression analysis
cutadapt	cutadapt	1.16 1.8.1 2.3 3.4	https://github.com/marcelm/cutadapt	removes adapter sequences from high-throughput sequencing data
DESeq2	DESeq2	1.10.1 1.18.1	https://bioconductor.org/packages/release/bioc/html/DESeq2.html	Differential gene expression analysis based on the negative binomial distribution
EBSeq	EBSeq	1.9.3 1.18	http://bioconductor.org/packages/release/bioc/html/EBSeq.html	An R package for gene and isoform differential expression analysis of RNA-seq data
FastQC	FastQC	0.11.8 0.11.9	https://www.bioinformatics.babraham.ac.uk/projects/fastqc/	A quality control tool for high throughput sequence data
FASTX-Toolkit	FASTX-toolkit	0.0.14 1.3.0	http://hannonlab.cshl.edu/fastx_toolkit/	a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessin
Genome Analysis Toolkit (GATK)	GenomeAnalysisTK	3.7 3.8.1.0 4.1.9.0 4.2.0.0	https://software.broadinstitute.org/gatk/	a wide variety of tools with a primary focus on variant discovery and genotyping
HOMER	HOMER	4.9 4.11	http://homer.ucsd.edu/homer/	HOMER (Hypergeometric Optimization of Motif EnRichment) is a suite of tools for Motif Discovery and next-gen sequencing analysis
HTSeq	HTSeq	0.6.1 0.9.1	https://htseq.readthedocs.io/en/release_0.9.1/	a Python package that provides infrastructure to process data from high-throughput sequencing assay
IDBA	idba	1.1.3	https://github.com/loneknightpy/idba	basic iterative de Bruijn graph assembler for second-generation sequencing reads
Java	java	8.0_161 9.0.4 10.0.2 11.0.9 12.0.2 13.0.2	https://www.java.com	a general-purpose computer-programming language that is concurrent, class-based, object-oriented, and specifically designed to have as few implementation dependencies as possible
MACS	MACS	2.0.10-2012.06.06 2.1.0-2015.04.20 2.1.1-2016.03.09 2.1.2-2019.09.06	http://liulab.dfci.harvard.edu/MACS/	Model-based Analysis of ChIP-Seq (MACS), for identifying transcript factor binding sites
MUMmer	MUMmer	3.22 3.23 4.0.ob2	http://mummer.sourceforge.net/	a system for rapidly aligning entire genomes, whether in complete or draft form
muTect	muTect	1.1.4 1.1.5	http://archive.broadinstitute.org/cancer/cga/mutect	identification of somatic point mutations in next generation sequencing data of cancer genomes
NCBI-blast	NCBI-blast	2.2.27+ 2.7.1+	https://blast.ncbi.nlm.nih.gov/Blast.cgi	BLAST finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance.
Oncotator	Oncotator	1.9.6.1 1.9.9.0	http://portals.broadinstitute.org/oncotator/	a web application for annotating human genomic point mutations and indels with data relevant to cancer researchers
PEAR	PEAR	0.9.10 0.9.11	https://www.h-its.org/downloads/pear-academic/	an ultrafast, memory-efficient and highly accurate pair-end read merger.
Perl	Perl	5.26.1	https://www.perl.org	a high-level, general-purpose, interpreted, dynamic programming languages
Picard	Picard	2.17.4 2.18.9 2.25.2	https://broadinstitute.github.io/picard/	command-line utilities that manipulate SAM files
Python 2	python2	2.7.14	https://www.python.org	an interpreted high-level programming language for general-purpose programming
Python 3	python3	3.6.4 3.7.10 3.9.2	https://www.python.org	an interpreted high-level programming language for general-purpose programming
QIIME	QIIME	1.9.1	http://qiime.org/	an open-source bioinformatics pipeline for performing microbiome analysis from raw DNA sequencing data
QIIME2	QIIME2	2017.12 2019.7	https://qiime2.org/	a microbiome analysis package with a focus on data and analysis transparency
R	R	3.4.3 3.5.1 3.6.1 4.1.0	https://www.r-project.org	a programming language and free software environment for statistical computing and graphics
RNAmmer	RNAmmer	1.2	http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?rnammer	predicting ribosomal RNA genes in full genome sequences
RSEM	RSEM	1.2.31 1.3.0 1.3.3	http://deweylab.github.io/RSEM/	accurate quantification of gene and isoform expression from RNA-Seq data
SAMtools	samtools	1.6 1.8 1.9 1.11	samtools.sourceforge.net	SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position forma
SPAdes	SPAdes	3.10.0 3.11.1 3.13.0	http://cab.spbu.ru/software/spades/	St. Petersburg genome assembler – is an assembly toolkit containing various assembly pipelines
STAR	STAR	2.7.8a 2.7.9a	https://github.com/alexdobin/STAR	RNA-seq aligner
strelka	strelka	1.0.15 2.8.4 2.9.7	https://github.com/Illumina/strelka	germline and somatic small variant caller
TrimGalore	TrimGalore	0.4.1 0.4.5	https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/	A wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files
Trimmomatic	Trimmomatic	0.33 0.36 0.38	http://www.usadellab.org/cms/index.php?page=trimmomatic	A flexible read trimming tool for Illumina NGS data
tRNAscan-SE	tRNAscan-SE	1.3.1	http://eddylab.org/software.html	tRNA detection in large-scale genome sequence
VerifyBamID	VerifyBamID	1.1.2 1.1.3	https://genome.sph.umich.edu/wiki/VerifyBamID	verifies whether the reads in particular file match previously known genotypes for an individual (or group of individuals), and checks whether the reads are contaminated as a mixture of two samples

Note: The above list is not exhaustive. Please contact us to check if a particular software is available.

Job Scheduling

Job Scheduling System (PBS Pro) at HPCF2 cluster

All jobs must be scheduled via the batch job system (PBS Pro). Jobs are submitted to PBS specifying requested resources, including the queue, number of CPUs, the amount of memory, and the length of time. PBS will then run job(s) on available compute nodes when the resources are available, subject to constraints on maximum resource usage.

General Job Queues


Queue name	Max no of processors per job	Max memory (GB) per job	Max no of job running per user	Max no of job queuing per user	Max Walltime (hr)
small	2	10	18	40	6
small_ext	2	10	6	12	60
medium	12	50	12	25	24
medium_ext	12	50	6	8	60
large	12	120	3	4	84
legacy	12	45	8	16	96
test	24	190	1	1	1

Special Queues (On Request)

Sometimes, you may have some jobs that would need more computing resources than those available in the general job queues. In this case, please send us the details of your job execution plan and resources needed. Depending on the current cluster resource usages and trends, we might setup a customized job queue with specific computing resources to enable your job execution on a short term basis.

Job Scripting

PBS Job Directives

Below are some commonly used PBS options in a job command file, which can also be used on the command line with qsub.


PBS Job Directives	Description
#PBS -A acct	Causes the job time to be charged to “acct”.
#PBS -N myJob	Assigns a job name. The default is the name of PBS job script.
#PBS -l nodes=4:ppn=2	The number of nodes and processors per node.
#PBS -l walltime=01:00:00	Sets the maximum wall-clock time during which this job can run. (walltime=hh:mm:ss)
#PBS -l mem=n	Sets the maximum amount of memory allocated to the job.
#PBS -q queuename	Assigns your job to a specific queue.
#PBS -o mypath/my.out	The path and file name for standard output.
#PBS -e mypath/my.err	The path and file name for standard error.
#PBS -j oe	Join option that merges the standard error stream with the standard output stream of the job.
#PBS -M email-address	Sends email notifications to a specific user email address.
#PBS -m	Set email to be sent to the user when:
#PBS -m a	a – the job aborts
#PBS -m b	b – the job begins
#PBS -m e	e – the job ends
#PBS -r n	Indicates that a job should not rerun if it fails.
#PBS -S shell	Sets the shell to use. Make sure the full path to the shell is correct.
#PBS -V	Exports all environment variables to the job.
#PBS -W	Used to set job dependencies between two or more jobs.

Some useful batch job directives to try are the following:


Directive	Function
-S /bin/bash	Specifies which shell to use.
-N JobName	Gives a name to the job. The name will appear in the output of qstat.
-M userame@hku.hk	Specifies an email address where notification messages will be sent.
-m abe (or any subset of a, b, and e)	Specifying when an email will be sent: a — abort, b — begin, e — end. See notes on email below.

PBS Job Output/Error Files

You can specify the full file names of the PBS job output/error files:


PBS Job Directives	Description
#PBS -o mypath/my.out	The path and file name for standard output
#PBS -e mypath/my.err	The path and file name for standard error

Or you can specify them with directories:


PBS Job Directives	Description
#PBS -o mypath	The path for standard output. Output file will be generated as, e.g. 123.omics.OU
#PBS -e mypath	The path for standard error. Error file will be generated as, e.g. 123.omics.ER

Or if these two directives -e and -o are not specified, the working directory and default filenames as below would be used:


Output / Error Files	Description
Output File	File with name $PBS_JOBNAME.o$PBS_JOBID would be generated, e.g. myfirstjob.o123
Error File	File with name $PBS_JOBNAME.e$PBS_JOBID would be generated, e.g. myfirstjob.e123

PBS Pro uses the directory that the job was submitted from to define the working directory for a job – no matter the location of the job submission script. Please note that by default, the PBS scheduler would store data relative to your home-directory. Yet it is recommended to specify a full path to the filename.

Tips:

The PBS job variables (e.g. $PBS_JOBID and $PBS_JOBNAME commonly used) would NOT be resolved in the #PBS job directives at the Omics cluster with new PBS scheduling system (that is different from the statgenpro cluster).

Interactive PBS Jobs

Use of PBS is not limited to batch jobs only. It also allows users to use the compute nodes interactively, when needed. For example, users can work with the developer environments provided by R on compute nodes, and run their jobs (until the walltime is expired).

Instead of preparing a submission script, users pass the job requirements directly to the qsub command. For example, the following PBS script:

 #PBS -l nodes=1:ppn=4 
 #PBS -l mem=2gb 
 #PBS -l walltime=15:00:00 
 #PBS -q small

would correspond to the qsub command with parameters as below:

 qsub -I -q small -l nodes=1:ppn=4,walltime=15:00:00,mem=2gb

Hence, the PBS scheduler will allocate 4 cores to the user as soon as nodes with given specifications become available, then automatically log the user into one of the compute nodes. Any Interactive PBS jobs (i.e. qsub –I ) will be logged out after a 30-minutes idle time to free up allocated but unused resources.

Job Management

Submit a batch job

Batch jobs allow users to submit a bunch of jobs that can be queued up and then executed across the cluster by the job scheduler, when the requested resources become available. Users submit jobs as scripts, which include instructions on how to run the job. The output of the job is written to a file for review later on. You can write a batch job that does anything that can be typed on the command-line.

Here is an example of batch script (with filename simple.sh) :

 #!/bin/bash
 #PBS -l nodes=1:ppn=1
 #PBS -l mem=2g
 #PBS -l walltime=00:01:00
 #PBS -m ae
 #PBS -N omics-simple
 #PBS -q small

 module load bamtools/2.5.1
 bamtools -v

Then you can submit this job script by:

 $ qsub simple.sh

See “PBS Job Directives” for more about the PBS job parameters that could be specified in the script file.

Check job status

After your job is submitted, you can use the qstat command to view the status of your job and the job queues. If there are resources available on the compute nodes, your job should turn into the R (running) state; If the compute nodes are busy, your job may be shown in the Q (queued) state until your requested resources are then available to run it. Jobs shown in C state means it has been completed.

Show running jobs

$ qstat -rn1

Example:

 [itsupport@omics ~]$qstat -rn1
 
 omics: 
                                                              Req'd  Req'd   Elap
 Job ID          Username  Queue    Jobname    SessID NDS TSK Memory Time  S Time
 --------------- --------  -------- ---------- ------ --- --- ------ ----- - -----
 1649.omics      itsupport large    BWA_19NT   121502  1  12   90gb  48:00 R hpch01/0*12

Show queuing/held jobs

$ qstat -i

Example:

 [itsupport@omics ~]$qstat -i
 
 omics: 
                                                              Req'd  Req'd   Elap
 Job ID          Username  Queue    Jobname    SessID NDS TSK Memory Time  S Time
 --------------- --------  -------- ---------- ------ --- --- ------ ----- - -----
 1651.omics      itsupport large    BWA_19T    ---     1  12   90gb  48:00 Q

Show all jobs (including completed jobs)

$ qstat -xan1

Example:

 [itsupport@omics Exome]$qstat -xan1
 
 omics: 
                                                            Req'd  Req'd   Elap
 Job ID          Username  Queue    Jobname    SessID NDS TSK Memory Time  S Time
 --------------- --------  -------- ---------- ------ --- --- ------ ----- - -----
 1648.omics      itsupport  large  BWA_23T     127476   1  12   45gb 48:00 F 00:03 hpch08/0*12
 1649.omics      itsupport  large  BWA_23NT    121502   1  12   90gb 48:00 R 01:12 hpch08/0*12
 1650.omics      itsupport  large  BWA_43T     5940     1  12   45gb 48:00 F 00:01 hpch08/0*12
 1651.omics      itsupport  large  BWA_43NT     --      1  12   90gb 48:00 Q   --   --

Show details of a job

$ qstat -xf JobID

Example:

 [itsupport@omics]$qstat -f 1649
 Job Id: 1649.omics
   Job_Name = BWA_34NT
   Job_Owner = itsupportn@omics
   resources_used.cpupercent = 916
   resources_used.cput = 05:56:56
   resources_used.mem = 59554972kb
   resources_used.ncpus = 12
   resources_used.vmem = 92494740kb
   resources_used.walltime = 01:28:38
   job_state = R
   ...

Check resource usage of a completed job

Function: Show job information and resource usage percentages of a particular job

Usage: myjob

Examples:

   $ myjob 440356
   
   Job information and usage summary of your HPCF job 440356 :
   +-------------------+----------+-------+-------------+------+---------------------+-----------+------+-------+
   | jobid             | username | queue | jobname     | E S  | End Time            | walltime% | mem% | cpu%  |
   +-------------------+----------+-------+-------------+------+---------------------+-----------+------+-------+
   | 440356.statgenpro | kelvin   | large | Large       |    0 | 2016-08-30 04:31:30 |     25.36 | 8.28 | 14.51 |
   +-------------------+----------+-------+-------------+------+---------------------+-----------+------+-------+
   +---------------------+---------------------+----------+----------+------+------+-------+----------+--------+
   | Submit Time         | Start Time          | wtime@   | wtime#   | mem@ | mem# | vmem@ | CPUTime@ | nproc# |
   +---------------------+---------------------+----------+----------+------+------+-------+----------+--------+
   | 2016-08-29 22:26:16 | 2016-08-29 22:26:18 | 06:05:13 | 24:00:00 | 3.31 | 40gb |  4.75 | 10:35:53 |     12 |
   +---------------------+---------------------+----------+----------+------+------+-------+----------+--------+
   E S = Exit Status ; % = usage percentage; # = requested ; @ = used ; mem@/vmem@ in GB ; nproc = number of processors

Only the owner of the job may check the usage

   $ myjob 123456
   ERROR: You (kelvin) not the owner of the job 123456.

Check resource usage summary of completed job(s)

Function: Display job summary and resource usage percentages of your recent jobs\\

Usage: myjobs [-v] [-j] []

Description: Display job summary and resource usage percentages of your recent jobs\\

Optional Parameters:

-v verbose mode with resource usage data of walltime/memory/cpu requested and used

-j Last jobs by JobID numbers (instead of the default by End Time). Note that the jobs in the list by JobID may be different from that by End Time.

specifying the number of jobs (default:20) to display

Examples

$ myjobs 5
 
Your last 5 HPCF completed jobs (by End Time):
+-------------------+----------+-------+------------------------+------+---------------------+--------+-------+--------+
| jobid             | username | queue | jobname                | E S  | End Time            | cpu%   | mem%  | wtime% |
+-------------------+----------+-------+------------------------+------+---------------------+--------+-------+--------+
| 458952.statgenpro | kelvin   | large | GATK_CK02              |    0 | 2016-10-28 16:45:05 |  69.82 | 10.80 |   0.42 |
| 458860.statgenpro | kelvin   | large | GATK_NG07              |    0 | 2016-10-28 11:19:56 |  53.44 | 10.70 |   0.40 |
| 458853.statgenpro | kelvin   | large | GATK_CK03              |    0 | 2016-10-28 11:13:53 |  69.20 | 10.80 |   0.42 |
| 458862.statgenpro | kelvin   | large | GATK_YJ01              |    0 | 2016-10-28 11:09:05 |  49.90 | 10.30 |   0.24 |
| 458858.statgenpro | kelvin   | large | GATK_KC04              |    0 | 2016-10-28 10:57:39 |  55.96 | 10.30 |   0.11 |
+-------------------+----------+-------+------------------------+------+---------------------+--------+-------+--------+
E S = Exit Status ; % = usage percentage; wtime = walltime

Verbose mode

$ myjobs -v 5

Your last 5 HPCF completed jobs (by End Time): 
+-------------------+----------+-------+------------------------+------+---------------------+--------+-------+--------+----------+--------+-------+-------+----------+-----------+
| jobid             | username | queue | jobname                | E S  | End Time            | cpu%   | mem%  | wtime% | CPUTime@ | CPUno# | mem@  | mem#  | wtime@   | wtime#    |
 +-------------------+----------+-------+------------------------+------+---------------------+--------+-------+--------+----------+--------+-------+-------+----------+-----------+
| 458952.statgenpro | kelvin   | large | GATK_CK02              |    0 | 2016-10-28 16:45:05 |  69.82 | 10.80 |   0.42 | 00:41:55 |      2 |  1.08 | 10.00 | 00:30:01 | 120:00:00 |
| 458860.statgenpro | kelvin   | large | GATK_NG07              |    0 | 2016-10-28 11:19:56 |  53.44 | 10.70 |   0.40 | 00:31:03 |      2 |  1.07 | 10.00 | 00:29:03 | 120:00:00 |
| 458853.statgenpro | kelvin   | large | GATK_KC03              |    0 | 2016-10-28 11:13:53 |  69.20 | 10.80 |   0.42 | 00:42:17 |      2 |  1.08 | 10.00 | 00:30:33 | 120:00:00 |
| 458862.statgenpro | kelvin   | large | GATK_YJ01              |    0 | 2016-10-28 11:09:05 |  49.90 | 10.30 |   0.24 | 00:16:57 |      2 |  1.03 | 10.00 | 00:16:59 | 120:00:00 |
| 458858.statgenpro | kelvin   | large | GATK_KC04              |    0 | 2016-10-28 10:57:39 |  55.96 | 10.30 |   0.11 | 00:08:55 |      2 |  1.03 | 10.00 | 00:07:58 | 120:00:00 |
+-------------------+----------+-------+------------------------+------+---------------------+--------+-------+--------+----------+--------+-------+-------+----------+-----------+
E S = Exit Status ; % = usage percentage ; wtime = walltime ; @ = used ; # = requested ; CPUno = number of processors ; mem@/vmem@/mem# in GB

Display last jobs by JobID numbers

$ myjobs -j 5

Your last 5 HPCF completed jobs (by JobID):
+-------------------+----------+-------+------------------------+------+---------------------+--------+-------+--------+
| jobid             | username | queue | jobname                | E S  | End Time            | cpu%   | mem%  | wtime% |
+-------------------+----------+-------+------------------------+------+---------------------+--------+-------+--------+
| 458952.statgenpro | kelvin   | large | GATK_CK02              |    0 | 2016-10-28 16:45:05 |  69.82 | 10.80 |   0.42 |
| 458862.statgenpro | kelvin   | large | GATK_YJ01              |    0 | 2016-10-28 11:09:05 |  49.90 | 10.30 |   0.24 |
| 458860.statgenpro | kelvin   | large | GATK_NG07              |    0 | 2016-10-28 11:19:56 |  53.44 | 10.70 |   0.40 |
| 458858.statgenpro | kelvin   | large | GATK_KC04              |    0 | 2016-10-28 10:57:39 |  55.96 | 10.30 |   0.11 |
| 458857.statgenpro | kelvin   | large | GATK_LR06              |    0 | 2016-10-28 10:54:35 |  55.46 | 10.30 |   0.11 |
+-------------------+----------+-------+------------------------+------+---------------------+--------+-------+--------+
E S = Exit Status ; % = usage percentage; wtime = walltime

Delete a job

$ qdel JobID

Example:

 [itsupport@omics]$qdel 1651

FAQ

How do I check how much data storage space I have used?

User may use the “df” command to show the disk quota usage of his/her home folder.

[tmchan@omics ~]$df -h ~/ Filesystem Size Used Avail Use% Mounted on compellent2:/home 1000G 308G 693G 31% /home

How to login HPCF with VS Code?

Our HPCF cluster has a special setup due to security policy. Please copy and add the following settings to your SSH config file, then try reconnecting to “omics” from VSCode.

Host omics
     HostName omics
     User username
     ProxyCommand ssh -q -W %h:%p username@omics.cpos.hku.hk

Note: Replace the above text “username” as your own HPCF login account name

You can refer the below screenshot to locate the SSH config file in your PC, and open the file directly from VSCode:

Why my VS Code cannot login HPCF?

Please also note that VSCode officially dropped the support for CentOS 7 since version 1.99, while our cluster is stilling running CentOS 7. The last supported version of VSCode is 1.98.2.

Please get the portable version of 1.98.2 VSCode from the link below and use this specific version to connect to our HPCF cluster. Portable version of VSCode is preferred in this case because it would not perform self-upgrade automatically.

https://update.code.visualstudio.com/1.98.2/win32-x64-archive/stable

filename: VSCode-win32-x64-1.98.2.zip
md5sum: db4fefdae4986ef4ba5ba4524d8488cd

Also, it is recommended disabling VSCode update function in its settings. You can go to File > Preferences > Settings and search for “Update”, then:

Uncheck “Update: Enable Windows Background Updates”
Set “Update: Mode” to “none”

How to launch RStudio / Jupyter Notebook in HPCF?

If you have not set X11 forwarding on your PC before, please follow the below steps:

Install MobaXterm

Download the installation package from the official site: https://mobaxterm.mobatek.net/

After installation, open MobaXterm

Check if the option “X server” is started or not. Click the option if it is stopped

When you see below, it means “X server” is started

RStudio

Connect to HPCF2 (a.k.a. omics); then, check the available software/modules of RStudio with the below command:

ml av rstudio

Type the below command to module load the RStudio software (RStudio/1.4.1717 for best compatibility):

ml RStudio/1.4.1717

The message “RStudio/1.4.1717 is loaded” is appeared.

Note: You can also load an R software module (for example: R/4.2.1) at the same time to launch the higher version of R in RStudio. The default R version is 3.5.2.

Type the below command to open RStudio:

rstudio

The below messages may appear in the command window:

In the meantime, a new window of RStudio should be loaded:

Jupyter Notebook

Check the available software/modules of Jupyter Notebook with the below command:

ml av ju

Type the below command to module load the Jupyter Notebook software:

ml jupyter/notebook

You should view the messages:

“firefox/latest is loaded

jupyter/notebook/ is loaded”.

Type the below command to open Jupyter Notebook:

jupyter notebook

The below messages may appear in the command window:

Meanwhile, a new window of Firefox (with Jupyter Notebook) should be loaded:

After starting Jupyter Notebook process, please use one of the following commands in a new SSH terminal to set up SSH local port forwarding:

ssh -N -f -L localhost:8889:omics:8889 username@hpcf2.cpos.hku.hk
or
ssh -N -f -L 8889:omics:8889 username@hpcf2.cpos.hku.hk
or
ssh -N -f -L 127.0.0.1:8889:omics:8889 username@hpcf2.cpos.hku.hk

(Port 8889 is an example, please change it to the actual port listened by Jupyter Notebook)

How to resolve SSH host identification warning when connected to HPCF?

When login HPCF, you might encounter the “WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!” prompt:

The above error means the security key for our HPCF Server has been updated due to a previous server security maintenance. This is expected and safe if you have not logged into our HPCF since Oct 2024.

To fix this, you need to remove the old key from your computer. Please open your terminal and use the following command:

ssh-keygen -R hpcf2.cpos.hku.hk

After running the command, try connecting to the server again:

ssh your_username@hpcf2.cpos.hku.hk

(Replace your_username with your actual username)

When asked “Are you sure you want to continue connecting (yes/no/[fingerprint])?”, type yes and press Enter, and you should then be able to input password and log in HPCF.

The new key fingerprint you should see for our HPCF is SHA256:k0VXsLWnhVlVph3xZahoROqr0NiCw4tAWpYSaB8n4to.

Tips and Tricks

Make file compression faster with multiple CPU cores

By default, file compression with 7z only uses a single CPU core and it may take considerable time when archiving large amount of files.

We may speed it up by enabling the use of multiple CPU cores in 7z.

We can specify the “bzip2” algorithm with the “-mm=Bzip2” argument” and specify the number of threads with the “-mmt=<#THREAD>” argument.

Below is an example command that uses 4 threads.

7za a -mm=Bzip2 -mmt=4 output_zip_file input_file

Contact

2831-5417
itsupport.cpos@hku.hk

High Performance Computing Facility (HPCF)