This job was posted by https://illinoisjoblink.illinois.gov : For more
information, please see:
https://illinoisjoblink.illinois.gov/jobs/12622955 Department
Provost Research Computing Center
About the Department
The University of Chicago Research Computing Center (RCC), a unit in the
Office of Research, provides high-end research computing resources to
researchers at the University of Chicago. It is dedicated to enabling
research by providing access to centrally managed High-Performance
Computing (HPC), storage, and visualization resources. These resources
include hardware, software, high-level scientific and technical user
support, and the education and training required to help researchers
make full use of modern HPC technology and local and national
supercomputing resources. The Office of Research oversees the conduct of
sponsored research, research program development, and contract
management functions.
Job Summary
The job participates in the design of automated, scalable, and rapidly
deployable solutions to systems infrastructure and server configuration.
Installs, configures, and maintains operating systems, monitoring and
alerting systems, utility software, and firewalls. Plans and executes
hands-on maintenance for production servers as well as Windows and Linux
servers.
The University of Chicago Research Computing Center (RCC) is seeking a
skilled HPC System Administrator to join its Systems and Operations
Team. This position will support the deployment, maintenance, and
automation of RCC\'s HPC systems, including CPU/GPU clusters, storage,
and networking infrastructure. The HPC System Administrator will assist
in system-level administration, troubleshooting, performance tuning, and
automation while collaborating with faculty and researchers to enable
cutting-edge computational science.
This is a hybrid position requiring 3 days onsite.
Responsibilities
Administer, install, monitor, and maintain HPC systems, including
compute nodes, storage, networking, and software stacks.
Develop and maintain automation tools for system provisioning,
configuration management, and monitoring.
Assist in the implementation and management of distributed file
systems (e.g., Lustre, BeeGFS, GPFS).
Install, configure, and optimize job scheduling and resource
management tools (e.g., Slurm, LSF, PBS).
Assist in system security, patch management, and troubleshooting
operational issues.
Contribute to performance benchmarking, system tuning, and capacity
planning.
Deploy and maintain commonly used HPC applications and software
stacks.
Document system administration procedures and contribute to
knowledge-sharing initiatives.
Support researchers by providing technical expertise and resolving
escalated support tickets.
Participate in vendor coordination, system procurement, and
hardware/software lifecycle management.
Installs, configures, and maintains operating system workstations
and servers. Performs software installations and upgrades to
operating systems and layered software packages. Monitors and tunes
the system to achieve optimum performance levels, acquiring
higher-level skills in the process.
Maintains all supporting documentation for comprehensive operating
system, hardware and software configuration. Monitors primary
responses for information technology related security incidents and
violations. Keeps current with new security and network monitoring
technologies, applicable laws, and regulations.
Performs other related work as needed.
Minimum Qualifications
Education:
Minimum requirements include a college or university degree in related
field.
Work Experience:
Minimum requirements include knowledge and skills developed through 2-5
years of work experience in a related job discipline.
Certifications:
Preferred Qualifications
Technical Skills or Knowledge:
Experience administering Linux-based HPC clusters, including job
schedulers (e.g., Slurm, LSF, PBS).
Familiarity with high-speed networking (e.g., InfiniBand, Ethernet).
Scripting/programming skills (Python, Bash, or Perl).
Experience configuring, installing and troubleshooting MPI and OpenMP
applications.
Experience configuring, installing, tuning and maintaining scientific
applications on large-scale systems.
Experience with system automation tools (e.g., Ansible, Puppet).
Experience with system provisioning tools (e.g., xCAT, Confluent,
Warewulf, etc).
Knowledge of distributed storage systems (e.g., Lustre, BeeGFS, GPFS).
Experience with containerization (Docker, Singularity, Apptainer).
Experience configuring, installing, maintaining and/or using
infrastructure and performance monitoring and optimization tools (such
as CheckMK, Grafana, Prometheus, Icinga, etc).
Experience in setting up and executing benchmarks in an HPC environment
and analyzing