HPC Sr. Scientific Software Engineer (IT@JH Research ...
Johns Hopkins University - Baltimore, MD
Apply NowJob Description
Computing is seeking a _HPC Sr. Scientific Software Engineer_ who will design, build, and support Johns Hopkins Universityu2019s high-performance computing and AI research infrastructure. This role integrates elements of both systems and software engineering, ensuring scalable, secure, and reproducible environments for scientific and data-intensive research. The Engineer develops and automates system and application workflows across CPU/GPU clusters, parallel storage, and hybrid cloud platforms. Responsibilities include configuring and optimizing large-scale Linux environments, implementing job scheduling and orchestration frameworks, containerizing applications, and supporting researchers in optimizing performance and reproducibility. Work combines project-based engineering with operational support, requiring both independent problem-solving and close collaboration with the Research Computing team and faculty stakeholders. Specific Duties & Responsibilities Software Deployment and Design + Develop and refine deployment strategies for scientific software on HPC and AI systems. + Design computational workflows, selecting optimal software configurations, and utilizing tools like Ansible for automation. + Assist teams in implementing, tuning, and optimizing AI models and gateway applications (e.g., XDMoD, Coldfront, Open OnDemand, CryoSPARC Live, SBGrid, AI Agents). _Performance Optimization_ + Analyze and optimize the performance of AI models and HPC applications, focusing on GPU-enabled computing. + Implement parallel processing, distributed computing, and resource management techniques for efficient job execution. _Integration and Optimization_ + Develop, debug, and maintain software tools, libraries, and frameworks supporting HPC and AI workloads. + Collaborate with the system team and software vendors (e.g., NVIDIA, Intel, Matlab) to optimize systems for maximum performance. + Utilize CUDA, DNN, TensorRT, and Intel Compilers to enhance system performance. _HPC Scientific Software Support_ + Manage and support scientific software deployment across HPC, cloud-based, and colocation facilities. + Oversee installation, configuration, and maintenance of HPC packages with tools like CMake, Make, EasyBuild, Spack, and Lua module files _Collaboration and Mentorship_ + Work closely with cross-functional teams, including researchers, data scientists, and software developers, to address complex HPC/AI challenges. + Mentor junior engineers and foster a culture of continuous learning. _Technical Support and Training Workshops and Troubleshooting_ + Resolve complex technical issues and perform root cause analysis for HPC/AI software challenges. + Implement effective solutions to prevent recurrence and improve system reliability + Provide training workshops for researchers and students, focusing on troubleshooting, optimizing workflows, and effectively using HPC systems. _Learning and Development_ + Stay current with advances in HPC and AI technologies and methodologies. + Incorporate new research findings into existing systems to improve performance and capabilities. _Container Orchestration_ + Develop and manage container orchestration strategies to ensure scalability, reliability, and security of applications. + Oversee the container lifecycle from creation and deployment to scaling and removal. _Documentation and Compliance_ + Create comprehensive documentation for system designs, performance metrics, and project status. + Ensure compliance with security and regulatory standards for all HPC and AI systems. _In Addition to the Duties Described Above_ + Design, deploy, and maintain large-scale Linux HPC clusters with CPU/GPU resources, high-speed networks, and distributed storage. + Develop and maintain automation frameworks for provisioning, monitoring, and software lifecycle management. + Implement and optimize job scheduling, container orchestration, and workflow automation tools to support diverse research workloads. + Collaborate with faculty and research teams to parallelize, containerize, and scale computational workflows for multi-GPU and distributed environments. + Benchmark and tune application performance across architectures, documenting findings and sharing best practices. + Integrate and support AI/ML frameworks, scientific libraries, and workflow engines (Snakemake, Nextflow, Dask, Ray). + Ensure system and application reliability through proactive monitoring (Prometheus, Grafana, ELK) and incident response participation. + Support reproducibility and FAIR data principles through version-controlled, containerized environments. + Contribute to documentation, training materials, and technical guidance to enhance user experience and self-service capabilities. + Participate in evaluation and adoption of new technologies to advance performance, efficiency, and sustainability in research computing. Minimum Qualifications + PhD in a quantitative discipline. + Five years of experience in HPC user support, software deployment, and performance optimization within an academic or research environment. + Additional education may substitute for required experience and additional related experience may substitute for required education beyond a high school diploma/graduation equivalent, to the extent permitted by the JHU equivalency formula. Preferred Qualifications + Eight + years of professional experience in high-performance computing, large-scale systems, or research software engineering. + Deep proficiency in Linux systems administration, performance tuning, and automation tools (Ansible, Terraform, Jenkins, or similar). + Experience with cluster management, workload schedulers (e.g., Slurm), and distributed or parallel file systems (e.g., GPFS, Lustre, WekaFS, Ceph). + Strong background in programming or scripting (Python, Bash, C/C++, Go, or Rust). + Familiarity with containerization and orchestration technologies used in HPC (Singularity, Apptainer, Docker, Kubernetes). + Understanding of high-speed interconnects (InfiniBand, 100/400 Gb Ethernet) and storage/data access patterns for AI and analytics. + Experience developing or maintaining CI/CD pipelines and module environments (Lmod/Spack) for research software. + Knowledge of GPU computing (CUDA, ROCm), MPI/OpenMP, and AI/ML frameworks. + Demonstrated ability to collaborate with researchers on performance optimization, workflow design, and reproducible computing. Classified Title: HPC Sr. Scientific Software Engineer Job Posting Title (Working Title): HPC Sr. Scientific Software Engineer ( Computing) Role/Level/Range: ATP/04/PG Starting Salary Range: $99,800 - $175,000 Annually (Commensurate w/exp.) Employee group: Full Time Schedule: Mon-Fri, 8:30am-5pm FLSA Status: Exempt Location: Johns Hopkins Bayview Department name: Computing Personnel area: University AdministrationEqual Opportunity Employer All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or status as a protected veteran.
Created: 2025-12-04