HiPCastor - Research
Our overarching goal is to research and develop new techniques applied to a diverse spectrum of computer systems (including edge, cloud and high-performance computing) and across layers of abstraction (including computer architecture, operating systems, virtualization, middleware, software engineering, and applications).
Here are some areas that we currently work in:
Serverless Workflows (FaaS)
Function-as-a-Service, serverless workflows. Modern Function-as-a-Service (FaaS) cloud platforms offer great potential for supporting event-driven scientific workflows. Nonetheless, there remain barriers to adoption by the scientific community in domains such as environmental sciences, where R is the focal language used for the development of applications and where users are typically not well-versed with FaaS APIs. We have designed and implemented FaaSr, a novel open-source middleware that supports event-driven scientific workflows in R. A key novelty in FaaSr is the ability to deploy workflows across FaaS providers without the need for any managed servers for coordination. We have also explored using lightweight per-function virtual machines (virtines) to enable more strongly isolated FaaS platforms.
High-performance Memory Systems
Modern memory systems involve designs aimed at surmounting the “memory wall,” where memory capacity and bandwidth can limit workload performance. We are investigating new system software support for disaggregated memory, where a workload’s memory is transparently expanded across nodes in a cluster. In particular, we developed a new compiler and runtime system for high-performance far memory called TrackFM. To address memory bandwidth limitations, near-data processing architectures move compute nearer to memory, for example with processing units integrated near DRAM banks. We are investigating new software and hardware abstractions for next-generation processing-in-memory (PIM) architectures.
System Software for HPC
We are interested in ground-up redesigns of the hardware/software layer for high-performance computing. In the past, we have developed new operating systems, virtual machine monitors, languages, compilers, and hardware designs for HPC.
Software-defined virtual networks for edge-to-cloud computing
While within cloud data centers, nodes can communicate without the presence of Network Address Translators (NATs), edge computing applications require devices to communicate across different private networks and must deal with NAT traversal to enable edge-to-edge communication. We have designed and implemented EdgeVPN, a technique that enables virtual private Ethernet networks that span edge and cloud resources – including those constrained by NAT and firewall middleboxes. EdgeVPN builds upon a scalable structured peer-to-peer overlay, and integrates overlay tunnels with Software Defined Networking (SDN) software switches to create a virtual network with dynamic membership – supporting unmodified Ethernet/IP stacks to facilitate the deployment of edge applications.
System techniques to efficiently serve large-scale AI models on HPC systems
Systems and algorithms for deploying and optimizing large deep neural networks at scale on HPC platforms. Foundational work includes Fauce (VLDB’21), an efficient framework for deep ensemble inference; MD-HM (ICS’21), a memory optimization scheme for large molecular dynamics simulations; and Tahoe (EuroSys’21), high-performance inference for decision tree ensembles on GPUs. Building on this base, we have advanced scalable large-scale LLM serving with new methods for retrieval-augmented generation (RAG), multi-agent coordination, and an automatic framework for plugging AI surrogates into HPC applications, Auto-HPCnet (HPDC’23). These efforts introduce a unified software stack that connects low-level operators, model execution, and end-to-end applications for large-scale AI.
Efficient and physics-informed AI for scientific applications
Domain-aware and efficient AI methods that reduce simulation cost while preserving physical fidelity. Adaptive Neural Network-Based Approximation to Accelerate Eulerian Fluid Simulation (SC’19) demonstrated neural surrogates for accelerating fluid dynamics. Smart-PGSim (SC’20) showed AI acceleration for power grid simulation and was later highlighted by DOE and PNNL. Auto-HPCnet (HPDC’23) introduced an automated framework for constructing neural surrogate models for HPC applications, bridging AI workflows and large-scale simulation codes. TimeX++ (ICML’24) extended these ideas for temporal modeling through an interpretable and efficient time-series learning framework for scientific forecasting. Most recently, our submitted IPDPS’26 paper LUMOS advances SciML efficiency by unifying automatic feature selection and structured parameter pruning through L0-regularized learning, reducing manual trial-and-error in building high-quality scientific surrogate models. Together, these contributions advance surrogate modeling, AI model efficiency, and uncertainty-aware learning and fine-tuning for scientific workloads.
System optimization on AI accelerators
Runtime and system-level techniques for emerging AI accelerators, with a focus on Cerebras wafer-scale engines. These platforms depart from the traditional GPU model by providing very large on-chip compute and memory capacity, and our work helps unlock their potential for training and serving increasingly large AI models. Our recent workshop paper, Phoenix (SC’25 Workshop), presents a framework for scalable and memory-efficient execution of sparse LoRA workloads on Cerebras systems, enabling larger and more efficient fine-tuning on wafer-scale hardware. This work is conducted in collaboration with Argonne and Yale University. Current student projects extend to RAG optimization on Cerebras, leveraging its massive on-chip memory to redesign dataflow and caching for LLM-based scientific and data-intensive applications.
Software
- EdgeVPN - VPN at the edge
- FaaSr - Function-as-a-Service for R
- LUMOS - SciML workflow optimization with unified feature and parameter adaptation
- Phoenix - sparse LoRA fine-tuning and inference on wafer-scale systems
- Shipyard - Load balanced, sharded consensus
- TrackFM compiler - automated far memory for legacy apps
- Wasp - a microhypervisor for function-granularity virtualization
Sponsors