CXLMemUring: Breaking the Memory Wall with Hardware-Software Co-design
Introduction: Why Do We Need CXLMemUring?
We're living in the era of the Great Memory Wall. Whether you're working with HPC applications, training Deep Learning Recommendation Models (DLRM), or building Large Language Models (LLM), memory latency and bandwidth have become the critical bottlenecks limiting application scalability.
Enter CXL (Compute Express Link) - an emerging technology that promises to expand memory for both host CPUs and device accelerators through a load/store interface. By extending memory coherency to the PCIe root complex, CXL enables flexible co-design patterns where you can access memory with coherency using near-device compute capabilities.
However, traditional memory optimization techniques like ROB, MSHR, read-ahead cache, and TLB don't scale well to CXL memory pools. This is where CXLMemUring comes in - a revolutionary hardware-software co-design paradigm for asynchronous and flexible parallel CXL memory pool access.
The Core Concept: Memory Access Like IO_uring
The name CXLMemUring draws inspiration from Linux's IO_uring mechanism. Just as IO_uring revolutionized asynchronous I/O operations, CXLMemUring aims to transform how we access memory.
How It Works
- Offload Memory Operations: Synthesized memory operations are offloaded to CXL endpoints, CXL switches, or cores near the CXL root complex (like Intel DSA)
- Asynchronous Execution: CPUs or accelerators can perform other computations while memory is being loaded
- Smart Notification: When CXL completes data loading, the data is placed into L1 cache (if capacity permits), and the in-core ROB is notified via a mailbox mechanism to resume computation on the previous hardware context
Technical Architecture: The Art of Hardware-Software Co-design
Software Stack
CXLMemUring employs a binary JIT compiler approach, similar to Apple Rosetta:
- Forward Analysis: Translates all remotable memory accesses and pointer accesses to CXL byte-addressable format using MLIR
- Backward Analysis: Identifies all functions with remote pointers that may be rewritten with native local pointers
- Dynamic Optimization: Uses profiling-guided techniques to adaptively adjust the offloading window, since access patterns often cannot be statically computed
The system marks functions and labels as profiling-guided points for cost model penalties. During runtime, the JIT can modify code after labels have been called to achieve better timing windows.
Hardware Stack
The hardware design consists of two main components:
- In-CPU Async Loading Engine
:
- Notifies the CPU to resume previous context when data arrives
-
Implements callbacks through CXL.io requests with an in-core mailbox
-
Near-Endpoint Coprocessor
:
- Computes load instruction sequences and simple memory operations
- Can be deployed near CXL Flash, GPU, or CXL switches
- Designed to handle the computational aspects of memory requests
Key Innovations: Beyond Traditional Approaches
1. Fine-Grained Access Optimization
Unlike RDMA's 4KB granularity, CXL supports 64-byte access granularity - much better suited for typical C++ object sizes. CXLMemUring leverages this to excel at pointer chasing and indirect memory reading scenarios.
2. Flexible Offloading Points
The system supports multiple offloading locations:
- CXL endpoints
- CXL switches
- DSA near the root complex
Each offloading point has a different cost model, allowing the system to choose the optimal approach based on the specific workload.
3. Interrupt-Free Design
Rather than using traditional interrupt-driven approaches, CXLMemUring sets ROB metadata to activate requests. This eliminates interrupt overhead and improves system efficiency.
Evaluation Framework: Proving the Value
CXLMemUring is evaluated using a modified BOOMv3 implementation on FPGA, with CHI for simulating CXL switch and accelerator access. Key evaluation dimensions include:
- Window Capture Effectiveness: How effectively instruction windows are offloaded and how much computation can be done before memory arrives
- ROB/MSHR Integration: Optimal design patterns for integrating with existing memory subsystems
- On-chip Area Comparison: Whether this approach saves chip area
- Programming Model Guidance: Insights for future programming models
The Vision: Rethinking Memory Access
CXLMemUring represents a paradigm shift in how we think about computation. In this model:
- The CPU becomes a hub for combining DSA requests and executing OLAP operations
- Control flow is offloaded while most memory remains local
- Only minimal memory communication is required
- Hardware design becomes more flexible, with compute capabilities deployed where needed
Implementation Insights
The evaluation is based on BOOMv3 over FPGA, chosen for its modifiability. The system adds CHI for simulating CXL switches and accelerator access, plus a weaker RISC-V core near the endpoint for code offloading.
For the in-core logic, all memory returns go into L1 cache to ensure they're unique to the SMT core and instantly consumed. The design avoids interrupts, instead setting ROB metadata to activate requests.
Looking Forward: The Future of Memory Access
As the capacity demands with tolerable latency and bandwidth continue to grow, approaches like CXLMemUring become increasingly critical. The co-design paradigm allows us to:
- Better utilize CXL's coherency capabilities
- Overcome limitations of traditional prefetching approaches
- Enable new programming models that better match modern workload requirements
Conclusion
CXLMemUring isn't just a technical solution - it's a new way of thinking about memory access in the CXL era. By combining hardware and software innovation, we can break through the memory wall and enable the next generation of high-performance applications.
As the CXL ecosystem matures, innovations like CXLMemUring will become increasingly important. They not only improve system performance but also provide developers with more flexible programming models, ultimately driving the entire computing industry forward.CXLMemUring: Breaking the Memory Wall with Asynchronous CXL Memory Access
In today's computing landscape, we're facing an increasingly critical challenge - the Memory Wall. Whether it's HPC applications, Deep Learning Recommendation Models (DLRM), or Large Language Model (LLM) training, all are constrained by memory access latency and bandwidth limitations. Today, I want to introduce an innovative solution: CXLMemUring, a hardware-software co-design paradigm for asynchronous and flexible parallel CXL memory pool access.
CXL Technology: Opening New Doors for Memory Expansion
CXL (Compute Express Link) is an emerging technology that expands memory for both host CPUs and device accelerators through a load/store interface. What sets CXL apart from traditional proprietary standards like NVLink is its extension of memory coherency to the PCIe root complex, making co-design more flexible. You can access memory with coherency using near-device computing capabilities.
The Core Concept of CXLMemUring
CXLMemUring draws inspiration from Linux's io_uring mechanism. The core idea is to offload synthesized memory operations to CXL endpoints, CXL switches, or cores near the CXL root complex (like Intel DSA) to fetch data, while CPUs or accelerators perform other computations in the background.
When CXL completes data loading:
- Data is placed into L1 cache if capacity permits
- The in-core Reorder Buffer (ROB) is notified via mailbox
- Computation resumes on the previous hardware context
Hardware-Software Co-design Architecture
Software Stack
CXLMemUring employs a binary JIT approach similar to Apple Rosetta, seamlessly executing binaries. Key features include:
- Forward Analysis: Translates all remotable memory accesses and pointer accesses to CXL byte-addressable access methods (using MLIR)
- Backward Analysis: Identifies all functions with remote pointers passed, which may be rewritten with native local pointers
- Profiling-Guided Optimization: All functions and labels are marked as profiling-guided points for cost model penalties in timing windows
- Dynamic Optimization: JIT can modify code after labels have been called once to achieve better timing windows
Hardware Stack
Hardware modifications focus on several key areas:
- BOOMv3-based Evaluation Platform: BOOMv3 was chosen because it's an easily modifiable core
- CHI Integration: For simulating access to CXL switches and other accelerators
- Co-processor Design: Adding co-processors near endpoints to calculate memory requests
- In-core Logic: All memory returns go to L1 cache, avoiding interrupts and only setting ROB metadata to activate requests
Technical Innovations
- Asynchronous Access Pattern: Adopts an io_uring-like asynchronous memory access approach, greatly improving memory access parallelism
- Dynamic Code Optimization: Uses JIT compiler to dynamically analyze and optimize memory access patterns
- Flexible Offloading Points: Supports compute offloading at multiple locations including CXL endpoints, switches, or DSA
- Fine-grained Access: CXL supports 64-byte fine-grained access, more suitable for C++ object access than RDMA's 4KB pages
Evaluation Goals
CXLMemUring's evaluation aims to answer these key questions:
- Effectiveness of Window Size Capture: How effectively instruction windows are offloaded
- Integration with ROB and MSHR: Exploring optimal integration design approaches
- On-chip Size Comparison: Whether this approach can save chip area
- Programming Model Guidance: Providing direction for future programming paradigms
Comparison with Existing Solutions
CXLMemUring offers distinct advantages over existing solutions:
- vs Data Streaming Accelerator (DSA): DSA is currently designed only for single-root CPU and bulk memory loads, requiring driver code and auxiliary data transmission for communication
- vs Asynchronous RDMA/SmartNIC: RDMA's 4KB granularity is too large for most C++ objects, unsuitable for pointer chasing and indirect memory reading
- vs "In-order-core" Asynchronous Memory Unit: Existing solutions only evaluate on in-order cores without fully considering L2 contention, ROB, and MSHR relationships
Looking Ahead
CXLMemUring represents a new direction in memory access optimization. In future co-design scenarios, CPUs will serve as hubs for combining DSA requests and executing OLAP operations they excel at, while hardware design will only need co-processors near endpoints to calculate memory requests. This design paradigm promises to provide a scalable solution to the memory wall problem.
As CXL technology continues to mature and become more widespread, we believe innovative designs like CXLMemUring will play an increasingly important role in high-performance computing, AI training, and other domains, helping us break through memory bottlenecks and achieve new leaps in computational performance.
The beauty of CXLMemUring lies in its recognition that memory access patterns are often not statically computable, requiring adaptive, runtime optimization. By combining hardware acceleration with intelligent software management, it offers a path forward in our ongoing battle against the memory wall - one that's both practical and scalable for the demands of modern computing workloads.