Ouroboros: Virtualized queues for dynamic memory management on GPUs

Martin Winter, Daniel Mlakar, Mathias Parger, Markus Steinberger

Research output: Chapter in Book/Report/Conference proceedingConference paperpeer-review

Abstract

Dynamic memory allocation on a single instruction, multiple threads architecture, like the Graphics Processing Unit (GPU), is challenging and implementation guidelines caution against it. Data structures must rise to the challenge of thousands of concurrently active threads trying to allocate memory. Efficient queueing structures have been used in the past to allow for simple allocation and reuse of memory directly on the GPU but do not scale well to different allocation sizes, as each requires its own queue.

In this work, we propose Ouroboros, a virtualized queueing structure, managing dynamically allocatable data chunks, whilst being built on top of these same chunks. Data chunks are interpreted on-the-fly either as building blocks for the virtualized queues or as paged user data. Re-usable user memory is managed in one of two ways, either as individual pages or as chunks containing pages. The queueing structures grow and shrink dynamically, only currently needed queue chunks are held in memory and freed up queue chunks can be reused within the system. Thus, we retain the performance benefits of an efficient, static queue design while keeping the memory requirements low. Performance evaluation on an NVIDIA TITAN V with the native device memory allocator in CUDA 10.1 shows speed-ups between 11X and 412X, with an average of 118X. For real-world testing, we integrate our allocator into faimGraph, a dynamic graph framework with proprietary memory management. Throughout all memory-intensive operations, such as graph initialization and edge updates, our allocator shows similar to improved performance. Additionally, we show improved algorithmic performance on PageRank and Static Triangle Counting.

Overall, our memory allocator can be efficiently initialized, allows for high-throughput allocation and offers, with its per-thread allocation model, a drop-in replacement for comparable dynamic memory allocators.
Original languageEnglish
Title of host publicationProceedings of the 34th ACM International Conference on Supercomputing, ICS 2020
PublisherAssociation of Computing Machinery
Pages1-12
Number of pages12
ISBN (Electronic)9781450379830
ISBN (Print)9781450379830
DOIs
Publication statusPublished - 29 Jun 2020
Event34th ACM International Conference on Supercomputing: ICS 2020 - Wordlwide online event, Virtual/Barcelona, Spain
Duration: 29 Jun 20202 Jul 2020
https://ics2020.bsc.es/

Conference

Conference34th ACM International Conference on Supercomputing
Abbreviated titleICS'20
Country/TerritorySpain
CityVirtual/Barcelona
Period29/06/202/07/20
Internet address

Keywords

  • dynamic graphs
  • dynamic memory allocation
  • GPU
  • queueing
  • resource management

ASJC Scopus subject areas

  • Computer Science(all)

Fingerprint

Dive into the research topics of 'Ouroboros: Virtualized queues for dynamic memory management on GPUs'. Together they form a unique fingerprint.

Cite this