Ouroboros: Virtualized queues for dynamic memory management on GPUs

Martin Winter; Daniel Mlakar; Mathias Parger; Markus Steinberger

doi:10.1145/3392717.3392742

Ouroboros: Virtualized queues for dynamic memory management on GPUs

Martin Winter, Daniel Mlakar, Mathias Parger, Markus Steinberger

Institute of Computer Graphics and Vision (7100)

Research output: Chapter in Book/Report/Conference proceeding › Conference paper › peer-review

Abstract

Dynamic memory allocation on a single instruction, multiple threads architecture, like the Graphics Processing Unit (GPU), is challenging and implementation guidelines caution against it. Data structures must rise to the challenge of thousands of concurrently active threads trying to allocate memory. Efficient queueing structures have been used in the past to allow for simple allocation and reuse of memory directly on the GPU but do not scale well to different allocation sizes, as each requires its own queue.

In this work, we propose Ouroboros, a virtualized queueing structure, managing dynamically allocatable data chunks, whilst being built on top of these same chunks. Data chunks are interpreted on-the-fly either as building blocks for the virtualized queues or as paged user data. Re-usable user memory is managed in one of two ways, either as individual pages or as chunks containing pages. The queueing structures grow and shrink dynamically, only currently needed queue chunks are held in memory and freed up queue chunks can be reused within the system. Thus, we retain the performance benefits of an efficient, static queue design while keeping the memory requirements low. Performance evaluation on an NVIDIA TITAN V with the native device memory allocator in CUDA 10.1 shows speed-ups between 11X and 412X, with an average of 118X. For real-world testing, we integrate our allocator into faimGraph, a dynamic graph framework with proprietary memory management. Throughout all memory-intensive operations, such as graph initialization and edge updates, our allocator shows similar to improved performance. Additionally, we show improved algorithmic performance on PageRank and Static Triangle Counting.

Overall, our memory allocator can be efficiently initialized, allows for high-throughput allocation and offers, with its per-thread allocation model, a drop-in replacement for comparable dynamic memory allocators.

Original language	English
Title of host publication	Proceedings of the 34th ACM International Conference on Supercomputing, ICS 2020
Publisher	Association of Computing Machinery
Pages	1-12
Number of pages	12
ISBN (Electronic)	9781450379830
ISBN (Print)	9781450379830
DOIs	https://doi.org/10.1145/3392717.3392742
Publication status	Published - 29 Jun 2020
Event	34th ACM International Conference on Supercomputing: ICS 2020 - Wordlwide online event, Virtual/Barcelona, Spain Duration: 29 Jun 2020 → 2 Jul 2020 https://ics2020.bsc.es/

Conference

Conference	34th ACM International Conference on Supercomputing
Abbreviated title	ICS'20
Country/Territory	Spain
City	Virtual/Barcelona
Period	29/06/20 → 2/07/20
Internet address	https://ics2020.bsc.es/

Keywords

dynamic graphs
dynamic memory allocation
GPU
queueing
resource management

ASJC Scopus subject areas

Computer Science(all)

Access to Document

10.1145/3392717.3392742

Cite this

Winter, M, Mlakar, D, Parger, M & Steinberger, M 2020, Ouroboros: Virtualized queues for dynamic memory management on GPUs. in Proceedings of the 34th ACM International Conference on Supercomputing, ICS 2020. Association of Computing Machinery, pp. 1-12, 34th ACM International Conference on Supercomputing, Virtual/Barcelona, Spain, 29/06/20. https://doi.org/10.1145/3392717.3392742

@inproceedings{14530267d11c4214a8d2e507aafee173,

title = "Ouroboros: Virtualized queues for dynamic memory management on GPUs",

abstract = "Dynamic memory allocation on a single instruction, multiple threads architecture, like the Graphics Processing Unit (GPU), is challenging and implementation guidelines caution against it. Data structures must rise to the challenge of thousands of concurrently active threads trying to allocate memory. Efficient queueing structures have been used in the past to allow for simple allocation and reuse of memory directly on the GPU but do not scale well to different allocation sizes, as each requires its own queue.In this work, we propose Ouroboros, a virtualized queueing structure, managing dynamically allocatable data chunks, whilst being built on top of these same chunks. Data chunks are interpreted on-the-fly either as building blocks for the virtualized queues or as paged user data. Re-usable user memory is managed in one of two ways, either as individual pages or as chunks containing pages. The queueing structures grow and shrink dynamically, only currently needed queue chunks are held in memory and freed up queue chunks can be reused within the system. Thus, we retain the performance benefits of an efficient, static queue design while keeping the memory requirements low. Performance evaluation on an NVIDIA TITAN V with the native device memory allocator in CUDA 10.1 shows speed-ups between 11X and 412X, with an average of 118X. For real-world testing, we integrate our allocator into faimGraph, a dynamic graph framework with proprietary memory management. Throughout all memory-intensive operations, such as graph initialization and edge updates, our allocator shows similar to improved performance. Additionally, we show improved algorithmic performance on PageRank and Static Triangle Counting.Overall, our memory allocator can be efficiently initialized, allows for high-throughput allocation and offers, with its per-thread allocation model, a drop-in replacement for comparable dynamic memory allocators.",

keywords = "dynamic graphs, dynamic memory allocation, GPU, queueing, resource management",

author = "Martin Winter and Daniel Mlakar and Mathias Parger and Markus Steinberger",

year = "2020",

month = jun,

day = "29",

doi = "10.1145/3392717.3392742",

language = "English",

isbn = "9781450379830",

pages = "1--12",

booktitle = "Proceedings of the 34th ACM International Conference on Supercomputing, ICS 2020",

publisher = "Association of Computing Machinery",

address = "United States",

note = "34th ACM International Conference on Supercomputing : ICS 2020, ICS'20 ; Conference date: 29-06-2020 Through 02-07-2020",

url = "https://ics2020.bsc.es/",

}

TY - GEN

T1 - Ouroboros

T2 - 34th ACM International Conference on Supercomputing

AU - Winter, Martin

AU - Mlakar, Daniel

AU - Parger, Mathias

AU - Steinberger, Markus

PY - 2020/6/29

Y1 - 2020/6/29

N2 - Dynamic memory allocation on a single instruction, multiple threads architecture, like the Graphics Processing Unit (GPU), is challenging and implementation guidelines caution against it. Data structures must rise to the challenge of thousands of concurrently active threads trying to allocate memory. Efficient queueing structures have been used in the past to allow for simple allocation and reuse of memory directly on the GPU but do not scale well to different allocation sizes, as each requires its own queue.In this work, we propose Ouroboros, a virtualized queueing structure, managing dynamically allocatable data chunks, whilst being built on top of these same chunks. Data chunks are interpreted on-the-fly either as building blocks for the virtualized queues or as paged user data. Re-usable user memory is managed in one of two ways, either as individual pages or as chunks containing pages. The queueing structures grow and shrink dynamically, only currently needed queue chunks are held in memory and freed up queue chunks can be reused within the system. Thus, we retain the performance benefits of an efficient, static queue design while keeping the memory requirements low. Performance evaluation on an NVIDIA TITAN V with the native device memory allocator in CUDA 10.1 shows speed-ups between 11X and 412X, with an average of 118X. For real-world testing, we integrate our allocator into faimGraph, a dynamic graph framework with proprietary memory management. Throughout all memory-intensive operations, such as graph initialization and edge updates, our allocator shows similar to improved performance. Additionally, we show improved algorithmic performance on PageRank and Static Triangle Counting.Overall, our memory allocator can be efficiently initialized, allows for high-throughput allocation and offers, with its per-thread allocation model, a drop-in replacement for comparable dynamic memory allocators.

AB - Dynamic memory allocation on a single instruction, multiple threads architecture, like the Graphics Processing Unit (GPU), is challenging and implementation guidelines caution against it. Data structures must rise to the challenge of thousands of concurrently active threads trying to allocate memory. Efficient queueing structures have been used in the past to allow for simple allocation and reuse of memory directly on the GPU but do not scale well to different allocation sizes, as each requires its own queue.In this work, we propose Ouroboros, a virtualized queueing structure, managing dynamically allocatable data chunks, whilst being built on top of these same chunks. Data chunks are interpreted on-the-fly either as building blocks for the virtualized queues or as paged user data. Re-usable user memory is managed in one of two ways, either as individual pages or as chunks containing pages. The queueing structures grow and shrink dynamically, only currently needed queue chunks are held in memory and freed up queue chunks can be reused within the system. Thus, we retain the performance benefits of an efficient, static queue design while keeping the memory requirements low. Performance evaluation on an NVIDIA TITAN V with the native device memory allocator in CUDA 10.1 shows speed-ups between 11X and 412X, with an average of 118X. For real-world testing, we integrate our allocator into faimGraph, a dynamic graph framework with proprietary memory management. Throughout all memory-intensive operations, such as graph initialization and edge updates, our allocator shows similar to improved performance. Additionally, we show improved algorithmic performance on PageRank and Static Triangle Counting.Overall, our memory allocator can be efficiently initialized, allows for high-throughput allocation and offers, with its per-thread allocation model, a drop-in replacement for comparable dynamic memory allocators.

KW - dynamic graphs

KW - dynamic memory allocation

KW - GPU

KW - queueing

KW - resource management

UR - http://www.scopus.com/inward/record.url?scp=85088540824&partnerID=8YFLogxK

U2 - 10.1145/3392717.3392742

DO - 10.1145/3392717.3392742

M3 - Conference paper

SN - 9781450379830

SP - 1

EP - 12

BT - Proceedings of the 34th ACM International Conference on Supercomputing, ICS 2020

PB - Association of Computing Machinery

Y2 - 29 June 2020 through 2 July 2020

ER -

Ouroboros: Virtualized queues for dynamic memory management on GPUs

Abstract

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this