Globally Homogeneous, Locally Adaptive Sparse Matrix-vector Multiplication on the GPU

Markus Steinberger; Rhaleb Zayer; Hans-Peter Seidel

doi:10.1145/3079079.3079086

Globally Homogeneous, Locally Adaptive Sparse Matrix-vector Multiplication on the GPU

Markus Steinberger, Rhaleb Zayer, Hans-Peter Seidel

Publikation: Beitrag in Buch/Bericht/Konferenzband › Beitrag in einem Konferenzband › Begutachtung

Abstract

The rising popularity of the graphics processing unit (GPU) across various numerical computing applications triggered a breakneck race to optimize key numerical kernels and in particular, the sparse matrix-vector product (SpMV). Despite great strides, most existing GPU-SpMV approaches trade off one aspect of performance against another. They either require preprocessing, exhibit inconsistent behavior, lead to execution divergence, suffer load imbalance or induce detrimental memory access patterns. In this paper, we present an uncompromising approach for SpMV on the GPU. Our approach requires no separate preprocessing or knowledge of the matrix structure and works directly on the standard compressed sparse rows (CSR) data format. From a global perspective, it exhibits a homogeneous behavior reflected in efficient memory access patterns and steady per-thread workload. From a local perspective, it avoids heterogeneous execution paths by adapting its behavior to the work load at hand, it uses an efficient encoding to keep temporary data requirements for on-chip memory low, and leads to divergence-free execution. We evaluate our approach on more than 2500 matrices comparing to vendor provided, and state-of-the-art SpMV implementations. Our approach not only significantly outperforms approaches directly operating on the CSR format ( 20% average performance increase), but also outperforms approaches that preprocess the matrix even when preprocessing time is discarded. Additionally, the same strategies lead to significant performance increase when adapted for transpose SpMV.

Originalsprache	englisch
Titel	ICS '17: Proceedings of the International Conference on Supercomputing
Erscheinungsort	New York, NY, USA
Herausgeber (Verlag)	ACM SIGWEB
Seiten	1–11
ISBN (Print)	978-1-4503-5020-4
DOIs	https://doi.org/10.1145/3079079.3079086
Publikationsstatus	Veröffentlicht - 2017
Extern publiziert	Ja
Veranstaltung	International Conference on Supercomputing: ICS 2017 - Chicago, USA / Vereinigte Staaten Dauer: 14 Juni 2017 → 16 Juni 2017

Konferenz

Konferenz	International Conference on Supercomputing
Kurztitel	ICS '17
Land/Gebiet	USA / Vereinigte Staaten
Ort	Chicago
Zeitraum	14/06/17 → 16/06/17

Zugriff auf Dokument

10.1145/3079079.3079086

Dieses zitieren

Globally Homogeneous, Locally Adaptive Sparse Matrix-vector Multiplication on the GPU. / Steinberger, Markus; Zayer, Rhaleb; Seidel, Hans-Peter.
ICS '17: Proceedings of the International Conference on Supercomputing. New York, NY, USA: ACM SIGWEB , 2017. S. 1–11 13.

Publikation: Beitrag in Buch/Bericht/Konferenzband › Beitrag in einem Konferenzband › Begutachtung

Steinberger, M, Zayer, R & Seidel, H-P 2017, Globally Homogeneous, Locally Adaptive Sparse Matrix-vector Multiplication on the GPU. in ICS '17: Proceedings of the International Conference on Supercomputing., 13, ACM SIGWEB , New York, NY, USA, S. 1–11, International Conference on Supercomputing, Chicago, Illinois, USA / Vereinigte Staaten, 14/06/17. https://doi.org/10.1145/3079079.3079086

@inproceedings{65f83efb37ae4d33838086b61b0dc132,

title = "Globally Homogeneous, Locally Adaptive Sparse Matrix-vector Multiplication on the GPU",

abstract = "The rising popularity of the graphics processing unit (GPU) across various numerical computing applications triggered a breakneck race to optimize key numerical kernels and in particular, the sparse matrix-vector product (SpMV). Despite great strides, most existing GPU-SpMV approaches trade off one aspect of performance against another. They either require preprocessing, exhibit inconsistent behavior, lead to execution divergence, suffer load imbalance or induce detrimental memory access patterns. In this paper, we present an uncompromising approach for SpMV on the GPU. Our approach requires no separate preprocessing or knowledge of the matrix structure and works directly on the standard compressed sparse rows (CSR) data format. From a global perspective, it exhibits a homogeneous behavior reflected in efficient memory access patterns and steady per-thread workload. From a local perspective, it avoids heterogeneous execution paths by adapting its behavior to the work load at hand, it uses an efficient encoding to keep temporary data requirements for on-chip memory low, and leads to divergence-free execution. We evaluate our approach on more than 2500 matrices comparing to vendor provided, and state-of-the-art SpMV implementations. Our approach not only significantly outperforms approaches directly operating on the CSR format ( 20% average performance increase), but also outperforms approaches that preprocess the matrix even when preprocessing time is discarded. Additionally, the same strategies lead to significant performance increase when adapted for transpose SpMV.",

keywords = "GPU, SpMV, linear algebra, sparse matrix",

author = "Markus Steinberger and Rhaleb Zayer and Hans-Peter Seidel",

year = "2017",

doi = "10.1145/3079079.3079086",

language = "English",

isbn = "978-1-4503-5020-4",

pages = "1–11",

booktitle = "ICS '17: Proceedings of the International Conference on Supercomputing",

publisher = "ACM SIGWEB ",

note = "International Conference on Supercomputing : ICS 2017, ICS '17 ; Conference date: 14-06-2017 Through 16-06-2017",

}

TY - GEN

T1 - Globally Homogeneous, Locally Adaptive Sparse Matrix-vector Multiplication on the GPU

AU - Steinberger, Markus

AU - Zayer, Rhaleb

AU - Seidel, Hans-Peter

PY - 2017

Y1 - 2017

N2 - The rising popularity of the graphics processing unit (GPU) across various numerical computing applications triggered a breakneck race to optimize key numerical kernels and in particular, the sparse matrix-vector product (SpMV). Despite great strides, most existing GPU-SpMV approaches trade off one aspect of performance against another. They either require preprocessing, exhibit inconsistent behavior, lead to execution divergence, suffer load imbalance or induce detrimental memory access patterns. In this paper, we present an uncompromising approach for SpMV on the GPU. Our approach requires no separate preprocessing or knowledge of the matrix structure and works directly on the standard compressed sparse rows (CSR) data format. From a global perspective, it exhibits a homogeneous behavior reflected in efficient memory access patterns and steady per-thread workload. From a local perspective, it avoids heterogeneous execution paths by adapting its behavior to the work load at hand, it uses an efficient encoding to keep temporary data requirements for on-chip memory low, and leads to divergence-free execution. We evaluate our approach on more than 2500 matrices comparing to vendor provided, and state-of-the-art SpMV implementations. Our approach not only significantly outperforms approaches directly operating on the CSR format ( 20% average performance increase), but also outperforms approaches that preprocess the matrix even when preprocessing time is discarded. Additionally, the same strategies lead to significant performance increase when adapted for transpose SpMV.

AB - The rising popularity of the graphics processing unit (GPU) across various numerical computing applications triggered a breakneck race to optimize key numerical kernels and in particular, the sparse matrix-vector product (SpMV). Despite great strides, most existing GPU-SpMV approaches trade off one aspect of performance against another. They either require preprocessing, exhibit inconsistent behavior, lead to execution divergence, suffer load imbalance or induce detrimental memory access patterns. In this paper, we present an uncompromising approach for SpMV on the GPU. Our approach requires no separate preprocessing or knowledge of the matrix structure and works directly on the standard compressed sparse rows (CSR) data format. From a global perspective, it exhibits a homogeneous behavior reflected in efficient memory access patterns and steady per-thread workload. From a local perspective, it avoids heterogeneous execution paths by adapting its behavior to the work load at hand, it uses an efficient encoding to keep temporary data requirements for on-chip memory low, and leads to divergence-free execution. We evaluate our approach on more than 2500 matrices comparing to vendor provided, and state-of-the-art SpMV implementations. Our approach not only significantly outperforms approaches directly operating on the CSR format ( 20% average performance increase), but also outperforms approaches that preprocess the matrix even when preprocessing time is discarded. Additionally, the same strategies lead to significant performance increase when adapted for transpose SpMV.

KW - GPU, SpMV, linear algebra, sparse matrix

U2 - 10.1145/3079079.3079086

DO - 10.1145/3079079.3079086

M3 - Conference paper

SN - 978-1-4503-5020-4

SP - 1

EP - 11

BT - ICS '17: Proceedings of the International Conference on Supercomputing

PB - ACM SIGWEB

CY - New York, NY, USA

T2 - International Conference on Supercomputing

Y2 - 14 June 2017 through 16 June 2017

ER -

Globally Homogeneous, Locally Adaptive Sparse Matrix-vector Multiplication on the GPU

Abstract

Konferenz

Zugriff auf Dokument

Fingerprint

Dieses zitieren