Online shielding for reinforcement learning

Bettina Könighofer; Julian Rudolf; Alexander Palmisano; Martin Tappler; Roderick Bloem

doi:10.1007/s11334-022-00480-4

Online shielding for reinforcement learning

Bettina Könighofer^*, Julian Rudolf, Alexander Palmisano, Martin Tappler, Roderick Bloem

^*Corresponding author for this work

Research output: Contribution to journal › Conference article › peer-review

Abstract

Besides the recent impressive results on reinforcement learning (RL), safety is still one of the major research challenges in RL. RL is a machine-learning approach to determine near-optimal policies in Markov decision processes (MDPs). In this paper, we consider the setting where the safety-relevant fragment of the MDP together with a temporal logic safety specification is given, and many safety violations can be avoided by planning ahead a short time into the future. We propose an approach for online safety shielding of RL agents. During runtime, the shield analyses the safety of each available action. For any action, the shield computes the maximal probability to not violate the safety specification within the next k steps when executing this action. Based on this probability and a given threshold, the shield decides whether to block an action from the agent. Existing offline shielding approaches compute exhaustively the safety of all state-action combinations ahead of time, resulting in huge computation times and large memory consumption. The intuition behind online shielding is to compute at runtime the set of all states that could be reached in the near future. For each of these states, the safety of all available actions is analysed and used for shielding as soon as one of the considered states is reached. Our approach is well-suited for high-level planning problems where the time between decisions can be used for safety computations and it is sustainable for the agent to wait until these computations are finished. For our evaluation, we selected a 2-player version of the classical computer game Snake. The game represents a high-level planning problem that requires fast decisions and the multiplayer setting induces a large state space, which is computationally expensive to analyse exhaustively.

Original language	English
Journal	Innovations in Systems and Software Engineering
Early online date	23 Sept 2022
DOIs	https://doi.org/10.1007/s11334-022-00480-4
Publication status	E-pub ahead of print - 23 Sept 2022
Event	13th NASA Formal Methods Symposium: NFM 2021 - Houston, Virtuell, United States Duration: 24 May 2021 → 28 May 2021

Keywords

Shielding
Runtime enforcement
Markov decision processes
Safe reinforcement learning

ASJC Scopus subject areas

Software

Access to Document

10.1007/s11334-022-00480-4Licence: CC BY 4.0

Cite this

@article{291e792c930046119d7a5eb1ebfe7026,

title = "Online shielding for reinforcement learning",

abstract = "Besides the recent impressive results on reinforcement learning (RL), safety is still one of the major research challenges in RL. RL is a machine-learning approach to determine near-optimal policies in Markov decision processes (MDPs). In this paper, we consider the setting where the safety-relevant fragment of the MDP together with a temporal logic safety specification is given, and many safety violations can be avoided by planning ahead a short time into the future. We propose an approach for online safety shielding of RL agents. During runtime, the shield analyses the safety of each available action. For any action, the shield computes the maximal probability to not violate the safety specification within the next k steps when executing this action. Based on this probability and a given threshold, the shield decides whether to block an action from the agent. Existing offline shielding approaches compute exhaustively the safety of all state-action combinations ahead of time, resulting in huge computation times and large memory consumption. The intuition behind online shielding is to compute at runtime the set of all states that could be reached in the near future. For each of these states, the safety of all available actions is analysed and used for shielding as soon as one of the considered states is reached. Our approach is well-suited for high-level planning problems where the time between decisions can be used for safety computations and it is sustainable for the agent to wait until these computations are finished. For our evaluation, we selected a 2-player version of the classical computer game Snake. The game represents a high-level planning problem that requires fast decisions and the multiplayer setting induces a large state space, which is computationally expensive to analyse exhaustively.",

keywords = "Shielding, Runtime enforcement, Markov decision processes, Safe reinforcement learning",

author = "Bettina K{\"o}nighofer and Julian Rudolf and Alexander Palmisano and Martin Tappler and Roderick Bloem",

year = "2022",

month = sep,

day = "23",

doi = "10.1007/s11334-022-00480-4",

language = "English",

journal = "Innovations in Systems and Software Engineering",

issn = "1614-5046",

publisher = "Springer London",

note = "13th NASA Formal Methods Symposium : NFM 2021, NFM 2021 ; Conference date: 24-05-2021 Through 28-05-2021",

}

TY - JOUR

T1 - Online shielding for reinforcement learning

AU - Könighofer, Bettina

AU - Rudolf, Julian

AU - Palmisano, Alexander

AU - Tappler, Martin

AU - Bloem, Roderick

PY - 2022/9/23

Y1 - 2022/9/23

N2 - Besides the recent impressive results on reinforcement learning (RL), safety is still one of the major research challenges in RL. RL is a machine-learning approach to determine near-optimal policies in Markov decision processes (MDPs). In this paper, we consider the setting where the safety-relevant fragment of the MDP together with a temporal logic safety specification is given, and many safety violations can be avoided by planning ahead a short time into the future. We propose an approach for online safety shielding of RL agents. During runtime, the shield analyses the safety of each available action. For any action, the shield computes the maximal probability to not violate the safety specification within the next k steps when executing this action. Based on this probability and a given threshold, the shield decides whether to block an action from the agent. Existing offline shielding approaches compute exhaustively the safety of all state-action combinations ahead of time, resulting in huge computation times and large memory consumption. The intuition behind online shielding is to compute at runtime the set of all states that could be reached in the near future. For each of these states, the safety of all available actions is analysed and used for shielding as soon as one of the considered states is reached. Our approach is well-suited for high-level planning problems where the time between decisions can be used for safety computations and it is sustainable for the agent to wait until these computations are finished. For our evaluation, we selected a 2-player version of the classical computer game Snake. The game represents a high-level planning problem that requires fast decisions and the multiplayer setting induces a large state space, which is computationally expensive to analyse exhaustively.

AB - Besides the recent impressive results on reinforcement learning (RL), safety is still one of the major research challenges in RL. RL is a machine-learning approach to determine near-optimal policies in Markov decision processes (MDPs). In this paper, we consider the setting where the safety-relevant fragment of the MDP together with a temporal logic safety specification is given, and many safety violations can be avoided by planning ahead a short time into the future. We propose an approach for online safety shielding of RL agents. During runtime, the shield analyses the safety of each available action. For any action, the shield computes the maximal probability to not violate the safety specification within the next k steps when executing this action. Based on this probability and a given threshold, the shield decides whether to block an action from the agent. Existing offline shielding approaches compute exhaustively the safety of all state-action combinations ahead of time, resulting in huge computation times and large memory consumption. The intuition behind online shielding is to compute at runtime the set of all states that could be reached in the near future. For each of these states, the safety of all available actions is analysed and used for shielding as soon as one of the considered states is reached. Our approach is well-suited for high-level planning problems where the time between decisions can be used for safety computations and it is sustainable for the agent to wait until these computations are finished. For our evaluation, we selected a 2-player version of the classical computer game Snake. The game represents a high-level planning problem that requires fast decisions and the multiplayer setting induces a large state space, which is computationally expensive to analyse exhaustively.

KW - Shielding

KW - Runtime enforcement

KW - Markov decision processes

KW - Safe reinforcement learning

UR - http://www.scopus.com/inward/record.url?scp=85138710788&partnerID=8YFLogxK

U2 - 10.1007/s11334-022-00480-4

DO - 10.1007/s11334-022-00480-4

M3 - Conference article

SN - 1614-5046

JO - Innovations in Systems and Software Engineering

JF - Innovations in Systems and Software Engineering

T2 - 13th NASA Formal Methods Symposium

Y2 - 24 May 2021 through 28 May 2021

ER -

Online shielding for reinforcement learning

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

EU - FOCETA - Foundations for continuous engineering of trustworthy autonomy

Cite this

Online shielding for reinforcement learning

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Projects

EU - FOCETA - Foundations for continuous engineering of trustworthy autonomy

Cite this