Nash Q-Network for Multi-Agent Cybersecurity Simulation

Cybersecurity defense is inherently adversarial, making multi‑agent reinforcement learning a natural fit, but simultaneous training of competing agents in complex environments is notoriously unstable. This work proposes a game‑theoretic deep RL framework for CybORG that extends Nash Q‑learning with a centralized joint Q‑network (critic) and separate decentralized policies. The critic estimates joint state–action values to construct payoff matrices and compute Nash equilibria, while Blue and Red policies are trained by minimizing cross‑entropy to these equilibrium strategies under partial observability. By decoupling critic learning from policy updates, the method mitigates non‑stationarity and guides agents—especially the Blue defender—toward robust, equilibrium‑based behaviors against an adaptive attacker.

Auction Design for Cyber Operation Strategic Planning

Cyber defense operations increasingly require long-term strategic planning under uncertainty and resource constraints. We can use combinatorial auctions for allocating defensive action bundles in a realistic cyber environment, using host-specific valuations derived from reinforcement learning (RL) Q-values. These Q-values encode long-term expected utility, allowing upstream planning.

Nash Q-Network for Multi-agent Cybersecurity Simulation

Cite

@InProceedings{10.1007/978-3-032-08067-7_3,
author=”Xie, Qintong
and Koh, Edward
and Cadet, Xavier
and Chin, Peter”,
editor=”Baras, John S.
and Papavassiliou, Symeon
and Tsiropoulou, Eirini Eleni
and Sayin, Muhammed O.”,
title=”Nash Q-Network for Multi-agent Cybersecurity Simulation”,
booktitle=”Game Theory and AI for Security”,
year=”2026″,
publisher=”Springer Nature Switzerland”,
address=”Cham”,
pages=”43–60″,
abstract=”Cybersecurity defense involves interactions between adversarial parties (namely defenders and hackers), making multi-agent reinforcement learning (MARL) an ideal approach for modeling and learning strategies for these scenarios. This paper addresses the challenge of simultaneous multi-agent training in complex environments and introduces a Nash Q-Network that enables learning in a partial observation environment. Facilitates learning in partially observed settings. We demonstrate the successful implementation of this algorithm in a notable complex cyber defense simulation treated as a two-player zero-sum Markov game setting. We propose the Nash Q-Network, which aims to learn Nash-optimal strategies that translate to robust defenses in cybersecurity settings. Our approach incorporates aspects of proximal policy optimization (PPO), deep Q-network (DQN), and the Nash-Q algorithm, addressing common challenges like non-stationarity and instability in multi-agent learning. The training process employs distributed data collection and carefully designed neural architectures for both agents and critics.”,
isbn=”978-3-032-08067-7″
}

Explore Reinforced: Equilibrium Approximation with Reinforcement Learning

Mateusz Nowak, Qintong Xie, Emma Graham, Ryan Yu, Michelle Yilin Feng, Roy Leibovitz, Xavier Cadet, Sang (Peter) Chin

Cite

@InProceedings{10.1007/978-3-032-08064-6_3,
author=”Nowak, Mateusz
and Xie, Qintong
and Graham, Emma
and Yu, Ryan
and Feng, Michelle Yilin
and Leibovitz, Roy
and Cadet, Xavier
and Chin, Peter”,
editor=”Baras, John S.
and Papavassiliou, Symeon
and Tsiropoulou, Eirini Eleni
and Sayin, Muhammed O.”,
title=”Explore Reinforced: Equilibrium Approximation with Reinforcement Learning”,
booktitle=”Game Theory and AI for Security”,
year=”2026″,
publisher=”Springer Nature Switzerland”,
address=”Cham”,
pages=”42–60″,
abstract=”Current approximate Coarse Correlated Equilibria (CCE) algorithms struggle with equilibrium approximation for games in large stochastic environments. While these game-theoretic methods are theoretically guaranteed to converge to a strong solution concept, reinforcement learning (RL) algorithms have shown increasing capability in such environments but lack the equilibrium guarantees provided by game-theoretic approaches. In this paper, we introduce Exp3-IXRL – an equilibrium approximator that utilizes RL, specifically leveraging the agent’s action selection, to update equilibrium approximations while preserving the integrity of both learning processes. We therefore extend the Exp3 algorithms beyond the stateless, non-stochastic settings. Empirically, we demonstrate improved performance in classic non-stochastic multi-armed bandit settings, capability in stochastic multi-armed bandits, and strong results in a complex and adversarial cybersecurity network environment.”,
isbn=”978-3-032-08064-6″
}

PoolFlip: A Multi-agent Reinforcement Learning Security Environment for Cyber Defense

Xavier Cadet, Simona Boboila, Sie Hendrata Dharmawan, Alina Oprea, Sang (Peter) Chin

Cite

@InProceedings{10.1007/978-3-032-08064-6_9,
author=”Cadet, Xavier
and Boboila, Simona
and Dharmawan, Sie Hendrata
and Oprea, Alina
and Chin, Peter”,
editor=”Baras, John S.
and Papavassiliou, Symeon
and Tsiropoulou, Eirini Eleni
and Sayin, Muhammed O.”,
title=”PoolFlip: A Multi-agent Reinforcement Learning Security Environment for Cyber Defense”,
booktitle=”Game Theory and AI for Security”,
year=”2026″,
publisher=”Springer Nature Switzerland”,
address=”Cham”,
pages=”172–192″,
abstract=”Cyber defense requires automating defensive decision-making under stealthy, deceptive, and continuously evolving adversarial strategies. The FlipIt game provides a foundational framework for modeling interactions between a defender and an advanced adversary that compromises a system without being immediately detected. In FlipIt, the attacker and defender compete to control a shared resource by performing a Flip action and paying a cost. However, the existing FlipIt frameworks rely on a small number of heuristics or specialized learning techniques, which can lead to brittleness and the inability to adapt to new attacks. To address these limitations, we introduce PoolFlip, a multi-agent gym environment that extends the FlipIt game to allow efficient learning for attackers and defenders. Furthermore, we propose Flip-PSRO, a multi-agent reinforcement learning (MARL) approach that leverages population-based training to train defender agents equipped to generalize against a range of unknown, potentially adaptive opponents. Our empirical results suggest that Flip-PSRO defenders are {\$}{\$}2{\backslash}times {\$}{\$}2{\texttimes}more effective than baselines to generalize to a heuristic attack not exposed in training. In addition, our newly designed ownership-based utility functions ensure that Flip-PSRO defenders maintain a high level of control while optimizing performance.”,
isbn=”978-3-032-08064-6″
}

Tree Search for Simultaneous Move Games via Equilibrium Approximation

Ryan Yu, Alex Olshevsky, Sang (Peter) Chin

Cite

@InProceedings{10.1007/978-3-032-08064-6_1,
author=”Yu, Ryan
and Olshevsky, Alex
and Chin, Peter”,
editor=”Baras, John S.
and Papavassiliou, Symeon
and Tsiropoulou, Eirini Eleni
and Sayin, Muhammed O.”,
title=”Tree Search for Simultaneous Move Games via Equilibrium Approximation”,
booktitle=”Game Theory and AI for Security”,
year=”2026″,
publisher=”Springer Nature Switzerland”,
address=”Cham”,
pages=”3–22″,
abstract=”Neural network supported tree-search has shown strong results in a variety of perfect information multi-agent tasks. However, the performance of these methods on imperfect information games has generally been below competing approaches. Here we study the class of simultaneous-move games, which are a subclass of imperfect information games which are most similar to perfect information games: both agents know the game state with the exception of the opponent’s move, which is revealed only after each agent makes its own move. Simultaneous move games include popular benchmarks such as Google Research Football and Starcraft Multi Agent Challenge. Our goal in this paper is to take tree search algorithms trained through self-play and adapt them to simultaneous move games without significant loss of performance. While naive ways to do this fail, we are able to achieve this by deriving a practical method that attempts to approximate a coarse correlated equilibrium as a subroutine within a tree search. Our algorithm, Neural Network-Coarse Correlated Equilibrium (NN-CCE), works on cooperative, competitive, and mixed tasks and our results are better than the current best MARL algorithms on a wide range of accepted baselines.”,
isbn=”978-3-032-08064-6″
}

Strategic Cyber Defense via Reinforcement Learning-Guided Combinatorial Auctions

Mai Pham, Vikrant Vaze, Sang (Peter) Chin

Cite

@INPROCEEDINGS{11196565,
author={Pham, Mai and Vaze, Vikrant and Chin, Peter},
booktitle={2025 IEEE High Performance Extreme Computing Conference (HPEC)},
title={Strategic Cyber Defense via Reinforcement Learning-Guided Combinatorial Auctions},
year={2025},
volume={},
number={},
pages={1-7},
abstract={Cyber defense operations increasingly require long-term strategic planning under uncertainty and resource constraints. We propose a new use of combinatorial auctions for allocating defensive action bundles in a realistic cyber environment, using host-specific valuations derived from reinforcement learning (RL) Q-values. These Q-values encode long-term expected utility, allowing upstream planning. We train CAFormer, a differentiable Transformer-based auction mechanism, to produce allocations that are approximately incentive-compatible under misreporting. Rather than benchmarking against existing agents, we explore the qualitative and strategic properties of the learned mechanisms. Compared to oracle and heuristic allocations, our method achieves competitive revenue while offering robustness to misreporting. In addition, we find that allocation patterns correlate with adversarial and defensive activity, suggesting implicit alignment with operational priorities. Our results demonstrate the viability of auction-based planning in cyber defense and highlight the interpretability benefits of RL-derived value structures.},
keywords={Training;Uncertainty;Reinforcement learning;Strategic planning;Transformers;Robustness;Resource management;Cost accounting;Optimization;Resilience;Cyber defense;strategic planning;mechanism design;differentiable optimization},
doi={10.1109/HPEC67600.2025.11196565},
ISSN={2643-1971},
month={Sep.},}

Quantitative Resilience Modeling for Autonomous Cyber Defense

Xavier Cadet, Simona Boboila, Edward Koh, Sang (Peter) Chin, Alina Oprea

Cite

@article{cadet2025quantitative,
title={Quantitative Resilience Modeling for Autonomous Cyber Defense},
author={Cadet, Xavier and Boboila, Simona and Koh, Edward and Chin, Peter and Oprea, Alina},
journal={Reinforcement Learning Journal},
volume={6},
pages={894–908},
year={2025}
}

nFlip : Deep Reinforcement Learning in Multiplayer FlipIt

Reinforcement learning has shown much success in games such as chess, backgammon and Go. However, in most of these games, agents have full knowledge of the environment at all times. We describe a deep learning model that successfully maximizes its score using reinforcement learning in a game with incomplete and imperfect information. We apply our model to FlipIt 1, a two-player game in which both players, the attacker and the defender, compete for ownership of a shared resource and only receive information on the current state upon making a move. Our model is a deep neural network combined with Q-learning and is trained to maximize the defender’s time of ownership of the resource. We extend FlipIt to a larger action-spaced game with the introduction of a new lower-cost move and generalize the model to multiplayer FlipIt.

Using Game Theory and Reinforcement Learning to Predict the Future

Baseball is a well known, repeated, finite, adversarial, stochastic game that has a massive amount of available data. On the other hand, Reinforcement Learning (RL) models take significant time and resources to train. By fusing Game Theory and RL, we are answering interesting questions such as “given a video of a pitch, can we compute the utility of a pitch given the desired location, resulting location, and setting?”

Deep Reinforcement Learning for FlipIt Security Game

Cite

@inproceedings{greige_deep_2022,
abstract = {Reinforcement learning has shown much success in games such as chess, backgammon and Go [21, 22, 24]. However, in most of these games, agents have full knowledge of the environment at all times. In this paper, we describe a deep learning model in which agents successfully adapt to different classes of opponents and learn the optimal counter-strategy using reinforcement learning in a game under partial observability. We apply our model to \$\$\backslashmathsf \FlipIt\$\$FlipIt[25], a two-player security game in which both players, the attacker and the defender, compete for ownership of a shared resource and only receive information on the current state of the game upon making a move. Our model is a deep neural network combined with Q-learning and is trained to maximize the defender’s time of ownership of the resource. Despite the noisy information, our model successfully learns a cost-effective counter-strategy outperforming its opponent’s strategies and shows the advantages of the use of deep reinforcement learning in game theoretic scenarios. We also extend \$\$\backslashmathsf \FlipIt\$\$FlipItto a larger action-spaced game with the introduction of a new lower-cost move and generalize the model to n-player \$\$\backslashmathsf \FlipIt\$\$FlipIt.},
address = {Cham},
author = {Greige, Laura and Chin, Peter},
booktitle = {Complex Networks & Their Applications X},
editor = {Benito, Rosa Maria and Cherifi, Chantal and Cherifi, Hocine and Moro, Esteban and Rocha, Luis M. and Sales-Pardo, Marta},
isbn = {978-3-030-93409-5},
pages = {831–843},
publisher = {Springer International Publishing},
title = {Deep Reinforcement Learning for FlipIt Security Game},
year = {2022}
}