Cybersecurity defense is inherently adversarial, making multi‑agent reinforcement learning a natural fit, but simultaneous training of competing agents in complex environments is notoriously unstable. This work proposes a game‑theoretic deep RL framework for CybORG that extends Nash Q‑learning with a centralized joint Q‑network (critic) and separate decentralized policies. The critic estimates joint state–action values to construct payoff matrices and compute Nash equilibria, while Blue and Red policies are trained by minimizing cross‑entropy to these equilibrium strategies under partial observability. By decoupling critic learning from policy updates, the method mitigates non‑stationarity and guides agents—especially the Blue defender—toward robust, equilibrium‑based behaviors against an adaptive attacker.