Ask HN: Advice on setting up adversarial programming challenge?

I have been very intrigued lately by the results shown in the SPAG paper [1], and I would like to take the next step in research.

I would like to set up an adversarial training exercise for LLMs around writing secure / robust code.

For those unfamiliar with the SPAG paper, the idea is to train an LLM on reinforcement learning on an adversarial language game. As their results show, teaching an LLM to play a language game (such as Taboo, in their paper) through reinforcement learning increases the model's ability to solve other human reasoning tasks (when compared to traditional imitation learning).

Rough idea:

Setup: The defender is responsible for writing code according to a specification and a set of (hidden) unit tests that will evaluate the code for validity and completeness. This will be the standard sort of setup for Leetcode or other coding challenge websites.

The attacker is responsible for calling the code in such a way that it will crash or fail or be exploited in some fashion. For instance, if the attacker has the goal of getting the defender's code to crash, then so long as the stack trace fails from inside the defender's code (and not the attacker's code), then it's a victory for the attacker.

Is there a better way to set up an automated red-team vs. blue-team for a coding CTF style challenge around Leetcode-type coding problems? If you wanted to set up a challenge in an adversarial style to train the defender to write more robust code, how would you do it?

[1] - https://github.com/Linear95/SPAG