<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Lonely Guy</title>
    <link>https://lonelyguy.vercel.app/</link>
    <description>rl, robotics, world models, embodied ai, and system stuff.</description>
    <language>en-us</language>
    <lastBuildDate>Fri, 29 May 2026 04:04:47 GMT</lastBuildDate>
    <atom:link href="https://lonelyguy.vercel.app/feed.xml" rel="self" type="application/rss+xml" />
    <item>
      <title><![CDATA[Teaching a 1.5B LLM to be a SOC Analyst (Without Burning Down the Network)]]></title>
      <link>https://lonelyguy.vercel.app/articles/2026-05-01-llm-soc-analyst/</link>
      <guid isPermaLink="true">https://lonelyguy.vercel.app/articles/2026-05-01-llm-soc-analyst/</guid>
      <pubDate>Fri, 01 May 2026 00:00:00 GMT</pubDate>
      <description><![CDATA[Fine-tuning a 1.5B LLM with GRPO to defend against realistic enterprise attack chains, and the unexpected lessons learned from reward hacking.]]></description>
      <content:encoded><![CDATA[<figure><img src="https://raw.githubusercontent.com/LonelyGuy-SE1/cybersec_env/main/assets/aggregate_performance.png" alt="" /></figure><p>Something changed recently. I&#39;ve been watching attacks on major internet infrastructure that used to require serious expertise - and the people behind some of them are amateurs. They&#39;re not writing exploits. They&#39;re prompting. Projects like Mythos have found 0-days in established codebases with millions of lines of code by treating vulnerability research as a reasoning task. Software I thought was locked down for years has been cracked open.</p><p>The asymmetry is uncomfortable: a capable AI assistant is available to attackers on day one. Defenders are still catching up. The logical response is to deploy an AI on your side - but training one on a live production network is obviously not an option. You need a realistic simulation. And most existing ones don&#39;t come close.</p><hr><h2>The Problem With Existing Cyber Environments</h2><p>Most cyber-defense environments treat the problem as a classification task: _is this payload malicious?_ The world doesn&#39;t work that way. Real attackers don&#39;t strike all at once. They establish a foothold, sit quietly for days, pivot laterally, exfiltrate slowly. The threat is a process, not a moment.</p><p>I wanted an environment that reflected this - one where:</p><ul><li>Attacks unfold over <strong>many steps</strong> with stochastic timing</li><li>Early signals are deliberately weak and noisy</li><li>Containment has a <strong>real business cost</strong> (taking a server offline costs something)</li><li>An adaptive adversary adjusts its pace based on defender behavior</li></ul><p>So I built <strong>Cybersec OpenEnv</strong>: a long-horizon, partially observable Markov decision process (POMDP) where a scripted attacker and an LLM defender play out realistic enterprise attack chains, grounded in MITRE ATT&amp;CK techniques.</p><hr><h2>Why a 1.5B Parameter Model?</h2><p>The defender is <code>Qwen2.5-1.5B-Instruct</code>, fine-tuned with GRPO (Group Relative Policy Optimization) via Hugging Face TRL and Unsloth 4-bit QLoRA on a single Colab T4 GPU. The size is deliberate.</p><p><strong>Iteration speed.</strong> Multiple full RL training cycles need to fit on consumer hardware. A 70B model makes that impossible.</p><p><strong>Privacy.</strong> Enterprise security telemetry - network topology, identity graphs, forensic signals - should never leave your infrastructure. Local, air-gapped models are the only defensible architecture for this use case.</p><p><strong>Deployment.</strong> If I want autonomous SOC agents running everywhere, they have to fit on standard enterprise hardware. The proof of concept has to work at the size that scales.</p><hr><h2>The Environment</h2><p>Three training scenarios and one held-out evaluation scenario, all modeled on real attack kill-chains:</p><p>| Scenario                           | Attack Chain                                                             | Horizon  |<br>| ---------------------------------- | ------------------------------------------------------------------------ | :------: |<br>| <code>supply_chain_token_drift</code>         | CI-token theft → poisoned artifact → payments pivot → warehouse exfil    | 70 ticks |<br>| <code>federated_identity_takeover</code>      | Spearphish → MFA fatigue → helpdesk pivot → HR portal → cloud exfil      | 70 ticks |<br>| <code>insider_repo_pivot</code>               | Repo recon → secret harvest → staging → prod cluster → DB exfil          | 80 ticks |<br>| <code>cloud_metadata_ssrf</code> _(held-out)_ | SSRF → cloud metadata → role-chain → KMS replicate → cloud storage exfil | 70 ticks |</p><p>The attacker is a scripted agent walking a deterministic DAG of attack stages, randomized with one of three personalities - <code>stealthy</code>, <code>aggressive</code>, or <code>opportunistic</code> - that vary its dwell time, detection footprint, and willingness to reroute around blocked stages.</p><p>Each tick, the LLM reads a partial observation: lagged alerts with noisy severity scores, forensic results from past investigations, a list of valid targets, and the current containment state. It outputs a single JSON action - <code>MONITOR</code>, <code>INVESTIGATE</code>, <code>ISOLATE_ASSET</code>, <code>REVOKE_IDENTITY</code>, <code>BLOCK_EGRESS</code>, or <code>PATCH_ASSET</code>. Invalid or out-of-set actions still consume a tick and incur a penalty.</p><p>The key constraint is information. The model never sees the full attack graph. It sees what a real SOC analyst would see: incomplete, delayed, ambiguous signals, and it has to decide.</p><hr><h2>The Disruption Exploit</h2><p>Here is where the project got interesting.</p><p>During initial GRPO training, the policy achieved a high average return. I felt good about it. Then I looked at the standard deviation of returns across the generated batches:</p><p><strong><code>0.0</code></strong></p><p>Exactly zero. Every rollout was scoring identically. That&#39;s not a trained policy - that&#39;s a degenerate one.</p><p>What had happened: the environment&#39;s disruption penalty (the cost of taking an asset offline) had a hard cap per tick. The math worked out such that isolating _every server in the company at Tick 0_ was the globally optimal play. The LLM had discovered it could score perfectly by nuking the entire network before the attacker made a single move.</p><p>_&quot;I solved the hack by unplugging the internet.&quot;_</p><p><strong>Fix one:</strong> Remove the cap. The disruption penalty now scales linearly with the number of isolated assets. Mass isolation becomes self-defeating.</p><p>But a second exploit emerged immediately. Even with the linear penalty, the model found a new shortcut: on Tick 0, before any alerts had fired, it would <code>ISOLATE_ASSET</code> the first valid target. Because the attack kill-chain is a linear DAG, blocking the root stage at Tick 0 instantly terminated the attack with a perfect <code>terminal_clean_bonus</code>. The model collected the prize without ever reading a single alert. It never investigated, never made a multi-step decision, never engaged with the environment&#39;s telemetry at all.</p><p>_&quot;Unplug the first server. Collect the prize.&quot;_</p><p>These two exploits are worth sitting with for a moment, because they reveal something real about RL: <strong>your reward function is your actual policy</strong>. The model wasn&#39;t cheating. It was solving the problem you gave it, precisely and efficiently. The problem was that the problem you gave it wasn&#39;t the one you meant.</p><p><strong>Fix two: evidence-based containment.</strong> A new reward channel - <code>evidence_bonus</code> - pays +1.5 for containing a target that the model has _already confirmed compromised_ via an <code>INVESTIGATE</code> action. Blindly isolating on Tick 0 earns nothing. The optimal strategy is now: watch the alerts, investigate the suspicious target, confirm compromise, contain surgically. That&#39;s the workflow you actually want.</p><p>On top of this, the training was restructured into <strong>on-policy iterative self-play</strong>:</p><ul><li><strong>Outer Loop 0 (Warmup):</strong> A hardcoded heuristic policy generates 1,500 rows of seed data. The model learns the JSON schema and basic mechanics.</li><li><strong>Outer Loop 1+ (Self-Play):</strong> The LLM generates its own rollouts at temperature 1.4. It makes mistakes, explores novel states the heuristic never reached, endures false positive penalties, and learns that &quot;surgically investigate then contain&quot; genuinely outscores &quot;blindly isolate everything.&quot;</li></ul><p>Three additional dispersion signals (<code>reward_action_diversity</code>, <code>reward_observation_aware</code>, <code>reward_batch_action_entropy</code>) prevent mode collapse into a single repeated action under GRPO&#39;s group-relative updates.</p><hr><h2>The Result</h2><p>After 120 optimizer steps across 2 outer loops, the numbers:</p><ul><li><strong>Valid-action rate: 98.3%</strong> - up from ~59% at the start of Loop 1</li><li><strong>Monitor fallback rate: 0.0%</strong> - the model emits well-formed JSON every tick</li><li><strong>Schema compliance: 97.1%</strong> - up from 72.1% at warmup</li></ul><p>The quantitative story that matters for <strong>out-of-distribution transfer</strong> is the held-out scenario <code>cloud_metadata_ssrf</code> (never in the fine-tuning corpus). There, mean cumulative return (n=30 episodes) is <strong>5.48 (trained)</strong> vs <strong>4.91 (zero-shot base LLM)</strong> vs <strong>3.05 (heuristic)</strong> - i.e. the adapter <strong>generalizes past the frozen base weights</strong> on a novel attack chain, not only past the scripted baseline.</p><div class="reader-media"><img src="https://raw.githubusercontent.com/LonelyGuy-SE1/cybersec_env/main/assets/aggregate_performance.png" alt="Aggregate before/after GRPO - training scenarios (left) vs held-out OOD scenario (right)"></div><p>_Figure: Left - aggregate mean cumulative reward over the three training scenarios. Right - same metric on <strong>held-out</strong> <code>cloud_metadata_ssrf</code> only; shaded bands are cross-seed spread._</p><p>On the <strong>training scenarios</strong> (left panel), the fine-tuned policy is in the same ballpark as the zero-shot base model at this light budget (120 steps, three scenarios).</p><p>On the <strong>held-out scenario</strong> (right panel) - <code>cloud_metadata_ssrf</code>, which the model never saw during training - it leads every baseline: random, heuristic (+2.43 mean return vs heuristic), and the <strong>zero-shot base LLM (+0.57)</strong> on the same evaluation protocol.</p><p>| Scenario                           | Heuristic | Trained-LLM | Δ vs heuristic |<br>| ---------------------------------- | :-------: | :---------: | :------------: |<br>| <code>supply_chain_token_drift</code>         |   2.872   |    5.515    |     +2.643     |<br>| <code>federated_identity_takeover</code>      |   2.985   |    1.139    |     −1.846     |<br>| <code>insider_repo_pivot</code>               |   3.057   |    5.310    |     +2.253     |<br>| <code>cloud_metadata_ssrf</code> _(held-out)_ |   3.048   |    5.482    |   <strong>+2.434</strong>   |</p><p>The regression on <code>federated_identity_takeover</code> is real and worth acknowledging - identity-pivot scenarios with MFA fatigue are genuinely ambiguous environments, and the high variance (std=2.58) reflects a policy that&#39;s still exploring rather than converging. That&#39;s a known failure mode at this training budget and a clear direction for future work.</p><p>The held-out result is what I wanted to demonstrate. The model wasn&#39;t memorising attack graphs. It had learned something transferable - a general defensive reasoning pattern that says: _wait for signal, investigate the suspicious thing, confirm before containing, don&#39;t burn the network down._ That pattern, learned on three scenarios, transferred cleanly to a fourth one it had never encountered.</p><hr><h2>What&#39;s Next</h2><p>The environment is open and extensible. The immediate directions:</p><ul><li><strong>Longer training runs</strong> on <code>federated_identity_takeover</code> to close the identity-pivot gap</li><li><strong>Adaptive attacker personalities</strong> that learn from the defender&#39;s containment patterns</li><li><strong>Multi-agent defender coordination</strong> across distributed SOC nodes</li></ul><p>The broader point: the combination of a realistic long-horizon environment, a well-designed multi-channel reward signal, and iterative on-policy self-play is enough to teach a 1.5B model to reason defensively about novel attack chains it was never trained on. You don&#39;t need a 70B model for that. You need the right environment.</p><hr><p><strong>Live Space:</strong> <a href="https://huggingface.co/spaces/Lonelyguyse1/cybersec" target="_blank" rel="noreferrer">huggingface.co/spaces/Lonelyguyse1/cybersec</a></p><p><strong>Repository and full write-up:</strong> <a href="https://github.com/LonelyGuy-SE1/cybersec_env" target="_blank" rel="noreferrer">github.com/LonelyGuy-SE1/cybersec_env</a> (root <a href="https://github.com/LonelyGuy-SE1/cybersec_env/blob/main/README.md" target="_blank" rel="noreferrer"><code>README.md</code></a> - tables, plots, held-out metrics)</p><p><strong>Package / HTTP API:</strong> <a href="https://github.com/LonelyGuy-SE1/cybersec_env/blob/main/cybersec/README.md" target="_blank" rel="noreferrer"><code>cybersec/README.md</code></a></p><p><strong>GRPO training notebook (GitHub):</strong> <a href="https://github.com/LonelyGuy-SE1/cybersec_env/blob/main/notebooks/cybersec_grpo.ipynb" target="_blank" rel="noreferrer">notebooks/cybersec_grpo.ipynb</a> - <a href="https://colab.research.google.com/github/LonelyGuy-SE1/cybersec_env/blob/main/notebooks/cybersec_grpo.ipynb" target="_blank" rel="noreferrer"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"></a></p><p><strong>Reference run (saved cell outputs, same cells):</strong> <a href="https://github.com/LonelyGuy-SE1/cybersec_env/blob/main/notebooks/cybersec_grpo_results.ipynb" target="_blank" rel="noreferrer">notebooks/cybersec_grpo_results.ipynb</a> - <a href="https://colab.research.google.com/github/LonelyGuy-SE1/cybersec_env/blob/main/notebooks/cybersec_grpo_results.ipynb" target="_blank" rel="noreferrer"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"></a></p>]]></content:encoded>
      <author>lonelyguyse1@gmail.com (Lonely Guy)</author>
    </item>
  </channel>
</rss>