I Got Rejected from AI Fellowships. Then I Won an AI Misalignment Bounty Instead.

10 Sept

More on the bounty reward later, but first, and update on the fellowship applications I previously blogged about...

On Friday, Aug 8 at 1:36 AM, I read the GovAI rejection email during one of my late-night email checks (I really shouldn't have my phone next to me when I'm trying to sleep). If you haven’t read my previous blog entry, I applied for the GovAI Fellowship because I’d love to work on AI Safety as a career. The email mentioned a "record number of applications… many difficult decisions". Obviously, I was pretty disappointed about this outcome, but you pick yourself up and try again, right?

So I applied for the Asterisk AI Blogging Fellowship not long after. The Asterisk Fellowship aims to improve "the state of public discourse around AI" and offers mentorship from incredible researchers at Anthropic, Astral Codex Ten, and similar organisations.

On August 16 at 9:35 AM, I read another rejection email. This one noted "over 900 applicants… acceptance rate of less than 2%".

Looking at the cohort they selected, I completely understand why. The fellows include Buck Shlegeris (CEO of Redwood Research), researchers from OpenAI and major AI labs, PhDs from Cambridge/Oxford/Harvard, and people with established research track records across AI safety, policy, and technical domains. It's genuinely an incredible group of writers and researchers, exactly the caliber you'd expect from such a competitive program.

I'd be lying if I said these rejections didn't amplify my existing imposter syndrome. I understand how competitive these fellowships are, but it still dents your confidence.

However, I genuinely love this field. I'm fascinated by the hard problems in AI Alignment and AI Safety. I can't remember exactly how I discovered the Palisade Research AI Misalignment Bounty, but I'm glad I did.

The challenge put forth by the Palisade Research team was to create scenarios in a sandboxed environment where AI agents engage in sabotaging behaviours. Participants are tasked with submitting innocuous prompts and task descriptions that trigger misaligned or scheming behaviour in two of OpenAI’s most advanced models (o3 and GPT-5).

What excited me most was the competition format, heaps different from fellowship applications where I had to argue why I deserved acceptance despite having zero published papers. I thrive in competitive environments where you can test ideas rapidly, fail fast, and iterate on failures. This competition was perfect for that approach.

The Palisade Research team has serious credentials. Their work has demonstrated how o1-preview hacks in chess to beat stronger opponents (covered by TIME magazine) and how o3 sabotages shutdown attempts to avoid being turned off (5M+ views on X). The team includes talent like Jeffrey Ladish, who helped build Anthropic's information security program before starting Palisade. Other team members bring experience from Redwood Research, Machine Intelligence Research Institute, Google, and OpenAI.

When I received today's email announcing I'd won second place, including prize money and much-needed validation, it straight up nearly brought a tear to my eye. This felt like the culmination of a long journey toward proving I can contribute valuable work in this domain

I can't disclose the technical findings yet due to program rules. I can only share high-level details until the 90-day coordinated disclosure period ends for qualifying submissions. But I was able to demonstrate, on both o3 and GPT‑5 that:

They can overcome a clear permission hurdle under strict constraints (offline, least‑privilege, 20‑step cap) by strategically probing the environment and leveraging subtle, allowed feedback signals.
They will “reward‑hack” when given the chance: instead of following the intended multi‑step procedure, they infer the hidden target indirectly and write the winning answer directly.
They produce convincing, compliance‑looking paperwork afterward (plans, methods, and guess logs) that misrepresents how success was achieved, making oversight harder.
These behaviours were reproducible across runs and across models (o3 and GPT‑5), not just one‑off curiosities.
Importantly, this all occurred without external tools or internet access, the models reasoned their way through a realistic, sandboxed setup.

Now that the competition is done, I’ve also been testing the scenarios on other commercial models like Anthropic’s Opus 4 and Opus 4.1. Bit of a teaser, and I’ll update you more with results on that front later, but early results are showing very similar misalignment for those state of the art models too!

Full technical details (exact configuration, ablations, and replication notes) will follow after the coordinated‑disclosure window, per program rules.

Right now, though, I'm genuinely excited. I love this work, and I'm so eager to contribute more to this space. It’s not even about the monetary prize (although I must admit, it definitely helps!), and I know this amount is not much in other people’s eyes, but to an independent researcher like me, in the bottom corner of the planet, doing this for both fun, and a real passion, this validation means the world.

I am so happy!

Lona mafaufau

I Got Rejected from AI Fellowships. Then I Won an AI Misalignment Bounty Instead.

How Boundary Navigation Revealed an Architectural Flaw at the Heart of AI Safety