A perspective on AI safety

One research direction I have found appealing is to study the foundations of trustworthy AI systems by combining the theoretical rigor of cryptography with the empirical insights of AI. The goal would be to better understand how to build models that behave reliably even as they grow more capable and face more attacks.

Current empirical methods often offer limited guarantees: unlearning procedures or defenses against jailbreaks and prompt injections can look convincing until a new adapted attacker appears and breaks them. From a cryptographic perspective, this is not very surprising. A recurring lesson from cryptography is that heuristic defenses, however sophisticated, are hard to trust without a clearer account of why they should hold up.

For that reason, it seems valuable to look for algorithms with more explicit guarantees, using tools like interpretability, SMT solvers, and ML theory. Yet, it’s hard to scale up these tools to modern models, so they often provide only partial explanations of emergent behavior and limited guidance for building robust defenses.

The middle ground that seemed most promising to me was a top-down approach: start from the more stable empirical observations we do have, such as scaling laws and representation engineering, and treat them as explicit average-case or worst-case assumptions about how large models behave. From there, one can state threat models and security goals, derive algorithms whose guarantees are only as strong as their assumptions, and use red-teaming to directly test where those assumptions fail. When a defense breaks, this framework can help distinguish whether the issue lies in the algorithm itself or in the higher-level model of behavior that motivated it. As the theoretical toolbox improves, those assumptions might gradually be replaced by more basic and better-justified ones while keeping the same overall discipline.

If this perspective resonates with you, or if you think it misses something important, feel free to reach out.