Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

AI Safety Research Intern-1

Centific

$30 - $50

Sep 22, 2025

Seattle, WA, USA

Centific is looking to solve the problem of ensuring safe and responsible AI deployment by developing robust security and trustworthy LLM systems that resist adversarial and behavioral exploits. This involves tackling cutting-edge AI safety across adversarial robustness, jailbreak defense, agentic workflows, and human-in-the-loop risk modeling.

Requirements

Strong Python and PyTorch/JAX skills; comfort with toolkits for language models, benchmarking, and simulation.
Demonstrated research in at least one of: LLM jailbreak attacks/defense, agentic AI safety, human-AI interaction vulnerabilities.
Proven ability to go from concept → code → experiment → result, with rigorous tracking and ablation studies.
Experience in adversarial prompt engineering, jailbreak detection (narrative, obfuscated, sequential attacks).
Prior work on multi-agent architectures or robust defense strategies for LLMs.
Familiarity with red-teaming, synthetic behavioral data, and regulatory safety standards.
Scalable training and deployment: Ray, distributed evaluation, CI/telemetry for defense protocols.

Responsibilities

Design, implement, and evaluate attack and defense strategies for LLM jailbreaks (prompt injection, obfuscation, narrative red teaming).
Analyze and simulate human-AI interaction patterns to uncover behavioral vulnerabilities, social engineering risks, and over-defensive vs. permissive response tradeoffs.
Prototype workflows for multi-agent safety (e.g., agent self-checks, regulatory compliance, defense chains) that span perception, reasoning, and action.
Create reproducible evaluation protocols/KPIs for safety, over-defensiveness, adversarial resilience, and defense effectiveness across diverse models (including latest benchmarks and real-world exploit scenarios).
Package research into robust, monitorable AI services using modern stacks (Kubernetes, Docker, Ray, FastAPI); integrate safety telemetry, anomaly detection, and continuous red-teaming.
Systematically red-team advanced LLMs (GPT-4o, GPT-5, LLaMA, Mistral, Gemma, etc.), uncovering novel exploits and defense gaps.
Implement context-aware, multi-turn attack detection and guardrail mechanisms, including countermeasures for obfuscated prompts (e.g., StringJoin, narrative exploits).

Other

Ph.D. student in CS/EE/ML/Security (or related); actively publishing in AI Safety, NLP robustness, or adversarial ML (ACL, NeurIPS, BlackHat, IEEE S&P, etc.).
Public code artifacts (GitHub) and first-author publications or strong open-source impact.
A publishable outcome (with company approval) or production-ready module measurably improving safety KPIs: adversarial robustness, over-defensiveness, and incident response latency.
Clean, reproducible code with documented ablations and end-to-end rerun reports for safety benchmarks.
A demo that communicates capabilities, limits, and next steps in defense and security assurance.