LLM Red Teaming & Safety | Homa Hosseinmardi

This project focuses on systematically evaluating the safety, robustness, and alignment of large language models through red teaming methodologies. We develop novel techniques to probe LLM vulnerabilities and assess potential risks in real-world deployment scenarios.

Our red teaming approaches combine automated adversarial prompting with human-in-the-loop evaluation to comprehensively assess LLM behavior across diverse contexts and potential misuse scenarios.

Key research areas include:

Adversarial prompt engineering for eliciting harmful or biased outputs
Systematic safety evaluation across different model architectures and training paradigms
Robustness testing under distribution shift and edge cases
Alignment assessment between stated objectives and actual model behavior

Our work contributes to the broader effort of ensuring AI safety and developing responsible deployment practices for large-scale language models.