UofTCSSLab 's Collections

SIREN

SIREN is a lightweight guard model that detects harmful content from LLM internal representations.