Bhalla, Usha, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, and Flavio Calmon. "Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability." ...
Jiaxun Li, Aaron, Suraj Srinivas, Usha Bhalla, and Himabindu Lakkaraju. "Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders." Proceedings of the Conference of the ...
New interpretability leap: Anthropic's Natural Language Autoencoders convert AI's internal activations into human-readable summaries, offering direct insight into chatbot reasoning. Safety and trust ...