Synthical
Your space
Profile
Activity
Favorites
Folders
Feeds
All articles
Claim page
Long Phan
Follow
Activity
Upvotes
Folders
Articles
4
Representation Engineering: A Top-Down Approach to AI Transparency
3 March 2025 by
Andy Zou
and
others
Machine Learning
,
Artificial Intelligence
Humanity's Last Exam
21 February 2025 by
Long Phan
and
others
Machine Learning
,
Artificial Intelligence
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs
19 February 2025 by
Mantas Mazeika
and
others
at
UC Berkeley
Machine Learning
,
Artificial Intelligence
Tamper-Resistant Safeguards for Open-Weight LLMs
10 February 2025 by
Rishub Tamirisa
and
others
Machine Learning
,
Artificial Intelligence
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
27 December 2024 by
Richard Ren
and
others
Machine Learning
,
Artificial Intelligence
Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation
23 August 2024 by
Phuc Phan
and
others
Computation and Language
,
Artificial Intelligence
Improving Alignment and Robustness with Circuit Breakers
1
12 July 2024 by
Andy Zou
and
others
at
Carnegie Mellon University
Machine Learning
,
Artificial Intelligence
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
1
15 May 2024 by
Nathaniel Li
and
others
at
UC Berkeley
Machine Learning
,
Artificial Intelligence
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
27 February 2024 by
Mantas Mazeika
and
others
at
Carnegie Mellon University
Machine Learning
,
Artificial Intelligence
This is an AI-generated summary
Key points
Topics
We have not analyzed this profile yet, please check back later