Multi-modal safety evaluation benchmark for AI systems. Covers 20 hazard categories and 9 attack strategies aligned with MLCommons AI Safety standards.
Agent Memory Intelligence Benchmark. Existing benchmarks reward flashy retrieval metrics (Recall@k) that don't correlate with downstream task quality. Engram measures what actually matters: does th...