Multi-modal safety evaluation benchmark for AI systems. Covers 20 hazard categories and 9 attack strategies aligned with MLCommons AI Safety standards.
Open-source, enterprise-ready platform for benchmarking, monitoring, and managing LLM & RAG applications. Features include multi-tenant support, RBAC, structured evaluation workflows, 20+ metrics (...
Agent Memory Intelligence Benchmark. Existing benchmarks reward flashy retrieval metrics (Recall@k) that don't correlate with downstream task quality. Engram measures what actually matters: does th...