Candidates Testimonials – How C.S.S Got Me Hired
Advice From Our Recruitment Team – By Carolyne N. – Head Of Recruitment
Personalized Support for Your Success
Upcoming Trainings & Events – Leadership & Career Growth Events
AI Evaluation Engineer Job
IT Jobs. Gramian Consultancy Jobs
Role overview
We are looking for an AI Evaluation Engineer with a strong research background to design and evaluate complex, multi-agent tasks used to benchmark next-generation AI systems.
In this role, you will work at the intersection of research, data structuring, and AI evaluation, building high-quality tasks that require deep document understanding, structured reasoning, and multi-step synthesis. You will create datasets and evaluation frameworks that test whether AI agents can truly read, reason, and extract knowledge from large-scale unstructured data.
This is a high-precision, detail-oriented role requiring strong analytical thinking, structured problem decomposition, and the ability to translate research content into measurable evaluation tasks.
Key Responsibilities
- Build multi-agent benchmark tasks that require reading, analyzing, and synthesizing large document collections
- Curate real-world research corpora — academic papers, case studies, technical reports — and design questions that require comprehensive analysis
- Write structured ground-truth oracles (JSON) with specific, verifiable answers that prove the agent actually read the source material
- Design LLM judge prompts that evaluate agent output field-by-field against the oracle
- Create decomposition guides that split research across multiple parallel sub-agents (one per document, one per domain, then synthesis)
Requirements
- 5+ years of experience in research (academic or industry) in a scientific, technical, or analytical domain
- Strong ability to read, analyze, and extract structured information from unstructured documents
- Experience designing or working with structured data formats (JSON, schemas, validation)
- Proficiency in Python scripting (data processing, validation, or evaluation scripts)
- Experience with AI evaluation, coding benchmarks, or structured reasoning tasks (e.g., SWE-bench, Terminal-bench, or similar)
- Experience working with Docker (building images, debugging containers)
- Strong attention to detail, especially when defining exact, verifiable outputs
- Ability to design complex, multi-step problem-solving workflows
How to Apply
🚨 Before You Apply for This Job. Need Help With Your CV?
This job will attract 1000+ applicants.
Many qualified professionals miss out on getting shortlisted and interviews — not because they lack experience, but because their CV doesn’t clearly show how they fit this specific job.
🎯 Want to get an interview fast? Customize your CV specifically for this job.
Using the same CV for every application will not get you interviews.
Email your CV today to our Client Service Manager, Rose, using cvwriting@corporatestaffing.co.ke
Subject: CV Review & Upgrade.
Rose and our recruiters will review your CV and show you exactly how to improve it for the job you are targeting.
Using an A.I-generated CV but not getting interviews? Get it reviewed here by our recruiters today.

