# AI Evaluation Engineer (Data Analysis & Multi-Agent Systems)

**Company:** [Gramian Consulting Group](http://jobs.workable.com/companies/kANY7hHLXDH7fUqyifRLmf.md)
**Location:** Remote
**Workplace:** remote
**Employment type:** Contract
**Department:** Talent Solutions

[Apply for this job](http://jobs.workable.com/view/92c3c858-128e-4121-bdee-3bf832f6f29e)

## Description

**About Us**

Gramian Consultancy is a boutique consultancy specializing in IT professional services and engineering talent solutions. With a strong background in software engineering and leadership, we help companies build high-performing teams by matching them with professionals who truly fit their needs.

**Role overview**

We are looking for an **AI Evaluation Engineer specialized in data analysis** to design benchmark tasks that simulate real-world analytical workflows.

You will create scenarios where AI systems must analyze **large, messy, multi-source datasets**, decompose tasks across multiple agents, and produce clear, verifiable conclusions.

**Commitments Required: 8 hours per day with an overlap of 4 hours with PST.**

**Employment type: Contractor assignment (no medical/paid leave)**

**Duration of contract: 4 weeks+**

**Location:** **Bangladesh, Brazil, Colombia, Egypt, Ghana, India, Indonesia, Kenya, Nigeria,Turkey, Vietnam**

**Interview: take home assessment (60min)**

### **Responsibilities**

-   Design and develop **multi-agent benchmark tasks** focused on complex data analysis workflows
-   Create or curate **realistic datasets** (CSV, JSON, logs, reports, financial or operational data)
-   Build tasks requiring:

-   Cross-referencing across multiple data sources
-   Anomaly detection and contradiction identification
-   Statistical analysis and interpretation

-   Define **task decomposition strategies** across specialized sub-agents (e.g., financial, technical, operational analysis)
-   Develop **verification logic** to validate precise analytical outputs (not generic summaries)
-   Implement evaluation pipelines using **Python and SQL**
-   Create reproducible environments using **Docker**
-   Analyze task performance and refine for **clarity, difficulty, and scoring accuracy**

## Requirements

-   5+ years of experience in **data analysis or analytics-heavy roles**
-   Strong proficiency in **Python (pandas, NumPy)** and **SQL**
-   Experience working with **real-world, messy datasets** (CSV, JSON, logs, reports)
-   Ability to design **analytical problems with clear, verifiable answers**
-   Solid understanding of **statistics** (distributions, correlations, outliers)
-   Familiarity with **AI benchmarks or evaluation environments** (e.g., SWE-bench or similar)
-   Hands-on experience with **Docker** (Dockerfiles, image builds, debugging)

### **Nice to Have**

-   Experience in **financial analysis, operations analytics, or risk analysis**
-   Exposure to **data pipelines or ETL workflows**
-   Experience with **data quality validation or anomaly detection systems**
-   Familiarity with **AI/ML data workflows or evaluation frameworks**