# Lead / Staff Data Engineer - Data Platform

**Company:** [Apna](http://jobs.workable.com/companies/oGcPwdAqKGbezdH7GyyVKX.md)
**Location:** Bengaluru, India
**Workplace:** on site
**Department:** Engineering

[Apply for this job](http://jobs.workable.com/view/0e281649-541e-4e5d-a6ee-6ba71a2d6686)

## Description

**Company:** Apna

**Team:** Data Platform / Engineering

**Location:** Bangalore

**Experience** : 5-7 Years of Experience

**Why Join Apna**

At Apna, data is central to how we build products, understand users, improve employer outcomes, power recommendations, and scale decision-making. This role gives you the opportunity to build the backbone of Apna’s data platform and influence how data is used across the company.

You will work on real-world, high-scale problems across jobs, users, employers, communities, matching, growth, and AI-driven systems.

**About the Role**

Apna is looking for a **Lead / Staff Data Engineer** to build and scale our core data platform. This role will work on large-scale data pipelines, lakehouse architecture, query platforms, workflow orchestration, and data reliability systems that power analytics, product intelligence, machine learning, business dashboards, experimentation, and operational decision-making across Apna.

We are looking for someone who can think deeply about **data architecture**, design reliable pipelines, improve data quality, and help build a platform that can scale with Apna’s growth.

**What You’ll Own:**

You will be responsible for designing, building, and operating critical parts of Apna’s data platform, including:

-   Building scalable batch and near-real-time data pipelines across product, business, growth, and ML use cases.
-   Designing and improving our lakehouse architecture using technologies like**Apache Hudi**.
-   Working with query engines such as**Presto / Trino**for large-scale analytical workloads.
-   Building and maintaining orchestration workflows using**Apache Airflow**.
-   Creating reusable data models, curated datasets, and reliable data marts for analytics and product teams.
-   Improving data platform reliability, observability, SLA tracking, lineage, and data quality checks.
-   Optimizing storage, compute, query performance, and pipeline costs.
-   Partnering with product, analytics, ML, and backend engineering teams to understand data needs and convert them into scalable platform solutions.
-   Driving engineering standards around data modeling, schema evolution, partitioning, deduplication, backfills, replayability, and pipeline ownership.
-   Mentoring data engineers and influencing architecture decisions across teams.

**What We’re Looking For**

**Must Have**

-   Strong experience in**data engineering**, preferably at scale.
-   Hands-on experience with**Apache Airflow**or similar orchestration systems.
-   Strong knowledge of**Presto / Trino**or other distributed query engines.
-   Good understanding of**Apache Hudi**concepts such as:

-   Copy-on-write vs merge-on-read
-   Upserts and deletes
-   Incremental reads
-   Compaction
-   Clustering
-   Timeline and commits
-   Schema evolution
-   Partitioning strategy

-   Strong knowledge of distributed data processing and storage systems.
-   Ability to design and build reliable ETL / ELT pipelines.
-   Strong SQL skills and ability to debug complex data issues.
-   Good understanding of different data architectures, including:

-   Data warehouse
-   Data lake
-   Lakehouse
-   Lambda architecture
-   Kappa architecture
-   Medallion architecture
-   Event-driven data architecture

-   Experience with data modeling for analytics and reporting.
-   Strong programming skills in at least one language such as**Python, Java, or Scala**.
-   Ability to reason about trade-offs between freshness, cost, reliability, latency, and complexity.
-   Strong debugging and production ownership mindset.

**Good to Have**

-   Experience with Kafka, Spark, Flink, Hive, Iceberg, Delta Lake, or BigQuery.
-   Experience building internal data platforms or self-serve data infrastructure.
-   Experience with data quality frameworks such as Great Expectations, Deequ, Soda, or custom validation systems.
-   Exposure to ML feature pipelines or feature stores.
-   Experience with metadata management, data catalogs, lineage, and governance.
-   Experience with cloud infrastructure such as AWS, GCP, or Azure.
-   Understanding of privacy, compliance, PII handling, and access control in data systems.

**What Success Looks Like  
In this role, success means:**

-   Critical business and product datasets are reliable, discoverable, and trusted.
-   Pipelines are observable, recoverable, and have clear SLAs.
-   Query performance improves across major analytical workloads.
-   Data freshness and quality issues reduce significantly.
-   Teams can build on top of the data platform faster without reinventing pipelines.
-   The platform can scale with Apna’s user, job, employer, and engagement data.
