Big Data Engineer (Spark)
Job Description
About this role
Spark is the workhorse of large-scale data processing, and writing it well requires understanding the JVM, the cluster, and the laws of distributed computation at the same time. As a Big Data Engineer (Spark) for AI training, you will help AI generate Spark code that doesn't just run but scales, recovers, and respects partitioning, shuffles, and skew.
Key Responsibilities
• Generate and evaluate Spark instruction-response pairs covering DataFrame, SQL, and Structured Streaming APIs.
• Review AI-generated code in Scala Spark, PySpark, and Spark SQL.
• Provide feedback on partitioning strategies, broadcast joins, and skew handling.
• Validate AI handling of Delta Lake, Iceberg, and Hudi table formats.
• Evaluate cluster sizing, dynamic allocation, and Spark-on-Kubernetes patterns.
• Identify subtle issues in shuffle behavior, serialization, and AQE-related regressions.
Ideal Qualifications
• 6• years in big data engineering, including 4• years writing production Spark.
• Deep familiarity with both Scala Spark and PySpark.
• Strong grasp of Spark internals (Catalyst, Tungsten, AQE) and distributed-systems trade-offs.
• Experience with at least one modern table format (Delta, Iceberg, Hudi).
• Comfort with cloud data platforms (Databricks, EMR, Dataproc, Synapse).
• Familiarity with Kafka, Airflow, or dbt is a plus.
Project Timeline
• Start Date: Immediate
• Duration: Ongoing
• Commitment: Flexible, 10-25 hours/week
Contract & Payment Terms
• Independent contractor agreement
• Remote work — anywhere in eligible locations
• Weekly payment via Stripe or bank transfer
• Flexible hours
Scale AI's grasp of distributed data with Spark — apply now!