Data Platform & AI Engineer

Modern data infrastructure

Sagar Das

I am a Data Platform Engineer with 3+ years of experience designing and implementing scalable, cloud-native solutions that drive innovation for global enterprises.

Key Achievements

  • Designed an automated renewable energy monitoring system processing 100GB+ IoT data daily.
  • Developed a customer segmentation framework using Spark ML, boosting targeted marketing effectiveness.
  • Built a scalable GenAI pipeline combining RAG (Retrieval-Augmented Generation) architecture with LLM fine-tuning.
  • Migrated legacy ETL workflows to modern cloud-native architectures, reducing processing times by 60%.

Area of Expertise

  • Data Engineering: Architecting scalable, event-driven data pipelines and serverless data frameworks for enterprise analytics.
  • MLOPS: Automating model training, deployment, and monitoring to drive ML adoption at scale.
  • GenAI & NLP: Leveraging LLMs and Retrieval-Augmented Generation (RAG) to create AI-powered enterprise solutions.
  • Big Data & Distributed Systems: Designing scalable, fault-tolerant architectures to handle high-velocity streaming & batch workloads.
Dive deep into my story:

Humans actually glow in the dark, but the light is 1,000 times weaker than the human eye can detect! So, while we may not light up like fireflies, we're all secretly glowing!

Work Experience

Graduate Assistant - Data Engineering
Department of IT, University of Maryland - College Park

Cloud Infrastructure Automation
  • Engineered cloud-native deployment system for academic analytics tools:
    • Automated provisioning of 50+ R applications using ShinyProxy/Terraform/Docker
    • Implemented LDAP/SSO integration with granular RBAC policies
    • Reduced VM provisioning time by 65% through infrastructure-as-code practices

Learning Analytics Pipeline Development

  • Architected real-time clickstream processing system handling 20M+ daily events:
    • GCP dataflow pipeline with PubSub/BigQuery integration
    • Apache Superset dashboards monitoring 150+ engagement metrics
    • Anomaly detection model identifying classroom irregularities with 92% accuracy
Academic Research Modernization
  • Transformed EDUCAUSE survey analysis through:
    • Dimensional modeling on AWS Redshift with MVR refresh automation
    • Natural language processing pipeline using SpaCy/NLTK
    • LLM-powered Q&A system (Gemma2 9B/Ollama) with RAG architecture
    • Developing cross-department Tableau dashboards serving 12 academic units

Senior Software Engineer - Data Engineering
Tiger Analytics

Cloud Data Fabric Development
  • Collaborated with a Lead Data Architect to drive a 10 engineer team in building an AWS-based data fabric from scratch, accelerating the client’s time-to-market by 30% through:
    • Custom ETL pipeline design with SCD Type-2 historization for accurate historical tracking
    • Metadata-driven workflow management for dynamic ETL orchestration
    • Real-time monitoring API using AWS Lambda & GraphQL for proactive failure detection
ML Engineering & Analytics
  • Developed a scalable customer segmentation model using Spark ML/K-Means, leveraging PCA-driven feature engineering
  • Implemented a forecasting engine on a GCP Data Lake comparing:
    • ARIMA for capturing seasonal trends in time-series forecasting
    • Random Forest & XGBoost for enhanced prediction accuracy
  • Designed a phased rollout strategy for ML model deployment, ensuring gradual adaptation and A/B testing
Cloud Infrastructure Automation
  • Reduced cloud provisioning time by 60% using Terraform-based Infrastructure as Code (IaC) templates for AWS/GCP
  • Built a serverless cost optimization framework for dynamic EC2 resource allocation, leveraging:
    • Spot Instance optimization for EMR clusters, reducing compute costs
    • Auto-scaling strategies for dynamic resource allocation based on demand
Data Governance & Security Compliance
  • Architected a metadata-driven Data Quality (DQ) framework integrating Great Expectations & Deequ for automated data validation
  • Implemented a scalable PII protection system with:
    • AWS Macie-powered sensitive data detection for compliance with GDPR/CCPA
    • Format-preserving encryption (FPE-FF1) in Python/Scala to maintain referential integrity


Intern & Software Engineer - Data Engineering
Xenonstack

ML Engineering
  • Scaled TensorFlow training pipelines using Python/Ray across 6-node cluster, reducing computation time by 10%
  • Developed production ML evaluation CLI with MLFlow/Python that automated feature scoring, reducing manual engineering effort by 25%
  • Implemented market segmentation models (Decision Trees/Random Forest) for personalized product recommendations
Data Engineering
  • Designed and developed a Linux-based command-line tool for automated data ingestion and storage:
    • Ingested enterprise, asset, meteorological, and failure data from 70+ geo-distributed locations of a renewable energy giant
    • Structured data into a multi-zoned AWS data lake, ensuring efficient organization and retrieval
    • Implemented user-specified granularity, enabling data science teams to build historical datasets
    • Powered KPI-driven dashboards for energy grid performance and downtime analytics
  • Integrated Spark Streaming into ETL pipelines and migrated from Spark 1.4 to Spark 2.4:
    • Processed ~100GB of IoT data daily from 70+ wind and solar farms
    • Structured data in a multi-zoned AWS data lake for consistent availability
    • Enhanced processing speed and scalability through Spark 2.4 migration
  • Developed a command-line interface for CI/CD automation:
    • Automated Docker image creation and registry push
    • Integrated with a remote Jenkins server to trigger CI/CD build pipelines
    • Standardized and automated containerized deployment workflows
Did you know? Modern LLMs process information at 1e-23 watts per operation - 10x more efficient than 2023 models!

Academic Foundation

Data engineering academic background

University of Maryland, College Park

Master of Information Management [Data Science] 2023-2025 | GPA: 4.0/4.0
Specialized Study Areas
  • Advanced Data Systems Architecture
  • Machine Learning Operationalization (MLOps)
  • Generative AI Implementation Strategies

  • Relevant Coursework
  • Cloud-Native Data Engineering (AWS/GCP)
  • Deep Learning Architectures
  • Large Language Model Applications
  • Data Governance & Ethics

  • Panjab University

    Bachelor of Engineering [Information Technology] 2015-2019 | GPA: 3.74/4.0
    Core Competencies Developed
  • Distributed Systems Design
  • Algorithm Optimization
  • Database Architecture

  • Leadership & Engagement
  • Technical Member - Entrepreneurship Cell
  • Captain - University Football Team

  • Core Technical Competencies

    Data Engineering
    • AWS and GCP cloud tools
    • Apache Spark | Apache Beam
    • Kafka
    • BigQuery | Redshift
    • Airflow
    • Apache Iceberg | Delta Lake
    • Hadoop Ecosystem
    • DBT Core
    • Distributed Systems
    Software Engineering
    • Python | Scala | Rust
    • Git | GitOps
    • CI/CD Pipelines
    • Docker
    • Kubernetes
    • Terraform
    • REST APIs
    • Backend Development
    Data Science & MLOps
    • Pytorch
    • MLflow
    • Advanced SQL
    • Weights & Biases
    • Pandas/NumPy
    • Scikit-learn
    • XGBoost
    • Kubeflow
    • SparkML
    • Apache Superset
    Generative AI
    • LangChain
    • Natural Language Processing
    • Vector DBs
    • RAG Architecture
    • LLM Fine-tuning
    • Ollama
    • Prompt Engineering

    Selected Works

    GenAI Text Summarization Engine

    Gemma2 on Ollama spaCy Python NLTK Pytorch
    • Developed prompt engineering pipeline for document summarization
    • Integrated zero-shot learning for multi-domain adaptability
    • Achieved 80% content retention accuracy on internal documents
    View Implementation →

    Mitigating DQ for Fintech Loan Analysis

    AWS Python Great Expectations Postgres Tableau
    • Developed a 5-step AWS Glue and Great Expectations pipeline for metadata extraction, profiling, validation, and transformation.
    • Built an event-driven pipeline with AWS Lambda, Step Functions, and S3 for real-time quality checks and transformations.
    • Integrated cleaned datasets into AWS RDS and created Tableau dashboards to visualize data quality trends and anomalies.
    View on GitHub →

    Monitoring EKS Cluster

    AWS Terraform HELM Prometheus Jenkins CI/CD Kubernetes
    • Streamlined the setup for deploying OpenTelemetry Webshop
    • Enabled modular deployment by splitting large Kubernetes manifests into component YAML files
    • Provisioned EKS, Grafana, and Prometheus with Terraform and Helm
    • Developed a system to collect logs from the kube-system and track unhealthy pods
    • Built pipelines to test and build Docker images via a remote Jenkins server
    View on GitHub →

    Cloud Data Fusion Platform

    Google Cloud Apache Spark Terraform Apache Superset BASH SQL
    • Architected automated big data solution processing 1M+ NYC records
    • Implemented real-time weather/accident correlation analytics
    • Reduced query latency by 40% through query optimization
    View on GitHub →

    Sports Analytics System

    Python Plotly Pandas Tableau
    • Developed EDA framework analyzing 10+ performance KPIs
    • Created interactive dashboards for tactical analysis
    • Identified 3 key success factors through Bayesian analysis
    View Analysis →


    Open for Opportunities

    Data Systems & ML Engineering Roles