Hi, I am Sagar!

A software engineer passionate about distributed systems and data engineering. I relish the challenge of building well-architected systems that turn raw data into reliable, analytics-ready actionable insights.

From real-time event processing to cloud-native data pipelines, I specialize in designing scalable architectures using tools like Apache Spark, Iceberg, Kafka, Airflow, and modern cloud platforms (AWS & GCP). Whether it's ingesting millions of clickstream records, deploying secure data apps, or enabling time-travel on petabyte-scale data lakes, I bring a systems-thinking mindset to solve data challenges that power decision-making at scale.

When production issues dont bug me, I'm either watching Liverpool attempt to give me a heart attack in stoppage time, playing football myself, or out with my Olympus pretending I understand composition. I also write tech musings on Medium when caffeine hits just right..

Let's build something impactful — feel free to explore my work or reach out to connect!


Open to offers

Roles: Software Engineer | Data Engineer | ML Engineer | AI Engineer | Analytics Engineer

Download Resume

Work Experience

DATA SPECIALIST - GRADUATE ASSISTANT

DIVISION OF IT - UMD | COLLEGE PARK, MD | SEP 2023 - MAY 2025
  • Building an NLP pipeline to transform surveys into analytics-ready datasets, leveraging PyTorch, LangChain, and HuggingFace Transformers, helping 8 research analysts save 20+ hours collectively on manual data querying efforts
  • Deployed R Shiny applications on GCP using ShinyProxy, Docker, and Terraform for multi-user collaboration on internal research applications.
  • Incorporated security controls (IAM, RBAC) alongside a Flask application gateway with Google OAuth and reverse-proxy SSL to ensure secure access.
  • Prepared a POC using GCP Pub/Sub, Apache Beam, and BigQuery to process 20M+ daily clickstream events from Canvas ELMS, enabling instant classroom analytics for the academic leadership team via Apache Superset dashboards
  • Optimized AWS ETL workflows by implementing incremental ingestion and advanced SQL techniques (CTEs, partitioning, indexing) slashing processing time from 7 hrs to 4 hrs while ensuring high data accuracy for reporting.
  • Senior Software Engineer

    TIGER ANALYTICS | CHENNAI, INDIA | JUL 2021 - JUL 2023
  • Partnered with data architects to prototype MVP's and launch 2 enterprise solutions: Intelligent Data Express and Data Observability Framework.
  • Developed the backend of the MVP's using AWS serverless tools and later scaled it to a microservice architecture using FastAPI and Docker to support 3X user growth.
  • Developed a metadata-driven ETL ingestion framework using Airbyte, Apache Airflow, AWS Glue, and Python, enabling rapid ingestion from diverse enterprise data sources (CDC, streaming, batch, on-premise databases), reducing asset onboarding time from weeks to hours
  • Engineered an ACID-compliant Lakehouse using Apache Iceberg, AWS S3, Athena, and Redshift. Implemented SCD Type-2 and time travel capabilities to ensure historical data integrity and support analytical workloads
  • Built a data quality tool with Great Expectations, Apache Spark, and Airflow to validate 30+ custom checks on data at rest, reducing data anomalies by 60% across downstream BI and analytics pipelines
  • Implemented an infrastructure observability pipeline utilizing CloudWatch logs, ELK stack and Grafana, accelerating root cause analysis and decreasing Mean-Time-To-Resolution (MTTR) for production issues by 40%
  • Established data governance capabilities by integrating LinkedIn DataHub, providing detailed tracking, auditability, and visualization of data flow, aiding regulatory compliance (GDPR, CCPA) and trust in data assets
  • Collaborated with business analysts and sales teams to translate functional requirements into engineering solutions
  • Provided MVP demos and simplified complex technical architectures to clients through concise presentations
  • Took a career break to focus on personal goals and well-being

    Intern & Software Engineer

    XENONSTACK | CHANDIGARH, INDIA | JAN 2019 - NOV 2019
  • Modernized a data platform by migrating legacy Hadoop workflows and Pig scripts to Scala/PySpark ETL jobs on Databricks, processing IoT sensor and weather data from 45 geo-locations via Kafka to create a high-availability data lake powering data science workflows
  • Collaborated with MLOPS team to build a Python framework using MLFlow to automate model management & scoring, improving feature engineering efficiency by 33% and reducing model discovery time in production
  • Scaled TensorFlow model training by implementing a Ray-based distributed pipeline across a 6-node cluster, reducing computation time by ~10%
  • Developed a cost-optimization strategy to bid and select EC2 spot instances for AWS EMR jobs during off-peak hours.
  • Education

    University of Maryland - College Park | USA

    Master's in Information Management 2023-2025 | GPA: 4.0/4.0
  • Relevant Coursework: Big Data Infrastructure, Data Analytics, Data Integration, Advance Data Science, Cloud Computing, Product Management
  • Received complete tuition fee waiver for the entire duration of the degree program
  • Panjab University - Chandigarh | India

    B.E. Information Technology 2015-2019 | GPA: 3.74/4.0
  • Relevant Coursework: Data Structures and Algorithms, Database Systems, Network Security, Operating Systems, Object Oriented Programming
  • Core Competencies

    Data Engineering
    • AWS and GCP cloud tools
    • Apache Spark | Apache Beam
    • Kafka
    • BigQuery | Redshift
    • Airflow
    • Apache Iceberg | Delta Lake
    • Hadoop Ecosystem
    • DBT Core
    • Apache Superset
    • Distributed Systems
    Software Engineering
    • Data Structures & Algorithms
    • Python | Scala | Rust
    • Git | GitOps
    • CI/CD Pipelines
    • Docker
    • Kubernetes
    • Terraform
    • REST APIs
    • Backend Development
    • Shell Scripting
    • Server Side Programming
    • HTML | CSS | Javascript
    ML Engineering
    • Feature Engineering
    • Supervised & Unsupervised learning
    • Pytorch
    • Advanced SQL
    • Weights & Biases
    • Scikit-learn
    • XGBoost
    • MLflow
    • Kubeflow
    • Metaflow
    • SparkML
    • Model Serving and Management
    Generative AI
    • LangChain
    • Natural Language Processing
    • Vector DBs
    • RAG Architecture
    • LLM and Fine-tuning
    • Prompt Engineering
    • Hugging Face Transformers
    • Agentic AI systems

    Selected Works

    Data Fusion Engineering

    GCP Serverless Apache Spark Terraform Apache Superset BASH SQL
    • Developed a complete analytics solution on GCP to ingest, store, transform, and analyze data
    • Automated data ingestion using time-driven cloud function from 6 Open NYC dataset APIs incrementally
    • Created automated Dataproc pipelines to process and transform ingested data into a BigQuery
    • Prepared auto-updating Apache Superset dashboards to visualize KPIs to identify accident-prone areas
    View on GitHub →

    Intelligent Record Management

    PyTorch NLP Tools Langchain Streamlit Elasticsearch Gemma2
    • A document processing and semantic search system for intelligent indexing of congressional archives
    • Allows users to input a query via Streamlit UI and retrieve relevant past press releases
    • LLM based document summaries infused with NLP: Topic Modeling, NER, and keyword extraction
    View on GitHub →

    Loan Default Prediction System

    PySpark Seaborn SciKit Learn XGBoost Random Forest K-Means Clustering PCA
    • End-to-end machine learning pipeline to predict loan defaults using the LendingClub dataset
    • Incorporated data preprocessing, feature engineering, & supervised ML modeling
    • Prepared a borrower segmentation using K-Means clustering to identify high-risk defaulter profiles
    View on GitHub →

    Data Preparation for Fintech Analytics

    AWS Python Great Expectations Postgres Tableau
    • A serverless framework to automate metadata extraction, profiling, validation, and transformation.
    • Event-driven workflows on AWS Lambda, Step Functions, and S3 for DQ checks and transformations.
    • Stored cleaned datasets into AWS RDS and created Tableau dashboards to visualize KPIs
    View on GitHub →

    Monitoring EKS Cluster

    AWS Terraform HELM Prometheus Jenkins CI/CD Kubernetes
    • Streamlined the setup for deploying OpenTelemetry Webshop
    • Enabled modular deployment by splitting large Kubernetes manifests into component YAML files
    • Provisioned EKS, Grafana, and Prometheus with Terraform and Helm
    • Developed a system to collect logs from the kube-system and track unhealthy pods
    • Built pipelines to test and build Docker images via a remote Jenkins server
    View on GitHub →

    Sports Analytics System

    Python Plotly Pandas Tableau
    • Developed EDA framework analyzing 10+ performance KPIs
    • Created interactive dashboards for tactical analysis
    • Identified 3 key success factors through Bayesian analysis
    View on Kaggle →