Hi, I am Sagar!
A software engineer passionate about distributed systems and data engineering.
I relish the challenge of building well-architected systems that turn raw data into reliable, analytics-ready actionable insights.
From real-time event processing to cloud-native data pipelines, I specialize in designing scalable architectures using tools like Apache Spark, Iceberg, Kafka, Airflow, and modern cloud platforms (AWS & GCP).
Whether it's ingesting millions of clickstream records, deploying secure data apps, or enabling time-travel on petabyte-scale data lakes, I bring a systems-thinking mindset to solve data challenges that power decision-making at scale.
When production issues dont bug me, I'm either watching Liverpool attempt to give me a heart attack in stoppage time, playing football myself, or out with my Olympus pretending I understand composition.
I also write tech musings on Medium when caffeine hits just right..
Let's build something impactful — feel free to explore my work or reach out to connect!
Open to offers
Roles: Software Engineer | Data Engineer | ML Engineer | AI Engineer | Analytics Engineer
Download Resume
Work Experience
DATA SPECIALIST - GRADUATE ASSISTANT
DIVISION OF IT - UMD | COLLEGE PARK, MD | SEP 2023 - MAY 2025
Building an NLP pipeline to transform surveys into analytics-ready datasets, leveraging PyTorch, LangChain, and HuggingFace Transformers, helping 8 research analysts save 20+ hours collectively on manual data querying efforts
Deployed R Shiny applications on GCP using ShinyProxy, Docker, and Terraform for multi-user collaboration on internal research applications.
Incorporated security controls (IAM, RBAC) alongside a Flask application gateway with Google OAuth and reverse-proxy SSL to ensure secure access.
Prepared a POC using GCP Pub/Sub, Apache Beam, and BigQuery to process 20M+ daily clickstream events from Canvas ELMS, enabling instant classroom analytics for the academic leadership team via Apache Superset dashboards
Optimized AWS ETL workflows by implementing incremental ingestion and advanced SQL techniques (CTEs, partitioning, indexing) slashing processing time from 7 hrs to 4 hrs while ensuring high data accuracy for reporting.
Senior Software Engineer
TIGER ANALYTICS | CHENNAI, INDIA | JUL 2021 - JUL 2023
Partnered with data architects to prototype MVP's and launch 2 enterprise solutions: Intelligent Data Express and Data Observability Framework.
Developed the backend of the MVP's using AWS serverless tools and later scaled it to a microservice architecture using FastAPI and Docker to support 3X user growth.
Developed a metadata-driven ETL ingestion framework using Airbyte, Apache Airflow, AWS Glue, and Python, enabling rapid ingestion from diverse enterprise data sources (CDC, streaming, batch, on-premise databases), reducing asset onboarding time from weeks to hours
Engineered an ACID-compliant Lakehouse using Apache Iceberg, AWS S3, Athena, and Redshift. Implemented SCD Type-2 and time travel capabilities to ensure historical data integrity and support analytical workloads
Built a data quality tool with Great Expectations, Apache Spark, and Airflow to validate 30+ custom checks on data at rest, reducing data anomalies by 60% across downstream BI and analytics pipelines
Implemented an infrastructure observability pipeline utilizing CloudWatch logs, ELK stack and Grafana, accelerating root cause analysis and decreasing Mean-Time-To-Resolution (MTTR) for production issues by 40%
Established data governance capabilities by integrating LinkedIn DataHub, providing detailed tracking, auditability, and visualization of data flow, aiding regulatory compliance (GDPR, CCPA) and trust in data assets
Collaborated with business analysts and sales teams to translate functional requirements into engineering solutions
Provided MVP demos and simplified complex technical architectures to clients through concise presentations
Took a career break to focus on personal goals and well-being
Intern & Software Engineer
XENONSTACK | CHANDIGARH, INDIA | JAN 2019 - NOV 2019
Modernized a data platform by migrating legacy Hadoop workflows and Pig scripts to Scala/PySpark ETL jobs on Databricks, processing IoT sensor and weather data from 45 geo-locations via Kafka to create a high-availability data lake powering data science workflows
Collaborated with MLOPS team to build a Python framework using MLFlow to automate model management & scoring, improving feature engineering efficiency by 33% and reducing model discovery time in production
Scaled TensorFlow model training by implementing a Ray-based distributed pipeline across a 6-node cluster, reducing computation time by ~10%
Developed a cost-optimization strategy to bid and select EC2 spot instances for AWS EMR jobs during off-peak hours.
Education
University of Maryland - College Park | USA
Master's in Information Management
2023-2025 | GPA: 4.0/4.0
Relevant Coursework: Big Data Infrastructure, Data Analytics, Data Integration, Advance Data Science, Cloud Computing, Product Management
Received complete tuition fee waiver for the entire duration of the degree program
Panjab University - Chandigarh | India
B.E. Information Technology
2015-2019 | GPA: 3.74/4.0
Relevant Coursework: Data Structures and Algorithms, Database Systems, Network Security, Operating Systems, Object Oriented Programming
Core Competencies
Data Engineering
- AWS and GCP cloud tools
- Apache Spark | Apache Beam
- Kafka
- BigQuery | Redshift
- Airflow
- Apache Iceberg | Delta Lake
- Hadoop Ecosystem
- DBT Core
- Apache Superset
- Distributed Systems
Software Engineering
- Data Structures & Algorithms
- Python | Scala | Rust
- Git | GitOps
- CI/CD Pipelines
- Docker
- Kubernetes
- Terraform
- REST APIs
- Backend Development
- Shell Scripting
- Server Side Programming
- HTML | CSS | Javascript
ML Engineering
- Feature Engineering
- Supervised & Unsupervised learning
- Pytorch
- Advanced SQL
- Weights & Biases
- Scikit-learn
- XGBoost
- MLflow
- Kubeflow
- Metaflow
- SparkML
- Model Serving and Management
Generative AI
- LangChain
- Natural Language Processing
- Vector DBs
- RAG Architecture
- LLM and Fine-tuning
- Prompt Engineering
- Hugging Face Transformers
- Agentic AI systems
Selected Works
Data Fusion Engineering
GCP Serverless
Apache Spark
Terraform
Apache Superset
BASH
SQL
- Developed a complete analytics solution on GCP to ingest, store, transform, and analyze data
- Automated data ingestion using time-driven cloud function from 6 Open NYC dataset APIs incrementally
- Created automated Dataproc pipelines to process and transform ingested data into a BigQuery
- Prepared auto-updating Apache Superset dashboards to visualize KPIs to identify accident-prone areas
View on GitHub →
Intelligent Record Management
PyTorch
NLP Tools
Langchain
Streamlit
Elasticsearch
Gemma2
- A document processing and semantic search system for intelligent indexing of congressional archives
- Allows users to input a query via Streamlit UI and retrieve relevant past press releases
- LLM based document summaries infused with NLP: Topic Modeling, NER, and keyword extraction
View on GitHub →
Loan Default Prediction System
PySpark
Seaborn
SciKit Learn
XGBoost
Random Forest
K-Means Clustering
PCA
- End-to-end machine learning pipeline to predict loan defaults using the LendingClub dataset
- Incorporated data preprocessing, feature engineering, & supervised ML modeling
- Prepared a borrower segmentation using K-Means clustering to identify high-risk defaulter profiles
View on GitHub →
Data Preparation for Fintech Analytics
AWS
Python
Great Expectations
Postgres
Tableau
- A serverless framework to automate metadata extraction, profiling, validation, and transformation.
- Event-driven workflows on AWS Lambda, Step Functions, and S3 for DQ checks and transformations.
- Stored cleaned datasets into AWS RDS and created Tableau dashboards to visualize KPIs
View on GitHub →
Monitoring EKS Cluster
AWS
Terraform
HELM
Prometheus
Jenkins CI/CD
Kubernetes
- Streamlined the setup for deploying OpenTelemetry Webshop
- Enabled modular deployment by splitting large Kubernetes manifests into component YAML files
- Provisioned EKS, Grafana, and Prometheus with Terraform and Helm
- Developed a system to collect logs from the kube-system and track unhealthy pods
- Built pipelines to test and build Docker images via a remote Jenkins server
View on GitHub →
Sports Analytics System
Python
Plotly
Pandas
Tableau
- Developed EDA framework analyzing 10+ performance KPIs
- Created interactive dashboards for tactical analysis
- Identified 3 key success factors through Bayesian analysis
View on Kaggle →