Sagar Das Profile

Sagar Das

Data Engineer | ML Engineer | AI Engineer

About Me

sagar@portfolio:~/about

Professional Journey

Data Specialist

Part-Time
University of Maryland College Park, MD 09/2023 - 05/2025
  • Building an NLP pipeline to transform surveys into analytics-ready datasets, leveraging PyTorch, LangChain, and HuggingFace Transformers, helping 8 research analysts save 20+ hours collectively on manual data querying efforts
  • Deployed R Shiny applications on GCP using ShinyProxy, Docker, and Terraform for multi-user collaboration on internal research applications.
  • Incorporated security controls (IAM, RBAC) alongside a Flask application gateway with Google OAuth and reverse-proxy SSL to ensure secure access.
  • Prepared a POC using GCP Pub/Sub, Apache Beam, and BigQuery to process 20M+ daily clickstream events from Canvas ELMS, enabling instant classroom analytics for the academic leadership team via Apache Superset dashboards
  • Optimized AWS ETL workflows by implementing incremental ingestion and advanced SQL techniques (CTEs, partitioning, indexing) slashing processing time from 7 hrs to 4 hrs while ensuring high data accuracy for reporting.

Senior Software Engineer - Data

Full-Time
Tiger Analytics Chennai, India 07/2021 - 07/2023
  • Partnered with data architects to prototype MVP's and launch 2 enterprise solutions: Intelligent Data Express and Data Observability Framework.
  • Developed the backend of the MVP's using AWS serverless tools and later scaled it to a microservice architecture using FastAPI and Docker to support 3X user growth.
  • Developed a metadata-driven ETL ingestion framework using Airbyte, Apache Airflow, AWS Glue, and Python, enabling rapid ingestion from diverse enterprise data sources (CDC, streaming, batch, on-premise databases), reducing asset onboarding time from weeks to hours
  • Engineered an ACID-compliant Lakehouse using Apache Iceberg, AWS S3, Athena, and Redshift. Implemented SCD Type-2 and time travel capabilities to ensure historical data integrity and support analytical workloads
  • Built a data quality tool with Great Expectations, Apache Spark, and Airflow to validate 30+ custom checks on data at rest, reducing data anomalies by 60% across downstream BI and analytics pipelines
  • Implemented an infrastructure observability pipeline utilizing CloudWatch logs, ELK stack and Grafana, accelerating root cause analysis and decreasing Mean-Time-To-Resolution (MTTR) for production issues by 40%
  • Established data governance capabilities by integrating LinkedIn DataHub, providing detailed tracking, auditability, and visualization of data flow, aiding regulatory compliance (GDPR, CCPA) and trust in data assets
  • Collaborated with business analysts and sales teams to translate functional requirements into engineering solutions
  • Provided MVP demos and simplified complex technical architectures to clients through concise presentations

Intern & Software Engineer – Data

Internship + Full-Time
Xenonstack Pvt. Limited Chandigarh, India 01/2019 - 11/2019
  • Modernized a data platform by migrating legacy Hadoop workflows and Pig scripts to Scala/PySpark ETL jobs on Databricks, processing IoT sensor and weather data from 45 geo-locations via Kafka to create a high-availability data lake powering data science workflows
  • Collaborated with MLOPS team to build a Python framework using MLFlow to automate model management & scoring, improving feature engineering efficiency by 33% and reducing model discovery time in production
  • Scaled TensorFlow model training by implementing a Ray-based distributed pipeline across a 6-node cluster, reducing computation time by ~10%
  • Developed a cost-optimization strategy to bid and select EC2 spot instances for AWS EMR jobs during off-peak hours.

Featured Projects

Data Fusion Engineering

Data Fusion Engineering

Google Cloud Apache Spark Terraform Apache Superset BASH SQL

Analyzing NYC road safety by integrating weather, traffic, and taxi data to uncover impactful patterns using a modern data stack.

View on GitHub →
Intelligent Record Management

Intelligent Record Management

PyTorch NLP Tools Langchain Streamlit Elasticsearch Gemma2

NLP-driven system for organizing 10,000+ docs with semantic search, knowledge graphs, and automated NER/topic analysis.

View on GitHub →
Loan Default Prediction System

Loan Default Prediction System

PySpark Seaborn SciKit Learn XGBoost Random Forest K-Means PCA

End-to-end ML pipeline for borrower risk segmentation using advanced feature engineering and clustering.

View on GitHub →
Data Preparation for Fintech Analytics

Data Prep for Fintech Analytics

AWS Python Great Expectations Postgres Tableau

Automated AWS framework for metadata storage, profiling, quality checks, and transformation in fintech analytics.

View on GitHub →
Monitoring EKS Cluster

Monitoring EKS Cluster

AWS Terraform Helm Prometheus Jenkins CI/CD Kubernetes EKS

Modular Kubernetes setup with Terraform/Helm, Prometheus monitoring, log health checks, and Jenkins CI/CD for OpenTelemetry demo webshop deployment.

View on GitHub →
Sports Analytics System

Sports Analytics System

Python Plotly Pandas Tableau

Deep-dive analytics of Liverpool FC's decade: market dynamics, player profiles, tactical investments, and managerial legacies.

View on Kaggle →

Technical Skills

Data Engineering
  • AWS and GCP cloud tools
  • Apache Spark | Apache Beam
  • Kafka
  • BigQuery | Redshift
  • Airflow
  • Apache Iceberg | Delta Lake
  • Hadoop Ecosystem
  • DBT
  • Apache Superset
  • Distributed Systems
Software Engineering
  • Data Structures & Algorithms
  • Python | Scala | Rust
  • Git | GitOps
  • CI/CD Pipelines
  • Docker
  • Kubernetes
  • Terraform
  • REST APIs
  • Backend Development
  • Shell Scripting
  • Server Side Programming
  • HTML | CSS | Javascript
ML Engineering
  • Feature Engineering
  • Supervised & Unsupervised learning
  • Pytorch
  • Advanced SQL
  • Weights & Biases
  • Scikit-learn
  • XGBoost
  • MLflow
  • Kubeflow
  • Metaflow
  • SparkML
  • Model Serving and Management
Generative AI
  • LangChain
  • Natural Language Processing
  • Vector DBs
  • RAG Architecture
  • LLM and Fine-tuning
  • Prompt Engineering
  • Hugging Face Transformers
  • Agentic AI systems

Education

Master of Information Management

Data Science Track

University of Maryland, College Park, MD
Aug 2023 - May 2025 GPA: 4.0/4.0

Key Courses:

Big Data Infrastructure Data Analytics Cloud Computing Advanced Data Science Data Integration Product Management

Bachelor of Engineering

Information Technology

Panjab University, Chandigarh, India
Aug 2015 - May 2019 GPA: 3.74/4.0

Key Courses:

Data Structures & Algorithms Database Management Systems Operating Systems Computer Networks Object Oriented Programming Agile Methods

Let's Talk About Your Data Needs

Whether you're looking to build a data platform, optimize existing pipelines, or explore how AI/ML can enhance your data strategy, I'd love to hear from you.