Sagar Das Profile

Sagar Das

Data Engineer | ML Engineer | AI Engineer

About Me

sagar@portfolio:~/about

Professional Journey

Data Specialist

Part-Time
University of Maryland College Park, MD 09/2023 - 05/2025
  • Designed an streaming analytics pipeline on GCP to process and analyze 20M+ clickstream events and logs per day using Pub/Sub, Apache Beam, Big Query streaming; developed a Superset dashboard for academic leadership to track 15+ engagement KPIs.
  • Developed an NLP-driven prototype using RAG architecture (LangChain, FAISS, OpenAI, Gemma), enabling Q&A andsummarization over 50+ Excel-based survey datasets (100K+ responses); fine-tuned prompts and rerankers, achieving 70% QA relevance.
  • Developed a Flask web app for researchers to self-provision GCP/AWS compute via Terraform with pre-configured runtimes, secured with nginx reverse proxy SSL, SSO, and role-based access control - eliminating repetitive IT requests.
  • Optimized AWS ETL workflows by implementing incremental ingestion and advanced SQL techniques (CTEs, partitioning, indexing) slashing processing time from 7 hrs to 4 hrs while ensuring high data accuracy for reporting.

Senior Software Engineer - Data

Full-Time
Tiger Analytics Chennai, India 07/2021 - 07/2023
  • Partnered with data architects to prototype a minimum viable product, define product roadmaps, and deliver end-to-end platform features for the successful launch of Tiger Intelligent Data Express, now utilized by key clients and external teams.
  • Developed the backend of the MVP's using AWS serverless tools and later scaled it to a microservice architecture using FastAPI and Docker to support 3X user growth.
  • Developed a metadata-driven ETL ingestion framework using Airbyte, Apache Airflow, AWS Glue, and Python, enabling rapid ingestion from diverse enterprise data sources (CDC, streaming, batch, on-premise databases), reducing asset onboarding time from weeks to hours
  • Engineered an ACID-compliant Lakehouse using Apache Iceberg, AWS S3, Athena, and Redshift. Implemented iceberg optimizations such as scheduling compaction, orphan file cleanup, and retention jobs which improved read query performance and cost efficiency for analytical workloads
  • Built a data quality tool with Great Expectations, Apache Spark, and Airflow to validate 30+ custom checks on data at rest, reducing data anomalies by 60% across downstream BI and analytics pipelines
  • Established data governance capabilities by integrating DataHub with Postgres to track data lineage and metadata across 100+ Airflow DAGs, Glue ETLs, and Iceberg tables.

Intern & Software Engineer – Data

Internship + Full-Time
Xenonstack Pvt. Limited Chandigarh, India 01/2019 - 11/2019
  • Migrated legacy Hadoop jobs to a modern Kappa architecture using Scala and Spark Streaming on Databricks, integrating Kafka and Delta Lake for optimized data processing.
  • Enabled granular micro-batch feeds (15, 30, 60 minutes) for downstream forecasting models by streaming 10GB/day of IoT data from 45 global regions via the Kappa architecture.
  • Collaborated with the MLOps team to develop a Python and MLflow integration, automating model management and scoring processes; increased feature engineering efficiency by 33% and reduced champion model discovery time.
  • Automated CI/CD workflows by creating a custom Linux CLI tool to deploy Dockerized applications to Amazon ECR via Jenkins and Terraform, eliminating manual operations.

Featured Projects

Data Fusion Engineering

Data Fusion Engineering

Google Cloud Apache Spark Terraform Apache Superset BASH SQL

Analyzing NYC road safety by integrating weather, traffic, and taxi data to uncover impactful patterns using a modern data stack.

View on GitHub →
Intelligent Record Management

Intelligent Record Management

PyTorch NLP Tools Langchain Streamlit Elasticsearch Gemma2

NLP-driven system for organizing 10,000+ docs with semantic search, knowledge graphs, and automated NER/topic analysis.

View on GitHub →
Loan Default Prediction System

Loan Default Prediction System

PySpark Seaborn SciKit Learn XGBoost Random Forest K-Means PCA

End-to-end ML pipeline for borrower risk segmentation using advanced feature engineering and clustering.

View on GitHub →
Data Preparation for Fintech Analytics

Data Prep for Fintech Analytics

AWS Python Great Expectations Postgres Tableau

Automated AWS framework for metadata storage, profiling, quality checks, and transformation in fintech analytics.

View on GitHub →
Monitoring EKS Cluster

Monitoring EKS Cluster

AWS Terraform Helm Prometheus Jenkins CI/CD Kubernetes EKS

Modular Kubernetes setup with Terraform/Helm, Prometheus monitoring, log health checks, and Jenkins CI/CD for OpenTelemetry demo webshop deployment.

View on GitHub →
Sports Analytics System

Sports Analytics System

Python Plotly Pandas Tableau

Deep-dive analytics of Liverpool FC's decade: market dynamics, player profiles, tactical investments, and managerial legacies.

View on Kaggle →

Technical Skills

Data Engineering
  • AWS and GCP cloud tools
  • Apache Spark | Apache Beam
  • Kafka
  • BigQuery | Redshift
  • Airflow
  • Apache Iceberg | Delta Lake
  • Hadoop Ecosystem
  • DBT
  • Apache Superset
  • Distributed Systems
Software Engineering
  • Data Structures & Algorithms
  • Python | Scala | Rust
  • Git | GitOps
  • CI/CD Pipelines
  • Docker
  • Kubernetes
  • Terraform
  • REST APIs
  • Backend Development
  • Shell Scripting
  • Server Side Programming
  • HTML | CSS | Javascript
ML Engineering
  • Feature Engineering
  • Supervised & Unsupervised learning
  • Pytorch
  • Advanced SQL
  • Weights & Biases
  • Scikit-learn
  • XGBoost
  • MLflow
  • Kubeflow
  • Metaflow
  • SparkML
  • Model Serving and Management
Generative AI
  • LangChain
  • Natural Language Processing
  • Vector DBs
  • RAG Architecture
  • LLM and Fine-tuning
  • Prompt Engineering
  • Hugging Face Transformers
  • Agentic AI systems

Education

Master of Information Management

Data Science Track

University of Maryland, College Park, MD
Aug 2023 - May 2025 GPA: 4.0/4.0

Key Courses:

Big Data Infrastructure Data Analytics Cloud Computing Advanced Data Science Data Integration Product Management

Bachelor of Engineering

Information Technology

Panjab University, Chandigarh, India
Aug 2015 - May 2019 GPA: 3.74/4.0

Key Courses:

Data Structures & Algorithms Database Management Systems Operating Systems Computer Networks Object Oriented Programming Agile Methods

Let's Talk About Your Data Needs

Whether you're looking to build a data platform, optimize existing pipelines, or explore how AI/ML can enhance your data strategy, I'd love to hear from you.