Data Platform & AI Engineer
I am a Data Platform Engineer with 3+ years of experience designing and implementing scalable, cloud-native solutions that drive innovation for global enterprises.
Key Achievements
- Designed an automated renewable energy monitoring system processing 100GB+ IoT data daily.
- Developed a customer segmentation framework using Spark ML, boosting targeted marketing effectiveness.
- Built a scalable GenAI pipeline combining RAG (Retrieval-Augmented Generation) architecture with LLM fine-tuning.
- Migrated legacy ETL workflows to modern cloud-native architectures, reducing processing times by 60%.
Area of Expertise
- Data Engineering: Architecting scalable, event-driven data pipelines and serverless data frameworks for enterprise analytics.
- MLOPS: Automating model training, deployment, and monitoring to drive ML adoption at scale.
- GenAI & NLP: Leveraging LLMs and Retrieval-Augmented Generation (RAG) to create AI-powered enterprise solutions.
- Big Data & Distributed Systems: Designing scalable, fault-tolerant architectures to handle high-velocity streaming & batch workloads.
Dive deep into my story:
Work Experience
Graduate Assistant - Data Engineering
Department of IT, University of Maryland - College Park
Cloud Infrastructure Automation
- Engineered cloud-native deployment system for academic analytics tools:
- Automated provisioning of 50+ R applications using ShinyProxy/Terraform/Docker
- Implemented LDAP/SSO integration with granular RBAC policies
- Reduced VM provisioning time by 65% through infrastructure-as-code practices
Learning Analytics Pipeline Development
- Architected real-time clickstream processing system handling 20M+ daily events:
- GCP dataflow pipeline with PubSub/BigQuery integration
- Apache Superset dashboards monitoring 150+ engagement metrics
- Anomaly detection model identifying classroom irregularities with 92% accuracy
Academic Research Modernization
- Transformed EDUCAUSE survey analysis through:
- Dimensional modeling on AWS Redshift with MVR refresh automation
- Natural language processing pipeline using SpaCy/NLTK
- LLM-powered Q&A system (Gemma2 9B/Ollama) with RAG architecture
- Developing cross-department Tableau dashboards serving 12 academic units
Senior Software Engineer - Data Engineering
Tiger Analytics
Cloud Data Fabric Development
- Collaborated with a Lead Data Architect to drive a 10 engineer team in building an AWS-based data fabric from scratch, accelerating the client’s time-to-market by 30% through:
- Custom ETL pipeline design with SCD Type-2 historization for accurate historical tracking
- Metadata-driven workflow management for dynamic ETL orchestration
- Real-time monitoring API using AWS Lambda & GraphQL for proactive failure detection
ML Engineering & Analytics
- Developed a scalable customer segmentation model using Spark ML/K-Means, leveraging PCA-driven feature engineering
- Implemented a forecasting engine on a GCP Data Lake comparing:
- ARIMA for capturing seasonal trends in time-series forecasting
- Random Forest & XGBoost for enhanced prediction accuracy
- Designed a phased rollout strategy for ML model deployment, ensuring gradual adaptation and A/B testing
Cloud Infrastructure Automation
- Reduced cloud provisioning time by 60% using Terraform-based Infrastructure as Code (IaC) templates for AWS/GCP
- Built a serverless cost optimization framework for dynamic EC2 resource allocation, leveraging:
- Spot Instance optimization for EMR clusters, reducing compute costs
- Auto-scaling strategies for dynamic resource allocation based on demand
Data Governance & Security Compliance
- Architected a metadata-driven Data Quality (DQ) framework integrating Great Expectations & Deequ for automated data validation
- Implemented a scalable PII protection system with:
- AWS Macie-powered sensitive data detection for compliance with GDPR/CCPA
- Format-preserving encryption (FPE-FF1) in Python/Scala to maintain referential integrity
Intern & Software Engineer - Data Engineering
Xenonstack
ML Engineering
- Scaled TensorFlow training pipelines using Python/Ray across 6-node cluster, reducing computation time by 10%
- Developed production ML evaluation CLI with MLFlow/Python that automated feature scoring, reducing manual engineering effort by 25%
- Implemented market segmentation models (Decision Trees/Random Forest) for personalized product recommendations
Data Engineering
- Designed and developed a Linux-based command-line tool for automated data ingestion and storage:
- Ingested enterprise, asset, meteorological, and failure data from 70+ geo-distributed locations of a renewable energy giant
- Structured data into a multi-zoned AWS data lake, ensuring efficient organization and retrieval
- Implemented user-specified granularity, enabling data science teams to build historical datasets
- Powered KPI-driven dashboards for energy grid performance and downtime analytics
- Integrated Spark Streaming into ETL pipelines and migrated from Spark 1.4 to Spark 2.4:
- Processed ~100GB of IoT data daily from 70+ wind and solar farms
- Structured data in a multi-zoned AWS data lake for consistent availability
- Enhanced processing speed and scalability through Spark 2.4 migration
- Developed a command-line interface for CI/CD automation:
- Automated Docker image creation and registry push
- Integrated with a remote Jenkins server to trigger CI/CD build pipelines
- Standardized and automated containerized deployment workflows
Did you know? Modern LLMs process information at 1e-23 watts per operation - 10x more efficient than 2023 models!
Academic Foundation
University of Maryland, College Park
Master of Information Management [Data Science]
2023-2025 | GPA: 4.0/4.0
Specialized Study Areas
Advanced Data Systems Architecture
Machine Learning Operationalization (MLOps)
Generative AI Implementation Strategies
Relevant Coursework
Cloud-Native Data Engineering (AWS/GCP)
Deep Learning Architectures
Large Language Model Applications
Data Governance & Ethics
Panjab University
Bachelor of Engineering [Information Technology]
2015-2019 | GPA: 3.74/4.0
Core Competencies Developed
Distributed Systems Design
Algorithm Optimization
Database Architecture
Leadership & Engagement
Technical Member - Entrepreneurship Cell
Captain - University Football Team
Core Technical Competencies
Data Engineering
- AWS and GCP cloud tools
- Apache Spark | Apache Beam
- Kafka
- BigQuery | Redshift
- Airflow
- Apache Iceberg | Delta Lake
- Hadoop Ecosystem
- DBT Core
- Distributed Systems
Software Engineering
- Python | Scala | Rust
- Git | GitOps
- CI/CD Pipelines
- Docker
- Kubernetes
- Terraform
- REST APIs
- Backend Development
Data Science & MLOps
- Pytorch
- MLflow
- Advanced SQL
- Weights & Biases
- Pandas/NumPy
- Scikit-learn
- XGBoost
- Kubeflow
- SparkML
- Apache Superset
Generative AI
- LangChain
- Natural Language Processing
- Vector DBs
- RAG Architecture
- LLM Fine-tuning
- Ollama
- Prompt Engineering
Selected Works
GenAI Text Summarization Engine
Gemma2 on Ollama
spaCy
Python
NLTK
Pytorch
- Developed prompt engineering pipeline for document summarization
- Integrated zero-shot learning for multi-domain adaptability
- Achieved 80% content retention accuracy on internal documents
View Implementation →
Mitigating DQ for Fintech Loan Analysis
AWS
Python
Great Expectations
Postgres
Tableau
- Developed a 5-step AWS Glue and Great Expectations pipeline for metadata extraction, profiling, validation, and transformation.
- Built an event-driven pipeline with AWS Lambda, Step Functions, and S3 for real-time quality checks and transformations.
- Integrated cleaned datasets into AWS RDS and created Tableau dashboards to visualize data quality trends and anomalies.
View on GitHub →
Monitoring EKS Cluster
AWS
Terraform
HELM
Prometheus
Jenkins CI/CD
Kubernetes
- Streamlined the setup for deploying OpenTelemetry Webshop
- Enabled modular deployment by splitting large Kubernetes manifests into component YAML files
- Provisioned EKS, Grafana, and Prometheus with Terraform and Helm
- Developed a system to collect logs from the kube-system and track unhealthy pods
- Built pipelines to test and build Docker images via a remote Jenkins server
View on GitHub →
Cloud Data Fusion Platform
Google Cloud
Apache Spark
Terraform
Apache Superset
BASH
SQL
- Architected automated big data solution processing 1M+ NYC records
- Implemented real-time weather/accident correlation analytics
- Reduced query latency by 40% through query optimization
View on GitHub →
Sports Analytics System
Python
Plotly
Pandas
Tableau
- Developed EDA framework analyzing 10+ performance KPIs
- Created interactive dashboards for tactical analysis
- Identified 3 key success factors through Bayesian analysis
View Analysis →