Office Address

Edmonton, Alberta, Canada

Phone Number

+1 (587) 937-3409

Email Address

yogesh.kumar.rajakumar.v.21@gmail.com

Portfolio Portfolio

AI for Energy Pipelines

project-overview
architecture-diagram
interface-preview

This capstone project presents an AI-powered tool designed to summarize code and trace data lineage across enterprise-scale Databricks pipelines. The system enables cross-functional teams in the energy sector to understand, audit, and interpret complex data workflows without deep programming knowledge.

Client
University of Alberta / Ernst & Young
Start
January 2024
Complete
April 2024
Focus Areas
NLP, Data Engineering, MLOps
Stack
Azure, Databricks, Python, Streamlit

The tool integrates transformer-based models (BERT) and sequence modeling (LSTM) to generate clear, business-friendly summaries of complex code logic. It maps out data flows using graph-based lineage tracing and visualizes the entire process through a lightweight web interface.

  • Achieved an F1-score of 0.73 in summarizing Databricks pipeline code.
  • Traced data lineage with 81 percent accuracy using probabilistic graphs.
  • Handled scaling efficiently with a time complexity of O(n log n).
  • Maintained robustness with only a 10 percent drop under noisy data conditions.
  • Adapted based on user feedback, improving performance by 15 percent.

Project Overview

This project was developed as part of my M.Sc. in Mathematical and Statistical Sciences, focused on combining artificial intelligence and software engineering to improve data transparency in regulated industries. It was built using real-world energy pipelines from my internship at Ernst & Young Canada.

The system processes Python-based Databricks notebooks, extracts data transformations, and uses statistical models to generate lineage diagrams and concise natural language summaries. An interactive Streamlit frontend allows users to navigate the summaries, provide feedback, and download reports.

The architecture is optimized for scalability and maintainability, using modular components for NLP, lineage inference, Azure data integration, and user interaction. Feedback loops are built into the interface to help refine output quality through real-world usage.

The project delivers a robust, interpretable, and production-ready solution that meets the dual needs of technical precision and business communication. It serves as a foundation for building broader AI-enabled data governance tools in the energy industry.