Portfolio Portfolio

AI for Energy Pipelines

This capstone project presents an AI-powered tool designed to summarize code and trace data lineage across enterprise-scale Databricks pipelines. The system enables cross-functional teams in the energy sector to understand, audit, and interpret complex data workflows without deep programming knowledge.

Client

University of Alberta / Ernst & Young

Start

January 2024

Complete

April 2024

Focus Areas

NLP, Data Engineering, MLOps

Stack

Azure, Databricks, Python, Streamlit

The tool integrates transformer-based models (BERT) and sequence modeling (LSTM) to generate clear, business-friendly summaries of complex code logic. It maps out data flows using graph-based lineage tracing and visualizes the entire process through a lightweight web interface.

Achieved an F1-score of 0.73 in summarizing Databricks pipeline code.
Traced data lineage with 81 percent accuracy using probabilistic graphs.
Handled scaling efficiently with a time complexity of O(n log n).
Maintained robustness with only a 10 percent drop under noisy data conditions.
Adapted based on user feedback, improving performance by 15 percent.

Project Overview

This project was developed as part of my M.Sc. in Mathematical and Statistical Sciences, focused on combining artificial intelligence and software engineering to improve data transparency in regulated industries. It was built using real-world energy pipelines from my internship at Ernst & Young Canada.

The system processes Python-based Databricks notebooks, extracts data transformations, and uses statistical models to generate lineage diagrams and concise natural language summaries. An interactive Streamlit frontend allows users to navigate the summaries, provide feedback, and download reports.

The architecture is optimized for scalability and maintainability, using modular components for NLP, lineage inference, Azure data integration, and user interaction. Feedback loops are built into the interface to help refine output quality through real-world usage.

The project delivers a robust, interpretable, and production-ready solution that meets the dual needs of technical precision and business communication. It serves as a foundation for building broader AI-enabled data governance tools in the energy industry.

Office Address

Phone Number

Email Address

Follow me: