HKU Engineering FYP24099

Exploring BTC Momentum Misclassifications: A Retrieval-Augmented Generation (RAG) Approach​

Exploring BTC Momentum Misclassifications: A Retrieval-Augmented Generation (RAG) Approach

SUPERVISOR: Yiu Siu Ming

Group: Ezen CHONG, Gillian Ru Qu TOO, Keith Kai Xuan CHAN, Zi Zheng FONG HU

Final Presentation

PROJECT INTRODUCTION

Can we better understand how narratives affect cryptocurrencies?

Project Schedule

Planning Stage

  • Project Brainstorming  
  • Detailed Project Plan and Initial Setup of Project Webpage 

Data Preprocessing

  • Initiate data collection
  • Utilize open-sourced models like LLAMA
  • import models locally via techniques such as QLORA

Training the Model

  • Sentiment Analysis Model Development 
  • Machine Learning Model Training 
  • Correlation and Prediction Model Development 
  • Preliminary Implementation and Interim Report 

Refinement and Optimization

  • Optimize models based on evaluation results
  • Perform extensive testing with a larger dataset
  • Analyze model robustness and reliability
  • Finalised Implementation 

Concluding the Project

  • Comprehensive Evaluation 
  • Final Report and Documentation 

Deliverables

Methodology

Our methodology is organized into four main sections: Architecture & Data Collection, Data Preprocessing, Machine Learning Models, and RAG Framework.

Architecture & Data Collection

The overall model architecture is designed to integrate multiple data sources to capture both technical signals and market sentiment to predict Bitcoin price momentum. The three primary data sources are 

  • Historical Bitcoin Price Data from Yahoo Finance
  • Sentiment Data from Cryptocurrency and Market News through the News API
  • Fear and Greed Index Data through their API

Data Preprocessing

A robust preprocessing step is essential to ensure that both sentiment data and price data are harmonized, which allows for an integrated analysis in later parts of our modeling and evaluation.  After processing the merged dataset is separated into 4 distinct sentiment sets:

  • With BTC Sentiment: This feature set incorporated sentiment scores derived from crypto-specific news related exclusively to Bitcoin. 
  • With Market Sentiment: This feature set incorporated sentiment scores derived from  market news.   
  • With F&G Sentiment (Alternative Crypto Fear & Greed Index): In this variation, the sentiment proxy was the Fear & Greed Index value. 
  • No Sentiment Indicators: This baseline feature set consisted solely of technical price data (e.g., Open, High, Low, Close, and Volume) without any sentiment information.  

Machine Learning Models

Each model will be trained on the standardized and pre-processed data as mentioned in the above section, with varying sentiment datasets as well as a common target variable, which is defined as the next day’s directional movement (upwards/downwards), which is determined from the daily log returns of BTC’s price. This experimental design allows us to assess the incremental value of sentiment indicators on our prediction task and to further handle refining and understanding on a deeper level as to which data sets perform better and why the results are as such.  The chosen models include  

Our chosen models include logistic regression, decision trees, gradient boosting, naïve Bayes, support vector machines (SVM), neural networks, and random forests (with XGBoost serving as a benchmark). For evaluation, accuracy was used, given that our dataset was fairly balanced.

RAG Framework

The system first uses machine learning to predict cryptocurrency prices. When the model makes errors, the system identifies the dates of those errors and retrieves aggregated news from those dates. This news is then fed into a RAG model, which is prompted to explain how the news might have influenced prices in the opposite direction of the prediction. Traders then review the RAG output to identify relevant narratives, implicitly refining their understanding of market dynamics and improving future trading decisions.