Project Methodologies

Overview

First we will get the raw data from web scraping and some existing datasets. Then we label them and assign them with reasonable tags. Then they will be passed to a few models for training. The performance of each model will be noted for comparison, and we will fine tune the best one to make it effective but still able to run locally.

Data Processing

The study by Yiming Zhu and colleagues shows that ChatGPT can label data with an average accuracy of 60.9%, peaking at 64.9% in sentiment analysis. To reduce errors, multiple AI models will be used for labeling, with human oversight if results are unsatisfactory. Acknowledging that text can exist on a spectrum between “fraud” and “normal,” the approach includes adding content-specific tags for better classification.

Additionally, text modifications, such as removing punctuation and irrelevant words, are necessary for effective model training. Large language models will assist in streamlining this process.

Model Training Overview

In the model training phase, a variety of algorithms will be tested to determine the best fit for the task. The models under consideration include:

Support Vector Machine (SVM):
A supervised learning algorithm that excels in high-dimensional spaces.
It maximizes the margin between classes for robust separation, reducing overfitting.
Random Forest:
Comprises multiple decision trees, voting for classification.
Performance may decline with high dimensionality due to sparsity in data.
Naive Bayes:
A probabilistic classifier based on Bayes’ Theorem, assuming feature independence.
Efficient for large datasets but may struggle with highly correlated features.
Logistic Regression:
A linear model for binary classification, using the logistic function.
Regularization techniques can help mitigate overfitting in high dimensions.
Long Short-Term Memory (LSTM):
A type of recurrent neural network designed for sequential data.
Effective in capturing long-term dependencies but requires careful regularization.
Bidirectional LSTM (BiLSTM):
Processes data in both directions for richer context.
Useful in NLP applications, though it also faces overfitting challenges.
BERT and ChatGPT:
Will be tested after the initial models to compare performance.

Model Performance Evaluation

Precision-weighted Average:
- Measures the accuracy of positive predictions.
Recall-weighted Average:
- Assesses the model’s ability to identify all relevant positive instances.
F1-score Weighted Average:
- The harmonic mean of precision and recall, balancing both metrics.
False Negative Rate (FNR):
- The proportion of actual positives incorrectly predicted as negatives.
True Negative Rate (TNR):
- Measures the model’s ability to correctly identify negative instances.