NLP – Based Annual Report Reader

A trustful AI advisor in improving decision-making processes in financial assessments by analyzing risk management practices.

Project Overview

Our research aims to tackle the challenges of identifying and classifying potential risks within annual reports using advanced Natural Language Processing (NLP) techniques.

Our first major innovation is the development of a risk sentence extractor, powered by BusinessBert, a sophisticated language model. By fine-tuning this model on a substantial dataset of labeled sentences from recent NASDAQ annual reports, we improve its ability to recognize and extract relevant risk information efficiently.

In addition to extraction, we employ a unique classification method based on word-vector similarity to categorize the identified risks—such as credit, market, and operational risks. This approach not only distinguishes our work from traditional machine learning methods but also effectively mitigates common issues like overfitting.

Welcome to HKU CS WordPress Multi Site 2024 Sites. This is your first post. Edit or delete it, then start writing!

Project methodology

.

Data collection

We downloaded 2,000 recent NASDAQ annual reports from AnnualReports.com for model training, aiming to reduce overfitting by gathering more documents from various stock exchanges.

Data Pre-processing

We manually labeled some from NASDAQ reports due to the lack of dataset but found it unfeasible. We used Generative Adversarial Networks (GAN) to generate synthetic data and implemented Mask-BERT to ensure the quality and relevance of these samples.

Model Development and Fine-tuning

We utilized FinBERT/BusinessBERT for risk extraction, fine-tuning it with labeled datasets. To reduce over-fitting, we applied the Fast Gradient Method (FGM) during fine-tuning, though this increased computational overhead.

Risk Classification

We used binary classifiers for each risk type and instead of directly fine-tuning FinBERT/BusinessBERT, we calculated cosine similarity between sentence vectors and padded word vectors, then added a fully connected layer to get risk probabilities.

Evaluation of results

We used binary classifiers for each risk type. Instead of directly fine-tuning FinBERT/BusinessBERT, we calculated cosine similarity between sentence and word vectors to obtain risk probabilities, aiming for improved accuracy, which we will validate later.

Upcoming phases

We will conduct a literature review on risky sentence extraction, analyze annual reports for nuanced risk indicators, and combine manual and AI-assisted labeling to enhance our risk assessment methodology.

An array of resources

Our comprehensive suite of professional services caters to a diverse clientele, ranging from homeowners to commercial developers.

More on Data Preprocessing

No free datasets exist for labeled sentences indicating potential risk.
Manual labeling of NASDAQ annual report sentences is infeasible due to limited resources and bias.
Generative Adversarial Networks (GAN) will be used to augment the dataset by generating synthetic data.
GANs involve a generator creating new samples and a discriminator assessing their similarity to real data.
Mask-BERT will be applied to ensure the quality and relevance of synthetic data by filtering out irrelevant information.

More on Risk Classification

We use binary classifiers for each risk type, including credit, market, and operational risks, recognizing that sentences may imply multiple risks.
Instead of directly fine-tuning FinBERT/BusinessBERT, we calculate cosine similarity between sentence vectors and padded word vectors.
A fully connected layer is added to derive probabilities for each risk type.
Risky sentences are filtered by our extractor, and keywords are sourced from reliable resources and SEC literature.
This method is expected to be more accurate than direct fine-tuning, which will be validated in the evaluation section