An Enhanced Note Assistance Application with Handwriting Recognition Leveraging LLM


Supervisor: Prof. Luo Ping

Group Member: Deng Jiaqi (3035832490), Gu Zhuangcheng (3035827110), Xie Changhe (3035770575), Zhou Zihan (3035772640)

Introduction

In today’s world, many of us rely on handwritten notes, whether on paper or digital devices like iPads. However, interpreting these notes can be challenging. Although there are some existing note-taking applications, most of them focus on basic functionalities like text transcription and keyword search without effectively handling diverse inputs such as handwritten notes, sketches, and complex visual data. Additionally, they lack context-aware querying and intelligent content categorization, making it difficult for users to organize and interact with their notes, especially when dealing with a combination of text, diagrams, and annotations.

Therefore, we would like to introduce an innovative note assistant application that combines the strengths of general-purpose and specialized LLMs to offer a more interactive and structured note-management experience. This project presents a tool to transform the note-taking process through handwriting recognition, sketch conversion, and note question-answering (QA) capabilities. Utilizing large language models (LLM) and optical character recognition (OCR), the application converts handwritten drafts into organized, searchable, and insightful digital notes. It transforms rough sketches into clean formats like Markdown and enables users to query their notes for deeper insights. 

Objective

This innovative note assistant application enhances the efficiency and effectiveness of organizing and understanding notes for users.

Objective 1: Written Content Recognition

The project introduces a Vision Language Model (VLM) for detecting and recognizing user-written content, including both text and graphical elements, enabling the conversion of handwritten drafts and notes into neatly structured digital formats.

Objective 2: Document QA

This project presents a Large Language Model (LLM), fine-tuned for DocQA tasks in lecture note-taking use cases, facilitating document understanding and allowing users to query their notes for detailed insights and information, utilizing techniques like Retrieval-Augmented Generation (RAG).

Objective 3: Interactive Application

This project offers an interactive mobile/desktop application that seamlessly integrates OCR, and QA capabilities, transforming handwritten content into clean, organized, searchable, and easy-to-understand digital notes.

Methodology

1 – App Development

  • The system adopts an Electron- Flask structure, which facilitates support for multiple operating systems and devices.
  • The backend, developed with Flask, serves as the central framework for application logic and integration with external APIs.
  • The frontend is built as a cross-platform system to support desktop and mobile environments.

2 – Dataset Collection

OmniNote-1M

  • It includes 180k Page OCR samples, 172k Text Spotting examples, and 160k Text Recognition instances, which collectively enable robust document understanding.
  • As for the visual question answering (VQA) tasks, we have Natural Image VQA (443k samples), Document VQA (39k), and Chart VQA (40k).
  • Document Parsing subset includes self-collected data from arXiv and Common Crawl for Page OCR and Text Spotting, while Text Recognition also incorporates synthesized handwritten data from public datasets like CASIA-HWDB for chinese and AIM for english.
  • Visual Question Answering leverages public datasets such as VQA v2, LRV_Chart, and TextVQA to ensure diversity in natural and document images.
  • We also implemented dataset filtering by extracting features with YOLO-v10 and implementing kNN clustering.

3 – Model Development

  • The raw document first passes through a feature extraction module that integrates the visual and textual features. The representation then flows into a hidden states generation component that produces a dense vector encoding of the document.
  • Separate decoder heads are employed for each target task, such as document understanding via question answering, element detection and localization, or document indexing and search.
  • By sharing the same encoder backbone while learning task- specific output layers, we can optimize the model for a wide range of document processing objectives while benefiting from transfer learning and reduced architectural complexity.
  • We use two models: Qwen2VL-2B for visual question answering and GOT for document understanding. The model fine-tuning process follows the left workflow.

Project Outcome

1 – OmniNote Desktop App

  • Canvas-based Note-Taking: Effortlessly capture and organize ideas using a versatile canvas, supporting text entry, drawing tools, and image embedding.
  • Hierarchical File Management: Create structured folders and manage your notes and files with intuitive drag-and- drop uploads and previews.
  • Interactive Model Integration: Leverage powerful AI models directly within the app, facilitating intelligent assistance for content creation and information extraction in one app.
  • Responsive and Customizable Interface: Experience OmniNote across different devices with adaptive design.

2 – OmniNote Mobile App

  • The system is specifically desgined to be expanded to mobile platforms, as handwritten input is more practical on devices such as iPads or smartphones.
  • The mobile frontend is developed using React Native and Expo that enables the development of applications for both iOS and Android using a single codebase.

3 – Document Parsing Model

Our state-of-the-art document parsing model (Siglip-Qwen2) demonstrates exceptional accuracy across multiple data formats including text, equations, and tables. Notable advantages include:

  • Compact and Efficient: Achieves superior performance with significantly fewer parameters (0.8B) compared to larger models like GPT-4o and InternVL2-Llama3 (over 70B parameters).
  • Outstanding Text Recognition: Outperforms leading models with the lowest edit distances for both English (0.081) and Chinese (0.214) texts.
  • Versatile Parsing Capability: Delivers robust performance in extracting structured content from diverse document layouts and languages, demonstrating consistency and adaptability.

Project Schedule

PhasePeriodDeliverables & MilestonesStatus
0Aug – Sep 2024Research & Detailed Project Plan
Phase 0 Deliverables: Detailed Project Plan & Project Website Setup
Done
1Oct – Nov 2024– Data Collection
– Fine-tune the Vision Language Model (VLM)
Done
1Dec – Dec 2024– Fine-tune the Large Language Model (LLM) for DocQA task
Phase 1 Deliverables: Interim Report & First Presentation
Done
2Jan – Feb 2025– Develop ApplicationDone
2Mar – Apr 2025– Integrate VLM and LLM into Application
Phase 2 Deliverables: Application
Done
3Apr 2025– Conduct User Experience Survey
– Test and Refine the System
Phase 3 Deliverables: Final Report & Final Presentation
Done