Introduction – FYP24066 WordPress Site

Modern large language models (LLMs) have demonstrated remarkable performance in various tasks, e.g., question answering, reading comprehension, text summarization, mathematical reasoning, and so on\cite{llmsurvey}. Models like GPT-3.5\cite{lmfewshotlearners}, GPT-4\cite{gpt4}, and Llama 3\cite{llamapaper} are widely used for their generative capabilities. However, these models are still constrained in the following ways: (1) these models have limited scope of knowledge and lack the most up-to-date information due to static pre-training data with a cutoff date; (2) they lack domain-specific knowledge unless fine-tuned; and (3) they still suffer from errors or hallucinations when tackling more complex or time-sensitive tasks, generating plausible but inaccurate text. These limitations pose challenges for deploying LLMs effectively in fields such as healthcare, finance, law, and scientific research. Addressing these issues is crucial for enhancing their applicability and reliability.

To address such limitation, Retrieval-Augmented Generation (RAG)\cite{ragpaper} systems consisting knowledge database, retriever, and LLM augment the LLM generation with an external source of knowledge to be retrieved from a knowledge database. The knowledge database contains a large number of texts from various sources, such as Wikipedia, news articles, social meida, online community, etc. A retriever retrieves the texts mostly related to the user’s query from the knowledge database. Then, the texts are used as context to augment the generation and reduce hallucination by allowing LLMs to gain context knowledge.

To provide better services, companies like OpenAI use web crawlers \cite{gptbot} to routinely crawl texts from the Internet and use them in different stages including pre-training, fine-tuning, and the building the knowledge database for RAG. This can cause risks of infringement of intellectual property and privacy. Researchers have been working towards making private data unlearnable\cite{unlearnableexample} \cite{textunlearnable} \cite{authorshipleakage}, but little progress has been done on preventing RAG systems from querying and generating using private data.

Empirical studies also show that the vulnerability of RAG systems
on leaking the private retrieval database\cite{ragprivacyrisk}, causing concerns in risks of privacy leak. Apart from that, RAG may further bring concerns in infringement of intellectual properties if the knowledge database contains copyrighted materials\cite{llmcopyrightviolation}.

In the light of building ethical and responsible AI systems, this work explores means to resolve the issue of privacy leaks and intellectual property infringement caused by private data being queried and used for generation in RAG systems. Extending the idea of making private data unlearnable\cite{unlearnableexample} \cite{textunlearnable}, this work proposes a pipeline to generate texts that cannot queried and used to generate by RAG systems.