Generated Tuning Hints with Large-Language Models by Exploiting NLP

Introduction: Two machine learning models were developed in an effort to reduce the cognitive load on Database Administrators when fine-tuning a PostgreSQL database system. The first model answers simple questions by utilizing embeddings on a custom-tailored knowledge-base consisting of user manuals, YouTube transcripts, and Reddit/Quora posts. The second model uses a Generative Pre-trained Transformer 2 (GPT-2) model with 124M parameters to answer more complex queries that require understanding of semantic relationships and dependencies between parameter non-independence. The training data to fine-tune the model was generated by ChatGPT-4 and the author by leveraging the first models input-output pairs.

Technical Details

Model 1: HayStacks was used to configure the question-answer pipeline that leveraged sentence-similarity tasks and data preparation/preprocessing from a user-defined knowledge-base.
Model 2: SFTTrainer was used to fine-tune the GPT2 model on relevant question-answer pairs.
Web Scraping: PlayWright, BeautifulSoup4 were used to webscrape tuning-hints from Reddit/Quora and save them into the user-defined knowledge-base as text files. Additionally, youtube-transcript-api was used to obtain text files that contained transcripts from conferences that discussed tuning-related hints.
External Tools: ChatGPT4 was used to help generate training data to finetune the LLM.

Source Code & Paper

The source code was developed on Google Colab. You can access the notebook and dataset here.
To access additional files (i.e., written report), please feel free to reach out to me!! :)