Exploring Language Model Development for Underrepresented Languages
As we progress through 2023, I am embarking on an exciting project: creating a generative chatbot for a language that currently lacks support within most large language models (LLMs). Notably, widely used models like ChatGPT and LLaMA often struggle with accurately representing or understanding this language, sometimes resorting to fabricating words and lacking coherent reasoning abilities.
Strategies for Teaching an Underrepresented Language to LLaMA
I am seeking guidance on the most effective methodologies for training LLaMA (or similar models) on my target language. Key questions I have include:
-
Fine-tuning with Prompts: Should I focus on fine-tuning using prompts exclusively in my language to enhance understanding and generation?
-
Fine-tuning for Translation: Would it be more beneficial to implement fine-tuning techniques aimed specifically at translation tasks?
-
Model Adaptation Techniques: In terms of methodology, should I consider fine-tuning the entire model, or would adaptation techniques such as LoRA (Low-Rank Adaptation) be more efficient?
Resources at My Disposal
I have access to valuable resources, including the capability to generate approximately 50,000 to 100,000 prompts with a team of human contributors. Additionally, I will be utilizing several A100 GPUs for the training process.
Call for Insights and Research
If anyone has encountered similar projects or relevant research papers in this area, I would greatly appreciate your recommendations. Sharing knowledge and past experiences can significantly enhance our approach to developing robust support for underrepresented languages in Artificial Intelligence.
I look forward to any insights or suggestions you might have!
Leave a Reply