Best practices for InstructLab instruction datasets

Instruction datasets are specialized datasets designed to help a language model understand and respond to specific instructions or prompts. They provide the model with structured examples of questions, commands, or statements and the corresponding desired responses. These datasets essentially "teach" the model how to follow instructions by exposing it to various scenarios where it learns the correct patterns and formats for responses.

When providing instruction datasets to InstructLab, you can think of the model as a "student" and yourself as the "teacher." You need to provide your student with an educational reading in the form of a Markdown file that will act as the student's source of truth when answering all questions. You also need to provide some example questions and answers in the form of a qna.qaml file to demonstrate how the student will be expected to apply the knowledge from the reading during the test.

Providing strong instruction datasets is essential

The model will output answers that are the same level of quality as the example question and answer pairs you provided. InstructLab uses your Markdown file as its source of truth and your qna.yaml to generate additional similar synthetic questions and answer pairs for the model to train off of, giving your instruction datasets an exponential impact on the model.

This article provides best practices for instruction datasets so you can effectively train your model to generate relevant outputs.

Building the context text file

The "reading" you give your model is the context file which should be a Markdown or text file. It is most effective if it uses a variety of paragraphs, tables, bullet points, and lists. The model learns best when it sees diverse formats, just like we do.

If you are pulling this information from somewhere other than your brain, copy the original source verbatim for best results. Make sure to remove any links.

Any code or preformatted text in the Markdown document should be enclosed in the corresponding Markdown block using backticks for the cli command or triple ‘ ‘ ‘ for multi-line formatted outputs or code blocks.

Build: the qna.yaml file: Chunking blurbs from context file

Prior to question and answer pairs in the qna.file, you will place paragraphs extracted verbatim from the Markdown file. These chunks of text should come from various parts of the Markdown file so that the beginning, middle, or end is not overrepresented. This chunk should correspond with the questions and answers before it, and it should be about 500 tokens or 375 words.

For example:

 version: 3
domain: coral_reefs
created_by: lkerriso
seed_examples:
  - context: |
Coral reefs support marine biodiversity by creating complex habitats that sustain a wide array of marine species. They provide essential shelter, feeding grounds, and breeding spaces for numerous organisms, including fish, sea turtles, and invertebrates. The structure of coral reefs, with its crevices and nooks, offers safe places for animals to hide from predators, facilitating a diverse community of species to thrive. Coral reefs act as nurseries for young fish, offering them shelter and protection during early life stages, which increases their chances of survival and contributes to population growth. The reefs’ physical structure enables juvenile fish to evade predators, which is essential for the health of broader marine populations. As these fish grow, they may leave the reef and populate other marine environments, supporting biodiversity in nearby ecosystems. The reefs also play a crucial role in the food web. They house tiny algae called zooxanthellae that live within coral tissues and perform photosynthesis, producing energy that sustains both the corals and the organisms that feed on them. This energy supports primary consumers like plankton and herbivorous fish, which are eaten by larger predators, establishing a balanced food web. This structure supports fish species diversity, providing resources for fish of all sizes.
questions_and_answers:
      - question: |
          [Question based on context]
        answer: |
          [Answer related to question and context]       
  - context: |
In addition, coral reefs are highly efficient in nutrient recycling, capturing and redistributing essential elements that benefit surrounding marine areas. Organisms like sponges help filter water, removing excess organic material and releasing nutrients usable by other reef organisms. This recycling maintains a healthy, balanced ecosystem that can support diverse marine life.In summary, coral reefs support marine biodiversity by providing habitats, nursery grounds, and a balanced food web that sustains various species. The intricate ecosystem services of coral reefs make them indispensable for the health and diversity of marine life worldwide. Protecting coral reefs is essential for the survival of countless marine species and for the stability of ocean ecosystems.
In the insurance industry, accurately predicting the likelihood of claims is essential for risk assessment and
  questions_and_answers:
      - question: |
          [Question based on context]
        answer: |
          [Answer related to question and context]

The questions

Now you are crafting example questions and answers for your model student. Strive for a diversity of simple and complex questions to prepare the model to handle various user needs. Include "what, how, and why" questions, and do not be vague, as this will lead to potentially confusing and inaccurate outputs.

For example, instead of asking the question "What is a coral reef?", I suggest you go deeper with something like "How do coral reefs support marine biodiversity?" or "Why are coral reefs sensitive to environmental changes?"

The answers

The answers should refer back to the question and use complete sentences so the student/model will learn to provide contextual and clear outputs. Avoid giving single word or short phrases as answers, as it could lead to the model outputting answers that appear less thoughtful, and instead just grab keywords.

When writing both the questions and the answers, pay attention to the wording used in the context/reading text file and use similar language in the qna.file to limit how much the model has to extrapolate. Make sure to wrap text at 120 characters for readability, and follow markdown’s formatting.

For example, if the question is "How do coral reefs support marine biodiversity?" the answer should be "Coral reefs support marine biodiversity by providing a habitat for marine animals, including fish, sea turtles, and invertebrates, to live, feed, and reproduce. Coral reefs also act as a "nursery" for young fish, offering shelter and protection from predators, thereby increasing fish populations."

InstructLab’s synthetic data generation

Again, good formatting here is essential, because Instructlab will use the files you provided to generate more synthetic questions and answers. Therefore, high quality human-generated training data will be multiplied for stronger results, as will low quality data.

The format typically looks something like this:

```yaml
version: 3
domain: Coral_Reefs 
created_by: <user-name>
seed_examples:
  - context: |
      [Insert sample paragraph, table, or list with markdown formatting]
    questions_and_answers:
      - question: |
          [Question based on context]
        answer: |
          [Answer related to question and context]

How long should my files be?

The thing to remember when building a qna.yaml is that you should aim for short, yet context rich. InstructLab uses tokens (basically, pieces of words or characters) to process text, and longer content uses more tokens. The rule of thumb:

Context should be about 500 tokens (roughly 375 words).
Each Q&A pair should total around 250 tokens (or about 185 words).

Can I add visuals, hyperlinks, graphs, etc?

Adding visuals or graphs can also save token space, but keep in mind the combined length should stay under 750 tokens. A bit of trimming here and there will help fit more information without overwhelming the model.

Avoid hyperlinks in the context—these consume tokens but don’t provide helpful info.

Can I use multiple documents?

If you’re working with multiple documents, you can combine them into one qna.yaml if they’re closely related. If they’re unrelated, it’s better to create separate files to keep each topic focused.

By following these guidelines, you’ll create a qna.yaml file. Taking a little extra care in building these files will go a long way toward building a more accurate and user-friendly model.

To ask more questions and share knowledge you discover, join the InstructLab Slack.

Best practices for InstructLab instruction datasets

Share:

Providing strong instruction datasets is essential

Building the context text file

Build: the qna.yaml file: Chunking blurbs from context file

The questions

The answers

InstructLab’s synthetic data generation

How long should my files be?

Can I add visuals, hyperlinks, graphs, etc?

Can I use multiple documents?

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue