While large language models (LLMs) offer incredible potential, they also come with their share of challenges. Working with LLMs demands high-quality training data, specialized skills and knowledge, and extensive computing resources. The process of forking and retraining a model is also time consuming and expensive.
The InstructLab project offers an open source approach to generative AI, sourcing community contributions to support regular builds of an enhanced version of an LLM. This approach is designed to lower costs, remove barriers to testing and experimentation, and improve alignment—that is, ensuring the model's answers are accurate, unbiased, and consistent with the values and goals of its users and creators.
What is InstructLab?
Initiated by IBM and Red Hat, the InstructLab project aims to democratize generative AI through the power of an open source community. It simplifies the LLM training phase through community skills and knowledge submissions.
InstructLab leverages the LAB (Large-scale Alignment for chatBots) methodology to enable community-driven development and model evolution. To learn more about the science behind LAB approach, see the InstructLab research paper posted by IBM.
Who is InstructLab for?
You don’t have to be a rocket scientist to contribute to InstructLab (but it’s great if you are!). With the InstructLab approach, there is minimal technical experience required. Contributions to the model are accepted in the form of knowledge and skills, with topics ranging from Beyoncé facts to professional law. This broad scope of topics makes the process approachable and entertaining.
We especially encourage contributions from experts in non-technical fields. Not only will this enhance the model's performance on a topic, but contributions from non-technical industry experts give them a voice in the AI conversation. InstructLab offers a practical way for less technical folks to contribute to a technical space that is poised to have lasting impact on the world.
What benefits does InstructLab offer?
InstructLab provides a cost-effective, community-driven solution for improving the alignment of LLMs and makes it easy for those with minimal machine learning experience to contribute.
Cost-effective
An open source approach makes InstructLab accessible to individuals and organizations regardless of their financial resources. As long as you have access to a laptop, you can download and use InstructLab tools, as we've designed it to run on laptop hardware. Such accessibility promotes a more inclusive environment for both developers and contributors.
Community-driven instruction tuning also drives the cost of model training down. By relying on the community, users cover topic generation by adding tasks of interest via skills and knowledge contributions. The synthetic data generation approach also means a smaller amount of data is needed from those contributions to have an impact on the model during training. This can all be tuned into a small-parameter, open source licensed model that is relatively cheap to both tune and serve for inferencing.
Community-driven
Opening up the data generation for the instruction tuning phase of model training to a large pool of contributors helps address innovation challenges that often arise during LLM training. Having a community drives together diverse talent by fostering collaboration among individuals with different backgrounds, expertise, and perspectives. This in turn encourages a wide range of contributions to land into the models. In addition, feedback from the users, contributors, and code reviewers in the community can help inform topic selection instead of solely relying on performance analysis and benchmarking data.
Ease of use
Non-technical people are typically deterred from contributing to software or AI due to perceived complexity and technical barriers. The vast array of models and tooling available and the perceived investment of time and effort required can be overwhelming for anyone, especially those without a technical background.
However, InstructLab removes most of these barriers. Thanks to YAML’s structured format and intuitive syntax, it’s easy to contribute knowledge and skill bounties in the form of a question-and-answer template. Contributors also benefit from an entire community with a wealth of resources including forums, docs, and user groups where individuals can seek support from one another.
Key features
- Regularly released models built with community contributions: Stay up to date by creating an account on HuggingFace.co and ‘liking’ the model(s) from the InstructLab repository.
- Pull request-focused contribution process in the open community: Keep track of new knowledge and skills contributions by watching the InstructLab/taxonomy repository on GitHub.
- Enhanced CLI tooling for contributing skills and knowledge and the ability to smoke test them in a locally-built model: Stay tuned by following our GitHub repo.
What’s next?
- Discover the developer preview of Red Hat Enterprise Linux AI, a foundation model platform to develop, test, and run Granite family large language models for enterprise applications.
- If you’re interested in building models that you can develop and serve yourself, check out Podman Desktop AI Lab, an open source extension for Podman Desktop to work with LLMs on a local environment.
Get started
Check out the InstructLab community page to get started now.
Last updated: May 15, 2024