Exploring Dolly 2.0: A Free Open Source Chat AI Model Based on GPT-Style
Introduction
On April 2023, folks at Databricks announced Dolly, a Large Language Model (LLM) fine-tuned on a Databricks machine learning platform using a small, high-quality open instruction-following dataset. While the first version relied heavily on OpenAI’s proprietary InstructGPT weights for generating responses, this dependency was a limiting factor.
Great news! Databricks has just announced that they are open-sourcing Dolly 2.0, which means that the training code, dataset, and model weights will be available for commercial use. This is a huge step forward in the world of language models, as it enables companies and organizations to create their powerful LLMs without having to pay for an API subscription or worry about their data being shared with third parties. So, if you’re interested in using this model, you can now take advantage of it and customize it to suit your needs.
Why 2.0 all of a sudden?
Upon closer examination of Dolly’s development, it has come to light that the initial release in March 2023 (Dolly 1.0) heavily relied on OpenAI’s proprietary InstructGPT weights for generating responses. While this approach worked well, it meant that Dolly 1.0 couldn’t be used commercially without violating OpenAI’s terms of service.
Recognizing this limitation, the Databricks team set out to create a version that was not only open-source but also free for commercial use. They achieved this by creating a new, high-quality training dataset called databricks-dolly-15k and using it to fine-tune an open-source base model.
The Dataset: databricks-dolly-15k
The key to Dolly 2.0’s success is the databricks-dolly-15k dataset. This dataset consists of 15,000 high-quality instruction/response pairs that were generated by Databricks employees. These pairs were created using seven specific instruction categories, each designed to cover a wide range of use cases.
The Databricks team created a contest to motivate their employees to participate in the dataset creation process, with the top 20 labellers receiving a substantial award. In addition, they outlined seven specific tasks that were required for the dataset creation:
-
Open Q&A: For instance, “Why do people like comedy movies?” or “What is the capital of France?” In some cases, there’s no correct answer; in others, it requires drawing on knowledge of the world.
-
Closed Q&A: These are questions that can be answered using only the information contained in a passage of reference text. For instance, given a paragraph from Wikipedia on the atom, one might ask, “What is the ratio between protons and neutrons in the nucleus?”
-
Extract information from Wikipedia: Here an annotator would copy a paragraph from Wikipedia and extract entities or other factual information such as weights or measurements from the passage.
-
Summarize information from Wikipedia: For this, annotators provided a passage from Wikipedia and were asked to distill it to a short summary.
-
Brainstorming: This task asked for open-ended ideation and an associated list of possible options. For instance, “What are some fun activities I can do with my friends this weekend?”.
-
Classification: For this task, annotators were asked to make judgments about class membership (e.g. are the items in a list animals, minerals or vegetables) or to judge the properties of a short passage of text, such as the sentiment of a movie review.
-
Creative writing: This task would include things like writing a poem or a love letter.
Example: Open Q&A in databricks-dolly-15k
How do I build a campfire?
Safety should always come first when starting a campfire. Ensure you have water or another way to extinguish your fire readily available. Use a campfire ring, pit, or clear a large area around your campfire location. Also, make sure campfires are permitted in your area.
With safety out of the way, all a fire needs are heat, fuel, and oxygen. Any combustible material will start a fire, but good campfire fuel sources include tinder, kindling, and dry logs.
Use stuff like dry pine needles, dry leaves, grasses, and wood shavings for tinder…
Why Dolly 2.0 Matters
Dolly 2.0 represents a significant step forward in developing large language models. By making it accessible and open-source, Databricks has levelled the playing field and opened up a new world of possibilities for businesses and organizations.
The availability of Dolly 2.0 brings unprecedented transparency, flexibility, and customization options available to businesses. By making Dolly 2.0 available for commercial use, Databricks has provided companies with a powerful tool to engage in natural and engaging conversations with customers while maintaining complete control over their data.
With this development, we can expect to see new and innovative applications of AI in the business world.
But Dolly 2.0 is not just about accessibility but also quality. The dataset used to train the model, databricks-dolly-15k, was explicitly designed for instruction tuning of LLMs, with 15,000 high-quality human-generated prompt/response pairs. These training records are natural, expressive, and designed to represent various behaviours, from brainstorming and content generation to information extraction and summarization.
This focused effort results in a powerful and versatile model capable of performing a wide range of tasks with remarkable accuracy and speed. And because the model is open source, it is infinitely customizable, allowing users to tweak and refine it to suit their specific needs.
Conclusion
In short, Dolly 2.0 represents a significant step forward in developing large language models. By making it accessible and open-source, Databricks has levelled the playing field and opened up a new world of possibilities for businesses and organizations. As the field of AI continues to evolve, it is clear that Dolly 2.0 will play a pivotal role in shaping the future of LLMs and driving innovation in this exciting field.
What do you think about Dolly 2.0? Let me know in the comments. I’ll see you at the next one. Have a wonderful day!