Databricks announces 'Dolly' - an open-source ChatGPT rival

It is aimed at democratising large language models

Databricks announces 'Dolly' – an open-source ChatGPT rival

Image:

Databricks announces 'Dolly' – an open-source ChatGPT rival

Big-data analytics firm Databricks has an open-source language model called Dolly, which it claims can replicate ChatGPT's abilities without the expensive hardware and large datasets.

The firm, founded by the creators of Apache Spark, says the release of Dolly is aimed at democratising large language models (LLMs). It should help small firms to build their own generative AI models, rather than being limited to only large tech companies who can afford the investment.

By open-sourcing Dolly and its training data, Databricks hopes anyone can train and operate genuinely human-like AI models.

The release comes amid growing interest in LLMs and generative AI.

OpenAI's ChatGPT has become one of the most popular chatbots in recent months, and one of the world's fastest-growing applications. It quickly gained worldwide attention after its release in November last year for its ability to mimic human-like interactions and hold engaging conversations.

One of the main reasons behind that popularity is the fact that OpenAI trained ChatGPT on trillions of words from the internet, making it a highly sophisticated language model.

ChatGPT's popularity has forced other other tech firms to develop their own proprietary instruction-following models.

Last month, Google announced its AI chatbot technology, "Bard", which it says will provide "fresh, high-quality responses" to users' queries by drawing on information from the web.

Facebook parent Meta joined the race by launching its partially open-source model named LLaMA, which is also believed to have been trained on trillions of words.

Despite the trend of training models on vast amounts of data, a group of researchers managed to create Alpaca, an AI model based on Facebook's LLaMA, using a relatively small dataset of around 50,000 questions and answers. The researchers have since taken it offline due to cost and safety concerns.

Databricks adopted a more targeted approach with Dolly. Rather than building a new massive model from scratch and spending months training it, the firm used an older open-source LLM called GPT-J, which was originally developed by EleutherAI a few years earlier.

Databricks made slight modifications to the EleutherAI LLM to enhance its instruction-following capabilities, including brainstorming and text generation, which were not present in the original model. The team leveraged data from Alpaca to achieve this.

"We're calling the model Dolly — after Dolly the sheep, the first cloned mammal — because it's an open source clone of an Alpaca, inspired by a LLaMA," the company says.

Databricks says Dolly's ability to mimic human interaction was accomplished using "only" 6 billion parameters, a single machine, and a dataset of just 50,000 words - in contrast to ChatGPT's 175 billion parameters and extensive training with thousands of GPUs.

The company believes the key to developing AI that can follow instructions lies in providing specific examples of how to communicate effectively with humans, rather than training on massive datasets.

"We're in the earliest days of the democratisation of AI for the enterprise, and much work remains to be done," wrote Databricks. They added that they believe the technology behind Dolly presents a new opportunity for companies that aim to build their own instruction-following models inexpensively.

Databricks has open-sourced Dolly and has also provided a Databricks notebook, which customers can use to build their own Dolly models on the Databricks platform.