Google's new AI training data fivefold larger than predecessors

New AI model been trained on 3.6 trillion tokens

Google's PaLM 2 training data surpasses that of its predecessor by nearly fivefold, report

Image:

Google's PaLM 2 training data surpasses that of its predecessor by nearly fivefold, report

Google unveiled PaLM 2, its latest large language model (LLM), at its annual developer conference last week - but its claims at the time about using a smaller training data set have been called into question.

A report by CNBC has found that PaLM 2 actually uses nearly five times the amount of training data as its predecessor, PaLM (Pathways Language Model), giving it the capability to perform in tasks such as maths, advanced coding and creative writing.

According to CNBC, the PaLM 2 model has been trained on 3.6 trillion tokens, a significant increase compared to PaLM's 780 billion tokens.

Tokens, which are sequential strings of words, play a crucial role in training LLMs as they train the model to predict the next word in a given sequence. Using tokens as building blocks, LLMs gain the ability to understand and generate coherent language patterns.

What is smaller in PaLM 2 compared to PaLM is the training parameters. According to reports, the original model was trained with 540 billion parameters. PaLM 2, in contrast, has been trained on 340 billion parameters.

"It excels at advanced reasoning tasks, including code and math, classification and question answering, translation and multilingual proficiency, and natural language generation better than our previous state-of-the-art LLMs, including PaLM," Google said in a blog post.

"It can accomplish these tasks because of the way it was built - bringing together compute-optimal scaling, an improved dataset mixture, and model architecture improvements."

Google says PaLM 2 has been trained on 100 languages, and has demonstrated "mastery" level proficiency in advanced language proficiency exams.

"PaLM 2 can decompose a complex task into simpler subtasks and is better at understanding nuances of the human language than previous LLMs, like PaLM.

"For example, PaLM 2 excels at understanding riddles and idioms, which requires understanding ambiguous and figurative meaning of words, rather than the literal meaning."

PaLM 2 uses a technique known as "compute-optimal scaling" to enhance its efficiency and overall performance. This results in faster inference, reduced parameter count for serving, and lowered serving costs.

By implementing this approach, PaLM 2 achieves improved operational efficiency without compromising its performance capabilities.

Google made a significant announcement about the integration of PaLM 2 into its ecosystem at the I/O conference, unveiling more than 25 new products and features harnessing the model.

This includes the expansion of Bard AI to additional languages, allowing it to offer its capabilities in a more diverse linguistic landscape.

The company introduced an updated search engine employing generative AI technology, in an attempt to compete with the GPT-4 integration in Microsoft's Bing.

"With generative AI, we are taking the next step with a bold and responsible approach," said Sundar Pichai, Google's CEO.

While Google has been enthusiastic about demonstrating the potential of its artificial intelligence technology, the company has been reluctant to disclose specific information about its training data.

Similarly, OpenAI, the organisation behind ChatGPT, has also chosen to withhold specific details about its latest LLM, GPT-4.

Both Google and OpenAI cite the competitive nature of the industry as the reason for their lack of disclosure.

During a hearing of the Senate Judiciary subcommittee on privacy and technology, OpenAI CEO Sam Altman agreed with lawmakers about the necessity for a new system to address the challenges posed by AI.

"For a very new technology we need a new framework," Altman said.

"Certainly companies like ours bear a lot of responsibility for the tools that we put out in the world."