Facebook to open up on the hardware behind its next-gen AI platform

Facebook's next-generation systems for AI computing are based on Nvidia Tesla GPUs

Facebook is set to open up the design of the custom server hardware developed by the firm to provide the machine learning and artificial intelligence (AI) power behind its platform.

Facebook said the move is intended to foster collaboration with others to build ever more complex AI systems.

Facebook said that its next-generation systems for AI computing - Facebook AI Research (Fair) - are based on its own custom design. This is one of the first to use Nvidia's newly released Tesla M40 GPU accelerator, a PCI Express adapter that is claimed to be specially developed for deep learning and training machine learning systems.

Going by the name of Big Sur - a region of the California coastline - the new systems are an Open Rack-compatible chassis that can be configured with up to eight Tesla GPUs of up to 300 watts each, with the flexibility to configure between multiple PCIe topologies, Facebook said.

Facebook said it intends to open-source the specifications for Big Sur, and will submit the design materials to the Open Compute Project (OCP), an initiative started by Facebook to openly develop and share designs for data centre infrastructure.

"At Facebook, we've made great progress thus far with off-the-shelf infrastructure components and design. We've developed software that can read stories, answer questions about scenes, play games and even learn unspecified tasks through observing some examples," said Facebook's Serkan Piantino and Kevin Lee, writing on the firm's Engineering blog.

"But we realised that truly tackling these problems at scale would require us to design our own systems. Today, we're unveiling our next-generation GPU-based systems for training neural networks, which we've code-named Big Sur," the pair explained.

With Nvidia's latest generation of GPU accelerator, Big Sur is twice as fast as Facebook's previous generation of hardware, which means it can train twice as fast and explore networks twice as large. The web giant also said that distributing training across eight GPUs allows it to scale the size and speed of networks by another factor of two.

However, while Big Sur was built with the Tesla M40 in mind, the hardware is qualified to support a wide range of PCIe cards, Facebook said. The new systems are also optimised for thermal and power efficiency, which means they can operate happily even in the air-cooled data centres operated by the web firm.

Facebook said that Big Sur has also been designed with serviceability in mind, removing components that are not used very much, while components that fail relatively frequently, such as hard drives and memory modules, can now be removed and replaced in a few seconds. No special training or service guide is needed, according to the firm.

"Even the motherboard can be removed within a minute, whereas on the original AI hardware platform it would take over an hour. In fact, Big Sur is almost entirely toolless - the CPU heat sinks are the only things you need a screwdriver for," the blog states.

Facebook stated that it wants to make it easier for AI researchers to share techniques and technologies.

"As with all hardware systems that are released into the open, it's our hope that others will be able to work with us to improve it. We believe that this open collaboration helps foster innovation for future designs, putting us all one step closer to building complex AI systems that bring this kind of innovation to our users and, ultimately, help us build a more open and connected world."

Facebook did not indicate when it is likely to publish full specifications for the Big Sur or its Fair platform, but it is likely that this will happen at the next OCP Summit event early in 2016.