Microsoft withdraws facial recognition database of 100,000 people

'MS Celeb' database had scraped images and videos published online under a Creative Commons open-source licence

Microsoft has withdrawn a facial recognition database featuring 10 million images of 100,000 people following claims it was being used by the military and the companies in China behind the repressive surveillance in Xinjiang province.

The database, labelled MS Celeb, was published in 2016. Microsoft claimed that it was the largest facial recognition dataset in the world.

However, the individuals featured in the database had not been asked their consent. Instead, Microsoft had scraped the images from search engines and videos published online under a Creative Commons licence.

AI & Machine Learning Live is returning to London on 3rd July 2019. Hear from the Met Office's Charles Ewen, AutoTrader lead data scientist Dr David Hoyle and the BBC's Noriko Matsuoka, among many others. Attendance is free to qualifying IT leaders and senior IT pros, but places are limited, so reserve yours now.

The database was taken down after it was revealed by the Financial Times. "The site was intended for academic purposes. It was run by an employee that is no longer with Microsoft and has since been removed," claimed Microsoft in a statement to the FT.

Microsoft had labelled the database ‘Celeb' to suggest that the faces were of publicly known figures, but it also included a number of private individuals, including journalists. People in the database contacted by the FT claimed no knowledge of their inclusion.

According to the FT, Microsoft's MS Celeb database had been used by a number of different companies, including IBM, Panasonic, Nvidia and Hitachi, as well as Sensetime and Megvii in China.

The latter two companies are involved in the Chinese government surveillance system installed in the province of Xinjiang where ethnic Uyghur people are closely monitored. More than one million people are believed to be held in internment camps - although the Chinese government claims that they are training centres.

It is not the only facial recognition dataset published online. Other data sets have since been removed, including one set-up by researchers at Duke University and another by Stanford University, called Brainwash.

The datasets had been discovered by Adam Harvey, who runs the Megapixels project which tracks different databases of personal information and how they are used.

Harvey warned that although Microsoft had taken the database offline, it is still being widely shared by people and groups who had downloaded it.

"People are posting it on GitHub, hosting the files on Dropbox and Baidu Cloud, so there is no way from stopping them from continuing to post it and use it for their own purposes," Harvey told the FT.

Delta is a new market intelligence service from Computing to help CIOs and other IT decision makers make smarter purchasing decisions - decisions informed by the knowledge and experience of other CIOs and IT decision makers.

Delta is free from vendor sponsorship or influence of any kind, and is guided by a steering committee of well-known CIOs, such as Charles Ewen, Christina Scott, Steve Capper and Laura Meyer.

Ten crucial technology areas are already covered at launch, with more data appearing and more areas being covered every week. Sign-up here for your free trial of the Computing Delta website.