A team dominated by Microsoft scientists has published research indicating that anyone scraping the internet for faces with which to train facial recognition algorithms might be wasting their time and money.
The research, led by a scientist from the University of Cambridge, has not been peer reviewed. In the paper, the team says a large-scale dataset of synthetic faces has produced accuracy of 96.17 percent. That is still short of the 99.8 percent accuracy that others get using billions of facial images scraped from social media and other areas of the web.
Pairing a dataset of 500,000 to 1.22 million synthetic digital busts with groups of real images ranging from 200 to 2,000 (with 20 images for each identity) boosted their result to roughly par with datasets holding myriad images, according to the paper.
The authors make the point that it is far easier to ethically request images from 2,000 people compared to doing the same with all the images currently harvested without consent by companies like Clearview AI.
The team says that images collected randomly and in unprecedented bulk come with not-so-hidden costs like bias and source quality deficiencies as well as label noise.
And, of course, that process has been called unethical at best by privacy advocates because all major Western social media prohibit face scraping of their subscribers’ accounts. It is a violation of their terms of service.
In an email exchange, Clearview CEO Hoan Ton-That says he is “impressed” with academic research using generated faces generally.
But “current research shows that the accuracy of the algorithms created from generated faces are lower compared to the current state of the art algorithms.”
Ton-That tipped his hat at combining generated with real images to improve performance and accuracy.
The DigiFace-1M dataset is available for download from GitHub.