SOME STARTUPS USE FAKE DATA TO TRAIN AI
BERLIN STARTUP SPIL.LY had a problem last spring. The company was developing an augmented-reality app akin to a full-body version of Snapchat’s selfie filters—hold up your phone and see your friends’ bodies transformed with special effects like fur or flames. To make it work, Spil.ly needed to train machine-learning algorithms to closely track human bodies in video. But the scrappy startup didn’t have the resources to collect the tens or hundreds of thousands of hand-labeled images typically needed to teach algorithms in such projects.
“It’s really hard being a startup in AI, we couldn’t afford to pay for that much data,” says CTO Max Schneider.
His solution? Fabricate the data.
Spil.ly’s engineers began creating their own labeled images to train the algorithms, by adapting techniques used to make movie and videogame graphics. Roughly a year later, the company has roughly 10 million images made by pasting digital humans it calls simulants into of photos of real-world scenes. They look weird, but they work. Think of it as putting the artificial in artificial intelligence.
“The models we train on purely synthetic data are pretty much equivalent to models we train on actual data,” says Adam Schuster, an engineer at Spil.ly. In a demo, a virtual monkey appears on a table viewed through an iPhone’s camera, jumps to the ground, and squirts paint onto the clothes of a real person standing nearby.
Fake it ‘til you make it has long been a motto of startups trying to survive in markets stalked by larger competitors. It has led some companies, like blood-test “innovator” Theranos, into trouble. In the world of machine learning, however, spoofing training data is becoming a legitimate strategy to jumpstart projects when cash or real training data is short. If data is the new oil, this is like brewing biodiesel in your backyard.
The phony data movement could accelerate the use of artificial intelligence in new areas of life and business. Machine-learning algorithms are inflexible compared to human intelligence, and applying them to a new problem generally requires new training data specific to that situation. Neuromation, a startup based in Tallinn, Estonia, is churning out images containing simulated pigs as part of work for a client that wants to use cameras to track the growth of livestock. Apple, Google, and Microsoft have all published research papers noting the convenience of using synthetic training data.
Evan Nisselson, a partner at venture firm LDV Capital, says synthetic data offers startups hope of competing with data-rich AI giants. Talented teams are often hamstrung by a lack of data, he says. “The ability to create synthetic data and train models with that can level the playing field between startups and big companies,” Nisselson says.
Spil.ly’s story adds some weight to that argument. In February, Facebook disclosed its own machine-learning software that can apply special effects to humans in video. Densepose, as it is called, was trained with 50,000 images of people hand-annotated with 5 million points. Within days, Spil.ly began synthesizing data similar to Facebook’s. The startup has since integrated ideas from Densepose into its own product.
Neuromation and others want to establish themselves as brokers of fake data. Another Neuromation project involves creating images of grocery store shelves for OSA HP, a retail analytics company with customers including French supermarket group Auchan. The data is training algorithms that read images to track stock on shelves. “The sheer number of product categories and the varying retail environments make gathering and labelling images impractical,” says Alex Isaev, CEO of OSA.
Ofir Chakon, cofounder of Israeli startup DataGen, says his company charges up to seven figure sums to generate custom videos of simulated—and somewhat creepy—hands. The company’s realism comes in part from a technique recently trendy in machine-learning circles called generative adversarial networks, which can create photo-realistic images.
To human eyes, those hands and Neuromation’s fake pigs couldn’t pass as real. “When I first saw the synthetic dataset I thought ‘This is terrible. How is it possible the computer can be learning from this?,’” says Schuster of Spil.ly. “But what matters is what the computer understands from an image.”
Getting the computer to understand the right thing can take some work. Spil.ly originally synthesized only naked figures, but found the software learned to look only for skin. The startup’s system now generates people with varied body shapes, skin tones, hair, and clothing. Spil.ly and others often also train their systems on a smaller number of real images, in addition to millions of synthetic examples.
Even the world’s most data- and cash-rich AI teams are embracing synthetic data. Google researchers train robots in simulated worlds, for example, while Microsoft published results last year on how 2 million synthetic sentences could improve translation of the Levantine dialect of Arabic.
Apple, which keeps its AI inspirations more secret, also has signalled interest in faking training data. In 2016, the company released a research paper on generating realistic images of eyes to improve gaze-detection software. Almost a year later, the company released the iPhone X, which unlocks by detecting a user’s gaze and then recognizing the face. Some of the same researchers contributed to both projects. The company declines to comment on whether it incorporated findings of the research in the unlocking feature.
In robotics, synthetic training data helps researchers carry out experiments at greater scale than is possible in the real world. Alphabet’s Waymo says its self-driving cars have driven millions of miles on public roads; but its control software has traveled billions of miles on simulated streets.
Giving machines digital doubles can help robots learn to better handle objects in factories or homes. Researchers at OpenAI, the research institute cofounded by Elon Musk, have found that they can train software in a simulated world that works reasonably well in a real robot. Tricks that help include randomly varying the colors and textures in the simulated world to make the software focus on the core physical problem, and generating millions of different, oddly shaped objects to be grasped. “Two years ago the prevailing belief was that simulated data was not very useful,” says Josh Tobin, a researcher at OpenAI. “In the last year or so that perception is starting to shift.”
Despite those successes, fake data is not omnipotent. Many complex problems aren’t well enough understood to simulate realistically, says DataGen’s Chakon. In other cases, the stakes are too high to risk creating a system with any disconnect from reality. Michael Abramoff, a professor at the University of Iowa, has developed ways to generate images of the retina, and says he uses synthetic data in grad-student projects. But he stuck to real images when developing the retina-checking software his startup IDx got approved by the FDA this month. “We wanted to be maximally conservative,” Abramoff says.