In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From; "I'm actually not sure about that."

Wondering what data OpenAI used to train its buzzy new text-to-video AI? The company's CTO is similarly unsure.

Mira Murati, OpenAI's longtime chief technology officer, sat down with The Wall Street Journal's Joanna Stern this week to discuss Sora, the company's forthcoming video-generating AI. About halfway through the 10-minute-long interview, Stern straightforwardly asked Murati where the new model's training data was gleaned from. But Murati, in the most cringe-inducing way possible, couldn't find an answer beyond vague corporate language.

"We used publicly available data and licensed data," Murati responded to the resoundingly simple question.

"I'm actually not sure about that," said Murati, before rebuffing further queries about whether videos shared to Instagram or Facebook were fed into model.

"You know, if they were publicly available — publicly available to use," the CTO answered, "but I'm not sure. I'm not confident about it."

Stern then inquired about OpenAI's data training partnership with the stock image company Shutterstock, asking if videos on the partnered platform were sucked into Sora's training material. And this time? Murati decided to shut down the line of questioning altogether.

"I'm just not going to go into detail about the data that was used," Murati continued. "But it was publicly available or licensed data."

So, in sum, Murati can't tell you exactly where the videos gobbled up by Sora first came from. But rest assured, the sourceless data was definitely, one hundred percent publicly available or licensed. Convincing stuff!

It's a bad look all around for OpenAI, which has drawn wide controversy — not to mention multiple copyright lawsuits, including one from The New York Times — for its data-scraping practices. After all, if the company's CTO can't firmly tell you where its buzziest new model's training data was sourced from, it doesn't exactly communicate a particular amount of care for the issue from OpenAI's higher-ups.

Me: What data was used to train Sora? YouTube videos?
OpenAI CTO: I'm actually not sure about that...
(I really do encourage you to watch the full @WSJ interview where Murati did answer a lot of the biggest questions about Sora. Full interview, ironically, on YouTube:… pic.twitter.com/51O8Wyt53c
— Joanna Stern (@JoannaStern) March 14, 2024

After the interview, Murati reportedly confirmed to the WSJ that Shutterstock videos were indeed included in Sora's training set. But when you consider the vastness of video content across the web, any clips available to OpenAI through Shutterstock are likely only a small drop in the Sora training data pond.

Online, reactions to the clip were mixed, with many chalking Murati's close-lipped responses up to a possible lack of candidness.

"So when *the CTO* of OpenAI is asked if Sora was trained on YouTube videos, she says 'actually I'm not sure' and refuses to discuss all further questions about the training data," former LA Times tech columnist Brian Merchant wrote in an X-formerly-Twitter post. "Either a rather stunning level of ignorance of her own product, or a lie — pretty damning either way!"

Others, meanwhile, jumped to Murati's defense, arguing that if you've ever published anything to the internet, you should be perfectly fine with AI companies gobbling it up.

"Why does it matter? That is the question," said one X user. "I find it insane that people make things public to everyone in the world and then complain when someone uses that public thing. If you want to be private, then be private."

That latter argument, though, speaks to the bizarre new reality that internet users have now found themselves in. Historically, when someone told you to be careful of what you post online, the reasoning was something akin to "you might regret that later" — and not "a multibillion-dollar AI company might turn a profit by vacuuming that Facebook video of you and your family, or a goofy YouTube video you made with your friends, into a generative AI model."

Whether Murati was keeping things close to the vest to avoid more copyright litigation or simply just didn't know the answer, people have good reason to wonder where AI data — be it "publicly available and licensed" or not — is coming from. And moving forward, vague corporate mumbling probably isn't going to cut it.

“Sora, make me a video of a bumbling Joe Biden stumbing on the stage, confused about where to go, and mumbling unintelligible bosh into the microphone, then using an annoying stage whisper.”

“Sorry, you can find that on any nightly newscast. You don’t need to ask me for that.”

3 posted on 03/17/2024 1:35:27 PM PDT by ProtectOurFreedom (“Occupy your mind with good thoughts or your enemy will fill them with bad ones.” ~ Thomas More)

From her employer’s perspective, it was the right answer. The problem is copyright claims.

5 posted on 03/17/2024 3:25:51 PM PDT by Zhang Fei (My dad had a Delta 88. That was a car. It was like driving your living room)

She doesn’t know or care where the programming came from for Artificial Stupidity

The NYTimes’ lawyers are salivating over this clip, waiting for the day they can use this in court in their lawsuit against OpenAI. Many more plaintiffs will appear this week as a result of Murati’s disastrous gaffe.

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.