OpenAI’s CEO, Sam Altman, unveiled Sora, the latest AI video maker, yesterday. Just like DALL-E and ChatGPT before it, Sora can interpret user commands in natural language and recreate the action as intended. But unlike any AI program I’ve ever seen, Sora creates a complete, lifelike film in place of text-based replies or image generation. That’s not meant to be a compliment.

Initial Sora impression: Terror

OpenAI has a series of different videos on Sora’s announcement page showing off what it can do, and they’re stunning—in the worst way. Sora can generate animated content, such as a “short fluffy monster kneeling beside a melting red candle,” or “a cartoon kangaroo disco dances.” While the end result isn’t something that’d match the quality of, say, Pixar or DreamWorks, they largely look professional (and some definitely look better than others). I doubt many people would have guessed on first glance that humans were not involved in the process.

But while its animation potential is concerning enough, it’s the realistic videos that are downright terrifying. OpenAI showed off “drone footage” of a historic church on the Amalfi coast, a parade of people celebrating Chinese Lunar New Year, and a tracking shot of a snowy street in Tokyo, and I promise you that you would assume these videos were real on your first watch. I mean, some of them still don’t seem AI-generated to me, and I know they are.

Even the ones with AI flaws, like warping and shifting of the assets, could be confused for video compression. There’s a video of puppies playing in the snow, and while there are some hiccups that you will spot after knowing it’s not real, the physics and quality of the image sell the illusion. How are none of these puppies real? They so clearly love the snow. God, are we already living in the Matrix?

How does Sora work?

While we don’t have all the details, OpenAI describes Sora’s core processes in its technical report. First off, Sora is a diffusion model. Like AI image generators, Sora creates a video by beginning with, essentially, a bunch of static noise, and removing that noise until it resembles the image you’re looking for.

Sora is trained on units of data called patches: These patches are made by compressing images and videos into “lower-dimension latent space,” then breaking that down further into “spacetime” patches, the units the model actually understands. These patches contain information of the space and time for a given video. Sora then generates videos within that “latent” space, and a decoder then maps that back to “pixel” space, producing the end result.

The company doesn’t confirm where this video and photo data comes from, however. (Curious.) They do say Sora is built on research from its DALL-E and GPT models, using the same re-captioning technique from DALL-E 3 to train the model on descriptive user prompts.

What else can Sora do?

While it can obviously generate videos from standard prompts, OpenAI says Sora can generate video from still images. Apple researchers are working on the same type of process with their Keyframer program.

It can also expand an existing video forward or backwards in time. OpenAI showed an example of this using a video of a streetcar in San Francisco. It added about 15 seconds of additional video to the start in three different ways. So, all three appear different at first, but all sync up at the end into the same, original video clip. They can use the technique to make “perfect loops,” as well.

OpenAI thinks Sora is perfect for simulating worlds. (Awesome!) It can create video with consistent 3D elements, so people and objects stay in place and interact as they should. Sora doesn’t lose track of people and objects when they leave the frame; it can remember what people and objects do that leave an impact on the “world,” such as someone painting on canvas. It can also, um, generate Minecraft on the fly, simulating the player while at the same time generating the world around it.

Sora isn't perfect

To their credit, OpenAI does note Sora’s current weaknesses and limitations. According to the company, the model may have trouble reproducing accurate physics in a “complex scene,” as well as certain cause-and-effect situations. OpenAI gives an example of a video of a person eating a cookie, but when you see the cookie after, it doesn’t have a bite mark. Apparently, glass shattering is also an issue to render.

The company also says that Sora may mess up “spatial details” in your prompt (confusing left for right, for example), and might not be able to properly render events happening over time.

You can see some of these limitations in videos OpenAI displays as evidence of Sora making “mistakes.” For a prompt asking Sora to generate a person running, Sora generates a man running the wrong way on a treadmill; when the prompt asks for archeologists discovering a plastic chair in the desert, the “archeologists” pull a sheet out of the sand, and the chair essentially materializes from that out of nowhere. (This one is particularly trippy to watch).

The future isn't now, but it is very soon

If you scroll through Sora’s introduction site, you might have a mini panic attack. But with the exception of the videos OpenAI highlights as mistakes, these are the best videos that Sora can produce right now, curated to show off its capabilities.

Sam Altman took to Twitter following the announcement and asked users to send him responses to put through Sora. He tweeted the end results for about eight options, and I doubt any of them would have made the announcement page. The first attempt at a “A half duck half dragon flies through a beautiful sunset with a hamster dressed in adventure gear on its back” was laughably bad, looking like something out of the first draft of a direct-to-DVD cartoon from the 2000s.

https://t.co/WJQCMEH9QG pic.twitter.com/Qa51e18Vph
— Sam Altman (@sama) February 15, 2024

The end result for “two golden retrievers podcasting on top of a mountain,” on the other hand, was confounding: It looks as if someone took stock footage of all the assets and quickly edited them on top of each other. It doesn’t look “real,” so much as Photoshopped, which, again, raises the question of what exactly Sora is trained on:

https://t.co/uCuhUPv51N pic.twitter.com/nej4TIwgaP
— Sam Altman (@sama) February 15, 2024

These quick demos actually made me feel a bit better, but only just. I don’t think Sora is at the point where it can generate lifelike videos imperceptible from reality on a whim. There are likely thousands upon thousands of results that OpenAI went through before settling on the highlights we see in its announcement.

But that doesn’t mean Sora isn’t terrifying. It won’t take much research or time for it to improve. I mean, this is where AI video generation was 10 months ago. I wonder what Sora would spit out if given the same prompt:

OpenAI is adamant that it’s taking the proper precautions here: It’s currently working with red teamers for harm reduction research, and wants to give Sora-generated content a watermark similar to other AI programs, so you can always tell when something was generated with OpenAI’s technology.

We Should All Be Scared of OpenAI’s New AI Video Generator

Initial Sora impression: Terror

How does Sora work?

What else can Sora do?

Sora isn't perfect

The future isn't now, but it is very soon

1 thought on “We Should All Be Scared of OpenAI’s New AI Video Generator”

Leave a Comment Cancel Reply