What Is OpenAI’s Sora Video Generator AI? How Does It Work?
It’s no secret that many of us have envisioned a future where AI can generate content based on our requests. While this vision has come true in the form of photos, the idea of photorealistic AI videos always seemed like a different frontier, considering the significant computing power required. However, just two years after the debut of ChatGPT, OpenAI has released its new text-to-video generator platform called Sora. Here is everything you need to know about it.
What can Sora do?
When discussing a video generator system, one might think about an AI that can generate animations or the viral Will Smith eating spaghetti video from last year. However, Sora is much more than that. The AI can generate photorealistic videos that look virtually indistinguishable from real ones. To put it simply, users can input a prompt of what they want, and the AI will generate it.
To better understand Sora’s capabilities, let’s look at an example of a video generated by Sora based on the prompt: “A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage.”
The video generated by the AI looks and feels real. All the shadows, people, lighting, and landscape are rendered almost perfectly, creating a shot that can convince almost anyone on the internet. Additionally, the AI appears to have addressed the issue of faces and details, as the woman’s face in the video looks straight out of a camera.
How does Sora work?
To put things into perspective, unlike other large LLMs that use text tokens, Sora utilizes visual patches to unify various modalities of visual data. To handle videos, the AI first compresses them into a simpler form, similar to summarizing a book into a few key points. It then uses visual patches to represent different parts of images or videos. These patches help Sora understand and generate diverse visual content, such as scenes with various resolutions and durations.
Finally, Sora employs a special kind of model called a diffusion transformer, which cleans up “noisy” patches in videos and images, essentially predicting what the original, clear version would look like.
Limitations of OpenAI’s Sora
Considering the AI is still in beta, there are several limitations and drawbacks in the generated videos. Taking the example of the lady walking through Tokyo, you can observe that her legs exhibit some unnatural behavior. This is because generating hands and legs proves to be challenging for an AI model. Moreover, this issue is evident in other Sora videos, where the AI struggles to predict the natural movement of fingers.
In addition to these minor problems, the AI occasionally generates random videos, deviating from the original prompt. For instance, when prompted to generate a video of a man running on a treadmill, the AI produced a video where the man was running backward. Moreover, in another instance, dogs appeared out of thin air.
Strict safety protocols required
Given that AI can create photo-realistic videos, there is an obvious concern about threat actors using AI to spread misinformation. With elections scheduled in both the USA and India, this tool could potentially be exploited by bad actors to create misleading videos of political figures. Fortunately, OpenAI hasn’t released the tool to the public yet and is currently in discussions with policymakers, educators, and stakeholders to explore potential use cases.
Moreover, the company is actively working on establishing stringent safety measures to ensure that when the AI is eventually released to the public, bad actors cannot misuse it to generate misinformation.
What does the future look like?
While we cannot predict the future precisely, if tools like OpenAI’s Sora become mainstream and iron out their bugs, the video-making industry could undergo a significant transformation. For instance, currently, if a company wants a drone shot overlooking Big Sur in California, it has to hire multiple individuals, including a drone pilot and a field crew.
However, with Sora, a production house can effortlessly generate a video of the location that looks indistinguishable from reality, potentially rendering the roles of these individuals redundant. Similarly, such AI systems would also profoundly impact the stock footage industry.