OpenAI’s Sora will be the new rage; folks just await its arrival

Introducing Sora, our text-to-video model. Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt.

Update:2024-02-16 14:07 IST

Prompt given for the above picture (screenshot from a short video) is Tour of an art gallery with many beautiful works of art in different styles.

Courtesy: OpenAI

HYDRERABAD: Sora is now the new-age rage and is trending. Toddlers would just take it to the OpenAI tool that creates most creative videos based on the test description. Now, the cyberspace e would be swarmed by the imaginative skills of the people who can give the best descriptive input in text form to generate the most creative AI video.

ChatGPT, an open AI tool funded by Microsoft and others, is busy producing it for people to use across the globe and across the spectrum of multiple age-groups.

OpenAI announced in its website: “Introducing Sora, our text-to-video model. Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt. Today, Sora is becoming available to red teamers to assess critical areas for harms or risks. We are also granting access to a number of visual artists, designers, and filmmakers to gain feedback on how to advance the model to be most helpful for creative professionals.

Also Read: Jagan talks about ChatGPT, AI in digital classroom, kids show him Byju’s tabs

We’re sharing our research progress early to start working with and getting feedback from people outside of OpenAI and to give the public a sense of what AI capabilities are on the horizon.”

OpenAI disclosed through its website that “Sora is able to generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background. The model understands not only what the user has asked for in the prompt, but also how those things exist in the physical world.”

How it works?

“The model has a deep understanding of language, enabling it to accurately interpret prompts and generate compelling characters that express vibrant emotions. Sora can also create multiple shots within a single generated video that accurately persist characters and visual style, according to OpenAI.

“For now, it has its own problems and the OpeAI is addressing them. The website said: “The current model has weaknesses. It may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect. For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark.

“The model may also confuse spatial details of a prompt, for example, mixing up left and right, and may struggle with precise descriptions of events that take place over time, like following a specific camera trajectory.”

The OpenAI explained the safety precautions to avoid fake and deepfake videos. The company is researching and analysing the tool extensively to teach languages and understanding to tailor-make the output to meet the minute expectations of the users.

Safety steps being incorporated

The OpenAI disclosed: “We’ll be taking several important safety steps ahead of making Sora available in OpenAI’s products. We are working with red teamers — domain experts in areas like misinformation, hateful content, and bias — who will be adversarially testing the model.

“We’re also building tools to help detect misleading content such as a detection classifier that can tell when a video was generated by Sora. We plan to include C2PA metadata in the future if we deploy the model in an OpenAI product.

“In addition to us developing new techniques to prepare for deployment, we’re leveraging the existing safety methods that we built for our products that use DALL·E 3, which are applicable to Sora as well.

Extreme violence, sexual content, hateful imagery, etc would be prevented

“For example, once in an OpenAI product, our text classifier will check and reject text input prompts that are in violation of our usage policies, like those that request extreme violence, sexual content, hateful imagery, celebrity likeness, or the IP of others. We’ve also developed robust image classifiers that are used to review the frames of every video generated to help ensure that it adheres to our usage policies, before it’s shown to the user.

“We’ll be engaging policymakers, educators and artists around the world to understand their concerns and to identify positive use cases for this new technology. Despite extensive research and testing, we cannot predict all of the beneficial ways people will use our technology, nor all the ways people will abuse it. That’s why we believe that learning from real-world use is a critical component of creating and releasing increasingly safe AI systems over time,” the OpenAI disclosed.

Explaining the research process, the company said: Sora is a diffusion model, which generates a video by starting off with one that looks like static noise and gradually transforms it by removing the noise over many steps.

Entire video is generated

“Sora is capable of generating entire videos all at once or extending generated videos to make them longer. By giving the model foresight of many frames at a time, we’ve solved a challenging problem of making sure a subject stays the same even when it goes out of view temporarily.

Also Read: Can ChatGPT give fitness plan? Man loses 11 kg, nutritionists say machine no replacement to humans

“Similar to GPT models, Sora uses a transformer architecture, unlocking superior scaling performance.

“We represent videos and images as collections of smaller units of data called patches, each of which is akin to a token in GPT. By unifying how we represent data, we can train diffusion transformers on a wider range of visual data than was possible before, spanning different durations, resolutions and aspect ratios.”

The OpenAI also explained how the build was made:

“Sora builds on past research in DALL·E and GPT models. It uses the recaptioning technique from DALL·E 3, which involves generating highly descriptive captions for the visual training data. As a result, the model is able to follow the user’s text instructions in the generated video more faithfully.

In addition to being able to generate a video solely from text instructions, the model is able to take an existing still image and generate a video from it, animating the image’s contents with accuracy and attention to small detail. The model can also take an existing video and extend it or fill in missing frames.”

Now once the red teamers complete the tasks entrusted to them, Sora will be available for use by the people.”