
Mishaal Rahman / Android Authority
When the AI-generated “Will Smith eating spaghetti” video went viral a little over 2 years ago, I wasn’t as skeptical as some about the future of AI video generation. I anticipated improvements, but I never imagined the technology would advance so quickly. Indeed, it was just last month that Google rolled out Veo 2, its second-generation AI video generator model, to the public, and the company is already back with a much more impressive model. After making over 25 videos with it, I’m convinced that Google’s Veo 3 is a mind-blowing advancement in AI video generation, for better or worse.
What is Veo 3?
Veo 3 is Google’s state-of-the-art text-to-video generation model. Like Veo 2, Veo 3 creates high-quality videos in a range of subjects and styles, even capturing nuanced object interactions and human expressions. Both models also block “harmful requests and results” and mark their video outputs with an invisible watermark called SynthID.
The Veo 2 model could only produce silent videos, making it more like a high-quality GIF generator. The new Veo 3 model, however, supports native audio generation, putting it leaps and bounds ahead of its predecessor. The new model can not only generate sound effects and ambient noise but also create dialogue that’s synced with the video.

Mishaal Rahman / Android Authority
Veo 3 vs Veo 2
While Veo 3 outputs are still limited to short, 8-second video clips, the addition of native audio generation has allowed people to create some truly mind-blowing AI videos that have taken the Internet by storm. I’m sure you’ve seen some of these videos already, but if not, we’ve put together a collection of over 25 videos made by Veo 3 that demonstrate the tool’s prowess and its current limitations. While Veo 3 can be a pain to work with, its low barrier to entry makes it an incredible tool for anyone with enough time to create convincing, life-like videos, and I’m not convinced the world is ready for this.
Veo 3 almost makes it too easy to create realistic videos
If you’ve spent any amount of time on social media in the last few weeks, then you’ve probably seen people argue over whether 100 men could beat 1 gorilla in a fight. It’s become somewhat of a meme, with laypeople and experts alike chiming in on the debate. Some amateur video makers have even created their own simulations of the hypothetical brawl. I wanted to see how easy it would be for me, someone with virtually no 3D animation experience, to make a video showing 100 men take on 1 gorilla.
It was as simple as asking the Gemini chatbot, “Create a video showing 100 men fighting one silverback gorilla.”
Now, I’m sure if you pixel-peep, you can find some errors. Maybe you’ll spot some men or weapons in the background appearing or disappearing randomly, or perhaps you’ll notice that there clearly aren’t 100 men in this 8-second clip. But if you were to simply watch this video on a small smartphone screen, you’d be hard-pressed to find any major issues at a glance.
This video definitely captures the chaotic, fast-paced action that would ensue when 100 men take on 1 gorilla. The sound that Veo 3 generated for the gorilla’s punches had weight to it, making it feel believable. I knew it was an AI-generated video, of course, because I was the one who made it. But when I showed my mom — who was unaware of the memes on social media — this clip, she asked me what movie it was from!
Another video that demonstrates Veo 3’s skills at simulating animal physics is this one:
I asked Gemini to make a video of “a bull rampaging in a shop selling fine china,” and it created a video where, again, if you were to pixel-peep, you’d probably find issues. But at a glance, everything looks strikingly real: the bull’s movement through the shop, the way dishes scatter, and the accompanying sound of them breaking. Most shops selling fine china would probably be more organized, but some out there might indeed look like this.
While I think the “100 men vs. 1 gorilla” video does a decent job showing how Veo 3 handles people, this next example better illustrates its ability to capture the nuances of human expressions. One of our readers asked us to create a video of “a British Parliament debate between two men using the roadman accent,” and I was amazed at how it turned out.
Veo 3 generated some really realistic, subtle movements in the — need I remind you, AI-generated — hands and facial muscles of the man on the left as he says, “Do you know what? Blud.” And the way the man on the right moves as he coughs in the other man’s face felt incredibly real.
What I think Veo 3 does best is create realistic videos of utterly unreal situations. Sure, you probably won’t ever see 100 men fight 1 gorilla, a bull rampaging through a fine china shop, or two British Parliamentarians arguing in a roadman accent in your lifetime, but they’re all things that could conceivably happen.
You’ll never see a real video depicting a “hyper-magnified view of a bustling ant colony” where the ants are actually “intricate clockwork robots” building mini skyscrapers from “glowing sugar crystals,” only for this intricate scene to be disrupted by a human finger pressing down from above, culminating in a close-up of a spherical sugar crystal.
And you’ll never see “an asteroid crashing into an ocean of water balloons.”
Both of these videos demonstrate Veo 3’s impressive skill at simulating physics interactions, even when dealing with such outlandish scenarios. To my amateur eyes, these look like they could have been meticulously created by an expert in 3D animation, but they weren’t.
My favorite video to come from my testing is the one where the Google weather froggy mascot pops out of a Google Nest Hub to have a picnic on the kitchen counter. I just love how the frog pops out of the display, quickly lays out a picnic blanket, and waves at the camera. I also love how the camera pans out to show the kitchen, and how accurately Veo 3 rendered the frog’s shadow and reflection on the kitchen counter.
The video isn’t perfect, though, as it strangely doesn’t have any audio. Also, the text on the Nest Hub says “Nesst” instead of “Nest,” showing that Veo 3 still has issues with rendering text accurately in videos. When I asked Veo 3 to redo the video but add sound, the result was much more unsettling: Instead of a cute frog popping out of the display, what looks like a person in a frog costume jumps over the display and starts a picnic on the counter before saying, “Hello, everybody!”
This redo encapsulates some of the issues that Veo 3 still has with consistency and prompt adherence. In fact, this was just one of many examples of Veo 3 struggling to stick to my prompt, despite Google’s stated improvements to prompt adherence. In one sense, I’m almost happy that Veo 3 isn’t perfect; the model won’t always give you what you’re looking for, and that makes it harder for people with malicious intentions to abuse it.
Should we be happy that Veo 3 isn’t perfect?
The videos I shared above, as well as the many great Veo 3-generated videos you may have already seen online, may have given you the impression that Veo 3 always produces excellent results. It can very easily do so, but it still struggles a lot with generating legible text and adhering to the prompt.
For example, when I asked it to create a video of “a woman paragliding from the top of the Eiffel Tower,” it gave me a pretty realistic-looking video of a woman paragliding…next to the Eiffel Tower.
Or when someone asked me to make a video “where someone types a prompt for Google Veo inside Google Veo,” the model churned out a video with unintelligible text on a laptop screen.
The most amusing issue I’ve encountered with Veo 3 is its inability to understand what a ‘bugdroid’ is. Bugdroid is the official nickname of the Android robot mascot, but Veo 3 consistently fails to accurately portray the robot in its generated videos, often creating generic robots with large eyes or bug-like antennas. It’s not like the model refuses to generate videos featuring the Android mascot due to brand safety concerns; it can easily be made to do so if you tell it to make a “green Android bot.”
Speaking of brand safety, it’s nice to see that Veo 3 at least has some basic protections against generating videos featuring famous people. If you ask it to, for example, make a video featuring YouTube star Mr. Beast, it’ll refuse to do so. If you try to work around this by describing the exact person instead of providing their name, Veo 3 will still refuse to generate the video.
With some really clever prompting, Veo 3 can definitely still be coaxed into creating videos featuring famous people, as we’ve already seen several Veo 3-made recreations of Will Smith eating spaghetti online. This is definitely problematic given how realistic Veo 3’s videos can be. Even videos that don’t feature celebrities can cause a stir, as demonstrated by a fake video of a woman being denied boarding due to her wanting to bring her “emotional support kangaroo” onboard.
To combat this, Google has started to put a visible watermark on all Veo 3 videos generated through Gemini. However, there’s an exception: this watermark isn’t placed if you’re a Google AI Ultra subscriber using Flow, Google’s new AI-powered filmmaking tool, to generate videos. The Google AI Ultra plan costs an eye-watering $249.99 per month, which is pricey enough to deter some but not all persons with malicious intent.
With Flow, you can even guide Veo 3’s output with your own images or AI-generated images. This allows for greater control over the output and enables generating videos that better align with your creative intent, capture your desired aesthetic, and match your characters’ designs. It also opens up new avenues of video creation by allowing Veo 3 to generate expanded scenes, reimagine videos in different styles, remove unwanted objects from videos, and animate drawn characters.

Flow mitigates many of the inherent limitations of using Veo 3 in Gemini, and I can genuinely see it being useful for amateur and even professional filmmakers. But it also circumvents Google’s new visible watermarking policy and only adds an invisible watermark that few platforms support. It also makes it easier to put together longer videos, which means it has a higher potential for abuse.
I don’t think there’s any turning back at this point; AI video generators are here to stay, and they’re just going to keep getting better. Veo 3 is leaps and bounds better than the first text-to-video models, and it’s only been a few years. With how much existing and new video data Google can pull from YouTube, the company will undoubtedly make major improvements in its upcoming Veo 4 model.
See all the videos we made with Veo 3
If you’re curious, here’s every single AI generated video we made with Veo 3 through Gemini:
- Create a video showing 100 men fighting one silverback gorilla. (Link)
- Can you repeat this, but make it like it was uploaded to Snapchat circa 2018, filmed on an iPhone? (Link)
- Create a cinematic trailer for an imaginary sci-fi movie set on a distant planet with floating cities where the protagonist is secretly the son of the villain. (Link)
- Create an animated video of garden gnomes constructing a futuristic AI supercomputer using CPUs that resemble carrots, potatoes, and broccoli. Show them working in a magical underground lab with glowing circuits and enchanted tools. (Link)
- Make a video showing the Android bugdroid walking down a path by itself, looking at a smartphone it’s holding. That bugdroid gets surprised by a couple of other bugdroids that are hanging out together and invite the first droid to join them. Each bugdroid should be wearing a hat that says “Android Faithful”. (Link)
- A man walking on water. (Link)
- A green Android bot messing with a red Apple and finally eating it. (Link)
- Make a video of 3 humanoid mechs fighting against an army of 20 red colored humanoid mechs via aeriel combat above the skies of Tokyo. All of the combatants are actively fighting and not taking turns or waiting around. The camera is slowly zooming outwards from the action throughout the entire scene. Behind a distant cloud, a silhouette of a flying lizard monster can be seen. (Link)
- A bull rampaging in a shop selling fine china. (Link)
- An influencer announces to the world, via a short form, vertical video, a 100 man vs 1 gorilla showdown while showing the contenders. (Link)
- A woman paragliding from the top of the Eiffel Tower. (Link)
- A surreal F1 race where iconic cars and drivers from different eras compete on a track that morphs through time and space. (Link)
- A hyper-magnified view of a bustling ant colony, but instead of ants, tiny, intricate clockwork robots frantically build and dismantle miniature skyscrapers made of glowing sugar crystals. Suddenly, a colossal, slow-motion ‘finger of doom’ descends from above, casting a giant shadow, accompanied by a booming, distorted ‘THUD’ followed by the sound of crumbling glass and the squeaking, frantic whirring of the robots as everything collapses into sparkling dust. End on a close-up of a single, perfectly spherical sugar crystal landing silently. (Link)
- Create a video of a basketball game between 5 bugdroids in yellow jerseys and 5 bugdroids in white jerseys where one of the yellow bugdroids dribbles the ball from his three point line to his free throw line then back to his three point line then turns around and shoots the ball to the basket. The ball bounces on the rim then hangs in the air then drops in. The bugdroids that shot the ball yells in delight while putting his hands around his throat and his other four teammates dressed in yellow run over to him to celebrate. Bugdroids, by the way, are the term for the Android OS robot mascot. (Link)
- Create a video of a beach filled with people with everyone doing a unique beach activity while the camera pans to the left 180 degrees, then the camera pans right back to the original spot. In the foreground is a person with a surfboard walking left, and the person is eventually out of the frame, but will reappear once the camera pans over him again as it is returning to the first frame of the scene. (Link)
- The froggy mascot popping out of a Google Nest Hub and having his picnic on a kitchen counter. He then greets the camera as it zooms out to reveal the kitchen. (Link)
- Create a video of the Android Bot mascot holding a Google Pixel phone and texting his friend, a red Apple holding an iPhone. The perspective should be over the shoulder of the Android Bot, showing the Pixel phone’s screen with the Google Messages app on screen. The message the Bot is sending should read, “Thanks for getting the message!” As he hits send, the video should show the red Apple receiving the message on his iPhone, and the message should be shown in a green bubble in the Apple Messages app. (Link)
- Make a video inspired by a Bollywood action movie scene that transitions to a musical dance sequence after a few seconds. Afterwards, the action sequence resumes from where it was suspended. All of the actors are of Indian descent and the lead actor is dressed in traditional Indian garbs. Alongside him is his sidekick who is dressed as a call center worker. (Link)
- Create a scene where a black male dressed in Hawaiian clothing is running away from a T-Rex. The two are moving in the direction of the audience. This scene plays for 1.5 seconds and marks the end of segment A of the video. Segment B of the video now begins. The camera zooms in on the man’s face as he looks to the left side of the screen. The scene pans to the left to reveal a close up shot of a white male dressed only in a beach towel and sunglasses. The new character lifts up his glasses slowly and shows a bewildered expression. This section marks segment B of the video and lasts for 4 seconds in total. Segment C of the video starts at this point and the camera now shifts back from the white male back to the black male running from the T-Rex. The camera zooms out and the T-Rex lets out a roar. In the background, a volcano can be seen erupting once the roaring starts. (Link)
- A gamer who is live streaming themselves playing a tactical, turn-based, fantasy-themed Japanese RPG. The player is currently in a battle with a party of 4, one thief, one white mage, one paladin, and one black mage, fighting against a red dragon. The player can be seen in a small square overlay on the bottom right corner of the screen. (Link)
- An asteroid crashing into an ocean of water balloons. (Link)
- Show us a video of NASA astronauts finding life on mars from the perspective of the mission control room. (Link)
- A video where someone types a prompt for Google Veo inside Google Veo. (Link)
- Giants Causeway with the basalt columns rising and sinking into the sea in waves. (Link)
- A war between unicellular and multicellular organisms. Video from the battlefield. (Link)
- A British parliament debate between two men using the roadman accent. (Link)