Domestic animation video AI is on fire! The two-dimensional wife pinched at will, gothic, dreamy, and mecha can be obtained with one click
Now, all the newcomers of the second dimension, you don't have to wait for the "wife" to come out!
Not only can we produce our own food, but we can also move.
Nowadays, the track of AI video generation is in full swing. These newer and stronger models are in the same vein as Scaling Law, focusing on a "big and comprehensive" model.
However, the effect of the drawing depends on the luck of "card drawing", not to mention the uncanny valley effect generated by real videos and the sudden change in the painting style generated by anime videos.
Similar to large language models, it is difficult to focus on services based on industry characteristics and exclusive demands if you want to take it all in on the application.
Especially for the editor of the "two-thorn newt", I have not found a suitable model for a long time.
After all, as an ordinary anime lover, if you want to appear in the same frame as your favorite characters, or create a second creation, you can only dream without drawing skills.
From scripting, keyframe drawing, rigging, to dynamic rendering, it takes a lot of time and effort.
Source network
Recently, the editor found a creative website "YoYo" specially built for the second dimension——
With simple operations such as text prompts or uploading pictures, you can generate high-quality and consistent anime content with one click, and bring your favorite characters to life in "fan videos"!
Mainland Station Portal: yoyo.avolutionai.com
International Station Portal: yoyo.art
One-click get for fandom videos
As you can see, YoYo not only has a simple creation interface, but also is very easy to use.
And most importantly, for anime lovers and creators, the two-dimensional atmosphere is extremely immersive.
Whether it's prompts or pictures, there are very rich high-quality materials - dozens of popular characters, as well as various styles such as general, flat paint, mecha, etc., which can be described as a one-stop collection, which makes people addicted.
These customization options allow you to control your character's design, storytelling, and even every subtle animation during the generation process.
Not much to say, let's start with a wave of actual measurements.
The cherry blossoms falling, the chin smiling, and the exquisite background and costumes make for a sense of atmosphere that comes out at once.
prompt: Woman in kimono in a garden full of prints
The burning candles, the flaming eyes, the black lolita, and the eerie atmosphere are very well handled.
prompt: highest quality, masterpiece, illustrations, super detailed, (1 female: 1.2), shoulder-length hair, goth costume, haunted mansion, holding candles, spooky
Next, let's take a look at the excellent character consistency. (White Hair Control Ecstasy)
From a jealous dragon slayer girl -
prompt:1girl ,hair between eyes ,white hair, blue eyes,long hair,no hat,white dress ,elf,pointy ears, fight with a big dragon, sword
A young girl who went for a walk in the woods——
prompt:1girl,white hair,elf,blue eyes,long hair,pointy ears,sitting in river,stars,white dress,pink canvas backpack,taking a walk in the forest
Or the elf princess sitting in the water-
prompt:1girl,white hair,elf,blue eyes,long hair,pointy ears,sitting in river,stars,white dress,sitting quietly on the water
By the way, a Chinese-English mixed prompt can also be supported.
prompt:1girl,hair between eyes,white hair,blue eyes,long hair,no hat,white dress,elf,pointy ears,waterfall,sit under the waterfall, fold your hands, close your eyes
As can be seen from the above GIFs, the AI has restored accurate and expressive character expressions, making the video full of story in just a few seconds.
The hair, dandelions, and skirt on the body flutter in the wind very naturally.
Prompt: A girl with long purple hair smiling in the wind in a dandelion-covered prairie with auroras twinkling in the sky
The snow falling and the heat rising from the cup can be distinguished at a glance even if they are intertwined.
prompt: A short-haired girl with a scarf drinking hot tea on a snowy day
A huge "radish" stands in the city, and the high-rise buildings depict a shocking scene.
prompt: mech, unmanned, alone, cloud, weapon, sci-fi, glow, sky, holding weapon, building, city
In addition to the characters, the background generation is also very cinematic in style.
Prompt: A bird's-eye view of the fantastic forest continent, with forest lakes, small towns, and distant mountains
prompt: A quaint town with a lively street market
From now on, no matter how fantastical the scene in our minds, we can let it be restored in the animation!
prompt: Fantasy forest on the continent of the forest, bunnies, squirrels, colorful mushrooms
prompt: A snow-white deer with horned plum blossoms stands on the top of a snow-capped mountain and looks into the distance, with a glimmer of light around it
In the scene of "Scenery", we can "reproduce" the favorite scenes generated by our friends with one click.
After selecting "Extract", the model generates a similar style of diagram based on the same prompt.
Then click on "Generate Video" - a long-haired girl in a JK uniform, and a white cat playing the piano, this picture is not too beautiful.
Generate a model
At present, AI-generated videos have two major technical shortcomings, one is controllability, and the other is the generation speed.
Most of the previous models used image or text instructions as generation conditions, but they lacked precise and interactive control of the actions in the video. The speed is also very slow when generating videos, which can also seriously affect the user experience for C-end applications.
In order to solve these model defects, the Deer Shadow team has focused on technical research for a long time, and has achieved fruitful results, publishing a number of high-level papers that are "full of dry goods".
The Motion-I2V paper, published just in January this year, proposes an innovative graphical video framework that can generate consistent and controllable video for complex images.
Address: https://arxiv.org/abs/2401.15977
Previous approaches, such as the AnimateDiff architecture, typically had the model responsible for both motion modeling and video generation, learning directly from the image to the video.
The paper argues that this combination of the two can lead to distortion of action in details and inconsistencies in timing. Motion-I2V chooses to decouple the two processes.
The first stage uses a motion field predictor based on a diffusion model, which refers to a given image and text prompt, focusing on pixel-level motion trajectory inference to predict the motion field mapping between the reference frame and all future frames.
In the second stage, a novel motion-enhanced timing layer is proposed to enhance the limited one-dimensional time attention in the model. This operation expands the temporal sense domain and reduces the complexity of learning spatiotemporal patterns simultaneously.
With the guidance of the first stage of trajectory prediction, the model of the second stage can more effectively propagate the features of the given image to the synthesized video frame, and together with the sparse trajectory control network Control-Net, Motion-I2V can also support the user's precise control of the motion trajectory and motion area.
This approach provides more control over the I2V process than relying solely on text prompts. In addition, the second-stage model natively supports zero-shot generation and video-to-video conversion.
Compared to existing methods, Motion-I2V produces more consistent video even with large amplitudes of motion and varying viewing angles.
It is clear from the demo that compared to models such as Pika and Gen-2, Motion-I2V can indeed simulate better motion forms, and the visual details are more realistic.
In terms of Wensheng videos, the AnimateLCM model released in February this year exposes the source code and pre-training weights, and can generate high-quality animations in only 4 iterative steps, so it has been widely welcomed by the open source community, with more than 60,000 downloads in a single month.
Warehouse address: https://huggingface.co/wangfuyun/AnimateLCM
This paper proposes that although the diffusion model has excellent generation effect, the iterative denoising process contains 30~50 steps, which is computationally expensive and time-consuming, which limits the practical application.
Inspired by the Latent Consistency Model (LCM), the team aims to generate a high-quality, photorealistic model with minimal steps.
Paper address: https://arxiv.org/abs/2402.00769
Instead of training directly on the original video dataset, AnimateLCM distills prior knowledge from the trained Stable Diffusion model. In addition, the decoupling strategy is used to separate the prior image generation and motion generation, and then 3D expand the image model, which can improve the training efficiency and generation quality.
In addition, in order to make the AnimateLCM model better adapt to the various adapters that are widely used in the community, the paper proposes an "accelerated" strategy to train the adapters without the need for additional teacher models.
Experiments have shown that this strategy works. It is very compatible with the image condition adapter or the layout condition adapter, which not only does not compromise the sampling efficiency, but also realizes the expansion of the model function.
In addition to Wensheng video and Tusheng video, AnimateLCM can also be used for efficient video style transfer with zero samples, or for extending video lengths up to 4 times the base length with near-perfect consistency.
While AnimateLCM has already achieved great results, the development team didn't stop there and chose to explore further on top of that.
In their latest paper, published in May, the authors point out that the latent consistency model still has some essential flaws. This paper investigates the causes behind these defects one by one, and proposes an improved Phased Consistency Model (PCM), which achieves significant improvements.
Address: https://arxiv.org/abs/2405.18407
The design limitations of CM and LCM are mainly reflected in three aspects:
1. Controllability: In image and video generation, there is an important parameter called CFG (classifier-free guidance), which controls the degree of influence of text prompts on the generated results. The higher the CFG value, the more relevant the image or video is to the prompt, but it also increases the likelihood of image distortion.
The Stable Diffusion model can produce good images in a wide range of CFG values (2~15), but the acceptable CFG value of LCM generally cannot exceed 2, otherwise overexposure will occur.
The CFG value cannot be increased, which greatly limits the controllability of text prompts to generate videos. In addition, LCM is also very insensitive to negative cues, such as in the first example below, where the model "blatantly" ignores the prompt requirements in favor of a dog with black fur.
2. Consistency: Both models can only use random multi-step sampling algorithms, so even if you start generating from the same seed, you can see significant inconsistencies between steps during inference.
3. Efficiency: In addition to the above two flaws, the authors found that LCM could not give good results in less than 4 steps of less than 4 steps, thus limiting the sampling efficiency.
The architecture design of PCM solves the above three defects well:
Adversarial loss is introduced into the hidden space to ensure the consistency of image distribution, which greatly improves the generation effect in the case of few-step inference
After the implementation of targeted solutions, the video effect generated by PCM in 1~4 steps of inference has been significantly optimized compared with LCM. Subsequent ablation experiments have also proved the necessity of these innovative designs of PCM.
From MotionI2V to AnimateLCM, and then to the latest PCM, the Deer Shadow team has been seeking breakthroughs and improvements in the gradual iteration to achieve the amazing effect of PCM, and the advanced performance of the model can be seen from the benchmark scores and side-by-side comparisons.
In the single-step inference process to generate images, the PCM method almost surpassed the scores of Stable Diffusion-Turbo in 2 datasets and 5 indicators, and the advantage of the consistency score was more significant, from 0.71 to 0.81 in SD-Turbo.
This advantage is still evident when the inference step gradually increases from step 1 to step 16. In most cases, the normal ODE solution method is preferable.
When using the three indicators of CLIP score, optical flow estimation and CLIP consistency to quantitatively evaluate the quality of video generation, the PCM model still achieves obvious advantages in the few-step inference (≤ 4 steps), which is greatly improved compared with the other two Diffusion baseline models DDIM, DPM and AnimateLCM.
It is worth mentioning that the research and development of Deer Shadow Technology did not happen overnight, and their technological innovation continued for several years and continued to iterate.
For example, FlowFormer, a novel architecture proposed in 2022, ranked first in Sintel's optical flow benchmark at that time, and VideoFlow, a video optical flow estimation framework released in 2023, refreshed SOTA on all public benchmarks.
Address: https://arxiv.org/abs/2203.16194
Address: https://arxiv.org/abs/2303.08340
MPI Sintel, an open-source dataset co-developed by researchers at the University of Washington, Georgia Tech, and the Mark · Planck Institute, is currently one of the most widely used benchmarks in the field of optical flow algorithms. The samples are well represented of natural scenes and movements, which are extremely challenging for current methods.
In the latest rankings, the VideoFlow series occupies three positions in the top five, of which ViCo_VideoFlow_MOF ranks first, which shows the technical precipitation and hard power of the Deer Shadow team.
For a long time, we have been shouting about the rise of Guoman, but the development of new works has been slow, and we have never been able to achieve a real breakthrough.
In the future, with the entry of AI, the status quo and creativity of animation production will be greatly improved.
For Deer Shadow Technology, the next thing to do is to let the scientific research results be quickly transformed, and let AI tools help the original animation achieve exponential growth.
This article is from Xinzhi self-media and does not represent the views and positions of Business Xinzhi.If there is any suspicion of infringement, please contact the administrator of the Business News Platform.Contact: system@shangyexinzhi.com