Data Science

How we used Stable Diffusion to enable geometry-aware avatar outfit texturing

Re-textured 3D garments using the proposed method and a reference image input.

Intro

Stable Diffusion, ControlNet and similar machine-learning models have been used to generate art and visual content. But they could also be adapted to generate high-quality textures for 3D objects. In this article, we demonstrate how we used Stable Diffusion and ControlNet to generate textures for Ready Player Me avatar outfits.

fig1
Left: generated diffuse and PBR materials from prompt: “Steampunk”; Right: original textures created by a 3D artist.
fig2
Left: original asset, back view; 2: generated diffuse and PBR materials, back view

By changing Physically Based Rendering (PBR) materials of a low-poly 3D mesh, the look and feel of the asset could be completely transformed. This technology could be used to:

  • Create many variations of a single asset
  • Stylize assets to match a particular art style
  • As a tool for artists to iterate on their ideas quickly 
  • As a technology to power a content creation tool for users

Our goal was to create a solution that quickly produces high-quality textures for avatar outfits that require little to no edits. 

ML Problem

Input: 3D mesh with existing UVs + prompt or image
Output: PBR UV textures

Solution

We developed a solution to generate textures in a UV space from the beginning with some smart guidance. We used a linear combination of Object Space Normal map + Positions in 3D space to achieve that. By using 3D mesh encoded into an image for ControlNet conditioning, we allow the model to follow the object's symmetry and understand the positions of pockets, zippers and fabric creases.

Example #1: conditional image and generated texture produced via prompt “Plants Queen”
Example #2: condition image and generated textures from various prompts

To train ControlNet, we created a dataset of ~1,000 existing Ready Player Me assets. We annotated them automatically using renders and Blip-2. As a result, Stable Diffusion with trained ControlNet produces textures of similar quality to the ones we have in the dataset. This method can then be generalized to unseen outfits with the addition of a conditioning image.

Baked shadows and lights

In our experiment, it became evident that the data used for the ControlNet training infuses the generated images with certain stylistic biases. In our experiments with Stable Diffusion v1.5, the generations stylistically tend to look like the majority of the existing Ready Player Me textures. For example, there are dark shadows on the clothing creases and shadows under the arms. We did not conduct experiments with alternative datasets of 3D models. Still, we assume that if the data does not have clear shadows on diffuse textures, trained ControlNet will inherit this bias.

Avatars wearing outfits textured entirely using the proposed method.

Image input

For the image input, we used an IP-Adapter to augment Stable Diffusion with image-prompt input. It works well in combination with ControlNet, allowing us to transfer the concept of the input image for a UV texture of the garment. However, the exact same style of input is not always achievable since the base Stable Diffusion model used for rendering may have some limitations with certain input styles and concepts.

Example of textures generated from an input image with a medieval outfit.

Limitations

  1. This method does not work perfectly on the outfit’s seams. This is often the case where geometric parts of a 3D asset positioned closely together are split into different UV islands. The misalignment then happens around UV edges.
  2. The method is unsuitable for creating fully coherent textures for assets with UVs that are too fragmented.
  3. Sometimes, without some keywords indicating what type of asset it is (t-shirt/jacket, pants, shoes), the model might get confused and hallucinate the wrong details.
  4. Using image prompts with faces may cause the faces to appear on the garments.
  5. Beautiful, coherent and useful generations happen once per 5 generations approximately, which is not a very efficient use of computing. This issue should be addressed with the model alignment or by other means.
Interface
a.) Render of pants with generated textures demonstrating misalignment in seams region. b.) The original asset created by a 3D artist. c.) Generated UV-texture d.) Conditional image that demonstrates how UV islands are separated on the image.

Asset Designer 

The main stable version is available in Asset Designer for a limited number of assets and with only diffuse available.

Asset Designer with an asset textured using a predefined “Medieval” prompt.

Future improvements

Addressing seams and consistency issues

Recently, a paper was released by Xianfang Zeng et al introducing Paint3D. The described method uses depth conditioning and several model views to generate a new desired look of the 3D object. Then, it re-projects the texture to the original UVs and uses inpainting and refinement steps to complete the texture. In the inpainting step, it uses a similar technique to the one proposed in our method. Namely, the authors also trained ControlNet conditioned on the positional encodings embedded in the UV space to produce diffuse maps without baked lights. The produced textures inherit the lightless bias from the training set.

Main takeaways

  • More diverse meshes are necessary for training higher quality ControlNet models that could generalize to diverse assets. We had ~1,000 meshes, and Paint3D used ~105K meshes sourced from the Objaverse dataset.
  • Creating coarse texture using depth-conditioned reprojections shall improve the coherence, solve the seams issues, and help the model to generalize to any asset better.

Real-time performance

To speed up the generation process, it makes sense to experiment with training custom LCM models on UV textures. Using smaller and faster models instead of the vanilla SD should also speed up the process significantly, which requires re-training ControlNet for such models.