Learning to Generate Chairs with Convolutional Neural Networks

May 30, 2019·
Ben Ahlbrand
Ben Ahlbrand
· 2 min read
Image credit: Unsplash

For a project during my masters degree, we implemented the paper Learning to Generate Chairs by Dosovitskiy et. al. in PyTorch. The only implementation available was the author’s in Caffe with Lua, so we set off to create a clean open source implementation of the work.

This work presents an early differentiable renderer using convolutional neural networks in order to render models from ShapeNet. Given some sparse sampling of discrete azimuth and elevation angle views of each object, it renders a 2D representation. This network is also capable of finding correspondences between objects of the same type in order to have a meaningful interpolation between them, in other words, semantically stable.

class 800, phi:: 20 degrees, theta: 300 degrees

The work is an abstract differentiable renderer (and to our knowledge one of the earliest useful examples) where a network is provided with an object class, and camera intrinsics, it does a one shot through the network - rendering a 3D model. The novelty of this paper is that instead of performing the rasterization and the rest of the traditional pipeline, you show it a sufficient number of examples, and the network can imagine the views in between. Some weaknesses seem to be in 1) the resolution of the network’s implementation, and 2) in capturing details not shown in the training dataset provided. Of course the latter is a reasonable shortcoming, as otherwise the network wouldn’t have a concept of what would be underneath the arm of the chair. A demonstration of the rendering of angles never seen by the network, multiple such examples are the first image at the top, and three more shown below:

class 666, phi: 15 degrees, theta: 250 degrees

We also discovered the masking this paper uses - where you learn the mask simultaneously with the rendered result, is nearly useless, as shown in the below figures. If you look carefully you can see some artifacts, but once compressed and visualized it’s honestly fairly difficult to distinguish between artifacts resulting from the network and that of the compression. One of these days I’ll retrain the network with a much higher resolution :)

class 417 - no mask

More modern works arising from this (which at the time it was published was quite revolutionary) can be found in the recently released framework from Nvidia - Kaolin. Another thing this paper demonstrated was that you can interpolate between classes across latent space - which is of course quite common now in GAN / AutoEncoder papers.

Our implementation can be found on my github.