What You Need To Know About OpenAI’s New 3D Model-Making AI, Point-E
Jason D. Rowley
‘Twas the week before Christmas, and just when folks felt it was safe to start logging out for the end-of-year holiday season, Bay Area machine learning skunkworks OpenAI launched yet another generative AI platform: This time, aimed at rendering 3D images out of digital noise.
You know Whisper and Ada and Babbage and Curie
Davinci and Codex and Dall-E
But do you recall
The newest OpenAI model of all?
Point-e Makes Its Debut
Much like how a person can type out a short text prompt to generate a 2D image with a generative model like DALL-E, OpenAI's newest model, Point-e, accomplishes the same task, except for making 3D objects.
One needs only ask for "a corgi wearing a red Santa hat" and, voila, Point-e manifests an RGB point cloud that indeed resembles what was described in the text prompt.
At least, that's one implementation. Point-e can also generate 3D point clouds from 2D images, as well as convert point clouds into 3D meshes, which look more like video game assets than a bunch of voxels loosely assembled in three dimensional space. Jupyter notebooks demonstrating these capabilities, along with the text-to-point cloud functionality mentioned earlier, are available on Github.
Point-e implements some of the same techniques used by text-to-image models like DALL-E, Stable Diffusion, and Midjourney. It's gaussian diffusion all the way down.
Text-to-Point Cloud: Under the Hood
Generating a 3D object from a text prompt happens in two primary phases.
First, the text prompt is passed to a custom-tuned version of GLIDE—a text-to-image synthesis model released by OpenAI at the tail end of 2021—which has received additional conditioning on 3D objects. Notably, the version of GLIDE used in Point-e's synthesis stack was trained on data from two sources: 95% of which was present in the original GLIDE training data, with the remaining 5% coming from a dataset of "several million" 3D renderings that was curated and assembled by OpenAI.
Once GLIDE generates a 2D image from the text prompt, the next step in the pipeline is to use another diffusion model—built for voxel-point diffusion, extending the work of Zhou et al to infer and include RGB color values for each point in the cloud—coupled with a transformer model to generate the coordinate map of voxels in a 3D space. The point cloud is then upsampled to add density and smooth over any gaps before either being displayed as-is or getting passed to the mesh generation step. In Point-e’s synthesis pipeline, mesh generation is facilitated by another transformer which builds a signed distance field (SDF) model from the underlying point cloud data, resulting in the relatively smooth 3D mesh models displayed above.
If that explains how Point-e can take a text prompt and turn it into a 3D object mesh, then what about the other implementations of the model? Well, it all depends on where and when sample data enters the synthesis pipeline. For image-to-point cloud applications, Point-e sidesteps the initial text-to-image process with GLIDE. If given a point cloud, generating a mesh does not require GLIDE or the entire point cloud synthesis step.
All of this suggests that although Point-e is a somewhat experimental, proof-of-concept release, it could be the foundation for a much more robust 3D asset generation toolkit down the road. Not unlike how 2D image generators are taking at least a certain corner of the art and publishing world by storm, one could imagine Point-e or its successor becoming an integral part of the workflow for professions ranging from video game design to hardware product development and everywhere in between.
That said, realizing a future where one can simply declare what part needs to be 3D printed, or which game character should have a certain appearance, and have a generative AI model produce those results at a level commensurate with a human artist or engineer, well, that still feels a bit like science fiction… for now.
What Comes Next?
OpenAI researchers openly state that text-to-3D object synthesis is "a fairly new area of research," and that Point-e may produce 3D assets of lower quality than the current state of the art, but the model is significantly more efficient in terms of the time it takes to produce results using commodity GPU hardware.
Researchers enumerated the system's limitations in slightly greater depth:
Currently, our pipeline requires synthetic renderings, but this limitation could be lifted in the future by training 3D generators that condition on real-world images. Furthermore, while our method produces colored three-dimensional shapes, it does so at a relatively low resolution in a 3D format (point clouds) that does not capture fine-grained shape or texture.
There are clear areas to improve.
As for what's next for Point-e, it's hard to say. OpenAI has been consistently inconsistent with updating and iterating its more nascent, research-stage models, often opting to release a dramatically improved (but also very different) model months or years later. This is certainly an exciting area, though, with long-run implications, the ultimate outcome of which is very much to-be-determined.
Point-e's launch caps off a very productive year at OpenAI. Earlier in 2022, the company made several iterative improvements to its flagship large language model (LLM), GPT-3. OpenAI also released its image generation model, DALL-E, to the general public, making it available via a web interface and API. And in November, OpenAI launched ChatGPT, an uncannily good chatbot that rides on a similar architecture to GPT-3. The company has reportedly begun training GPT-4, which is widely expected to launch in the first quarter of 2023.
There will be plenty more to share about OpenAI’s ongoing developments in the exciting field of generative artificial intelligence in the new year. Oh, and here’s one more corgi for good measure. ☃️
Inline Image Credit: OpenAI
If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .