The team behind yack! wanted to use Deepgram and computer vision to build a fun and novel project. What came out of it was an automatic video to comic generator, which has more complexity than you might think. I sat down with Allan Zhang, Andreas Economides, Felix Chippendale, and Tom Grant to ask them about their project.
yack! takes a video and restyles it as a classic comic book using Deepgram's Speech Recognition API and computer vision. The output looks a bit like this:
Once a video is provided, yack! generates a transcript with Deepgram. Then, keyframes in the video are chosen. Frames are cropped, the image has some comic book styling applied, and captions are overlaid as speech bubbles. Finally, each 'tile' is placed in a dynamic SVG element which is rendered on the page.
That's... a lot.
How It Works
The team got Deepgram working within the first hour of building, which freed the team to focus on more complex parts of the project. To make the returned transcript as useful as possible, they used our utterances feature to understand what keyframes to show and diarization to color text when different speakers are detected.
Once a key frame is chosen, computer vision is used to detect a speaker's location in the frame. It is then cropped to ensure faces are seen, that there's enough space for text to be overlaid, and that the aspect ratio is roughly maintained. During development, the face detection algorithm was one of the slowest parts -- taking up to 20 seconds -- though the team managed to speed this up slightly.
The style transfer then took place -- a set of simple visual tricks to make a real-life image look more comic-like -- reducing colors, finding edges and making them darker/bolder, and stacking. This was by far the slowest bit of the overall processing time - accounting for around 60%. Given more time this could be done with machine learning.
Finally, the text is overlaid, and a dynamic SVG is created. The placement of tiles is, in itself, an engineering challenge. The team used a block-claiming algorithm to have times 'claim' space on the page.
Try It Out
The yack! team built a website for users to interact with and a Docker image to create a portable, scalable, and easily-deployable server.
You can try out yack! at yack.ml
If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .