In our Built with AssemblyAI series, we showcase developer and hackathon projects, innovative companies, and impressive products created using the AssemblyAI Core Transcription API and/or our Audio Intelligence APIs such as Sentiment Analysis, Auto Chapters, Entity Detection, and more.
This Real-time Speech-to-Image Generation project was built by students at ASU HACKML 2022.
Describe what you built with AssemblyAI.
Using the AssemblyAI Core Transcription API, we are able to reproduce elements of the zero-shot capabilities presented in the DALL-E 2 paper, in real-time. To accomplish this, we used a much less complex model that is trained on a summarized version of the training data they refer to as a meta-data set in the paper Less is More: Summary of Long Instructions is Better for Program Synthesis.
This is only possible with the combination of robust features present in the AssemblyAI API that allow for seamless integration with the Machine Learning model and web interface framework, as well as the corrective language modeling they use to repair malformed input and isolate sentences as they are spoken.
What was the inspiration for your product?
This project was inspired by a paper called Zero-Shot Text-to-Image Generation, by Open AI, that introduces an autoregressive language model called DALL-E 2 trained on images broken into segments that are given natural language descriptions.
How did you build your project?
There are two major components to this project:
- Real-time audio transcription. For this, we interfaced with the AssemblyAI API. The client-side interface was made with HTML and CSS. The server is hosted by Node.js using Express. The server will open up a connector to the AssemblyAI API after a button press is detected on the client-side.
- Text-to-Image generation. For this, pretrained models are deployed on the local machine and a Pytorch model runs in parallel with the client and server. After an asynchronous connection is established with the AssemblyAI API, Selenium will pass messages back and forth between the client and the pretrained model as the API responds with text transcribed from audio data. Selenium will update the image on the client-side while the model also saves images to the local drive.
What were your biggest takeaways from your project?
Audio transcription tools are getting more impressive all the time! Being able to produce similar results to the State-of-the-Art model with significantly less data also implies that the larger models are trained on some amount of superfluous data. Finally, changing the pretraining paradigm can make a big difference.
What's next for Speech-to-Image generation?
There are three main things on the horizon:
- Using a knowledge graph like ATOMIC, as a basis for commonsense, to check the semantic correctness of the object associations present in the image during the generation phase in order to get better images.
- Associating natural language with a series of images in a vector that can visually describe a sequence of events written in text.
- Generating a video from the resulting image vector.
Learn more about this project here.