Splitting A Recording By Voice
I have been bouncing around ideas on how to make generated content more interesting for a couple years now. In April 2023, I created a video with generated content, using a green-screen type effect to put a real street scene in the background. In January 2024, I tried emulating the podcasts I love, and added a co-host - I had ChatGPT generate a dialog between two characters, and then used different voices to create a discussion.
Google’s Notebook LM does this REALLY well, SO MUCH better than just creating a two-person dialog. The conversation flows naturally, you get the “oh yeah?”, “interesting” injected like you would in a conversation. Notebook LM creates a single WAV file. I wanted a separate recording/.WAV for each “speaker”, so I could run it through nVidia’s “audio2face” model. I created a tool to split a wav file per-speaker using AssemblyAI’s transcription service - if this is useful for anyone, you can check it out here: https://github.com/raudette/speakersplit
For an example of how this all comes together, check out this video of two avatars chatting about an internet forum thread on making the perfect coffee: Brewing Perfection: Inside the World of At-Home Espresso Machines - YouTube