For many who grew up in the French-speaking world, the television series “C’est pas sorcier” was a cultural touchstone. Its format, featuring a duo of hosts explaining complex scientific topics in an accessible and entertaining manner, was one of my favorite TV show in my childhood. This project was inspired by a desire to replicate that engaging, conversational dynamic using modern technology for my daughter. The goal was to leverage the latest advancements in generative AI — specifically the multi-speaker text-to-speech (TTS) capabilities of Google’s Gemini models — to programmatically generate similar educational content. This project, titled “It’s Not Artificial,” demonstrates the power of this approach, and the results are highly promising! The Concept: A Dynamic Duo Powered by AI 💡
The core concept is to generate a complete audio episode from a set of simple parameters. The process involves creating a script with a dialogue between two distinct personas: a curious inquirer and a subject matter expert, mirroring the dynamic of the original show.
Below are two examples generated by the system, one in English and one in French, on the topic of Artificial Intelligence.
An examination of the underlying code reveals a two-part process.
The Settings: Customizing the Experience ⚙️
One of the most powerful aspects of this project is its flexibility. Beyond NotebookLM, the underlying notebook is designed to be highly customizable, allowing anyone to tailor the generated content. Here are the key settings you can adjust: AGE
: Defines the target age for the content, which influences the complexity of the vocabulary and explanations.
LANG
: Sets the language for the transcript and audio (e.g., "English" or "French", …). Gemini models support 24 languages automatically.THEME
: Determines the topic of the conversation. You could set this to anything from "Black Holes" to "The History of Pizza."
MINUTES
: Specifies the desired length of the audio episode.
SPEAKER_1_NAME
& SPEAKER_2_NAME
: Customizes the names of the two hosts in the script.
SPEAKER_1_VOICE
& SPEAKER_2_VOICE
: Selects from a wide range of 30 pre-built voices for each speaker, allowing for unique vocal combinations.This level of control makes it possible to generate a virtually unlimited variety of educational audio content.
The Code: A Two-Step Symphony 🎼
The process is orchestrated within a Python script utilizing the Google GenAI SDK. Part 1: Generating the Transcript
The initial step involves generating an audio script. Rather than manual composition, a Gemini model is prompted to create a transcript. The effectiveness of this step hinges on precise prompt engineering, which defines the speaker roles, tone, topic, and target audience.
TRANSCRIPT_PROMPT = f"""
Generate a {MINUTES} minutes transcript in {LANG} about {THEME} for a {AGE}-year-old.
The speaker names are {SPEAKER_1_NAME} and {SPEAKER_2_NAME}.
- {SPEAKER_1_NAME} has a curious mind and ask questions.
- {SPEAKER_2_NAME} is an expert and answer questions.
Follow strictly the format below for the transcript (e.g., no extra sounds, no markdown, ...):
{SPEAKER_1_NAME}: So... what's on the agenda today?
{SPEAKER_2_NAME}: You're never going to guess!
{SPEAKER_1_NAME}: Black holes?
{SPEAKER_2_NAME}: Yes!
"""
transcript = client.models.generate_content(
model=TRANSCRIPT_MODEL,
contents=TRANSCRIPT_PROMPT,
# ... configuration ...
).text
By specifying the roles of the speakers, the model generates a natural-sounding conversation that flows logically.
Fred: So... what's on the agenda today?
Jamy: Today, we're talking about something super smart!
Fred: Ooh, like an owl? Or a dolphin?
Jamy: Even smarter, in a way. We're talking about Artificial Intelligence.
Fred: Arty-fish-all... what now?
Jamy: Artificial Intelligence. Let's call it AI for short. It's like giving a computer or a robot a special brain so it can learn and think.
Fred: A robot brain? Cool! So it can think just like me?
Jamy: Almost! It can think and solve problems, but in a different way. Imagine you have a toy robot. AI is like the magic that makes the robot smart enough to play a game with you.
Fred: It can play games? Like checkers?
Jamy: Exactly! Or it can play chess or even video games. Some AI are so good they can beat the best players in the world.
Fred: Wow! What else can this AI brain do?
Jamy: Lots of things! Do you ever talk to a grown-up's phone or a smart speaker and ask it to play a song or tell you a joke?
...
Part 2: Bringing the Script to Life with Multiple Voices
This is the core of the implementation. The latest Gemini models can generate audio with multiple, distinct speakers from a single API call. The configuration involves defining the speakers and assigning a pre-built voice to each. response = client.models.generate_content(
model=TEXT_TO_SPEECH_MODEL,
contents=f"Read this in {LANG} with a style interesting for a {AGE}-year-old:\n\n{transcript}",
config=types.GenerateContentConfig(
response_modalities=["AUDIO"],
speech_config=types.SpeechConfig(
multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(
speaker_voice_configs=[
types.SpeakerVoiceConfig(
speaker=SPEAKER_1_NAME,
voice_config=types.VoiceConfig(
prebuilt_voice_config=types.PrebuiltVoiceConfig(
voice_name=SPEAKER_1_VOICE,
)
),
),
types.SpeakerVoiceConfig(
speaker=SPEAKER_2_NAME,
voice_config=types.VoiceConfig(
prebuilt_voice_config=types.PrebuiltVoiceConfig(
voice_name=SPEAKER_2_VOICE,
)
),
),
]
),
)
)
)
The model parses the transcript, identifies the speaker tags (e.g., Fred:
or Jamy:)
, and applies the designated voice to the corresponding lines of dialogue. This produces a seamless, conversational audio file without the need for manual audio editing or splicing.
The Impact: Beyond Nostalgia 📈
While this project was inspired by a classic television show, its implications extend far beyond nostalgia.
In Education: Consider a paradigm where educational content is highly personalized. A student could select a topic, language, and the personas of their AI tutors. History could be learned through a simulated conversation with historical figures. This level of customization has the potential to make learning more accessible and engaging for a diverse range of students.
In Professional Environments: The applications are equally compelling. Corporate training, for instance, could be transformed from static presentations into interactive, conversational modules. Employee onboarding could feature simulated dialogues with key personnel, and complex technical documentation could be elucidated through an expert-novice conversational format.
The Future is Conversational 💬
This experiment highlights a shift towards more natural and intuitive human-computer interaction. The ability to dynamically generate multi-speaker audio opens a new frontier for automated content creation.
The success of formats like “C’est pas sorcier” demonstrates the educational power of conversation. With tools like Gemini, this principle can be programmatically integrated into our digital experiences, fostering a new generation of dynamic and engaging learning tools. In the future, the project could also generate images and videos to support the episode.