Lecture Notes: Building ElevenLabs and the Global Voice AI Revolution

One. Course Details

This is week two of CS 153: Frontier Systems (nicknamed "AI Coachella") at Stanford University, featuring guest speaker Mati Staniszewski, co-founder and CEO of ElevenLabs. Anjani "Anj" Mehta introduces Mati, recalling how he first discovered ElevenLabs as a viral Discord text-to-speech bot in 2022 and became an early angel investor. Mati, a former Google and Palantir engineer, shares the remarkable story of building ElevenLabs from a two-person startup in Europe to one of the fastest-growing AI companies in the world, with over $430 million in annual recurring revenue just 36 months after launch. The lecture covers the company's origins, technical breakthroughs, business strategy, industry collaborations, and the future of conversational voice AI.
The lecture covers:

The personal inspiration behind ElevenLabs and the problem of bad international dubbing
How the company pivoted from full AI dubbing to perfecting text-to-speech first
The technical tradeoffs between cascaded and fused audio model architectures
ElevenLabs' product-led growth (PLG) strategy and community-driven development
The company's explosive revenue growth and operational philosophy
Pricing strategy for AI products: value-based vs. cost-based pricing
Safety, security, and ethical challenges in synthetic voice technology
Real-world deployments with governments, enterprises, and creators
The future of on-device speech models and conversational voice agents

Two. Key Learning Takeaways

Start with the smallest, most solvable problem first. ElevenLabs initially aimed to build full AI dubbing but pivoted to perfecting text-to-speech when they realized it was the most immediate pain point for users.
Product-led growth and community feedback are superpowers. Building directly with creators and developers on Discord allowed ElevenLabs to iterate faster and discover use cases they never predicted.
Cascaded architectures are better for reliability; fused architectures win on latency. For enterprise use cases where accuracy and tool calling are critical, separate transcription, LLM, and TTS models remain superior. Fused models will dominate low-latency consumer use cases.
Price based on value delivered, not cost of compute. The most successful AI companies capture roughly 10% of the value they create for customers, rather than charging based on how much it costs to run their models.
Collaboration beats competition in emerging industries. Mati emphasizes that founders should view other startups as potential partners rather than enemies, especially in fast-growing fields like audio AI.
Safety must be baked into models from day one. ElevenLabs has built watermarking, content tracing, and moderation directly into its technology to prevent fraud and abuse.
Small, autonomous teams drive exponential growth. ElevenLabs maintains teams of fewer than ten people with full ownership, allowing it to move faster than much larger competitors.

Three. Course Gold Quotes

"The best way to build a product is to stay as close to your users as possible. They will show you use cases you never would have thought of yourself, 6, 12, or 18 months before anyone else."
"Never start pricing from your costs. Start from the value you deliver to the customer and work backwards. Aim to capture about one-tenth of that value in your pricing."
"You can go further together. What seems like competition today will be a partnership tomorrow. The real competitors are the hyperscalers and legacy companies, not the other startups."
"Voice authentication is not the future. If you can replicate a voice, you should never use it to secure bank accounts or sensitive information."
"Necessity is the mother of invention. When we started, we couldn't afford $6,000 for a patent, so we just focused on innovating faster than anyone else could copy us."
"The biggest breakthrough in voice AI wasn't making voices sound human—it was making them controllable. Only when directors could tell the model exactly how to deliver a line did studios start adopting it."
"We don't see ourselves as just a model company. We want to be the go-to platform for any business that wants to interact with their customers through voice."

Four. Layered Learning Notes

Module 1: Origins and Early Days of ElevenLabs

Mati and his co-founder Piotr (both Polish) were inspired by the terrible state of dubbing in Poland, where all foreign movies are narrated by a handful of monotone male voices.
They left Google and Palantir in 2022 to fix this problem, initially operating between London and Warsaw.
The company originally ran entirely on Discord, building bots to automate internal communication before switching to Slack.
They followed MidJourney's community-driven model, launching a Discord bot that allowed users to generate audio clips for free.
Early user feedback revealed that people wanted voiceover corrections and narration in their own language far more than full dubbing, leading to the pivot to text-to-speech.
The first models were trained on free compute credits from programs like NVIDIA Inception, costing only tens of thousands of dollars. They famously declined to spend $6,000 on a patent, choosing instead to out-innovate competitors.

Module 2: Technical Evolution and Breakthroughs

Early text-to-speech models suffered from two major flaws: they could not accurately replicate voice characteristics, and they could not understand context to deliver natural emotional performances.
ElevenLabs' key innovation was moving away from hardcoding voice parameters (gender, age, accent) and instead letting the model learn abstract representations of voice.
They combined ideas from open-source models like James Betker's Tortoise TTS (built nights and weekends as a side project) with transformer and diffusion architectures.
The company expanded its technology stack over time:
- 2022: First high-quality English text-to-speech model
- 2023: Multi-language support, voice cloning, and the voice marketplace
- 2024: Speech-to-text, AI translation, and full AI dubbing (used for Javier Milei's UN speech and Lex Fridman's interviews with world leaders)
- 2025: Real-time conversational voice agents
Cascaded vs. Fused Architectures:
- Cascaded: Separate models for speech-to-text, LLM reasoning, and text-to-speech. Better for reliability, tool calling, and enterprise use cases. Allows debugging each step and applying guardrails.
- Fused: Single end-to-end model that generates speech tokens directly from audio input. Much lower latency (~300ms) but less reliable and harder to debug. Best for consumer companion use cases.
ElevenLabs is pursuing both approaches, planning to blend them depending on the customer's needs.

Module 3: Business Model and Explosive Growth

ElevenLabs uses a hybrid business model with roughly 50% of revenue from self-serve PLG and 50% from enterprise customers.
The company crossed $330 million ARR at the end of 2025 and added over $100 million ARR in the first quarter of 2026, reaching over $430 million total.
It employs just 450 people, organized into small, autonomous teams of fewer than ten people with full decision-making authority.
Pricing philosophy: Always price based on the value delivered to the customer, not the cost of running the model. Aim to capture 10% of the value you create.
Forward-deployed engineers work directly with enterprise customers to integrate ElevenLabs' technology into their workflows, creating predictable revenue streams.
The voice marketplace allows creators to license their voices and earn passive income, creating a network effect that attracts more users and voice talent.

Module 4: Industry Collaboration and Ecosystem

Mati emphasizes the importance of collaboration over competition in the emerging audio AI space.
He has a close working relationship with Brendan Iribe, CEO of Sesame (another voice AI startup), and the two have angel invested in each other's companies.
Sesame open-sourced its CSM conversational speech model, which many students used for final projects in last year's class.
Anj notes that this collaborative culture is rare in tech but drives faster progress for the entire industry.
The real competition for startups is not other startups but large hyperscalers and legacy technology companies.

Module 5: Safety, Ethics, and Social Impact

Synthetic voice technology creates significant safety risks, including fraud, scams, and impersonation.
ElevenLabs has built three layers of protection:
1. In-model moderation to stop harmful content before it is generated
2. Watermarking and content tracing to identify AI-generated audio
3. A public verification system to confirm if audio was generated by ElevenLabs
The company strongly advises against using voice authentication for banking or sensitive systems.
ElevenLabs uses its technology for social good:
- It has helped almost 10,000 people with ALS or throat cancer regain their voices.
- It worked with the Ukrainian government to add voice capabilities to the Diia citizen app, providing critical services to displaced people during the war.

Module 6: Challenges and Future Trends

Key bottlenecks: Hiring exceptional talent, continuing research breakthroughs, and securing enough compute.
Mati notes that limited compute can actually drive innovation by forcing teams to be more efficient.
China and the global AI race: ElevenLabs actively blocks distillation attacks on its models. While Chinese labs are making rapid progress in audio AI, Mati believes Western companies can compete by building trusted brands and ecosystems.
Studio adoption: Hollywood studios are finally starting to adopt AI voice technology now that models are fully controllable. Most use it for scratch work, post-production repairs, and localization rather than replacing lead actors.
On-device models: ElevenLabs will release its first on-device text-to-speech model tomorrow. On-device models will always lag behind cloud models in capability but are critical for privacy and low-latency use cases.
Five-year vision: ElevenLabs aims to become one of the three to five leading conversational AI platforms, providing the full stack of tools businesses need to interact with customers through voice.

Module 7: Advice for Student Founders

Start small and solve a specific problem that you personally care about.
Stay extremely close to your users and iterate quickly based on their feedback.
Don't worry about patents early on—focus on innovating faster than anyone else.
Build small, autonomous teams with full ownership of their work.
Collaborate with other founders rather than viewing everyone as a competitor.
The final project for CS 153 (the "one-person frontier lab") is an incredible opportunity to build something meaningful with tools that would have required a team of dozens just a few years ago.
Wishing you all an incredible journey exploring the frontier of voice AI. This is one of the most exciting and fast-moving areas of technology right now, and you have access to tools that would have seemed like science fiction just a few years ago. Whether you want to build the next generation of conversational agents, create new forms of audio content, or solve important problems for people around the world, the possibilities are endless. Take advantage of this special time, collaborate with your classmates, and don't be afraid to build something bold. The future of how we communicate with machines—and with each other—is being written right now, and you all have a chance to shape it.

Video Source and Usage Instructions

Video Title: Stanford CS153 Frontier Systems | Mati Staniszewski from ElevenLabs on The Future of Voice Systems
• Course Series: Stanford CS153 Frontier Systems
• Original Platform:
• Original Publisher: Stanford
• Original Video URL: https://youtu.be/vfF011ko89o?si=QIiYdBTsriqBUSE-

Information About Website Advertising

This site is a non-profit educational sharing platform. The advertisements displayed on the pages are solely intended to cover basic operational costs such as server maintenance, bandwidth, and content upkeep. We do not generate any form of commercial profit from the video content, nor do we charge any fees for the original video content.

Copyright and Compliance Statement

1. We have preserved the original video in its entirety without making any modifications, edits, or alterations to the course content, in order to ensure the authenticity and integrity of the academic material.
2. All copyrights and intellectual property rights related to this video belong to the original author and Stanford. This repost strictly adheres to Creative Commons license and is intended solely for educational, research, and personal communication purposes.
3. If the original copyright holder believes this repost infringes upon your legitimate rights and interests, or if you have any objections to the operation of this site, please contact us through the website. We will remove the relevant content as soon as possible upon receiving notification.

1.If you have any questions, please email us.：[gwang4821@gmail.com]
2. You can also go directly to the Feedback Center,Feedback
3. We will address your feedback immediately upon receipt.