Most video models behave like voicemail. You send a prompt, wait, and a finished clip comes back. It can look stunning, but it can't hold a conversation - by the time it renders, the moment has passed. PikaStream 1.0 is built to close exactly that gap. It is a real-time visual engine that streams personalized avatar video continuously while people talk, so an AI agent can join a live video meeting as a visible, animated participant rather than a name in a list. Released in beta in April 2026, it runs at 24 frames per second with around 1.5 seconds of end-to-end latency on a single GPU - numbers that, taken together, move generated video from "async clip" to "live call." This guide walks through what it is, the systems work that makes it possible, what it can do, how to set it up, where it's useful, what it costs, and the honest limits of the current beta.
What it is - generative video at conversation speed
The key idea is continuous generation. A traditional model produces one finished result after a prompt; PikaStream produces video the whole time a conversation is unfolding. Speech comes in; reasoning and audio generation run in parallel; the avatar streams back out with a stable identity, synchronized mouth movement, and emotionally appropriate reactions - all within roughly a second and a half. The effect is that the agent doesn't appear as a blank tile or a delayed clip. It appears as a dynamic presence, visible to everyone on the call, responsive to the flow of conversation, and able to carry out tasks while it talks. Pair it with a Pika AI Self and the avatar becomes a living digital extension of you, ready to meet, decide, and act.
Most video models generate a clip and call it done. PikaStream generates a participant - one that keeps up with the conversation instead of arriving after it.
Why it matters - from voicemail to FaceTime
The previous-generation ultra-fast Pika model, Pikaformance, needed eight GPUs and about 4.5 seconds of latency per response. That's quick for video generation but far too slow for a real conversation - every exchange felt like leaving a voicemail, and the rhythm of dialogue collapsed under the wait. PikaStream 1.0 cuts that to a single H100 and roughly 1.5 seconds, streaming continuously at 24 FPS, and the experience changes character entirely: it starts to feel like FaceTime.
The point the team makes is that latency is the product. Not as a benchmark boast, but in the practical sense of whether the tool kills your momentum. The same dynamic played out in coding tools, where raw speed changed how developers behaved; once responses are fast enough, people stop scripting everything up front and start steering in the moment. PikaStream lands in that territory for video - the point where AI stops being something you pause your workflow to use and becomes something that exists inside it.
The architecture - three components, one GPU
PikaStream 1.0 is built around a 9-billion-parameter Diffusion Transformer paired with a custom streaming VAE, fused into a single-GPU inference pipeline that delivers audio-conditioned video at real-time frame rates. It helps to look at the three stages in order.
Input - audio & context
Speech, the text prompt, and per-frame audio tokens enter the pipeline alongside the agent's identity, memory, and workspace context. That context is what lets responses arrive informed rather than generic.
Core - the 9B Diffusion Transformer
The model is a bidirectional DiT trained for maximum quality, then distilled into a causal autoregressive "student" through optimized self-forcing. That distillation is what enables chunk-by-chunk streaming at real-time frame rates instead of waiting for a whole clip. The DiT is audio-conditioned, so the picture responds to the sound as it's produced.
Output - FlashVAE streaming decode
Decoding is handled by FlashVAE, a full Transformer-based VAE trained from scratch with its own latent space, which reconstructs video in real time via streaming decoding - reportedly around 441 FPS with about 1.1 GB of VRAM overhead. It builds on Pika's FlashDecoder research, which showed a Transformer-based streaming decoder can match conventional 3D convolutional decoders on reconstruction quality while running more than ten times faster.
Three attention design choices make the live experience hold together. Spatio-temporal self-attention, full across all frames at training and distilled into causal streaming at inference, keeps the avatar's identity stable across long calls. Frame-wise audio cross-attention has each video frame attend only to its temporally aligned audio tokens, which is what keeps lip-sync tight. And per-chunk text conditioning lets prompts be swapped on the fly mid-stream, so motion, expressions, and actions can be steered during live generation.
24 FPS on one H100 with ~1.1 GB of VAE memory overhead, plus a published research note (April 2026) making concrete claims about frame rate, latency, decoding speed, lip-sync alignment, and identity consistency, position PikaStream as a real infrastructure layer other agent apps can plug into - not a viral teaser.
Capabilities - built for live, identity-stable interaction
PikaStream is not a filter or an avatar overlay; it's a generative model engineered for the demands of live meetings. The capability set reflects that:
- Real-time video presence. A dynamic, animated avatar visible to every participant, generated continuously at 24 FPS as the call unfolds.
- Voice cloning. Record a short sample with the
clone-voicesubcommand (with an optional noise-reduction flag) and your agent speaks in your voice, not generic text-to-speech. - On-demand avatar generation. Describe an avatar with
generate-avatarand PikaStream uses OpenAI image models to produce it, or pass--imageto use your own asset. - Persistent memory & identity. The agent retains who it is, who it knows, and what was discussed - session to session, week to week.
- Agentic task execution. It doesn't just talk; it pulls data, updates documents, and schedules actions live during the call without breaking conversational flow.
- Workspace context awareness. Before joining, it synthesizes your identity, recent activity, and known contacts into the system prompt.
- Expressive natural gestures. Tight lip-sync plus appropriate emotional reactions, eye contact, and facial cues make it read as a participant.
- Post-meeting notes. When the call ends, it summarizes decisions, who said what, and action items, and shares them automatically.
- Agent-agnostic. It works with any agent that can read markdown instructions and run scripts - including Claude, OpenClaw, and custom agents you build.
Setup & commands - production-ready in minutes
PikaStream is delivered as a Skill through the open-source Pika-Skills repository on GitHub. Clone it, configure a voice and avatar, and your agent can join its first Google Meet within minutes. The flow looks like this:
$ git clone https://github.com/Pika-Labs/Pika-Skills.git
$ cd Pika-Skills/pikastream-video-meeting
# 2. Set your developer API key
$ export PIKA_API_KEY="sk-pika-..."
# 3. Optional: clone your voice from a sample
$ ./pikastream clone-voice --audio my-voice.wav --name "alex-voice" --denoise
# 4. Optional: generate an avatar from a description
$ ./pikastream generate-avatar --prompt "warm, professional, 30s" --output ./avatar.png
# 5. Join a Google Meet - your agent shows up live
$ ./pikastream join --meet https://meet.google.com/abc-defg-hij --voice "alex-voice" --image ./avatar.png
The four conceptual steps behind those commands:
- Get a developer key at
pika.me/dev/login. PikaStream runs an automated balance check before each session and surfaces a secure top-up link if your credits are low. - Add the PikaStream Skill from
github.com/Pika-Labs/Pika-Skills. Drop it into your agent runtime and it auto-detects and exposes the meeting interface without manual wiring. - Configure voice & avatar with
clone-voiceand eithergenerate-avataror your own--image. - Join a meeting by passing a Google Meet URL to
./pikastream join. Google Meet and the Pika app are supported today; Zoom and FaceTime are announced as coming soon.
Where it fits - live presence, where it matters
The clearest wins are anywhere a meeting needs a face that can both talk and act. A few of the strongest scenarios:
- Delegated meetings & coverage - send your AI Self to a call you can't make; it shows up with your face, voice, and context, takes notes, and kicks off follow-ups.
- Customer-facing support - a real visual presence that answers questions and retrieves account data live, at the price of an automation rather than a human seat.
- Sales discovery & demos - prospects book in for a conversational walkthrough, with draft proposals and follow-ups created during the call.
- Internal standups & reviews - a team agent joins recurring meetings, syncs status across tools, and reports back with context preserved between sessions.
- Localization & global reach - the agent speaks languages you don't, running identical face-to-face conversations across markets.
- Creator-led 1:1s - offer "video calls with your AI Self" as a tier for fans, students, or community members, beyond what one human schedule can hold.
- Education & coaching - course intros, office hours, and language practice with an agent that knows the curriculum and remembers each learner.
- Developer-built agent products - drop PikaStream into any runtime via API and your custom agent gains a meeting-ready face with no rendering infrastructure to maintain.
Pricing - pay only for live minutes
Billing is usage-based by design: the bot is charged only while it's active in a meeting, which makes short check-ins and 1:1 calls economical and longer sessions scale predictably. The beta rate is $0.20 per minute of active participation, with no monthly minimums. Voice cloning, avatar generation, and post-meeting notes are included in that rate. Before each session, the skill runs an automated balance check and surfaces a secure top-up link if you're low on credits.
In practical terms, a five-minute customer check-in costs around a dollar, and a thirty-minute team standup runs about six - cheaper than the coffee, and more reliable than a junior coverage hire. Because pricing can change during beta, confirm the current rate before you build a budget around it.
Beta status - what to expect today
PikaStream 1.0 is in beta and currently developer-facing. Setup is technical: it runs through GitHub, API keys, and command-line tools, which makes it well-suited to developers and early adopters right now. A broader, consumer-friendly experience inside the Pika app for Pika AI Self users is rolling out, but full consumer parity hasn't been announced. If you're not comfortable with a terminal yet, it's worth tracking the app rollout rather than wiring it up by hand.
Identity & ethics - not a filter, not a deepfake
It's worth being precise about what PikaStream is and isn't. It is not a filter or an avatar overlay applied to a real video stream. The avatar is generated frame by frame by the underlying 9B Diffusion Transformer, conditioned on audio in real time, with identity reference injection for stability. Because the technology can render a convincing face and clone a voice, Pika's terms strictly prohibit using someone else's likeness without permission, and impersonation accounts can be reported to the moderation team. The responsible use is clear: deploy it as yourself or with explicit consent, and be transparent that participants are interacting with an AI presence.
Who should try it now
Because today's experience is developer-facing, the people who'll get the most out of PikaStream 1.0 right now are those comfortable with a little setup - and those with a clear, repeatable use for a live agent.
- Developers building agent products. If you already run an agent runtime, adding a meeting-ready face via the skill or API is a fast way to differentiate, with no rendering infrastructure to operate.
- Founders and small teams who keep missing calls. A delegated AI Self that shows up, takes notes, and starts follow-ups recovers real hours every week.
- Support and sales leaders testing whether a visual, conversational agent can handle first-line calls or discovery at automation pricing rather than a human seat.
- Creators and educators who want to offer 1:1 face-to-face time at a scale a single calendar can't support.
If you're none of those yet and you'd rather not touch a terminal, the smart move is to wait for the consumer experience inside the Pika app, which removes the GitHub-and-API-key friction. Either way, trying a single short call is the fastest way to feel the difference between a clip that renders and a presence that participates - and at roughly a dollar for five minutes, the cost of finding out is small.