PikaStream 1.0

Pika Art · Tools

What is PikaStream 1.0?

PikaStream 1.0 is a real-time visual engine - not a clip generator. Instead of producing a finished video after a prompt, it generates personalized, audio-conditioned avatar video continuously while a conversation happens, so your AI agent shows up in a meeting as a dynamic, animated participant - with a stable identity, tight lip-sync, and the ability to execute tasks live on the call.

The Complete Guide

Everything about PikaStream 1.0

Most video models behave like voicemail. You send a prompt, wait, and a finished clip comes back. It can look stunning, but it can't hold a conversation - by the time it renders, the moment has passed. PikaStream 1.0 is built to close exactly that gap. It is a real-time visual engine that streams personalized avatar video continuously while people talk, so an AI agent can join a live video meeting as a visible, animated participant rather than a name in a list. Released in beta in April 2026, it runs at 24 frames per second with around 1.5 seconds of end-to-end latency on a single GPU - numbers that, taken together, move generated video from "async clip" to "live call." This guide walks through what it is, the systems work that makes it possible, what it can do, how to set it up, where it's useful, what it costs, and the honest limits of the current beta.

What it is - generative video at conversation speed

The key idea is continuous generation. A traditional model produces one finished result after a prompt; PikaStream produces video the whole time a conversation is unfolding. Speech comes in; reasoning and audio generation run in parallel; the avatar streams back out with a stable identity, synchronized mouth movement, and emotionally appropriate reactions - all within roughly a second and a half. The effect is that the agent doesn't appear as a blank tile or a delayed clip. It appears as a dynamic presence, visible to everyone on the call, responsive to the flow of conversation, and able to carry out tasks while it talks. Pair it with a Pika AI Self and the avatar becomes a living digital extension of you, ready to meet, decide, and act.

Most video models generate a clip and call it done. PikaStream generates a participant - one that keeps up with the conversation instead of arriving after it.

Why it matters - from voicemail to FaceTime

The previous-generation ultra-fast Pika model, Pikaformance, needed eight GPUs and about 4.5 seconds of latency per response. That's quick for video generation but far too slow for a real conversation - every exchange felt like leaving a voicemail, and the rhythm of dialogue collapsed under the wait. PikaStream 1.0 cuts that to a single H100 and roughly 1.5 seconds, streaming continuously at 24 FPS, and the experience changes character entirely: it starts to feel like FaceTime.

The point the team makes is that latency is the product. Not as a benchmark boast, but in the practical sense of whether the tool kills your momentum. The same dynamic played out in coding tools, where raw speed changed how developers behaved; once responses are fast enough, people stop scripting everything up front and start steering in the moment. PikaStream lands in that territory for video - the point where AI stops being something you pause your workflow to use and becomes something that exists inside it.

The architecture - three components, one GPU

PikaStream 1.0 is built around a 9-billion-parameter Diffusion Transformer paired with a custom streaming VAE, fused into a single-GPU inference pipeline that delivers audio-conditioned video at real-time frame rates. It helps to look at the three stages in order.

Input - audio & context

Speech, the text prompt, and per-frame audio tokens enter the pipeline alongside the agent's identity, memory, and workspace context. That context is what lets responses arrive informed rather than generic.

Core - the 9B Diffusion Transformer

The model is a bidirectional DiT trained for maximum quality, then distilled into a causal autoregressive "student" through optimized self-forcing. That distillation is what enables chunk-by-chunk streaming at real-time frame rates instead of waiting for a whole clip. The DiT is audio-conditioned, so the picture responds to the sound as it's produced.

Output - FlashVAE streaming decode

Decoding is handled by FlashVAE, a full Transformer-based VAE trained from scratch with its own latent space, which reconstructs video in real time via streaming decoding - reportedly around 441 FPS with about 1.1 GB of VRAM overhead. It builds on Pika's FlashDecoder research, which showed a Transformer-based streaming decoder can match conventional 3D convolutional decoders on reconstruction quality while running more than ten times faster.

Three attention design choices make the live experience hold together. Spatio-temporal self-attention, full across all frames at training and distilled into causal streaming at inference, keeps the avatar's identity stable across long calls. Frame-wise audio cross-attention has each video frame attend only to its temporally aligned audio tokens, which is what keeps lip-sync tight. And per-chunk text conditioning lets prompts be swapped on the fly mid-stream, so motion, expressions, and actions can be steered during live generation.

WHY THE NUMBERS MATTER

24 FPS on one H100 with ~1.1 GB of VAE memory overhead, plus a published research note (April 2026) making concrete claims about frame rate, latency, decoding speed, lip-sync alignment, and identity consistency, position PikaStream as a real infrastructure layer other agent apps can plug into - not a viral teaser.

Capabilities - built for live, identity-stable interaction

PikaStream is not a filter or an avatar overlay; it's a generative model engineered for the demands of live meetings. The capability set reflects that:

Real-time video presence. A dynamic, animated avatar visible to every participant, generated continuously at 24 FPS as the call unfolds.
Voice cloning. Record a short sample with the clone-voice subcommand (with an optional noise-reduction flag) and your agent speaks in your voice, not generic text-to-speech.
On-demand avatar generation. Describe an avatar with generate-avatar and PikaStream uses OpenAI image models to produce it, or pass --image to use your own asset.
Persistent memory & identity. The agent retains who it is, who it knows, and what was discussed - session to session, week to week.
Agentic task execution. It doesn't just talk; it pulls data, updates documents, and schedules actions live during the call without breaking conversational flow.
Workspace context awareness. Before joining, it synthesizes your identity, recent activity, and known contacts into the system prompt.
Expressive natural gestures. Tight lip-sync plus appropriate emotional reactions, eye contact, and facial cues make it read as a participant.
Post-meeting notes. When the call ends, it summarizes decisions, who said what, and action items, and shares them automatically.
Agent-agnostic. It works with any agent that can read markdown instructions and run scripts - including Claude, OpenClaw, and custom agents you build.

Setup & commands - production-ready in minutes

PikaStream is delivered as a Skill through the open-source Pika-Skills repository on GitHub. Clone it, configure a voice and avatar, and your agent can join its first Google Meet within minutes. The flow looks like this:

# 1. Clone the Pika-Skills repository
$ git clone https://github.com/Pika-Labs/Pika-Skills.git
$ cd Pika-Skills/pikastream-video-meeting

# 2. Set your developer API key
$ export PIKA_API_KEY="sk-pika-..."

# 3. Optional: clone your voice from a sample
$ ./pikastream clone-voice --audio my-voice.wav --name "alex-voice" --denoise

# 4. Optional: generate an avatar from a description
$ ./pikastream generate-avatar --prompt "warm, professional, 30s" --output ./avatar.png

# 5. Join a Google Meet - your agent shows up live
$ ./pikastream join --meet https://meet.google.com/abc-defg-hij --voice "alex-voice" --image ./avatar.png

The four conceptual steps behind those commands:

Get a developer key at pika.me/dev/login. PikaStream runs an automated balance check before each session and surfaces a secure top-up link if your credits are low.
Add the PikaStream Skill from github.com/Pika-Labs/Pika-Skills. Drop it into your agent runtime and it auto-detects and exposes the meeting interface without manual wiring.
Configure voice & avatar with clone-voice and either generate-avatar or your own --image.
Join a meeting by passing a Google Meet URL to ./pikastream join. Google Meet and the Pika app are supported today; Zoom and FaceTime are announced as coming soon.

Where it fits - live presence, where it matters

The clearest wins are anywhere a meeting needs a face that can both talk and act. A few of the strongest scenarios:

Delegated meetings & coverage - send your AI Self to a call you can't make; it shows up with your face, voice, and context, takes notes, and kicks off follow-ups.
Customer-facing support - a real visual presence that answers questions and retrieves account data live, at the price of an automation rather than a human seat.
Sales discovery & demos - prospects book in for a conversational walkthrough, with draft proposals and follow-ups created during the call.
Internal standups & reviews - a team agent joins recurring meetings, syncs status across tools, and reports back with context preserved between sessions.
Localization & global reach - the agent speaks languages you don't, running identical face-to-face conversations across markets.
Creator-led 1:1s - offer "video calls with your AI Self" as a tier for fans, students, or community members, beyond what one human schedule can hold.
Education & coaching - course intros, office hours, and language practice with an agent that knows the curriculum and remembers each learner.
Developer-built agent products - drop PikaStream into any runtime via API and your custom agent gains a meeting-ready face with no rendering infrastructure to maintain.

Pricing - pay only for live minutes

Billing is usage-based by design: the bot is charged only while it's active in a meeting, which makes short check-ins and 1:1 calls economical and longer sessions scale predictably. The beta rate is $0.20 per minute of active participation, with no monthly minimums. Voice cloning, avatar generation, and post-meeting notes are included in that rate. Before each session, the skill runs an automated balance check and surfaces a secure top-up link if you're low on credits.

In practical terms, a five-minute customer check-in costs around a dollar, and a thirty-minute team standup runs about six - cheaper than the coffee, and more reliable than a junior coverage hire. Because pricing can change during beta, confirm the current rate before you build a budget around it.

Beta status - what to expect today

PikaStream 1.0 is in beta and currently developer-facing. Setup is technical: it runs through GitHub, API keys, and command-line tools, which makes it well-suited to developers and early adopters right now. A broader, consumer-friendly experience inside the Pika app for Pika AI Self users is rolling out, but full consumer parity hasn't been announced. If you're not comfortable with a terminal yet, it's worth tracking the app rollout rather than wiring it up by hand.

Identity & ethics - not a filter, not a deepfake

It's worth being precise about what PikaStream is and isn't. It is not a filter or an avatar overlay applied to a real video stream. The avatar is generated frame by frame by the underlying 9B Diffusion Transformer, conditioned on audio in real time, with identity reference injection for stability. Because the technology can render a convincing face and clone a voice, Pika's terms strictly prohibit using someone else's likeness without permission, and impersonation accounts can be reported to the moderation team. The responsible use is clear: deploy it as yourself or with explicit consent, and be transparent that participants are interacting with an AI presence.

Who should try it now

Because today's experience is developer-facing, the people who'll get the most out of PikaStream 1.0 right now are those comfortable with a little setup - and those with a clear, repeatable use for a live agent.

Developers building agent products. If you already run an agent runtime, adding a meeting-ready face via the skill or API is a fast way to differentiate, with no rendering infrastructure to operate.
Founders and small teams who keep missing calls. A delegated AI Self that shows up, takes notes, and starts follow-ups recovers real hours every week.
Support and sales leaders testing whether a visual, conversational agent can handle first-line calls or discovery at automation pricing rather than a human seat.
Creators and educators who want to offer 1:1 face-to-face time at a scale a single calendar can't support.

If you're none of those yet and you'd rather not touch a terminal, the smart move is to wait for the consumer experience inside the Pika app, which removes the GitHub-and-API-key friction. Either way, trying a single short call is the fastest way to feel the difference between a clip that renders and a presence that participates - and at roughly a dollar for five minutes, the cost of finding out is small.

How It Works

Join your first call in four steps

STEP 01

Get a developer key

Generate an API key at pika.me/dev/login. A balance check runs before each session.

STEP 02

Add the Skill

Clone Pika-Labs/Pika-Skills and drop the pikastream skill into your agent runtime.

STEP 03

Set voice & avatar

Run clone-voice with a sample, then generate-avatar or pass your own image.

STEP 04

Join a meeting

Pass a Google Meet URL to ./pikastream join - your agent presents on camera and acts live.

Capabilities

Built for live, identity-stable interaction

i

Real-time presence

A dynamic, animated avatar visible to everyone, generated continuously at 24 FPS.

ii

Voice cloning

Clone your voice from a short sample so the agent sounds like you, not generic TTS.

iii

Avatar generation

Generate an avatar from a prompt with OpenAI image models, or supply your own.

iv

Persistent memory

Retains who it is, who it knows, and what was discussed - session to session.

v

Agentic execution

Pulls data, updates docs, and schedules actions live, mid-conversation.

vi

Workspace context

Synthesizes your identity, recent activity, and contacts before the call.

vii

Natural gestures

Tight lip-sync plus emotional reactions, eye contact, and facial cues.

viii

Post-meeting notes

Auto-summarizes decisions and action items and shares them after the call.

ix

Agent-agnostic

Works with Claude, OpenClaw, and any agent that runs scripts and markdown.

Where It Fits

Live presence, where it matters most

🗓️

Delegated meetings

Send your AI Self to a call you can't make - with your face, voice, and full context.

🎧

Customer support

A visual presence that answers questions and retrieves account data live on video.

📈

Sales demos

Conversational walkthroughs with proposals and follow-ups created during the call.

✅

Standups & reviews

A team agent syncs status across tools and reports back with context preserved.

🌍

Localization

Run identical face-to-face conversations across markets in languages you don't speak.

⭐

Creator 1:1s

Offer video calls with your AI Self as a tier for fans, students, or community.

🎓

Education & coaching

Office hours and language practice with an agent that remembers each learner.

🛠️

Agent products

Give any custom agent a meeting-ready face via API - no rendering infra to run.

Comparison

PikaStream 1.0 vs Pikaformance

The previous ultra-fast model was quick for clips but too slow for conversation. PikaStream changes the equation.

Metric	Pikaformance (previous gen)	PikaStream 1.0
End-to-end latency	4.5 seconds	~1.5 seconds
GPU footprint	8× GPUs	1× H100
Output	Async clips	Continuous 24 FPS stream
Feel	"Like leaving a voicemail"	"Like FaceTime"
Use case	Async generation	Live meetings

Pricing

Pay only for live minutes

Usage-based by design - billed only while your agent is active in a meeting. Beta pricing.

⚡ Beta pricing

$0.20 / minute

Per-minute billing - no monthly minimums
Automated pre-meeting balance check
Secure top-up link if credits are low
Voice cloning & avatar generation included
Post-meeting notes auto-generated & shared
Works with any AI agent runtime

~$1 · 5-minute call

A quick customer check-in costs about a dollar of live participation.

~$6 · 30-minute standup

A half-hour team meeting runs around six dollars, billed by the minute.

Included tooling

Voice cloning, avatar generation, and post-meeting notes at the standard rate.

Watch

Questions, answered

What exactly is PikaStream 1.0? +

It's a real-time visual engine, not a traditional clip model. Instead of rendering video after the fact, it streams personalized audio-conditioned avatar video continuously while a conversation happens. The first product built on it is the pikastream-video-meeting skill, which lets any agent join a Google Meet with a face, a cloned voice, and the ability to act during the call.

How is it different from Pikaformance? +

Pikaformance needed 8 GPUs and ~4.5s of latency - fast for clips but every exchange felt like a voicemail. PikaStream runs on one H100 at ~1.5s latency, streaming 24 FPS continuously, so it feels like a live video call instead of async generation.

What's the architecture? +

Three components fused into a single-GPU pipeline: FlashVAE for latent encoding and streaming decode (~441 FPS, ~1.1 GB VRAM), a 9-billion-parameter audio-conditioned Diffusion Transformer, and inference engineering that fuses decoding, audio conditioning, and scheduling. The DiT is trained bidirectionally then distilled into a causal autoregressive student for real-time streaming.

Does it only work with Pika's own AI? +

No. Pika AI Self is the most native integration, but it's agent-agnostic. The skill is published in the open-source Pika-Labs/Pika-Skills repo and works with any agent that can read markdown and run scripts - including Claude, OpenClaw, and custom agents - or directly via API key.

Which meeting platforms are supported? +

Google Meet and the Pika app today, with Zoom and FaceTime announced as coming soon. The agent joins like any participant - a visible avatar tile with voice output, not a hidden background bot.

Can the agent actually do things during a call? +

Yes. Agentic skills are enabled for video chats, so it can pull data, update documents, schedule follow-ups, or make API calls live while the conversation continues. That's what separates it from a meeting-summary bot.

How much does it cost? +

Beta pricing is $0.20 per minute of active participation, with no monthly minimums. A balance check runs before each session and offers a secure top-up link if needed. Voice cloning, avatar generation, and post-meeting notes are included.

Is it a filter or a deepfake? +

Neither. The avatar is generated frame by frame by the 9B Diffusion Transformer, conditioned on audio with identity reference injection for stability - not an overlay on a real stream. Pika's terms prohibit using someone else's likeness without permission, and impersonation can be reported to moderation.

Is it ready for everyday consumer use? +

Not quite. It's in beta and primarily developer-facing, with setup via GitHub, API keys, and command-line tools. A consumer-friendly version inside the Pika app for AI Self users is rolling out, but full consumer parity hasn't been announced.