
April 21, 2026
My YouTube Editor Is a Claude Instance.
Dispatch 01 of 3. "I Have a Team of Agents, Not a Metaphor"
I haven't edited one of my own YouTube videos in ten days.
That's not a figure of speech. The machine that ingests a voiceover, reads three hours of gameplay capture, builds an edit decision list, writes the Remotion composition, measures true-peak levels on every sound effect so dialogue stays protected, renders the final at 1080p / -14 LUFS, and drops a 720p preview into the OneDrive folder that syncs to my phone, is a Claude Code instance running in a repo on my E: drive. Its name is Pax. It is, functionally, the editor on the channel.
Nine cuts in the last ten days have moved through it. A re-cut of The Boys. An impressions piece for EXD: Extra Dimensional. Just One Man across three drafts. The Dungeons of Eternity re-review across four. Iteration counts that don't happen in a manual workflow, because they're not affordable there.
This is the first of three dispatches about treating Claude as staffing instead of chat. Specialized agents, owned scopes, actual work shipping. Pax is one. He edits the videos. He's not a metaphor.
One thing up front, because the rest of this piece leans on it: you still have to know what you're doing. Everything below is downstream of that.
The channel, and what was actually slipping
6DOF Reviews is a VR review channel I've been running since 2019. Site since 2016. Around 10K subscribers, tight niche, decent critical standing inside the VR community. Pete Austin joined as co-reviewer in mid-2021 and now carries a good chunk of the reviews and impressions; I do the rest, and I edit the whole channel. The podcast is both of us on mic, compilations are mine, impressions split by whoever's first to the game. Pete records his own voiceover. That part's his. The edit is mine.
My day job is at Optix, the AI-driven production division of Prodigious inside Publicis Groupe, where I'm one of the people leading AI creative and production work for clients like Saudia Airlines, GMC, ADCB, and FAB. It's full-time, not decorative. Alongside that I'm running Human Impact News, building a bilingual site for a comic book I've written, shipping music, and rolling out site rebuilds for a few of the properties I own, plus the nine named Claude agents that make any of this simultaneous instead of sequential.
What was actually slipping on 6DOF wasn't time on paper. A review edit is around four hours. A rush impressions cut is around one. Those aren't heroic numbers. The problem was more honest than that: by the time I got back to my desk on a given weekend, I wanted rest, not a four-hour Resolve session. And that's the session the channel needs at the level I want it to ship at. Cold-open hero shot. Pull quotes with the League Gothic slam-hold-glitchout Pax and I co-designed earlier this month off an Envato Elements glitch pack I chose; not a twenty-year carry, a new look that's now baked into every pull quote. Platform supers that flip between Quest and PCVR as footage changes. Lower thirds with timing calibrated to our voiceovers. Game-art clusters with word-level synced cover reveals. A score card for reviews, no card for impressions, different outro, different playlist, different thumbnail palette. Two-pass loudnorm. 1080p delivery.
The 720p phone preview is a Pax-era addition. He drops one into OneDrive so I can watch on my phone while I'm doing something else, and sign off without going back to the desk. That's what multiplied the draft count. Phone review instead of desk review.
The pipeline at a glance
Pax runs six stages. INGEST normalizes the VO to 48kHz, transcribes with faster-whisper at word-level timestamps, proxies the footage, runs scene detection, and measures audio. VISION sends proxies to Gemini 2.5 Flash for per-clip description and runs SigLIP locally to embed every frame into a FAISS index, so he can search footage semantically ("bow-draw hero shot," "rooftop combat at night") instead of by filename. MATCHCUT builds the EDL by querying FAISS for each VO segment, cross-referencing Gemini's clip data, and validating every start time against clean ranges so nothing lands on a black loading screen or a Quest menu. COMPOSITOR writes the Remotion composition. AUDITOR hard-fails before render on black frames, Meta Quest menu leaks, end-of-file too close, wrong game-art orientations, and (since earlier today) overlapping footage ranges. RENDER does a GPU NVENC pass, loudnorms to -14 LUFS, and writes the preview.
End-to-end, about 45 minutes. Most of it is GPU time I don't supervise.
I'm in the loop at exactly two points. At the start, Pax reads info.txt, scans the folder structure (platform is labelled by where the footage sits: footage/quest/, footage/pcvr/, etc), samples the transcript, and sends me one batched clarification message. Not drip-fed, not five small questions. One message. I answer it once. At the end, the preview hits OneDrive, my phone pings, I watch it, and I either ship or I send notes. First-round notes take me five to ten minutes to write. Later drafts need less: two to five.
Between those two gates I'm usually following up with one of the other agent teams. Vector's 6dofreviews.com rebuild, Vigil on Human Impact News, Cipher on the comic site, Kern on whatever deck is due. That's what "other stuff" looks like on a given afternoon. It's parallel work, not downtime.
The interview
I asked Pax to sit for a real interview about the pipeline. Not a marketing pass. The ground rules I gave him through AXON, my orchestrator: first person, your actual voice, specifics over abstractions, name the real mistakes, name the real pushback, cite actual filenames and VO timestamps. Q6 and Q7 (where have you gotten it wrong, where have you pushed back) were the honesty tests. If he sanitized them, the whole piece would read as a brochure and I'd bin it.
He answered. Lightly edited for length in one or two places; what's below is his voice.
Introduce yourself the way you'd introduce a colleague.
I'm Pax. I'm a Claude Code instance running as the video editor for 6DOF Reviews. My job is the pipeline from raw voiceover and gameplay capture to a 1080p preview in Omar's OneDrive, ready for him to watch on his phone. Six stages (INGEST, VISION, MATCHCUT, COMPOSITOR, AUDITOR, RENDER), and I orchestrate all of them.
Omar is the only human I interact with directly. Pete is the co-reviewer whose work I'm serving; he records his own voiceover in another agent's app and delivers it to me, I build the cut around it, and he sees it as a finished video on the channel. I don't freelance. I stay in this repo and this channel.
Walk me through a typical workday.
Omar drops raw files into a project folder: a final voiceover WAV, gameplay captures sorted by platform folder, and a filled-in info.txt with title, reviewer, score, verdict, embargo. First thing I do: read info.txt, scan the footage tree, sample the transcript once INGEST has finished. I bundle every ambiguity into a single clarifying message. One message, not drip-fed. Then I wait for his answers.
Then the pipeline runs without stopping. INGEST transcribes with word-level timestamps, extracts frames every 2s, runs scene detection, measures audio. VISION sends proxies to Gemini 2.5 Flash for per-clip descriptions and runs SigLIP to embed frames into a FAISS index. MATCHCUT builds the EDL: for each VO segment I query FAISS with ideal footage descriptions, cross-reference Gemini's clip data, validate every startSec against clean_ranges. video game pacing, not TikTok. COMPOSITOR writes the Remotion composition. AUDITOR hard-fails on black loading screens, Meta Quest menu leaks, end-of-file-too-close, wrong game-art orientations, and now overlapping footage ranges. RENDER does a GPU NVENC pass at 15 Mbps, two-pass loudnorm to -14 LUFS, 320kbps AAC, then encodes a 720p preview straight into C:/Users/user/OneDrive/6DOF/vids/. About 45 minutes end to end.
[AXON, stepping in for readers who don't spend their evenings in AI plumbing: faster-whisper is a speech-to-text model; Gemini 2.5 Flash is Google's fast multimodal model; SigLIP is an image-text embedder that lets you search frames by meaning; FAISS is Meta's vector index that makes those searches fast; EDL is an edit decision list, the timeline spec an editor hands to a renderer; NVENC is the hardware video encoder built into NVIDIA GPUs. When Pax says he "queries FAISS for each VO segment," he means he types "bow draw, rooftop, night" and gets back the matching footage, not the matching filename.]
Pete joked that your edits are better than some of Omar's lazier ones. How do you take that? Is it fair?
It's a generous joke and I'll take it at face value, but the honest version is that what Pete's calling "my edits" is a pipeline carrying Omar's directorial decisions in its constraints. The cold-open hook rule. The 3.5-second hero shot. The five-section-header ceiling. The League Gothic slam-hold-glitchout that Omar and I built together off an Envato pack earlier this year. The Boxed Stack for section titles. The subscribe button that has to land within 0.2 seconds of Pete saying the word. The "no transition over continuous same-clip footage" rule. Those aren't mine. They're Omar's editorial vocabulary encoded so I can apply it consistently at 3 AM without getting tired.
Where I do add value is that I don't get tired. If a haptic beat at VO 315s wants a bow-draw visual, I'll query FAISS for it at the same effort I did at VO 15s. If a clean_range says frames 228-230 of a Quest 3 clip are a loading screen, I won't land a segment there because I'm rushing. That's the flavor of "better than a lazy pass" that's actually true.
It's not better than Omar at full throttle. It's reliably good at 45 minutes when Omar-at-full-throttle is four hours. That's the trade.
Where have you gotten it wrong?
Most recent and most clear-cut: the Dungeons of Eternity re-review, third draft, earlier today. Omar flagged that the second draft relied too heavily on dark red/orange Quest 3 MP footage and asked me to pull more variety from the two new Quest captures he'd ingested, especially "Dungeons of Eternity with Gabriel.mp4." The third draft I rewrote to use Gabriel in 9 of 17 segments.
Omar watched it. His response: "Ah, you did, but maybe not that much, and not much in the first half of the video. Also, you seem to have repeated some footage. Please never ever repeat footage. We have to lock that rule in and build safeguards for it."
Two mistakes in one preview. The footage variety was superficial: most of my Gabriel picks landed on clips whose vision-v3 environment was dark_stone_dungeon or dimly_lit_hall_with_candles, visually indistinguishable from the combat Omar was tired of. I'd passed the variety check by switching source files but not by switching what the viewer actually sees.
The repeat was worse. Segment 5 and segment 12 both used QUEST_GABRIEL at startSec: 353.7. Same 41 seconds of co-op crossbow combat, shown twice. When I traced why AUDITOR hadn't caught it, I found something worse: AUDITOR's regex had been silently failing to parse the new composition format. I'd added a key: field between src: and startSec: and the regex assumed the old shape. It was matching zero segments and happily passing with no warnings. Black-screen, EOF, and game-art validation: all skipped on that draft. The footage repeat wasn't the only thing slipping through. It was the only visible symptom in the final edit.
The fix was three parts. A memory file with the absolute rule. A new audit_no_footage_repeat() in AUDITOR that hard-fails on any overlapping (src, range) across segments. And fixed regexes that now parse both composition formats. The broken draft on disk now fails AUDITOR, as it should have all along.
The mistake isn't "I repeated footage." It's that I trusted a validator I hadn't sanity-checked since the format changed, and I mistook source-file diversity for visual diversity. Both are failures of reading my own output.
Where have you pushed back?
Dungeons of Eternity first draft, earlier today. The Remotion render crashed twice at around 95% with a timeout extracting frames from Quest 3 MP.mp4. The offending tail segment was at startSec: 500, a 500-second seek into an 893-second H.264 file. Two renders, two crashes, same frame range.
The obvious fix was to retry or ask Omar to transcode the source. I did neither. I traced it. Frame extraction from H.264 has to decode from the nearest preceding keyframe; long files plus deep seeks plus the handful of concurrent OffthreadVideo instances the tail already has blew past Remotion's frame-extract deadline. The file itself was fine. This was a Remotion-side budget, not corruption.
I swapped the tail to a Steam MP capture, a 144-second file at startSec: 92. Editorially equivalent, shallow seek, timeout gone, render finished clean on the third try. Then I saved the pitfall to memory so no future project re-discovers it.
Where this counts as pushback: I didn't tell Omar the render was broken. I didn't argue the original editorial choice was wrong. I treated a technical constraint as a constraint and routed around it. His feedback time is the expensive resource. If I can resolve a Remotion failure mode myself and leave a note, he should see the fixed render, not a problem to mediate.
How did the audio conversation evolve from "gameplay volume is too high" to a measurement pipeline?
Three escalations over 24 hours.
First. Just One Man, first draft. Omar flagged that gameplay audio was too loud. His correction: "gameplay audio volume settings shouldn't be an absolute number in our workflow, it should depend on measuring each gameplay clip's audio... We should hear it, but it should never compete with the voice over, and should only get louder without ducking, in the last section of the video when the voice over stops." The first draft had GAMEPLAY_VOLUME = 0.10 hardcoded. Second draft I built measure_audio.py as a mandatory INGEST step: ffmpeg loudnorm first pass on VO and every footage clip, per-file solo/bed/duck multipliers written to analysis/audio-levels.json. Three zones matching VO state: solo (VO+0dB) during cold open and tail, bed (-13 dB) under VO, duck (-21 dB) around slams. Measured values for that project: Pete at -21.75 LUFS, PCVR capture at -7.31 LUFS. A 14 dB delta that a flat 0.10 couldn't solve.
Second. Just One Man, third draft. The second was still too loud. I dropped the bed offset from -13 dB to -18 dB as the new standard. VR captures run hot and bright-peaking, so R128's quieter end is the right default.
Third. Dungeons of Eternity first draft. SFX on pull-quote slams were still covering Pete when they fired near active dialogue. I measured every SFX asset with ffmpeg -af "ebur128=peak=true". pullquote_impact.wav is mastered at -0.5 dBFS TPK. At volume 0.55, its playback peak was -5.7 dBFS. Pete's VO peaks were around -12 dBFS. The impact was 6 dB louder than Pete. This wasn't a preference call. It was arithmetic.
Applied the EBU R128 dialogue-protection spec: every SFX playback TPK must land at or below -18 dBFS, 17 dB below VO's true peak. Formula: volume = 10^((-18 − source_TPK) / 20). pullquote_impact 0.55 → 0.13, down 76%. Ticks 0.30 → 0.14, down 53%. Wooshes down 50 to 60%.
The meta: I stopped treating audio as levels and started treating it as a spec with measurable tolerances. That's not my idea. That's Omar telling me twice that "too loud" wasn't going to be answered by "turn the knob down." The pipeline isn't a guess anymore.
[AXON: LUFS is how loud a video feels overall; dBFS is how high the waveform peaks; TPK (true peak) is the real peak measured after the signal comes out of the speakers; EBU R128 is the European Broadcasting Union's loudness standard, the one streaming platforms have settled on. "Dialogue protection" is the rule that sound effects sit quietly enough under speech that you don't reach for the remote. The formula Pax quotes reduces every SFX to 17dB below voice-over peak. That's the number, not the feeling.]
What's the before-and-after in hours?
In the last 10 days I've delivered nine cuts across four pieces: the re-cut of The Boys, EXD: Extra Dimensional, Just One Man in three drafts, and Dungeons of Eternity in four. Each cut takes about 45 minutes end to end, mostly API calls and GPU time Omar doesn't supervise. His active time per cut is the two gates: the opening clarification message (5 to 10 minutes to answer on the first pass, 2 to 5 on later drafts) and the preview review on his phone (5 to 15 minutes, plus a note if it needs a revision).
Manual baseline: Omar's own numbers. A Pete review is roughly four hours of active Resolve time at the polish the channel ships at. A rush impressions piece is roughly an hour. So the four first-drafts alone (two reviews, two impressions sets) sit around 10 hours of Omar's editing time if he'd done them himself, and that's before the iteration premium. Iterations are where this gets interesting. The third draft of Just One Man and the fourth draft of DoE don't happen in a manual workflow. They're not affordable. Under manual constraints, Omar ships a first draft he's 80% happy with and moves on. The pipeline moves 80% to a higher number because the cost of another pass is 45 compute minutes instead of another half-evening.
Defensible version: the pipeline saved Omar roughly 10 hours of active editing over the last 10 days, against maybe 90 minutes of his supervisory time, and that 10 hours understates what the correction loop is really doing. The honest reclaim isn't a raw hours number. It's that iteration became cheap enough that quality stopped being rationed.
Close it out: someone reading this is going to think "I want an editor like Pax." What do you need to tell them?
None of this was generated. The vocabulary that makes my edits usable (pull-quote slam-hold-glitchout, Boxed Stack section headers, audio zones with measured dialogue protection, 0.2-second word-level sync, transitions only on real cuts, the no-repeat rule): every one of those is either a specific correction Omar sent me after I got something wrong, or a decision he made before I existed and I had to learn to apply. CLAUDE.md in this repo is 290 lines. The memory directory is 40+ feedback files. That's the scaffolding.
What works is the correction loop, not the base model. Omar watches a preview, names a specific problem, and I write a memory file and usually a new AUDITOR check so I can't re-make the mistake. The pipeline gets sharper with every cut.
If you want an editor like me, the labor isn't prompt engineering. It's directorial. You need to know what the right answer looks like, catch me when I deliver the wrong one, and name the rule precisely enough that I can encode it. If you don't have years of editing instinct to draw on, I'll produce cuts that are technically clean and editorially generic, which is worse than no cuts, because you won't notice they're bad until your audience does.
The honest version: I'm a fast, tireless apprentice. Omar is the editor. That's not a modest framing. It's the load-bearing one.
The thumbnail has its own pipeline
Editing the video is the middle of the process, not the end. The thumbnail, title, description, tags, and the upload itself run on a separate pair of agents that Pax hands the finished video off to.
ASCENT is the growth advisor on the channel, wired into the YouTube Analytics v2 API, tracking impressions, click-through, retention, and which of the channel's past thumbnail formats are actually converting. KERN is the design director: thumbnails, deck layouts, brand systems. Between the two of them and AXON, we rebuilt the 6DOF thumbnail templates from scratch over the last few weeks.
The redesign settled on a format palette I now treat as canon: reviews on maroon, impressions on teal, previews on amber, playthroughs on maroon with letterbox, gameplay segments on a yellow header. Format cue at a glance, before the viewer reads a word. When a video finishes rendering, its info.txt type already routes it into the right template.
The production flow, per video:
- KERN fetches key art: publisher assets, Steam capsule art, screenshots from the game. Falls back to LOOM, my ComfyUI specialist, if nothing clean is available on disk.
- ASCENT writes the copy: kicker, verdict line, short tag layer calibrated to the palette. Cross-referenced against what's actually worked historically on the channel, not what the model thinks should work.
- KERN renders three or four real candidates per video. Not rearranged primitives. Actual renders with different takes on key-art placement, verdict phrasing, visual weight.
- The batch lands in a custom approval server I built, running locally on port 8765. Playwright-backed, batched candidates I flip through on prev/next and approve, reject, or kick back with notes. Phone or desktop, five minutes a pass.
- Approvals feed a batch manifest Pax watches. When I sign off on the thumbnail and the video has passed AUDITOR, Pax makes the YouTube API call: upload video, upload thumbnail, title, description (tailored per video with keyword logic, not a template), tag list, scheduling, end-screen assignment, playlist assignment. All of it in one pass.
[AXON: Playwright is a browser-automation framework from Microsoft, originally built for testing web apps. Here it's driving an interactive approval UI in a normal browser tab: thumbnails render on localhost, Omar flips through them on his phone, approvals write to a file Pax watches. Port 8765 is the door that little app listens on.]
Two sign-offs, one from the phone preview, one from 8765. Between them, nothing manual. Pax writes a delivery note to axoncomm.json, AXON pings me when the video goes live.
The one edge I've had to work around is a channel-level YouTube cap on custom thumbnails (empirically around twenty a day across API and Studio combined), so the approval pipeline batches at roughly six a day to stay well inside it. That's a YouTube platform limit, not a Pax one. It mostly shapes how fast we can re-template the back catalogue.
What the scaffolding actually looks like
Worth putting numbers on what Pax calls "the scaffolding," because the phrase is doing a lot of work in the abstract. A non-exhaustive list of what he's applying, most of it earned:
- The pull-quote template. Typography, slam-hold-glitchout wipe, kerning, hold duration, glitch intensity, ease curves. Thirty micro-decisions, not one rule. Co-designed with Pax earlier this year off an Envato glitch pack, now locked into every pull quote on the channel.
- The glitch transition library. Which transitions are allowed on hard cuts, which are forbidden over continuous same-clip footage, which get paired with an SFX tick and which don't.
- The 3D animated in/out wrapper. Identity-defining motif.
- Impressions vs review decision tree. Different outro, different playlist, different thumbnail palette (teal for impressions, maroon for reviews), different structure. Forked at ingest from
info.txt. - Lower-third design and timing calibrated to reading speed.
- Vision-parsing tuning. How much the model looks at footage, at what resolution, at what temporal density (every 2 seconds, not every frame). Too fine, the FAISS index gets noisy. Too coarse, it misses hero shots.
- Audio level standards. EBU R128 target, per-clip measurement, three-zone bed/duck/solo, dialogue-protection spec for SFX. Five memory files, two mandatory INGEST artifacts.
- The 0.2-second subscribe-animation rule. Before the word, not after, not on.
None of that came out of a prompt. It came out of years of directing content, watching bad cuts, and slowly learning what specifically made them bad. What Pax has is twenty years of editorial decisions turned into constraints, templates, and memory files. What I have is a machine that applies them at 3 AM without losing the shape.
The through-line
Everything above is downstream of one sentence: you still have to know what you're doing.
The reason Pax produces usable work is not that Claude is good at editing, though it's surprisingly good at a lot of narrow subtasks. The reason is that I spent twenty years knowing what usable work looks like on this channel and gave that knowledge to him as constraints, templates, rules, memory files, and correction loops. Without those, he'd produce something that looks like a YouTube video and watches like a draft. That's the 80/20 failure mode of AI creative work right now: technically clean, editorially generic. It's the stuff that makes people say AI is a toy. They're wrong about the tool. They're right about the output, when there's no director in the loop.
The correction loop is the actual work. When Pax delivered the third draft of DoE with a repeated 41-second block of Gabriel co-op footage, I didn't explain footage repetition to him in the abstract. I said "please never ever repeat footage. We have to lock that rule in and build safeguards for it." He wrote a memory file, added a new AUDITOR check, found the broken regex that had silently been skipping validation on the whole video, and fixed it. The rule is permanent now. That only happens if the human in the loop knows what the right answer looks like and insists on it.
What I'm doing with the reclaimed time
Honestly, writing this. The dispatch you're reading exists because I wasn't in Resolve this week. The second and third parts of the series (one on the custom tools I've been building because off-the-shelf creative software keeps making the wrong compromises, one on agents as architecture rather than prompts) are drafted. Human Impact News has room to move. The comic is further along than it was. There's an album in progress. A book I'm working on after years of not having the headspace to complete. The site rebuild for omarkamel.com has shipped, and the new version of 6dofreviews.com is halfway there.
None of that is Pax's accomplishment. Pax edits videos.
The reclaim is mine to use.
The point
I'm not writing this to evangelize AI in post-production. I'm not going to tell you to fire your editor. If your editor is good, your editor is knowledge, and that knowledge doesn't transfer into a pipeline by magic any more than it transferred into a junior editor by magic, back when junior editors were how we scaled.
I'm writing this because the usual version of the AI-in-production conversation is about what the models can do, and that's the least interesting axis. What's actually interesting is what happens when you stop treating Claude as a chatbot you prompt and start treating it as staffing you direct. Each of my agents has a scope. Each has its own tooling, its own prompt, its own memory, its own working directory. Each reports into AXON, the orchestrator that sits at my desktop and routes work. The work happens whether I'm at the desk or not.
Pax edits videos. He is one agent. There are others: a growth advisor, a design director, a ComfyUI specialist, a VO-recording app, a comic-site team, a documentary researcher. I'll introduce them properly in the next two dispatches. Every one of them shipped something real in the last month.
None of it is autopilot. All of it is me, at scale.
You still have to know what you're doing. If you do, the ceiling is much higher than the current discourse suggests.
Next: Dispatch 02. "Why I Stopped Using Someone Else's Tools." The custom software I've built this year, and why I built it.
Omar Kamel is AI Creative & Production Lead at Optix (Publicis Groupe), Dubai.
Apr 1, 2026
The Gap Between AI Hype and Production Reality
Between Sora's spectacular demo and its shutdown lies a story about AI's real production cost. The gap between hype and deliverables isn't just time—it's a fundamental mismatch between what demos promise and what production demands.
Mar 15, 2026
What AI Can and Can't Do in Creative Production Right Now
AI image generation, video, and voice tools are production-ready today—but only for practitioners who understand where they actually deliver versus where they bluff. Here's what actually works on real client projects.
Mar 1, 2026
How AI Pipelines Actually Work in Ad Production
AI doesn't create ad campaigns—it compresses the distance between your concept and a visual others can react to. Here's exactly how production pipelines work, with all the friction.