
September 15, 2023
Running Your Own Models: SDXL and the Case for Local AI
Why serious creative production requires hardware you control.
Two months ago, Stability AI released SDXL 1.0. If you're doing creative work with AI and you haven't run it locally yet, you're missing something important. Not because it's the shiniest new toy (it is), but because the experience of running your own models changes how you think about this entire technology.
I've been running Stable Diffusion locally since the first release. I run it for actual production work, for clients in Dubai who would not be thrilled to learn their unreleased visuals were sitting on someone else's servers. And SDXL represents the first time a locally-run open model can credibly compete with the cloud services for professional output quality.
This isn't a hobbyist flex. It's a production argument.
What SDXL Actually Changed
Let's start with what matters. Stable Diffusion 1.5 generated at 512x512 natively. You could push it higher with tricks, but that was the sweet spot. SD 2.1 bumped to 768x768 and somehow managed to be worse at following prompts, which was an impressive failure of priorities.
SDXL generates at 1024x1024 natively. That's four times the pixel count of SD 1.5. But resolution alone isn't the story. The composition is dramatically better. Hands are still occasionally wrong, but they're wrong less often. Text rendering, while not perfect, went from completely unusable to occasionally legible. And the model understands spatial relationships, lighting, and style prompts with a sophistication that SD 1.5 never achieved.
The architecture is different too. SDXL uses a two-stage process: a base model generates the initial image, then a refiner model adds detail. This is why it needs more VRAM than its predecessors (more on hardware in a moment), but the quality jump justifies it. The difference between SD 1.5 output and SDXL output is not incremental. It's generational. Think of it as the gap between a decent point-and-shoot and a proper full-frame camera. Both take photos. One of them takes photos you can actually use.
For production work specifically, the improvements that matter most are:
- Native resolution. 1024x1024 means less upscaling, fewer artifacts, and output that's actually usable for digital campaigns without immediately reaching for Real-ESRGAN.
- Prompt adherence. SDXL follows complex prompts more reliably. You describe a scene, you get something recognizably close to what you described. SD 1.5 sometimes felt like giving directions to someone wearing headphones.
- Style range. The model handles photorealism, illustration, and everything in between without the extreme prompt engineering that 1.5 demanded.
- Better training data curation. Stability AI was more deliberate about SDXL's training set, which shows in the output's reduced tendency toward uncanny artifacts.
The Tools: Automatic1111 vs. ComfyUI
If you're running models locally, you need a frontend. Right now, the dominant option is Automatic1111's Stable Diffusion WebUI. It has the largest community, the most extensions, and the most documentation. If you google a question about running Stable Diffusion locally, the answer assumes you're using A1111.
For good reason. A1111 is approachable. It's a web interface with sliders, text boxes, and dropdown menus. You can be generating images within minutes of installation if your hardware cooperates. The extension ecosystem is massive, covering everything from ControlNet to face restoration to tiling to prompt manipulation tools that would take pages to list. For someone getting started with local generation, A1111 is the correct choice.
But there's a newer tool gaining traction that I think will eventually become the professional standard: ComfyUI.
ComfyUI is a node-based interface. If you've ever used Nuke, Houdini, or even Unreal Engine's Blueprints, you'll understand the paradigm immediately. Instead of a linear interface where you fill in parameters and hit generate, you build a visual graph of nodes. Each node represents a step: load a model, encode a prompt, run a sampler, decode the latent, save the image. You connect them with wires.
This sounds more complicated. It is more complicated. That's the point.
Here's what a node-based workflow gives you that a linear interface can't:
Branching and merging. You can take a single generation, split it into multiple processing paths, apply different operations to each, and combine the results. Try doing that in A1111. You'll be clicking between tabs and saving intermediate files like it's 2005.
Reproducibility. A ComfyUI workflow is a graph. You can save it, share it, version-control it. When someone asks "how did you make that?" the answer is a JSON file, not a paragraph of instructions and screenshots. For production work where consistency matters, where a client needs the same look across fifty assets, this is not a nice-to-have. It's essential.
Transparency. Every step is visible. You can see exactly what's happening to your image at every stage of the pipeline. With A1111, the extensions sometimes interact in ways that are opaque. With ComfyUI, if something goes wrong, you can trace the graph and find exactly where.
SDXL's two-stage process is native. ComfyUI handles the base-plus-refiner architecture naturally, because you literally wire the base model's output into the refiner's input. A1111 supports SDXL, but the two-stage workflow feels bolted on rather than native.
I'm not abandoning A1111. I still use it for quick experiments and one-offs. But for building repeatable production workflows, I'm increasingly spending my time in ComfyUI. The learning curve is steeper. The payoff is worth it.
LoRA Training: Why Brand Consistency Demands Local
Here's where the local argument gets concrete.
LoRA (Low-Rank Adaptation) is a technique for fine-tuning a model on a small dataset. In practical terms: you take a couple dozen images of a specific product, character, or style, and you train a small adapter that teaches the model what that thing looks like. The result is a file, typically 10-200MB, that you can load alongside SDXL to generate images that are consistent with your training data.
For professional production, this is transformative. A client gives you their product photography or visual references. You train a LoRA on it. Now you can generate that product in any context, any lighting, any composition, while maintaining the visual identity. Need the same perfume bottle on a marble countertop at golden hour, on a glass shelf in a minimalist boutique, held by a hand against a desert sunset? Train the LoRA once, generate forever.
But here's the thing: LoRA training requires your training data to live on whatever machine does the training. If you're using a cloud service for this, your client's unreleased product shots, their embargoed campaign imagery, their trade-secret packaging designs are sitting on someone else's infrastructure.
I work with clients in the Middle East who take IP protection seriously. Production companies and brands with legal teams that would have cardiac events if they discovered unreleased assets had been uploaded to a third-party AI service. Running your own hardware isn't paranoia. It's due diligence.
Kohya-ss is the standard tool for LoRA training. It's not user-friendly. The documentation assumes you know what a learning rate scheduler is and have opinions about optimizers. The process involves selecting training images, creating caption files (either manually or with a BLIP/CLIP captioning tool), configuring training parameters, and then waiting while your GPU earns its electricity bill. A typical SDXL LoRA training run takes 30-90 minutes on a modern GPU, depending on dataset size and training steps.
The result is worth the pain. I have LoRAs trained on specific visual styles and subjects that let me generate consistent imagery across projects. The alternative, fighting Midjourney's prompt system to achieve visual consistency across dozens of outputs, is like trying to conduct an orchestra by shouting general vibes at the musicians.
ControlNet: Compositional Control That Production Demands
ControlNet solves a problem that anyone doing production work hits immediately: AI generates beautiful images that have nothing to do with the layout you need.
You've locked a composition. The subject goes here, the text goes there, the figure faces this direction, the background has this depth of field. That's the plan. Now try getting Midjourney to hit that exact composition through prompt engineering alone. Good luck.
ControlNet takes a conditioning image (a depth map, a pose skeleton, a canny edge detection output, a scribble) and uses it to guide the generation. Draw a rough sketch of your layout, run it through a canny edge preprocessor, feed it to ControlNet, and SDXL will generate an image that follows your composition while filling in the creative details.
The SDXL-compatible ControlNet models are still catching up (SD 1.5 has a more mature ControlNet ecosystem as I write this), but they're arriving quickly. And the combination of ControlNet for composition plus LoRA for brand consistency plus SDXL's improved base quality gets you closer to a production-ready pipeline than anything that existed six months ago.
In A1111, ControlNet is an extension. In ComfyUI, it's a set of nodes you wire into your graph. Both work. ComfyUI gives you more granular control over how and when the ControlNet conditioning is applied, which matters when you're stacking multiple ControlNet models (pose plus depth plus edge, for example).
The Hardware Question
Everyone asks: what GPU do I need?
For SDXL, the honest answer is: an NVIDIA GPU with at least 8GB of VRAM, and ideally 12GB or more. That means:
- RTX 3060 12GB: The budget entry point. It works. Generation times are slow (30-60 seconds per image at 1024x1024 with 30 steps), but it works. The 12GB VRAM is the key spec here, not the 3060 part.
- RTX 3090 / RTX 4090: The serious option. 24GB of VRAM means you can run SDXL with the refiner, stack ControlNet models, and still have headroom. Generation at 1024x1024 drops to 10-15 seconds on a 4090. If you're doing this professionally, this is what you want.
- RTX 4080 / 3080: 16GB on the 4080, 10-12GB on the 3080 variants. Workable, with some compromises on batch size and stacking complexity.
AMD? Technically possible. Practically painful. The entire Stable Diffusion ecosystem is built on NVIDIA's CUDA. AMD support exists through ROCm, but you'll spend more time debugging driver issues than generating images. Maybe that changes. It hasn't yet.
LoRA training is more demanding. You can train on 8GB VRAM with aggressive memory optimization, but 12GB is comfortable and 24GB means you can train without thinking about it. Training is also where an RTX 4090 pays for itself, because what takes 90 minutes on a 3060 takes 25 minutes on a 4090.
Total investment for a serious local AI workstation: $1,500-3,000, depending on whether you build around a 3090 (used prices have dropped significantly as 40-series cards arrived) or a new 4090. That's real money. It's also less than three months of a Midjourney team subscription, and unlike a subscription, it doesn't disappear when you stop paying.
Why Midjourney Isn't Enough
I use Midjourney. I like Midjourney. V5.2 produces gorgeous images with minimal effort. For exploration, mood boarding, and concept development, it's unbeatable. The Discord interface is limiting, but the quality-to-effort ratio is the best in the industry.
But Midjourney has fundamental limitations for production work:
No reproducibility guarantees. Yes, you can use seeds. But Midjourney's model updates can change outputs for the same prompt and seed. Your carefully curated generation from last month might not reproduce today.
No fine-tuning. You cannot train Midjourney on your client's products. You're working with the model as-is. For one-off creative exploration, that's fine. For maintaining brand consistency across a campaign with 40 deliverables, it's a problem.
No pipeline integration. Midjourney lives in Discord (or now, on their alpha web interface). It doesn't plug into a ComfyUI workflow. It doesn't accept ControlNet conditioning. It doesn't fit into an automated production pipeline. Every image requires manual prompting and manual downloading.
Everything lives on their servers. Every prompt, every generation, every reference image you upload is on Midjourney's infrastructure. For personal projects, who cares. For client work under NDA, this is a non-starter.
No ControlNet, no compositional control. You can describe a composition in words. Words are imprecise. When the layout is approved and the product needs to be in the upper third of the frame facing camera left, "describe it in words and hope" is not a production methodology.
DALL-E 2 has similar limitations, with the additional constraint that OpenAI's content policies are aggressive enough to reject perfectly legitimate commercial concepts. Adobe Firefly is interesting for its "commercially safe" training data angle, but the output quality, as of today, isn't competitive with SDXL or Midjourney for most use cases.
The cloud services are good at what they're good at: fast, accessible, high-quality generation for people who need images and don't need to control the pipeline. That describes a lot of use cases. It doesn't describe professional creative production.
The Tension, and Where This Goes
There's a real tension between cloud convenience and local control, and pretending it doesn't exist would be dishonest.
Running local models means maintaining hardware, troubleshooting Python environments, keeping up with a community that moves at a pace that makes the JavaScript ecosystem look stable, and occasionally staring at a CUDA out-of-memory error at 2 AM wondering why you didn't just use Midjourney.
Cloud services mean someone else handles all of that. You type a prompt. You get an image. It works. The cost is that you surrender control over your pipeline, your data, your reproducibility, and your ability to customize.
For hobbyists, explorers, and people who need occasional AI imagery, cloud services are the right answer. For production studios, agencies, and anyone doing serious creative work at scale, local infrastructure isn't optional. It's the cost of doing the job properly.
I don't think the future is entirely local or entirely cloud. I think it's hybrid. Run your fine-tuned models and sensitive client work locally. Use cloud services for exploration and concepting. Use GPT-4 or Claude 2 for the text side of your creative process. Use Runway Gen-2 for quick video experiments. Use Midjourney for mood boards. Use your own hardware for everything that touches a client's IP.
Meta's Llama 2 proved that open models can compete with closed ones in the LLM space. SDXL proved it for image generation. The trend is clear: open models are getting good enough that running your own infrastructure isn't a compromise. It's a competitive advantage.
The tools are here. The hardware is affordable (not cheap, but affordable for a professional tool). The community is building at an extraordinary pace. The question isn't whether local AI belongs in your production pipeline. It's how long you can afford to pretend it doesn't.
Omar Kamel is an AI creative lead with two decades of experience in TV, film, and content production, based in Dubai.
Jun 1, 2023
Why All AI Art Looks the Same
Every AI-generated image looks the same—hyperreal, cinematic, forgettable. For advertising that needs to stand out, this monoculture is a business problem, not an aesthetic one.
Mar 1, 2023
The Morning After: AI Just Got Good Enough to Matter
Four major AI releases in a single week—Midjourney V5, GPT-4, Claude, and Runway Gen-1—represent a threshold moment. For production professionals, the question isn't whether these tools matter anymore. It's how fast you adapt.
Apr 1, 2026
The Gap Between AI Hype and Production Reality
Between Sora's spectacular demo and its shutdown lies a story about AI's real production cost. The gap between hype and deliverables isn't just time—it's a fundamental mismatch between what demos promise and what production demands.