Most developers hit the same wall: Cursor AI is brilliant, but the moment you’re coding something sensitive — proprietary logic, client data, internal tooling — sending every prompt to Anthropic or OpenAI servers feels wrong. And honestly, it should.
Running Cursor AI with Ollama locally fixes that completely. Your code never leaves your machine. No API bills. No rate limits cutting you off mid-session.
Here’s the exact setup, what actually works in June 2026, and the mistakes that waste your afternoon.
Why Local First Makes Sense for Cursor in 2026
Before the how, you need to understand why this setup works the way it does — because if you skip this part, you’ll misconfigure things and wonder why responses are slow or broken.
Cursor AI isn’t a standalone LLM. It’s a code editor built on top of VS Code that routes your prompts to AI backends. By default, those backends are cloud models — Claude 3.5 Sonnet, GPT-4o, stuff like that. What most people don’t realize is Cursor also supports custom OpenAI-compatible API endpoints. That’s the door Ollama walks through.
Ollama runs local language models on your own hardware and exposes them through a local server at http://localhost:11434. That server talks the same API language as OpenAI. So Cursor thinks it’s hitting an OpenAI endpoint — except the endpoint is your own machine, serving a model you downloaded yourself.
This is why the setup works at all. You’re not hacking Cursor. You’re using a feature it already has, just pointed at a local target instead of a cloud one.
The tradeoff? Local models are slower and generally less capable than GPT-4o or Claude Sonnet for complex reasoning. But for autocomplete, refactoring, explaining code, and writing boilerplate — models like Qwen2.5-Coder 7B or DeepSeek-Coder-V2 are genuinely competitive. I’ve been coding full projects this way for months and the gap is smaller than you’d think for day-to-day work.
What You Actually Need Before Starting
Don’t skip this. Half the “it’s not working” posts on Reddit come from people who jumped into the setup without checking these first.
Hardware baseline:
- At least 8GB RAM for a 7B model (16GB is more comfortable)
- 16GB+ for anything in the 13B range
- An Apple Silicon Mac (M1/M2/M3/M4) or a machine with a dedicated NVIDIA GPU will give you usable speed — CPU-only on Intel/AMD works but expect 15–30 second waits per response
Software you need installed:
- Cursor AI (download from cursor.com — you need version 0.40 or later for reliable custom endpoint support)
- Ollama (ollama.com — install takes under two minutes)
- That’s it
Models worth downloading: For coding specifically, don’t just grab whatever’s trending. The models that actually perform well inside Cursor for code tasks in 2026:
qwen2.5-coder:7b— Best balance of speed and code quality for most machinesdeepseek-coder-v2:16b— Better output, needs 16GB+ RAMcodellama:13b— Solid fallback, widely tested with Cursorllama3.1:8b— Good for chat/explain tasks, weaker on complex code generation
I’ve run all four. For most people on a MacBook Pro M2 or M3, qwen2.5-coder:7b hits the sweet spot. DeepSeek-Coder-V2 is noticeably better at reasoning through multi-file problems, but the wait time goes up.
Step 1 — Install Ollama and Pull Your Model
Go to ollama.com and install for your OS. Mac gets a .dmg, Linux gets a one-line curl command, Windows gets an installer.
Once installed, open your terminal and run:
ollama pull qwen2.5-coder:7b
This downloads the model. Size is around 4.7GB for the 7B version, so give it a few minutes depending on your connection.
When it’s done, start Ollama’s local server:
ollama serve
You’ll see output confirming it’s running on port 11434. Keep this terminal open — Ollama needs to stay running while you use Cursor.
Quick sanity check before touching Cursor. Run this:
curl http://localhost:11434/api/generate -d '{
"model": "qwen2.5-coder:7b",
"prompt": "Write a Python hello world",
"stream": false
}'
If you get a JSON response with code in it, Ollama is working. If you get a connection refused error, Ollama isn’t running — go back and run ollama serve again.
Step 2 — Configure Cursor to Use Ollama
Open Cursor. Go to Settings (gear icon bottom left, or Cmd+, on Mac / Ctrl+, on Windows).
Navigate to Models in the left sidebar.
You’re looking for two things:
- The OpenAI API Key field
- The Override OpenAI Base URL field (sometimes labeled “OpenAI Base URL” depending on your Cursor version)
Here’s what you enter:
API Key: ollama (literally just type the word ollama — Ollama doesn’t use real API keys but Cursor requires something in this field)
Base URL: http://localhost:11434/v1
The /v1 at the end matters. Without it, the API calls won’t format correctly and you’ll get errors or empty responses.
Now scroll down to the model list. You need to add your Ollama model here. Click “Add Model” and type the exact model name you pulled — qwen2.5-coder:7b in this example. Make sure it matches exactly what you used in the ollama pull command.
Save settings.
Step 3 — Test It Inside Cursor
Open any project in Cursor. Open the AI chat panel (Cmd+L or Ctrl+L).
In the model selector dropdown at the top of the chat panel, switch to the model you just added. It should appear in the list.
Type something simple: “Explain what this file does” and point it at a file you have open.
If it works, you’ll see a response streaming in from your local model. No internet required. No API costs.
If you get an error like “model not found” — double check that the model name in Cursor settings exactly matches what Ollama has installed. Run ollama list in your terminal to see exactly what’s available.
The Part That Trips Most People Up
Real talk: the most common failure point isn’t the setup — it’s the model switcher in Cursor.
Cursor has two different AI modes: the Chat panel (Cmd+L) and Inline Edit (Cmd+K). They can use different models. If you set your Ollama model in chat but forget to switch it in inline edit, you’ll get errors or it’ll default back to a cloud model.
Check both. In the inline edit popup, there’s a small model selector in the top right corner of that floating panel. Switch that one too.
The second common issue: Ollama stops serving if you close the terminal or your machine sleeps. You’ll see a “connection refused” error in Cursor. The fix is just to open terminal and run ollama serve again. If you want it to run automatically on startup, on Mac you can add it as a login item or run ollama serve & in your shell profile.
Making Autocomplete Work Locally (Tab Completion)
This is where it gets interesting — and where most guides just stop.
Cursor’s tab autocomplete (the inline ghost-text suggestions as you type) is a separate system from the chat panel. By default, it uses Cursor’s own cloud model. Getting that to use a local model requires one more step.
Go to Cursor Settings → Features → Cursor Tab.
You’ll see an option for the completion model. In some versions of Cursor, you can override this with your custom model. In others (particularly older builds), this option is locked to Cursor’s hosted model.
As of June 2026, Cursor version 0.44+ allows you to use custom models for tab completion. If you’re on an older version, update first.
Set the completion model to your Ollama model the same way you did for chat. The latency will be higher than cloud completions — expect 1–3 seconds instead of near-instant. Some people find this annoying and keep tab completion on the cloud model while routing chat/inline edit locally. That’s a completely valid hybrid setup.
For privacy-sensitive work, I turn off tab completion entirely (Settings → Features → disable Cursor Tab) and only use local models for explicit chat and inline edit commands. This gives you the best of both: no passive data leakage, full control over what gets sent where.
Choosing the Right Model for Your Use Case
Not all local models perform the same inside Cursor. Here’s what I’ve actually found after running these setups across different project types:
For React/Next.js/TypeScript: qwen2.5-coder:7b handles JSX and TypeScript types well. It occasionally hallucinates import paths but catches itself if you ask it to review its own output.
For Python data science / ML code: deepseek-coder-v2:16b is noticeably better here. Python is heavily represented in its training data and it handles NumPy, pandas, and scikit-learn idioms cleanly.
For explaining legacy code you inherited: llama3.1:8b is actually decent for this — it’s better at natural language reasoning and will give you readable explanations instead of just rewriting the code.
For Rust or Go: Honestly, the local models struggle more here. The specialized syntax and strict compiler requirements mean you’ll get more errors. Not unusable, but expect to correct more output than with Python or JS.
If you’re working on a privacy-sensitive project but need better code quality, the local models with Cursor guide covers a few additional model options worth looking at.
Performance: What to Actually Expect
I want to be straight with you here because most guides paint an unrealistically rosy picture.
On an M2 MacBook Pro 16GB, qwen2.5-coder:7b generates responses in roughly 3–8 seconds for typical coding prompts. Tab completion latency is around 1.5–2.5 seconds. That’s noticeable compared to cloud models, but not workflow-breaking once you adjust.
On an M3 Max with 36GB RAM, you can comfortably run deepseek-coder-v2:16b and get responses in 4–10 seconds. The quality jump is real for complex refactoring tasks.
On a Windows machine with an RTX 3080 (10GB VRAM): qwen2.5-coder:7b runs in roughly 2–4 seconds — GPU acceleration makes a real difference. Ollama automatically detects and uses CUDA if it’s available.
CPU-only Intel/AMD machines: Expect 20–60 seconds per response on 7B models. Functional, but you’ll feel every second. If you’re in this situation, keep your prompts tight and specific rather than open-ended.
The honest downside? For complex multi-file refactoring or architecture decisions, cloud models like Claude Sonnet still beat local 7B models by a wide margin. Local is for privacy, cost control, and offline use — not raw capability. If you need both, Cursor lets you switch models per-session, so you can go local for sensitive files and cloud for the tough reasoning tasks.
Privacy: What’s Actually Happening to Your Code
Here’s what most setup guides don’t explain clearly.
When you run Ollama locally and point Cursor at it, your prompts go from Cursor → localhost:11434 → Ollama → back to Cursor. Nothing hits the internet. Ollama has no telemetry on your prompts.
But Cursor itself still runs. And Cursor does have its own telemetry about editor usage — not your code content, but usage patterns. If you want zero telemetry even from the editor itself, you’d need to look at something like VS Code with a local LLM extension instead (Continue.dev is the popular option here).
For most developers working on client code or internal tools, running Ollama locally through Cursor is sufficient. Your actual code content and prompts don’t leave the machine. That’s the critical bit.
Cursor also has a Privacy Mode option in Settings. Enabling this tells Cursor not to store your code on their servers. With Privacy Mode on + Ollama local model, you get meaningful privacy without switching editors entirely.
If you’re curious about how other privacy-respecting AI tools compare, Venice AI’s approach to uncensored local models is worth a look for non-coding tasks.
Troubleshooting the Most Common Errors
“Model not found” in Cursor chat: Run ollama list in terminal. Copy the exact model name shown there. Paste it into Cursor’s model settings. One typo breaks it.
Empty responses / spinner that never stops: Usually means Ollama isn’t running. Open terminal, run ollama serve. Also check that your Base URL is http://localhost:11434/v1 with the /v1 — missing this is a very common mistake.
“Connection refused” error: Same as above — Ollama stopped. Restart it. If it keeps stopping, check if your machine’s firewall is blocking localhost connections (rare but happens on some Windows setups).
Responses start then cut off mid-sentence: This is usually a context length issue. The model hit its token limit. Try with a smaller code snippet, or pull a model with a longer context window. qwen2.5-coder:7b supports up to 32K context, but Ollama’s default max_tokens setting can be lower. You can set this in Cursor’s model config or in Ollama’s Modelfile.
Extremely slow responses even on good hardware: Check if Ollama is using GPU. Run ollama ps while a generation is happening. You’ll see memory usage split between GPU and CPU. If everything is on CPU and you have a GPU, make sure your CUDA/Metal drivers are current. On Mac, this shouldn’t be an issue — Ollama uses Metal automatically.
Cursor reverts to cloud model after restart: This happens if you don’t save your model settings properly. In Cursor Settings, after adding the model and entering your Base URL, make sure you hit Save or click out of the field before closing settings. Sometimes it doesn’t persist if you close the window mid-edit.
Running Multiple Models: When and Why
Once the basic setup is working, you might want more than one model available. You can pull multiple models with Ollama and they all show up in your Cursor model list once added.
Practical reasons to do this:
Keep a fast 7B for autocomplete-style tasks and a larger 13B or 16B for bigger refactoring jobs. Switch manually based on what you need.
Keep a general-purpose model alongside a code-specific one. When I need to write documentation or explain something in plain English, I’ll switch to llama3.1:8b — it’s better at natural language. For actual code writing, back to qwen2.5-coder:7b.
Switching between them takes two seconds in the model dropdown. There’s no restart needed, no reconfiguration. Ollama loads the model into memory when you first use it, then keeps it cached until you run something else.
One thing to know: Ollama can only run one model at a time by default. If you have qwen2.5-coder:7b loaded and then request llama3.1:8b, it unloads the first one. This takes a few seconds. Not a big deal in practice, but worth knowing so you don’t think something broke.
The Hybrid Setup (Local + Cloud, Strategically)
Here’s an approach that a lot of experienced Cursor users land on after a while.
You don’t have to go all-in on local. Cursor lets you have both configured simultaneously and switch per-session.
The workflow I use:
- Local Ollama model for anything involving client code, proprietary logic, or sensitive data
- Cloud model (Claude Sonnet or GPT-4o) for complex architecture decisions, debugging novel problems, or when I need the best possible reasoning and the code isn’t sensitive
This way you’re not sacrificing quality on the hard problems while still protecting the stuff that matters.
If you’re thinking about building more private, local AI workflows beyond just coding, uncensored local AI image generation setups follow a similar philosophy — local for privacy, cloud for capability when needed.
Keeping Everything Updated
Ollama updates are simple:
ollama pull qwen2.5-coder:7b
Running this again pulls the latest version of the model if one is available. It won’t re-download if nothing changed.
For Cursor, just update through the app’s built-in updater. Custom endpoint settings persist through updates, so you won’t need to reconfigure.
One thing to watch: Cursor updates occasionally change how custom models are configured in the UI. If something breaks after a Cursor update, the first thing to check is whether the Base URL field or model name field got reset. Takes 30 seconds to verify and fix if it happened.
What This Setup Won’t Do
Worth being honest about the limits:
It won’t give you Cursor’s full context awareness on local models. Some of Cursor’s more advanced features — like codebase indexing and multi-file understanding — still route certain metadata through Cursor’s servers. With Privacy Mode on, this is minimized, but the editor experience with local models isn’t 100% identical to the cloud version.
It won’t match GPT-4o on hard reasoning tasks. A 7B parameter model running on your laptop is not as capable as a frontier model running on a data center. For most daily coding tasks, the gap is manageable. For “architect this entire system from scratch” type requests, you’ll notice the difference.
It won’t work offline if Ollama isn’t running. The model runs locally, but Cursor still needs Ollama’s local server active. If you’re coding on a plane and forgot to start Ollama before takeoff, you’ll fall back to cloud or get errors. Get in the habit of starting Ollama when you start your machine.
Advanced: Customizing Ollama Models for Code Tasks
If you want to squeeze more out of your local models, Ollama lets you create custom Modelfiles — basically configuration files that set system prompts, context length, and generation parameters for a specific model.
Create a file called Modelfile in any directory:
FROM qwen2.5-coder:7b
SYSTEM """
You are a senior software engineer. When writing code:
- Always include error handling
- Prefer readability over cleverness
- Add brief comments only where logic isn't obvious
- Ask clarifying questions if requirements are ambiguous
"""
PARAMETER num_ctx 16384
PARAMETER temperature 0.2
Then run:
ollama create cursor-coder -f Modelfile
Now cursor-coder appears as a model option in Ollama. Add it to Cursor the same way you added your original model. Lower temperature (0.1–0.3) gives you more consistent, deterministic code output — good for coding tasks. Higher temperature (0.7–0.9) is better for brainstorming or documentation.
The system prompt here is the thing most people skip and then wonder why their local model gives generic responses. Giving it a strong coding-focused system prompt meaningfully improves output quality. Tested this across probably 50+ sessions — the difference with a well-crafted system prompt vs. the default is noticeable within the first few responses.
One More Thing Before You Finish
If you want to see what else is possible with local AI models beyond coding, the Agent Zero AI setup guide covers running autonomous AI agents locally — same privacy-first philosophy, extended to actual task automation.
And if you’re comparing local model options for other use cases beyond code, Grok’s free tier and Venice AI’s free plan are worth knowing about for when local compute isn’t the right fit.
Start with ollama pull qwen2.5-coder:7b, point Cursor at http://localhost:11434/v1, add the model in settings, and you’re running. Everything else in this guide is optimization — but that core setup takes under ten minutes and works on the first try if you follow the steps in order.

