llama.cpp on Windows: Your Complete Guide to Local Python Code Completion
Stop paying for GitHub Copilot. Stop sending your code to the cloud. Run a powerful code completion AI entirely on your Windows laptop — free, private, and offline.
This guide focuses on one specific goal: getting llama.cpp running on Windows for Python code completions. No fluff. Just step-by-step instructions that work.
Why llama.cpp for Code Completion?
Before we dive in, let’s be clear about what you’re building:
✅ What you get:
• AI suggests code as you type (like Copilot)
• Works entirely offline — no internet needed
• Your code never leaves your machine
• Completely free (no subscription)
• Runs on consumer hardware (8GB+ RAM recommended)
❌ What you don't get:
• Cloud sync across devices
• Team collaboration features
• The absolute latest models (local models lag by 6-12 months)
• Perfect suggestions (local models are smaller but still capable)
The Trade-off: Local code completion uses smaller models (3B-7B parameters) compared to cloud alternatives. But modern small models are incredibly capable for code completion.
Prerequisites
Hardware Requirements
- RAM: 8GB minimum, 16GB+ recommended
- Storage: 10GB free (models are 2-8GB each)
- CPU: Any modern CPU, 4+ cores recommended
- GPU: Optional, NVIDIA with 6GB+ VRAM for 5-10x speedup
Software Requirements
- Windows 10 or 11 (64-bit)
- Python 3.9+
- VS Code (recommended) or PyCharm
Step-by-Step Setup
Step 1: Install Python
# Check if Python is installed:
python --version
# If not installed, download from python.org
# IMPORTANT: Check "Add Python to PATH" during installation
Step 2: Create Project Folder
mkdir C:\llm-code-helper
cd C:\llm-code-helper
# Create virtual environment
python -m venv venv
venv\Scripts\activate
Step 3: Install llama-cpp-python
pip install llama-cpp-python
# For NVIDIA GPU support:
# pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
Step 4: Download a Code Model
# Install huggingface CLI
pip install huggingface-hub
# Download CodeLlama-7B-Python (recommended for beginners)
huggingface-cli download TheBloke/CodeLlama-7B-Python-GGUF codellama-7b-python.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
# Alternative: Download DeepSeek-Coder-6.7B
# huggingface-cli download TheBloke/DeepSeek-Coder-6.7B-GGUF deepseek-coder-6.7b.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
Model Size Reference:
- CodeLlama-7B-Python: ~4GB, needs 8GB RAM
- DeepSeek-Coder-6.7B: ~4GB, excellent Python performance
- Phi-2 (2.7B): ~1.6GB, for low-end hardware
Step 5: Test the Model
# Create test_llm.py
from llama_cpp import Llama
print("Loading model...")
llm = Llama(
model_path="codellama-7b-python.Q4_K_M.gguf",
n_ctx=4096,
n_threads=4,
verbose=False
)
print("Generating completion...")
prompt = '''def calculate_fibonacci(n):
"""Calculate the nth Fibonacci number."""'''
output = llm(prompt, max_tokens=200, temperature=0.2)
print(output['choices'][0]['text'])
Run: python test_llm.py
Step 6: Create Completion Server
# completion_server.py
from llama_cpp import Llama
from http.server import HTTPServer, BaseHTTPRequestHandler
import json
print("Loading model...")
llm = Llama(
model_path="codellama-7b-python.Q4_K_M.gguf",
n_ctx=4096,
n_threads=4,
verbose=False
)
print("Model loaded!")
class CompletionHandler(BaseHTTPRequestHandler):
def do_POST(self):
if self.path == '/complete':
content_length = int(self.headers['Content-Length'])
post_data = self.rfile.read(content_length)
data = json.loads(post_data)
code = data.get('code', '')
cursor_pos = data.get('cursor_pos', len(code))
prompt = f"Complete this Python code:\n{code[:cursor_pos]}\n"
output = llm(prompt, max_tokens=100, temperature=0.1)
completion = output['choices'][0]['text']
self.send_response(200)
self.send_header('Content-type', 'application/json')
self.end_headers()
self.wfile.write(json.dumps({'completion': completion}).encode())
server = HTTPServer(('localhost', 5000), CompletionHandler)
print("Server running at http://localhost:5000")
server.serve_forever()
Run: python completion_server.py
Step 7: VS Code Integration
Option 1: Continue Extension (Recommended)
- Install “Continue” extension in VS Code
- Open Continue settings → config.json
- Add configuration for local model
{
"models": [{
"title": "Local CodeLlama",
"provider": "llama.cpp",
"apiBase": "http://localhost:5000"
}]
}
Usage: Press Ctrl+L to open Continue chat, or type for autocomplete.
Optimization Tips
Speed Up Generation
# Use more threads (match your CPU cores)
n_threads=8
# Larger batch size
n_batch=512
# Lower temperature for faster, more deterministic output
temperature=0.1
# Limit max tokens
max_tokens=50
Reduce Memory
# Use smaller quantization:
# Q4_K_M → Q4_0 → Q3_K_M (smaller, faster, lower quality)
# Or smaller models:
# Phi-2 (1.6GB) for very low-end hardware
Quick Reference
# Activate environment
cd C:\llm-code-helper
venv\Scripts\activate
# Start server
python completion_server.py
# Test completion
curl -X POST http://localhost:5000/complete -H "Content-Type: application/json" -d "{"code": "def ", "cursor_pos": 4}"
# Stop server
Ctrl + C
Closing Thought
You now have AI-powered code completion running entirely on your Windows laptop. No subscriptions. No data leaving your machine. Just you and your code, with a helpful AI assistant that works offline.
It’s not perfect — local models are smaller than cloud alternatives. But for day-to-day Python development, it’s surprisingly capable. And the privacy? That’s priceless.
Part of the LLM Tools series under AI Foundations.