llama.cpp on Windows: Your Complete Guide to Local Python Code Completion

Stop paying for GitHub Copilot. Stop sending your code to the cloud. Run a powerful code completion AI entirely on your Windows laptop — free, private, and offline.

This guide focuses on one specific goal: getting llama.cpp running on Windows for Python code completions. No fluff. Just step-by-step instructions that work.

Why llama.cpp for Code Completion?

Before we dive in, let’s be clear about what you’re building:

✅ What you get:
• AI suggests code as you type (like Copilot)
• Works entirely offline — no internet needed
• Your code never leaves your machine
• Completely free (no subscription)
• Runs on consumer hardware (8GB+ RAM recommended)

❌ What you don't get:
• Cloud sync across devices
• Team collaboration features
• The absolute latest models (local models lag by 6-12 months)
• Perfect suggestions (local models are smaller but still capable)

The Trade-off: Local code completion uses smaller models (3B-7B parameters) compared to cloud alternatives. But modern small models are incredibly capable for code completion.

Prerequisites

Hardware Requirements

RAM: 8GB minimum, 16GB+ recommended
Storage: 10GB free (models are 2-8GB each)
CPU: Any modern CPU, 4+ cores recommended
GPU: Optional, NVIDIA with 6GB+ VRAM for 5-10x speedup

Software Requirements

Windows 10 or 11 (64-bit)
Python 3.9+
VS Code (recommended) or PyCharm

Step-by-Step Setup

Step 1: Install Python

# Check if Python is installed:
python --version

# If not installed, download from python.org
# IMPORTANT: Check "Add Python to PATH" during installation

Step 2: Create Project Folder

mkdir C:\llm-code-helper
cd C:\llm-code-helper

# Create virtual environment
python -m venv venv
venv\Scripts\activate

Step 3: Install llama-cpp-python

pip install llama-cpp-python

# For NVIDIA GPU support:
# pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

Step 4: Download a Code Model

# Install huggingface CLI
pip install huggingface-hub

# Download CodeLlama-7B-Python (recommended for beginners)
huggingface-cli download TheBloke/CodeLlama-7B-Python-GGUF codellama-7b-python.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

# Alternative: Download DeepSeek-Coder-6.7B
# huggingface-cli download TheBloke/DeepSeek-Coder-6.7B-GGUF deepseek-coder-6.7b.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

Model Size Reference:

CodeLlama-7B-Python: ~4GB, needs 8GB RAM
DeepSeek-Coder-6.7B: ~4GB, excellent Python performance
Phi-2 (2.7B): ~1.6GB, for low-end hardware

Step 5: Test the Model

# Create test_llm.py
from llama_cpp import Llama

print("Loading model...")
llm = Llama(
    model_path="codellama-7b-python.Q4_K_M.gguf",
    n_ctx=4096,
    n_threads=4,
    verbose=False
)

print("Generating completion...")
prompt = '''def calculate_fibonacci(n):
    """Calculate the nth Fibonacci number."""'''

output = llm(prompt, max_tokens=200, temperature=0.2)
print(output['choices'][0]['text'])

Run: python test_llm.py

Step 6: Create Completion Server

# completion_server.py
from llama_cpp import Llama
from http.server import HTTPServer, BaseHTTPRequestHandler
import json

print("Loading model...")
llm = Llama(
    model_path="codellama-7b-python.Q4_K_M.gguf",
    n_ctx=4096,
    n_threads=4,
    verbose=False
)
print("Model loaded!")

class CompletionHandler(BaseHTTPRequestHandler):
    def do_POST(self):
        if self.path == '/complete':
            content_length = int(self.headers['Content-Length'])
            post_data = self.rfile.read(content_length)
            data = json.loads(post_data)
            
            code = data.get('code', '')
            cursor_pos = data.get('cursor_pos', len(code))
            
            prompt = f"Complete this Python code:\n{code[:cursor_pos]}\n"
            
            output = llm(prompt, max_tokens=100, temperature=0.1)
            completion = output['choices'][0]['text']
            
            self.send_response(200)
            self.send_header('Content-type', 'application/json')
            self.end_headers()
            self.wfile.write(json.dumps({'completion': completion}).encode())

server = HTTPServer(('localhost', 5000), CompletionHandler)
print("Server running at http://localhost:5000")
server.serve_forever()

Run: python completion_server.py

Step 7: VS Code Integration

Option 1: Continue Extension (Recommended)

Install “Continue” extension in VS Code
Open Continue settings → config.json
Add configuration for local model

{
  "models": [{
    "title": "Local CodeLlama",
    "provider": "llama.cpp",
    "apiBase": "http://localhost:5000"
  }]
}

Usage: Press Ctrl+L to open Continue chat, or type for autocomplete.

Optimization Tips

Speed Up Generation

# Use more threads (match your CPU cores)
n_threads=8

# Larger batch size
n_batch=512

# Lower temperature for faster, more deterministic output
temperature=0.1

# Limit max tokens
max_tokens=50

Reduce Memory

# Use smaller quantization:
# Q4_K_M → Q4_0 → Q3_K_M (smaller, faster, lower quality)

# Or smaller models:
# Phi-2 (1.6GB) for very low-end hardware

Quick Reference

# Activate environment
cd C:\llm-code-helper
venv\Scripts\activate

# Start server
python completion_server.py

# Test completion
curl -X POST http://localhost:5000/complete -H "Content-Type: application/json" -d "{"code": "def ", "cursor_pos": 4}"

# Stop server
Ctrl + C

Closing Thought

You now have AI-powered code completion running entirely on your Windows laptop. No subscriptions. No data leaving your machine. Just you and your code, with a helpful AI assistant that works offline.

It’s not perfect — local models are smaller than cloud alternatives. But for day-to-day Python development, it’s surprisingly capable. And the privacy? That’s priceless.

Part of the LLM Tools series under AI Foundations.