Running LLMs Locally: A Practical Guide to Ollama and LM Studio

Cloud AI is convenient. But every prompt you send to a hosted API leaves your machine. For some work, like summarizing public articles or brainstorming, that’s fine. For other work, like analyzing a phishing sample, reviewing internal code, or triaging logs, it’s a problem.

That’s why I run LLMs locally. The tooling has gotten good enough that you don’t need a research lab or a $5,000 GPU rig to get started. Here’s the practical case for local inference, the honest tradeoffs, and step-by-step setup for the two tools I reach for most: Ollama and LM Studio.

The case for local

Four reasons local LLMs earn a place in my stack:

Privacy and data residency. Anything you send to a cloud API hits someone else’s logs, retention policies, and jurisdiction. Local inference means your prompts and outputs never leave your machine. For security work involving malware samples, internal documentation, or regulated data, that’s the difference between a usable tool and a compliance violation.

Cost. No per-token billing. Once the hardware exists, inference is free. If you use AI heavily, this matters fast.

Offline capability. Planes, hotel Wi-Fi, air-gapped lab environments. Local models keep working when the network doesn’t.

Learning and control. Running models locally forces you to understand quantization, context windows, prompt formats, and inference parameters. You stop treating AI as a magic box. For anyone serious about working with AI, this fluency compounds.

The honest tradeoffs

Local isn’t a free lunch.

Capability gap. Frontier cloud models still beat the best open-weight models on hard reasoning, coding, and long-context tasks. A 7B model running on your laptop is impressive, but it’s not Claude Opus or GPT-5. Set expectations accordingly: local models are excellent for bulk simple tasks and decent for medium-complexity work, but you’ll still reach for cloud for the hardest problems.

Hardware demands. Running a 7B-parameter model at Q4 quantization needs roughly 6 to 8GB of RAM or VRAM. 13B wants around 16GB. 70B realistically needs 48GB+ or aggressive quantization that hurts quality. Apple Silicon has a real advantage here: unified memory means a 32GB M-series Mac can run models that would require a $1,500 GPU on a Windows box.

Friction. Cloud APIs are one HTTP call. Local setup involves model selection, quantization choices, context length tuning, and occasional driver headaches. The tools below minimize this, but you’ll still spend an afternoon getting comfortable.

Hardware reality check

Before you install anything, check what you can actually run:

8GB RAM: Small models only (1B to 3B parameters). Useful for simple tasks.
16GB RAM: Comfortable with 7B models at Q4. This is the sweet spot for most users.
32GB+ RAM or 16GB+ VRAM: 13B models run well. 70B is borderline.
Apple Silicon M-series: Unified memory makes these punch above their weight.
NVIDIA GPU with 8GB+ VRAM: Significant speedup. CUDA support is mature.
AMD GPU: ROCm support is improving but still rougher than NVIDIA.

If you’re starting from scratch and want one recommendation: a 16GB Mac mini or a Windows machine with a used RTX 3060 (12GB) is enough to do real work.

Ollama: the CLI-first option

Ollama is the path of least resistance. It’s a command-line tool that handles model downloads, quantization, GPU offloading, and serving behind a single interface. It also exposes an OpenAI-compatible REST API on port 11434, which means existing scripts written for cloud APIs work locally with a one-line change.

Install on macOS:

brew install ollama
# or download the installer from ollama.com

Install on Linux:

curl -fsSL https://ollama.com/install.sh | sh

Install on Windows:

Download the installer from ollama.com or run:

winget install Ollama.Ollama

Verify and run your first model:

ollama --version
ollama pull llama3.1:8b
ollama run llama3.1:8b

That’s it. You’re chatting with an 8B model locally. Type /bye to exit.

Useful commands:

ollama list              # show installed models
ollama pull qwen2.5:7b   # pull a different model
ollama rm llama3.1:8b    # free up disk space
ollama serve             # start the API server (runs automatically on macOS/Windows)

Hit the API from Python using the OpenAI client:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Summarize TCP/IP in three sentences."}]
)
print(response.choices[0].message.content)

This is the workflow I use for automation: any script I’d write against a cloud API works against Ollama by changing the base_url.

LM Studio: the GUI-first option

LM Studio is the polished desktop app. If you’d rather click than type, this is your tool. It includes a built-in model browser that pulls from Hugging Face, a chat interface with parameter sliders, document chat with built-in RAG, and a local server mode that exposes the same OpenAI-compatible API as Ollama.

Install:

Go to lmstudio.ai/download and grab the installer for your OS (macOS, Windows, or Linux).
Run the installer. On macOS, drag the app to Applications. On Windows, run the .exe. On Linux, the AppImage runs directly.
Launch the app.

Download a model:

Click the search icon in the left sidebar.
Search for something like Qwen2.5 7B Instruct or Llama 3.1 8B.
Pick a GGUF quantization. Q4_K_M is the standard balance of size and quality.
Click download. Models are typically 4 to 8GB.

Chat with the model:

Click the chat icon, then click “Select a model to load” at the top.
Pick your downloaded model and adjust GPU offload if needed.
Use the right-side panel to set system prompts, temperature, and context length.

Run the local API server:

Click the Developer tab on the left.
Toggle “Status: Running” at the top.
The API now runs at http://localhost:1234/v1 and accepts OpenAI-format requests.

LM Studio’s killer feature for me is the document chat. Drag a PDF or text file into a conversation and the model reasons over it locally. No upload to a third-party RAG service.

Other tools worth knowing

I focus on Ollama and LM Studio because they cover 90% of cases. But three others deserve mention:

Jan: Open-source desktop app with a polished UI, similar in spirit to LM Studio but fully open source.
GPT4All: The most beginner-friendly option. Quick install, simple UI, low ceiling but very low floor.
llama.cpp: The C++ inference engine that powers most of the above. Use it directly when you need maximum control or custom builds.

Cybersecurity use cases

This is where local LLMs earn their keep for security work:

Phishing email analysis. Drop a suspicious email body into a local model and ask it to flag social engineering patterns, suspicious URLs, and impersonation indicators. The sample never leaves your machine.

Log triage. Pipe Suricata alerts, auth logs, or EDR telemetry into a local model to summarize patterns, group by indicator, and flag anomalies. At volume, this is significantly cheaper than cloud APIs and avoids exfiltrating sensitive infrastructure data.

CVE and advisory summarization. Feed daily NVD or vendor advisories into a local model to extract affected products, CVSS context, and exploitability notes. Build it into a cron job and you’ve got a private threat intel pipeline.

CTF and home lab support. Working through a box on HackTheBox or running tools against a vulnerable VM? A local model can explain output, suggest next steps, and walk through concepts without sending lab traffic patterns to a third party.

Code review on internal repos. Point a local model at proprietary code for security review. No NDA concerns, no data exposure, no enterprise approval cycles.

The pattern: anywhere you’d hesitate to paste content into a cloud API, local inference removes the hesitation.

Key Takeaways

Local LLMs solve real problems for security work (privacy, offline capability, cost) but they don’t replace frontier cloud models for the hardest tasks.
Hardware sets the ceiling. 16GB RAM runs 7B models comfortably. Apple Silicon punches above its weight thanks to unified memory.
Ollama is the right starting point for terminal users and anyone building automation. One command to install, one command to run a model.
LM Studio is the right starting point for GUI users and anyone wanting model browsing, document chat, and parameter tuning without writing code.
Both expose OpenAI-compatible APIs, so any script that works against a cloud endpoint works against your local server with a base URL change.
Build the local stack alongside cloud, not instead of it. Use local for sensitive and high-volume work, cloud for the hardest reasoning tasks.