The game-changer
In late 2023, something happened that most business owners missed entirely. Meta, Mistral AI, and a dozen research groups released their model weights openly — meaning the actual AI brain, not just an API to call. Suddenly, anyone could download a model performing close to GPT-3.5 and run it on their own computer.
It sounded academic. It wasn't. It was the starting point for a movement that now, over a year later, allows medium-sized Swedish companies to systematically build AI infrastructure without sending a single word to OpenAI, Google, or Anthropic.
The most important piece of that puzzle is called Ollama.
What is Ollama — and why does it matter?
Ollama is an open tool that lets you download and run
large language models locally with a single command. You install it
on your server or workstation, type ollama run llama3
and have a functioning AI system that responds in milliseconds — offline,
without subscription fees and without data leaving your machine.
It is not a product you pay for. It is an infrastructure tool, much like Docker — you install it, configure it, and then it's just there. It runs as a local API server that you then connect to your own applications.
Once configured, local AI is invisible infrastructure— like electricity in the wall, but for intelligence.
The models you can run via Ollama include Llama 3 (Meta), Mistral and Mixtral (Mistral AI), Qwen (Alibaba), Phi-3 (Microsoft) and a growing list of specialized variants. There are models optimized for code, for medical text, for legal documents, for Swedish—and combinations thereof.
Why do Swedish companies choose local AI?
There are three reasons that come up time and again when I talk to Swedish business leaders about this. They aren't technical. They are business-oriented.
1. GDPR and data minimization
Every time you send data to ChatGPT, Claude, or Gemini, you are sending data to a server in the USA. It isn't illegal—but it requires you to be careful about what data you send, to have a DPA (Data Processing Agreement) with the provider, and to be able to explain to your customers what you do with their information.
With local AI, the answer is trivially simple: data does not leave the office. There is nothing to explain. There is no third party. There is no provider changing terms in their acceptable use policy without warning. Especially for companies handling sensitive information— health data, legal documents, financial information, trade secrets— this is an argument that is hard to ignore.
2. Cost control at high volume
API prices for cloud-based AI look reasonable per request. They stop looking reasonable when you calculate 100,000 requests per month. Claude Sonnet costs about 3 dollars per million input tokens. A local Llama 3 instance on a used workstation with an RTX 4090 costs electricity—about 50 öre per hour. For continuous workflows, the math isn't complicated.
It is not always the right choice. If you run AI rarely and need top-5% performance every time, the cloud is still best. But for high-volume, lower-stakes tasks—classification, summarization, data normalization, internal search—local AI is often cheaper per request after 3–6 months.
3. Control and customization
With a local model, you can fine-tune on your own data. You can add domain-specific terminology, train the model to follow your specific formats and rules, and build something that actually understands your industry—not just general text. This is not possible with the large cloud models unless you are a Fortune 500 company with an Enterprise agreement.
What do you need to run it?
The short answer: it depends on which model and how critical the response time is. The long answer is in the table below.
Small model (7B parameters) — suitable for:
Requirement: 8 GB RAM (CPU inference) or 8 GB VRAM (GPU inference)
Performance: ~15–30 tokens/second on CPU, 60–120 on GPU
Suitable for: summarization, classification, simple Q&A
Not for: complex reasoning, code generation, legal analysis
Medium-sized model (30–70B parameters) — suitable for:
Requirement: 48–96 GB RAM (CPU) or 24–48 GB VRAM (GPU/multi-GPU)
Performance: 5–15 tokens/second on CPU, 30–80 on GPU
Suitable for: complex analysis, code generation, longer documents
Hardware: A good workstation with an RTX 4090 (24 GB) is sufficient for 32B
A practical starting scenario for a small to medium-sized business: a used HP Z8 workstation with dual Xeon, 256 GB RAM and a NVIDIA RTX 4090 costs approximately 30,000–50,000 SEK. It handles Qwen2.5 32B excellently, with 24h/7d operation, and pays for itself compared to cloud prices if you process more than about 500 documents per day.
Which models perform best for Swedish texts?
It is a common question — and the answer has changed rapidly over the last year. Early open models were incredibly poor at Swedish. That has changed significantly.
Recommended models for Swedish (May 2026)
Qwen2.5:32b — excellent Swedish, strong reasoning, requires GPU with 24 GB VRAM
Mistral-Nemo:12b — good Swedish, fast, works with 16 GB VRAM or 32 GB RAM
Llama3.1:8b — acceptable Swedish, low requirements, good starting point
Codestral (code in Swedish): commands and comments in Swedish
Nomic-embed-text (embeddings): searching in Swedish documents, RAG pipeline
The most important thing for Swedish: avoid models under the 7B class if quality in Swedish is a requirement. Smaller models tend to drift toward English or mix in anglicisms in a way that looks unprofessional in a business context.
Four concrete use cases for Swedish SME
1. Internal document search (RAG)
You have 10 years of internal routine descriptions, quotes, minutes, and project documents. No one can find anything. With a local RAG pipeline (Retrieval-Augmented Generation) — an embedding model + your local LLM + a vector database — anyone can ask questions in natural Swedish and get correct answers from your actual documents. Everything stays with you.
2. Automatic categorization of incoming email
A small locally running model can read subject line + sender + the first 200 words of every email and categorize it as "order", "complaint", "supplier invoice" or "other" — with 90–95% accuracy. It doesn't require GPT-4. It requires Mistral 7B and a couple of hours of configuration.
3. Contract analysis without outside eyes
No law firms want you copy-pasting their drafts into ChatGPT. With a local model, you can analyze contracts, identify unusual clauses, compare against templates, and flag deviations — without a single word leaving the company. Not replacing the lawyer, but a first filter that saves billable hours.
4. Customer service bot with your product catalog
Fine-tune a small model on your product descriptions, common questions, and return policy. Answer 80% of incoming chat queries automatically — in Swedish, without latency, without cost per interaction, and without sharing your customer conversations with a third party.
What local AI is not suitable for
It is important to be honest about the limitations. Local AI is not always the right choice, and real infrastructure planning requires you to understand the trade-offs.
Tasks requiring top performance: The most difficult reasoning tasks — complex legal arguments, advanced medical diagnostics, mathematical proofs — still require the large cloud models. A local 32B model is impressive but not in the same league as GPT-4o or Claude 3.5 Sonnet for truly complex multi-step problems.
Multimodal input: If you need to analyze images, redraw diagrams, or understand complex tables from scanned PDFs, the vision capabilities of cloud models are still superior to most local alternatives.
Low volume + high requirements: If you run AI ten times a week and need the best possible answer every time — pay for the cloud. It is cheaper and simpler.
Local AI is not for everyone. It is for those who have the volume, data protection requirements, or the need for control that justifies the investment.
How to start — in three steps
Step 1: Install Ollama and test locally
curl -fsSL https://ollama.ai/install.sh | sh
ollama run qwen2.5:7b
# Type your first question — the model will download automatically (~5 GB)
It takes about ten minutes. You now have a working local AI system. Try asking questions about your own work — see how it performs compared to ChatGPT for your specific typical tasks.
Step 2: Identify a high-volume task
Choose a task you do often and repetitively: categorize customer inquiries, summarize meeting minutes, extract data from PDFs. It doesn't have to be the most complex task — on the contrary, simple repetitive tasks are those that provide the best ROI with local AI.
Step 3: Build an API call, not an integration
Ollama exposes a REST API on port 11434. Your existing web app, your Python script, or your Node backend can call it just like it calls OpenAI — often with just a rename of a URL variable. This is intentional: Ollama implements OpenAI's API format.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # dummy, not required
)
response = client.chat.completions.create(
model="qwen2.5:7b",
messages=[{"role": "user", "content": "Summarize: ..."}]
)
Local AI is no longer a solution just for nerds with servers in the basement.
It is a serious business strategy for Swedish companies that want
control, predictable costs, and a GDPR-compliant setup that doesn't
require legal consultation. Get started with Ollama, test a real
use case, and see if the math works for you.
If you want help building a local AI infrastructure tailored
to your company —
contact me directly.