Cryptography

Homomorphic Encryption in Practice: TFHE, Concrete-ML, and ML Inference on Encrypted Data

February 20269 min read

Lucio Durán

Engineering Manager & AI Solutions Architect

Homomorphic Encryption Principles

Regular encryption is like a locked safe. You can store things in it and take things out, but you can't modify the contents without unlocking it first. Homomorphic encryption is like a safe with built-in manipulator arms — you can perform computations on the contents while they're still locked inside.

Formally: given encryptions E(a) and E(b), you can compute E(a + b) and E(a × b) without knowing a or b. That's it. Every computation can be decomposed into additions and multiplications, so this is Turing-complete. The catch is performance.

There are three levels:

Partially Homomorphic Encryption (PHE): Supports unlimited operations of ONE type (addition OR multiplication). RSA is multiplicatively homomorphic. Paillier is additively homomorphic. Fast, but limited.
Somewhat Homomorphic Encryption (SHE): Supports both addition and multiplication, but only a limited number before the noise overwhelms the result.
Fully Homomorphic Encryption (FHE): Supports unlimited additions AND multiplications via "bootstrapping" — a technique that refreshes the ciphertext noise. This is what TFHE provides.

TFHE: The Scheme Behind Concrete-ML

TFHE (Torus FHE) represents encrypted data as polynomials over a torus (a circle group). The key innovation is fast bootstrapping — TFHE can refresh ciphertext noise in ~10ms, compared to minutes for older schemes like BGV/BFV.

Each TFHE ciphertext contains:

The encrypted value (a polynomial)
Noise (essential for security, but grows with computation)
Parameters (determines security level and noise capacity)

The noise grows with every operation:

Addition: Noise grows linearly (cheap)
Multiplication: Noise grows exponentially (expensive)
Bootstrapping: Resets noise to a fixed level (costs ~10ms but enables further computation)

This is why multiplicative depth matters. A model with 10 sequential multiplications needs 10 bootstrapping operations minimum. At 10ms each, that's 100ms of pure bootstrapping overhead before counting the actual computation.

Concrete-ML: FHE for Machine Learning

Zama's Concrete-ML wraps TFHE in a scikit-learn-compatible API. You train a model on plaintext data, then "compile" it into an FHE circuit:

from concrete.ml.sklearn import LogisticRegression
import numpy as np

# Train on plaintext data (this step is normal ML)
X_train = np.load("features_train.npy") # Shape: (10000, 20)
y_train = np.load("labels_train.npy") # Shape: (10000,)

model = LogisticRegression(n_bits=8) # Quantize to 8-bit integers
model.fit(X_train, y_train)

# Evaluate in plaintext first (sanity check)
plaintext_accuracy = model.score(X_test, y_test)
print(f"Plaintext accuracy: {plaintext_accuracy:.4f}")
# Output: Plaintext accuracy: 0.9234

# Compile the model into an FHE circuit
fhe_circuit = model.compile(X_train)

# Now we can do inference on encrypted data
# Step 1: Client encrypts their data
encrypted_input = fhe_circuit.encrypt(X_test[0:1])

# Step 2: Server runs inference on encrypted data
encrypted_result = fhe_circuit.run(encrypted_input)

# Step 3: Client decrypts the result
decrypted_result = fhe_circuit.decrypt(encrypted_result)
print(f"FHE prediction: {decrypted_result}")
# Output: FHE prediction: [1] (same as plaintext prediction)

The n_bits=8 parameter is critical. It controls quantization — how many bits represent each value in the model. Lower bits = faster FHE but lower accuracy:

n_bits	Plaintext Accuracy	FHE Inference Time	Speedup vs. 16-bit
4	0.8912	45ms	18x
6	0.9156	95ms	8.5x
8	0.9234	180ms	4.5x
12	0.9241	520ms	1.6x
16	0.9243	810ms	1x

Going from 16-bit to 8-bit, we lose 0.09% accuracy but gain 4.5x speed. That's the tradeoff that makes FHE practical.

The Credit Scoring Pipeline

Here's the architecture I built for the bank:

[Customer App] → [Encrypt features locally] → [Send ciphertext to API]
 ↓
 [FHE Inference Server (AWS)]
 ↓
 [Encrypted result]
 ↓
[Customer App] ← [Decrypt result locally] ← [Return encrypted prediction]

The customer's financial data (income, debt-to-income ratio, credit history length, etc.) never leaves their device in plaintext. The bank's model weights are embedded in the compiled FHE circuit, but the customer can't extract them — the circuit is a one-way computation.

# Server-side: Load the compiled model (no access to cleartext data)
from concrete.ml.deployment import FHEModelServer

server = FHEModelServer(path_dir="./deployed_model")

@app.post("/predict")
async def predict(request: Request):
 encrypted_input = await request.body()

 # Run inference on ciphertext — server never sees plaintext
 encrypted_result = server.run(
 serialized_encrypted_quantized_data=encrypted_input,
 serialized_evaluation_keys=get_evaluation_keys(request.headers["client-id"]),
 )

 return Response(content=encrypted_result, media_type="application/octet-stream")

# Client-side: Encrypt, send, decrypt
from concrete.ml.deployment import FHEModelClient

client = FHEModelClient(path_dir="./client_model", key_dir="./keys")
client.generate_private_and_evaluation_keys()

# Prepare the data
clear_input = np.array([[65000, 0.32, 7, 3, 720, 12000, 0, 1]])
encrypted_input = client.quantize_encrypt_serialize(clear_input)

# Send evaluation keys (public, needed by server for computation)
evaluation_keys = client.get_serialized_evaluation_keys()

# ... send encrypted_input and evaluation_keys to server ...

# Receive and decrypt
encrypted_result = response.content
decrypted_prediction = client.deserialize_decrypt_dequantize(encrypted_result)
print(f"Credit score category: {decrypted_prediction}")

Custom Neural Networks with Concrete-ML

For more complex models, you can compile PyTorch modules:

import torch
from concrete.ml.torch.compile import compile_torch_model

class CreditNet(torch.nn.Module):
 def __init__(self):
 super().__init__()
 self.fc1 = torch.nn.Linear(20, 64)
 self.fc2 = torch.nn.Linear(64, 32)
 self.fc3 = torch.nn.Linear(32, 2)

 def forward(self, x):
 x = torch.relu(self.fc1(x))
 x = torch.relu(self.fc2(x))
 x = self.fc3(x)
 return x

# Train normally in PyTorch
model = CreditNet()
# ... training loop ...

# Compile for FHE
quantized_model = compile_torch_model(
 model,
 X_train,
 n_bits=6, # Aggressive quantization for speed
 rounding_threshold_bits=6, # Additional optimization
)

# Check the circuit complexity
print(f"Circuit size: {quantized_model.fhe_circuit.size}")
print(f"Max bit-width: {quantized_model.fhe_circuit.graph.maximum_integer_bit_width()}")
print(f"Estimated FHE time: {quantized_model.fhe_circuit.complexity}ms")

The model architecture matters enormously for FHE performance. Rules I learned the hard way:

ReLU is expensive — each ReLU requires a comparison, which in FHE means a programmable bootstrapping operation (~10ms each). A network with 96 ReLU activations has ~960ms of bootstrapping overhead.
Depth over width — a 3-layer network with 64 neurons per layer is much slower than a 2-layer network with 128 neurons, because depth adds sequential bootstrapping.
Avoid max/min/argmax — these require multiple comparisons. Use average pooling instead of max pooling.
Batch normalization is free — it fuses into the preceding linear layer during compilation. Always use it.

The Performance Reality

Let me be completely transparent about what's fast and what's not:

import time
from concrete.ml.sklearn import (
 LogisticRegression, DecisionTreeClassifier,
 XGBClassifier, LinearSVR
)

models = {
 "Logistic Regression (8-bit)": LogisticRegression(n_bits=8),
 "Decision Tree (depth=5, 8-bit)": DecisionTreeClassifier(n_bits=8, max_depth=5),
 "XGBoost (10 trees, depth=4)": XGBClassifier(n_bits=6, max_depth=4, n_estimators=10),
 "Linear SVR (8-bit)": LinearSVR(n_bits=8),
}

for name, model in models.items():
 model.fit(X_train, y_train)
 circuit = model.compile(X_train)

 # Benchmark FHE inference
 encrypted = circuit.encrypt(X_test[0:1])
 start = time.time()
 result = circuit.run(encrypted)
 fhe_time = time.time() - start

 # Compare with plaintext
 start = time.time()
 _ = model.predict(X_test[0:1])
 plain_time = time.time() - start

 accuracy = model.score(X_test, y_test)
 print(f"{name}: acc={accuracy:.4f}, FHE={fhe_time*1000:.0f}ms, plain={plain_time*1000:.2f}ms")

Results on an AWS c6i.2xlarge (8 vCPUs, no GPU acceleration):

Model	Accuracy	FHE Inference	Plaintext	Slowdown
Logistic Regression	0.923	85ms	0.02ms	4,250x
Decision Tree (depth=5)	0.941	210ms	0.01ms	21,000x
XGBoost (10 trees, depth=4)	0.956	1,240ms	0.08ms	15,500x
Linear SVR	0.917	72ms	0.01ms	7,200x
Custom NN (2-layer, 64 units)	0.948	3,400ms	0.15ms	22,667x

That 4,250x slowdown for logistic regression translates to 85ms — completely viable for real-time API calls. The XGBoost at 1.2 seconds is fine for batch processing. The neural network at 3.4 seconds is borderline — acceptable for some use cases, too slow for others.

What's Not Possible (Yet)

Let me save you weeks of wasted effort:

Large language models: A GPT-2 size model would take hours per token. Not happening.
Image classification with deep CNNs: ResNet-50 inference would be ~45 minutes. Even a tiny MNIST classifier takes 8-12 seconds.
Anything requiring dynamic control flow: FHE circuits are static. You can't branch on encrypted values (that would leak information).
Large feature spaces: 1000+ features means 1000+ encrypted values to process. Memory becomes the bottleneck before compute.

Key Sizes and Network Transfer

FHE has a data expansion problem. A 20-dimensional feature vector that's 160 bytes in plaintext becomes:

Encrypted input: ~42 KB
Evaluation keys: ~158 MB (sent once per client, cached by server)
Encrypted output: ~4 KB

That 158 MB evaluation key is the elephant in the room. It needs to be sent from the client to the server once per session. On a fast connection, that's a few seconds of upload. On mobile, it's a deal-breaker without aggressive key compression:

# Enable key compression — reduces eval keys by ~4x
circuit = model.compile(
 X_train,
 configuration=fhe.Configuration(
 enable_key_compression=True,
 compress_evaluation_keys=True,
 ),
)
# Compressed eval keys: ~40 MB (still large, but manageable)

The Architecture That Works

After iterating through several designs, here's what worked for the bank deployment:

# Deployment architecture
"""
1. Client generates keys once, stores locally
2. Evaluation keys uploaded to server once (cached with TTL)
3. Each inference request sends only encrypted input (~42 KB)
4. Server returns encrypted result (~4 KB)
5. Client decrypts locally

Total per-request data transfer: ~46 KB
Latency budget:
 - Network (encrypted input): 15ms
 - FHE inference (server): 340ms
 - Network (encrypted result): 5ms
 - Client decrypt: 2ms
 Total: ~362ms
"""

The 340ms inference time is for the XGBoost model that the bank actually chose — 15 trees at depth 4, quantized to 6 bits, processing 20 financial features. Accuracy is 95.2% on their holdout set, compared to 96.1% for the plaintext model. The bank accepted that 0.9% accuracy trade-off because FHE let them process customer data from a partner institution without data sharing agreements, which would have taken 8 months of legal review.

Where This Is Going

TFHE hardware acceleration is coming. Zama is building custom ASICs that promise 100-1000x speedup over CPU-based FHE. Intel's HEXL library already provides 2-4x speedups using AVX-512 instructions. GPU acceleration through CUDA-based TFHE implementations shows 10-50x improvements.

When hardware catches up — and it will, because the market demand is clear — a 340ms inference will become 3ms. At that point, FHE becomes viable for real-time applications across the board: encrypted search, private recommendation systems, confidential medical diagnosis, and yes, even large model inference.

For now, the sweet spot is clear: small-to-medium models (logistic regression, gradient boosting, shallow networks) on structured data with fewer than 100 features. If your use case fits that profile and you need mathematical privacy guarantees stronger than "we promise not to look at your data," homomorphic encryption is ready.

Tips

LuchoTip: Quantize your model to 8-bit integers or lower before compiling with Concrete-ML. FHE operations on 16-bit values are ~16x slower than 8-bit because the noise growth is exponential with bit-width. Most ML models lose less than 1% accuracy going from float32 to int8 — but gain 10-20x FHE speedup.

LuchoTip: Use fhe.Compiler with show_mlir=True to inspect the generated MLIR intermediate representation. If you see chains of tensor.extract operations, your model has memory access patterns that TFHE can't parallelize. Restructure to use contiguous tensor ops instead.

LuchoTip: Start with Concrete-ML's built-in models (LogisticRegression, XGBClassifier) before attempting custom neural networks. The built-in models are optimized for FHE — they produce circuits with minimal multiplicative depth. A custom PyTorch model compiled naively will be 50-100x slower.

homomorphic-encryptiontfheconcrete-mlzamaprivacyencrypted-computation

Tools mentioned in this article

AWSTry AWS

CloudflareTry Cloudflare

Disclosure: Some links in this article are affiliate links. If you sign up through them, I may earn a commission at no extra cost to you. I only recommend tools I personally use and trust.

X / Twitter LinkedIn WhatsApp

Seguime

Back to Blog