Homomorphic Encryption in Practice: TFHE, Concrete-ML, and ML Inference on Encrypted Data
Homomorphic Encryption Principles
Regular encryption is like a locked safe. You can store things in it and take things out, but you can't modify the contents without unlocking it first. Homomorphic encryption is like a safe with built-in manipulator arms — you can perform computations on the contents while they're still locked inside.
Formally: given encryptions E(a) and E(b), you can compute E(a + b) and E(a × b) without knowing a or b. That's it. Every computation can be decomposed into additions and multiplications, so this is Turing-complete. The catch is performance.
There are three levels:
- Partially Homomorphic Encryption (PHE): Supports unlimited operations of ONE type (addition OR multiplication). RSA is multiplicatively homomorphic. Paillier is additively homomorphic. Fast, but limited.
- Somewhat Homomorphic Encryption (SHE): Supports both addition and multiplication, but only a limited number before the noise overwhelms the result.
- Fully Homomorphic Encryption (FHE): Supports unlimited additions AND multiplications via "bootstrapping" — a technique that refreshes the ciphertext noise. This is what TFHE provides.
TFHE: The Scheme Behind Concrete-ML
TFHE (Torus FHE) represents encrypted data as polynomials over a torus (a circle group). The key innovation is fast bootstrapping — TFHE can refresh ciphertext noise in ~10ms, compared to minutes for older schemes like BGV/BFV.
Each TFHE ciphertext contains:
- The encrypted value (a polynomial)
- Noise (essential for security, but grows with computation)
- Parameters (determines security level and noise capacity)
The noise grows with every operation:
- Addition: Noise grows linearly (cheap)
- Multiplication: Noise grows exponentially (expensive)
- Bootstrapping: Resets noise to a fixed level (costs ~10ms but enables further computation)
This is why multiplicative depth matters. A model with 10 sequential multiplications needs 10 bootstrapping operations minimum. At 10ms each, that's 100ms of pure bootstrapping overhead before counting the actual computation.
Concrete-ML: FHE for Machine Learning
Zama's Concrete-ML wraps TFHE in a scikit-learn-compatible API. You train a model on plaintext data, then "compile" it into an FHE circuit:
from concrete.ml.sklearn import LogisticRegression
import numpy as np
# Train on plaintext data (this step is normal ML)
X_train = np.load("features_train.npy") # Shape: (10000, 20)
y_train = np.load("labels_train.npy") # Shape: (10000,)
model = LogisticRegression(n_bits=8) # Quantize to 8-bit integers
model.fit(X_train, y_train)
# Evaluate in plaintext first (sanity check)
plaintext_accuracy = model.score(X_test, y_test)
print(f"Plaintext accuracy: {plaintext_accuracy:.4f}")
# Output: Plaintext accuracy: 0.9234
# Compile the model into an FHE circuit
fhe_circuit = model.compile(X_train)
# Now we can do inference on encrypted data
# Step 1: Client encrypts their data
encrypted_input = fhe_circuit.encrypt(X_test[0:1])
# Step 2: Server runs inference on encrypted data
encrypted_result = fhe_circuit.run(encrypted_input)
# Step 3: Client decrypts the result
decrypted_result = fhe_circuit.decrypt(encrypted_result)
print(f"FHE prediction: {decrypted_result}")
# Output: FHE prediction: [1] (same as plaintext prediction)
The n_bits=8 parameter is critical. It controls quantization — how many bits represent each value in the model. Lower bits = faster FHE but lower accuracy:
| n_bits | Plaintext Accuracy | FHE Inference Time | Speedup vs. 16-bit | |--------|-------------------|--------------------|--------------------| | 4 | 0.8912 | 45ms | 18x | | 6 | 0.9156 | 95ms | 8.5x | | 8 | 0.9234 | 180ms | 4.5x | | 12 | 0.9241 | 520ms | 1.6x | | 16 | 0.9243 | 810ms | 1x |
Going from 16-bit to 8-bit, we lose 0.09% accuracy but gain 4.5x speed. That's the tradeoff that makes FHE practical.
The Credit Scoring Pipeline
Here's the architecture I built for the bank:
[Customer App] → [Encrypt features locally] → [Send ciphertext to API]
↓
[FHE Inference Server (AWS)]
↓
[Encrypted result]
↓
[Customer App] ← [Decrypt result locally] ← [Return encrypted prediction]
The customer's financial data (income, debt-to-income ratio, credit history length, etc.) never leaves their device in plaintext. The bank's model weights are embedded in the compiled FHE circuit, but the customer can't extract them — the circuit is a one-way computation.
# Server-side: Load the compiled model (no access to cleartext data)
from concrete.ml.deployment import FHEModelServer
server = FHEModelServer(path_dir="./deployed_model")
@app.post("/predict")
async def predict(request: Request):
encrypted_input = await request.body()
# Run inference on ciphertext — server never sees plaintext
encrypted_result = server.run(
serialized_encrypted_quantized_data=encrypted_input,
serialized_evaluation_keys=get_evaluation_keys(request.headers["client-id"]),
)
return Response(content=encrypted_result, media_type="application/octet-stream")
# Client-side: Encrypt, send, decrypt
from concrete.ml.deployment import FHEModelClient
client = FHEModelClient(path_dir="./client_model", key_dir="./keys")
client.generate_private_and_evaluation_keys()
# Prepare the data
clear_input = np.array([[65000, 0.32, 7, 3, 720, 12000, 0, 1]])
encrypted_input = client.quantize_encrypt_serialize(clear_input)
# Send evaluation keys (public, needed by server for computation)
evaluation_keys = client.get_serialized_evaluation_keys()
# ... send encrypted_input and evaluation_keys to server ...
# Receive and decrypt
encrypted_result = response.content
decrypted_prediction = client.deserialize_decrypt_dequantize(encrypted_result)
print(f"Credit score category: {decrypted_prediction}")
Custom Neural Networks with Concrete-ML
For more complex models, you can compile PyTorch modules:
import torch
from concrete.ml.torch.compile import compile_torch_model
class CreditNet(torch.nn.Module):
def __init__(self):
super().__init__()
self.fc1 = torch.nn.Linear(20, 64)
self.fc2 = torch.nn.Linear(64, 32)
self.fc3 = torch.nn.Linear(32, 2)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = self.fc3(x)
return x
# Train normally in PyTorch
model = CreditNet()
# ... training loop ...
# Compile for FHE
quantized_model = compile_torch_model(
model,
X_train,
n_bits=6, # Aggressive quantization for speed
rounding_threshold_bits=6, # Additional optimization
)
# Check the circuit complexity
print(f"Circuit size: {quantized_model.fhe_circuit.size}")
print(f"Max bit-width: {quantized_model.fhe_circuit.graph.maximum_integer_bit_width()}")
print(f"Estimated FHE time: {quantized_model.fhe_circuit.complexity}ms")
The model architecture matters enormously for FHE performance. Rules I learned the hard way:
- ReLU is expensive — each ReLU requires a comparison, which in FHE means a programmable bootstrapping operation (~10ms each). A network with 96 ReLU activations has ~960ms of bootstrapping overhead.
- Depth over width — a 3-layer network with 64 neurons per layer is much slower than a 2-layer network with 128 neurons, because depth adds sequential bootstrapping.
- Avoid max/min/argmax — these require multiple comparisons. Use average pooling instead of max pooling.
- Batch normalization is free — it fuses into the preceding linear layer during compilation. Always use it.
The Performance Reality
Let me be completely transparent about what's fast and what's not:
import time
from concrete.ml.sklearn import (
LogisticRegression, DecisionTreeClassifier,
XGBClassifier, LinearSVR
)
models = {
"Logistic Regression (8-bit)": LogisticRegression(n_bits=8),
"Decision Tree (depth=5, 8-bit)": DecisionTreeClassifier(n_bits=8, max_depth=5),
"XGBoost (10 trees, depth=4)": XGBClassifier(n_bits=6, max_depth=4, n_estimators=10),
"Linear SVR (8-bit)": LinearSVR(n_bits=8),
}
for name, model in models.items():
model.fit(X_train, y_train)
circuit = model.compile(X_train)
# Benchmark FHE inference
encrypted = circuit.encrypt(X_test[0:1])
start = time.time()
result = circuit.run(encrypted)
fhe_time = time.time() - start
# Compare with plaintext
start = time.time()
_ = model.predict(X_test[0:1])
plain_time = time.time() - start
accuracy = model.score(X_test, y_test)
print(f"{name}: acc={accuracy:.4f}, FHE={fhe_time*1000:.0f}ms, plain={plain_time*1000:.2f}ms")
Results on an AWS c6i.2xlarge (8 vCPUs, no GPU acceleration):
| Model | Accuracy | FHE Inference | Plaintext | Slowdown | |-------|----------|---------------|-----------|----------| | Logistic Regression | 0.923 | 85ms | 0.02ms | 4,250x | | Decision Tree (depth=5) | 0.941 | 210ms | 0.01ms | 21,000x | | XGBoost (10 trees, depth=4) | 0.956 | 1,240ms | 0.08ms | 15,500x | | Linear SVR | 0.917 | 72ms | 0.01ms | 7,200x | | Custom NN (2-layer, 64 units) | 0.948 | 3,400ms | 0.15ms | 22,667x |
That 4,250x slowdown for logistic regression translates to 85ms — completely viable for real-time API calls. The XGBoost at 1.2 seconds is fine for batch processing. The neural network at 3.4 seconds is borderline — acceptable for some use cases, too slow for others.
What's Not Possible (Yet)
Let me save you weeks of wasted effort:
- Large language models: A GPT-2 size model would take hours per token. Not happening.
- Image classification with deep CNNs: ResNet-50 inference would be ~45 minutes. Even a tiny MNIST classifier takes 8-12 seconds.
- Anything requiring dynamic control flow: FHE circuits are static. You can't branch on encrypted values (that would leak information).
- Large feature spaces: 1000+ features means 1000+ encrypted values to process. Memory becomes the bottleneck before compute.
Key Sizes and Network Transfer
FHE has a data expansion problem. A 20-dimensional feature vector that's 160 bytes in plaintext becomes:
Encrypted input: ~42 KB
Evaluation keys: ~158 MB (sent once per client, cached by server)
Encrypted output: ~4 KB
That 158 MB evaluation key is the elephant in the room. It needs to be sent from the client to the server once per session. On a fast connection, that's a few seconds of upload. On mobile, it's a deal-breaker without aggressive key compression:
# Enable key compression — reduces eval keys by ~4x
circuit = model.compile(
X_train,
configuration=fhe.Configuration(
enable_key_compression=True,
compress_evaluation_keys=True,
),
)
# Compressed eval keys: ~40 MB (still large, but manageable)
The Architecture That Works
After iterating through several designs, here's what worked for the bank deployment:
# Deployment architecture
"""
1. Client generates keys once, stores locally
2. Evaluation keys uploaded to server once (cached with TTL)
3. Each inference request sends only encrypted input (~42 KB)
4. Server returns encrypted result (~4 KB)
5. Client decrypts locally
Total per-request data transfer: ~46 KB
Latency budget:
- Network (encrypted input): 15ms
- FHE inference (server): 340ms
- Network (encrypted result): 5ms
- Client decrypt: 2ms
Total: ~362ms
"""
The 340ms inference time is for the XGBoost model that the bank actually chose — 15 trees at depth 4, quantized to 6 bits, processing 20 financial features. Accuracy is 95.2% on their holdout set, compared to 96.1% for the plaintext model. The bank accepted that 0.9% accuracy trade-off because FHE let them process customer data from a partner institution without data sharing agreements, which would have taken 8 months of legal review.
Where This Is Going
TFHE hardware acceleration is coming. Zama is building custom ASICs that promise 100-1000x speedup over CPU-based FHE. Intel's HEXL library already provides 2-4x speedups using AVX-512 instructions. GPU acceleration through CUDA-based TFHE implementations shows 10-50x improvements.
When hardware catches up — and it will, because the market demand is clear — a 340ms inference will become 3ms. At that point, FHE becomes viable for real-time applications across the board: encrypted search, private recommendation systems, confidential medical diagnosis, and yes, even large model inference.
For now, the sweet spot is clear: small-to-medium models (logistic regression, gradient boosting, shallow networks) on structured data with fewer than 100 features. If your use case fits that profile and you need mathematical privacy guarantees stronger than "we promise not to look at your data," homomorphic encryption is ready.