Embedding with MiniLM
Index
Why MiniLM?
MiniLM is a compact family of Transformer models designed to capture the meaning of text in a lightweight way. The version I use, MiniLM-L6-v2, is a smaller cousin of BERT (a well-known language model). Instead of hundreds of millions of parameters, it runs on only 6 layers and produces a 384-dimensional vector (a list of 384 numbers) for each token.
Because it is tuned for sentence-sized text, MiniLM is especially good at representing short pieces of information (like a function signature or DocC comments). It also does a decent job keeping unusual words, such as function or variable names in code.
I’m using it to implement semantic search for Swift and Markdown on this application. In my tests, I’ve found that it produces more relevant results than OpenAI’s largest model.
Prepare MiniLM
Convert to Core ML
We could run the model directly using MLTensor, but converting it to Core ML makes it more suitable to run on GPU/ANE (ANE = Apple Neural Engine).
To convert it, I lock sequence/batch to static shapes and use FP16 for speed, then drop unused paths via tracing. Here are the highlights from prepare-model.py:
MAX_SEQ = 512
= Max input size. +
MAX_SEQ
is the maximum number of tokens the model was trained to handle in a single input. Transformers are trained with a fixed sequence length, chosen to balance accuracy with speed. For MiniLM that limit is 512, so we lock Core ML inputs to [batch, 512]
. This way Swift can preallocate MLMultiArray objects once, with no dynamic resizing or guesswork about padding.
Related concepts:
- Static shape: fixed input size.
- Attention mask: An array that tells the model which token positions to attend to (1) or to ignore (0). Those ignored are discarded when computing attention.
- Computing attention: a step in the transformer pipeline where the model decides how much each token should focus on other tokens.
ct.RangeDim(lower_bound=1, upper_bound=16, default=16)
= batch size. +
This tells Core ML to handle up to 16 inputs in one go.
For each input there are multiple steps (tokenize, copy to MLMultiArray), run Core ML, return result. This carries an overhead of memory transfer, kernel launch, scheduling that is best paid once than 16 times. Therefore we tell CoreML to expect inputs to be a MLMultiArray of shape [16, 512].
In embedding and semantic search we don’t expect more than 8 x 512 = 8,192 tokens, so 16 is a good balance.
-
compute_units=ct.ComputeUnit.ALL
= use CPU, GPU, or ANE –whichever is free. -
compute_precision=ct.precision.FLOAT16
= use 16-bit numbers. +
A Core ML model package is mostly tensors: big tables of numbers that represent what the model has learned. Each number is a weight: it nudges the model toward the right output.
These numbers can be stored with different precision:
- 32-bit (“full precision”): very exact, but heavier and slower.
- 16-bit (“half precision”): a little less exact, but much lighter and faster.
- 4-bit, 8-bit: these are usually quantized models –models reduced to smaller sizes so they run on less powerful hardware.
When using embeddings we are usually interested in their similarity. We calculate this with cosine similarity, which is a comparison between the angle of two vectors in space. If the vectors point in the same direction, they’re considered similar, regardless of their length (magnitude). Because orientation doesn’t depend on tiny decimals, 16-bit is good enough.
minimum_deployment_target=ct.target.macOS14
generate a MLProgram. +
torch.jit.trace
records the exact steps the model takes when you run it once. For MiniLM we only care about turning text into embeddings, so we trace just that path and throw away the rest. The result is a smaller, faster model. +
torch.jit.trace
is a tool that runs the model once with example inputs. PyTorch watches every calculation that happens—all the matrix math, all the transformations—and records them step-by-step. That recording becomes the converted model.
For MiniLM, we set it up to only output the embeddings. When we record this, we capture just the path from tokens to embeddings. Everything else the model can do is discarded.
This provides several benefits:
- Smaller file: No classification heads or extra outputs.
- Faster to load: Less stuff to initialize on device
- Better optimized: Core ML can optimize a fixed structure more effectively for consumer hardware
The following script does the download, tracing, and conversion in one go:
mkdir -p MiniLM && cd MiniLM
python -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install torch==2.5.0 torchvision torchaudio
pip install scikit-learn==1.5.1
pip install coremltools transformers sentence-transformers
python prepare-model.py
The script downloads all-MiniLM-L6-v2, disables a PyTorch fastpath that confuses the converter, fixes the sequence length to 512 tokens, and creates a MiniLM_L6_v2.mlpackage
that can be loaded from the bundle.
Using fixed versions and settings make the resulting MiniLM_L6_v2.mlpackage
reproducible. Leaving them to defaults could change tensor shapes or precision and make a mess with your previous data.
Bundle Tokenizer Assets
The training process for the model creates a correspondence between words, tokens, and number IDs. Example:
"settings" -> ["set", "##ting", "##s"] -> [1012, 4567, 302]
The index for each token is particular to this model and means nothing for any other model. Therefore along with the model we have to download the following information from huggingface:
config.json
: model architecture configtokenizer_config.json
: tokenizer behavior settingstokenizer.json
: configuration + vocabulary + splitting rulesvocab.txt
: the vocabulary lookup table
Looking at the configuration MiniLM-L6 is
- BERT: a standard Transformer encoder stack with self-attention and feed-forward blocks,
- Uncased: everything gets lowercased, which degrades signal for code
- 6 layers (six Transformer encoder blocks)
- Hidden size 384: each token is represented as a 384-dimensional vector
- Vocabulary size 30,522: the set of known WordPiece (=the BERT tokenization algorithm) tokens; words/subwords outside this set become [UNK].
- max sequence length 512: the longest tokenized input it can handle in one pass; shorter inputs are padded, longer ones truncated.
At runtime HFTokenizerAdapter
loads tokenizer and vocabulary through Hugging Face’s Swift AutoTokenizer. Then we call encodeWithPadding(text:)
it:
- Splits text into WordPiece subwords using the same rules as the original model.
- Inserts
[CLS]
at the front and[SEP]
at the end. - Truncates or pads to 512 tokens.
- Builds an attention mask with 1s for real tokens and 0s for padding.
The resulting arrays are wrapped in MLMultiArray and passed to model.prediction(…)
to obtain one vector per token position. Specifically we get a 3D tensor shaped [batch, seq_len, hidden_size]
where hidden_size
is 384 for MiniLM.
Predicting with Swift
1) Load the Core ML model
We exported to mlpackage, which once compiled becomes a .mlmodelc
file.
// Load the Core ML model
func loadMiniLM() throws -> MLModel {
guard let modelURL = Bundle.main.url(forResource: "MiniLM_L6_v2", withExtension: "mlmodelc") else {
throw NSError(domain: "MiniLM", code: 1, userInfo: [NSLocalizedDescriptionKey: "MiniLM_L6_v2.mlmodelc not found in bundle"])
}
let cfg = MLModelConfiguration()
cfg.computeUnits = .all
return try MLModel(contentsOf: modelURL, configuration: cfg)
}
2) Tokenize the input +
// Tokenize the input (BERT WordPiece, uncased)
func tokenize(_ text: String, maxLen: Int = 512) throws -> (ids: [Int], mask: [Int]) {
// Assumes you’ve configured the tokenizer with MiniLM’s vocab + lowercasing.
let tokenizer = try BERTWordPieceTokenizer()
let (ids, mask) = tokenizer.encodeWithPadding(text: text, maxLength: maxLen)
return (ids, mask) // ids/mask are length == maxLen (e.g., 512)
}
This step performs several operations that convert user text into tokens. The result is a list of integer IDs (from the vocabulary) plus a mask telling the model which positions are real words and which are padding to reach 512 tokens.
-
Split words into subword tokens.
e.g.settings
becomes["set", "##ting", "##s"]
-
Mark begin/end
All models use CLS/SEP for this. The input sequence becomes:[CLS] + tokens_for_text + [SEP] + padding
. -
Resize input to 512
This means it is truncated or padded so it results in exactly 512 tokens. To avoid computing attention for padding tokens, a companion mask is generated with 512 booleans where 0 marks the padding tokens. -
Normalize tokens
MiniLM normalizes everything to lowercase. -
Replace unknown tokens with
UNK
If a token is not present in the vocabulary it is replaced withUNK
. This is rare, because even function identifiers are decomposed into known tokens, for instancescanForChagnes()
->[scan, ##For, ##Change, ##s]
HTTPRequest2XX
->[HTTP, ##Request, ##2, ##XX]
3) Wrap input in MLMultiArray
+
// Creates one batch of 512 tokens.
func makeInt32Array(_ ints: [Int]) throws -> MLMultiArray {
let arr = try MLMultiArray(
shape: [1, NSNumber(value: ints.count)],
dataType: .int32
)
let ptr = arr.dataPointer.assumingMemoryBound(to: Int32.self)
for (i, v) in ints.enumerated() { ptr[i] = Int32(v) }
return arr
}
Core ML’s MLMultiArray is a container for tensors (multi-dimensional arrays of numbers). It lets us do the math (masking, summing, normalization) directly on-device, in a fast, vectorized way. Here we are just creating the array and putting the tokens inside.
4) Run prediction and grab token-level hidden states. +
// Run prediction and grab token-level hidden states
func tokenHiddenStates(model: MLModel, inputIDs: MLMultiArray, attentionMask: MLMultiArray) throws -> MLMultiArray {
let features = try MLDictionaryFeatureProvider(dictionary: [
"input_ids": inputIDs,
"attention_mask": attentionMask
])
let out = try model.prediction(from: features)
// Prefer "last_hidden_state"
// otherwise fall back to the first multiArray output.
if let hs = out.featureValue(for: "last_hidden_state")?.multiArrayValue {
return hs // [1, seqLen, hidden]
}
// Fallback: find any multiArray output
for name in out.featureNames {
if let arr = out.featureValue(for: name)?.multiArrayValue, arr.shape.count == 3 {
return arr
}
}
throw NSError(
domain: "MiniLM",
code: 2,
userInfo: [NSLocalizedDescriptionKey: "No 3D hidden-state output found"]
)
}
Calling model.prediction(…)
runs the Core ML model with our inputs. Each token is turned into a hidden state: a vector of numbers that captures the token’s meaning in context. For MiniLM L6, each hidden state has 384 numbers. Together they form a 3D tensor of shape [batch, seq_len, hidden_size]
where:
batch
: number of inputs processed at once (here, 1).seq_len
: the number of tokens (here, 512).hidden_size
: the length of each token vector (384 for MiniLM).
So the output shape [1, 512, 384]
means: one input, with 512 tokens, each represented as a 384-dimensional vector. The model may expose this as last_hidden_state
or hidden_states
.
5) Masked mean pooling. +
// Masked mean pooling (to sentence/chunk vector)
func maskedMeanPool(hs: MLMultiArray, attentionMask: MLMultiArray) -> MLXArray {
// hs: [1, seqLen, hidden], mask: [1, seqLen]
let hsArr = MLXArray(mlMultiArray: hs).astype(.float32)
let maskArr = MLXArray(mlMultiArray: attentionMask).astype(.float32)
let seqLen = maskArr.shape.count > 1 ? maskArr.shape[1] : 1
let mask3D = mlx.reshape(maskArr, [1, seqLen, 1]) // [1, seqLen, 1]
let masked = hsArr * mask3D // broadcast
let sumVec = mlx.sum(masked, axes: [1]) // [1, hidden]
let counts = mlx.maximum(mlx.sum(mask3D, axes: [1]), MLXArray(1e-6 as Float)) // [1,1]
return sumVec / counts // [1, hidden]
}
After prediction, we have one vector (384 numbers) for each token. But we want a single vector that represents the entire sentence or snippet. To do this we use mean pooling: take the average of all the real token vectors (the ones marked with 1 in the mask). This step is called masked mean pooling.
The result is a single 384-dimensional vector that captures the meaning of the whole input. This is the embedding we’ll normalize and use for semantic search. At this point we have one embedding that represents the entire chunk of text.
6) L2 normalize +
// L2 normalize (cosine-ready)
func l2normalize(_ vec: MLXArray) -> [Float] {
let eps = MLXArray(1e-12 as Float)
let norm = mlx.sqrt(
mlx.maximum(
mlx.sum(vec * vec, axes: [1], keepDims: true),
eps
)
)
return (vec / norm).toArray() // [Float], length == hidden (≈384)
}
After pooling, we have one vector for the whole input (≈384 numbers). The last step is L2 normalization. This means we scale the vector so that its length (magnitude) becomes 1, without changing its direction.
Why do this? In semantic search we compare embeddings by angle (cosine similarity). If two vectors point in the same direction, they are considered similar, no matter how long they are. Normalizing makes every embedding the same length, so comparisons depend only on direction.
The result is a single, normalized 384-dimensional embedding that represents the text. This is the vector you store and use for similarity search.
We can now call every function above to produce an embedding for the input text.
// End-to-end: text -> normalized embedding
func miniLMEmbedding(_ text: String, maxLen: Int = 512) throws -> [Float] {
let model = try loadMiniLM()
let (ids, mask) = try tokenize(text, maxLen: maxLen)
let inputIDs = try makeInt32Array(ids)
let attention = try makeInt32Array(mask)
let hs = try tokenHiddenStates(
model: model,
inputIDs: inputIDs,
attentionMask: attention
)
let pooled = maskedMeanPool(hs: hs, attentionMask: attention)
return l2normalize(pooled)
}
// Example:
let vec = try miniLMEmbedding("SwiftUI view that handles user input")
print("dim:", vec.count) // ~384
MLX keeps the post-processing on-device and vectorized: multiply by the mask, reduce across the sequence axis, divide by the live-token count, then L2 normalize. Rudrank Riyam has a book on MLX if you are interested.
Power Chart
The first conversion was running entirely on the CPU at around 13 W. Core ML defaulted there because it wasn’t sure the GPU or Neural Engine could handle certain features. For instance:
- Flexible input shapes. If the model accepts variable-length inputs, Core ML can’t pre-plan memory well so it keeps work on the CPU (which can handle any shape). Until recently, any model with dynamic dimensions (like
RangeDim
or 512×N) was forced to CPU. The converter now accepts them, but the runtime still prefers CPU unless you fix the shapes in advance (e.g. always 512 tokens). - Mixed precision. If some layers want 32-bit floats and others 16-bit, accelerators struggle to coordinate work.
- Attention layers. Transformers rely on operations such as attention, layer norm, and GELU. On macOS, the ANE doesn’t fully support these yet.
By giving Core ML a fixed input size (512 tokens), using FP16 (16-bit floats), and setting computeUnits = .all
, the runtime can safely schedule the work on GPU.
With those tweaks, power use dropped to ≈ 5–6 W on CPU, ≈ 0.3–0.4 W on GPU, and 0 W on ANE. This shows how much more efficient the GPU is for transformer workloads: ≈10x less power than the CPU for the same job.
I think this is as good as it gets. MiniLM won’t run on the Neural Engine as-is. The ANE is more specialized than the GPU and only supports a limited set of operations. Transformers like MiniLM rely on others (attention, layer norm, GELU) that it doesn’t accelerate. To change that, Apple would need to design and train a new Transformer variant built only from ANE-friendly ops. Apple’s own Foundation models may be examples of this.
Dependencies
Swift dependencies I used:
- mlx-swift (MLX, MLXNN, MLXFast, MLXLinalg, MLXRandom): Apple’s Swift bindings for MLX, providing the tensor math, neural-net layers, linear algebra, random number generation, and fast kernels used for pooling and normalization.
- swift-transformers (Transformers): Hugging Face’s Swift port, used here for tokenizer adapters.
In the Python script torch==2.5.0, coremltools, transformers, and sentence-transformers are the pieces that actually export the graph, while scikit-learn satisfies a dependency pulled in by sentence-transformers. You need to pin these versions to generate future compatible versions if you want to experiment with conversions.