Some LLM background info I wrote for the article Semantic Search.

Index

LLM Primer

What is an LLM?

A Large Language Model is an artificial intelligence designed to recognize patterns in language and produce contextually appropriate responses to queries.

  • They are Large because they are trained with curated data that amounts to billions of words – the equivalent of reading millions of books, articles, and documents.
  • They are Language Models because they have been trained to understand human created languages.

Even when LLMs are trained with text, the same principles can be applied to train models on movement recordings, musical notes, images, or any other form of sequential data. The key is whether there are patterns that express relationships the model can learn from

Training

Training a large language model (LLM) involves teaching it to predict the next word in a sequence of text. It involves several key steps:

  1. Tokenization
  2. Embeddings
  3. Training loop on a neural network

Tokenization: the text is broken down into tokens, which are numerical representations of words or subwords. Tokens are the basic unit of text that AI models process - smaller than words but larger than individual characters. For instance, the word “hamburger” might be split into tokens like ham–burg–er. Most LLMs have a maximum number of tokens they can handle in a single conversation (the context window). A context window of 32k tokens corresponds to roughly 20–25k English words.

Embeddings are arrays of numbers (vectors) that can represent any structured data—words, images, code, musical notes, etc. Think of them as arrows pointing at a point in space. These vectors are designed so that “similar” pieces of data have embeddings close to each other, making it easy to retrieve related information by comparing vectors.

The Training loop is the iterative process through which a neural network learns from data by repeatedly processing examples, comparing predictions to correct answers, and adjusting its parameters to improve performance. Here are the key steps:

  1. Initialize
    • Weights start with random values.
    • Biases start at zero or small random values.
  2. Forward Pass
    • Input data flows through the network.
    • The model makes a prediction using the current weights and biases.
  3. Calculate Error (Loss)
    • Compare the model’s prediction to the correct answer.
    • The difference is the “loss.”
  4. Backpropagation
    • The loss is propagated backward through the network.
    • Each weight and bias receives a gradient indicating how it should change to reduce the loss.
  5. Optimization Step
    • An optimizer uses these gradients to update each parameter.
    • The learning rate decides how big these updates are.
  6. Repeat
    • This cycle runs many times over the training data.
    • Over time, parameters converge to values that yield accurate predictions.

It’s like turning lots of small knobs (the weights and biases) a little at a time to achieve the best outcome.

Quick summary: chop text into little pieces, place them on a coordinate space, and create a function that looks up your queries in that space to produce an answer. That’s the (simplified) idea. Vectors are really arrows in space, except they may have 300 or 1500 dimensions instead of two or three.

Fine Tuning

Fine-tuning is the process of further training a pre-trained model on a smaller dataset to adapt it for a particular domain while preserving its general capabilities.

Fine-tuning modifies the model’s weights/parameters themselves through additional training on your specific data. This means the model learns patterns and knowledge from your training data and incorporates them into its base capabilities. The fine-tuning process will likely capture general patterns and repeated structures, but it may not reliably memorize every specific detail of every function, specially rarely seen ones. The agent will have to use its context window as working memory to materialize its knowledge.

Chat Completion

A chat completion is a response generated by an LLM in the context of a conversation. The model keeps track of previous messages to maintain context and generate coherent replies.

The context window is the maximum amount of tokens a model can handle per conversation.

  • Each user message and any additional information (e.g., instructions, attached files) consume tokens, all of which must fit into the context window.
  • Language models can’t exceed their context window because they’re architecturally designed and trained to handle only a specific maximum length of text.
  • The attention mechanism itself (how tokens relate to each other) grows quadratically (n²) with context length. But the overall challenge of training grows exponentially because of compounding factors like data requirements, stability issues, hardware needs, etc.
  • Modern models may have up to a 200k token context window—meaning they can handle a large chunk of text, but not beyond it.

A model that performs chat completion only has awareness of what is inside its context window. Creating a project with files in Claude or GPT is the equivalent of copy pasting such files at the beginning of the conversation. Each subsequent message from the user will re-send the whole conversation up to that point, increasing the token consumption. Therefore the initial files are “paid for” multiple times in terms of token usage, since they’re sent with each new message.

Once the conversation grows too large, some earlier context may be “pushed out” or truncated, making it inaccessible to the model. Another issue is that LLM performance degrades on long conversations. It’s like they spent most of their attention reading the whole conversation instead of solving the problem. A solution is to ask them to summarize the problem and copy paste that into a new window.

A workaround is to implement “sparse attention” where one token doesn’t relate to every other token, therefore avoiding the n² problem but degrading performance. Another is to compact the conversation keeping the parts relevant for ongoing work. But it remains true that attention is limited.

Context in Big Codebases

LLMs have practical limits. They can only “see” as much text as fits into their context window, and they’re not automatically aware of everything in a massive code repository. Context windows can be easily exceeded by large application codebases.

With these in mind there are three ways we work with LLMs today: files, embeddings, fine-tuning.

Files

For trivial cases, select only the minimal set of files needed to provide context about the problem. This reduces context window usage and ensures the model focuses on the most relevant code. Local reasoning and modularity are highly desirable, and lets you drop entire libraries on the best performance model for architectural advice.

For instance, the following script concatenates all .swift files in current folder and subfolders, then stores the result in a single file within the parent folder:

find . -name "*.swift" -type f -print0 | xargs -0 cat > ../concatenated.swift

Embeddings

If you have sections of code that rarely change—such as a library—you can store them as embeddings. Then, when you need to query the agent about a specific part of that code, you retrieve the relevant chunks and include them in your query. This approach avoids re-uploading the entire library and helps keep the context window usage low.

Here is the workflow:

  1. Break down the codebase
    • Split code into smaller chunks, ideally chunks that make sense independently.
    • Convert each chunk into an embedding vector and store it in a database. These vectors represent the relation between structured information.
    • Store the vectors in a regular database (SQLite) or a vector database (Pinecone, Weaviate, Qdrant, Milvus, Redis).
  2. Form the Query
    • Convert your user query into an embedding vector.
  3. Retrieve & Provide Context
    • Fetch the most relevant chunks from the database.
    • Supply these chunks, alongside your question, into the model’s context. All products provide an API you can use to get the chunks for a query, then you can locally send them to the chat completion model.

Hopefully, this will result in low token consumption but with enough relevant code to answer the query. This ensures the model only sees the most relevant code sections, even when the overall codebase is too large to fit in the context window. For instance, when prompting about a class it may fetch surrounding code so the agent can make sense of the problem.

Fine-Tuning

If your project is especially domain-specific or you repeatedly need the model to handle the same queries, you might consider fine-tuning the model on your codebase or on specific usage patterns. This can help the model internalize common functions and code structures. However, fine-tuning alone will not memorize every detail of a massive repository. You’ll likely still need embeddings for deeper searches.

Cost comparison

Cost comparison

  • Files: no upfront costs, may result in being expensive when providing a large amount of texts.
  • Embeddings: one-time cost in time, practically free in money.
  • Fine-tuning: high upfront cost in time and money. Lowest token usage per query since the model “knows” your code.

When to use them

  • Files: casual queries where you manually select the context.
  • Embeddings: ongoing queries for mostly stable files. Requires setting up a database and embedding pipeline.
  • Fine-tuning: domains that rarely change where you anticipate usage amortizing the cost over time.