Select Page


I seldom, if ever, write technical posts. There are just too many excellent resources out there and my goal isn’t to explain technical things to a technical audience, as much as I enjoy teaching from time to time, or giving the odd guest lecture.

But I made an exception whilst perusing the literature regarding Transformers, as in Large Language Models. This was motivated by research into the potential use of Transformers for a novel non-language use case. I wanted to experiment with a few ideas, so I decided to build a base of various notebooks and code libs from which to set out on my research path.

Secondarily, I wanted to ground myself in sufficient code-level mechanics to explore a set of questions I have about what, exactly, are Transformers doing in relation to semantics. This was for the purposes of attempting to understand how world models might get incorporated into LLM schemas, beginning with Transformers. Of course, this isn’t a novel question, but I need my own research bed.

There is also the interesting question as to why it is that depth (of Transformers) is a key indicator for modeling language performance (against the various benchmarks out there). I felt that by playing with the transformations, layer by layer, I might find insights and/or confirm various similar explorations already in the literature.

A question that also piqued my interest is to ask what would the opposite of a Large Language Model (LLM) look like, as in a “Small”, yet nonetheless still performant, LM. This was from the point of view of not only attempting to understand what a more parsimonious system might look like, but from the point of view of understanding the scope of independent research that might be achievable without Google or OpenAI resources ($$$$$).

Anyway, these are not the subjects of this blog post. Rather, this post is merely a link to a notebook that I published that is a more fine-grained explanation of an existing code-level explanation of an Encoder-Decoder Transformer (i.e. BERT-like).

Per the notebook intro:

I wanted to add some missing details to standard transformer annotations often found in courses and texts. But not wanting to reinvent the wheel with yet another explanation, the accompanying notebooks are mostly an expansion of Chapter 11 of the Dive Into Deep Learning open source book — a valuable resource for learners. (We are all learners.)