It is common to use heuristics to frame innovation. Whilst this can easily tend towards a kind of bias wherein the heuristic becomes more precious than the execution, it is hard to cope with lots of ideas and data without narratives. One such heuristic is the maxim of 70-20-10, which refers to the percentages of R&D investment in incremental, evolutionary and radical (or disruptive) innovation categories. McKinsey-ists might call these Horizon 1, 2 and 3. Not quite the same, but you get the idea. I don’t attempt to define the categories here.
My argument is that AI is so foundational to the future of most industries and processes that a starting point for thinking about the 10% disruption is that 100% of it ought be concentrated upon AI. Hence the name: 100 of 10.
As an interpretation of this maxim, you might consider that if you cannot find reason, or sufficient resource/opportunity to occupy 100%, then you are quite possibly doing something wrong — most likely you haven’t really understood the scope of AI.
At the very least, this maxim is a warning bell to business leaders whose interest in AI amounts to no more than reading airport books about the subject and then instructing their digital transformation folks to “do more AI”. Alas, I believe this is common and that it often amounts to ticking of boxes versus any meaningful doing.
This post is an attempt to offer a glimmer of insight as to why the 100-of-10 rule ought to be taken seriously by all business leaders.
[Note: all text in red is generated by GPT-3 using the DaVinci-002 model.]
The Foundation Revolution
I seldom write to explain how some algorithm works mathematically, mostly because I feel that there are already so many great resources. And, we live in a world where many explainers are adroit at explaining as part of a seemingly full-time commitment — one that I cannot match, and nor do I. My value is in interpretation, which I shall attempt here.
Understanding fundamentals across a swathe of technical disciplines, from chip design to 3D compression to racing cars to computational aesthetics, I often ask probing questions. Here I am asking: Why pay so much attention to AI?
There are perhaps hundreds of ways to answer that, but I will attempt only one for now.
Leaving aside what I have said before about the importance of definitions, I don’t attempt to define AI here. What I will do is attempt to interpret the broader implications of this paper: “On the Opportunities and Risks of Foundation Models” written by The Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). (Note that my overview of how Foundation Models work below is somewhat analogous vs. technically accurate in order to explain to a non-technical readership.)
The paper summarizes the new AI landscape in which the power of Deep Learning models appears to have reached a critical phase.
Per the abstract:
AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks
Note that since the relatively most recent revision of this paper (August ’21) the example models cited have already been surpassed by more powerful versions, possibly due to a Power Law in performance gains (against scale of model). We shall return to this subject, but first, let’s dwell upon the phrase: downstream tasks.
Until recently, training an AI model to understand language or images — i.e. deep semantic fields — required that the model be tuned to the specific task at hand, such as, say, answering questions, or summarizing text. However, with recent attempts to train models at much greater scale (i.e. by training them on large corpuses of texts) something peculiar emerged. I use the term emerged selectively as it is my belief (not discussed in the paper) that the true nature of language, in its cultural context within a technological society, is that it constitutes a Complex System.
I will not dwell upon that claim here as it might entail wandering into the (controversial) domain of philosophy of language, but I will only mention that a characteristic of Complex Systems is that new, and dominant, properties can emerge from the unplanned interaction of myriad system components.
In this case, the unplanned interaction is the interaction of vast numbers of component functions within the model to produce novel capabilities that were not the target of training the model. There has been mention by some commentators of this as a “Phase Change“, akin to similar phase changes in large physics models, although I do not favor this interpretation.
The meaning of emergent property is that unbeknown to the model architects, large scale models display a post-hoc ability to solve problems without having received specialized training in those problem domains. For example, a model trained to understand language in the context of translation, has abilities that also allow it to answer questions or summarize text.
What is even stranger is that these abilities can be invoked by merely prompting the model via a textual input. As an example, if I take the second paragraph from my summary above and feed it into the GPT-3 with a preceding text prompt: “Summarize for a second grader,” (a well known prompt from the GPT-3 playground of prompts) it remarkably produces a condensed version.
This person is saying that we should be concentrating on AI 100% because it is so important for the future.
Yes, that is, kind of, what I am saying.
To spell out what is happening here, the model was not trained to summarize texts for second graders. However, due to the scale of the model and the ability of its components to discriminate language relationships over deep hierarchies (via a mechanism called Self Attention), the model has discovered a function that can map one text passage to another in a way that satisfies the prompt “Summarize for a second grader”.
This has happened because within the large corpus of input data there were enough examples of patterns wherein a summarized version of text can be related to an original text (not necessarily within the same document) via a term similar in semantic meaning to the prompt. This pattern was presumably of use to the model in completing its original task even though the summarization task was not part of its objective.
We can think of this summarization task as a kind of “language understanding primitive” that would conceivably help a translator to translate. And we can think of the AI model as a machine that can learn, or generate, a set of such primitives (without knowing what they are) which when combined can help to solve the original problem the machine was being asked to solve (i.e. language translation).
As an analogy, imagine that we wanted to train a machine that has an adding primitive to multiply numbers and we gave it lots of examples that it could rote learn, like 8×8=64. And then imagine that we also gave it lots of examples of serial addition, like 2+2+2+2 = 8 and that, in the training data, we made mention that 2x2x2x2 can be expressed as “4 lots of 2” or “4 times 2” or “4×2”. We can imagine that the machine might learn that one way to multiply, as indeed was the way CPUs used to do it, is to iteratively add. Then if we gave it a novel problem, like 9×9 (i.e. not present in the training data) it might use its acquired “iterative adding” primitive to solve the problem (e.g. by constructing an iterator, keeping in mind that the machine does not have a native multiplier).
[Note: you might be wandering if, indeed, GPT-3 can “do math”. Kind of. See this paper.]
The large-scale language model acquires all kinds of such primitives, beginning with relatively simple word successions, like the word “Labor” often precedes the word “Union”, or that the word “barks” is often found in sentences with the word “Dog” and not “Cat”. Over a large set of data, this might allow the model to “translate” the sentence “Dogs will bark to alert their owners” to “Cats will meow to alert their owners”. These primitives are acquired via a dense set of layers of the Self-Attention mechanism allowing the model to both acquire many primitives and to acquire “nested” ones that can span fairly large blocks of text.
To ask a model like GPT-3 to solve the above Dog-Cats translation is not like asking it to translate “My name is Paul” to French, which I can do with a prompt like:
English: My name is Paul
And it will respond with (GPT-3 answer in red):
English: My name is Paul
French: Je m’appelle Paul.
There is no “Cat language” that would make such a prompt useful. One has to learn what kind of prompt structure is likely to invoke the model’s innate primitives to solve the problem in a way that it has likely learned without such prompting. As an example, here’s what I entered:
Alert their owner: dog barks
Alert their owner: cat
And here’s how it completed the prompt:
Alert their owner: dog barks
Alert their owner: cat meows
Alert their owner: baby cries
Note here how the model understood that it was a mapping problem (via the semi-colon) of the form Bark is to Dog (within the context “Alert owner”) as Meow is to Cat. Of interest, the model went on to offer another example of that form related to babies. From this we can surmise that across the large-scale training data there were plenty of examples of cats meowing, dogs barking and babies crying within a similar context of alerting. Of course, due to the low-level primitive of understanding what words go with what (pair-wise and even wider spans), the model has an “innate” synonyms primitive, so we don’t have to rely uniquely upon the word “alert” to signify the context. Any synonym, or similar semantics, will do. We can attempt that next:
get attention: dog barks
get attention: cat meows
get attention: person shouts
This ability greatly amplifies that power of the model because much of language involves the processing of synonyms, homonyms and so on: relationships that the AI figures out relatively easily because such patterns are much more uniformly available throughout the corpus.
Let’s consider a slightly different prompt to illustrate the capabilities further whilst hinting at something else to come.
Alert the neighbor: dog barks
Alert the neighbor: cat meows
If your dog is barking excessively, you may want to alert your neighbor so they are aware of the situation. Similarly, if your cat is meowing excessively, you may also want to alert your neighbor.
Note that the model did successfully complete the translation task: meows. However, it went on to produce a run-on sentence that gives us a window into the kind of context the model was using as the con-joining “relationship primitive”. What I had in mind was that the animal was alerting its neighbor, like it previously alerted its owner. However, the follow-on sentence seems to suggest that the pattern the model selected as most probable for context was the occasion of a cat or dog making excessive noise so as to warrant alerting the neighbor.
[Side note: when I say “what I had in mind” — this ought to make us think of the Theory of Mind wherein when humans communicate they constantly attempt to approximate what each other are thinking. This is missing from AI assistants and why they can be very frustrating at times. This immediately hints at possible architectures, but that’s a topic for another time.]
We can see in this previous example that whilst the model is powerful, prompts can be difficult to interpret with so little information to go on.
If I revise the prompt, I get a different response:
In order to alert the neighbor: a dog barks
In order to alert the neighbor: a cat meows
In order to alert the neighbor: a person shouts
Here we see a successful prompt completion and a very analogous follow-on example that seems to indicate that the context the model has selected is close to what I had in mind. I did this by engineering the prompt.
Now, I could write reams about how this model works (procedurally) and it could be a fascinating diversion into language and AI algorithms, but let us return to the phrase: downstream tasks.
This ability for the model to respond to prompts with only a single example to go by is called Single (or One) Shot Learning. The model was trained to translate English to French (or multiple dialects) but it has managed to solve a novel task (call it “Modes of alerting”) with only one example.
Some tasks require a few examples and this is called Few Shot Learning. The model can do so because within the context of the original problem solution the model learned certain primitives that can be used to solve a much broader class of problems, or downstream tasks.
Now, within a broad range of enterprise activities, the core of those activities is related to understanding language within context, setting and scope. And, there are probably many such activities where the primitives of a large-scale model could be applied without the need to build a dedicated model. One example might be the oft-mentioned sentiment analysis:
Paul thumped the desk: angry
Paul jumped with joy: happy
Paul didn‘t say much: nonplussed
Paul‘s eyes lit up: excited
Note this is an example of Few Shot Learning, plus I added a preceding text string to provide an additional prompt to the model as to what kind of problem context I had mind. In effect, this is like a kind of “command” to the model to complete the Few Shot examples using emotional cues.
This One/Few-Shot Learning capability of large language models has profound impact because it would appear that certain AI models are equipped to provide a foundation for a good deal of related tasks, hence the term Foundational Models.
The opportunity for the use of Foundation Models should be obvious — if many business processes are solved or accelerated using language understanding, then the prospect of automating such processes using pre-trained foundation models is seductive. This is especially because the solution of downstream tasks does not require the kinds of AI expertise and resources that were used to train the original models. All that is required is the means to access the model and then the skill required to figure out a viable means of prompting.
As you might have noticed, the method of prompting does affect what the model does and how well it does it. This takes time, skill and effort, but nothing compared to the skill and effort of training a custom AI model. This new skill of finding a suitable prompt has been called Prompt Engineering.
That said, there doesn’t appear to be much engineering involved. Indeed, some of the best “prompt engineers” out there appear to be folks with good language skills, such as writers. We can take the above example and change it from emotional cues to financial status thus:
Paul thumped the desk: broke
Paul jumped with joy: bonus
Paul didn‘t say much: low balance
Paul looked worried: overdraft
Paul’s financial status is broke.
This is an interesting example. The model has indeed understood that I am trying to translate my actions to financial status, say after looking at my bank balance. We shall return to this example — perhaps you are already noticing some pitfalls?
Now, I am skipping an important step called fine-tuning. Recall that the original model is pre-trained on a large text corpus. This is typically a “general knowledge” corpus, such as online forums, Wikipedia, and so on. If a particular use case involves more specialized language, such as relating to a particular product line or business context, then the existing model can be fine-tuned by showing it lots of examples of texts from that specific domain. For example, if I am an online lending company and want to solve a downstream task related to customer queries about loans, I might fine-tune the model on a set of examples of customer dialogs wherein the language is more tuned to what gets said within the context of applying for a loan.
Within the rubric of the above example, which is contrived and perhaps of limited practical use, we might imagine that a fine-tuned model more familiar with the actions of people in relation to their financial status might well be able to predict with some accuracy a person’s financial status from certain cues. As an aside, I previously researched how it might be possible to detect personality (according to some psychometric model, like the Big Five) mapped to financial behaviors. This work was motivated by research originally reported by Visa. As with so many of my older AI projects, I believe that application of SOTA Foundation Model techniques would have significantly enhanced the results, notwithstanding some of the challenging nuances of this particular problem (and I am not suggesting such an approach works, although my research indicated it might).
Transformers: Functions in Disguise
With transformers and large-scale models, the opportunity is far greater than merely applying pre-trained models to downstream processes. There are many opportunities to adapt the original transformer architecture, or similar, to a broad range of business problems wherever the underlying data is likely to incorporate patterns that might be interpreted using hierarchies of Self Attention.
What does this mean?
Language, as we all know, is a system of rules. Without rules, we would just have noise — or incoherence. I cannot write a sentence like: “Dog did it fly back over sundown?” Well, I cannot write such a sentence with any conventional widespread meaning, even though language can — and does — often deviate from the official rules of “proper” grammar. More than likely, I have violated a number of those rules in this post, never mind in the shorthand methods we all now use in our text and Slack messages.
The transformer model is attempting to decode these rules via construction of a set of word and sentence relationships (via probabilities of detecting those patterns in the data) — i.e. the primitives I referred to earlier. It is able to do so because it can exploit autoregression — namely, the dataset provides everything we need to know to build and test our model. We can test because if we ask the model to translate, say, “My name is Paul” to French, we already have plenty of previous examples in French of this pattern (“Je m’appelle X”) to guide the self-learning of the model — the corpus contains the training data without experts having to supervise via examples.
There are perhaps many such datasets or circumstances. One example that I have looked at is the design of webpages, meaning not just what the text says, but the images that are present and the layout of the information, all combining with some kind of coherence to produce a final page design (and content). Whilst there might be plenty of aesthetic variations in page design, by and large many websites conform to similar design patterns — indeed, designers like to acknowledge, pioneer, adopt and share such patterns. For example, if I see a Contact Us page, then likely it will have contact information, office locations, perhaps a link to an FAQ, etc.
We can think of pages as having widespread similarity (across the entire Worldwide Web) that is typically coherent and follows rules (of design, related to communication of language). These rules have structure and hierarchy, positional and semantic, as content often does, both by design convention and the inherent coherence of language (given that a page design is often influenced by language structure — i.e. to “tell a story”). Therefore, it is plausible that we could train a transformer architecture to look for these patterns and end up unearthing a set of “design primitives” to model page design. Indeed, here is such an experiment. Note here that the transformer is attempting to understand a number of modes: text, image and page morphology, hence it goes by the name “Multimodal Transformer”.
Upon reflection, there are likely to be a number of systems that are amenable to multimodal or mono-modal transformer modeling. In other words, all kind of human systems that were hitherto seemingly immune to automation might prove accessible to automation via large-scale modeling techniques, especially if the system relies upon coherence strongly correlated to language rules and primitives.
Note that it is not a requirement to train a model from scratch. The more that the problem can be mapped to the patterns of coherence in language, the more that the pre-trained language model can be exploited to boost learning of the novel problem. This capability might not be obvious. Indeed, its full scope and potential is still being unmasked. When we talk of transfer learning — i.e. taking a model pre-trained upon a general language corpus and applying to a more specific problem — we don’t necessarily mean that the specific problem has to take the form of a standard language problem, like finding an answer to a question.
In organized industrial life there are many processes, as encapsulated in some data, that are the product of human thought and organization. Now, without offending some philosophy of language experts, let us assume that Chomsky is correct when he says that the language faculty is not “designed” to be an instrument of communication (yes, I know that sounds crazy) but is originally an instrument of thought. Indeed, it is for this reason, he argues (in a way that I will do great injustice to with this simplification) that it is very difficult to uncover the internal mechanics of language in terms of fundamental symbolic processes. Namely, the outward form of language (as in sentences uttered and written) do not reveal enough evidence on the surface to reverse-engineer the modes of production — i.e. how the brain “language organ” works.
Indeed, Chomsky has argued (i.e. against the likes of Norvig, and others) that statistical methods of language modeling — i.e. what transformers do — are insufficient to understand language. His point is spelled out via analogy, which is that were an AI to watch (via a sensor through a window) objects, like leaves or apples, falling, it might eventually predict with great accuracy the likelihoods of objects falling and their dynamics (e.g. velocity, position, acceleration), but never discover Newton’s Laws of Mechanics, or the existence of an invisible force called gravity. (Actually, this is an interesting prospect and I wonder if anyone has attempted such an experiment.)
Nonetheless, transformers at very large scale, have, using statistics, unearthed various sets of language primitives (without “knowing” what they are). And, in a sense, because these primitives appear to be able to solve various language problems, even novel ones (via One-Shot learning), we might assume that the models are, in some way (to be determined) “mimicking” the thought processes used by humans within the language faculty. Indeed, again in some sense, perhaps these primitives are a mix of “language statistics” and “thought functions”. Well, don’t dwell on that too long, but hopefully you get the point.
I am taking you down this path in order to help you appreciate that the potential here for “transfer learning” is, in a way, the potential for some kind of programmatic solution to problems that are the output of human thought. This has potentially massive potential in ways that could radically affect automation of various human enterprises. This, indeed, is the opportunity. And, it is so potentially powerful, it is as if we have discovered electricity — i.e. something capable of touching most human productivity tasks. This is how foundational these new models are, or could potentially be within a short space of time.
Can these systems reason?
I believe this is a poor question because it is immediately loaded with philosophical baggage. But perhaps I can illuminate a potential answer by way of demonstrating that these models are indeed unearthing “thought functions”.
Consider a person who decides to name conference rooms after the names of English rivers (as indeed, all the rooms were in the Motorola R&D lab I first worked in and where I filed my first neural-network patent in 1994). This is not a feature of language per se, nor a common rule of business. But it is an idea — i.e. an output of thought. Someone thought to name conference rooms after rivers.
Let me try this prompt with GPT-3:
Conference rooms named after English rivers;
Conference room one: Avon
Conference room two: Thames
Conference room three: Severn
Quite correctly, the model has provided a viable suggestion for our third conference room: Severn. (Well, I apologize to my Welsh friends who might consider this a Welsh river, albeit starting in England).
We can propose in some sense that the model has “understood” the intention, or thought, to name conference rooms after rivers. It has then deftly adopted that pattern in suggesting the next name. I could put this function in a black box and hitherto call it: “Meeting Room Namer” and now I have a business function that has nothing to do with question answering, or language translation.
Here is a related prompt:
Conference rooms in buildings often have on their door:
The name of the room
The capacity of the room
The technology in the room
And so we see that, in a sense, the model has already understood that conference rooms have names. It is easy to see how a certain thread of prompts might have led to naming of conference rooms. Let’s imagine we linked this to a building layout, as understood by a multi-modal transformer. It is conceivable that with some kind of prompt, like “Office design for large software team using funky vibes” might have produced a viable plan along with room names, snack suggestions, decor suggestions, amenities and so on.
Building an “office design generator” might otherwise be difficult — I know, because I have worked on decor engines via computational aesthetics. It might require considerable training and data whereas using pre-trained models we can possibly get there faster. The question is how?
Well, that is the subject of another post, or — let’s be clear — the subject of many future IPOs in the AI space! Or, more worryingly for some, the subject of someone’s efforts about to bring about the demise of your business (looking ahead at risks).
What should be clear is that the wave of foundational models presents a major opportunity to revolutionize human productivity, perhaps as significant as the introduction of digital computing, in ways previously hard to penetrate. This is because these models, by modeling language dynamics at scale are, in a way, generating an “understanding” akin to some aspects of human thought, which, per Chomsky, is closely related to the function of language. Many business systems exhibit coherence stemming from human thought and so are amenable to interventions by foundational models.
Now, I don’t have time — nor all the answers — to spell out the significance of this to your industry, but the anecdotal warning is that if your business is run using human thought patterns, then large scale models might well be able to “understand” those patterns in ways previously unthinkable. What was originally conceived of as “Natural Language Processing” (NLP) is morphing into something far more significant and interesting: “Human Process Understanding” (HPU). Yes, I made that acronym up, but you get the idea.
Well, did I mention that large-scale models can also understand code and things like database queries? This is a fascinating development regarding HPU because humans have built mechanisms (called software and CPUs) to translate thought into digital functions, ones that can be actuated into physical actions (via sensors and actuators of all kinds).
We are on the cusp of figuring out how to pivot large-scale models into HPU in ways that could have foundational impact on all digital systems. I believe — and hopefully have demonstrated — that this impact could be so profound that it warrants an urgent strategic imperative of dedicating 100% of your 10% of “disruptive” resourcing to understanding, developing and extending this HPU capability to your business. The risks of not doing so should be obvious — if your competitors are doing this, they could stand to leapfrog your performance in entirely novel ways.
I should add that I believe this approach has many caveats and limitations when it comes to application to many business contexts that are, in fact, dynamical systems most likely characterized by Complexity. Put simply, this means that the methods of statistical estimation used by these large-scale models break down (in the face of Complexity). However, consider this a musing that does not detract from the opportunity at hand.
Even if you cannot sign up for 100-of-10, you should at least have a position paper or strategy that could, in theory, do so. The opportunity is simply too risky to ignore.
But there is another risk, one that I hinted at earlier, which is related to model bias. I don’t mean here what often gets discussed, as in racial profiling, sexist leanings, etc. Whilst these subjects are critically important to society, especially to those affected, there is another form of bias that often gets overlooked. Consider the following prompt and response:
salesforce vs. adobe
Salesforce and Adobe are both CRM (customer relationship management) software. They are both used to manage customer data and customer interactions. However, there are some key differences between the two.
Salesforce is a cloud-based CRM software, while Adobe is a desktop-based CRM software
From a very simple prompt, the model has made what might be considered a “statement of facts” whereas it’s, well — what is it? An opinion? Can a model have an opinion? Well, seemingly so. Would Adobe be happy about this “opinion”? Is it even an opinion that GPT-3 really “holds?” Hmmm. Let’s find out…
what is adobe experience cloud?
Adobe Experience Cloud is a set of integrated solutions from Adobe … [truncated]
It seems like GPT-3 doesn’t “believe” that Adobe is a desktop-based solution after all. The clue is kind of in the name: Adobe Experience *Cloud*.
Clearly, if you’re not the maker of this model, you might be concerned about what the model is saying about you, your business or your business interests, unfairly. Moreover, there has been noises that the folks at Google Research believe that this kind of QA-response in the “future of search”. (Note this rebuttal.)
This kind of risk is discussed widely in the Stanford paper about Foundational models. The point is that many orgs will be using models, or products built upon models, that are built by other orgs who have the considerable resources needed to build such models. GPT-3 is alleged to cost tens of millions of dollars each time to train. And a supercomputer developed for OpenAI by Microsoft has more than 285,000 CPU cores, 10,000 GPUs and 400 gigabits per second of network connectivity for each GPU server. That’s not exactly the average computer sitting around in most orgs.
So here we see another set of risks to those whose products, claims and ideas are being filtered via foundational models beyond their reach to explain, refute and modify. This is to say nothing of how such a model might be selectively trained in the first place to hold certain “positions”.
Well, this post is already way too long, so let me conclude.
Perhaps the most fitting conclusion is to end with the summary that GPT-3 generated. Regarding the overall claim:
This person (me) is saying that we should be concentrating on AI 100% because it is so important for the future.
The opportunity is profound:
The wave of foundational models is an opportunity to revolutionize human productivity in ways that were previously hard to penetrate.
Regarding the risks of not paying attention to the foundational model wave (emphasis mine):
People are working on a way to use big models to figure out how to do things better in the digital world. If we do this right, it could have a really big impact on everything that happens in the digital world. But we have to act now, because if we don’t, our competitors might …
Regarding the risks of bias:
In other words, there is a risk that the makers of a model like GPT-3 could use it to make unfair decisions about things like whether or not Adobe is a desktop-based solution.
If you have any questions, comments, suggestions, then mention @pgolding via twitter.