Good mathematicians see analogies between theorems; great mathematicians see analogies between analogies

This quote remark of S. Banach (quoted by S. Ulam 1957) caught my attention while reading Probability theory: the logic of science by E. T. Jaynes - took a note of it a few years ago, and.. I forgot about it..

To my own surprise, I got reminded of it after a recent GPT-3 twitter-storm (checkout a mid-July spike on Google Trends for related search queries). A release of another deep neural net is by far not an event of immediate importance to probability theory, but nonetheless.. As I was looking through the relevant literature, and GPT-3 API use-cases published on twitter (see Gwern’s overview) - it became self-evident that the likelihood of substantial medium-term consequences from Open AI’s work is high!

Question is what are these consequences going to be? Now, I dont have the resources to train such a model and test its limits (I wish), but I do have a keyboard to write about! So here is my speculation.

Generative pre-training transformer (GPT), most certainly follows the steps of Alpha Go in teaching Rich Sutton’s bitter lesson to the skeptics. Yet the learning objective as well as the overall scaling is far simpler, and as Gwern’s post points out cheaper, as compared to Alpha Go (and other projects alike).

AlphaGo seems to have totally original moves it creates itself.

Alpha Go received a honorary 9-dan title for exhibiting creative skills and pushing forward the game’s progress. It inspired multiple professional players to reflect on their style and strategies, and learn new moves. If that has been the outcome of Alpha Go, then who and how will be inspired by GPT like models?

When I am learning let’s say a concept in linear algebra, and if I really want to understand it, I would refer to multiple different textbooks Linear Algebra and Its Applications By G. Strang, then to Linear Algebra Done Right by S. Axler, and then finally to Evan Chen’s Napkin Project. Then at some point, I see a concept of interest being used in an applied research paper and it finally clicks!

What caught my attention, is that the data used to train GPT-3 includes very few books, textbooks, and scientific journal articles. If GPT-? model could be trained on anything from children’s textbooks to graduate course textbooks with a wide subjects variety, I suspect it would have a strong foundation in scientific reasoning.

Moreover the data is mostly in English language. Scientific concepts are the same in different languages, but those same concepts expressed in a different language would provide a stronger learning signal. Google, for example, demonstrated how BERT sentence embeddings improve across all languages when trained on low and high resource languages simultaneously.

Finally, while textbooks would provide a strong foundation, there is just an enormous abundance of scientific papers published every year. As Nature article from 2016 puts it:

Recent bibliometrics show that the number of published scientific papers has climbed by 8–9% each year over the past several decades. In the biomedical field alone, more than 1 million papers pour into the PubMed database each year — about two papers per minute.

There is an argument on the quality of all of the research or some of it even being wrong. But the overall trajectory of scientific progress has been up, and that is key. Which means that overall, the average use of statistical models, experimental, design, proofs and reasoning is correct within the scientific literature - the model trained on all of knowledge would likely pick that up, as consistent use of the right scientific tools and reasoning will lead to the consistent reduction of the unsupervised loss.

Henry Poincare seems to be often referred to as the last polymath within mathematics, a person who has excelled in all fields of mathematics that existed during his lifetime. How likely there to be a polymath in any field of studies given the current rate of scientific outputs? Quite unlikely it seems, which is what precisely reminded me of the S. Banach’s quote - we are missing out on analogies to be made to inspire solutions to the existing problems!

Most certainly GPT like model will not be an AGI, but a model trained on all of the textual knowledge written by humans, potentially with some supervised regularisation, will be a very useful. As Shahine helped me to summarise: “Just as AlphaGo pushed the boundaries of human’s knowledge in board games, where best players in Go, Chess and other games learn from neural network. So will GPT trained on all of scientific knowledge advance science by for example showing signs of inter-fields extrapolation of scientific methods”.


Quick summary of GPT-3

At the core of the learning objective, is language modeling. An unsupervised objective of distribution estimation from a set of examples , each composed of a variable length of sequences of symbols , factored as the product of conditional probabilities over symbols:

i.e. the task is simply to predict the next word given the context. What could be simpler? There are complexities in terms of the sampling from the trained model, but that I will review later, one day..

From Brown et al. Language Models are Few-Shot Learners (2020).

The key to the success of the model is the dataset size and the model size. As demonstrated above the training data is huge. Interestingly though, a substantial amount of training time is spent on more informative text data like Wikipedia. As shown in the figure below, the larger the model size, the lower the is the GPT loss on validation data. Hence the larger is the model, the better is the performance on downstream tasks as Brown et al. Language Models are Few-Shot Learners (2020) shows.

From Brown et al. Language Models are Few-Shot Learners (2020).

Kaplan et al. Scaling Laws for Neural Language Models (2020) shows and Gwern’s post rightly elaborates, we are only halfway in reducing the validation loss!

If we see such striking gains in halving the validation loss but with so far left to go, what is left to emerge as we third or halve again?

As Radford et al. Language Models are Unsupervised Multitask Learners (2019) beautifully puts it: “Since the supervised objective is the same as unsupervised objective but only evaluated on a subset of the sequence, the global minimum of the unsupervised objective is also the global minimum of the supervised objective”. Language models, as a result, do not need explicit supervision! As shown below, it thus only needs text to promt GPT-3 to a specific tasks:

From Brown et al. Language Models are Few-Shot Learners (2020).


OpenAI papers and posts in chronological order:

1 Radford et al. Improving Language Understanding by Generative Pre-Training (2018) & OpenAI’s blog-post

Building up on previous work from Google on generative pre-training and neural architecture for text data Open AI team showed that by pre-training a large neural network on a large text data set and then fine-tuning for specific language tasks improves performances on standard natural language benchmarks (examples). The pre-training is very simple, neural network is provided with context paragraph or sentence, and neural network has to predict the next word.

This simple training method is what is used in all of GPT models released by Open AI. Key to the progress appears to be large datasets (mostly mined from the internet) and large neural networks - Open AI appears to have realised that in the next paper.

In terms of the neural architecture. What changed is that previously for language modeling people were using recurrent neural networks, these models are fed every word of a sentence one by one and the neural network has to remember every previously word it was fed. As a result these networks were not so easy to optimise. Transformers, type of neural network that researchers at Google have developed and used by OpenAI, on the other hand do not have recurrence. Transformers treat a sentence/paragraph of text as an ordered set of its parts words/tokens. Neural network takes in the context as a whole without any recurrence where each word/token is fed into the network at once.

2 Radford et al. Language Models are Unsupervised Multitask Learners (2019) & OpenAI’s blog-post

The same simple pre-training scheme of predicting the next word, but they further expand the dataset and increase model size. Key in this work is that they realise that they do not need to do fine-tuning on specific benchmark datasets at all, as it was done in the previous work from 2018. Essentially, they show that for language modeling the unsupervised objective (no explicit target given the input is specified) is the same as supervised objective (explicitly specified input and target to predict pairs) - I highly recommend reading section 2 in the paper. Instead the problem is in constructing a sufficiently powerful neural network such that they can continuously reduce the training loss over the data. In this case the training data is a pre-processed Common Crawl - a dataset of web scrapes (note that they removed wikipedia articles from training).

This work is what produced GPT-2 that was scandalous last year. The outputs from the model were very impressive: realistic paragraphs, news articles, and question answering. But there were clear artifacts and it was easy to spot that the text was generated by a model.

3 Kaplan et al. Scaling Laws for Neural Language Models (2020)

Any machine learning model, besides having parameters inside the model, also has hyper-parameters such as how big the model should be, how long it should be trained for, and how big should be the sample sizes during every update.

Given the two previous works, and Open AI realising the potential of larger models and larger dataset sizes, they wanted to investigate how far do they need to increase the model size, what are the important hyper-parameters, and what kind of loss on language modeling they could achieve. So they fit multiple models of different hyper-parameters and estimated the relationship between model size, dataset size, and the loss that they could reach. In the next work on GPT-3 it is clear how they are using this work as a guidance to determine the size of the model and training hyper-parameters.

4 Brown et al. Language Models are Few-Shot Learners (2020)

This is the paper describing recently released GPT-3 and what it can do - and it is impressive! At first I was very sceptical but recent tweet-storm made me curious to review all of the related papers and articles which is how I arrived to this short summary.

Essentially in this work they push their work from 2019 further with larger model and expanded datasets. This time besides filtered Common Crawl dataset they include Wikipedia, and two datasets of books although they do not specify which books.

Model generates way more realistic text. When GPT-2 was known to generate easily detectable news article, in this study Open AI shows that humans ability to classify GPT-3 vs human generated text is random!

Another impressive aspect of the model is its ability to learn from just showing it examples in text i.e. telling it “do this and that” and giving a few examples, and then the model does it! It is able to do translation, question-answering, and several tasks that require reasoning like arithmetics, unscrambling words, or using a novel word in a sentence. And there have been more examples of what it can do on twitter.

Other resources:

  1. Mitchell M. Can GPT-3 Make Analogies?
  2. Ruder S. Why You Should Do NLP Beyond English
  3. Philosophers On GPT-3 & Hardmaru’s tweet
  4. Roziere et al. Unsupervised Translation of Programming Languages & Lample’s theard
  5. Clark J. Delegation Machines
  6. Gwern on GPT-3 & Gwern’s GPT-3 Creative Fiction
  7. Harvard nlp The Annotated Transformer
  8. Are we in an AI overhand?
  9. Karpathy strikes again with minGPT

Cite as:

  title   = "The last polymath, a neural network",
  author  = "Gamper, Jevgenij",
  journal = "",
  year    = "2019",
  url     = ""