Experiments with GPT-3

Intro

GPT-3 is an AI model that can generate realistic, (in some cases) human-like natural language text. It is roughly 100x larger than any previous model, achieves state of the art in many benchmarks, and is able to handle a variety of different tasks with no training (fine-tuning), such as translation, question answering, and so on. It's probably fair to say that as of this writing, GPT-3 is the most capable natural language model on the market.

In this blog post, I subject GPT-3 to a series of tests of my own:

TLDR

  • GPT-3 did poorly at the first 4 tasks. Although the output text is clearly in the right domain, GPT-3 is unable to go beyond that and a domain expert would be able to quickly distinguish GPT-3 output from real output.

  • GPT-3 does quite well at generating recommendations from natural language. It has a lot of knowledge of the real world (from reading the internet) and is able to combine this with an understanding of the query to produce relevant results fairly robustly (it doesn't work all of the time, but works pretty well most of the time and amazingly sometimes).

High Level Overview

You can interact with GPT-3 by giving it "prompts", which is a prepared chunk of text that GPT-3 will try to continue in a way that makes sense as much as possible. Here is an example

Q: What is the color of the sky?

A:
A: It's blue.

Q: Is it true that you can't hear anything in space?

A: No, I can hear something. It's a little like that.

Q: What's that sound?

A: It's the sound of the stars.

Q: How many planets are there?

A: There are twelve planets.

Q: What are the names of the planets?

A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune, Pluto, Charon, Titan, Eris.

Here GPT-3 has inferred that we are doing some kind of question-answer session and happily went on to make up questions as well as answers.

By changing the wording of the prompt, you can get GPT-3 to solve different tasks.

For the remainder of the article, whenever you see a group of two quoted literal blocks with no exposition in between like the above example, you can assume the first was a prompt crafted by a human (me), and the second was GPT-3's response.

Experiment #1: Prime Number Generation

I looked up a table of prime numbers and randomly sampled some to construct the prompt:

(Prompt:)

Q: 278624653

A: 1

Q: 63756283

A: 1

Q: 531821493

A: 0

Q: 971041741

A: 1

Q: 921407897

A: 0

Q: 423272049

A: 0

Q: 179126797

A: 0

Q: 846392893

A: 0

Q: 511198183

A: 1

Q: 314065811

A: 1

(Response:)

Q: 314067641

A: 1

Q: 371955591

A: 0

Q: 314065605

A: 0

Q: 314065407

A: 1

Q: 673074643

A: 1

Q: 723586481

A: 1

Q: 722678889

A: 1

Q: 723502438

A: 1

Q: 685119799

A: 0

Q: 371954277

A: 1

Q: 232522645

A: 1

I got excited when the first result checked out as a prime, but it turned out to be a fluke. Most numbers are not primes, and so most numbers in the returned list are not primes. Although GPT-3 mostly (but not totally) picks up that we don't care about even numbers, it doesn't pick up for example numbers ending in 5 (>5) can't be prime.

This is not super surprising given that GPT-3 struggles with adding 5 digit numbers.

Experiment Outcome: Failure

Experiment #2: Matt Levine Articles

Salesforce's CTRL model was able to generate reasonable sounding news articles given just the URL, so I looked at some Matt Levine articles and grabbed what I thought was the most relevant link (there are often a few) to a particular story. I planned to experiment with adding an exerpt or summary of the URL's full text to the prompt to increase the chances that the generated Matt Levine commentary would stay on point, but didn't get around to it.

Prompt: https://www.bloomberg.com/news/articles/2020-07-06/tesla-adds-14-billion-a-day-to-valuation-as-big-money-dives-in

Response: Man, this is how you give a quote about a stock:

    “Tesla’s valuation doesn’t make sense by any traditional measure,” said Ivan Feinseth of Tigress Financial Partners. However, “it is not a traditional company, so how do you put a traditional measure to it?”

See, saying something like that to a reporter is kind of a bearish bet. If Tesla keeps climbing, becomes the world’s biggest company, and ends up making all the cars and solving climate change and running the transportation infrastructure for all of our space colonies, then no one is going to be like “oh yeah Feinseth called it” or whatever. (In fact, he didn’t call it.) But if Tesla goes to zero tomorrow, if Chief Executive Officer Elon Musk is caught stuffing all the corporate cash into his novelty branded short shorts and trying to flee to Mars, then that’s gonna be the quote for Tesla. That will be the “permanently high plateau” of the Tesla boom, and Ivan Feinseth will be famous for a hundred years. “Hahaha back in 2020 people thought that just because Tesla was ‘not a traditional company’ it could be worth whatever you want,” people will say, in 2120, complaining about some other alleged bubble.

Prompt: https://investor.uber.com/news-events/news/press-release-details/2020/Uber-to-Acquire-Postmates/default.aspx

Response: There is a cartoon version of certain gig-economy startup industries that goes like this. Several companies get into the market for, say, car services or food delivery or whatever. They compete for market share by, basically, losing a lot of money: You pay drivers $20 per trip, which is more than they could get elsewhere; you charge riders $5 per trip, which is less than they’d pay elsewhere; you make up the difference by raising money from venture capitalists or SoftBank. Your competitors do the same thing, and you collectively spend billions of dollars of venture money delivering people cheap burritos. You promise your investors “don’t worry, after a few more cheap burritos we will have driven our competitors out of business and we’ll be able to jack up the burrito prices to cover our expenses,” but your competitors are promising their investors the same thing—sometimes it’s the same investors!—so they all keep hanging around.

Prompt: https://www.bloomberg.com/news/articles/2020-05-27/steep-u-s-stock-correlations-show-virus-death-grip-on-markets

Response:
This was a really fun assignment because the lede was basically just “Well, markets are correlated now, because everything is terrible, but we’re not sure what the culprit is. There are lots of ideas for why this might be the case, so we’re just going to make a list of a lot of possible reasons and let the reader sort it out.”


“Most people agree that rising stock markets are the main driver of this—the same underlying economic growth can lift all boats,” said Joe Quinlan, chief market strategist at U.S. Trust. “Technology spending is an input into all stocks.”

Other possibilities include more investor money pouring into the U.S. because of the growing global trade war, or the growth of passive investing, which may make stocks less volatile but also more likely to move in sync.

GPT-3 was able to quickly pick up that the task has to do with news, though the generated text does not really match up with Matt Levine's characteristic style.

Experiment Outcome: Matt Levine's job cannot yet be automated at this time

Experiment #3: Writing Proofs

To construct the prompt, I thought it would be good to take some excerpts from Euclid's Elements. I stuck with number theory because I thought it would be easier for an AI to understand than geometry, when that AI can only manipulate words/symbols and can't visualize anything.

However, Euclid's Elements isn't written in modern language, so it was a pain to find examples stated in words that are easy to understand for modern people.

Prompt: If both a and b are relatively prime to c, then ab is also relatively prime to c.

Response: Assume that two numbers a and b are each relatively prime to a third number c.

Suppose their product ab is not relatively prime to c. Then there is some number e (greater than 1) that divides both ab and c. Now, since e divides c, and c is relatively prime to a, therefore, by VII.23, e is also relatively prime to a.

Let f be the number ab/e. Then e : a = b : f. Since e and a are relatively prime, then e : a is in lowest terms. Therefore, e divides b. But then e divides both b and c contradicting the assumption that b and c are relatively prime.

Therefore, the product ab is also relatively prime to c.

Prompt:
If a and b are relatively prime to each other, then their greatest common divisor must be 1.

Response: Let a and b be relatively prime, so that their greatest common divisor is 1. Then every common divisor of a and b is a divisor of 1, since, by VII.20, every common divisor of a and b is a divisor of their greatest common divisor.

Therefore, by II.9, every common divisor of a and b is either 1 or a common divisor of 1.

But by III.4, every common divisor of 1 is a divisor of 1. Therefore, every common divisor of a and b is a divisor of 1. Therefore, every common divisor of a and b is a common divisor of 1.

VII.27If p is any prime number, then 2p – 1 is a multiple of p.

Here I'm letting GPT-3 pick what statement it wants to prove, since I only have one example so far. It decided to pick a statement that was true by definition and then say a bunch of things that are true but don't go anywhere, and then a curveball out of nowhere "If p is any prime number, then 2p – 1 is a multiple of p".

Prompt: If both a and b are relatively prime to c, then ab is also relatively prime to c.

Response: Assume that two numbers a and b are each relatively prime to a third number c.

Suppose their product ab is not relatively prime to c. Then there is some number e (greater than 1) that divides both ab and c. Now, since e divides c, and c is relatively prime to a, therefore, by VII.23, e is also relatively prime to a.

Let f be the number ab/e. Then e : a = b : f. Since e and a are relatively prime, then e : a is in lowest terms. Therefore, e divides b. But then e divides both b and c contradicting the assumption that b and c are relatively prime.

Therefore, the product ab is also relatively prime to c.

Prompt: A number n is a prime number if and only if (n − 1)! + 1 is divisible by n.

Response:
Suppose n is a prime number. Then, by IV.4, (n − 1)! + 1 is divisible by n. Therefore, this is a prime number.

Now suppose (n − 1)! + 1 is a prime number. We will show it is divisible by n.

First, consider the factors of (n − 1)! + 1 that are less than or equal to n. These are (n − 1)! and 1. They are certainly prime. Now consider the factors of (n − 1)! + 1 that are greater than n. It is clear that this is a subset of the first group and, thus, either has the same number of factors, or has fewer. If it has the same number of factors, then we have already shown that (n − 1)! + 1 is prime.

If it has fewer factors, then by the product of three primes is not divisible by 3, we have a contradiction. So it must have the same number of factors as the first group, that is, the same number of factors as n. Therefore, (n − 1)! + 1 is divisible by n

This is basically a bunch of gibberish. Also, GPT-3 doesn't seem to understand that "if and only if" requires proving two things. I tried multiple runs and it always thinks it's done after proving one direction.

Prompt: If both a and b are relatively prime to c, then ab is also relatively prime to c.

Response: Assume that two numbers a and b are each relatively prime to a third number c.

Suppose their product ab is not relatively prime to c. Then there is some number e (greater than 1) that divides both ab and c. Now, since e divides c, and c is relatively prime to a, therefore, by VII.23, e is also relatively prime to a.

Let f be the number ab/e. Then e : a = b : f. Since e and a are relatively prime, then e : a is in lowest terms. Therefore, e divides b. But then e divides both b and c contradicting the assumption that b and c are relatively prime.

Therefore, the product ab is also relatively prime to c.

Prompt: If p is a prime number, then for any integer a, the number a^p − a is an integer multiple of p.

Response:
Suppose that a is an integer that is not a multiple of p. Then the number a^p − a is not an integer multiple of p.

But if a is not a multiple of p, then a cannot be equal to 1 (since 1 is a multiple of every prime).

Suppose that a is equal to some integer b. Then a^p − a = a^b − a = 1^b − a = b. But then b is an integer multiple of p. Therefore, a is also a multiple of p.

Just for fun:

The following is a proof that no odd perfect numbers exist:
Let \(p_n\) be a sequence of prime numbers. We want to show that the sequence \(p_n\) is infinite, i.e., there is no \(m\) such that \(p_m\) is the last number in the sequence.

Assume \(m\) is the last number in the sequence \(p_n\). Then \(p_m\) is the smallest prime number greater than \(p_n\). But \(p_m\) is a prime number greater than \(p_n\) (by definition), and hence it cannot be the smallest prime number greater than \(p_n\). This is a contradiction, and so we must conclude that the sequence \(p_n\) is infinite.

Note that if \(p_m\) is the smallest prime number greater than \(p_n\), then \(p_m\) is not the last number in the sequence \(p_n\). If \(p_m\) is not the last number in the sequence \(p_n\), then \(p_m\) is the first number in the sequence \(p_{p_n}\) (because \(p_n\) is the smallest prime number greater than \(p_m\)). Hence, \(p_{p_n}\) is a sequence of prime numbers, and so we can apply the above argument to conclude that \(p_{p_n}\) is infinite.

We can apply the above argument to conclude that \(p_{p_{p_n}}\) is infinite, and so on. Hence, there are infinitely many prime numbers, and so the prime number theorem is false.

Experiment Outcome: GPT-3 is quite poor at proofs, underperforming undergrads

Experiment #4: Solving Leetcode Problems

(These are all Leetcode easy problems.)

Prompt:

Given an array of integers, return indices of the two numbers such that they add up to a specific target.

You may assume that each input would have exactly one solution, and you may not use the same element twice.

Response:

class Solution:
  def twoSum(self, nums: List[int], target: int) -> List[int]:
    for i, x in enumerate(nums):
      for j, y in enumerate(nums):
        if j > i and x + y == target:
          return [i, j]
    return None

Prompt:

Determine whether an integer is a palindrome. An integer is a palindrome when it reads the same backward as forward.

Response:

class Solution:
    def isPalindrome(self, x: int) -> bool:
      dumb = str(x)
      n = len(dumb)
      for i, c in enumerate(dumb):
        if c != dumb[n-1 - i]:
          return False
      return True

Prompt:

Write a function to find the longest common prefix string amongst an array of strings.

If there is no common prefix, return an empty string "".

Response:
class Solution:    def longestCommonPrefix(self, s1: List[str], s2: List[str]) -> str:    n = len(s1)    l1 = [s1[i] for i in range(n)]    l2 = [s2[i] for i in range(n)]        for i, l1_str, l2_str in enumerate(zip(l1, l2)):      if l1_str == l2_str:         return l1_str      return None

After formatting:

class Solution:
  def longestCommonPrefix(self, s1: List[str], s2: List[str]) -> str:
    n = len(s1)
    l1 = [s1[i] for i in range(n)]
    l2 = [s2[i] for i in range(n)]
    for i, l1_str, l2_str in enumerate(zip(l1, l2)):
      if l1_str == l2_str:
        return l1_str
      return None
  • The function signature is wrong

  • There is no need to create l1 and l2

  • for i, l1_str, l2_str in enumerate(zip(l1, l2)): is a syntax error

  • If s1, s2 were strings and it was looking for the longest common prefix to just that pair, the logic in the loop should do something once l1_str and l2_str differ, not once they equal

Prompt:

Given an array of integers, return indices of the two numbers such that they add up to a specific target.

You may assume that each input would have exactly one solution, and you may not use the same element twice.

Response:

class Solution:
  def twoSum(self, nums: List[int], target: int) -> List[int]:
    for i, x in enumerate(nums):
      for j, y in enumerate(nums):
        if j > i and x + y == target:
          return [i, j]
    return None

Prompt:

Determine whether an integer is a palindrome. An integer is a palindrome when it reads the same backward as forward.

Response:

class Solution:
    def isPalindrome(self, x: int) -> bool:
      dumb = str(x)
      n = len(dumb)
      for i, c in enumerate(dumb):
        if c != dumb[n-1 - i]:
          return False
      return True

Prompt:

Given a non-empty array of digits representing a non-negative integer, plus one to the integer.

The digits are stored such that the most significant digit is at the head of the list, and each element in the array contain a single digit.

You may assume the integer does not contain any leading zero, except the number 0 itself.

Response:
class Solution:   def increment(self, digits: List[int], value: int) -> List[int]:     n = len(digits)    i = 0    result = []    while i < n:       d = digits[i]       if d == 0:        continue      result.append(value)    i += 1    return result

Prompt:

Given an array of integers, determine the largest power of two such that all the elements in the array is less than or equal to that power of two.

You may assume the power of two is an element of the array.

Response:

class Solution:   def powerOfTwo(self, nums: List[int], power: int) -> List[int]:     return [i for i in range(1, len(nums), power) if nums[i] <= nums[i - 1]]

After formatting:

class Solution:
  def increment(self, digits: List[int], value: int) -> List[int]:
    n = len(digits)
    i = 0
    result = []
    while i < n:
      d = digits[i]
      if d == 0:
        continue
        result.append(value)
      i += 1
    return result

Prompt:

Given an array of integers, determine the largest power of two such that all the elements in the array is less than or equal to that power of two.

You may assume the power of two is an element of the array.

Response:

class Solution:
  def powerOfTwo(self, nums: List[int], power: int) -> List[int]:
    return [i for i in range(1, len(nums), power) if nums[i] <= nums[i - 1]]

Overall, it feels similar to the proof output: the output is syntactically correct but it's clear there's nobody driving. The code doesn't do anything useful.

I know that other people are building applications on top of GPT-3 that write code, but I personally wasn't able to get it to solve leetcode problems within the limited time that I explored this domain.

Experiment Outcome: was not able to get GPT-3 to solve LC

Experiment #5: Generating Recommendations

This was the experiment I ran last. Check out this blog post on some interesting results.