tl;dr: Take 5 minutes to skim It could be one of the most impactful things you do this year.

Black Friday encourages us to buy things in person. Cyber Monday encourages us to buy things on the internet. In response to this focus on consumerism, Giving Tuesday was born. Instead of buying things for people you know personally, take a day to think about improving your community and causes you care about. A few years ago, I’d only ever donated to the Wikimedia foundation thanks to “A Personal Appeal From Jimmy Wales.” Since then I’ve learned a lot about…

Is my backyard in San Francisco utopia?

A map of the world that does not include Utopia is not worth even glancing at, for it leaves out the one country at which Humanity is always landing. And when Humanity lands there, it looks out, and, seeing a better country, sets sail. Progress is the realisation of Utopias.
Oscar Wilde

A friend recently asked, “Do you believe in utopia?” My short answer is yes, but the devil is in the details. I don’t claim to be an expert, but I’ll share my amateur opinions here. As we build the future, a clear vision can be our blueprint.

The experience machine

It’s tough to understand all the numbers on COVID developments. I made a Colab notebook to try to answer the question “are things getting better or worse?”

“Why didn’t anyone tell me this before??” This is my feeling about most of the content in this list when I first encountered it. I’ve aggregated it here roughly in order of impact.

1. Become a rationalist and effective altruist

2. Better mental health

3. Learn to communicate

On Halloween, I started taking Lexapro, an SSRI, to treat my depression. On the first day, I noticed an obvious and immediate effect as soon as it kicked in. I felt like someone had flipped a light switch in my mind. I felt good in a way that I have rarely ever felt. In the last two months, I’ve felt this way most days. I often ask myself why it took me 31 years to try an SSRI. Mostly it was due to stigma and an incorrect impression of the research. …

Reading about how Leonardo da Vinci journaled so prolifically but only published a tiny fraction made me feel both sad and inspired. I was sad that his incredible observations and discoveries lay dormant in his private collection. For example, he was the first person to document all the kinds of teeth and their exact layout in the human head. If he had published that, he might have been considered the father of dentistry. He was also the first to pith a frog, exploring the details of how nerves control muscles. He could have started a whole movement around neuroscience. Instead…

500MB of Pytorch on AppEngine and Kubernetes

I recently deployed a 500MB Pytorch model. It was surprisingly hard! In this post, I document the pitfalls and tradeoffs I made.

Running on CPU was nearly as fast as GPU for non-batch processing, so I recommend starting with that if you can.

Easy ways fail

Tensorflow Serving seemed ok, but converting our model from Pytorch to ONNX might have been difficult. We also wanted to keep the local code as simple as possible for ease of development. To make sure the server came up quickly, I copied the model into the codebase with a .gitignore entry. I added pytorch-pretrained-bert to my requirements.txt…

An exploration of standard sampling techniques and the new nucleus sampling

Humans often choose words that surprise language models (Holtzman et al 2019)

Causal language models like GPT-2 are trained to predict the probability of the next word given some context. For example, given “I ate a delicious hot ___”, the model may predict “dog” with 80% probability, “pancake” 5% probability, etc. The cool thing about this structure is they can be used to generate sequences of arbitrary length. I can give the model “I ate,” sample a token from the resulting distribution to get “I ate a”, then put that through the model again to get another distribution and resulting token. Repeat as long as we like. It turns out that this…

Ben Mann, Yaroslav Bulatov, Darius Lam

TL;DR: we made Transformer-XL train efficiently on 128 GPUs on AWS. The code is available at

We achieved almost linear throughput scaling on AWS p3dn.24xlarge instances with 8 V100–32GB each on a 100Gbps network


One of the difficulties of researching language models is that you often don’t know if your ideas work until you try them on a real-world datasets. However, training on such datasets on one machine can take weeks.

Fortunately there’s a straightforward recipe to speed up this process:

  1. Find a good single machine model
  2. Run N copies of the model on N machines in parallel, synchronizing at each step
  3. Solve all remaining technical challenges

We used this recipe…

The latest technique for distributed training of large deep learning models

In software engineering, decreasing cycle time has a super-linear effect on progress. In modern deep learning, cycle time is often on the order of hours or days. The easiest way to speed up training, data parallelism, is to distribute copies of the model across GPUs and machines and have each copy compute the loss on a shard of the training data. The gradients from these losses can then be accumulated using a single parameter server or something fancier like ring all-reduce (default in pytorch). After back-propagating the gradients, repeat with the updated model. …

Ben Mann

Software engineer, tinkerer, aspiring mad scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store