tl;dr: Take 5 minutes to skim It could be one of the most impactful things you do this year.

Black Friday encourages us to buy things in person. Cyber Monday encourages us to buy things on the internet. In response to this focus on consumerism, Giving Tuesday was born…

Is my backyard in San Francisco utopia?

A map of the world that does not include Utopia is not worth even glancing at, for it leaves out the one country at which Humanity is always landing. And when Humanity lands there, it looks out, and, seeing a better country, sets sail. Progress is the realisation of Utopias…

It’s tough to understand all the numbers on COVID developments. I made a Colab notebook to try to answer the question “are things getting better or worse?”

“Why didn’t anyone tell me this before??” This is my feeling about most of the content in this list when I first encountered it. I’ve aggregated it here roughly in order of impact.

1. Become a rationalist and effective altruist

On Halloween, I started taking Lexapro, an SSRI, to treat my depression. On the first day, I noticed an obvious and immediate effect as soon as it kicked in. I felt like someone had flipped a light switch in my mind. I felt good in a way that I have…

Reading about how Leonardo da Vinci journaled so prolifically but only published a tiny fraction made me feel both sad and inspired. I was sad that his incredible observations and discoveries lay dormant in his private collection. For example, he was the first person to document all the kinds of…

500MB of Pytorch on AppEngine and Kubernetes

I recently deployed a 500MB Pytorch model. It was surprisingly hard! In this post, I document the pitfalls and tradeoffs I made.

Running on CPU was nearly as fast as GPU for non-batch processing, so I recommend starting with that if you can.

Easy ways fail

Tensorflow Serving seemed ok, but converting our…

An exploration of standard sampling techniques and the new nucleus sampling

Humans often choose words that surprise language models (Holtzman et al 2019)

Causal language models like GPT-2 are trained to predict the probability of the next word given some context. For example, given “I ate a delicious hot ___”, the model may predict “dog” with 80% probability, “pancake” 5% probability, etc. The cool thing about this structure is they can be used…

Ben Mann, Yaroslav Bulatov, Darius Lam

TL;DR: we made Transformer-XL train efficiently on 128 GPUs on AWS. The code is available at

We achieved almost linear throughput scaling on AWS p3dn.24xlarge instances with 8 V100–32GB each on a 100Gbps network


One of the difficulties of researching language models is that you often don’t know if your ideas work until you try them on a real-world datasets. …

The latest technique for distributed training of large deep learning models

In software engineering, decreasing cycle time has a super-linear effect on progress. In modern deep learning, cycle time is often on the order of hours or days. The easiest way to speed up training, data parallelism, is to distribute copies of the model across GPUs and machines and have each…

Ben Mann

Software engineer, tinkerer, aspiring mad scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store