William Blake

Image and especially video compression is a problem of obvious real life significance, with video traffic taking up over 60% of all internet traffic. Increasingly better lossless and lossy codecs have been developed over the years to combat this problem. Relatively recently, a nascent research domain on using auto-encoder style neural networks for compression has emerged. This post goes through the evolution and highlights where the field might be heading.

To understand neural compression, we first need to understand the ubiquitous JPEG standard which has been around since 1992. Lossless codecs exploit statistical redundancy in data: if you can predict the next byte conditional on the previous ones, then using the same statistical model in both encoder and decoder sides increases compression ratio by assigning smaller codes to more likely bytes. Lossy codecs go further by disregarding some “less important” data entirely.

There are many great tutorials on JPEG online, e.g. this, but the most important takeaway is that human eyes are much more receptive to low frequency changes such as edges than to high frequency changes such as elaborate textures. Artists exploit this by creating an “illusion of detail” instead of actually drawing every individual blade of grass.

*Understanding frequency in images*

JPEG exploits this by transforming the image data to the frequency domain using the discrete cosine transform, expressing each 8x8 pixel block of an image as a sum of cosine functions with different frequencies and amplitudes. Instead of $H \times W$ spatial pixels, we are left with $H \times W$ DCT coefficients, where each 8x8 block of coefficients corresponds to a block of the image.

Next, the 8x8 block of values is element-wise divided by an 8x8 *quantisation table*, and the results are rounded (this is what makes JPEG lossy). Intuitively, whenever you lower the quality setting when saving a JPEG file, this increases the values of the quantisation table, causing more zeros in the final representation of the image, making it easier to compress. It’s important to note that the values in the quantisation table are not uniform, nor random, they are carefully hand-picked to provide pleasing visual results. In fact, there is nothing special about these tables insofar as any software or user can change their values for desired effects.

The resulting quantised frequency values are the compressed representation of the image. In practice, this is further compressed with lossless encoding (all the rounded zeros are redundant), which we’ll skip this time. But what’s important to note is that DCT and quantisation are reversible operations, so we can do the inverse of them in the decoder stage, resulting in a lossy reconstruction of the original image.

*High level diagram of JPEG internals*

The above slide from Johannes Ballé’s 2018 talk summarises JPEG on a high level. It’s striking how closely this resembles a neural network diagram - we have a natural loss function and two (linear) transformations and their inverses. A natural question is whether such a system could be end-to-end trained by parametrising the transforms such as this:

*High level diagram of learned image codec internals*

So why can’t we just train any autoencoder to minimise MSE and be done with? By projecting the image data to some low-dimensional bottleneck, we could use that directly as the encoded representation, right?

This is pretty much what the first iterations tried to do. For example, in [7] the authors moreover constrain the bottleneck to be a binary representation. This has the additional benefit that the vector can be directly serialised and bitrate can be estimated directly from the bottleneck size.

*Progressive bitwise encoding with LSTMs, [7]*

Here, the autoencoder recursively generates a 4-bit (keep in mind the authors were compressing tiny 32x32 images!) representation of the image, with the residual of that reconstruction being fed into the next layer etc. By repeating this e.g. 16 times, we could generate a 64-bit vector to be transmitted to the receiver. This means the method is also one of the first to have adaptive rate control, which is a desireable property for a codec.

Unfortunately, it does not achieve the best compression ratios in practice. That’s because actual compression rates mostly depend on the Shannon entropy of the source file. Since this autoencoder has no incentive to reduce the entropy of the latent vector, we might not be compressing optimally in terms of the bitrate-distortion trade-off.

A seminal breakthrough came in 2017 when [1] designed the first end-to-end trained model that directly optimises for the rate-distortion trade-off. Their key insights were the following:

- Using a loss of the form $L = R + \lambda D $, where R stands for bitrate and D for distortion (e.g. MSE), achieves an optimal trade-off between the two (for a given value of $\lambda$ moderating the weights).
- To estimate $ R $ by a neural network, we can use the fact that actual compression rates can be very close to actual entropy, so we only need to train a probabilistic model of our quantised latent variable $ \hat{y} = round(y) $ and the entropy of it gives the bitrate directly: $ L = -\mathbf{E}[\log_2 P_{\hat{y}}] + \lambda \mathbf{E}[D]$ where $P_{\hat{y}}$ is the probability distribution of $\hat{y}$. Note that both terms in the loss depend on the probability model, as $\hat{y}$ is also used for decoding the image.
- A specific type of lossless coding - adaptive arithmetic coding - is used to actually compress $ \hat{y} $ into a bitstream. Without going into too much detail, the key insight is that both the arithmetic encoder and decoder depend on the probability distribution of symbols (a symbol here is a possible value of $\hat{y}$).

This new model achieved better compression ratios than JPEG or even JPEG2000.

Once the $ R + \lambda D $ loss became prevalent, it became clear that the best way improve compression rates is to improve the entropy (probability) estimation. Another way to look at this is through the arithmetic coder: imagine we were encoding English text. For the hackneyed example input “a quick brown fox jumps o”, if we use a static probability model, then the most likely next character is always space. However, if we had access to some GPT-style context-aware next character predictor, it could condition on the already available information and achieve far better compression ratios. Note that both the encoder and decoder would need to use the same model.

[2] does something similar. When the previous probability model was fixed (essentially reflecting the expected distribution of latents over all training set images), now we emit an additional quantised channel $\hat{z}$ that is also sent as bitstream. The receiving end then first decodes $ \hat{z} $, reconstructs the distributional parameters of $ y $ from it (in practice, each element $y_i$ is modelled as a Gaussian with), and then decodes $ \hat{y} $ *conditional* on these parameters, that means the probability model becomes image-specific.

*Seminal hyperprior network, [2]. In the above, GDN is just some alternative non-linearity, ReLU could be used instead*

I have played around a bit with the hyperprior model, using the popular CompressAI library - you can find a notebook here. Using a toy dataset of 100k 96x96 images (STL-10), I’m training a tiny toy model at different $ \lambda $ values and logging the average entropy and MSE loss for these. While this does is not yet enough to beat JPEG, it showcases how the rate-distortion training happens in practice.

Here are some fun images of a ram progressively improving during the training process:

*Reconstructing a ram*

As expected, a lower value of $ \lambda $ incentivises the model to focus on compression, while the higher one gives us a rather realistic reconstruction of the original image (top).

The hyperprior showed that feeding additional information to the arithmetic coder’s probability model improves compression rates, namely sending image-specific $ \hat{z} $. But there is other information we are not using. For example, let’s say we are in the process of decoding $\hat{y}$ pixel-by-pixel. After the first half, it becomes clear that we have half of a dog. It is then very likely that the remaining half is the latter of the dog. We can update the probability model with this information - as long as both the encoder and decoder side do this kind of synchronised sequential processing, we can achieve a lower compression ratio.

This is pretty much the idea behind the autoregressive prior: instead of decoding all latents in one or two passes (like the hyperprior), we recursively predict the probability distribution of the next latent conditional on the previous ones. The arithmetic decoder can adapt in each iteration to optimally pack/unpack a single latent. In practice, instead of using all the previous latents, only a local neighbourhood (e.g. 5x5) of the pixel is used, which provides 12 pixels when serially processed:

*Autoregressive image decoding [8]*

This method added another step size improvement to learned image compression. Unfortunately, as you might be able to tell, iterating over the entire image pixel by pixel is really ineffective computationally. I recall seeing some papers where it was claimed that more than 95% of the decoding time was spent in the autoregressive model! This made the method mostly one of academic interest, used in competitions where only rate-distortion performance is measured, without computational constraints. Further work has tried to optimise this obvious bottleneck. For example, using something like the checkerboard context model can speed up decoding more than 40 times, without much of a drop in performance:

*Checkerboard context for arithmetic decoding [4]*

A glaring weakness of models trained for the $ R + \lambda D $ loss is that we need to train a separate model for each point of the rate-distortion curve. That would mean that if you are streaming video, the service would need to have N different models in storage ready to store, depending on your (dynamic) bandwidth - which is absurd. Traditional codecs can adapt rate on the fly, like the quality setting in JPEG.

Luckily, there’s a nifty little trick to support variable rate within one model. It’s fairly obvious that to support this, we should use variable values of $ \lambda $ during training, but these should be coupled with some input feature that tells the model which rate-distortion regime we are in. In [3], this is done quite elegantly by introducing a “gain” unit and its inverse. This is a learnable embedding-like vector that multiplies the encoded latent $y$ before it goes into quantisation - decoding performs the opposite.

*Learned variable rate control [3]*

If we recall JPEG, this is very similar to using different quantisation tables for different desired quality values, except it has the added benefit that the gain and inverse gain coefficients are trained end to end. During training, we can discretise the range of $\lambda$ and gain vectors, say to 100 values. Each sample gets assigned a uniformly random index in 100, and uses the corresponding gain vectors and $\lambda$ coefficient. In inference, we can then vary which of the 100 gain vectors we wish to use, and due to the way they have been trained, we effictively have a 1-100 quality index like JPEG!

In image reconstruction, pixel-wise MSE is usually used as the first simple loss function. However, this causes unwanted artifacts at low bitrates: let’s say you have a perfect reconstruction of the image but it’s shifted 5 pixels to the right. MSE would incentivise the model to prefer a blended nothingness over such slightly misaligned reconstructions. This would cause the model to blend out fine textures at low bitrates. Due to this and other issues, the MSE-based PSNR loss metric does not achieve high correlation to actual golden truth subjective test scores, unlike other more advanced metrics such as VMAF. A fun fact is that VMAF has won an Emmy award for technology and engineering :)

What are some of the more perceptual loss functions that have been used? One fairly simple improvement is weighting MSE by a saliency map of the input: it’s reasonable to assume that viewers care more about the reconstruction quality of human faces compared to minute background details. Moreover, the pixel-wise L1 loss and Charbonnier loss have been tried to reduce blurring artifacts caused by MSE.

Since deep neural networks extract semantic information from images during their training process, it’s possible to pass through images through a pretrained network, extract deep feature activations and compare their distances to get a semantic similarity measure. This is what the PIPS loss does, originally using the VGG network for extracting deep features.

*PIPS feature similarity [9]*

A similar alternative to PIPS loss is the style loss borrowed from the style transfer literature. It similarly extracts intermediate layer activations from a pretrained network such as VGG, but the metric is calculated in a somewhat different way, focusing on global statistics of the feature maps, meant to measure differences in style.

In [2], the authors show that image compression methods can be thought of as generative variational autoencoders as $ \lambda \to \infty$. It is therefore not surprising that many methods that have been useful in that field, have also been adopted in image compression. Another NeurIPS paper [5] showed that using a classic image generation method, a GAN discriminator, is useful for training the image compression model. In fact, based on human evaluations, a model trained for joint MSE, LPIPS and GAN loss was deemed better than BPG (a then state of the art image compressor based on the HEVC video codec) at half the bitrate.

*HiFiC [5] vs BPG at similar file size*

*Comparison of image codecs, CompressAI*

The above image from CompressAI shows the progress of various learned image codecs against their traditional benchmarks, such as VTM, AV1, BPG. While these codecs are already somewhat outdated, it’s visible that generally the best learned and traditional codecs are neck and neck.

This is confirmed by the CLIC compression challenge where especially in the video track, winning submissions are often hybrid approaches where a traditional codec is combined with a neural post-processing layer (in-loop filter). The efficacy of such ensemble methods shows that neural methods do not dominate sufficiently to make traditional codecs obsolote. In fact, it seems that in the recent CLIC2024 competition, both the image and video track winners used the state of the art ECM codec.

But more than accuracy, the main concern of ML codecs is their computational cost, highlighted best by this concluding slide of the CLIC2022 challenge:

*Challenges in incorporating ML in a
mainstream nextgen video codec*

We see the difference between mainstream codecs and learned ones can be up to two orders of magnitude in the number of calculations. In the CLIC challenge, it’s common to see neural codecs have 10x longer decoding times, even if their inference can be run on a GPU.

Due to this, it might mean that at least for the time being, more lightweight hybrid neural approaches are the best way to enhance image and video compression. Although, it also seems likely that in the long run due to the bitter lesson effects, neural codecs which are significantly simpler conceptually (JVET codec documentation can easily span hunderds of pages :)) and meant to run on generic neural hardware (which is becoming more common) will prevail. At least it seems safe to predict that a good compression system will have at least some learned components in the future.

[1] Ballé, Johannes, et al. “Variational image compression with a scale hyperprior.” arXiv preprint arXiv:1802.01436 (2018).

[2] Ballé, Johannes, Valero Laparra, and Eero P. Simoncelli. “End-to-end optimized image compression.” arXiv preprint arXiv:1611.01704 (2016).

[3] Cui, Ze, et al. “Asymmetric gained deep image compression with continuous rate adaptation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

[4] He, Dailan, et al. “Checkerboard context model for efficient learned image compression.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

[5] Mentzer, Fabian, et al. “High-fidelity generative image compression.” Advances in Neural Information Processing Systems 33 (2020): 11913-11924.

[6] Minnen, David, Johannes Ballé, and George D. Toderici. “Joint autoregressive and hierarchical priors for learned image compression.” Advances in neural information processing systems 31 (2018).

[7] Toderici, George, et al. “Variable rate image compression with recurrent neural networks.” arXiv preprint arXiv:1511.06085 (2015).

[8] Van den Oord, Aaron, et al. “Conditional image generation with pixelcnn decoders.” Advances in neural information processing systems 29 (2016).

[9] Zhang, Richard, et al. “The unreasonable effectiveness of deep features as a perceptual metric.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

]]>Research is an ambiguous term in industry, so for the following let’s define it as both:

- Research science - actively publishing novel research for expanding the knowledge frontier, working with ideas and prototypes
- Research engineering - implementing and optimising solutions near the knowledge frontier, working with production grade systems

It seems that the first one is almost exclusively dominated by big tech companies:

There are other slightly smaller companies that can still produce NeurIPS level research like Spotify or Uber, but the smaller the market cap of the company, the less likely it is that their research scientists are working full time near the knowledge frontier.

Why is this?

Publication is a non-direct benefit to the employer. There is no obvious revenue/profit contribution from publishing at a top
journal or conference. There are many indirect benefits, such as attracting top talent, better brand recognition and driving innovation.
However, fledgling startups have 99 problems and these kind of soft benefits tend to not make it into the priority shortlist.
This means that if you are dreaming of working on cutting edge solutions *and* have publishing be part of your expected work, then big tech might be the only feasible place.
Somewhere between 50-100B market cap seems to me to be the starting point where you could expect research scientist to be an actual open role.

So that leaves us with research engineering. Surely one can do this at a small company, right?

The average lifecycle of an ML project looks something like the above (where ML denotes machine learning and DL deep learning). In practice, you may use versioning lingo such as “V1” and “V2” for the different stages of a model. There is only one important thing to note about the function: it is concave ($ f’(x) > 0, f^{(2)}(x) < 0, \forall x$). In other words, there are diminishing returns from investing more effort into the approach. For any given model solving some task, we expect to improve it more rapidly during the initial “low hanging fruit” stage, compared to when someone has already spent 3 years working on it.

Imagine a rapid grocery delivery startup who’s hired a team of ambitious data scientists to solve various machine learning problems at the company. Let’s say there are 5 main opportunities that have been identified by management:

- Dynamic pricing to balance demand/supply (e.g. increase courier fees during peak hours)
- Improving the dispatching engine (ML arrival time predictions + optimal order batching)
- Computer vision for automated courier/restaurant on-boarding
- Personalised recommendations in the user app
- Using Gen AI to produce synthetic menu images

To simulate this scenario, we can generate a number of effort-accuracy curves from any concave function, for below I use $ f(x) = a \log(bx)$.

As I’ve argued before, companies solve an optimisation problem to maximise returns given some fixed effort budget. The only remaining problem is that on our y-axis we currently have accuracy, not (monetary) returns. The company optimises wrt to $ \frac{\partial Returns}{\partial Effort} = \frac{\partial Returns}{\partial Accuracy} \frac{\partial Accuracy}{\partial Effort} $. However, let’s assume without loss of generality that $ \frac{\partial Returns}{\partial Accuracy} $ is linear and has the same coefficient for each project - then the optimisation problem boils down to maximising the sum of accuracies. It seems likely that if anything, $ \frac{\partial Returns}{\partial Accuracy} $ is itself concave (going from 50->60 pp accuracy is perhaps more business-enabling than going from 90->100), so $ \frac{\partial Returns}{\partial Effort} $ would be even more concave, but linearity doesn’t change the nature of the problem either.

If the projects are independent, the optimal allocation can be approximated with the following algorithm:

```
effort_per_project = defaultdict(lambda: 0)
total_effort = 100
for _ in range(total_effort):
# Take the analytical or empirical derivative
marginal_gains = [derivative(effort_curve(current_effort)) for effort_curve, current_effort in zip(effort_curves, current_efforts)]
# Allocate 1 unit of effort to project with highest gain
project_to_allocate = np.argmax(marginal_gains)
effort_per_project[project_to_allocate] += 1
```

Above, the optimal allocation for each project is shown in red dots, and it is clear that when facing diminishing returns, it is not rational to invest too heavily in any one project. However, the juxtaposition with the first image shows that this means not investing further than some basic feature engineering improvements! We shouldn’t even start looking into deep learning architectures, let alone the research frontier implementations.

This visualises the fundamental tradeoff between industry priorities (impact) and academic priorities (novelty). Even if the
data scientists would prefer to work on the top importance project and bring it into the SotA territory - and with a budget of 100 `total_effort`

, this would be feasible - they are not able
to, because there are various different projects that require keeping their lights on.

There are two possible necessary conditions to get to the “SotA research approach” area of the curve for any given project.

Firstly, let’s look at the same optimisation solution, but this time expand `total_effort`

to 300. Intuitively, this corresponds to
the growth of the company - the most valuable ML projects remain more or less the same, but due to improved resources,
we can invest more. An alternative way to look at this is increasing effort surplus causing further specialisation. This is similar to how
human societies have evolved - at first, almost everyone was a farmer, but as food surpluses accumulated, you could have
marketers, social media specialists and prompt engineers as well. 🙂

We see now that it is optimal for the company to invest into research frontier level improvements to the first project. So if you work at Google, your solution for the quarterly fraud detection model task might be novel enough to warrant publishing in KDD.

The second way to have an optimal allocation nearing the SotA region for the top project is to change the relative importance of projects. In real life, different companies have a different distribution of ML priorities. Some, which I denote as “ML-first” product companies, have their core product success hinging on some ML model. Take, for example, Starship Technologies, developing cute autonomous sidewalk robots. They needed to build up a perception team relatively early, because without perception, you have no autonomy, and thus no product. Other possible ML use cases, such as travel time prediction, fraud prevention or customer segmentation, came much later. Similarly, if you’re selling an OCR service, or a foundational LLM, the core business success is intimately tied to the ML solution. We can represent this graphically by making one of the projects much more prominent:

The final takeaway is simple: if you want to publish, you will be mostly disappointed at a startup. If you want to work on cutting edge engineering, then a startup will work if their product is ML-first. Be sure to ask in interviews how critical is the ML solution to the company, if it falls into the nice-to-have category, it’ll likely not have a SotA solution. Otherwise, the amount of interesting ML engineering work scales with the size of the company.

]]>After working for 5 years on ML projects at Bolt, I was asked by my manager (only somewhat seriously) to give a farewell presentation on how to increase one’s impact as an individual contributor (IC) in a large product organisation. This got me thinking about how we measure, attribute and plan for impact in technology companies. One thing led to another and here we are, with Monte Carlo simulations for career impact. But there’s nothing like writing a simulator to prove fairly commonly known truths, right?

Senior IC (individual contributor) career path is still a relative novelty within the grand scheme of things. While the traditional approach
to promotion pushes high performers into management path, where their individual impact becomes muddled and difficult to measure,
ICs are evaluated on their raw contributions alone. Usually (at least in the ML field), this is measured in units of
some business metric that their projects hopefully help to improve. Whether you like it or not, this
sort of *impact* assessment is going on in the background each time promotions, salary reviews or re-organisations
happen. So in order to maximise your overall impact in the organisation, it helps to understand how an organisation itself
operates.

Most big technology companies operate on a quarterly planning cycle. Optimally solving this is equivalent to solving
the knapsack problem - given a fixed resource count
(usually denoted as *human weeks*), choose from a set of possible projects to maximise some business metric
(usually revenue, GMV, profit etc).

A spreadsheet used for quarterly planning of a food delivery platform might look something like this:

Project | Description | Benefit | Cost | ROI (kEUR/w) |
---|---|---|---|---|

A | Fix payment button | 500k EUR | 2 human-weeks | 250 |

B | Adopt new UI framework | 275k EUR | 5 human-weeks | 55 |

C | Simplify restaurant on-boarding | 630k EUR | 6 human-weeks | 105 |

D | New dispatching algorithm | 2.5M EUR | 13 human-weeks | 192 |

… | … | … | … |

With only one employee, we have 13 weeks available in a quarter. Given this constraint, it would be best to work only on project D, and nothing else.

We can extend this simple model by taking the engineer’s perspective into account. Instead of taking ranked roadmap items as a given (“just solving tickets”), it seems preferable to spend some effort to do research and re-assess the proposed items. That’s because the benefit and cost are not deterministic, they are only estimates for a random variable from some distribution.

Secondly, the benefits of a software project accrue into the future, rather than being a lump sum. Usually this scale with the business growth - we can think of project benefits relative to the overall revenue or profit. In an ideal world, cost is a one time thing - you build something and it just works - but realistically, there is some future maintenance cost of new features.

Finally, there’s a trade-off between implementation cost and future maintenance costs - we might decide to do things somewhat sloppily initially to ship faster. Then, in the future, we have the option to reduce the tech debt with further investment into the project.

Putting all of this together, the rational engineer would solve something like a dynamic stochastic knapsack problem with the additional choice variables for tech debt, research effort and refactoring. I’m not going to even bother with setting this up formally, let alone solving it, mostly because I doubt anyone in the real world does anything close to it. I’ve yet to see anyone take into account future maintenace costs of solutions in quarterly planning, not to mention how people dislike stochasticity. Instead, I simulate a few scenarios with baseline “sensible” strategies and see if they reveal anything about the trade-offs. What are these sensible strategies?

- not spending any effort on exploration vs spending
*some* - never taking on tech debt (purist) vs sometimes doing it (necessary evil)
- varying the overall level of recurrent maintenace cost (a proxy for engineering quality?)

For the following, I wrote a simple simulator (code here) with the following conditions:

- Each quarter, there are $ N $ possible projects to work on, which have a random benefit (normal) and cost (log-normal). Log-normal distribution of costs is based on the interesting argument made by Erik Bern.
- The planner chooses the top K ones to work on that fit into the time budget, ranking them heuristically by initial benefit and cost ratio. These initial estimates are unbiased (more on this later).
- While projects keep accruing benefits every period, there is also some maintenance cost involved. However, this is not accounted for at project selection time.
- The maintenance cost is a function of initial project cost and how much tech debt was taken on implementation.
- Before choosing new projects, the planner can refactor existing projects to reduce their tech debt.
- Before choosing new projects, the planner can explore them in-depth, expanding time budget, but reducing the variation of benefit/cost estimates.
- Many arbitrary constants and heuristics - keep in mind this is just a thought exercise. 🙂

Let’s play around with the choice variables to see what would be a good strategy to approach this problem with.

As a brief aside, note that there is a crucial difference in how we model the planner’s knowledge of the estimates of benefit and cost. Currently, we’re assuming that the planner has knowledge of the full distribution. To illustrate, consider having to choose the best 2 projects out of A, B and C with the following expected rewards (100K, 130K and 180K):

If we really know the full distribution (which here means the mean and variance), then it’s straightforward that we should
pick B and C. However, this implies that the estimate in our planning spreadsheet is an unbiased estimate for the mean.
For more complex distributions like the lognormal, we’d be assuming that our spreadsheet number corresponds to $ \mathbf{E}(x) = \exp \left( \mu + \frac{\sigma^2}{2} \right) $ with correct
estimates for $\mu$ and $\sigma$! This seems highly suspect.
It seems more likely that the number in our spreadsheet is only an imperfect estimate for $ \mathbf{E}(x) $. Perhaps we
have worked on similar problems before, and can infer the mean based on historical data. In the absence of that, our
estimate might be *really* noisy, though. We can model this extreme scenario by drawing the estimate as just one sample
from the same distribution. This is the fundamental difference between *known* and *unknown* unknowns. If we know that a
project will have a high variance of outcomes, we are still able to plan for it, taking the variance into account. If, however,
both the mean and the variance we estimate are themselves noisy, then we run into the reversion to the mean
problem. This is illustrated below, where for 100 time periods, we select top 5 projects from 20 randomly generated ones,
and then compare the distribution of actual rewards to the estimated ones. Since we’re doing the selection on a random variable,
likelihood of selecting an extreme value is increased. The variance of the estimates is significantly less due to the
same reason.

Real life is probably somewhere in-between of the known and unknown unknowns territory, so we should expect our spreadsheet ROI estimates to be overestimated, and our headcount budgets to often explode when dealing with uncertain projects. Note that this is not an artifact of the lognormal distribution, as here for rewards we used a normal one. The root cause of it is selection on a random variable.

Coming back to the simulation model, the first thing that is fairly obvious is that if existing projects introduce maintenace costs, then the rate of shipping new things
is going to slow down over time. So the main question becomes, how to keep up as high a pace as possible when juggling between
new and old stuff. Here we compare the outcomes after 4 years for different rates of *recurring cost*, i.e. a fraction of
initial development cost that needs to be spent each quarter just to keep something up and running.

Similarly, we can plot the total reward for the cost rates.

All the curves are exponentially growing - if you add more projects with a positive impact, you are adding on top of the impact of the projects you’re already maintaining. However, we see that the difference between outcomes is quite drastic. If we take the maintenance fraction to reflect the overall level of tech debt, then the first takeaway is that technical debt will need to be paid back, but it’s paid back from your own impact and career growth potential. Often I’ve seen tech debt arguments boil down to something like “will the system still scale in 5 years?” where the average engineer doesn’t care much about what happens after 5 years, but here we see it can start playing a role within the immediate tenure time as well.

Alternatively, we could view tech debt as a dynamic lever: there probably exists a successful strategy taking on tech debt
when the (expected) value of current possible tasks is high and paying it off when the value is low. We could set some
threshold on the average ROI expectation of this quarter’s projects and cut the corners if it’s high. Intuitively, in case of urgency
(we are not likely to see as high expected rewards at other times), it makes sense to relax the quality standards somewhat,
even if this locks in future productive resources into maintaining the horrible mess we’ve made. However, at times of low
opportunity cost, we can spend time to fix it, to free up resources for future projects. While this provides another
optimisation avenue, the winnings from this are generally much smaller than from being able to reduce your overall maintenance rate.
This is because a) given enough projects, the budget is almost entirely exhausted by maintenance and tech debt adds to that cost by
doing a noisy gamble on the reward, b) there is a very limited combination of parameters where taking on tech debt is
more advantageous than not (because the full additional cost of tech debt is the increased maintenance cost *and* the eventual refactoring cost, unless we discount the future heavily (which we currently don’t do in the 12 quarter simulation)).

We’d have to make quite generous assumptions on how tech debt works to see a significant difference, for example with:

- Assuming x% of “cutting corners” reduces initial implementation cost by $x$%, and increases maintenance cost by only $\frac{x}{10}$%
- The refactoring/rewriting cost is just 30% of the initial implementation cost (unlikely!)

the results are not too different:

This could clearly look different with different parameters assumed, nonetheless, dynamic technical debt leveraging seems a second order effect compared to the overall maintenance rate one.

Finally, plotting the average outcomes over different values of exploration rate, the classic exploration-exploitation dilemma becomes visible: if we do too little exploration, then we become “ticket solvers” or just “building cool stuff”, while the opposite end of the spectrum is usually called “analysis paralysis”. There is a sweet spot which depends on the variance of the benefit/cost estimates and how much the exploration will reduce uncertainty. As expected, for a fixed level of rewards, higher uncertainty means higher required exploration rate, but also worse outcomes. I think the reason ML teams tend to spend more time on exploration is directly tied to the higher uncertainty of the nature of the projects. And clearly, you should only use ML when you expect high returns from a project.

- Planning estimates are almost surely overly optimistic due to selection on noisy estimates. This is why hand wavy rules like “gut feeling estimates times 3” are often quite accurate, and provides another mechanism to explain budgeting failures, as this is not an artifact of the log-normal distribution only.
- Working on the right thing can often be more important than the work itself. The higher the uncertainty of a project (or alternatively, the more you can expect to reduce it), the more should be invested into this. This is why having dedicated data analysts for planning and impact estimation becomes important at scale.
- To increase your technical impact within an organisation, it’s important to not “get stuck” with poorly engineered legacy projects - 5% maintenance cost on average ships 80% more projects over 3 years than 15%. I’d wager that the number of successfully shipped projects might even be the best proxy for recognition, as people tend to be biased towards new and flashy projects.

While the above model is general enough to reflect any engineering projects, it’s worth pointing out that machine learning projects have some idiosyncracies:

- It’s possible that in your company there’s a separate team of data analysts for impact estimation (exploration), so it’s not the actual engineer doing impact estimations. However, it comes out of the company’s resource budget nonetheless. If we think of analysts as a leveraging factor on engineers, this resource might be better used elsewhere (opportunity costs). Quite often, especially in the ML/DS field, it is the engineers themselves who need to estimate the impact.
- Another way to write the estimate of monetary impact of a project is $dB = \frac{dB}{dA} \frac{dA}{dE} dE$ where dE denotes the effort spent, $\frac{dA}{dE}$ shows how much a model’s accuracy improves from the effort, and $\frac{dB}{dA}$ shows how much the business metric improves from the accuracy improvement. ML practicioners are in double trouble here: not only is it very hard to understand the effect on business metrics, often we don’t know how much a model can be improved until we actually try to do it.
- Probably
*the most*useful uncertainty reduction tool I’ve encountered is building demos and sharing these with stakeholders ASAP. On the one hand, being able to ship something like this fast significantly reduces the effort estimate - especially given that ML tends to be viewed as “researchy” - while it also sets some lower bound on the rewards, at least the $\frac{dA}{dE}$ element.

LLM stochasticity poses a problem for evaluating whether new commits improve a project’s performance or not - the solution is to run a paired difference test on repeated measures of the same eval set for maximal statistical power.

OpenAI’s GPT-4 outputs are not deterministic. [1][2] Leaving aside anecdotal performance changes over long time horizons, simply sending the same API request 10 times in a row can give you 10 different answers. Working with GPT-4 at Bolt, we noticed that these differences are not just mere paraphrasing - it could also be that an AutoGPT-like agent chooses an entirely different set of actions across multiple requests. It looks to be caused by the nature of how log probabilities are stored and used in parallel GPU calculations. Thus, as a fundamental problem, it might affect all modern instruction tuned AGI models.

However, as most practitioners also know, LLM development needs to be evals-driven (EDD) - that means every prompt, code or model modification should run against some fixed set of task-specific test cases that have a quantified success measure to either greenlight or block new commits. For simplification, assume we have a simple binary accuracy measure. This is fairly realistic if your LLM is deployed as an agent and needs to choose from a discrete set of actions / API calls to perform.

Let’s say by anecdotal inspection you discover a systematic issue in the LLM outputs and want to fix it with improved prompt engineering. As prompts tend to be not the most modular things, it’s hard to know if these changes do not hurt overall accuracy. Therefore, we need to measure both the old (A) and new (B) branches on our evals set and compare the accuracies. Alright, we did that - the values are 73% and 76% respectively. Great! Does that mean we are safe to roll it out? This brings us back to the original problem of stochasticity. On our evals set on Bolt we saw that on a sample of 64 cases, the accuracy could vary in consequent A/A eval runs from 65% to 75%! This renders a naive comparison of accuracies essentially useless, as unless the changes have a dramatically large impact, they can get overshadowed by background variation.

Luckily estimating unknown population parameters from a noisy sample is the fundamental problem of statistics, so let’s see if developers can use a few tricks from the field.

More stats-savvy readers lament: why not just increase the sample size? It’s true that if we increase it to infinity, by the law of large numbers we will converge to the true accuracy ratio in either branch. Alas, collecting curated validation samples can be costly, and starting from thousands can also become slow to evaluate and cost-prohibitive for SotA models such as GPT-4.

However, even with a limited sample size, we can use statistical hypothesis testing between the samples from the two branches. For a binary conversion metric such as accuracy, there are multiple suitable tests, but below I use the Welch t-test and the chi-square contingency table test.

We can simulate the scenario described above in Python:

```
import numpy as np
from scipy.stats import truncnorm, ttest_ind, chi2_contingency
import pandas as pd
from tqdm import tqdm
np.random.seed(1337)
TRUE_EFFECT_SIZE = 0.05 # This is the percentage point effect size treatment gives
N_TEST_CASES = 100
## First, simulate "true" values of accuracy for the test cases.
# This is the probability that a given LLM call will give the correct answer -
# and directly related to the difficulty of the test case.
# This distribution is a parameter of the simulation and could be changed
# for uniform, truncated normal etc.
values = [0.15, 0.5, 0.9]
weights = [0.25, 0.15, 0.6]
weights = np.array(weights) / sum(weights)
probabilities_control = np.random.choice(values, size=N_TEST_CASES, p=weights)
probabilities_treatment = np.clip(probabilities_control + TRUE_EFFECT_SIZE, 0, 1)
## Second, simulate the model's stochastic behaviour
def llm_call(probabilities):
return np.array([np.random.choice([0, 1], p=[1 - p, p]) for p in probabilities])
## Finally, we can simulate the results of LLM eval calls
MC_TRIALS = 500
# Arrays to hold p-values
p_values_t_test = np.zeros(MC_TRIALS)
p_values_chi_sq = np.zeros(MC_TRIALS)
# Monte Carlo trials
for i in tqdm(range(MC_TRIALS)):
results_control = llm_call(probabilities_control)
results_treatment = llm_call(probabilities_treatment)
data = pd.DataFrame({
'result': np.concatenate([results_control, results_treatment]),
'treatment': np.concatenate([np.zeros(N_TEST_CASES), np.ones(N_TEST_CASES)]),
'test_case': np.concatenate([np.arange(N_TEST_CASES), np.arange(N_TEST_CASES)])
})
# Perform a t-test
t_test_result = ttest_ind(data[data['treatment'] == 0]['result'],
data[data['treatment'] == 1]['result'])
p_values_t_test[i] = t_test_result.pvalue
# Perform a chi-square test
contingency_table = pd.crosstab(data['treatment'], data['result'])
_, p, _, _ = chi2_contingency(contingency_table)
p_values_chi_sq[i] = p
# Compare power
power_t_test = np.mean(p_values_t_test < 0.05)
power_chi_sq = np.mean(p_values_chi_sq < 0.05)
print(f"Power of the t-test for each iteration: {power_t_test}")
print(f"Power of the chi-squared test for each iteration: {power_chi_sq}")
>>Power of the t-test for each iteration: 0.046
>>Power of the chi-squared test for each iteration: 0.022
```

The results do not look too promising! 😱 For an accuracy baseline of roughly 67% and a true effect size of 5pp, t-test detects a significant effect only 5% of the times, and chi-squared test does even worse at 2%!

Fortunately there is another source of variation we can exploit - namely that the test cases are different. Some are very difficult - and therefore have a lower probability of being correct. Technically, we are dealing with a paired observation here - each test case is measured for both branches. So to achieve better statistical power, we can use a paired t-test instead. This is as simple as adding the following code:

```
from scipy.stats import ttest_rel
...
p_values_paired_t_test = np.zeros(MC_TRIALS)
...
# Perform a paired t-test
paired_t_test_result = ttest_rel(data[data['treatment'] == 0]['result'],
data[data['treatment'] == 1]['result'])
p_values_paired_t_test[i] = paired_t_test_result.pvalue
...
power_paired_t_test = np.mean(p_values_paired_t_test < 0.05)
print(f"Power of the paired t-test for each iteration: {power_paired_t_test}")
>>Power of the paired t-test for each iteration: 0.154
```

This yields an improvement as expected, but still does not seem too good.

Paired t-test is usually used for paired data in the special case where we have exactly two sets of measurements for each group (for example before/after measurement in a drug trial). However, why limit ourself to 2? Since the number of LLM evals calls is entirely in our control, we could run the eval for each case an arbitrary $N$ number of times and this should provide us with even more variation to exploit.

Instead of the paired t-test, with $N$ measurements (where $N$ can actually be different between control and treatment), StackExchange guided me towards the mixed effects model to estimate the treatment coefficient. This model seems almost perfectly suited for our use case, as we know there should be some background variation of accuracy depending on the test case specifics, but we don’t necessarily want to explicitly estimate a coefficient (fixed effect) for it - we only care about the treatment effect. To implement this in Python, we need to add an additional `N_ITERATIONS`

loop within a trial, aggregate trial data over iterations, and add a statsmodels fit. For fun, let’s also add some benchmarks - the fixed effects (dummy variable) model, just naive linear regression and running a paired t-test on data that has been averaged within each group.

```
# Arrays to hold p-values
p_values_t_test = np.zeros(MC_TRIALS)
p_values_chi_sq = np.zeros(MC_TRIALS)
p_values_paired_t_test = np.zeros(MC_TRIALS)
p_values_avg_paired_t_test = np.zeros(MC_TRIALS)
p_values_mixed_effects = np.zeros(MC_TRIALS)
p_values_fixed_effects = np.zeros(MC_TRIALS)
p_values_ols = np.zeros(MC_TRIALS)
N_ITERATIONS = 5
for i in tqdm(range(MC_TRIALS)):
trial_datas = []
for _ in range(N_ITERATIONS):
results_control = llm_call(probabilities_control)
results_treatment = llm_call(probabilities_treatment)
# Create a DataFrame for the mixed effects model
trial_datas.append(pd.DataFrame({
'result': np.concatenate([results_control, results_treatment]),
'treatment': np.concatenate([np.zeros(N_TEST_CASES), np.ones(N_TEST_CASES)]),
'test_case': np.concatenate([np.arange(N_TEST_CASES), np.arange(N_TEST_CASES)])
}))
data = pd.concat(trial_datas)
# Fit the mixed effects model
model = smf.mixedlm("result ~ treatment", data, groups=data['test_case'])
model_fit = model.fit()
p_values_mixed_effects[i] = model_fit.pvalues['treatment']
# Fit fixed effects model
model = smf.ols('result ~ treatment + C(test_case)', data=data)
model_fit = model.fit()
p_values_fixed_effects[i] = model_fit.pvalues['treatment']
# Fit OLS
model = smf.ols('result ~ treatment', data=data)
model_fit = model.fit()
p_values_ols[i] = model_fit.pvalues['treatment']
# Perform a paired t-test
paired_t_test_result = ttest_rel(data[data['treatment'] == 0]['result'],
data[data['treatment'] == 1]['result'])
p_values_paired_t_test[i] = paired_t_test_result.pvalue
# Perform a paired t-test on averaged data
data_avg = data.groupby(["test_case", "treatment"])["result"].mean().reset_index()
paired_t_test_avg_result = ttest_rel(data_avg[data_avg['treatment'] == 0]['result'],
data_avg[data_avg['treatment'] == 1]['result'])
p_values_avg_paired_t_test[i] = paired_t_test_avg_result.pvalue
# Perform a t-test
t_test_result = ttest_ind(data[data['treatment'] == 0]['result'],
data[data['treatment'] == 1]['result'])
p_values_t_test[i] = t_test_result.pvalue
# Perform a chi-square test
contingency_table = pd.crosstab(data['treatment'], data['result'])
_, p, _, _ = chi2_contingency(contingency_table)
p_values_chi_sq[i] = p
# Compare power
power_t_test = np.mean(p_values_t_test < 0.05)
power_chi_sq = np.mean(p_values_chi_sq < 0.05)
p_values_ols = np.mean(p_values_ols < 0.05)
power_paired_t_test = np.mean(p_values_paired_t_test < 0.05)
power_avg_paired_t_test = np.mean(p_values_avg_paired_t_test < 0.05)
power_mixed_effects = np.mean(p_values_mixed_effects < 0.05)
power_fixed_effects = np.mean(p_values_fixed_effects < 0.05)
print(f"Power of the t-test for each iteration: {power_t_test}")
print(f"Power of the chi-squared test for each iteration: {power_chi_sq}")
print(f"Power of the OLS model for each iteration: {p_values_ols}")
print(f"Power of the paired t-test for each iteration: {power_paired_t_test}")
print(f"Power of the avg paired t-test for each iteration: {power_avg_paired_t_test}")
print(f"Power of the mixed effects model for each iteration: {power_mixed_effects}")
print(f"Power of the fixed effects model for each iteration: {power_fixed_effects}")
>>Power of the t-test for each iteration: 0.368
>>Power of the chi-squared test for each iteration: 0.336
>>Power of the OLS model for each iteration: 0.368
>>Power of the paired t-test for each iteration: 0.592
>>Power of the avg paired t-test for each iteration: 0.59
>>Power of the mixed effects model for each iteration: 0.584
>>Power of the fixed effects model for each iteration: 0.584
```

A couple of points are immediately striking:

- The power of all tests increases when increasing
`N_ITERATIONS`

. Even if we are not sure about the optimal test setup, scaling repeated measurements seems useful. For a sanity check, the interested reader can verify that power is near zero for all tests if we set`TRUE_EFFECT_SIZE`

equal to 0. - t-test, chi-squared and OLS are clearly worse than the others - this is because they do not leverage the paired nature of the data.
- Paired t-test, averaged paired t-test, mixed effects and fixed effects models have very similar results, with the latter being identical.

This raises a few interesting questions:

Isn’t averaging multiple measurements data before paired comparison bad? Wouldn’t you lose useful variation? It turns out that yes - the averaged power is always less than paired one - but the difference is very small, which might partly be an artefact of the simulation setup.

There are research papers written about the subtle differences between mixed effects and random effects. How can they be identical? It turns out that if you are only interested in the *treatment effect* rather than the group effects (here, what is the estimated accuracy rate of each individual test case) and you also have a balanced dataset with no missing values, then a paired t-test, mixed effects and random effects are actually **all** equivalent! Marco Laube has a nice proof on this. In the appendix, I extend it to our use case with multiple measurements.

To me personally, the equivalence of a simple paired test and fixed effects was not intuitive at first. In the first case, we simply stack repeated measurements on top of each other in *long* format and treat them as independent pairs, while in the latter if we measured the performance on 2 different test cases 1 million times, we would get a nice estimate on the case specific performance. It boils down to what we want to measure - since it’s the treatment effect $\hat{\beta_1}$ rather than $\hat{u_2}$, the choice between methods does not matter that much.

Finally, it’s worth noting that scaling `N_ITERATIONS`

is no magic solution for small evals samples. If we managed to set `N_ITERATIONS=int(1e3)`

, our $\hat{\beta_1}$ *will* converge to the population treatment effect value. However, the interpretation is not the one we’re interested in necessarily. It will only tell us that a change to the prompts caused an effect of size $\hat{\beta_1}$ *on the limited set of test cases*. This does not necessarily transfer to good generalisation ability to other requests the model might see in production - in the worst case, we’ve chosen completely irrelevant test cases so the exercise is useless. So it is always a good idea to also scale `N_TEST_CASES`

and make sure they correspond to real problems you expect your LLM to solve.

Reusing Marco Laube’s notation - if our model is given by

\[y_{ijk} = \beta_{0} + \beta_{1} \mathbf{1}_{j=2} + u_{i} + \epsilon_{ijk}\]where $j$ denotes treatment in ${1,2}$ and $k$ is the number of repeated measurements we have per each group (below example uses $k=2$ and $n=2$ to exemplify), we can write:

\[\begin{bmatrix} y_{111} \\ y_{112} \\ y_{121} \\ y_{122} \\ y_{211} \\ y_{212} \\ y_{221} \\ y_{222} \\ \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \\ \end{bmatrix} \times \begin{bmatrix} \beta_0 \\ \beta_1 \\ u_2 \\ \end{bmatrix}\]We then get that $k$ scales the matrix $X^TX$ (makes sense, as we’re essentially duplicating data in $X$).

\[X^{T}X = k \begin{bmatrix} 2n & n & 2 \\ n & n & 1 \\ 2 & 1 & 2 \\ \end{bmatrix}\]and the matrix inverse is divided by k:

\[(X^{T}X)^{-1} = \frac{1}{k} \begin{bmatrix} \frac{n+1}{2n} & -\frac{1}{n} & -\frac{1}{2} \\[6pt] -\frac{1}{n} & \frac{2}{n} & 0 \\[6pt] -\frac{1}{2} & 0 & 1 \\[6pt] \end{bmatrix}\]Therefore

\[(X^{T}X)^{-1} X^T \in \mathbf{R}^{3\times 8} = \frac{1}{2} \begin{bmatrix} \frac{n+1}{2n} & \frac{n+1}{2n} & \frac{n-1}{2n} & \frac{n-1}{2n} & \frac{1}{2n} & \frac{1}{2n} & -\frac{1}{2n} & -\frac{1}{2n} \\[6pt] -\frac{1}{n} & -\frac{1}{n} & \frac{1}{n} & \frac{1}{n} & -\frac{1}{n} & -\frac{1}{n} & \frac{1}{n} & \frac{1}{n} & \\[6pt] -\frac{1}{2} & -\frac{1}{2} & -\frac{1}{2} & -\frac{1}{2} & \frac{1}{2} & \frac{1}{2} & \frac{1}{2} & \frac{1}{2} \\[6pt] \end{bmatrix}\]Note that we will have $k$ duplicate columns - in this case 2. The OLS estimator is given by:

\[\hat{\beta} = (X^{T}X)^{-1} X^Ty = \frac{1}{k} \begin{bmatrix} \frac{n+1}{2n}(y_{111} + y_{112}) + \frac{n-1}{2n}(y_{121} + y_{122}) + \frac{1}{2n}(y_{211}+y_{212}) - \frac{1}{2n}(y_{221} + y_{222}) \\ \frac{1}{n} \big( (y_{121} + y_{122}) - (y_{111} + y_{112}) + (y_{221}+y_{222}) - (y_{211}+y_{212}) \big) \\[6pt] \frac{1}{2} \big( (y_{211}+y_{212}+y_{221}+y_{222}) - (y_{111} + y_{112} + y_{121} + y_{122}) \end{bmatrix}\]We see that \(\hat{\beta_1} = \frac{1}{n} \sum^n \frac{1}{k} \big( \sum^k y_{i2k} - y_{i1k} \big)\)

Let’s denote \(\gamma_{ij} = \sum^k y_{ijk}\) then we have \(\hat{\beta_1} = \frac{1}{n} \sum^n \frac{1}{k} \big( \gamma_{i2} - \gamma_{i1} \big)\)

We see that the treatment effect estimator simply becomes the average of the average (over $k$) differences of the independent variable. The rest of the proof should follow Marco Laube’s derivations. I barely trust my own math though, so here’s a sanity check based on our simulation data from above: if the estimators are equivalent, we should:

- see a near 1 correlation in their p-values.
- see a very similar distribution of their p-values.

Let’s check the full correlation matrix of all estimators and the corresponding medians:

t-test | chi-sq | OLS | paired t-test | paired avg t-test | mixed effects | fixed effects | |
---|---|---|---|---|---|---|---|

t-test | 1 | 0.9996 | 1 | 0.9746 | 0.9721 | 0.9748 | 0.9749 |

chi-sq | 1 | 0.9996 | 0.9686 | 0.9663 | 0.9687 | 0.9689 | |

OLS | 1 | 0.9746 | 0.9721 | 0.9748 | 0.9749 | ||

paired t-test | 1 | 0.9956 | 0.9995 | 0.9995 | |||

paired avg t-test | 1 | 0.9954 | 0.9954 | ||||

mixed effects | 1 | 1 | |||||

fixed effects | 1 |

We see that:

- t-test is very similar to chi-squared, and exactly equivalent to OLS. The latter is a fairly well known result from statistics.
- Indeed, the correlation between paired t-test, mixed effects and fixed effects is near 1.

Since in hypothesis testing we also care about the absolute levels rather than only correlations, we can also look at the median values for an estimator’s p-value distribution:

Estimator | Median p-value |
---|---|

t-test | 0.0959 |

chi-sq | 0.1104 |

OLS | 0.0959 |

paired t-test | 0.0266 |

paired avg t-test | 0.031 |

mixed effects | 0.0247 |

fixed effects | 0.025 |

We see that:

- Even though highly correlated, chi-squared test is worse than t-test in this case.
- Even though paired t-test was slightly more sensitive when we were looking at the 5% threshold, based on the medians, mixed and fixed effects are better. However, the differences are very small, small enough to put down to implementation differences between the libraries, so we can indeed say that the methods are equivalent 🙂 Since paired t-test is easier to do, I’d recommend this approach.

The goal of the competition was the following: given a snapshot of the state of a road network, we would like to predict travel times (proxied by a red/yellow/green congestion class label here) on each road (edge) for 15 minutes into the future. This has practical applications for vehicle routing, which is essentially just shortest path finding through a weighted graph, where the weights can be exactly our predicted travel times.

However, an important crux was that the car counter input data (node) was sparse. This makes sense: unless you are Google, detailed information on all the cars currently on the road network is not available. Instead, there may be some fixed number of vehicle counters at intersections, as happened to be the case in London, Melbourne and Madrid.

As a spatiotemporal edge prediction task based on sparse vertex data, it has also applicability towards research on the spread of COVID and malicious software, or the temporal dynamics of crypto-currency networks. [1]

NeurIPS is first and foremost a deep learning conference. Also, since the road network is naturally represented as a graph, IARAI presented this as a graph neural network challenge, with the example notebooks providing various torch dataloaders and GNN utilities. Since I am no GNN expert, and the dummy example notebook would have taken 2 days to train for an epoch on CPU and seemed to crash my SageMaker notebook’s VRAM with a tiny batch size, I started looking into other, more familiar solutions. My thinking was that if our approach is significantly worse than the competition, we learn something new and if it’s better, then that’s also great, so it’s a guaranteed win 😀.

Despite how hard some deep learning aficionados wish for it not to be the case, it’s common knowledge in data science that gradient boosting methods win competitions on tabular data. A 2022 survey confirms this is still the case. There are two reasons for this: firstly, they can fit on a complex mix of feature types (categorical, ordinal, numeric) with good accuracy with zero fine-tuning and secondly, depending on your DL architecture, they can be 100-1000x faster to train.

Most tabular DL papers only account for the first point and try to engineer architectures or architecture ensembles (I saw a paper which trained 10x neural nets with random seeds and averaged their predictions, justifying it with that GBM is also an ensemble of models 🤯). However, being 100x faster is beneficial for ML competitions and real life alike, as increased iteration speed lets you test more hypotheses and improve the model more.

In addition to tabular data, competitions such as M5 show that LGB can be the best solution even for large scale multivariate problems. We think that our approach and the scoreboard show that traffic forecasting is “close enough” to a tabular problem that it can also be competitively solved with gradient boosting. More generally, the takeaway seems to be that even if you have *structured data*, if there exists a feasible feature engineering transformation into tabular format, gradient boosting might still be the most practical algorithm. The value proposition of deep learning is of course the reduced need for feature engineering, this is however no free lunch, due to the immensely increased computation cost. So choose your models wisely.

Intuitively there are a few different data sources for predicting the speed on a road segment in the future:

- speed on the same segment in the past
- overall “state” of traffic/speeds in the city currently
- “state” of traffic/speeds in the near vicinity of road of interest
- position of the road relative to others
- static attributes of each road, e.g. number of lanes, pavement type etc

The current state of traffic is provided by the car counter data. If we view this as a regular multivariate time series, we get a $k \times t$ matrix $C$ where $k$ is the number of counters and $t$ the number of observations. The main innovation we deployed was reducing this matrix to the first $p$ principal components and using those as time series features directly. Intuitively, this provides a compressed form of “traffic”. We confirm that this is the case by visualising the first two principal components grouped by time in week - they are clearly separable. Note that unlike the winning team, we did not use any external lookup approaches to retrieve exact temporal features, so the principal components are the only proxy for time that our model has.

To encode current “nearby” traffic, we tried a somewhat heuristic approach: weighting the counter matrix $C$ with a row-normalised symmetric weight matrix. More details can be found in our paper but intuitively, this generates a feature which is the distance-weighted mean of nearby counters. This can also be thought of as a hardcoded proxy for a graph convolution layer.

To encode historical traffic for roads, one can use embeddings for neural networks or target encoding for gradient boosting. If your entities (roads) have a lot of density on them, these methods are roughly equivalent, with sparse data target encodings should do slightly better, but they are also dangerously leak-prone.

The best thing about gradient boosting is that you can throw in lat/lng coordinates without any preprocessing. So that was how we encoded road position (note that in OpenStreetMap, road segments are quite short, so there’s not much loss of precision from just using a fixed point per line). Finally, road attributes are simply tabular features which are easy to handle with any architecture.

We used LightGBM as our library of choice, because we’re familiar with it and experiments with Catboost or XGBoost seemed to take 5-10x longer for standard hyperparameters. We employ two tricks for optimising LGB performance: by using the *init_score* argument with target encoded features, we significantly reduce training time, as the model can start from a much more accurate base weak learner. We increase the complexity hyperparameter *num_leaves* to unintuitively high levels like 5k or 10k, in practice on our large dataset this seems to converge faster but not hurt holdout accuracy.

Computationally, our solution for extended track could be run on a 32GB Macbook, and for the core track training models end to end on a larger server took just a few hours. The trained models take <100MB on disk. The code can be found on GitHub and the paper on arXiv.

Let’s have a look at the top final results of core track (with about 10x more data than extended, so arguably harder for GBM approaches).

Team |
Approach |
Library |
Secret Sauce |
Score |
---|---|---|---|---|

ustc-gobbler | GNN | torch | GATv2Conv + external time lookup | 0.8431 |

Bolt | GBM | LGB | PCA + target encoding | 0.8497 |

oahciy | GBM | XGB/LGB | XGBoost/LightGBM ensemble, target encoding | 0.8504 |

GongLab | GNN | torch | target encoding | 0.8560 |

TSE | GBM | LGB | k-NN target encoding | 0.8736 |

discovery | GNN | torch | hierarchical graph representation | 0.8759 |

ywei | GNN | torch | MLP-based GNN | 0.8779 |

First, note how close small are the differences between the top 4 teams. From a practical perspective, the models are all basically the same. Therefore, in a real world scenario the simplicity of the solution would also be important. While I think our PCA-based compression approach is quite nice, I have to acknowledge that the winning team’s GNN approach is also very elegant, as they do not do much feature engineering, rather letting the GATv2Conv layer learn relevant correlations. However, they also reverse engineer the exact time features, so it is unclear whether the GNN would still be more accurate without it.

Looking at the other teams, it seems that generally GBM does better than GNN. The extended track also sees 3 teams from top 4 using GBM. I believe over the long run we might see GNN models achieving the SotAs on (fixed) graph-based traffic benchmarks, including being able to generalise to unseen graphs i.e. perform *inductive* graph learning, being able to pool training data from multiple different city graphs etc. But this competition showcases that under a limited time budget, given a new ML prediction task, GBMs can be competitive even on nontabular datasets. It might be that this limited scenario reflects real life projects better.

I often see Kaggle as a bonus requirement for DS/ML roles on LinkedIn. While some exposure can be beneficial, e.g. understanding what are the best ML models for different problems, there is almost no real life benefit from getting to Grandmaster level.

Experience from t4c22 corroborates this. For the first two weeks we were exploring the data, writing utility functions for “wrangling” it and training some dummy models to see whether there is any signal at all. This is very similar to how new ML projects happen in the industry. Once confident that your model is doing at least something useful, you try to come up with better feature engineering and tune the parameters, this is equivalent to a “v2” model in a real life project.

But everything that happens after that has no resemblance to real problems. For the last month of the competition, we were mostly doing two things:

- Trying to engineer target-encoded features based on historical signals, but in a way to minimize leakage.
- Overfitting to test set by brute forcing solutions with increasing model complexity.

While 2. was mostly an artifact of the competition design - there was no second hidden test set, so all teams spent too much time gaming the leaderboard - the first is something that you’d never do in industry. That’s because in real life, you’re maximizing a different metric than just accuracy on test set. Your solution should be robust, readable, maintainable and have good latency. So you almost never see ensembling multiple models, stacking or leaky feature engineering, as they would simply generate too much risk.

As wise Pareto foresaw, 95% of the time spent in ML competitions goes into optimizing the last 5% of accuracy, taking on increasing technical debt and complexity. In real life, due to alternative costs you’d surely switch to a new more worthwhile task before that.

*Model stacking: Ensembling technique where ML model predictions made on K holdout folds are used as features for a final model*

If you’re curious, here is what we did for 1: In addition to the historical travel time on road segment (classic target encoding), it’s good to take into account also what was the travel time, *conditional* on the city’s overall traffic. You can think of it as categorising overall vehicle counters into two $state$ buckets: $[“low”, “high”]$ and then calculating a separate target-encoded feature for each. Since we want to predict the congestion class, we’ll get features such as e.g. $proba_{e, green, low} = \mathbb{E} [ \frac{count_{e, green, low}}{count_{e, total, low}}]$ for each edge $e$, where the expectation is taken over training set.

But there is a risk of leakage here. Since the label data was also sparse (in the next 15 minutes, you will only observe true values for edges that saw cars on them), it might be that we have edges with very few samples in the historical data, and any target encoding starts leaking signal. To combat that, for sparse edges we replace $proba_{e, class, state}$ with the median probabilities taken over *all* sparse edges, so that a high capacity model such a LGB cannot infer the signal directly.

For the extended track where the data was predicting supersegments, we had no sparsity problem, so of course, there’s no reason to stop at just 2 traffic state categories. We experimented with 10, 30, even 50, and the accuracy kept increasing! But this represents exactly the kind of try-hard crazy feature engineering that you wouldn’t see in most real life projects!

[1] Traffic4cast at NeurIPS 2022, Neun et al., 2023

]]>Tabular data is one of the few last machine learning strongholds where deep learning does not reign supreme. This is not due to lack of trying, as there have been multiple proposed architectures, maybe the most known being TabNet. What the field lacks though, are generic go-to implementations that would achieve competitive performance on a range of benchmarks, in a similar way as gradient boosted tree ensembles (XGBoost, LightGBM, Catboost - GBT for short) can do. Indeed, Shwartz-Ziv & Armon [4] show that the proposed tabular deep learning methods often outperform others only on the dataset proposed in their papers, generally losing to GBT methods.

Why should we care?

While the GBT vs DL debate can become emotionally loaded at times, there are some objective benefits to the latter approach:

- Seamless online learning. As we generally train deep learning models using minibatches, it is straightforward to extend this to learning continuously on new data, without requiring a full retraining. GBTs cannot fully replicate this, as they are building an increasingly complex additive function. In order to update it, you’d have to either prune the function somehow (e.g. keep only first N trees and add new ones) or keep the tree structure and only update leaf values. Neither can guarantee optimality, while as a neural network is simply a collection of differentiable stacked weight layers, you can fine-tune the whole model.
- Transferability. Closely related to the last point, it means we can reuse a learned model structure and use it for other tasks, or predict multiple things jointly.
- Multimodal inputs. You had a tabular problem? But what if you can use some image data or free form text to improve the accuracy? DL allows to train these kinds of models jointly end to end. That’s much more elegant than combining multiple models (with different APIs) for the same task.

I’ll use the NYC taxi trip duration dataset from Kaggle and try to predict `trip_duration`

purely as a function of the pickup and dropoff coordinates. This exemplifies a relatively difficult type of tabular problem, as coordinates are numeric features, but we (a) expect to have a non-monotonic relationship between them and the target, with (b) potentially very fine-grained decision boundaries (imagine a city block which has exceptionally restrictive routing), and (c) the target is defined by the interaction of both inputs - trip duration is mostly a function of the distance between pickup and dropoff, so we’re expecting our models to learn this. Additionally, it’s a relevant real life problem for dispatching and logistics planning.

The dataset:

```
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 1458644 non-null object
1 vendor_id 1458644 non-null int64
2 pickup_datetime 1458644 non-null object
3 dropoff_datetime 1458644 non-null object
4 passenger_count 1458644 non-null int64
5 pickup_longitude 1458644 non-null float64
6 pickup_latitude 1458644 non-null float64
7 dropoff_longitude 1458644 non-null float64
8 dropoff_latitude 1458644 non-null float64
9 store_and_fwd_flag 1458644 non-null object
10 trip_duration 1458644 non-null int64
dtypes: float64(4), int64(3), object(4)
memory usage: 122.4+ MB
```

The regression models are trained for MAE, using early stopping. For the neural networks, learning rate reduction on reaching a plateau is applied. The holdout test set is split by timestamp, to avoid unexpected leakage.

The interaction-based coordinate problem does not really pose a problem for a GBT model. It can perform splits on the lat/lng space as it would on any other numeric feature. By alternating between splits on both pickup and dropoff, it encodes the interaction seamlessly. The only drawback of GBTs for this problem is that the binary splits can only be horizontal or vertical, and that you’d need 4 depth-wise splits to model a square (which in our problem can be relevant neighbourhood). It should be noted that there has been work to extend GBT to spatial splits, e.g. [3], but that’s not available to the lay practitioner.

The obvious argument speaking for models like XGBoost or LightGBM is how easy it is to get started with them and achieve competitive accuracy. Just look at the code below, you do not need to be a data scientist to run this!

```
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_valid, y_valid, reference=lgb_train)
model = lgb.train(params,
lgb_train,
num_boost_round=num_estimators,
valid_sets=[lgb_train, lgb_eval],
early_stopping_rounds=50)
```

where X_train is the DataFrame with the 4 coordinate columns. No feature engineering, no standardization or scaling, nor discretization is needed to get started! This is the closest you can come to not knowing anything about machine learning, but being able to train a model. We often (half-)joke with colleagues that the relative ease of using powerful libraries such as LightGBM has made the theory behind data science redundant.

I’ll use Keras for the following, to iterate fast over the relatively simple network architectures. First, let’s initiate a base class that simplifies training.

```
class KerasModel:
"""Base class for training all Keras models"""
def __init__(self, hyperparams: dict, binary: bool = False):
self.model = None
self.hyperparams = hyperparams
self.optimizer = RMSprop(self.hyperparams["starting_lr"])
self.feedforward_layer = self.create_feedforward_layers(hidden_units=[300, 100], dropout_rate=0.2)
if binary:
self.output_layer = Dense(1, activation="sigmoid")
self.loss = "binary_crossentropy"
self.metrics = ["binary_crossentropy"]
else:
self.output_layer = Dense(1)
self.loss = "mae"
self.metrics = ["mae"]
self.callback_early_stopping = keras.callbacks.EarlyStopping(monitor=f"val_{self.metrics[0]}", patience=5, restore_best_weights=True)
self.callback_decrease_lr = keras.callbacks.ReduceLROnPlateau(
monitor=f"val_{self.metrics[0]}",
factor=0.3,
patience=2,
min_lr=1e-6)
@staticmethod
def create_feedforward_layers(hidden_units, dropout_rate, name=None):
fnn_layers = []
for units in hidden_units:
fnn_layers.append(Dropout(dropout_rate))
fnn_layers.append(Dense(units, activation=tf.nn.gelu))
fnn_layers.append(BatchNormalization())
return keras.Sequential(fnn_layers, name=name)
def train(self, x_train, y_train, x_valid, y_valid, *args, **kwargs):
raise NotImplementedError()
def predict(self, x_test):
raise NotImplementedError()
```

I’ll reuse the same feedforward layer throughout different models.

First, to prove what is probably obvious to people with DL experience, you can’t simply pass the complex numeric features into a fully connected layer.
We’ll try with the following model, that passes the inputs directly to `self.feedforward_layer`

. While MLPs are in theory universal function approximators, it’s inefficient to force them to learn the relevant “splits” from scalar inputs.

```
class MLPModel(KerasModel):
def train(self, x_train, y_train, x_valid, y_valid, *args, **kwargs):
dense_inputs = keras.Input(shape=(x_train.shape[1],), name="raw_inputs")
x = self.feedforward_layer(dense_inputs)
outputs = self.output_layer(x)
nn_model = keras.Model(inputs=dense_inputs, outputs=outputs, name="mlp_raw")
print(nn_model.summary())
nn_model.compile(optimizer=self.optimizer, loss=self.loss, metrics=self.metrics)
nn_model.fit(
x=x_train,
y=y_train,
validation_data=(x_valid, y_valid),
epochs=self.hyperparams["epochs"],
batch_size=self.hyperparams["batch_size"],
callbacks=[self.callback_early_stopping, self.callback_decrease_lr]
)
self.model = nn_model
def predict(self, x_test):
return np.squeeze(self.model.predict(x_test))
```

We can significantly help the model learn by discretizing the input data. A clever trick from [4] is to notice that a trained tree model can be expressed as a simple neural network where the inputs are one hot encoded according to which split in the sorted list of splits they fall into, the first layer of weights represents the connectivity of splits and terminal leaves, and the final output is simply the binary activation of one leaf node. Inspired by this, they propose to discretize the numeric features into bins based on quantiles, one hot encode these and only then pass to fully connected layers. I try this with the coordinate data - calculating the quantile threshold on the training set and applying one hot encoding based on them to the whole dataset.

Note that here all 4 coordinate inputs are embedded separately - implicitly we are assuming that the feedforward layer in the end is powerful enough to learn interaction effects from the concatenated embeddings vector.

```
class EmbeddedBinModel(KerasModel):
def __init__(self, numeric_features, *args, **kwargs):
super().__init__(*args, **kwargs)
self.numeric_features = numeric_features
def create_inputs_and_embeddings(self, discrete_bin_vocab_size: int):
inputs = []
input_features = []
for cf in self.numeric_features:
cf_input = keras.Input(shape=(1,), name=f"{cf}_discrete")
cf_feature = Embedding(discrete_bin_vocab_size + 1, 100, name=f"{cf}_embedding")(cf_input)
inputs.append(cf_input)
input_features.append(cf_feature)
return inputs, input_features
def train(self, x_train, y_train, x_valid, y_valid, *args, **kwargs):
inputs, input_features = self.create_inputs_and_embeddings(kwargs["discrete_bin_vocab_size"])
all_embeddings = tf.concat(input_features, axis=1)
all_embeddings = Flatten()(all_embeddings)
x = self.feedforward_layer(all_embeddings)
outputs = self.output_layer(x)
nn_model = keras.Model(inputs=inputs, outputs=outputs, name="quantised_bin_embeddings")
print(nn_model.summary())
nn_model.compile(optimizer=self.optimizer, loss=self.loss, metrics=self.metrics)
model_inputs = {f"{cf}_discrete": x_train[cf] for i, cf in
zip(range(len(self.numeric_features)), self.numeric_features)}
valid_inputs = {f"{cf}_discrete": x_valid[cf] for i, cf in
zip(range(len(self.numeric_features)), self.numeric_features)}
nn_model.fit(
x=model_inputs,
y=y_train,
validation_data=(valid_inputs, y_valid),
epochs=self.hyperparams["epochs"],
batch_size=self.hyperparams["batch_size"],
callbacks=[self.callback_early_stopping, self.callback_decrease_lr]
)
self.model = nn_model
def predict(self, x_test):
test_inputs = {f"{cf}_discrete": x_test[cf] for i, cf in
zip(range(len(self.numeric_features)), self.numeric_features)}
pred = np.squeeze(self.model.predict(test_inputs))
return pred
```

Since a location is defined by 2 coordinates, perhaps embedding them separately is not optimal. A more natural discretization of space could be provided by a geospatial indexing system, such as H3. Similarly to [6] I calculate separate embeddings for H3 cell levels 4-10 (hexagon edge length from ~22.6km to ~66m). This is a useful trick, as there may be both high level and hyperlocal dynamics in the data. As the number of grid cells increases exponentially in levels, this does significantly increase the number of trainable parameters. [6] shows how hashing tricks can be used to combat this, but it’s not really needed for the current data, as NYC is relatively small.

```
class EmbeddedH3Model(KerasModel):
def __init__(self, h3_resolutions: list, *args, **kwargs):
super().__init__(*args, **kwargs)
self.h3_resolutions = h3_resolutions
def create_inputs_and_embeddings(self, embedding_vocab_size):
h3_token_inputs = []
h3_token_features = []
for point in ["src", "dst"]:
for h3_res in self.h3_resolutions:
token_inputs = keras.Input(shape=(1,), name=f"spatial_tokens_h3_{point}_{h3_res}")
token_features = Embedding(embedding_vocab_size[point][h3_res] + 1, 100,
name=f"h3_embedding_{point}_{h3_res}")(token_inputs)
h3_token_inputs.append(token_inputs)
h3_token_features.append(token_features)
return h3_token_inputs, h3_token_features
def create_data_for_model(self, x: pd.DataFrame):
train_inputs = {}
for point in ["src", "dst"]:
train_inputs_point = {f"spatial_tokens_h3_{point}_{k}": x[f"h3_hash_index_{point}_{k}"] for k in
self.h3_resolutions}
train_inputs = {**train_inputs, **train_inputs_point}
return train_inputs
def train(self, x_train, y_train, x_valid, y_valid, *args, **kwargs):
all_token_inputs, all_token_features = self.create_inputs_and_embeddings(kwargs["embedding_vocab_size"])
all_embeddings = tf.concat(all_token_features, axis=1)
all_embeddings = Flatten()(all_embeddings)
x = self.feedforward_layer(all_embeddings)
outputs = self.output_layer(x)
nn_model = keras.Model(inputs=all_token_inputs, outputs=outputs, name="h3_embedding_model")
print(nn_model.summary())
nn_model.compile(optimizer=self.optimizer, loss=self.loss, metrics=self.metrics)
training_feature_inputs = []
for dataset in [x_train, x_valid]:
training_feature_inputs.append(self.create_data_for_model(dataset))
nn_model.fit(
x=training_feature_inputs[0],
y=y_train,
validation_data=(training_feature_inputs[1], y_valid),
epochs=self.hyperparams["epochs"],
batch_size=self.hyperparams["batch_size"],
callbacks=[self.callback_early_stopping, self.callback_decrease_lr]
)
self.model = nn_model
def predict(self, x_test):
test_inputs = self.create_data_for_model(x_test)
pred = np.squeeze(self.model.predict(test_inputs))
return pred
```

There is a shortcoming of the embedded quantile bins method: it loses all information on the order of values. While this might not matter too much for our use case - coordinate range is not related to the target monotonically - it can help with more traditional numeric features, where you’d like to model both the ordinal properties, but also enable bin-level differences.

An elegant solution I discovered from [2] is to instead apply a *piecewise linear* encoding, which works as follows: if you have 5 equal-width bins over a range from 0 to 10, i.e. [0, 2), [2, 4), [4, 6), [6, 8), [8, 10], then for an input value of 7, you encode it as [1, 1, 1, 0.5, 0] instead of [0, 0, 0, 1, 0] (the one hot encoding). This is order-preserving.

This advanced embedding method was not implemented here, as simple one hot encoding did the trick, seen below. However, it can be useful for more difficult numeric problems. Indeed, [2] provides a whole array of further more advanced embedding methods, of which most are shown to improve deep learning model performance.

The source code for all models can be found here.

Model | MAE | MedianAE | MSE | Median | R2 |
---|---|---|---|---|---|

Single decision tree (depth=5) | 411.71 | 260.0 | 373643.57 | 408.0 | 0.23 |

GBT (depth=5, trees=10) | 382.47 | 302.73 | 291745.26 | 708.35 | 0.4 |

GBT (depth=5, early stopping) | 249.02 | 161.22 | 156070.27 |
681.95 | 0.68 |

Embedding quantiles | 247.49 | 144.8 | 167238.83 | 633.25 | 0.65 |

Embedding H3 cells | 247.35 |
143.22 |
168387.11 | 632.08 | 0.65 |

We see that both embedding-based models reach parity with fully trained GBT, which is what we were aiming for.

It’s interesting to visualize how the models “see” the coordinate space. For this, we can plot a 2-dimensional decision boundary on an arbitrarily framed binary classification problem - e.g. whether trip duration is going to be more than 5 minutes. For reference, both models achieve about 83% accuracy on this task. We’d expect different shapes of the decision boundaries, as GBT can only perform vertical and horizontal splits, while the embedded H3 model uses a combination of multi level hexagons.

The code below can be used to create the decision boundary, which can be visualized over a basemap, using the excellent mplleaflet library.

```
def create_decision_boundary(model, X, xlim=(-74.005, -73.96), ylim=(40.73, 40.78), pickup_lon=None, pickup_lat=None):
# Create a grid for plotting decision boundary. We fix the pickup coordinates to be able to visualise in 2D
if not pickup_lon:
pickup_lon = X["pickup_longitude"].median()
if not pickup_lat:
pickup_lat = X["pickup_latitude"].median()
X = X.to_numpy()
x_min, x_max = np.percentile(X[:, 0], 0.1), np.percentile(X[:, 0], 99.99)
y_min, y_max = np.percentile(X[:, 1], 0.1), np.percentile(X[:, 1], 99.99)
grid_size = 0.0001
xx, yy = np.meshgrid(np.arange(x_min, x_max, grid_size), np.arange(y_min, y_max, grid_size))
xx_ravel = xx.ravel()
yy_ravel = yy.ravel()
pred_array = np.c_[np.repeat(pickup_lon, len(xx_ravel)), np.repeat(pickup_lat, len(yy_ravel)), xx_ravel, yy_ravel]
preds = model.predict(pred_array)
preds = preds.reshape(xx.shape).round()
return xx, yy, preds
```

Fixing the pickup coordinates at the medians (which is somewhere around 5th Avenue and 42nd Street), we get the following decision boundary over dropoff coordinates for LightGBM:

Here the blue region corresponds to where the model think we can reach driving in 5 minutes of driving, starting from the red point, an isochrone map. It’s significant that the model has learned such a realistic interpretation of the physical world - as travel time is mostly distance driven, it is intuitive that we’d see an area centered around the pickup point. Purely from the coordinate signal, the model has learned that this area is entirely homogenous, except near the border. This is quite remarkable, as generally for isochrone mapping you’d require a strong routing engine such as OSRM or Valhalla and an underlying road network graph.

We can sample arbitrary pickup points to make sure the good properties of the model are not caused simply from modelling the neighbourhood of the median well:

Here we see the model learning that driving towards Upper East Side on Park Avenue is faster than west-east, resulting in an oval-like isochrone.

And finally, we find a failure case: Here the model has correctly inferred that you can go over the bridge into Queens (and that area is blue), but we’d probably want it to understand that in this case the whole route to Queens should be blue. Most likely it’s caused by the fact that we don’t have a sufficient amount of trips with dropoff on the Queensboro bridge or near it :) We can also see that the horizontal-vertical split rule can look a bit clumsy.

The corresponding decision boundary around the median pickup for embedded H3 model looks like this:

Indeed, the neural network model sees the map in hexagons, and we can recognize cell outlines straight away. Additionally, we see that the boundary looks less like a uniform blob we saw for the GBT, and capable of more complexity. This behaviour is intuitive, as GBT requires more splits to model exclusion (4 splits for square), while the H3 embedding based model can just turn any embedding’s key-value vector “on” to change the target prediction.

An interesting property of NYC the model seems to have learned - if I’m not overanalyzing this - is that getting to the area immediately behind (from the direction of the red point) Penn Station takes a longer time. As the busiest transportation facility in the Western Hemisphere, you’d certainly expect tough traffic there! With its horizontal/vertical splits and requiring multiple splits to model a neighbourhood, the GBT model has less incentive to capture this anomalous *pocket* and we see that on the GBT map there is no exception near Penn Station.

[1] TabNet: Attentive Interpretable Tabular Learning, Arik & Pfister, 2020

[2] On Embeddings for Numerical Features in Tabular Deep Learning, Gorishniy, Rubachev & Babenko, 2022

[3] How and why we built a custom gradient boosted-tree package, Lyft, 2021

[4] Gradient boosted decision tree neural network, Saberian, Delgado & Raimond, 2019

[5] Tabular Data: Deep Learning is Not All You Need, Shwartz-Ziv & Armon, 2021

[6] DeepETA, Uber, 2022

]]>Consider for a moment the principle of state sovereignty - a principle dating back to the Westphalian peace of 1648 and today encoded in the United Nations Charter - that holds that each state has the supreme power of authority in its own borders. This would also apply to the decision of joining NATO, which has been the free choice of all the countries involved in the eastwards expansion.

Now, one might say that a small country like Estonia does have *some* sovereignty, but they also need to take into account the geopolitical realities of being located next to a major power. Let’s look at the populations of the countries who’ve joined NATO since 1997.

*Expansion of NATO, BBC*

- Estonia - 1.3M
- Latvia - 1.9M
- Lithuania - 2.8M
- Poland - 38M
- Czech - 10M
- Slovakia - 5.5M
- Hungary - 9.75M
- Romania - 20M
- Slovenia - 2M
- Croatia - 4M
- Montenegro - 0.6M
- Albania - 2.8M
- North Macedonia - 2M
- Bulgaria - 7M

Total: 107M people. That’s roughly the same ballpark as the Russian population, with Ukraine included it’d be more than that.

Notice how all these countries made the same decision - rather than wanting to join e.g. the Russian-backed CSTO. Practically, this means that Russia has already lost the economic and cultural competition with the West in Eastern Europe, and can now only resort to brute force in dominating its neighbours (e.g. Moldova, Georgia, Ukraine already before 2022). That these countries belong to NATO takes away even that ability, which is what all the fuss is about. It’s obvious that the Baltics don’t threaten Russia in any way even in NATO, the real issue is that Russia’s power projection on these vestigial pieces of the old Empire is now significantly less effective.

So agreeing with the first argument above is essentially the same as saying that Russia’s interests outside their borders, their right to maintain a *sphere of influence* overrides the interests of a multitude of sovereign nations (similarly sized as Russia on aggregate) on choosing their future. It’s saying that the *national humiliation* from losing an empire is something we should empathize with. It’s saying that a bully has the right to strongarm other kids on the playground just because he’s bigger. Obviously there’s some geopolitical reality to this, but that doesn’t mean we should morally normalize it.

Call me an idealist, but I don’t think such a cynical argument should have credence in 21st century Europe.

]]>*Disclaimer: the following is some of combination of amateur economic history, Fermi estimates and general ramblings. It should in no case be considered to be anything similar to real research.*

As I’m currently visiting Lisbon, I couldn’t help but notice the rich maritime history the city celebrates. Any way you look, there is evidence of the illustrious past of the Portuguese Empire, whether it be caravel themed Azulejo tiles, monuments to Prince Henry the Navigator or the splendor of the Praça do Comércio. And it is no wonder, as this was probably the zenith of Portugal’s “glory” by the standards of that era. Even across the whole world’s history, there are few examples of small countries achieving such levels of power projection as the Portuguese did. By the combination of some novel technical innovations and rather brute gunboat diplomacy, they managed to establish strongholds across the African coast and all around the Indian Ocean.

*Portuguese discoveries 1415–1543; in green = in the Reign of D. João III, Wikimedia Commons*

It’s well known that the Age of Exploration was mostly driven by the need to find alternative trade routes (of spices, mainly pepper, cinnamon, ginger, cloves) to Asia, as the existing land-based route was quite costly. Effectively, early explorers such as Columbus and da Gama were akin to budding entrepreneurs, who were looking to leverage venture capital (mostly the money of royal courts, but later also private markets with the advent of the Dutch East India Company and the likes) to finance their risky undertakings.

This made me wonder about how valuable the spice trade actually was, especially when put into the context of today’s consumer goods and corporations.

*A nau (the bulkier cousin of a caravel) themed Azulejo tile I bought*

To start with, the best readily available online resource on in-depth analysis of historic spice prices was provided by professor John Munro (note how even he considered himself to be an amateur on the subject, despite being an economic historian… seems like a fun academic field).

Based on that, it took a master carpenter in Antwerp in 1438 (50 years before the Portuguese traverse the Cape of Good Hope), so about a half a day’s wage to buy 100 grams of black pepper, and one and a half day’s wage to buy the same amount of cloves. For reference, in 2019 the average net salary in Belgium was 2442 euros. Master carpenter is probably paid better than the average salary, so let’s assume 3000 net per month, or roughly 17.6 euros per hour assuming a 170 hour working month.

A mid range new iPhone costs about 850 euros, so (assuming 8 hour days) the average master carpenter in Belgium would need to work 6 days for an iPhone. In equivalent purchasing power parity terms, that’s the same as about 400 grams of clove in 1438! That puts into perspective why an individual Portuguese trader would be eager to get into the Indian Ocean trade routes. Ivan Obolensky makes the excellent point that due to the transaction cost of Age of Sail spice trade, the price markup from Goa to Venice was 27 times, while the modern cocaine markup is only 13 times. No wonder one of the suggested etymological roots for the word *drug* being the Middle Dutch *droge* for dry goods such as spices.

It’s also interesting to note that the sailors themselves were partly compensated with *future* pepper cargo. This could be considered similar to today’s equity options provided to tech workers. If you were willing to take part in the risky venture (according to Oliveira Martins, the ratio of safe return to being lost at sea was about 6:1) and if you did not start a mutiny or die on the way (as about one third to a half of the crew did even if the trip was successful), a sailor was awarded more than half a ton of pepper, or 436 iPhone equivalents (worth 370K iPhone equivalent EUR), in addition to the regular salary and chance at plunder. Those are odds that people might take up even today!

I found some estimates on Wikipedia that claimed that in 1506, taxes on overseas activities accounted for 65% of the state income. Considering that the first overseas territory Ceuta was only conquered in 1415 and the Guinean Gold Coast reached in about 1470 (where the real profitability showed), that means roughly tripling the state income in 37 years between 1470 and 1506, or an annualised growth rate of 3%. That might not seem like a lot by today’s standards, but one needs to take into account that historical real economic growth rates are low, for example this study found that the pre 1660 real per capita GDP growth is zero, and only after the industrial revolution did we see a 1.25% pc growth (or 2-3% if we don’t account for population growth). That means that the Portuguese managed to kickstart something akin to the post Industrial Revolution level of growth, at least as far as their royal treasury was concerned.

Furthermore, we can estimate the annual revenue from the spice trade alone - the original question I hoped to answer with a quick Google search. There was a nice periodicity to Portuguese travel, as the ships made use of the monsoon winds to get back to Europe. Additionally, due to piracy, the ships had to travel in convoys consisting of more agile caravels and bulkier *nau*s (an Iberian enhanced version of a carrack). As these were royal court sponsored convoys embarking from Lisbon, we have quite good records on them available. Moreover, we can take the 1606 shipwreck of a typical nau off the coast of Lisbon as an example of a ship in the armadas sent to India. This ship was carrying 250 tons of black pepper, and additional tons of various spices, so let’s round it to about 300 tons of pepper equivalent and assume that every nau was carrying such cargo on average.

PS: I later found this Wikipedia article putting the average cargo of a nau at 382 tonnes, which is not too far off my 300 tonne assumption.

There were 806 departures of naus to India between 1497 and 1612. That’s about 7 ships per year, but it’s an average over a time period that includes the very first voyages to discover the trade routes. Likely during the heyday of Portuguese domination (before 1590s competition from the English and Dutch) the number was somewhat higher, say 10 ships per year. For example, the exemplar 1606 nau was part of a fleet of six naus and four galleons. The galleons served as more maneuvarable fighting ships protecting the convoy. Let’s assume that this is an average fleet composition and that none of the galleons are carrying cargo. Using the universal currency of a labour day, one cargo ship was then carrying 300 tons of pepper equivalent, or 1.34 million days’ wages worth of pepper, or 223 thousand iPhones, worth 189 million euros today. For reference, a modern cargo ship carrying ~10 thousand containers at average value of $40,000 per container is carrying 400 million euros worth of goods, weighing about 240,000 tonnes. Given the value in labour and the value density of the Portuguese ships, no wonder they required significant protection! It might be that a Portuguese nau laden with peppers in 1510, waiting for the winter monsoon to start the journey home, was the valuable asset in the world economy at that point.

If we assume the perfect case where all 6 naus make it home that year, and using my suspect labour-to-iPhone-to-EUR conversion from above, this puts the annual current value of Portuguese India spice trade revenue just above 1.1B euros, which is only about 0.5% of the current GDP of Portugal. But that’s mostly just a reflection of how rich the world is today. Taking the estimated 1.2B GDP (1990 USD adjusted for inflation and converted to EUR) of Portugal in current euros at face value, we’d have an import-to-GDP ratio of almost 1, which would easily put exploration age Portugal among the top countries by trading activity in the world, and this is only considering the spice trade.

Finally, to estimate how much this foreign trade helped the GDP per capita growth in Portugal, rather than just the king’s personal budget, I refer to real economists who link roughly a fifth of GDP to the empire. But already the fact that you’d need a complicated model to estimate this shows that the spice trade was in no way a silver bullet raising Portugal’s wealth far above it’s neighbours, as plotting the GDP per capita growth against a few European countries clearly shows:

*Based on Madisson*

So far I’ve tried to estimate the size of the spice trade by revenue, but the attractiveness is determined by the profitability. Maintaning a multicontinental supply line to launch the ships, water and restock them, building a factory system to accumulate and store goods without blowing up the prices, and defending the cargo at all times is very expensive. It is telling that the Portuguese trading center in Antwerp went bankrupt already in 1549, decades before the Dutch and English started to contest the Cape Route, and the Lisbon based Casa da Índia followed in 1560. As a small country, Portugal seriously overextended itself trying to maintain market power in the spice trade, and had to rely on external financing and services.

It was the extensively private capital backed English East India Company and the Dutch VOC that managed to really dominate the Indian shipping routes and drive the Portuguese out with superior organisation and military might. Over the very long run, one might thus say that the spice trade was an example of a market with a second mover advantage, where being the innovator did not necessarily pay off in the end. As a matter of fact, the VOC went on to become the most valuable company in the world, and in many respects was the first congolomerate of the world.

To conclude, the value of the spice trade for Portugal was such that it greatly boosted the royal treasury up to a few times. It probably made some top merchants and royals incredibly wealthy during the good years. However, due to the volatile nature of the business and a lack of capital, the Portuguese were surpassed by foes with deeper pockets. The colonial enterprises were not enough to greatly enhance the economic prosperity of Portugal as a whole.

Some interesting points I discovered when reading up on this topic:

- The Portuguese literally had no goods of value to offer for the spices in India, so they opted for silver, copper and gold (from their new-found African resources).
- The demand for spices inexplicably reduced in mid 17th century (after all the trouble to set up the sea-based trade routes), and the best hypothesis for the reason is that the elites started to disdain the goods now available for the middle classes.
- Early navigation off the coasts relied on the magnetic compass and the astrolabe to determine latitude from the angle of the sun at noon. The measurements were compared to precomputed tabular values. Determination of longitude was much tricker and was likely done by dead reckoning, essentially trying to infer current location from the previous location, direction and speed.
- The famed caravel of the Portuguese is hard to concretely distinguish from other ship types - neither lateen sails nor carvel-builds were novel. As ship building is an evolutionary craftsman’s art honed over generations, the caravel had natural similarities to nearby Arabic and European ship types. I’d guess a better explanation for the Portuguese ship building prowess is their geographical location - by having a much shorter feedback loop for which ships worked off the coast of Africa and which didn’t, they were able to adapt fast.

I find it hard to come by good podcasts. While there’s a wealth of material out there, especially if you’re after entertainment, I mostly miss works that have significant depth. By depth, I mean here the quality of systematically dissecting a topic and tackling it to the very core. The same way that bland self-help books and 10 min executive summaries of these books are overcrowding book publishing, the same seems to be happening to podcasts.

Thus I am always positively thrilled whenever I find a gem in this space. And the greatest find so far happened a year ago, when I started listening to Hardcore History by Dan Carlin. Being a history buff was the natural entry point for me. But I soon realised that HH goes much further than your average history podcast. Having been labelled one of the greatest storytellers in the world, Dan Carlin uses his experience as a former radio host to give us (ordinary) first person accounts from historically significant events. Contrary to something like the monotonous narration of Ken Burns documentaries, Dan Carlin’s stylised delivery creates drama and empathy whether the events happened 100 or 2000 years ago. A self-titled “fan of history”, he takes some liberties that true historians could not. But even this somewhat simplified history goes into an order of magnitude more depth than any history resource you would find on YouTube/Netflix.

For example, the first series I got into - Supernova in the East - has 6 episodes totalling over 26 hours (!) of audio. Who has time to listen to a 26 hour podcast, I thought. My morning commute was a mere 20 minutes. But sure enough, once I started listening, I also managed to find more activites to listen to. Having now gone through almost all episodes, I decided to rank my favourite ones.

There is a general pattern with Hardcore History: it keeps getting better over time, but the episodes are also longer. For example, the first 20 shows are mostly under an hour. I personally feel that due to this they remain more of a fun historical trivia, as the format is too short for the more advanced narrative building we get in longer episodes. They are still great shows in their own right, but don’t stand up to the heavyweight series of HH. Nonetheless, I rank short episodes in a separate category as they are just different.

It would be fair to note that “short” is defined here as relative to audiobooks, as even these episodes are longer than 4 hours.

There is an almost fairy tale like quality to the HH shows about earlier eras, no doubt due to Dan Carlin having some more leeway for creative writing and story building. The one image that haunts me from this episode is the possible rate of collapse of world leading civilisations. Think finding the well preserved ruins of some great modern city in a desolate desert in the future is unlikely? Well, it already happened once before, and it happened in a few generations. The natural continuation for this topic is the three part series King of Kings on Persian history.

This episode is just so… weird. A detailed account of the Anabaptist rebellion in Münster, an episode of history that most people have probably never heard of. Starting from the familiar story of Martin Luther and Reformation, it gradually escalates into a tale of a Woodstock-esque proto-communist enclave of barricaded insane people trying to withstand a professional army multiple times its size. There are some amazing characters in this one, in addition to the psychological case study of how delusion can reinforce itself in isolated groups.

It’s becoming more accepted over time that “great men” have not necessarily been the driving forces of history and quite contrarily, some of them might have been quite nasty individuals (this is still a work in progress, as e.g. there’s still a statue of Leopold II - architect of a corporate state that dismembered kids in Africa - in central Brussels). Yet such moral standards are rarely applied to the heroes of antiquity, the Alexanders and Caesars who’ve inspired generations of latter generals and statesmen. How much can be explained away by historical context remains an open question, which Dan Carlin repeatedly tackles here. Great content for Roman history buffs, of course.

This series deserves a spot in the top if only for the relative underappreciation Mongol conquests receive in Western
historiography. Probably relatively the strongest military ever since the 20th century, the Mongol steamroller absolutely
crushed all known continents of the world at will. Especially European listeners will be dismayed by how puny and insignificant
their armies were in comparison. With all the atrocities the Mongols committed, this story goes a bit in the Lord of the
Rings direction on the good-vs-evil dimension. Some of the areas that were devastated have *still* not covered, which to
me seemed unbelievable.

The Roman Empire gets a lot of the highlight, but for me, the story of the collapse of the Roman Republic is more interesting. What are the fundamental institutions that will need to change as a country scales from an insignificant province to a world empire, and how does it affect the qualities that made that province so tough to begin with? The best character building of the whole show - the same Caesar you would hate after listening to The Celtic Holocaust astounds here with his sheer brilliance and work ethic.

Dan Carlin’s interpretation of World War I takes the top spot for me. On the one hand, it is probably the most significant historical event ever, but much less represented in media than WWII. A thorough walk through of the main events and the thinking behind the leaders is interesting on its own. Why would you ever charge a trench - don’t you know there will be machine guns waiting?! But I found the series inciting so many profound ideas that completely changed my understanding of WWI. For example:

- The entire military build-up leading up to the war was essentially a game theoretic game of chicken between European states who had overstretched themselves in a complex web of alliances. And the only personal capable of managing this web has retired. This immediately reminded me of how a company might suffer if their most brilliant engineer tries to make themselves irreplaceable.
- Meritocracy vs autocracy in government becomes increasingly important as the destructive power of weapons increases, all it takes is one idiot in one government to ruin a continent for decades.
- The rapid deterioration of 19th century aristocratic ethics in the first years of the war to be replaced by the animalistic trench warfare was just scary. Morals are for the good times.
- I always wondered why shell shock was such a big issue for WWI soldiers, but doesn’t get mentioned much in further 20th century warfare. It’s because objectively WWI was much worse for the soldiers, either looking at the casualty rate per day or the living conditions in the trenches near Verdun/Somme.
- Obviously, the horrific descriptions from Verdun or Somme can create pacifist leanings in anyone, especially if confronted with the futility of how little these battles gained (and again, that the reason for these deaths was some incompetent ruler).

*Rush hour in Stockholm, Gemma Evans, Unsplash*

I’ve been adhering this year to a personal challenge to cycle as my main mode of inner-city transport. This coincided with a period of massive resurgence in cycling popularity across Europe, as cities like Paris or Berlin have made significant investments to boost the modal share of cycling after the lockdowns. Even in my home city of Tallinn, where supposedly only 1% of trips are by bike, it became a dominant topic in this year’s local election. As in any politicised debate there is a tendency to revert from facts to suitable narratives, I started wondering if there’s an objective view based on data that would help to evaluate and rank the efforts of different cities in making their environments more bike-friendly. After all, every city probably looks up to Copenhagen or Amsterdam in this respect, but how well are the others doing? OpenStreetMap, the world’s largest open-source geographic database might provide some of these answers.

There already are a few similar rankings that I could find. For example, the Copenhagenize Index gives subjective scores to a variety of areas such as streetscape, culture and ambition. The city list comprises 600 cities with 600 000 inhabitants worldwide. Another one is the Bicycle Cities Index by digital insurance company Coya. The city list selection is arbitrary, but the measurement is done on a number of objective indicators sourced across the internet such as road infrastructure, bicycle usage, number of fatalities etc.

The reasons for developing yet another one:

- Copenhagenize Index and Bicycle Cities Index were last updated in 2019
- We want to treat fairly both the city selection (based on population size) and the metric measurement, refraining from subjective expert-assigned scores or manual lookup of data from different sources
- Using a standardised methodology based on OpenStreetMap will allow to repeat the experiment at a different time and scale it to an arbitrary number of cities, with little manual additional effort

To measure the cyclability of cities, I calculate the share of road infrastructure that has been marked explicitly as cycling-friendly (either a bike lane or bike road) on OpenStreetMap (OSM). The entire navigable length is considered, thus a one-way road has half the length of a two-way. OSM has been deemed a fairly reliable data source both overall and for cycling.

The motivation behind measuring cycling path length is twofold: firstly, it has been shown to be correlated to the popularity of cycling and secondly, the presence of dedicated bike lanes is correlated with increased safety, which should have a further reinforcement effect on cycling popularity. While building many cycling lanes alone is not enough to facilitate a transition to widespread cycling, it is probably a necessary condition.

To qualify as a cycling road, an OpenStreetMap way has to either:

- have lanes dedicated to cyclists or shared with buses (counted as lanes)
- have a sidewalk that explicitly allows cycling (counted as lanes)
- be a separate track for cyclists, a cycle street (mostly Belgium/Netherlands) or a bicycle road (mostly Germany)
- be a path or footway that has been designated to cyclists by signs

It is worth noting that this definition of cycling road is far more restrictive than what a router such as OSRM or Google Maps would use. That’s because for high connectivity and actually calculating sensible routes between any points A and B, routers often guide through ordinary, potentially unsafe streets. As a matter of fact, the part of the road network considered in this study is only roughly 10% of all considered cyclable in OSRM.

As splitting between cycle lanes and cycle tracks (including segregated ones) is the only distinction we make from OpenStreetMap data, it’s not possible to quantify all the nuances of these roads. For example, a bike lane could be very well designed: having narrower nearby car lanes, using speed limits or other speed reduction methods, penalising parking on the lane etc. But it could also just be some red paint on a sidewalk that runs into lightning poles and bus stops. Even if some of these fine-grained details could be captured from OSM, it’s doubtful that they would be logged with similar quality across Europe. In addition, cultural factors which are not included in the OSM data model, but affect both the road network and cycling convenience probably exist.

A possible caveat of the above approach is the topology of the city. Theoretically, there could be many recreational bike roads near the outskirts of the city where space is abundant, but nothing in the city centre. This would not facilitate cycling as a viable mean of transportation. For this reason, I apply exponential decay weights to the road lengths, where the weights are a function of the road’s distance from the city centre. More precisely, the formula for weighting is:

`exp(decay_coef * min(max(distance_from_centre - t_min), t_max))`

where the t_min = 10th percentile of road distances from the centre and t_max = max(90th percentile of distances from the centre, 15km). I calibrate the decay coefficient so that the above formula equals to 0.1 at t_max and 1.0 at t_min. This guarantees two things:

- A road in the city centre has 10 times higher weight than a road on the 90th percentile of distance from city centre.
- Censoring t_max at 15km makes sure we do not put weight on faraway roads, in case the city polygon is very large.

Some sample cycling paths with their weights are visualised below on the example of Tallinn, where the circle around the centroid with a radius of 8km (the 90th percentile) denotes the distance at which a road has 10 times less weight than a road in the city centre.

While the main ranking is based on the above, I also calculate a few auxiliary metrics:

- The share of segregated cycling tracks (a subset of all tracks). This should be the gold standard of a cycling road.
- The share of cycling lanes (as opposed to tracks).
- The number of parking spots per km2.

While these are presented in the full results, they’re excluded from the ranking calculation due to two reasons:

- There are some regional idiosyncrasies between how bike infra is built and logged to OpenStreetMap. While the overall navigable bike road share seems a fairly universal metric, the others yielded unintuitive rankings of the cities.
- I’m not qualified to weigh different types of roads - what’s the value of a bike lane vs a segregated track?

- OpenStreetMap to download the country maps from Geofabrik
- Osmium tool to extract the city maps from country files and parse map elements
- polygons.openstreetmap.fr to fetch GeoJSON boundaries for cities
- Nominatim to look up city boundary IDs
- Pyrosm and mplleaflet to visualise OSM elements in a notebook
- CyclOSM was used as a sanity check visualisation tool

The base map files were downloaded between September 19 and October 5, 2021.

Copenhagen’s neighbour Malmö takes the top spot this time, 17.3% of the entire navigable road network (16.2% weighted) being explicitly for bikes. Most of the other cities on the list - from The Netherlands, Belgium and Scandinavia - are not surprising either. I was not familiar with cycling in Germany, but this source for Hanover claims that Hanover, Bremen and Munich as having the highest bike modal split, all make the top. Out of larger cities, Paris was already mentioned above. Thus it seems the ranking is capturing what we’d expect it to.

City | OSM id | Area (km2) | Navigable road length (km) | Cycle road share (weighted) | Rank |
---|---|---|---|---|---|

Malmo | 10663667 | 86.457 | 4427.65 | 0.162 | 1 |

Copenhagen | 2192363 | 108.951 | 4616.33 | 0.136 | 2 |

Valencia | 344953 | 139.082 | 4143.81 | 0.134 | 3 |

Helsinki | 34914 | 717.645 | 14339.5 | 0.134 | 4 |

Antwerp | 59518 | 203.696 | 5693.08 | 0.133 | 5 |

Hanover | 59418 | 204.013 | 8361.35 | 0.13 | 6 |

Rotterdam | 1411101 | 128.844 | 5962.23 | 0.125 | 7 |

Utrecht | 1433619 | 75.032 | 3609.5 | 0.125 | 8 |

Stockholm | 398021 | 215.754 | 10363.5 | 0.124 | 9 |

Gothenburg | 935611 | 1093.63 | 12253.4 | 0.124 | 10 |

Nantes | 59874 | 65.795 | 2780.18 | 0.122 | 11 |

Munster | 62591 | 303.304 | 7347.75 | 0.122 | 12 |

Bremen | 62559 | 326.285 | 10271.5 | 0.121 | 13 |

Amsterdam | 271110 | 219.504 | 8661.92 | 0.12 | 14 |

Aarhus | 1784663 | 471.397 | 10146.2 | 0.112 | 15 |

Reykjavik | 2580605 | 244.465 | 4330.58 | 0.108 | 16 |

Bologna | 43172 | 140.667 | 3739.61 | 0.108 | 17 |

Toulouse | 35738 | 118.029 | 5297.21 | 0.108 | 18 |

Lyon | 120965 | 47.981 | 2494.39 | 0.108 | 19 |

Mannheim | 62691 | 144.978 | 6018.61 | 0.101 | 20 |

Cologne | 62578 | 405.011 | 14756.9 | 0.101 | 21 |

Hague | 192736 | 98.144 | 4693.85 | 0.1 | 22 |

Bonn | 62508 | 141.067 | 5416.14 | 0.1 | 23 |

Seville | 342563 | 141.287 | 4250.41 | 0.096 | 24 |

Dusseldorf | 62539 | 217.488 | 8323.37 | 0.096 | 25 |

Munich | 62428 | 310.712 | 16984.8 | 0.095 | 26 |

Nuremberg | 62780 | 187.351 | 7610.35 | 0.091 | 27 |

Vienna | 109166 | 414.863 | 16373.9 | 0.09 | 28 |

Leicester | 162353 | 73.389 | 3073.72 | 0.089 | 29 |

Paris | 7444 | 105.391 | 6174.52 | 0.089 | 30 |

Full results can be found here, including the auxiliary metrics.

Here point size is proportional to the cycling road share. The map shows that at least bad weather seems to have little (or even inverse) correlation to cycling infrastructure :)

A more convincing case could be made for the relationship with population size. Indeed, most of the top cycling cities seem to be smaller cities. Using the populations of European cities from Wikipedia and plotting against our metric shows that the largest city above 12% (weighted) share has a population of just ~exp(13.8)=1 million. However, over the whole dataset there seems to be no relationship between population and cycle road share (dashed line is the best linear fit). Even if there are some natural scaling factors that reduce the proportion of bike roads in the few very largest cities, there are clearly many smaller cities with few bike roads that could still better realise their potential.

To see whether the ranking is capturing the intended concepts, it is useful to visualise the cycling infrastructure on a map and compare it to our quantitative measurements. I use CyclOSM, which seems to be the best such tool.

Firstly, comparing a top-ranking city with a bottom one (in CyclOSM maps, the more blue the better) shows that indeed, the overall level of infrastructure is captured:

Nantes is an interesting outlier - it makes the top overall, but almost half of that comes from bike lanes. Indeed, it visibly has predominantly cycling lanes (marked by the dashed line), rather than separate cycleways also in CyclOSM.

The same city planning model seems to be used in Lyon.

Contrasting Nantes is Barcelona - anyone who’s cycled there knows the convenience of their designated cycleways that have little overlap with car traffic. While Barcelona does not do too well in the overall ranking, it is at the top when measured by segregated cycling tracks.

Finally, to visualise the benefit of the spatial weighting of roads, we can look at examples of cities that would gain or lose the most in the ranking if we did not do any weighting. For example, Milan would gain 29 places (from 86 to 57) and visibly this is because central Milan has almost no bike roads.

Meanwhile, Zaragoza *only* has bike roads in the very centre, and would lose 35 places (from 57 to 92) if there was no
weighting.