Mixture of Experts, Sparsity, and Megablocks

Jun 1, 2024

A neural network learns features that attempt to map input to output. A neural network with 3 features is almost certainly less powerful than one with 10,000 features. Features are composed by parameters and represented by neurons (albeit often in incredibly complicated and uninterpretable ways) and so increasing the number of features in a neural net generally requires increasing the parameters in our neural net. This presents the obvious problem that in order to increase the expressiveness of our neural net we need to make our neural net larger by increasing our matrices containing parameters and therefore increase our computational cost.

Read →

Comments

James’ Substack

Mixture of Experts, Sparsity, and Megablocks