[Stuff I read] Kolmogorov Arnold Networks
- Proposed by Max Tegmark’s group. See Liu et al
- A hot topic on twitter - a matter of a lot of debate.
The paper “KAN: Kolmogorov–Arnold Networks” proposes Kolmogorov-Arnold Networks (KANs) as an alternative to Multi-Layer Perceptrons (MLPs). The core idea behind KANs is inspired by the Kolmogorov-Arnold representation theorem, which states that any multivariate continuous function can be represented as a sum of continuous functions of one variable. This section will summarize the technical details of the paper, focusing on the mathematical formulations.
1. Introduction
The motivation for KANs stems from the limitations of MLPs, such as fixed activation functions on nodes and linear weights. MLPs rely heavily on the universal approximation theorem, but their structure can be less efficient and interpretable. KANs, on the other hand, utilize learnable activation functions on edges and replace linear weights with univariate functions parametrized as splines.
2. Kolmogorov-Arnold Representation Theorem
The Kolmogorov-Arnold representation theorem states:
where
3. KAN Architecture
KANs generalize the representation theorem to arbitrary depths and widths. Each weight parameter in KANs is replaced by a learnable 1D function (spline).
3.1. Mathematical Formulation of KANs
Define a KAN layer with
Activation function on edge
where
The output of each layer is computed as:
in matrix form:
where
4. Approximation Abilities and Scaling Laws
KANs can approximate functions by decomposing high-dimensional problems into several 1D problems, effectively avoiding the curse of dimensionality.
Theorem 2.1: Approximation Bound
Let
For each
where
5. Grid Extension Technique
KANs can increase accuracy by refining the grid used in splines:
6. Simplification Techniques
KANs can be made more interpretable by sparsification, pruning, and symbolification. The L1 norm and entropy regularization can be used to sparsify the network.
7. Toy Examples and Empirical Results
KANs were shown to have better scaling laws than MLPs, achieving lower test losses with fewer parameters in various toy datasets and special functions.
Example Functions
- Bessel function:
- High-dimensional function:
KANs can achieve near-theoretical scaling exponents
Conclusion
KANs provide a novel approach to neural network design, leveraging the Kolmogorov-Arnold representation theorem to achieve better performance and interpretability compared to traditional MLPs. The use of learnable activation functions on edges and splines allows for greater flexibility and efficiency in function approximation.