[Stuff I read] Kolmogorov Arnold Networks

  • Proposed by Max Tegmark’s group. See Liu et al
  • A hot topic on twitter - a matter of a lot of debate.

The paper “KAN: Kolmogorov–Arnold Networks” proposes Kolmogorov-Arnold Networks (KANs) as an alternative to Multi-Layer Perceptrons (MLPs). The core idea behind KANs is inspired by the Kolmogorov-Arnold representation theorem, which states that any multivariate continuous function can be represented as a sum of continuous functions of one variable. This section will summarize the technical details of the paper, focusing on the mathematical formulations.

1. Introduction

The motivation for KANs stems from the limitations of MLPs, such as fixed activation functions on nodes and linear weights. MLPs rely heavily on the universal approximation theorem, but their structure can be less efficient and interpretable. KANs, on the other hand, utilize learnable activation functions on edges and replace linear weights with univariate functions parametrized as splines.

2. Kolmogorov-Arnold Representation Theorem

The Kolmogorov-Arnold representation theorem states:

f(x)=q=12n+1Φq(p=1nφq,p(xp))

where φq,p:[0,1]R and Φq:RR.

3. KAN Architecture

KANs generalize the representation theorem to arbitrary depths and widths. Each weight parameter in KANs is replaced by a learnable 1D function (spline).

3.1. Mathematical Formulation of KANs

Define a KAN layer with nin-dimensional inputs and nout-dimensional outputs as a matrix of 1D functions:

Φ={φq,p},p=1,2,,nin,q=1,2,,nout

Activation function on edge φl,j,i between layer l and l+1 is given by:

φl,j,i(x)=w(b(x)+spline(x))

where b(x)=silu(x)=x1+ex.

The output of each layer is computed as:

xl+1,j=i=1nlφl,j,i(xl,i)

in matrix form:

xl+1=Φlxl

where Φl is the function matrix of layer l.

4. Approximation Abilities and Scaling Laws

KANs can approximate functions by decomposing high-dimensional problems into several 1D problems, effectively avoiding the curse of dimensionality.

Theorem 2.1: Approximation Bound

Let f(x) be represented as:

f=(ΦL1ΦL2Φ1Φ0)x

For each Φl,i,j, there exist k-th order B-spline functions Φl,i,jG such that:

f(ΦL1GΦL2GΦ1GΦ0G)xCmCGk1+m

where G is the grid size and C depends on f and its representation.

5. Grid Extension Technique

KANs can increase accuracy by refining the grid used in splines:

{cj}=argmin{cj}Exp(x)(j=0G2+k1cjBj(x)i=0G1+k1ciBi(x))2

6. Simplification Techniques

KANs can be made more interpretable by sparsification, pruning, and symbolification. The L1 norm and entropy regularization can be used to sparsify the network.

7. Toy Examples and Empirical Results

KANs were shown to have better scaling laws than MLPs, achieving lower test losses with fewer parameters in various toy datasets and special functions.

Example Functions

  1. Bessel function: f(x)=J0(20x)
  2. High-dimensional function: f(x1,,x100)=exp(1100i=1100sin2(πxi/2))

KANs can achieve near-theoretical scaling exponents α=4, outperforming MLPs in accuracy and parameter efficiency.

Conclusion

KANs provide a novel approach to neural network design, leveraging the Kolmogorov-Arnold representation theorem to achieve better performance and interpretability compared to traditional MLPs. The use of learnable activation functions on edges and splines allows for greater flexibility and efficiency in function approximation.