DECODER ONLY TRANSFOMERS FOR HANDWRITTEN DIGIT GENERATION

Quantifying the Effects of Training Data Complexity on Network Pruneability for Single-Head Decoder-Only Transformers

The unjustifiably high environmental impact of deep learning applications:

Click on the image above to read the full report

As neural networks continue to blindly grow, the ability to predict how many parameters are required to properly model a particular solution eludes researchers. This pattern contributes to the unnecessarily high environmental impact of deep learning applications and exacerbates the divide between researchers according to their capacity to compute, diminishing the repeatability and rigor of published research. This work attempts to address these concerns by quantifying training data complexity and relating this quantity to the number of required parameters required for a simple sequence modeling task. In this case, our task was the synthesis of handwritten digits, based on training data from Edwin de Jong’s Mnist Sequence Dataset.

How can we know without training, whether a solution is overparameterized, and if so, by how much?

All of the “handwritten” digits here were generated by our AI model.

We relate training data complexity for a sequence modeling task to the number of unused parameters in the trained transformer model, defining ”use” through an arbitrary magnitude constraint, ε. We believe that through methods similar to our proposition, we can help future researchers regain mathematical rigor in their deep-learning papers, help mitigate the replication crisis, and reduce the environmental impact of these ever-growing deep-learning solutions.

Our hypothesis states that as the training data complexity increases, the number of used parameters should increase, as the model requires more of its capacity to reproduce the task at hand. We establish two measures of complexity, checking the proportion of ”used” parameters after the end of the first lottery ticket pruning round. We posit that well-trained networks of any complexity should learn efficient representations of their target distributions and organically zero out unnecessary parameters. We provide preliminary evidence for this claim by implementing a vanilla transformer network, termed ”tinyDOT” (Tiny Decoder Only Transformer), to model the handwriting sequences needed to produce the digits found in the MNIST dataset.

Defining Training Data Complexity

In many deep learning applications, defining data complexity is a non-trivial task and depends on the problem domain. In our case, we define data complexity in two forms: L, the sequence length, and S, the Mean Spectral Density of the sample’s Optical Flow Vectors. (SEE REPORT FOR DETAILS)

In our experiment, we find a clear relationship between both measures of data complexity and the amount of unused parameters at the end of the first pruning round. We estimate this relationship through two OLS linear models between the complexity and the pruneable parameters.

Quantifying the Effects of Training Data Complexity on Network Pruneability for Single-Head Decoder-Only Transformers

The unjustifiably high environmental impact of deep learning applications:

Defining Training Data Complexity

Chir-PCA