Ozzie AI - The Transformer AI Model

The Transformer Model was made popular when the paper: Attention is all you need, was realised, circa 2020 or so. The paper states some amazing things, statements that we simply take for granted these days:

In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.

The graphical model of the Transformer shown here:

The Attention is described in the following diagram:

NOTE:

In my opinion, C# is an excellent Programming Language for Machine Learning, or AI Programming! Why its not more widely used, I think is simply stubborn people not wanting to change! The Memory Management makes it ideal and the Speed is pretty good nowadays! Truly, if you're using a Graph Based Model, and handing off all Matrix calculations to the GPU, then its no slower than any other language!

If one were to observe the Code in C#, we could have a basic outline like so:

Attention.cs Class

namespace AI
{



    #region Using Statements:



    using System;
    using System.Threading;
    using System.Threading.Tasks;



    #endregion




    /// <summary>
    /// The Attention Mechanism supporting Multi-head, Self-Attention and Cross-Attention.
    /// </summary>
    [Serializable]
    public class Attention
    {



        #region Fields:



        /// <summary>
        /// Dimensionality of the model.
        /// </summary>
        private readonly int _dModel;

        /// <summary>
        /// Number of attention heads.
        /// </summary>
        private readonly int _numHeads;

        /// <summary>
        /// Dimensionality per attention head.
        /// </summary>
        private readonly int _dHead;

        /// <summary>
        /// Query weight matrix [dModel, dModel].
        /// </summary>
        private Matrix<double> _Wq;

        /// <summary>
        /// Key weight matrix [dModel, dModel].
        /// </summary>
        private Matrix<double> _Wk;

        /// <summary>
        /// Value weight matrix [dModel, dModel].
        /// </summary>
        private Matrix<double> _Wv;

        /// <summary>
        /// Output weight matrix [dModel, dModel].
        /// </summary>
        private Matrix<double> _Wo;

        /// <summary>
        /// Gradient of query weight matrix.
        /// </summary>
        private Matrix<double> _dWq;

        /// <summary>
        /// Gradient of key weight matrix.
        /// </summary>
        private Matrix<double> _dWk;

        /// <summary>
        /// Gradient of value weight matrix.
        /// </summary>
        private Matrix<double> _dWv;

        /// <summary>
        /// Gradient of output weight matrix.
        /// </summary>
        private Matrix<double> _dWo;

        /// <summary>
        /// Maximum sequence length.
        /// </summary>
        private readonly int _maxSeqLen;

        // Adam optimizer state variables
        /// <summary>
        /// First moment estimate for _Wq.
        /// </summary>
        private Matrix<double> _mWq;

        /// <summary>
        /// Second moment estimate for _Wq.
        /// </summary>
        private Matrix<double> _vWq;

        /// <summary>
        /// First moment estimate for _Wk.
        /// </summary>
        private Matrix<double> _mWk;

        /// <summary>
        /// Second moment estimate for _Wk.
        /// </summary>
        private Matrix<double> _vWk;

        /// <summary>
        /// First moment estimate for _Wv.
        /// </summary>
        private Matrix<double> _mWv;

        /// <summary>
        /// Second moment estimate for _Wv.
        /// </summary>
        private Matrix<double> _vWv;

        /// <summary>
        /// First moment estimate for _Wo.
        /// </summary>
        private Matrix<double> _mWo;

        /// <summary>
        /// Second moment estimate for _Wo.
        /// </summary>
        private Matrix<double> _vWo;



#endregion



        /// <summary>
        /// Initializes multi-head attention with specified dimensions.
        /// </summary>
        /// <param name="dModel">Dimensionality of the model.</param>
        /// <param name="numHeads">Number of attention heads.</param>
        /// <param name="maxSeqLen">Maximum sequence length.</param>
        /// <exception cref="ArgumentException">Thrown if dModel is not divisible by numHeads.</exception>
        public Attention(int dModel, int numHeads, int maxSeqLen)
        {
            
            if (dModel % numHeads != 0)
                throw new ArgumentException("dModel must be divisible by numHeads.");
            _dModel = dModel;
            _numHeads = numHeads;
            _dHead = dModel / numHeads;
            _maxSeqLen = maxSeqLen;
            _Wq = Matrix<double>.InitializeXavier(_dModel, _dModel);
            _Wk = Matrix<double>.InitializeXavier(_dModel, _dModel);
            _Wv = Matrix<double>.InitializeXavier(_dModel, _dModel);
            _Wo = Matrix<double>.InitializeXavier(_dModel, _dModel);
            // Initialize Adam state
            _mWq = new Matrix<double>(_Wq.Rows, _Wq.Columns);
            _vWq = new Matrix<double>(_Wq.Rows, _Wq.Columns);
            _mWk = new Matrix<double>(_Wk.Rows, _Wk.Columns);
            _vWk = new Matrix<double>(_Wk.Rows, _Wk.Columns);
            _mWv = new Matrix<double>(_Wv.Rows, _Wv.Columns);
            _vWv = new Matrix<double>(_Wv.Rows, _Wv.Columns);
            _mWo = new Matrix<double>(_Wo.Rows, _Wo.Columns);
            _vWo = new Matrix<double>(_Wo.Rows, _Wo.Columns);
        }



        /// <summary>
        /// Forward pass through multi-head attention.
        /// </summary>
        /// <param name="q">Query matrix [batchSize * qSeqLen, dModel].</param>
        /// <param name="k">Key matrix [batchSize * kSeqLen, dModel].</param>
        /// <param name="v">Value matrix [batchSize * kSeqLen, dModel].</param>
        /// <param name="mask">Optional mask [qSeqLen, kSeqLen], e.g., causal mask.</param>
        /// <returns>Attention output [batchSize * qSeqLen, dModel].</returns>
        public Matrix<double> Forward(Matrix<double> q, Matrix<double> k, Matrix<double> v, Action<string> logger, Matrix<double> mask = null)
        {

            // Input validation
            if (q == null || k == null || v == null || q.Columns != _dModel || k.Columns != _dModel || v.Columns != _dModel)
            {
                logger?.Invoke($"Error: Invalid inputs (q={q?.Rows}x{q?.Columns}, k={k?.Rows}x{k?.Columns}, v={v?.Rows}x{v?.Columns}, expected columns={_dModel}). Returning zero matrix.");
                return new Matrix<double>(q?.Rows ?? 0, _dModel);
            }

            if (q.Rows == 0 || k.Rows == 0 || v.Rows == 0)
            {
                logger?.Invoke($"Warning: Empty inputs (q.Rows={q.Rows}, k.Rows={k.Rows}, v.Rows={v.Rows}). Returning zero matrix.");
                return new Matrix<double>(q.Rows, _dModel);
            }

            // logger?.Invoke($"Input shapes: q=[{q.Rows},{q.Columns}], k=[{k.Rows},{k.Columns}], v=[{v.Rows},{v.Columns}], mask=[{mask?.Rows ?? -1},{mask?.Columns ?? -1}], _maxSeqLen={_maxSeqLen}");

            // Determine sequence lengths and batch size
            int qSeqLen, kSeqLen, batchSize;

            if (mask != null)
            {
                if (mask.Rows <= 0 || mask.Columns <= 0)
                {
                    logger?.Invoke($"Error: Mask invalid (Rows={mask.Rows}, Columns={mask.Columns}). Returning zero matrix.");
                    return new Matrix<double>(q.Rows, _dModel);
                }

                qSeqLen = mask.Rows;
                kSeqLen = mask.Columns;
                batchSize = q.Rows / qSeqLen;

                if (q.Rows % qSeqLen != 0 || batchSize <= 0 || k.Rows != batchSize * kSeqLen || v.Rows != batchSize * kSeqLen)
                {
                    logger?.Invoke($"Error: Dimension mismatch (q.Rows={q.Rows}, k.Rows={k.Rows}, v.Rows={v.Rows}, mask={mask.Rows}x{mask.Columns}). Returning zero matrix.");
                    return new Matrix<double>(q.Rows, _dModel);
                }
            }
            else
            {
                if (_maxSeqLen <= 0)
                {
                    logger?.Invoke($"Error: _maxSeqLen ({_maxSeqLen}) invalid. Returning zero matrix.");
                    return new Matrix<double>(q.Rows, _dModel);
                }

                batchSize = Math.Max(1, Math.Min(q.Rows, k.Rows) / _maxSeqLen);
                qSeqLen = q.Rows / batchSize;
                kSeqLen = k.Rows / batchSize;

                if (qSeqLen <= 0 || qSeqLen > _maxSeqLen || q.Rows % batchSize != 0 ||
                    kSeqLen <= 0 || kSeqLen > _maxSeqLen || k.Rows % batchSize != 0 || v.Rows != k.Rows)
                {
                    logger?.Invoke($"Warning: Inconsistent dimensions (q.Rows={q.Rows}, k.Rows={k.Rows}, v.Rows={v.Rows}, _maxSeqLen={_maxSeqLen}). Falling back to batchSize=1.");
                    batchSize = 1;
                    qSeqLen = q.Rows;
                    kSeqLen = k.Rows;

                    if (qSeqLen > _maxSeqLen || kSeqLen > _maxSeqLen || v.Rows != k.Rows)
                    {
                        logger?.Invoke($"Error: Sequence lengths exceed _maxSeqLen (qSeqLen={qSeqLen}, kSeqLen={kSeqLen}, _maxSeqLen={_maxSeqLen}). Returning zero matrix.");
                        return new Matrix<double>(q.Rows, _dModel);
                    }
                }
            }

            // logger?.Invoke($"Computed: batchSize={batchSize}, qSeqLen={qSeqLen}, kSeqLen={kSeqLen}");

            // Compute Q, K, V
            Matrix<double> Q = q * _Wq;
            Matrix<double> K = k * _Wk;
            Matrix<double> V = v * _Wv;

            if (Q.Rows != q.Rows || K.Rows != k.Rows || V.Rows != v.Rows || Q.Columns != _dModel || K.Columns != _dModel || V.Columns != _dModel)
            {
                logger?.Invoke($"Error: QKV computation failed (Q=[{Q.Rows},{Q.Columns}], K=[{K.Rows},{K.Columns}], V=[{V.Rows},{V.Columns}]). Returning zero matrix.");
                return new Matrix<double>(q.Rows, _dModel);
            }

            var headOutputs = new Matrix<double>[_numHeads];
            Parallel.For(0, _numHeads, h =>
            {
                try
                {
                    var qHead = Q.SubMatrix(0, Q.Rows, h * _dHead, _dHead);
                    var kHead = K.SubMatrix(0, K.Rows, h * _dHead, _dHead);
                    var vHead = V.SubMatrix(0, V.Rows, h * _dHead, _dHead);
                    var output = new Matrix<double>(q.Rows, _dHead);

                    Parallel.For(0, batchSize, b =>
                    {
                        try
                        {
                            var qBatch = qHead.SubMatrix(b * qSeqLen, qSeqLen, 0, _dHead);
                            var kBatch = kHead.SubMatrix(b * kSeqLen, kSeqLen, 0, _dHead);
                            var scoresBatch = (qBatch * kBatch.Transpose()) * (1.0 / Math.Sqrt(_dHead));

                            if (mask != null)
                            {
                                if (scoresBatch.Rows != mask.Rows || scoresBatch.Columns != mask.Columns)
                                {
                                    logger?.Invoke($"Error: Mask size mismatch (scores=[{scoresBatch.Rows},{scoresBatch.Columns}], mask=[{mask.Rows},{mask.Columns}]) in head {h}, batch {b}.");
                                    return;
                                }
                                scoresBatch = scoresBatch + mask;
                            }

                            var weights = Softmax(scoresBatch);
                            var vBatch = vHead.SubMatrix(b * kSeqLen, kSeqLen, 0, _dHead);
                            var outputBatch = weights * vBatch;

                            if (outputBatch.Rows != qSeqLen || outputBatch.Columns != _dHead)
                            {
                                logger?.Invoke($"Error: Output batch size incorrect (outputBatch=[{outputBatch.Rows},{outputBatch.Columns}], expected=[{qSeqLen},{_dHead}]) in head {h}, batch {b}.");
                                return;
                            }

                            for (int i = 0; i < qSeqLen; i++)
                                for (int j = 0; j < _dHead; j++)
                                    output[b * qSeqLen + i, j] = outputBatch[i, j];
                        }
                        catch (Exception ex)
                        {
                            logger?.Invoke($"Error in inner Parallel.For (head={h}, batch={b}): {ex.Message}");
                        }
                    });

                    headOutputs[h] = output;
                }
                catch (Exception ex)
                {
                    logger?.Invoke($"Error in outer Parallel.For (head={h}): {ex.Message}");
                    headOutputs[h] = null;
                }
            });

            if (headOutputs.Any(h => h == null))
            {
                logger?.Invoke("Error: One or more attention heads failed to compute. Returning zero matrix.");
                return new Matrix<double>(q.Rows, _dModel);
            }

            var combined = ConcatenateHeads(headOutputs);
            var result = combined * _Wo;

            if (result.Rows != q.Rows || result.Columns != _dModel)
            {
                logger?.Invoke($"Error: Final output shape incorrect (result=[{result.Rows},{result.Columns}], expected=[{q.Rows},{_dModel}]). Returning zero matrix.");
                return new Matrix<double>(q.Rows, _dModel);
            }

            if (result.Any(double.IsNaN) || result.Any(double.IsInfinity))
            {
                logger?.Invoke("Warning: Output contains NaN or Infinity values.");
            }

            // logger?.Invoke($"Output shape: [{result.Rows},{result.Columns}]");
        
            return result;
        }



        /// <summary>
        /// Backward pass through multi-head attention.
        /// </summary>
        /// <param name="dOutput">Gradient w.r.t. output [batchSize * qSeqLen, dModel].</param>
        /// <param name="q">Query matrix [batchSize * qSeqLen, dModel].</param>
        /// <param name="k">Key matrix [batchSize * kSeqLen, dModel].</param>
        /// <param name="v">Value matrix [batchSize * kSeqLen, dModel].</param>
        /// <param name="mask">Optional mask [qSeqLen, kSeqLen].</param>
        /// <returns>Tuple of gradients w.r.t. query, key, and value matrices.</returns>
        public (Matrix<double> dQ, Matrix<double> dK, Matrix<double> dV) Backward(
            Matrix<double> dOutput, Matrix<double> q, Matrix<double> k, Matrix<double> v, Matrix<double> mask = null)
        {
            
            int batchSize = q.Rows / (mask != null ? mask.Rows : _maxSeqLen);
            int qSeqLen = q.Rows / batchSize;
            int kSeqLen = k.Rows / batchSize;

            var dConcat = dOutput * _Wo.Transpose();
            _dWo = dConcat.Transpose() * dOutput;

            var heads = new Matrix<double>[_numHeads];
            Parallel.For(0, _numHeads, h =>
            {
                heads[h] = dConcat.SubMatrix(0, dConcat.Rows, h * _dHead, _dHead);
            });

            var dQProj = new Matrix<double>(q.Rows, _dModel);
            var dKProj = new Matrix<double>(k.Rows, _dModel);
            var dVProj = new Matrix<double>(v.Rows, _dModel);

            var Q = q * _Wq;
            var K = k * _Wk;
            var V = v * _Wv;

            Parallel.For(0, _numHeads, h =>
            {
                var dHead = heads[h];
                var qHead = Q.SubMatrix(0, Q.Rows, h * _dHead, _dHead);
                var kHead = K.SubMatrix(0, K.Rows, h * _dHead, _dHead);
                var vHead = V.SubMatrix(0, V.Rows, h * _dHead, _dHead);

                for (int b = 0; b < batchSize; b++)
                {
                    int startIdxQ = b * qSeqLen;
                    int startIdxK = b * kSeqLen;
                    var dHeadBatch = dHead.SubMatrix(startIdxQ, qSeqLen, 0, _dHead);
                    var qHeadBatch = qHead.SubMatrix(startIdxQ, qSeqLen, 0, _dHead);
                    var kHeadBatch = kHead.SubMatrix(startIdxK, kSeqLen, 0, _dHead);
                    var vHeadBatch = vHead.SubMatrix(startIdxK, kSeqLen, 0, _dHead);

                    var scoresBatch = (qHeadBatch * kHeadBatch.Transpose()) * (1.0 / Math.Sqrt(_dHead));
                    if (mask != null)
                        scoresBatch = scoresBatch + mask;
                    var attnBatch = Softmax(scoresBatch);

                    var dVHeadBatch = attnBatch.Transpose() * dHeadBatch;
                    var dAttnBatch = dHeadBatch * vHeadBatch.Transpose();
                    var dScoresBatch = SoftmaxBackward(dAttnBatch, attnBatch);

                    var dQHeadBatch = (dScoresBatch * kHeadBatch) * (1.0 / Math.Sqrt(_dHead));
                    var dKHeadBatch = (dScoresBatch.Transpose() * qHeadBatch) * (1.0 / Math.Sqrt(_dHead));

                    lock (dQProj)
                    {
                        for (int i = 0; i < qSeqLen; i++)
                            for (int j = 0; j < _dHead; j++)
                                dQProj[startIdxQ + i, h * _dHead + j] += dQHeadBatch[i, j];
                    }
                    lock (dKProj)
                    {
                        for (int i = 0; i < kSeqLen; i++)
                            for (int j = 0; j < _dHead; j++)
                                dKProj[startIdxK + i, h * _dHead + j] += dKHeadBatch[i, j];
                    }
                    lock (dVProj)
                    {
                        for (int i = 0; i < kSeqLen; i++)
                            for (int j = 0; j < _dHead; j++)
                                dVProj[startIdxK + i, h * _dHead + j] += dVHeadBatch[i, j];
                    }
                }
            });

            var dQ = dQProj * _Wq.Transpose();
            var dK = dKProj * _Wk.Transpose();
            var dV = dVProj * _Wv.Transpose();

            _dWq = q.Transpose() * dQProj;
            _dWk = k.Transpose() * dKProj;
            _dWv = v.Transpose() * dVProj;

            return (dQ, dK, dV);
        }


        /// <summary>
        /// Updates attention weight matrices with Adam optimization.
        /// </summary>
        /// <param name="adam">The Adam optimizer instance.</param>
        /// <param name="t">The current timestep for bias correction.</param>
        public void UpdateParametersWithAdam(AdamOptimizer adam, int t)
        {
            
            (_mWq, _vWq, _Wq) = adam.Update(_Wq, _dWq, _mWq, _vWq, t);
            (_mWk, _vWk, _Wk) = adam.Update(_Wk, _dWk, _mWk, _vWk, t);
            (_mWv, _vWv, _Wv) = adam.Update(_Wv, _dWv, _mWv, _vWv, t);
            (_mWo, _vWo, _Wo) = adam.Update(_Wo, _dWo, _mWo, _vWo, t);
        }



        /// <summary>
        /// Computes the gradient norm for attention weights.
        /// </summary>
        /// <returns>Gradient norm as a scalar value.</returns>
        public double GetGradientNorm() => Math.Sqrt(_dWq.ElementWiseMultiply(_dWq).Sum() +
                                                    _dWk.ElementWiseMultiply(_dWk).Sum() +
                                                    _dWv.ElementWiseMultiply(_dWv).Sum() +
                                                    _dWo.ElementWiseMultiply(_dWo).Sum());



        /// <summary>
        /// Concatenates outputs from all attention heads.
        /// </summary>
        /// <param name="heads">Array of head outputs [numHeads][batchSize * qSeqLen, dHead].</param>
        /// <returns>Concatenated matrix [batchSize * qSeqLen, dModel].</returns>
        private Matrix<double> ConcatenateHeads(Matrix<double>[] heads)
        {
            
            var result = new Matrix<double>(heads[0].Rows, _dModel);
            Parallel.For(0, heads[0].Rows, i =>
            {
                for (int h = 0; h < _numHeads; h++)
                    for (int j = 0; j < _dHead; j++)
                        result[i, h * _dHead + j] = heads[h][i, j];
            });
            return result;
        }



        /// <summary>
        /// Applies softmax to attention scores along each row.
        /// </summary>
        /// <param name="input">Attention scores [qSeqLen, kSeqLen].</param>
        /// <returns>Softmax probabilities [qSeqLen, kSeqLen].</returns>
        private Matrix<double> Softmax(Matrix<double> input)
        {
           
            var result = new Matrix<double>(input.Rows, input.Columns);
            Parallel.For(0, input.Rows, i =>
            {
                double max = input.Row(i).Max();
                var exp = input.Row(i).AddScalar(-max).Exp();
                double sum = exp.Sum() + 1e-10; // Larger epsilon
                for (int j = 0; j < input.Columns; j++)
                    result[i, j] = exp[0, j] / sum;
            });
            return result;
        }



        /// <summary>
        /// Computes the gradient of the softmax operation.
        /// </summary>
        /// <param name="dOutput">Gradient w.r.t. softmax output [qSeqLen, kSeqLen].</param>
        /// <param name="probs">Softmax probabilities [qSeqLen, kSeqLen].</param>
        /// <returns>Gradient w.r.t. softmax input [qSeqLen, kSeqLen].</returns>
        private Matrix<double> SoftmaxBackward(Matrix<double> dOutput, Matrix<double> probs)
        {
        
            var dScores = new Matrix<double>(probs.Rows, probs.Columns);
            Parallel.For(0, probs.Rows, i =>
            {
                for (int j = 0; j < probs.Columns; j++)
                {
                    double sum = 0;
                    for (int k = 0; k < probs.Columns; k++)
                        sum += dOutput[i, k] * probs[i, k] * ((j == k ? 1 : 0) - probs[i, j]);
                    dScores[i, j] = sum;
                }
            });
            return dScores;
        }



        /// <summary>
        /// Scales all gradients in the multi-head attention mechanism, including query, key, value, 
        /// and output weight gradients.
        /// </summary>
        /// <param name="scale">The scaling factor to apply to all gradients (e.g., for gradient clipping).</param>
        /// <exception cref="ArgumentException">Thrown if scale is NaN, infinite, or negative.</exception>
        public void ScaleGradients(double scale)
        {
            
            if (double.IsNaN(scale) || double.IsInfinity(scale) || scale < 0)
                throw new ArgumentException("Scale must be a non-negative finite number.", nameof(scale));

            _dWq *= scale;
            _dWk *= scale;
            _dWv *= scale;
            _dWo *= scale;
        }
    }
}

FeedForward.cs Class

namespace AI
{
    /// <summary>
    /// Feed-forward network with two linear layers and ReLU activation.
    /// </summary>
    [Serializable]
    public class FeedForward
    {



        #region Fields:



        /// <summary>
        /// First layer weight matrix [dModel, dFF].
        /// </summary>
        private Matrix<double> _w1;



        /// <summary>
        /// First layer bias [1, dFF].
        /// </summary>
        private Matrix<double> _b1;



        /// <summary>
        /// Second layer weight matrix [dFF, dModel].
        /// </summary>
        private Matrix<double> _w2;



        /// <summary>
        /// Second layer bias [1, dModel].
        /// </summary>
        private Matrix<double> _b2;



        /// <summary>
        /// Gradient of first layer weights.
        /// </summary>
        private Matrix<double> _dW1;



        /// <summary>
        /// Gradient of first layer bias.
        /// </summary>
        private Matrix<double> _dB1;



        /// <summary>
        /// Gradient of second layer weights.
        /// </summary>
        private Matrix<double> _dW2;



        /// <summary>
        /// Gradient of second layer bias.
        /// </summary>
        private Matrix<double> _dB2;



        /// <summary>
        /// Cached hidden layer output [batchSize * seqLen, dFF].
        /// </summary>
        private Matrix<double> _hidden;



        /// <summary>
        /// Cached input [batchSize * seqLen, dModel].
        /// </summary>
        private Matrix<double> _input;



        // Adam optimizer state variables
        /// <summary>
        /// First moment estimate for _w1.
        /// </summary>
        private Matrix<double> _mW1;



        /// <summary>
        /// Second moment estimate for _w1.
        /// </summary>
        private Matrix<double> _vW1;



        /// <summary>
        /// First moment estimate for _b1.
        /// </summary>
        private Matrix<double> _mb1;



        /// <summary>
        /// Second moment estimate for _b1.
        /// </summary>
        private Matrix<double> _vb1;



        /// <summary>
        /// First moment estimate for _w2.
        /// </summary>
        private Matrix<double> _mW2;



        /// <summary>
        /// Second moment estimate for _w2.
        /// </summary>
        private Matrix<double> _vW2;



        /// <summary>
        /// First moment estimate for _b2.
        /// </summary>
        private Matrix<double> _mb2;



        /// <summary>
        /// Second moment estimate for _b2.
        /// </summary>
        private Matrix<double> _vb2;



        #endregion



        #region Properties:



        #endregion




        /// <summary>
        /// Initializes the feed-forward network.
        /// </summary>
        /// <param name="dModel">Input and output dimensionality.</param>
        /// <param name="dFF">Hidden layer dimensionality.</param>
        public FeedForward(int dModel, int dFF)
        {
            
            _w1 = Matrix<double>.InitializeHe(dModel, dFF);
            _b1 = Matrix<double>.Random(1, dFF, 0.01);
            _w2 = Matrix<double>.InitializeHe(dFF, dModel);
            _b2 = Matrix<double>.Random(1, dModel, 0.01);
            // Initialize Adam state
            _mW1 = new Matrix<double>(_w1.Rows, _w1.Columns);
            _vW1 = new Matrix<double>(_w1.Rows, _w1.Columns);
            _mb1 = new Matrix<double>(_b1.Rows, _b1.Columns);
            _vb1 = new Matrix<double>(_b1.Rows, _b1.Columns);
            _mW2 = new Matrix<double>(_w2.Rows, _w2.Columns);
            _vW2 = new Matrix<double>(_w2.Rows, _w2.Columns);
            _mb2 = new Matrix<double>(_b2.Rows, _b2.Columns);
            _vb2 = new Matrix<double>(_b2.Rows, _b2.Columns);
        }



        /// <summary>
        /// Forward pass through the feed-forward network.
        /// </summary>
        /// <param name="input">Input matrix [batchSize * seqLen, dModel].</param>
        /// <returns>Output matrix [batchSize * seqLen, dModel].</returns>
        public Matrix<double> Forward(Matrix<double> input)
        {
            
            _input = input;
            _hidden = (input * _w1).AddBias(_b1);
            _hidden = ReLU(_hidden);
            return (_hidden * _w2).AddBias(_b2);
        }



        /// <summary>
        /// Backward pass through the feed-forward network.
        /// </summary>
        /// <param name="dOutput">Gradient w.r.t. output [batchSize * seqLen, dModel].</param>
        /// <returns>Gradient w.r.t. input [batchSize * seqLen, dModel].</returns>
        public Matrix<double> Backward(Matrix<double> dOutput)
        {
            
            _dW2 = _hidden.Transpose() * dOutput; // [dFF, dModel]
            _dB2 = dOutput.SumRows(); // [1, dModel], matches _b2
            var dHidden = dOutput * _w2.Transpose(); // [batchSize * seqLen, dFF]
            var dPreActivation = dHidden.ElementWiseMultiply(ReLUBackward(_hidden)); // [batchSize * seqLen, dFF]
            _dW1 = _input.Transpose() * dPreActivation; // [dModel, dFF]

            // 
            _dB1 = dPreActivation.SumRows(); // [1, dFF]
            if (_dB1.Rows != _b1.Rows || _dB1.Columns != _b1.Columns)
                throw new InvalidOperationException("Bias gradient _dB1 shape mismatch");

            return dPreActivation * _w1.Transpose(); // [batchSize * seqLen, dModel]
        }



        /// <summary>
        /// Updates feed-forward parameters with Adam optimization.
        /// </summary>
        /// <param name="adam">The Adam optimizer instance.</param>
        /// <param name="t">The current timestep for bias correction.</param>
        public void UpdateParametersWithAdam(AdamOptimizer adam, int t)
        {
            
            (_mW1, _vW1, _w1) = adam.Update(_w1, _dW1, _mW1, _vW1, t); // [dModel, dFF]
            (_mb1, _vb1, _b1) = adam.Update(_b1, _dB1, _mb1, _vb1, t); // [1, dFF]
            (_mW2, _vW2, _w2) = adam.Update(_w2, _dW2, _mW2, _vW2, t); // [dFF, dModel]
            (_mb2, _vb2, _b2) = adam.Update(_b2, _dB2, _mb2, _vb2, t); // [1, dModel]
        }



        /// <summary>
        /// Computes the gradient norm for this layer.
        /// </summary>
        /// <returns>Gradient norm as a scalar value.</returns>
        public double GetGradientNorm() => Math.Sqrt(_dW1.ElementWiseMultiply(_dW1).Sum() +
                                                    _dB1.ElementWiseMultiply(_dB1).Sum() +
                                                    _dW2.ElementWiseMultiply(_dW2).Sum() +
                                                    _dB2.ElementWiseMultiply(_dB2).Sum());



        /// <summary>
        /// Applies ReLU activation element-wise.
        /// </summary>
        /// <param name="input">Input matrix.</param>
        /// <returns>Matrix with ReLU applied.</returns>
        private Matrix<double> ReLU(Matrix<double> input)
        {
        
            var result = input.Clone();
            Parallel.For(0, input.Rows, i =>
            {
                for (int j = 0; j < input.Columns; j++)
                    result[i, j] = Math.Max(0, result[i, j]);
            });
            return result;
        }



        /// <summary>
        /// Computes the gradient of the ReLU activation.
        /// </summary>
        /// <param name="input">Input matrix before ReLU.</param>
        /// <returns>Gradient matrix (1 if input > 0, 0 otherwise).</returns>
        private Matrix<double> ReLUBackward(Matrix<double> input)
        {
            
            var result = new Matrix<double>(input.Rows, input.Columns);
            Parallel.For(0, input.Rows, i =>
            {
                for (int j = 0; j < input.Columns; j++)
                    result[i, j] = input[i, j] > 0 ? 1.0 : 0.0;
            });
            return result;
        }



        /// <summary>
        /// Scales all gradients in the feed-forward network, including weights and biases for both layers.
        /// </summary>
        /// <param name="scale">The scaling factor to apply to all gradients (e.g., for gradient clipping).</param>
        /// <exception cref="ArgumentException">Thrown if scale is NaN, infinite, or negative.</exception>
        public void ScaleGradients(double scale)
        {
            
            if (double.IsNaN(scale) || double.IsInfinity(scale) || scale < 0)
                throw new ArgumentException("Scale must be a non-negative finite number.", nameof(scale));

            _dW1 *= scale;
            _dB1 *= scale;
            _dW2 *= scale;
            _dB2 *= scale;
        }
    }
}

DecoderLayer.cs Class

namespace AI
{
    /// <summary>
    /// Decoder layer with masked self-attention, cross-attention, and feed-forward network.
    /// </summary>
    [Serializable]
    public class DecoderLayer
    {



        #region Fields:



        /// <summary>
        /// Multi-head self-attention for the decoder input.
        /// </summary>
        private readonly Attention _selfAttention;



        /// <summary>
        /// Multi-head attention for encoder-decoder cross-attention.
        /// </summary>
        private readonly Attention _encDecAttention;



        /// <summary>
        /// Feed-forward network.
        /// </summary>
        private readonly FeedForward _ffn;



        /// <summary>
        /// First layer normalization after self-attention.
        /// </summary>
        private readonly LayerNormalization _norm1;



        /// <summary>
        /// Second layer normalization after cross-attention.
        /// </summary>
        private readonly LayerNormalization _norm2;



        /// <summary>
        /// Third layer normalization after feed-forward network.
        /// </summary>
        private readonly LayerNormalization _norm3;



        /// <summary>
        /// Dropout rate for regularization.
        /// </summary>
        private readonly double _dropoutRate;



        /// <summary>
        /// A thread-safe random number generator for dropout and initialization.
        /// </summary>
        [NonSerialized]
        private readonly ThreadLocal<Random> _rand = new ThreadLocal<Random>(() => new Random());



        /// <summary>
        /// Cached input for self-attention backward pass [batchSize * targetSeqLen, dModel].
        /// </summary>
        private Matrix<double> _lastInput;



        /// <summary>
        /// Cached output of first normalization for cross-attention backward pass [batchSize * targetSeqLen, dModel].
        /// </summary>
        private Matrix<double> _lastNorm1Output;



        #endregion



        #region Properties:



        #endregion



        /// <summary>
        /// Initializes a decoder layer with attention and feed-forward components.
        /// </summary>
        /// <param name="dModel">Dimensionality of embeddings.</param>
        /// <param name="numHeads">Number of attention heads.</param>
        /// <param name="maxSeqLen">Maximum sequence length.</param>
        /// <param name="dFF">Feed-forward network hidden size.</param>
        /// <param name="dropoutRate">Dropout probability.</param>
        /// <param name="rand">Random number generator.</param>
        public DecoderLayer(int dModel, int numHeads, int maxSeqLen, int dFF, double dropoutRate, ThreadLocal<Random> rand)
        {
        
            _selfAttention = new Attention(dModel, numHeads, maxSeqLen);
            _encDecAttention = new Attention(dModel, numHeads, maxSeqLen);
            _ffn = new FeedForward(dModel, dFF);
            _norm1 = new LayerNormalization(dModel);
            _norm2 = new LayerNormalization(dModel);
            _norm3 = new LayerNormalization(dModel);
            _dropoutRate = dropoutRate;
            _rand = rand;
        }



        /// <summary>
        /// Forward pass through the decoder layer.
        /// </summary>
        /// <param name="input">Decoder input [batchSize * targetSeqLen, dModel].</param>
        /// <param name="encoderOutput">Encoder output [batchSize * inputSeqLen, dModel].</param>
        /// <param name="mask">Causal mask [targetSeqLen, targetSeqLen].</param>
        /// <returns>Output matrix [batchSize * targetSeqLen, dModel].</returns>
        public Matrix<double> Forward(Matrix<double> input, Matrix<double> encoderOutput, Action<string> logger, Matrix<double> mask)
        {
            
            _lastInput = input;
            var selfAttn = _selfAttention.Forward(input, input, input, logger, mask); // Masked self-attention
            var norm1Output = _norm1.Forward(input + ApplyDropout(selfAttn));
            _lastNorm1Output = norm1Output;
            var encDecAttn = _encDecAttention.Forward(norm1Output, encoderOutput, encoderOutput, logger); // Cross-attention
            var norm2Output = _norm2.Forward(norm1Output + ApplyDropout(encDecAttn));
            var ffnOutput = _ffn.Forward(norm2Output);
            return _norm3.Forward(norm2Output + ApplyDropout(ffnOutput));
        }



        /// <summary>
        /// Backward pass through the decoder layer.
        /// </summary>
        /// <param name="dOutput">Gradient w.r.t. output [batchSize * targetSeqLen, dModel].</param>
        /// <param name="input">Decoder input [batchSize * targetSeqLen, dModel].</param>
        /// <param name="encoderOutput">Encoder output [batchSize * inputSeqLen, dModel].</param>
        /// <param name="mask">Causal mask [targetSeqLen, targetSeqLen].</param>
        /// <returns>Tuple of gradients w.r.t. input and encoder output.</returns>
        public (Matrix<double> dInput, Matrix<double> dEncoderOutput) Backward(Matrix<double> dOutput, Matrix<double> input, Matrix<double> encoderOutput, Matrix<double> mask)
        {
            
            var dNorm3 = _norm3.Backward(dOutput);
            var dFFN = _ffn.Backward(dNorm3);
            var dNorm2 = _norm2.Backward(dNorm3 + dFFN);
            var (dQ_encdec, dK_encdec, dV_encdec) = _encDecAttention.Backward(
                dNorm2, _lastNorm1Output, encoderOutput, encoderOutput);
            var dNorm1 = _norm1.Backward(dNorm2 + dQ_encdec);
            var (dQ_self, dK_self, dV_self) = _selfAttention.Backward(
                dNorm1, _lastInput, _lastInput, _lastInput, mask);
            var dInput = dNorm1 + dQ_self + dK_self + dV_self;
            var dEncoderOutput = dK_encdec + dV_encdec;
            return (dInput, dEncoderOutput);
        }



        /// <summary>
        /// Updates parameters of attention, feed-forward, and normalization components with Adam optimization.
        /// </summary>
        /// <param name="adam">The Adam optimizer instance.</param>
        /// <param name="t">The current timestep for bias correction.</param>
        public void UpdateParametersWithAdam(AdamOptimizer adam, int t)
        {
           
            _selfAttention.UpdateParametersWithAdam(adam, t);
            _encDecAttention.UpdateParametersWithAdam(adam, t);
            _ffn.UpdateParametersWithAdam(adam, t);
            _norm1.UpdateParametersWithAdam(adam, t);
            _norm2.UpdateParametersWithAdam(adam, t);
            _norm3.UpdateParametersWithAdam(adam, t);
        }



        /// <summary>
        /// Computes the gradient norm for this layer.
        /// </summary>
        /// <returns>Gradient norm as a scalar value.</returns>
        public double GetGradientNorm() => Math.Sqrt(_selfAttention.GetGradientNorm() +
                                                    _encDecAttention.GetGradientNorm() +
                                                    _ffn.GetGradientNorm() + _norm1.GetGradientNorm() +
                                                    _norm2.GetGradientNorm() + _norm3.GetGradientNorm());



        /// <summary>
        /// Applies dropout to the input matrix during training.
        /// </summary>
        /// <param name="input">Input matrix to apply dropout to.</param>
        /// <returns>Matrix with dropout applied.</returns>
        private Matrix<double> ApplyDropout(Matrix<double> input)
        {

            if (!Xformer.IsTraining) return input;
            var result = input.Clone();
            Parallel.For(0, input.Rows, i =>
            {
                for (int j = 0; j < input.Columns; j++)
                    if (_rand.Value.NextDouble() < _dropoutRate)
                        result[i, j] = 0;
                    else
                        result[i, j] /= 1 - _dropoutRate;
            });
            return result;
        }



        /// <summary>
        /// Scales all gradients in the decoder layer, including self-attention, encoder-decoder attention, 
        /// feed-forward, and normalization components.
        /// </summary>
        /// <param name="scale">The scaling factor to apply to all gradients (e.g., for gradient clipping).</param>
        /// <exception cref="ArgumentException">Thrown if scale is NaN, infinite, or negative.</exception>
        public void ScaleGradients(double scale)
        {

            if (double.IsNaN(scale) || double.IsInfinity(scale) || scale < 0)
                throw new ArgumentException("Scale must be a non-negative finite number.", nameof(scale));

            _selfAttention.ScaleGradients(scale);
            _encDecAttention.ScaleGradients(scale);
            _ffn.ScaleGradients(scale);
            _norm1.ScaleGradients(scale);
            _norm2.ScaleGradients(scale);
            _norm3.ScaleGradients(scale);
        }
    }
}

EncoderLayer.cs Class

namespace AI
{
    /// <summary>
    /// Encoder layer implementing multi-head self-attention and feed-forward network.
    /// </summary>
    [Serializable]
    public class EncoderLayer
    {



        #region Fields:



        /// <summary>
        /// Multi-head self-attention mechanism.
        /// </summary>
        private readonly Attention _attention;



        /// <summary>
        /// Feed-forward network.
        /// </summary>
        private readonly FeedForward _ffn;



        /// <summary>
        /// First layer normalization applied after attention.
        /// </summary>
        private readonly LayerNormalization _norm1;



        /// <summary>
        /// Second layer normalization applied after feed-forward network.
        /// </summary>
        private readonly LayerNormalization _norm2;



        /// <summary>
        /// Dropout rate for regularization.
        /// </summary>
        private readonly double _dropoutRate;



        /// <summary>
        /// A thread-safe random number generator for dropout and initialization.
        /// </summary>
        [NonSerialized]
        private readonly ThreadLocal<Random> _rand = new ThreadLocal<Random>(() => new Random());



        /// <summary>
        /// Cached input for backward pass [batchSize * seqLen, dModel].
        /// </summary>
        private Matrix<double> _lastInput;



        #endregion



        #region Properties:



        #endregion



        /// <summary>
        /// Initializes an encoder layer with attention and feed-forward components.
        /// </summary>
        /// <param name="dModel">Dimensionality of embeddings.</param>
        /// <param name="numHeads">Number of attention heads.</param>
        /// <param name="maxSeqLen">Maximum sequence length.</param>
        /// <param name="dFF">Feed-forward network hidden size.</param>
        /// <param name="dropoutRate">Dropout probability.</param>
        /// <param name="rand">Random number generator.</param>
        public EncoderLayer(int dModel, int numHeads, int maxSeqLen, int dFF, double dropoutRate, ThreadLocal<Random> rand)
        {
            
            _attention = new Attention(dModel, numHeads, maxSeqLen);
            _ffn = new FeedForward(dModel, dFF);
            _norm1 = new LayerNormalization(dModel);
            _norm2 = new LayerNormalization(dModel);
            _dropoutRate = dropoutRate;
            _rand = rand;
        }



        /// <summary>
        /// Forward pass through the encoder layer.
        /// </summary>
        /// <param name="input">Input matrix [batchSize * seqLen, dModel].</param>
        /// <returns>Output matrix [batchSize * seqLen, dModel].</returns>
        public Matrix<double> Forward(Matrix<double> input, Action<string> logger)
        {
            
            _lastInput = input;
            var attnOutput = _attention.Forward(input, input, input, logger); // Self-attention
            var norm1Output = _norm1.Forward(input + ApplyDropout(attnOutput));
            var ffnOutput = _ffn.Forward(norm1Output);
            return _norm2.Forward(norm1Output + ApplyDropout(ffnOutput));
        }



        /// <summary>
        /// Backward pass through the encoder layer.
        /// </summary>
        /// <param name="dOutput">Gradient w.r.t. output [batchSize * seqLen, dModel].</param>
        /// <returns>Gradient w.r.t. input [batchSize * seqLen, dModel].</returns>
        public Matrix<double> Backward(Matrix<double> dOutput)
        {
            
            var dNorm2 = _norm2.Backward(dOutput);
            var dFFN = _ffn.Backward(dNorm2);
            var dNorm1 = _norm1.Backward(dNorm2 + dFFN);
            var (dQ, dK, dV) = _attention.Backward(dNorm1, _lastInput, _lastInput, _lastInput);
            return dNorm1 + dQ + dK + dV; // Self-attention: q = k = v
        }



        /// <summary>
        /// Updates parameters of attention, feed-forward, and normalization components with Adam optimization.
        /// </summary>
        /// <param name="adam">The Adam optimizer instance.</param>
        /// <param name="t">The current timestep for bias correction.</param>
        public void UpdateParametersWithAdam(AdamOptimizer adam, int t)
        {
            
            _attention.UpdateParametersWithAdam(adam, t);
            _ffn.UpdateParametersWithAdam(adam, t);
            _norm1.UpdateParametersWithAdam(adam, t);
            _norm2.UpdateParametersWithAdam(adam, t);
        }



        /// <summary>
        /// Computes the gradient norm for this layer.
        /// </summary>
        /// <returns>Gradient norm as a scalar value.</returns>
        public double GetGradientNorm() => Math.Sqrt(_attention.GetGradientNorm() + _ffn.GetGradientNorm() +
                                                    _norm1.GetGradientNorm() + _norm2.GetGradientNorm());



        /// <summary>
        /// Applies dropout to the input matrix during training.
        /// </summary>
        /// <param name="input">Input matrix to apply dropout to.</param>
        /// <returns>Matrix with dropout applied.</returns>
        private Matrix<double> ApplyDropout(Matrix<double> input)
        {

            if (!Xformer.IsTraining) return input;
            var result = input.Clone();
            Parallel.For(0, input.Rows, i =>
            {
                for (int j = 0; j < input.Columns; j++)
                    // In ApplyDropout:
                    if (_rand.Value.NextDouble() < _dropoutRate)
                        result[i, j] = 0;
                    else
                        result[i, j] /= 1 - _dropoutRate;
            });
            return result;
        }



        /// <summary>
        /// Scales all gradients in the encoder layer, including attention, feed-forward, and normalization components.
        /// </summary>
        /// <param name="scale">The scaling factor to apply to all gradients (e.g., for gradient clipping).</param>
        /// <exception cref="ArgumentException">Thrown if scale is NaN, infinite, or negative.</exception>
        public void ScaleGradients(double scale)
        {
            
            if (double.IsNaN(scale) || double.IsInfinity(scale) || scale < 0)
                throw new ArgumentException("Scale must be a non-negative finite number.", nameof(scale));

            _attention.ScaleGradients(scale);
            _ffn.ScaleGradients(scale);
            _norm1.ScaleGradients(scale);
            _norm2.ScaleGradients(scale);
        }
    }
}

LayerNormalization.cs Class

namespace AI
{
    /// <summary>
    /// Layer normalization for stabilizing training dynamics.
    /// </summary>
    [Serializable]
    public class LayerNormalization
    {



        #region Fields:



        /// <summary>
        /// Scale parameter [1, dModel].
        /// </summary>
        private Matrix<double> _gamma;



        /// <summary>
        /// Shift parameter [1, dModel].
        /// </summary>
        private Matrix<double> _beta;



        /// <summary>
        /// Gradient of scale parameter.
        /// </summary>
        private Matrix<double> _dGamma;



        /// <summary>
        /// Gradient of shift parameter.
        /// </summary>
        private Matrix<double> _dBeta;



        /// <summary>
        /// Dimensionality of the model.
        /// </summary>
        private readonly int _dModel;



        /// <summary>
        /// Cached mean statistics [batchSize * seqLen, 1].
        /// </summary>
        private Matrix<double> _mean;



        /// <summary>
        /// Cached standard deviation statistics [batchSize * seqLen, 1].
        /// </summary>
        private Matrix<double> _stdDev;



        /// <summary>
        /// Cached input for backward pass [batchSize * seqLen, dModel].
        /// </summary>
        private Matrix<double> _lastInput;



        // Adam optimizer state variables
        /// <summary>
        /// First moment estimate for _gamma.
        /// </summary>
        private Matrix<double> _mGamma;



        /// <summary>
        /// Second moment estimate for _gamma.
        /// </summary>
        private Matrix<double> _vGamma;



        /// <summary>
        /// First moment estimate for _beta.
        /// </summary>
        private Matrix<double> _mBeta;



        /// <summary>
        /// Second moment estimate for _beta.
        /// </summary>
        private Matrix<double> _vBeta;



        #endregion



        #region Properties:



        #endregion



        /// <summary>
        /// Initializes layer normalization with given dimensionality.
        /// </summary>
        /// <param name="dModel">Dimensionality of the input.</param>
        /// <param name="epsilon">Small constant for numerical stability, default is 1e-6.</param>
        public LayerNormalization(int dModel, double epsilon = 1e-6)
        {
            
            _dModel = dModel;
            _gamma = Matrix<double>.Ones(1, dModel);
            _beta = Matrix<double>.Zeros(1, dModel);
            // Initialize Adam state
            _mGamma = new Matrix<double>(_gamma.Rows, _gamma.Columns);
            _vGamma = new Matrix<double>(_gamma.Rows, _gamma.Columns);
            _mBeta = new Matrix<double>(_beta.Rows, _beta.Columns);
            _vBeta = new Matrix<double>(_beta.Rows, _beta.Columns);
        }



        /// <summary>
        /// Forward pass through layer normalization.
        /// </summary>
        /// <param name="input">Input matrix [batchSize * seqLen, dModel].</param>
        /// <returns>Normalized output [batchSize * seqLen, dModel].</returns>
        public Matrix<double> Forward(Matrix<double> input)
        {
           
            int batchSeqLen = input.Rows;
            var output = new Matrix<double>(batchSeqLen, _dModel);
            _lastInput = input.Copy();
            _mean = new Matrix<double>(batchSeqLen, 1);
            _stdDev = new Matrix<double>(batchSeqLen, 1);

            Parallel.For(0, batchSeqLen, i =>
            {
                double sum = 0;
                for (int j = 0; j < _dModel; j++)
                    sum += input[i, j];
                _mean[i, 0] = sum / _dModel;

                double sumSqDiff = 0;
                for (int j = 0; j < _dModel; j++)
                {
                    double diff = input[i, j] - _mean[i, 0];
                    sumSqDiff += diff * diff;
                }
                _stdDev[i, 0] = Math.Sqrt(sumSqDiff / _dModel + 1e-6);
            });

            Parallel.For(0, batchSeqLen, i =>
            {
                for (int j = 0; j < _dModel; j++)
                    output[i, j] = _gamma[0, j] * (input[i, j] - _mean[i, 0]) / _stdDev[i, 0] + _beta[0, j];
            });

            return output;
        }



        /// <summary>
        /// Backward pass through layer normalization.
        /// </summary>
        /// <param name="dOutput">Gradient w.r.t. output [batchSize * seqLen, dModel].</param>
        /// <returns>Gradient w.r.t. input [batchSize * seqLen, dModel].</returns>
        public Matrix<double> Backward(Matrix<double> dOutput)
        {
            
            int batchSeqLen = _lastInput.Rows;
            var dInput = new Matrix<double>(batchSeqLen, _dModel);
            _dGamma = Matrix<double>.Zeros(1, _dModel);
            _dBeta = Matrix<double>.Zeros(1, _dModel);

            Parallel.For(0, batchSeqLen, i =>
            {
                for (int j = 0; j < _dModel; j++)
                {
                    double normVal = (_lastInput[i, j] - _mean[i, 0]) / _stdDev[i, 0];
                    lock (_dGamma) _dGamma[0, j] += dOutput[i, j] * normVal;
                    lock (_dBeta) _dBeta[0, j] += dOutput[i, j];
                }
            });

            Parallel.For(0, batchSeqLen, i =>
            {
                double invStdDev = 1.0 / _stdDev[i, 0];
                double dNormSum = 0;
                double dVarSum = 0;

                for (int j = 0; j < _dModel; j++)
                {
                    double dNorm = dOutput[i, j] * _gamma[0, j];
                    dNormSum += dNorm;
                    dVarSum += dNorm * (_lastInput[i, j] - _mean[i, 0]);
                }

                double dVarTerm = dVarSum * (-0.5) * Math.Pow(_stdDev[i, 0], -3);
                double dMeanTerm = -dNormSum * invStdDev;

                for (int j = 0; j < _dModel; j++)
                {
                    double dNorm = dOutput[i, j] * _gamma[0, j];
                    dInput[i, j] = invStdDev * (dNorm + (dVarTerm * 2 * (_lastInput[i, j] - _mean[i, 0]) / _dModel) +
                                                (dMeanTerm / _dModel));
                }
            });

            return dInput;
        }



        /// <summary>
        /// Updates normalization parameters with Adam optimization.
        /// </summary>
        /// <param name="adam">The Adam optimizer instance.</param>
        /// <param name="t">The current timestep for bias correction.</param>
        public void UpdateParametersWithAdam(AdamOptimizer adam, int t)
        {
            
            (_mGamma, _vGamma, _gamma) = adam.Update(_gamma, _dGamma, _mGamma, _vGamma, t);
            (_mBeta, _vBeta, _beta) = adam.Update(_beta, _dBeta, _mBeta, _vBeta, t);
        }



        /// <summary>
        /// Computes the gradient norm for this layer.
        /// </summary>
        /// <returns>Gradient norm as a scalar value.</returns>
        public double GetGradientNorm() => Math.Sqrt(_dGamma.ElementWiseMultiply(_dGamma).Sum() +
                                                    _dBeta.ElementWiseMultiply(_dBeta).Sum());



        /// <summary>
        /// Scales all gradients in the layer normalization, including the scale (gamma) and shift (beta) gradients.
        /// </summary>
        /// <param name="scale">The scaling factor to apply to all gradients (e.g., for gradient clipping).</param>
        /// <exception cref="ArgumentException">Thrown if scale is NaN, infinite, or negative.</exception>
        public void ScaleGradients(double scale)
        {
           
            if (double.IsNaN(scale) || double.IsInfinity(scale) || scale < 0)
                throw new ArgumentException("Scale must be a non-negative finite number.", nameof(scale));

            _dGamma *= scale;
            _dBeta *= scale;
        }
    }
}

AdamOptimizer.cs Class

namespace AI
{
    /// <summary>
    /// Adam optimizer for adaptive parameter updates.
    /// </summary>
    [Serializable]
    public class AdamOptimizer
    {



        #region Fields:



        /// <summary>
        /// Learning rate for parameter updates.
        /// </summary>
        private readonly double _lr;



        /// <summary>
        /// Exponential decay rate for the first moment estimates.
        /// </summary>
        private readonly double _beta1;



        /// <summary>
        /// Exponential decay rate for the second moment estimates.
        /// </summary>
        private readonly double _beta2;



        /// <summary>
        /// Small constant for numerical stability.
        /// </summary>
        private readonly double _epsilon;



        #endregion



        #region Properties:



        #endregion




        /// <summary>
        /// Initializes the Adam optimizer with specified hyperparameters.
        /// </summary>
        /// <param name="learningRate">Learning rate for updates.</param>
        /// <param name="beta1">Decay rate for first moment (typically 0.9).</param>
        /// <param name="beta2">Decay rate for second moment (typically 0.999).</param>
        /// <param name="epsilon">Stability constant (typically 1e-8).</param>
        public AdamOptimizer(double learningRate, double beta1, double beta2, double epsilon)
        {
            
            _lr = learningRate;
            _beta1 = beta1;
            _beta2 = beta2;
            _epsilon = epsilon;
        }



        /// <summary>
        /// Updates a parameter using the Adam optimization algorithm.
        /// </summary>
        /// <param name="param">Parameter matrix to update.</param>
        /// <param name="grad">Gradient of the parameter.</param>
        /// <param name="m">First moment estimate (moving average of gradients).</param>
        /// <param name="v">Second moment estimate (moving average of squared gradients).</param>
        /// <param name="t">Timestep for bias correction.</param>
        /// <returns>Tuple of updated (m, v, param).</returns>
        /// <exception cref="ArgumentNullException">Thrown if any input matrix is null.</exception>
        /// <exception cref="ArgumentException">Thrown if matrix dimensions do not match or numerical issues occur.</exception>
        public (Matrix<double> m, Matrix<double> v, Matrix<double> param) Update(Matrix<double> param, Matrix<double> grad, Matrix<double> m, Matrix<double> v, int t)
        {

            // Null checks
            if (param == null) throw new ArgumentNullException(nameof(param), "Parameter matrix is null.");
            if (grad == null) throw new ArgumentNullException(nameof(grad), "Gradient matrix is null.");
            if (m == null) throw new ArgumentNullException(nameof(m), "First moment matrix is null.");
            if (v == null) throw new ArgumentNullException(nameof(v), "Second moment matrix is null.");

            // Log caller and dimensions
            var stackFrame = new System.Diagnostics.StackFrame(1);
            var caller = stackFrame.GetMethod()?.DeclaringType?.Name + "." + stackFrame.GetMethod()?.Name;
            Console.WriteLine($"Update called from {caller}: " +
                $"param[{param.Rows},{param.Columns}], grad[{grad.Rows},{grad.Columns}], " +
                $"m[{m.Rows},{m.Columns}], v[{v.Rows},{v.Columns}]");

            // Dimension validation
            int rows = param.Rows, cols = param.Columns;
            if (grad.Rows != rows || grad.Columns != cols ||
                m.Rows != rows || m.Columns != cols ||
                v.Rows != rows || v.Columns != cols)
            {
                throw new ArgumentException($"Matrix dimension mismatch in {caller}: " +
                    $"param[{rows},{cols}], grad[{grad.Rows},{grad.Columns}], " +
                    $"m[{m.Rows},{m.Columns}], v[{v.Rows},{v.Columns}]");
            }

            // Rest of the method (unchanged for brevity)
            m = m * _beta1 + grad * (1 - _beta1);
            v = v * _beta2 + grad.ElementWiseMultiply(grad) * (1 - _beta2);
            double beta1PowT = Math.Pow(_beta1, t);
            double beta2PowT = Math.Pow(_beta2, t);
            if (double.IsNaN(beta1PowT) || double.IsNaN(beta2PowT) || beta1PowT >= 1.0 || beta2PowT >= 1.0)
                throw new ArgumentException($"Invalid bias correction at t={t}: beta1^t={beta1PowT}, beta2^t={beta2PowT}");
            var mHat = m * (1.0 / (1 - beta1PowT));
            var vHat = v * (1.0 / (1 - beta2PowT));
            //var sqrtVHat = vHat.ElementWiseSqrt();
            //var denominator = new Matrix<double>(rows, cols);

            var sqrtVHat = vHat.ElementWiseSqrt();
            var denominator = sqrtVHat.AddScalar(+1e-6); // Larger epsilon
            if (denominator.Any(x => x <= 0))
                throw new ArgumentException("Denominator <= 0 detected in Adam update");

            Parallel.For(0, rows, i =>
            {
                for (int j = 0; j < cols; j++)
                {
                    double sqrtVal = sqrtVHat[i, j];
                    if (double.IsNaN(sqrtVal))
                        throw new ArgumentException($"sqrtVHat[{i},{j}] is NaN");
                    double val = sqrtVal + _epsilon;
                    if (val <= 0)
                        throw new ArgumentException($"Denominator <= 0 at [{i},{j}]: sqrtVHat={sqrtVal}, epsilon={_epsilon}");
                    denominator[i, j] = val;
                }
            });
            var invDenominator = denominator.ElementWiseInverse();
            var updateTerm = mHat.ElementWiseMultiply(invDenominator);
            var scaledUpdate = updateTerm * _lr;
            param = param - scaledUpdate;

            return (m, v, param);
        }
    }
}

Model.cs Class

namespace AI
{
    /// <summary>
    /// Manages the encoder and decoder stacks of the Transformer model.
    /// </summary>
    [Serializable]
    public class Model
    {



        #region Fields:



        /// <summary>
        /// Dimensionality of embeddings and hidden states.
        /// </summary>
        public readonly int _dModel;



        /// <summary>
        /// List of encoder layers.
        /// </summary>
        private readonly List<EncoderLayer> _encoders;



        /// <summary>
        /// List of decoder layers.
        /// </summary>
        private readonly List<DecoderLayer> _decoders;



        /// <summary>
        /// Dropout rate for regularization.
        /// </summary>
        private readonly double _dropoutRate;



        /// <summary>
        /// A thread-safe random number generator for dropout and initialization.
        /// </summary>
        [NonSerialized]
        private readonly ThreadLocal<Random> _rand = new ThreadLocal<Random>(() => new Random());



        /// <summary>
        /// Cached encoder layer outputs for backward pass.
        /// </summary>
        private List<Matrix<double>> _encoderOutputs;



        /// <summary>
        /// Cached decoder layer outputs for backward pass.
        /// </summary>
        private List<Matrix<double>> _decoderOutputs;



        #endregion



        #region Properties:



        /// <summary>
        /// Final decoder output from the last forward pass [batchSize * targetSeqLen, dModel].
        /// </summary>
        public Matrix<double> LastDecoderOutput { get; private set; }



        #endregion



        /// <summary>
        /// Initializes the model with specified encoder and decoder layers.
        /// </summary>
        /// <param name="dModel">Dimensionality of embeddings and hidden states.</param>
        /// <param name="numHeads">Number of attention heads.</param>
        /// <param name="dFF">Feed-forward network hidden size.</param>
        /// <param name="numLayers">Number of encoder and decoder layers.</param>
        /// <param name="maxSeqLen">Maximum sequence length.</param>
        /// <param name="dropoutRate">Dropout probability.</param>
        /// <param name="rand">Random number generator.</param>
        public Model(int dModel, int numHeads, int dFF, int numLayers, int maxSeqLen, double dropoutRate, ThreadLocal<Random> rand)
        {
            
            _dModel = dModel;
            _dropoutRate = dropoutRate;
            _rand = rand;
            _encoders = Enumerable.Range(0, numLayers)
                .Select(_ => new EncoderLayer(dModel, numHeads, maxSeqLen, dFF, dropoutRate, rand))
                .ToList();
            _decoders = Enumerable.Range(0, numLayers)
                .Select(_ => new DecoderLayer(dModel, numHeads, maxSeqLen, dFF, dropoutRate, rand))
                .ToList();
            _encoderOutputs = new List<Matrix<double>>();
            _decoderOutputs = new List<Matrix<double>>();
        }



        /// <summary>
        /// Forward pass through the encoder stack.
        /// </summary>
        /// <param name="input">Embedded input [batchSize * inputSeqLen, dModel].</param>
        /// <returns>Encoder output [batchSize * inputSeqLen, dModel].</returns>
        public Matrix<double> ForwardEncoder(Matrix<double> input, Action<string> logger)
        {
           
            _encoderOutputs.Clear();
            var output = input;
            _encoderOutputs.Add(output);
            foreach (var encoder in _encoders)
            {
                output = encoder.Forward(output, logger);
                _encoderOutputs.Add(output);
            }
            return output;
        }



        /// <summary>
        /// Forward pass through the decoder stack with causal masking.
        /// </summary>
        /// <param name="input">Embedded target input [batchSize * targetSeqLen, dModel].</param>
        /// <param name="encoderOutput">Encoder output [batchSize * inputSeqLen, dModel].</param>
        /// <param name="mask">Causal mask [targetSeqLen, targetSeqLen].</param>
        /// <returns>Decoder output [batchSize * targetSeqLen, dModel].</returns>
        public Matrix<double> ForwardDecoder(Matrix<double> input, Matrix<double> encoderOutput, Action<string> logger, Matrix<double> mask)
        {

            _decoderOutputs.Clear();
            var output = input;
            _decoderOutputs.Add(output);
            foreach (var decoder in _decoders)
            {
                output = decoder.Forward(output, encoderOutput, logger, mask);
                _decoderOutputs.Add(output);
            }
            LastDecoderOutput = output;
            return output;
        }



        /// <summary>
        /// Backward pass through decoder and encoder stacks.
        /// </summary>
        /// <param name="dDecoderOutput">Gradient w.r.t. decoder output [batchSize * targetSeqLen, dModel].</param>
        /// <param name="inputIds">Input token indices [batchSize][inputSeqLen].</param>
        /// <param name="targetIds">Target token indices [batchSize][targetSeqLen].</param>
        /// <returns>Tuple of gradients w.r.t. encoder input and decoder input embeddings.</returns>
        public (Matrix<double> dEncoderInput, Matrix<double> dDecoderInput) Backward(Matrix<double> dDecoderOutput, int[][] inputIds, int[][] targetIds)
        {
            
            int batchSize = inputIds.Length;
            int inputSeqLen = inputIds[0].Length;
            int targetSeqLen = targetIds[0].Length;
            var mask = Matrix<double>.CreateCausalMask(targetSeqLen);

            // Backprop through decoder layers
            var dDecoderInput = dDecoderOutput;
            var dEncoderOutput = Matrix<double>.Zeros(batchSize * inputSeqLen, _dModel);
            for (int i = _decoders.Count - 1; i >= 0; i--)
            {
                var layerInput = _decoderOutputs[i];
                var encoderOutput = _encoderOutputs[_encoders.Count];
                (dDecoderInput, var dEnc) = _decoders[i].Backward(dDecoderInput, layerInput, encoderOutput, mask);
                dEncoderOutput += dEnc;
            }

            // Backprop through encoder layers
            var dEncoderInput = dEncoderOutput;
            for (int i = _encoders.Count - 1; i >= 0; i--)
            {
                dEncoderInput = _encoders[i].Backward(dEncoderInput);
            }

            return (dEncoderInput, dDecoderInput);
        }



        /// <summary>
        /// Updates parameters of all encoder and decoder layers with Adam optimization.
        /// </summary>
        /// <param name="adam">The Adam optimizer instance.</param>
        /// <param name="t">The current timestep for bias correction.</param>
        public void UpdateParametersWithAdam(AdamOptimizer adam, int t)
        {
            
            foreach (var encoder in _encoders)
                encoder.UpdateParametersWithAdam(adam, t);
            foreach (var decoder in _decoders)
                decoder.UpdateParametersWithAdam(adam, t);
        }



        /// <summary>
        /// Computes the total gradient norm across all layers.
        /// </summary>
        /// <returns>Total gradient norm as a scalar value.</returns>
        public double GetGradientNorm()
        {

            double norm = 0;
            foreach (var encoder in _encoders)
                norm += encoder.GetGradientNorm() * encoder.GetGradientNorm();
            foreach (var decoder in _decoders)
                norm += decoder.GetGradientNorm() * decoder.GetGradientNorm();

            if (double.IsNaN(norm) || double.IsInfinity(norm))
            {
                Console.WriteLine("Warning: Gradient norm is NaN or Infinity.");
            }

            return Math.Sqrt(norm);
        }



        /// <summary>
        /// Scales all gradients in the encoder and decoder layers by a specified factor.
        /// </summary>
        /// <param name="scale">The scaling factor to apply to all gradients (e.g., for gradient clipping).</param>
        /// <exception cref="ArgumentException">Thrown if scale is NaN, infinite, or negative.</exception>
        public void ScaleGradients(double scale)
        {
            if (double.IsNaN(scale) || double.IsInfinity(scale) || scale < 0)
                throw new ArgumentException("Scale must be a non-negative finite number.", nameof(scale));

            foreach (var encoder in _encoders)
            {
                encoder.ScaleGradients(scale);
            }

            foreach (var decoder in _decoders)
            {
                decoder.ScaleGradients(scale);
            }
        }
    }
}

Transformer.cs Class

namespace AI
{



    #region Using Statements:



    using System;
    using System.IO;
    using System.Linq;
    using System.Text.Json;
    using System.Threading.Tasks;
    using System.Collections.Generic;
    using System.Runtime.Serialization;



    #endregion




    /// <summary>
    /// A thread-safe Transformer model for sequence-to-sequence tasks, featuring modular architecture,
    /// robust backpropagation, scaling, statistics tracking, and serialization support.
    /// </summary>
    [Serializable]
    public class Transformer
    {



        #region Fields

        /// <summary>
        /// The core Transformer model containing encoder and decoder stacks.
        /// </summary>
        private readonly Model _model;

        /// <summary>
        /// The size of the vocabulary (number of unique tokens).
        /// </summary>
        private readonly int _vocabSize;

        /// <summary>
        /// The maximum sequence length supported by the model.
        /// </summary>
        private readonly int _maxSeqLen;

        /// <summary>
        /// The learning rate used for parameter updates during training.
        /// </summary>
        private readonly double _learningRate;

        /// <summary>
        /// The dropout probability used for regularization to prevent overfitting.
        /// </summary>
        private readonly double _dropoutRate;

        /// <summary>
        /// A thread-safe random number generator for dropout and initialization.
        /// </summary>
        [NonSerialized]
        private readonly ThreadLocal<Random> _rand = new ThreadLocal<Random>(() => new Random());

        /// <summary>
        /// Utility for normalizing input data to a specified range.
        /// </summary>
        private readonly Scaling _scaler;

        /// <summary>
        /// Tracker for loss and timing statistics during training.
        /// </summary>
        private readonly Statistics<double> _stats;

        /// <summary>
        /// The embedding matrix mapping token indices to dense vectors [vocabSize, dModel].
        /// </summary>
        private Matrix<double> _embedding;

        /// <summary>
        /// The positional encoding matrix adding sequence position information [maxSeqLen, dModel].
        /// </summary>
        private Matrix<double> _posEncoding;

        /// <summary>
        /// The output projection matrix mapping decoder outputs to vocabulary logits [dModel, vocabSize].
        /// </summary>
        private Matrix<double> _outputProjection;

        /// <summary>
        /// Logging delegate for outputting training progress and diagnostics.
        /// </summary>
        private Action<string> _logger;

        /// <summary>
        /// 
        /// </summary>
        public static bool IsTraining { get; private set; }

        private readonly int _dFF;
        private readonly int _dModel;
        private readonly int _numHeads;
        private readonly int _numLayers;

        #endregion



        /// <summary>
        /// Initializes a new Transformer model with the specified configuration.
        /// </summary>
        /// <param name="vocabSize">Number of unique tokens in the vocabulary.</param>
        /// <param name="dModel">Dimensionality of embeddings and hidden states.</param>
        /// <param name="numHeads">Number of attention heads in multi-head attention.</param>
        /// <param name="dFF">Hidden layer size of the feed-forward network.</param>
        /// <param name="numLayers">Number of encoder and decoder layers.</param>
        /// <param name="maxSeqLen">Maximum sequence length supported.</param>
        /// <param name="dropoutRate">Dropout probability (0 to 1), default is 0.1.</param>
        /// <param name="learningRate">Learning rate for gradient updates, default is 0.001.</param>
        /// <param name="logger">Optional logging action for progress output, defaults to console.</param>
        /// <exception cref="ArgumentException">Thrown if any parameter is invalid (e.g., non-positive sizes, invalid dropout rate).</exception>
        public Transformer(int vocabSize, int dModel, int numHeads, int dFF, int numLayers, int maxSeqLen, double dropoutRate = 0.1, double learningRate = 0.001, Action<string> logger = null)
        {
            
            // Validate input parameters
            if (vocabSize <= 0 || dModel <= 0 || numHeads <= 0 || dFF <= 0 || numLayers <= 0 || maxSeqLen <= 0)
                throw new ArgumentException("All size parameters must be positive.");
            if (dModel % numHeads != 0)
                throw new ArgumentException("dModel must be divisible by numHeads.");
            if (dropoutRate < 0 || dropoutRate > 1)
                throw new ArgumentException("Dropout rate must be between 0 and 1.");
            if (learningRate <= 0)
                throw new ArgumentException("Learning rate must be positive.");

            // Initialize fields
            _dFF = dFF;
            _dModel = dModel;
            _numLayers = numLayers;
            _numHeads = numHeads;
            _vocabSize = vocabSize;
            _maxSeqLen = maxSeqLen;
            _learningRate = learningRate;
            _dropoutRate = dropoutRate;
            _logger = logger ?? Console.WriteLine;
            _scaler = new Scaling(-1.0, 1.0);
            _stats = new Statistics<double>();

            // Initialize matrices with appropriate initialization methods
            _embedding = Matrix<double>.InitializeXavier(vocabSize, dModel);
            _posEncoding = GeneratePositionalEncoding(maxSeqLen, dModel);
            _outputProjection = Matrix<double>.InitializeXavier(dModel, vocabSize);

            // Instantiate the core model with encoder and decoder stacks
            _model = new Model(dModel, numHeads, dFF, numLayers, maxSeqLen, dropoutRate, _rand);
        }



        /// <summary>
        /// Trains the Xformer model on the provided input and target sequences over multiple epochs.
        /// This method implements a batched training loop with Adam optimization, gradient clipping,
        /// and a warmup learning rate schedule. It tracks and logs loss, accuracy, and a "percent above random"
        /// metric to evaluate performance relative to random guessing.
        /// </summary>
        /// <param name="inputIds">Array of tokenized input sequences.</param>
        /// <param name="targetIds">Array of tokenized target sequences.</param>
        /// <param name="epochs">Number of training epochs.</param>
        /// <param name="batchSize">Size of each training batch (default: 32).</param>
        /// <param name="clipScale">Gradient clipping threshold (default: 5.0).</param>
        /// <returns>The average loss across all epochs.</returns>
        /// <exception cref="ArgumentException">Thrown if inputs are invalid (e.g., mismatched lengths, null, or exceed maxSeqLen).</exception>
        public double Train(int[][] inputIds, int[][] targetIds, int epochs, int batchSize = 32, double clipScale = 5.0)
        {
            
            IsTraining = true;
            ValidateInput(inputIds, targetIds);

            // Split train/validation
            int trainSize = (int)(inputIds.Length * 0.8);
            int valSize = inputIds.Length - trainSize;
            var trainInput = inputIds.Take(trainSize).ToArray();
            var trainTarget = targetIds.Take(trainSize).ToArray();
            var valInput = inputIds.Skip(trainSize).Take(valSize).ToArray();
            var valTarget = targetIds.Skip(trainSize).Take(valSize).ToArray();

            var metricsHistory = new List<EpochMetrics>();
            double totalEpochLoss = 0;
            var sw = new System.Diagnostics.Stopwatch();

            var mEmbedding = new Matrix<double>(_embedding.Rows, _embedding.Columns);
            var vEmbedding = new Matrix<double>(_embedding.Rows, _embedding.Columns);
            var mOutputProj = new Matrix<double>(_outputProjection.Rows, _outputProjection.Columns);
            var vOutputProj = new Matrix<double>(_outputProjection.Rows, _outputProjection.Columns);

            int globalStep = 0;
            const double warmupSteps = 1000.0;
            double dModelFactor = 1.0 / Math.Sqrt(_model._dModel);

            double learnableParams = Statistics<double>.EstimateDatasetLearnableParameters(inputIds, targetIds, _vocabSize, _maxSeqLen);
            int modelParams = CalculateModelParameters();
            _logger($"Estimated Learnable Parameters: {learnableParams:F0}");
            _logger($"Model Parameters: {modelParams}");
            _logger($"Ratio: {(double)modelParams / learnableParams:F2}");
            double estEntropy = EstimateTokenEntropy(inputIds, targetIds);
            _logger($"Estimated Token Entropy: {estEntropy:F3} bits");
            _logger($"Suggested Pairs: {(int)(modelParams / (estEntropy * 2 * 15 + 2 * Math.Log2(_vocabSize) * 7.5))}");

            for (int epoch = 0; epoch < epochs; epoch++)
            {
                double epochLoss = 0;
                int correctSequences = 0;
                int totalSequences = 0;
                double gradNormSum = 0;
                int numBatches = (int)Math.Ceiling((double)trainSize / batchSize);

                for (int batchStart = 0; batchStart < trainSize; batchStart += batchSize)
                {
                    globalStep++;
                    int currentBatchSize = Math.Min(batchSize, trainSize - batchStart);
                    int[][] batchInput = trainInput.Skip(batchStart).Take(currentBatchSize).ToArray();
                    int[][] batchTarget = trainTarget.Skip(batchStart).Take(currentBatchSize).ToArray();

                    sw.Start();
                    // var batchInputMatrix = ConvertToMatrix(batchInput, _embedding);
                    var logits = Forward(batchInput, batchTarget, _logger); // Self-attention
                    double loss = _stats.CrossEntropyLoss(logits, batchTarget);
                    if (double.IsNaN(loss) || logits.All(m => m == 0))
                    {
                        _logger($"Batch {batchStart / batchSize + 1} skipped due to invalid logits.");
                        sw.Reset();
                        continue;
                    }

                    epochLoss += loss * currentBatchSize;
                    var dLogits = ComputeCrossEntropyGradients(logits, batchTarget);
                    var (dEmbedding, dOutputProj) = Backward(dLogits, batchInput, batchTarget);

                    double gradNorm = Math.Sqrt(
                        dEmbedding.ElementWiseMultiply(dEmbedding).Sum() +
                        dOutputProj.ElementWiseMultiply(dOutputProj).Sum() +
                        _model.GetGradientNorm()
                    );
                    gradNormSum += gradNorm;
                    if (gradNorm > clipScale)
                    {
                        double scale = clipScale / gradNorm;
                        dEmbedding *= scale;
                        dOutputProj *= scale;
                        _model.ScaleGradients(scale);
                    }

                    // double gradNorm = _model.GetGradientNorm();
                    // _logger($"Epoch {epoch + 1}/{epochs}, Batch {batchStart / batchSize + 1}/{numBatches} - Loss: {loss:F6} - GradNorm: {gradNorm:F2}");
                    if (double.IsNaN(loss) || double.IsInfinity(loss))
                    {
                        _logger("Loss is NaN or Infinity!");
                        break;
                    }

                    double stepFactor = 1.0 / Math.Sqrt(Math.Max(globalStep, 1));
                    double warmupFactor = globalStep / Math.Pow(warmupSteps, 1.5);
                    double lr = dModelFactor * Math.Min(stepFactor, warmupFactor);

                    var adam = new AdamOptimizer(lr, beta1: 0.9, beta2: 0.999, epsilon: 1e-8);
                    (mEmbedding, vEmbedding, _embedding) = adam.Update(_embedding, dEmbedding, mEmbedding, vEmbedding, globalStep);
                    (mOutputProj, vOutputProj, _outputProjection) = adam.Update(_outputProjection, dOutputProj, mOutputProj, vOutputProj, globalStep);
                    _model.UpdateParametersWithAdam(adam, globalStep);

                    for (int b = 0; b < currentBatchSize; b++)
                    {
                        int[] predicted = new int[batchTarget[b].Length];
                        for (int j = 0; j < batchTarget[b].Length; j++)
                            predicted[j] = Matrix<double>.GetNextToken(logits, b * batchTarget[0].Length + j);
                        if (predicted.SequenceEqual(batchTarget[b]))
                            correctSequences++;
                    }

                    totalSequences += currentBatchSize;

                    double batchTimeMs = sw.Elapsed.TotalMilliseconds;
                    sw.Reset();

                    if (batchStart % (10 * batchSize) == 0 || batchStart + currentBatchSize >= trainSize)
                    {
                        _logger($"Epoch {epoch + 1}/{epochs}, Batch {batchStart / batchSize + 1}/{numBatches} - Loss: {loss:F6} - Time: {batchTimeMs:F2}ms");
                    }
                }

                double trainLoss = epochLoss / trainSize;
                double trainAccuracy = (double)correctSequences / totalSequences * 100;

                // Validation pass
                IsTraining = false;
                var valLogits = Forward(valInput, valTarget, _logger); // [valSize * seqLen, vocabSize], e.g., [822, 39]
                double valLoss = _stats.CrossEntropyLoss(valLogits, valTarget);
                int valCorrect = 0;
                for (int b = 0; b < valSize; b++)
                {
                    int[] predicted = new int[valTarget[b].Length];
                    for (int j = 0; j < valTarget[b].Length; j++)
                        predicted[j] = Matrix<double>.GetNextToken(valLogits, b * valTarget[0].Length + j);
                    if (predicted.SequenceEqual(valTarget[b])) valCorrect++;
                }
                double valAccuracy = (double)valCorrect / valSize * 100;
                IsTraining = true;

                metricsHistory.Add(new EpochMetrics(epoch + 1, trainLoss, trainAccuracy, valLoss, valAccuracy));
                var (isOverfitting, message) = OverfittingDetector.OverFitted(metricsHistory);
                _logger(message);
                if (isOverfitting) break;

                var (actualPercentAboveRandom, targetPercentAboveRandom) = Statistics<double>.CalculatePerformanceRatioToRandom(correctSequences, totalSequences, _vocabSize, _maxSeqLen);
                totalEpochLoss += trainLoss;
                _logger($"Epoch {epoch + 1}/{epochs} Summary - " +
                        $"True Learning: {actualPercentAboveRandom:F2}%, " + // Actual Percent Above Random:
                        $"TL Target: {targetPercentAboveRandom:F2}%, " + // Target Percent Above Random:
                        $"Avg Loss: {trainLoss:F6}, Accuracy: {trainAccuracy:F2}%, " +
                        $"Val Loss: {valLoss:F6}, Val Acc: {valAccuracy:F2}%");
            }

            IsTraining = false;
            double avgEpochLoss = totalEpochLoss / epochs;
            _logger($"Training Complete - Avg Loss Across Epochs: {avgEpochLoss:F6}");
            return avgEpochLoss;
        }



        /// <summary>
        /// Checks for the condition called Over Fitting, a common problem in ML.
        /// </summary>
        /// <param name="metrics"></param>
        /// <param name="minEpochs"></param>
        /// <param name="lossSpikeThreshold"></param>
        /// <param name="accuracyDropThreshold"></param>
        /// <returns></returns>
        public static (bool isOverfitting, string message) OverFitted(List<EpochMetrics> metrics, int minEpochs = 3, double lossSpikeThreshold = 1.5, double accuracyDropThreshold = 0.05)
        {
            if (metrics.Count < minEpochs)
                return (false, $"Not enough epochs ({metrics.Count} < {minEpochs}) to assess overfitting.");

            // Get the last few epochs for trend analysis
            var recentMetrics = metrics.TakeLast(minEpochs).ToList();
            var latest = recentMetrics.Last();
            var previous = recentMetrics[^2]; // Second-to-last

            // Check 1: Loss divergence (train down, val up, or sudden spike)
            bool trainLossDecreasing = recentMetrics.Select(m => m.TrainLoss).IsMonotonicDecreasing();
            bool valLossIncreasing = recentMetrics.Select(m => m.ValLoss).IsMonotonicIncreasing();
            bool lossSpike = latest.TrainLoss > previous.TrainLoss * lossSpikeThreshold;

            // Check 2: Accuracy divergence (train up, val down or flat)
            bool trainAccIncreasing = recentMetrics.Select(m => m.TrainAccuracy).IsMonotonicIncreasing();
            bool valAccDecreasing = recentMetrics.Select(m => m.ValAccuracy).IsMonotonicDecreasing();
            bool valAccStalled = recentMetrics.Max(m => m.ValAccuracy) - recentMetrics.Min(m => m.ValAccuracy) < 0.01; // <1% change
            bool accDrop = previous.TrainAccuracy - latest.TrainAccuracy > accuracyDropThreshold;

            // Overfitting conditions
            if (lossSpike)
                return (true, $"Overfitting detected at Epoch {latest.Epoch}: Training loss spiked from {previous.TrainLoss:F6} to {latest.TrainLoss:F6}.");
            if (trainLossDecreasing && valLossIncreasing)
                return (true, $"Overfitting detected at Epoch {latest.Epoch}: Training loss decreasing ({previous.TrainLoss:F6} → {latest.TrainLoss:F6}), validation loss increasing ({previous.ValLoss:F6} → {latest.ValLoss:F6}).");
            if (trainAccIncreasing && (valAccDecreasing || valAccStalled))
                return (true, $"Overfitting detected at Epoch {latest.Epoch}: Training accuracy increasing ({previous.TrainAccuracy:F2}% → {latest.TrainAccuracy:F2}%), validation accuracy {(valAccDecreasing ? "decreasing" : "stalled")} ({previous.ValAccuracy:F2}% → {latest.ValAccuracy:F2}%).");
            if (accDrop)
                return (true, $"Overfitting detected at Epoch {latest.Epoch}: Training accuracy dropped significantly ({previous.TrainAccuracy:F2}% → {latest.TrainAccuracy:F2}%).");

            return (false, $"No overfitting detected at Epoch {latest.Epoch}. Train Loss: {latest.TrainLoss:F6}, Val Loss: {latest.ValLoss:F6}, Train Acc: {latest.TrainAccuracy:F2}%, Val Acc: {latest.ValAccuracy:F2}%");
        }



        /// <summary>
        /// Generates a sequence of tokens autoregressively given an input sequence.
        /// </summary>
        /// <param name="inputIds">Input token indices [1][inputSeqLen], batch size must be 1.</param>
        /// <param name="maxLength">Maximum number of tokens to generate.</param>
        /// <returns>Array of generated token indices including BOS (0) and EOS (1) tokens.</returns>
        /// <exception cref="ArgumentException">Thrown if batch size is not 1 or input is invalid.</exception>
        public int[] Generate(int[][] inputIds, int maxLength, Action<string> logger)
        {
            
            if (inputIds.Length != 1)
                throw new ArgumentException("Generate supports batchSize=1 only.");
            ValidateInput(inputIds, null);

            var encoderOutput = Encode(inputIds, logger);
            var outputIds = new List<int> { 0 }; // BOS token assumed as 0
            int[][] currentIds = new int[1][] { outputIds.ToArray() };

            for (int t = 0; t < maxLength; t++)
            {
                var decoderOutput = Decode(currentIds, encoderOutput, logger);
                var logits = decoderOutput * _outputProjection; // [batchSize * seqLen, vocabSize]
                if (logits.Columns != _vocabSize)
                {
                    throw new InvalidOperationException(
                        $"Logits have {logits.Columns} columns, but expected {_vocabSize} based on vocabulary size."
                    );
                }

                int nextToken = Matrix<double>.GetNextToken(logits, logits.Rows - 1);
                if (nextToken < 0 || nextToken >= _vocabSize)
                {
                    logger($"Error: Generated token ID {nextToken} at step {t} is out of bounds for vocabulary size {_vocabSize}.");
                    nextToken = 0; // Map to <UNK> token as a fallback
                }

                outputIds.Add(nextToken);
                if (nextToken == 1) break;
                currentIds[0] = outputIds.ToArray();
            }

            return outputIds.ToArray();
        }



        /// <summary>
        /// Saves the model to a file using JSON serialization.
        /// </summary>
        /// <param name="filePath">Destination file path for saving the model.</param>
        /// <exception cref="IOException">Thrown if file writing fails.</exception>
        /// <exception cref="JsonException">Thrown if serialization fails.</exception>
        public void Save(string filePath)
        {
            
            var options = new JsonSerializerOptions { WriteIndented = true, IncludeFields = true };
            string json = JsonSerializer.Serialize(this, options);
            File.WriteAllText(filePath, json);
        }



        /// <summary>
        /// Loads a model from a file using JSON deserialization.
        /// </summary>
        /// <param name="filePath">Source file path containing the serialized model.</param>
        /// <returns>A new Transformer instance loaded from the file.</returns>
        /// <exception cref="FileNotFoundException">Thrown if the file does not exist.</exception>
        /// <exception cref="JsonException">Thrown if deserialization fails.</exception>
        public static Transformer Load(string filePath)
        {
            
            if (!File.Exists(filePath))
                throw new FileNotFoundException($"Model file not found at {filePath}");

            string json = File.ReadAllText(filePath);
            var options = new JsonSerializerOptions { IncludeFields = true };
            return JsonSerializer.Deserialize<Transformer>(json, options);
        }



        /// <summary>
        /// Calculates the Model's ideal Parameters.
        /// </summary>
        /// <returns></returns>
        private int CalculateModelParameters()
        {
            
            int embeddingParams = _vocabSize * _dModel;
            int outputParams = _dModel * _vocabSize;
            int attentionParams = _numLayers * _numHeads * (_dModel * _dModel * 3);
            int ffParams = _numLayers * (_dModel * _dFF + _dFF * _dModel);
            int normParams = _numLayers * (2 * _dModel * 2);
            return embeddingParams + outputParams + attentionParams + ffParams + normParams;
        }



        /// <summary>
        /// Estimate Entropy of the Token.
        /// </summary>
        /// <param name="inputs"></param>
        /// <param name="targets"></param>
        /// <returns></returns>
        private double EstimateTokenEntropy(int[][] inputs, int[][] targets)
        {
            
            var counts = new Dictionary<int, int>();
            int total = inputs.Length * _maxSeqLen * 2;
            foreach (var seq in inputs.Concat(targets))
                foreach (var token in seq)
                    counts[token] = counts.GetValueOrDefault(token) + 1;
            double entropy = 0;
            foreach (var count in counts.Values)
            {
                double p = (double)count / total;
                entropy -= p * Math.Log2(p);
            }
            return entropy;
        }



        /// <summary>
        /// Validates input and target token indices for consistency and constraints.
        /// </summary>
        /// <param name="inputIds">Input token indices to validate [batchSize][inputSeqLen].</param>
        /// <param name="targetIds">Target token indices to validate [batchSize][targetSeqLen], nullable.</param>
        /// <exception cref="ArgumentException">Thrown if validation fails (e.g., null, inconsistent lengths, out-of-range tokens).</exception>
        private void ValidateInput(int[][] inputIds, int[][] targetIds)
        {
            
            if (inputIds == null || inputIds.Length == 0 || !inputIds.Any() || inputIds.Any(row => row == null || row.Length == 0 || row.Length > _maxSeqLen))
                throw new ArgumentException("Input IDs must be non-null, non-empty, and within maxSeqLen.");
            int batchSize = inputIds.Length;
            int inputSeqLen = inputIds[0].Length;
            if (inputIds.Any(row => row.Length != inputSeqLen))
                throw new ArgumentException("All input sequences must have the same length.");
            if (inputIds.Any(row => row.Any(id => id < 0 || id >= _vocabSize)))
                throw new ArgumentException("Input token IDs must be within vocabulary range.");

            if (targetIds != null)
            {
                if (targetIds.Length != batchSize || !targetIds.Any() || targetIds.Any(row => row == null || row.Length == 0 || row.Length > _maxSeqLen))
                    throw new ArgumentException("Target IDs must match input batch size and be within maxSeqLen.");
                int targetSeqLen = targetIds[0].Length;
                if (targetIds.Any(row => row.Length != targetSeqLen))
                    throw new ArgumentException("All target sequences must have the same length.");
                if (targetIds.Any(row => row.Any(id => id < 0 || id >= _vocabSize)))
                    throw new ArgumentException("Target token IDs must be within vocabulary range.");
            }
        }



        /// <summary>
        /// Performs a forward pass through the Transformer model.
        /// </summary>
        /// <param name="inputIds">Input token indices [batchSize][inputSeqLen].</param>
        /// <param name="targetIds">Target token indices [batchSize][targetSeqLen], nullable.</param>
        /// <returns>Logits over vocabulary [batchSize * targetSeqLen, vocabSize].</returns>
        public Matrix<double> Forward(int[][] inputIds, int[][] targetIds, Action<string> logger)
        {
            
            var encoderOutput = Encode(inputIds, logger);
            if (targetIds == null)
                return encoderOutput * _outputProjection; // For encoding-only scenarios

            var decoderOutput = Decode(targetIds, encoderOutput, _logger);
            return decoderOutput * _outputProjection;
        }



        /// <summary>
        /// Encodes input sequences through the encoder stack.
        /// </summary>
        /// <param name="inputIds">Input token indices [batchSize][inputSeqLen].</param>
        /// <returns>Encoder output [batchSize * inputSeqLen, dModel].</returns>
        private Matrix<double> Encode(int[][] inputIds, Action<string> logger)
        {
            
            var inputEmbedding = Embed(inputIds);
            return _model.ForwardEncoder(inputEmbedding, logger);
        }



        /// <summary>
        /// Decodes target sequences using encoder output with causal masking.
        /// </summary>
        /// <param name="targetIds">Target token indices [batchSize][targetSeqLen].</param>
        /// <param name="encoderOutput">Encoder output [batchSize * inputSeqLen, dModel].</param>
        /// <returns>Decoder output [batchSize * targetSeqLen, dModel].</returns>
        private Matrix<double> Decode(int[][] targetIds, Matrix<double> encoderOutput, Action<string> logger)
        {
            
            var targetEmbedding = Embed(targetIds);
            var causalMask = Matrix<double>.CreateCausalMask(targetIds[0].Length);
            return _model.ForwardDecoder(targetEmbedding, encoderOutput, logger, causalMask);
        }



        /// <summary>
        /// Embeds token indices with positional encodings and applies scaling.
        /// </summary>
        /// <param name="ids">Token indices [batchSize][seqLen].</param>
        /// <returns>Embedded matrix [batchSize * seqLen, dModel].</returns>
        private Matrix<double> Embed(int[][] ids)
        {
            
            int batchSize = ids.Length;
            int seqLen = ids[0].Length;

            // Validate positional encoding dimensions
            if (seqLen > _posEncoding.Rows)
            {
                throw new InvalidOperationException(
                    $"Sequence length {seqLen} exceeds positional encoding matrix with {_posEncoding.Rows} rows."
                );
            }

            // Validate token IDs
            const int unkTokenId = 0; // Assume <UNK> token is at index 0
            for (int b = 0; b < batchSize; b++)
            {
                for (int t = 0; t < seqLen; t++)
                {
                    int tokenId = ids[b][t];
                    if (tokenId < 0 || tokenId >= _embedding.Rows)
                    {
                        Console.WriteLine(
                            $"Warning: Token ID {tokenId} at batch {b}, position {t} is out of bounds for embedding matrix with {_embedding.Rows} rows. Mapping to <UNK> token."
                        );
                        ids[b][t] = unkTokenId; // Map to <UNK>
                    }
                }
            }

            var result = new Matrix<double>(batchSize * seqLen, _embedding.Columns);

            Parallel.For(0, batchSize, b =>
            {
                for (int t = 0; t < seqLen; t++)
                {
                    int idx = b * seqLen + t;
                    result.SetRow(idx, _embedding.Row(ids[b][t]) + _posEncoding.Row(t));
                }
            });

            return _scaler.Scale(new[] { result })[0];
        }



        /// <summary>
        /// Computes gradients via backpropagation for all trainable parameters.
        /// </summary>
        /// <param name="logits">Model output logits [batchSize * targetSeqLen, vocabSize].</param>
        /// <param name="inputIds">Input token indices [batchSize][inputSeqLen].</param>
        /// <param name="targetIds">Target token indices [batchSize][targetSeqLen].</param>
        /// <returns>Tuple containing gradients for embedding and output projection matrices.</returns>
        private (Matrix<double> dEmbedding, Matrix<double> dOutputProjection) Backward(Matrix<double> logits, int[][] inputIds, int[][] targetIds)
        {
            
            int batchSize = targetIds.Length;
            int targetSeqLen = targetIds[0].Length;
            int inputSeqLen = inputIds[0].Length;

            // Gradient of cross-entropy loss w.r.t. logits
            var dLogits = ComputeCrossEntropyGradients(logits, targetIds);

            // Backprop through output projection
            var dDecoderOutput = dLogits * _outputProjection.Transpose();
            var dOutputProjection = _model.LastDecoderOutput.Transpose() * dLogits;

            // Backprop through encoder and decoder
            var (dEncoderInput, dDecoderInput) = _model.Backward(dDecoderOutput, inputIds, targetIds);

            // Backprop through embedding (shared between encoder and decoder)
            var dEmbeddingFromEncoder = BackwardEmbedding(dEncoderInput, inputIds);
            var dEmbeddingFromDecoder = BackwardEmbedding(dDecoderInput, targetIds);
            var dEmbedding = dEmbeddingFromEncoder + dEmbeddingFromDecoder;

            return (dEmbedding, dOutputProjection);
        }



        /// <summary>
        /// Computes gradients for the embedding matrix based on input gradients and token indices.
        /// </summary>
        /// <param name="dInput">Gradient w.r.t. embedded input [batchSize * seqLen, dModel].</param>
        /// <param name="ids">Token indices [batchSize][seqLen].</param>
        /// <returns>Gradient w.r.t. embedding matrix [vocabSize, dModel].</returns>
        private Matrix<double> BackwardEmbedding(Matrix<double> dInput, int[][] ids)
        {
            
            int batchSize = ids.Length;
            int seqLen = ids[0].Length;
            int numThreads = Environment.ProcessorCount;

            // Create an array of local gradient matrices, one per thread
            var localGradients = new Matrix<double>[numThreads];
            for (int i = 0; i < numThreads; i++)
            {
                localGradients[i] = new Matrix<double>(_vocabSize, _embedding.Columns);
            }

            // Parallel computation with thread-local accumulation
            Parallel.For(0, batchSize * seqLen, idx =>
            {
                int threadIdx = Thread.CurrentThread.ManagedThreadId % numThreads; // Simple thread index mapping
                int b = idx / seqLen;
                int t = idx % seqLen;
                int token = ids[b][t];
                var gradRow = dInput.Row(b * seqLen + t);
                var localDEmbedding = localGradients[threadIdx];

                // Accumulate gradients in thread-local matrix (no lock needed)
                for (int d = 0; d < _embedding.Columns; d++)
                {
                    localDEmbedding.AddInPlace(token, d, gradRow[0, d]);
                }
            });

            // Reduce all thread-local matrices into a single result
            var dEmbedding = localGradients[0]; // Start with the first one
            for (int i = 1; i < numThreads; i++)
            {
                dEmbedding += localGradients[i]; // Assumes Matrix<double> supports += operator
            }

            return dEmbedding;
        }



        /// <summary>
        /// Computes gradients of cross-entropy loss with respect to logits.
        /// </summary>
        /// <param name="logits">Predicted logits [batchSize * seqLen, vocabSize].</param>
        /// <param name="targetIds">Target token indices [batchSize][seqLen].</param>
        /// <returns>Gradient matrix [batchSize * seqLen, vocabSize].</returns>
        private Matrix<double> ComputeCrossEntropyGradients(Matrix<double> logits, int[][] targetIds)
        {
            
            int batchSize = targetIds.Length;
            int seqLen = targetIds[0].Length;
            var dLogits = new Matrix<double>(logits.Rows, logits.Columns);

            for (int b = 0; b < batchSize; b++)
            {
                for (int t = 0; t < seqLen; t++)
                {
                    int idx = b * seqLen + t;
                    var row = logits.Row(idx);
                    double maxLogit = row.Max();
                    var expLogits = row.AddScalar(-maxLogit).Exp();
                    double sumExp = expLogits.Sum();
                    if (sumExp < 1e-10) sumExp = 1e-10; // Larger epsilon
                    var probs = expLogits * (1.0 / sumExp);

                    for (int v = 0; v < _vocabSize; v++)
                        dLogits[idx, v] = probs[0, v] - (v == targetIds[b][t] ? 1.0 : 0.0);
                }
            }

            return dLogits * (1.0 / (batchSize * seqLen));
        }



        /// <summary>
        /// Generates sinusoidal positional encodings for sequence positions.
        /// </summary>
        /// <param name="maxSeqLen">Maximum sequence length.</param>
        /// <param name="dModel">Embedding dimensionality.</param>
        /// <returns>Positional encoding matrix [maxSeqLen, dModel].</returns>
        private Matrix<double> GeneratePositionalEncoding(int maxSeqLen, int dModel)
        {
            
            var pe = new Matrix<double>(maxSeqLen, dModel);
            Parallel.For(0, maxSeqLen, pos =>
            {
                for (int i = 0; i < dModel; i += 2)
                {
                    double divTerm = Math.Pow(10000, (double)i / dModel);
                    pe[pos, i] = Math.Sin(pos / divTerm);
                    if (i + 1 < dModel)
                        pe[pos, i + 1] = Math.Cos(pos / divTerm);
                }
            });
            return pe;
        }
    }
}

Here’s a textual diagram of the `Transformer` architecture, reflecting the flow from input to output:

+---------------------------+
| Input Tokens | [batchSize, inputSeqLen]
| (inputIds) |
+---------------------------+
|
v
+---------------------------+
| Embedding Layer | _embedding: [vocabSize, dModel]
| (Token -> Dense Vector) |
+---------------------------+
|
v
+---------------------------+
| Positional Encoding | _posEncoding: [maxSeqLen, dModel]
| (Added to Embedding) |
+---------------------------+
|
v
+---------------------------+
| Encoder Stack | numLayers x EncoderLayer
| |
| +-------------------+ |
| | EncoderLayer | |
| | - MultiHeadAttn | | Self-Attention (numHeads)
| | - FeedForward | | [dModel -> dFF -> dModel]
| | - LayerNorm (x2) | |
| | - Dropout | |
| +-------------------+ |
+---------------------------+
|
v
+---------------------------+
| Encoder Output | [batchSize * inputSeqLen, dModel]
+---------------------------+
|
|----------------> (to Decoder)
v
+---------------------------+
| Target Tokens | [batchSize, targetSeqLen]
| (targetIds) |
+---------------------------+
|
v
+---------------------------+
| Embedding Layer | Shared _embedding
| (Token -> Dense Vector) |
+---------------------------+
|
v
+---------------------------+
| Positional Encoding | Shared _posEncoding
| (Added to Embedding) |
+---------------------------+
|
v
+---------------------------+
| Decoder Stack | numLayers x DecoderLayer
| |
| +-------------------+ |
| | DecoderLayer | |
| | - MultiHeadAttn | | Masked Self-Attention (causalMask)
| | - MultiHeadAttn | | Cross-Attention (to Encoder Output)
| | - FeedForward | | [dModel -> dFF -> dModel]
| | - LayerNorm (x3) | |
| | - Dropout | |
| +-------------------+ |
+---------------------------+
|
v
+---------------------------+
| Decoder Output | [batchSize * targetSeqLen, dModel]
+---------------------------+
|
v
+---------------------------+
| Output Projection | _outputProjection: [dModel, vocabSize]
| (Logits over Vocab) |
+---------------------------+
|
v
+---------------------------+
| Logits | [batchSize * targetSeqLen, vocabSize]
+---------------------------+

The Transformer model we have developed, represents a custom implementation of the classic transformer architecture, originally introduced in the seminal 2017 paper "Attention is All You Need" by Vaswani et al. This model is designed as a sequence-to-sequence framework, featuring both an encoder and a decoder, making it well-suited for tasks that involve transforming one sequence into another. Built in C# with a focus on modularity and thread-safety, the Transformer model encapsulates the core principles of attention mechanisms, enabling it to handle complex relationships within sequential data. Its architecture includes multi-head self-attention, feed-forward neural networks, and positional encodings, all of which are critical for capturing dependencies in sequences without relying on recurrent structures.

At its core, the Transformer model consists of several key components initialized in its constructor: a vocabulary size of 39, an embedding dimension (`dModel`) of 96, 4 attention heads, a feed-forward hidden size (`dFF`) of 48, and 4 layers each for the encoder and decoder. The embedding matrix maps token indices to dense vectors, while sinusoidal positional encodings provide the model with information about token positions in the sequence. The output projection layer transforms the decoder’s output into logits over the vocabulary, facilitating token prediction. With 488,256 parameters, the model is relatively lightweight compared to industrial-scale transformers like GPT-3, but its size is well-matched to the dataset of 2055 training pairs, as indicated by the parameter-to-learnable-parameter ratio of 1.00.

The Transformer’s primary use case, as implemented, is sequence-to-sequence prediction, exemplified by the task defined in the `GenerateTrainingPairs` method. Here, the model takes an input sequence of length 2 and predicts a target sequence of the same length, where the target is a shifted version of the input with a new random token. This setup mimics tasks like machine translation or summarization, where an input sequence (e.g., a sentence in one language) is transformed into a target sequence (e.g., a sentence in another language). The model’s `Train` method employs a cross-entropy loss to optimize its parameters, using the Adam optimizer with a warmup learning rate schedule, gradient clipping, and dropout for regularization, achieving a peak validation accuracy of 11.44% after 4 epochs before overfitting was detected.

Efficiency-wise, the Transformer model demonstrates a balance between computational complexity and performance, though it has room for improvement. Training on a dataset of 2055 pairs with a batch size of 32, the model processes each epoch in approximately 26 minutes on a CPU, with batch times ranging from 13 to 44 seconds. This is relatively slow for a model of its size, primarily due to the CPU-based implementation and the use of `Parallel.For` loops for parallelization. The model’s efficiency is further hampered by the small dataset and the task’s complexity—predicting a sequence with a vocabulary size of 39 and a nearly uniform token distribution (entropy of 5.279 bits) proved challenging, as evidenced by the low peak accuracy.

The Transformer’s training performance highlights both its strengths and limitations. Over 5 epochs, the training loss decreased from 4.298823 to 2.540405, and the validation loss dropped to 2.122244, indicating that the model learned meaningful patterns. However, the accuracy remained low (peaking at 6.51% for training and 11.44% for validation), and overfitting was detected in Epoch 5 due to a significant drop in training accuracy (from 6.51% to 2.98%). This suggests that while the model can learn, its generalization is limited by the small dataset, the task’s difficulty, and the encoder-decoder architecture, which may be overkill for the given task.

Looking to future use cases, the Transformer model holds potential for a variety of applications beyond its current sequence-to-sequence task. With modifications, it could be adapted for natural language processing tasks such as text generation, sentiment analysis, or even question answering, especially if transformed into a decoder-only architecture like GPT. For instance, by removing the encoder and focusing on autoregressive generation, the model could be used for chatbot applications, generating coherent responses based on user input. Its small size makes it suitable for deployment on resource-constrained devices, such as edge devices in IoT systems, where it could perform tasks like real-time translation or data summarization.

Another promising use case is in time-series analysis, where the Transformer’s attention mechanism can capture long-range dependencies in sequential data. For example, it could be applied to financial forecasting, predicting stock prices based on historical data, or in healthcare, analyzing patient data sequences to predict disease progression. The model’s ability to handle variable-length sequences and its lack of reliance on recurrent structures make it particularly effective for such tasks, where traditional RNNs often struggle with vanishing gradients.

In educational settings, the Transformer could serve as a teaching tool for understanding attention-based models. Its implementation in C# provides a clear, readable codebase that students and researchers can use to experiment with transformer architectures. By modifying hyperparameters like the number of layers, attention heads, or embedding dimensions, users can study the impact of these changes on model performance, making it a valuable resource for learning about deep learning concepts.

The model’s current implementation also opens the door for domain-specific applications. For instance, in bioinformatics, it could be used to predict protein sequences from DNA sequences, leveraging its sequence-to-sequence capabilities. In gaming, it could generate procedural content, such as dialogue or level sequences, enhancing the player experience. These use cases would require retraining the model on domain-specific datasets and potentially increasing its capacity to handle larger vocabularies and longer sequences.

One of the most significant areas for future improvement is the implementation of GPU-based math to enhance the model’s efficiency. Currently, the Transformer relies on CPU-based computations, using `Parallel.For` loops to parallelize operations like embedding lookups and matrix multiplications. This approach is suboptimal for deep learning, where matrix operations—the backbone of transformer models—are highly parallelizable and benefit immensely from GPU acceleration. By porting the model to a GPU-accelerated framework, such as CUDA.NET or integrating with libraries like TensorFlow.NET or PyTorch (via Python interop), training times could be reduced dramatically.

GPU-based math would leverage the parallel architecture of GPUs to perform matrix operations in parallel across thousands of cores. For example, the matrix multiplication in the `Forward` method (`decoderOutput * _outputProjection`) and the attention computations in the `Model` class could be executed much faster on a GPU. With a typical mid-range GPU like an NVIDIA RTX 3060, batch processing times could drop from 30 seconds to milliseconds, potentially reducing epoch times from 26 minutes to under a minute. This speedup would enable training on larger datasets and for more epochs, improving the model’s accuracy and generalization.

Implementing GPU support would require refactoring the `Matrix<double>` class to use GPU-accelerated operations. One approach is to integrate with a library like Math.NET Numerics, which supports CUDA if configured properly, or to rewrite critical operations using CUDA directly. For instance, the `Embed` method’s loop over batches and sequence positions could be parallelized across GPU threads, with each thread handling a single token’s embedding lookup and positional encoding addition. Similarly, the attention mechanism’s dot products and softmax operations could be offloaded to the GPU, leveraging optimized CUDA kernels.

The benefits of GPU acceleration extend beyond training speed. During inference, such as in the `Generate` method, GPU-based computations would enable faster autoregressive generation, making the model more practical for real-time applications like chatbots or interactive systems. For example, generating a sequence of 10 tokens currently takes multiple seconds per token due to CPU bottlenecks; with a GPU, this could be reduced to milliseconds, enabling seamless user interactions.

However, implementing GPU support comes with challenges. It requires access to compatible hardware (e.g., an NVIDIA GPU with CUDA support) and familiarity with GPU programming. The C# ecosystem is less mature for GPU computing compared to Python, where frameworks like PyTorch and TensorFlow provide out-of-the-box GPU support. One solution is to create a hybrid implementation, where the core model logic remains in C#, but matrix operations are offloaded to a Python-based backend via interop, using libraries like Python.NET to call PyTorch functions.

However, I have already started working on GPU Off Loading:

using System;
using System.Runtime.InteropServices;

public class OpenCLInterop : IDisposable
{
    private const int CL_SUCCESS = 0;
    private const int CL_MEM_READ_ONLY = 1;
    private const int CL_MEM_WRITE_ONLY = 2;

    [DllImport("OpenCL.dll", EntryPoint = "clGetPlatformIDs")]
    private static extern int clGetPlatformIDs(int numEntries, out IntPtr platforms, out int numPlatforms);
    [DllImport("OpenCL.dll", EntryPoint = "clGetDeviceIDs")]
    private static extern int clGetDeviceIDs(IntPtr platform, int deviceType, int numEntries, out IntPtr devices, out int numDevices);
    [DllImport("OpenCL.dll", EntryPoint = "clCreateContext")]
    private static extern IntPtr clCreateContext(IntPtr properties, int numDevices, ref IntPtr devices, IntPtr pfnNotify, IntPtr userData, out int errcodeRet);
    [DllImport("OpenCL.dll", EntryPoint = "clCreateCommandQueue")]
    private static extern IntPtr clCreateCommandQueue(IntPtr context, IntPtr device, long properties, out int errcodeRet);
    [DllImport("OpenCL.dll", EntryPoint = "clCreateProgramWithSource")]
    private static extern IntPtr clCreateProgramWithSource(IntPtr context, int count, string[] strings, IntPtr[] lengths, out int errcodeRet);
    [DllImport("OpenCL.dll", EntryPoint = "clBuildProgram")]
    private static extern int clBuildProgram(IntPtr program, int numDevices, ref IntPtr device, string options, IntPtr pfnNotify, IntPtr userData);
    [DllImport("OpenCL.dll", EntryPoint = "clCreateKernel")]
    private static extern IntPtr clCreateKernel(IntPtr program, string kernelName, out int errcodeRet);
    [DllImport("OpenCL.dll", EntryPoint = "clCreateBuffer")]
    private static extern IntPtr clCreateBuffer(IntPtr context, int flags, IntPtr size, IntPtr hostPtr, out int errcodeRet);
    [DllImport("OpenCL.dll", EntryPoint = "clSetKernelArg")]
    private static extern int clSetKernelArg(IntPtr kernel, int argIndex, IntPtr argSize, ref IntPtr argValue);
    [DllImport("OpenCL.dll", EntryPoint = "clEnqueueWriteBuffer")]
    private static extern int clEnqueueWriteBuffer(IntPtr commandQueue, IntPtr buffer, bool blockingWrite, IntPtr offset, IntPtr size, IntPtr ptr, int numEvents, IntPtr eventWaitList, out IntPtr evt);
    [DllImport("OpenCL.dll", EntryPoint = "clEnqueueReadBuffer")]
    private static extern int clEnqueueReadBuffer(IntPtr commandQueue, IntPtr buffer, bool blockingRead, IntPtr offset, IntPtr size, IntPtr ptr, int numEvents, IntPtr eventWaitList, out IntPtr evt);
    [DllImport("OpenCL.dll", EntryPoint = "clEnqueueNDRangeKernel")]
    private static extern int clEnqueueNDRangeKernel(IntPtr commandQueue, IntPtr kernel, int workDim, IntPtr globalWorkOffset, IntPtr[] globalWorkSize, IntPtr[] localWorkSize, int numEvents, IntPtr eventWaitList, out IntPtr evt);
    [DllImport("OpenCL.dll", EntryPoint = "clReleaseMemObject")]
    private static extern int clReleaseMemObject(IntPtr memobj);
    [DllImport("OpenCL.dll", EntryPoint = "clReleaseKernel")]
    private static extern int clReleaseKernel(IntPtr kernel);
    [DllImport("OpenCL.dll", EntryPoint = "clReleaseProgram")]
    private static extern int clReleaseProgram(IntPtr program);
    [DllImport("OpenCL.dll", EntryPoint = "clReleaseCommandQueue")]
    private static extern int clReleaseCommandQueue(IntPtr commandQueue);
    [DllImport("OpenCL.dll", EntryPoint = "clReleaseContext")]
    private static extern int clReleaseContext(IntPtr context);

    private IntPtr _context;
    private IntPtr _commandQueue;
    private IntPtr _program;
    private IntPtr _kernel;
    private IntPtr _device;

    public OpenCLInterop(string kernelSource, string kernelName)
    {
        int err = clGetPlatformIDs(1, out IntPtr platform, out int numPlatforms);
        if (err != CL_SUCCESS || numPlatforms == 0) throw new Exception($"clGetPlatformIDs failed: {err}");

        err = clGetDeviceIDs(platform, 4 /* CL_DEVICE_TYPE_GPU */, 1, out _device, out int numDevices);
        if (err != CL_SUCCESS || numDevices == 0) throw new Exception($"clGetDeviceIDs failed: {err}");

        err = 0;
        _context = clCreateContext(IntPtr.Zero, 1, ref _device, IntPtr.Zero, IntPtr.Zero, out err);
        if (err != CL_SUCCESS) throw new Exception($"clCreateContext failed: {err}");

        err = 0;
        _commandQueue = clCreateCommandQueue(_context, _device, 0, out err);
        if (err != CL_SUCCESS) throw new Exception($"clCreateCommandQueue failed: {err}");

        string[] source = { kernelSource };
        err = 0;
        _program = clCreateProgramWithSource(_context, 1, source, null, out err);
        if (err != CL_SUCCESS) throw new Exception($"clCreateProgramWithSource failed: {err}");

        err = clBuildProgram(_program, 1, ref _device, "", IntPtr.Zero, IntPtr.Zero);
        if (err != CL_SUCCESS) throw new Exception($"clBuildProgram failed: {err}");

        err = 0;
        _kernel = clCreateKernel(_program, kernelName, out err);
        if (err != CL_SUCCESS) throw new Exception($"clCreateKernel failed: {err}");
    }

    public void MatrixMultiply(double[] A, double[] B, double[] C, int rowsA, int colsA, int colsB)
    {
        int err;
        IntPtr bufferA = clCreateBuffer(_context, CL_MEM_READ_ONLY, (IntPtr)(A.Length * sizeof(double)), IntPtr.Zero, out err);
        IntPtr bufferB = clCreateBuffer(_context, CL_MEM_READ_ONLY, (IntPtr)(B.Length * sizeof(double)), IntPtr.Zero, out err);
        IntPtr bufferC = clCreateBuffer(_context, CL_MEM_WRITE_ONLY, (IntPtr)(C.Length * sizeof(double)), IntPtr.Zero, out err);

        GCHandle handleA = GCHandle.Alloc(A, GCHandleType.Pinned);
        GCHandle handleB = GCHandle.Alloc(B, GCHandleType.Pinned);
        try
        {
            err = clEnqueueWriteBuffer(_commandQueue, bufferA, true, IntPtr.Zero, (IntPtr)(A.Length * sizeof(double)), handleA.AddrOfPinnedObject(), 0, IntPtr.Zero, out _);
            err |= clEnqueueWriteBuffer(_commandQueue, bufferB, true, IntPtr.Zero, (IntPtr)(B.Length * sizeof(double)), handleB.AddrOfPinnedObject(), 0, IntPtr.Zero, out _);
        }
        finally
        {
            handleA.Free();
            handleB.Free();
        }

        err = clSetKernelArg(_kernel, 0, (IntPtr)sizeof(IntPtr), ref bufferA);
        err |= clSetKernelArg(_kernel, 1, (IntPtr)sizeof(IntPtr), ref bufferB);
        err |= clSetKernelArg(_kernel, 2, (IntPtr)sizeof(IntPtr), ref bufferC);
        err |= clSetKernelArg(_kernel, 3, (IntPtr)sizeof(int), ref rowsA);
        err |= clSetKernelArg(_kernel, 4, (IntPtr)sizeof(int), ref colsA);
        err |= clSetKernelArg(_kernel, 5, (IntPtr)sizeof(int), ref colsB);
        if (err != CL_SUCCESS) throw new Exception($"clSetKernelArg failed: {err}");

        IntPtr[] globalWorkSize = new IntPtr[] { (IntPtr)rowsA, (IntPtr)colsB };
        err = clEnqueueNDRangeKernel(_commandQueue, _kernel, 2, IntPtr.Zero, globalWorkSize, null, 0, IntPtr.Zero, out _);
        if (err != CL_SUCCESS) throw new Exception($"clEnqueueNDRangeKernel failed: {err}");

        GCHandle handleC = GCHandle.Alloc(C, GCHandleType.Pinned);
        try
        {
            err = clEnqueueReadBuffer(_commandQueue, bufferC, true, IntPtr.Zero, (IntPtr)(C.Length * sizeof(double)), handleC.AddrOfPinnedObject(), 0, IntPtr.Zero, out _);
            if (err != CL_SUCCESS) throw new Exception($"clEnqueueReadBuffer failed: {err}");
        }
        finally
        {
            handleC.Free();
        }

        clReleaseMemObject(bufferA);
        clReleaseMemObject(bufferB);
        clReleaseMemObject(bufferC);
    }

    public void Dispose()
    {
        if (_kernel != IntPtr.Zero) clReleaseKernel(_kernel);
        if (_program != IntPtr.Zero) clReleaseProgram(_program);
        if (_commandQueue != IntPtr.Zero) clReleaseCommandQueue(_commandQueue);
        if (_context != IntPtr.Zero) clReleaseContext(_context);
    }
}

Another future enhancement is scaling the Transformer model for larger tasks. Currently, its vocabulary size of 39 and sequence length of 2 limit its applicability to small, synthetic tasks. By increasing the vocabulary size (e.g., to 50,000, as in GPT-3) and supporting longer sequences (e.g., 512 tokens), the model could handle real-world NLP tasks like document summarization or machine translation. This would require a larger dataset—potentially billions of tokens—and a more powerful model, with increased `dModel`, `numLayers`, and `numHeads`, necessitating GPU acceleration to make training feasible.

The Transformer’s modularity also makes it a candidate for transfer learning. By pre-training on a large, general-purpose dataset (e.g., a corpus of text), the model could learn broad language patterns, which could then be fine-tuned for specific tasks like sentiment classification or named entity recognition. This approach mirrors how models like BERT and GPT are used in practice, where pre-training on vast datasets enables strong performance on downstream tasks with limited labeled data.

In the context of future use cases, the Transformer could be integrated into production systems with proper optimization. For instance, in customer service, it could power a translation system that converts user queries from one language to another in real time, leveraging GPU acceleration for low-latency inference. In autonomous systems, it could process sensor data sequences to predict future states, such as vehicle trajectories, enhancing decision-making capabilities.

The model’s current implementation also highlights the importance of addressing overfitting and task design. Future iterations should incorporate stronger regularization techniques, such as higher dropout rates, weight decay, and data augmentation, to improve generalization. Additionally, redesigning the task to introduce learnable patterns (e.g., making the target sequence a function of the input sequence) could make the model more effective, as the current task’s partial randomness limits its learnability.

Finally, the Transformer model serves as a foundation for exploring advanced transformer variants. Future work could incorporate techniques like sparse attention (e.g., from the Longformer) to handle longer sequences efficiently, or integrate ideas from vision transformers (ViT) to process image sequences. By continuing to evolve the model, it can remain relevant in the rapidly advancing field of deep learning, where transformers continue to dominate due to their flexibility and performance.

In summary, the Transformer model we’ve created is a versatile, albeit small-scale, implementation of the transformer architecture, with applications in sequence-to-sequence tasks and potential for expansion into broader domains. Its current efficiency is limited by CPU-based computations and a challenging task design, but with GPU acceleration, architectural optimizations, and task redesign, it can become a powerful tool for future use cases, from real-time NLP systems to time-series analysis and beyond.

The Transformer Model is one of the best AI Models used today. The Industry is evolving, and new models are being proposed, some say they are better than the Transformer model, but few have taken the industry by storm like the Transformer Model has!