Hierarchical Reasoning Model - HRM

Reasoning, the cognitive process of devising and executing complex, goal-oriented action sequences, has long been a cornerstone of human intelligence and a persistent challenge for artificial intelligence (AI).

Edit | Back to AI


The Hierarchical Reasoning Model (HRM): A Novel Architecture for Advanced Reasoning in Artificial Intelligence

Introduction

Reasoning, the cognitive process of devising and executing complex, goal-oriented action sequences, has long been a cornerstone of human intelligence and a persistent challenge for artificial intelligence (AI). Traditional large language models (LLMs), despite their remarkable capabilities in natural language processing, often struggle with tasks requiring multi-step reasoning, such as solving complex Sudoku puzzles or navigating large mazes. These models typically rely on Chain-of-Thought (CoT) techniques, which, while effective in certain contexts, suffer from brittle task decomposition, high data requirements, and significant computational latency. The Hierarchical Reasoning Model (HRM), introduced in a recent research paper by Guan Wang et al. (2025), presents a groundbreaking approach to overcoming these limitations. Inspired by the hierarchical and multi-timescale processing of the human brain, HRM offers a novel recurrent architecture that achieves exceptional computational depth, training stability, and efficiency. This essay explores the architecture, mechanisms, performance, and broader implications of HRM, drawing on insights from the arXiv paper and the associated GitHub repository.

Background: The Challenges of Reasoning in AI

To appreciate the significance of HRM, it is essential to understand the limitations of current AI systems in reasoning tasks. Most LLMs, such as those powering modern chatbots, excel in pattern recognition and language generation but falter when faced with tasks requiring sequential, multi-step reasoning. For instance, solving a complex Sudoku puzzle involves not only recognizing patterns but also planning a sequence of moves while maintaining an understanding of the puzzle’s global constraints. Similarly, tasks like pathfinding in large mazes or tackling the Abstraction and Reasoning Corpus (ARC) benchmark demand abstract planning and detailed execution, areas where LLMs often require extensive fine-tuning or large datasets to perform adequately.

Chain-of-Thought (CoT) prompting, a popular technique, encourages LLMs to break down problems into intermediate steps, mimicking human reasoning processes. However, CoT has notable drawbacks: it requires explicit supervision of intermediate steps, demands large amounts of training data, and introduces high latency due to iterative processing. These limitations highlight the need for a new architectural paradigm capable of efficient, unsupervised reasoning. The Hierarchical Reasoning Model addresses these challenges by drawing inspiration from the human brain’s ability to process information hierarchically, balancing high-level strategic planning with low-level detailed computations.

The Hierarchical Reasoning Model: Architecture and Design

The HRM, developed by researchers from Sapient Intelligence and collaborators, is a recurrent neural architecture designed to emulate the hierarchical and multi-timescale processing observed in the human brain. Unlike traditional LLMs that rely on transformer-based architectures, HRM introduces a dual-module system comprising two interdependent recurrent modules: a high-level module for slow, abstract planning and a low-level module for rapid, detailed computations. This structure allows HRM to execute complex sequential reasoning tasks in a single forward pass, eliminating the need for explicit supervision of intermediate steps.

High-Level Module: Abstract Planning

The high-level module serves as the “CEO” of the HRM, analogous to the prefrontal cortex in the human brain, which is responsible for executive functions such as planning and decision-making. This module operates on a slower timescale, focusing on abstract representations of the task at hand. For example, in a Sudoku puzzle, the high-level module might determine the overall strategy, such as prioritizing the elimination of candidates in a specific row or column. By maintaining a global view of the problem, this module ensures that the reasoning process remains coherent and goal-oriented.

The high-level module’s recurrent nature allows it to maintain a memory of previous planning steps, enabling it to adapt strategies dynamically as the task progresses. This is a significant departure from CoT-based approaches, which often require predefined prompts or templates to guide intermediate steps. By contrast, the high-level module in HRM learns to generate abstract plans autonomously, reducing the need for external supervision.

Low-Level Module: Detailed Computations

Complementing the high-level module, the low-level module handles rapid, fine-grained computations required to execute the plans devised by the high-level module. Continuing the Sudoku example, this module would perform tasks such as identifying specific cell values or checking constraints within a given row, column, or block. Operating on a faster timescale, the low-level module ensures that detailed computations are performed efficiently, feeding results back to the high-level module to refine the overall strategy.

The interdependence between the two modules is a key feature of HRM. The high-level module provides strategic guidance, while the low-level module supplies detailed feedback, creating a feedback loop that mirrors the dynamic interplay between planning and execution in human cognition. This hierarchical structure allows HRM to achieve significant computational depth—processing complex tasks with multiple layers of abstraction—while maintaining training stability and efficiency.

Single Forward Pass and Efficiency

One of HRM’s most remarkable features is its ability to execute sequential reasoning tasks in a single forward pass. Unlike CoT-based models, which require multiple iterations to decompose and solve a problem, HRM integrates planning and execution into a unified process. This is achieved through the recurrent architecture, which allows the model to iterate internally without external prompting. The result is a significant reduction in computational latency, making HRM suitable for real-time applications where speed is critical.

Moreover, HRM achieves this efficiency with a remarkably small parameter count of just 27 million parameters. In contrast, modern LLMs often have billions of parameters, requiring substantial computational resources for training and inference. HRM’s lean architecture demonstrates that advanced reasoning capabilities can be achieved without the need for massive scale, challenging the prevailing trend of ever-larger models in AI research.

Performance and Evaluation

The HRM has demonstrated exceptional performance across a range of complex reasoning tasks, as detailed in the arXiv paper. With only 1000 training samples and no pre-training or CoT data, HRM achieves near-perfect performance on tasks such as solving complex Sudoku puzzles and finding optimal paths in large mazes. These tasks require both abstract reasoning and precise execution, highlighting the model’s ability to balance high-level planning with low-level computation.

Perhaps most notably, HRM outperforms much larger models on the Abstraction and Reasoning Corpus (ARC), a benchmark designed to measure artificial general intelligence (AGI) capabilities. The ARC tasks require models to identify abstract patterns and apply them to novel scenarios, a challenge that has proven difficult for even the most advanced LLMs. HRM’s reported performance on ARC-AGI-2, with a self-reported score of 55% using 1120 training examples, significantly surpasses baseline LLMs, which score in the 1–4% range. However, it’s worth noting that this result remains unverified by independent leaderboard submissions, and critics have pointed out that the evaluation methodology could be clearer.

The model’s ability to achieve such results with minimal training data underscores its data efficiency, a critical advantage in scenarios where labeled data is scarce or expensive. Furthermore, HRM’s performance without pre-training suggests that its architecture is inherently suited to reasoning tasks, reducing reliance on large-scale, general-purpose datasets.

Implementation and Open-Source Availability

The HRM is not only a theoretical advancement but also a practical tool, as evidenced by its open-source release on GitHub. The repository, maintained by Sapient Intelligence, provides the code for the Hierarchical Reasoning Model, allowing researchers and developers to explore, replicate, and extend the model. The open-source nature of HRM is a significant step toward democratizing access to advanced AI technologies, enabling a broader community to contribute to its development and apply it to diverse use cases.

Additionally, independent implementations, such as the one by lucidrains, have made the model more accessible by integrating it into user-friendly libraries. These efforts have sparked interest in the AI community, with discussions on platforms like Reddit highlighting HRM’s potential as a “game changer” for reasoning tasks. However, some community feedback has raised concerns about the lack of trained checkpoints or detailed evaluation logs in the official repository, which could hinder independent verification of the model’s performance.

Critical Analysis: Strengths and Limitations

HRM’s strengths lie in its innovative architecture, efficiency, and performance on complex reasoning tasks. By emulating the hierarchical processing of the human brain, HRM addresses key limitations of CoT-based approaches, offering a more robust and scalable solution for sequential reasoning. Its small parameter count and data efficiency make it an attractive option for resource-constrained environments, such as edge devices or low-data regimes. Furthermore, its open-source availability fosters collaboration and innovation, positioning HRM as a catalyst for further advancements in AI reasoning.

However, HRM is not without its limitations. The self-reported 55% score on ARC-AGI-2, while impressive, has not been independently verified, raising questions about reproducibility. The lack of detailed evaluation artifacts, such as trained checkpoints or per-task outputs, limits the ability of third parties to confirm the model’s performance. Additionally, some critics have noted that the paper’s methodology could be clearer, particularly regarding the training process. For instance, while the authors claim that HRM was not trained on test sets, the description of the training process has been criticized as ambiguous, potentially undermining confidence in the results.

Another consideration is the model’s scope. While HRM excels in tasks like Sudoku and maze navigation, its applicability to broader domains, such as natural language reasoning or real-world decision-making, remains underexplored. Community discussions suggest interest in extending HRM to other use cases, but further research is needed to evaluate its generalizability.

Implications for AI Research and Applications

The introduction of HRM has far-reaching implications for AI research and its practical applications. By demonstrating that a small, efficient model can outperform larger LLMs on reasoning tasks, HRM challenges the paradigm of scaling up model size to achieve better performance. This shift could lead to more sustainable AI development, reducing the computational and environmental costs associated with training massive models.

In practical terms, HRM’s efficiency and low data requirements make it suitable for applications in resource-constrained settings, such as autonomous systems, robotics, or embedded devices. For example, its ability to solve complex pathfinding problems could be applied to autonomous navigation in drones or self-driving vehicles. Similarly, its proficiency in abstract reasoning could enhance decision-making systems in fields like logistics, finance, or healthcare.

Moreover, HRM’s open-source release encourages community-driven innovation. Researchers and developers can build on the model to explore new applications, such as integrating it with scalable memory systems or adapting it for multi-modal tasks. The enhanced version proposed by Siddhesh2377, which includes scalable memory and JSON metadata, illustrates the potential for extending HRM’s capabilities to handle persistent knowledge and context-aware reasoning.

Ethical and Societal Considerations

As with any AI advancement, HRM raises important ethical and societal considerations. Its efficiency and accessibility could democratize AI, enabling smaller organizations and individuals to leverage advanced reasoning capabilities. However, this also increases the risk of misuse, such as deploying HRM in applications that require robust safety guarantees, like autonomous weapons or critical infrastructure management. Ensuring that HRM is used responsibly will require clear guidelines and oversight.

Additionally, the model’s reliance on minimal training data raises questions about bias and fairness. While HRM’s data efficiency is a strength, it also means that the quality and representativeness of the training data are critical. If the training data is biased or incomplete, HRM’s reasoning outputs could reflect those limitations, potentially leading to unintended consequences in real-world applications.

Future Directions

The HRM opens several avenues for future research. First, addressing the concerns about evaluation transparency and reproducibility will be crucial for establishing trust in the model’s capabilities. Providing trained checkpoints, detailed evaluation logs, and clear documentation of the training process would enable independent validation and broader adoption.

Second, exploring HRM’s applicability to a wider range of tasks, such as natural language reasoning, multi-modal processing, or real-time decision-making, could expand its impact. For instance, integrating HRM with language models or vision systems could create hybrid architectures capable of tackling complex, multi-domain problems.

Finally, investigating the theoretical underpinnings of HRM’s hierarchical architecture could yield insights into the nature of reasoning itself. By studying how the high-level and low-level modules interact, researchers may uncover principles that apply not only to AI but also to cognitive science and neuroscience, fostering interdisciplinary advancements.

Conclusion

The Hierarchical Reasoning Model represents a significant leap forward in AI’s ability to perform complex, sequential reasoning tasks. By drawing inspiration from the human brain’s hierarchical processing, HRM achieves remarkable efficiency and performance with a lean architecture and minimal training data. Its dual-module design, single-pass execution, and open-source availability make it a promising tool for both research and real-world applications. While challenges remain, particularly regarding evaluation transparency and generalizability, HRM’s potential to transform AI reasoning is undeniable. As the AI community continues to explore and extend this model, HRM may pave the way for more efficient, scalable, and general-purpose reasoning systems, bringing us closer to the goal of artificial general intelligence.

namespace AI.Math
{



    #region Using Statements:



    using System;
    using System.Linq;
    using System.Runtime.InteropServices;
    using System.Text.Json.Serialization;
    using System.Threading.Tasks;



    #endregion



    /// <summary>
    /// Represents a matrix of numerical values for efficient linear algebra operations, optimized for use in machine learning models like Transformers.
    /// Supports generic numeric types (float or double) with parallelized operations for speed and thread safety.
    /// </summary>
    /// <typeparam name="T">The numeric type, constrained to float or double, implementing IComparable and IEquatable for comparisons and equality checks.</typeparam>
    /// <remarks>
    /// This matrix is a 2D array internally, designed for operations like addition, multiplication, transposition, and normalization.
    /// It uses jagged arrays for memory efficiency and parallel processing for performance, with thread-safe random number generation where applicable.
    /// </remarks>
    public class Matrix<T> where T : struct, IComparable<T>, IEquatable<T>
    {



        #region Fields:



        /// <summary>
        /// Indicates whether T is float, used for type-specific operations to optimize performance.
        /// </summary>
        [JsonInclude]
        internal static readonly bool IsFloat = typeof(T) == typeof(float);



        /// <summary>
        /// Indicates whether T is double, used for type-specific operations to optimize precision.
        /// </summary>
        [JsonInclude]
        internal static readonly bool IsDouble = typeof(T) == typeof(double);



        /// <summary>
        /// Gets the number of rows in the matrix.
        /// </summary>
        [JsonInclude]
        public int Rows => MatrixRows;



        /// <summary>
        /// Gets the number of columns in the matrix.
        /// </summary>
        [JsonInclude]
        public int Columns => MatrixColumns;



        /// <summary>
        /// _sync Lock object.
        /// </summary>
        [NonSerialized]
        private readonly object _syncLock = new object();



        /// <summary>
        /// A thread-safe random number generator for dropout and initialization.
        /// </summary>
        [NonSerialized]
        private static readonly ThreadLocal<Random> _random = new ThreadLocal<Random>(() => new Random(Guid.NewGuid().GetHashCode()));



        #endregion



        #region Properties:



        /// <summary>
        /// Threshold for switching to parallel operations; tune based on testing for optimal performance.
        /// </summary>
        [JsonInclude]
        public int ParallelThreshold { get; private set; } = 100;



        /// <summary>
        /// Internal jagged array storing the matrix data, where each row is an array of type T.
        /// </summary>
        [JsonInclude]
        public T[][] Data { get; private set; }



        /// <summary>
        /// Number of rows in the matrix.
        /// </summary>
        [JsonInclude]
        public int MatrixRows { get; private set; }



        /// <summary>
        /// Number of columns in the matrix.
        /// </summary>
        [JsonInclude]
        public int MatrixColumns { get; private set; }



        /// <summary>
        /// Gets or sets the value at the specified row and column index.
        /// </summary>
        /// <param name="row">Row index, must be between 0 and Rows-1.</param>
        /// <param name="col">Column index, must be between 0 and Columns-1.</param>
        /// <returns>The value at the specified position.</returns>
        /// <exception cref="IndexOutOfRangeException">Thrown if row or col is out of bounds.</exception>
        /// <remarks>
        /// Provides direct access to matrix elements with bounds checking for safety.
        /// </remarks>
        public T this[int row, int col]
        {
            get
            {
                if (row < 0 || row >= MatrixRows || col < 0 || col >= MatrixColumns)
                    throw new IndexOutOfRangeException($"Index [{row}, {col}] is out of bounds for matrix [{MatrixRows}, {MatrixColumns}].");
                return Data[row][col];
            }
            set
            {
                if (row < 0 || row >= MatrixRows || col < 0 || col >= MatrixColumns)
                    throw new IndexOutOfRangeException($"Index [{row}, {col}] is out of bounds for matrix [{MatrixRows}, {MatrixColumns}].");
                lock (_syncLock)
                    Data[row][col] = value;
            }
        }



        #endregion



        /// <summary>
        /// Initializes a new matrix with the specified dimensions, filled with zeros.
        /// </summary>
        /// <remarks>
        /// Default Constructor for Json Serialization/Deserialization..
        /// </remarks>
        public Matrix()
        {
        }



        /// <summary>
        /// Initializes a new matrix with the specified dimensions, filled with zeros.
        /// </summary>
        /// <param name="rows">Number of rows, must be positive.</param>
        /// <param name="cols">Number of columns, must be positive.</param>
        /// <exception cref="ArgumentException">Thrown if rows or cols are less than or equal to zero.</exception>
        /// <remarks>
        /// Uses Parallel.For to initialize rows efficiently, ensuring thread safety as each row is independent.
        /// </remarks>
        public Matrix(int rows, int cols)
        {

            if (rows <= 0 || cols <= 0)
                throw new ArgumentException($"Matrix dimensions must be positive, got rows={rows}, cols={cols}.");

            MatrixRows = rows;
            MatrixColumns = cols;
            Data = new T[rows][];
            Parallel.For(0, rows, i => Data[i] = new T[cols]);
        }



        /// <summary>
        /// Initializes a matrix from a provided 2D jagged array.
        /// </summary>
        /// <param name="data">Jagged array of values, must be non-null, non-empty, and rectangular.</param>
        /// <exception cref="ArgumentNullException">Thrown if data is null.</exception>
        /// <exception cref="ArgumentException">Thrown if data is empty or has inconsistent row lengths.</exception>
        /// <remarks>
        /// Copies the input data in parallel to ensure the internal array is independent of the input, avoiding external modifications.
        /// </remarks>
        public Matrix(T[][] data)
        {

            if (data == null)
                throw new ArgumentNullException(nameof(data), "Input data cannot be null.");
            if (data.Length == 0 || data[0].Length == 0)
                throw new ArgumentException("Input data must have at least one row and column.");
            if (!data.All(row => row.Length == data[0].Length))
                throw new ArgumentException("All rows in the input data must have the same length.");

            MatrixRows = data.Length;
            MatrixColumns = data[0].Length;
            Data = new T[MatrixRows][];
            Parallel.For(0, MatrixRows, i =>
            {
                Data[i] = new T[MatrixColumns];
                Array.Copy(data[i], Data[i], MatrixColumns);
            });
        }



        /// <summary>
        /// Clips all elements of the matrix to be within the specified range [min, max].
        /// Each element is set to min if it is less than min, max if it is greater than max,
        /// or remains unchanged if it is within the range.
        /// </summary>
        /// <param name="min">The lower bound of the clipping range, of type T (float or double).</param>
        /// <param name="max">The upper bound of the clipping range, of type T (float or double).</param>
        /// <returns>A new Matrix<T> with the same dimensions, where all elements are clipped to [min, max].</returns>
        /// <exception cref="InvalidOperationException">Thrown if the matrix type T is not float or double.</exception>
        /// <exception cref="ArgumentException">Thrown if max is less than min.</exception>
        /// <remarks>
        /// This method performs an element-wise clipping operation, ensuring each element satisfies min <= element <= max.
        /// It is optimized with parallel processing for large matrices (when Rows * Columns >= ParallelThreshold),
        /// ensuring efficient computation while maintaining thread safety by performing read-only operations on the input matrix
        /// and writing to a new result matrix. The method uses type-safe comparisons via Compare to handle float or double types.
        /// This operation is commonly used in machine learning for gradient clipping to prevent exploding gradients during training.
        /// </remarks>
        public Matrix<T> Clip(T min, T max)
        {
            if (!IsFloat && !IsDouble)
                throw new InvalidOperationException("Clip is only supported for Matrix<float> or Matrix<double>.");
            if (Compare(max, min) < 0)
                throw new ArgumentException("Maximum value must be greater than or equal to minimum value.", nameof(max));

            var result = new Matrix<T>(Rows, Columns);
            bool useParallel = Rows * Columns >= ParallelThreshold;

            if (useParallel)
            {
                Parallel.For(0, Rows, i =>
                {
                    for (int j = 0; j < Columns; j++)
                    {
                        T value = Data[i][j];
                        result.Data[i][j] = Compare(value, min) < 0 ? min : Compare(value, max) > 0 ? max : value;
                    }
                });
            }
            else
            {
                for (int i = 0; i < Rows; i++)
                {
                    for (int j = 0; j < Columns; j++)
                    {
                        T value = Data[i][j];
                        result.Data[i][j] = Compare(value, min) < 0 ? min : Compare(value, max) > 0 ? max : value;
                    }
                }
            }

            return result;
        }



        /// <summary>
        /// Applies the softmax function to each row of the matrix, normalizing the values to sum to 1.
        /// The softmax function is computed as exp(x - max(x)) / sum(exp(x - max(x))) for each row,
        /// ensuring numerical stability by subtracting the maximum value to prevent overflow.
        /// </summary>
        /// <returns>A new Matrix<T> of the same dimensions where each row sums to 1.</returns>
        /// <exception cref="InvalidOperationException">Thrown if the matrix type T is not float or double.</exception>
        /// <remarks>
        /// This method is thread-safe as it performs read-only operations on the input matrix and writes to a new result matrix.
        /// Parallelization is used for large matrices (Rows * Columns >= ParallelThreshold) to improve performance.
        /// A small constant (1e-6 for float, 1e-10 for double) is used to prevent division by zero.
        /// The method is optimized with a two-pass approach: one to compute max and exp/sum, and one to normalize.
        /// </remarks>
        public Matrix<T> CustomSoftmax()
        {
            if (!IsFloat && !IsDouble)
                throw new InvalidOperationException("CustomSoftmax is only supported for Matrix<float> or Matrix<double>.");

            var result = new Matrix<T>(Rows, Columns);
            T[] rowSums = new T[Rows]; // Store sums for each row
            T[] rowMaxes = new T[Rows]; // Store max values for each row
            T epsilon = IsFloat ? (T)(object)1e-6f : (T)(object)1e-10;

            // Determine if parallelization is beneficial based on matrix size
            bool useParallel = Rows * Columns >= ParallelThreshold; // Align with class's threshold (default 100)

            if (useParallel)
            {
                Parallel.For(0, Rows, i =>
                {
                    // Find max value in the row for numerical stability
                    T max = Data[i][0];
                    for (int j = 1; j < Columns; j++)
                        if (Compare(Data[i][j], max) > 0)
                            max = Data[i][j];
                    rowMaxes[i] = max;

                    // Compute exp(x - max) and sum
                    T sum = Zero();
                    for (int j = 0; j < Columns; j++)
                    {
                        T val = Exp(Subtract(Data[i][j], max));
                        result.Data[i][j] = val;
                        sum = Add(sum, val);
                    }
                    rowSums[i] = Compare(sum, epsilon) <= 0 ? epsilon : sum; // Prevent division by zero
                });

                // Normalize each row
                Parallel.For(0, Rows, i =>
                {
                    for (int j = 0; j < Columns; j++)
                        result.Data[i][j] = Divide(result.Data[i][j], rowSums[i]);
                });
            }
            else
            {
                // Sequential processing for small matrices
                for (int i = 0; i < Rows; i++)
                {
                    // Find max value in the row
                    T max = Data[i][0];
                    for (int j = 1; j < Columns; j++)
                        if (Compare(Data[i][j], max) > 0)
                            max = Data[i][j];
                    rowMaxes[i] = max;

                    // Compute exp(x - max) and sum
                    T sum = Zero();
                    for (int j = 0; j < Columns; j++)
                    {
                        T val = Exp(Subtract(Data[i][j], max));
                        result.Data[i][j] = val;
                        sum = Add(sum, val);
                    }
                    rowSums[i] = Compare(sum, epsilon) <= 0 ? epsilon : sum;

                    // Normalize
                    for (int j = 0; j < Columns; j++)
                        result.Data[i][j] = Divide(result.Data[i][j], rowSums[i]);
                }
            }

            return result;
        }



        /// <summary>
        /// Computes the mean along the specified axis (0 for rows, 1 for columns).
        /// </summary>
        /// <param name="axis">Axis along which to compute the mean: 0 (mean of each column, returns [1, Columns]), 1 (mean of each row, returns [Rows, 1]).</param>
        /// <returns>A new Matrix<T> containing the mean values along the specified axis.</returns>
        /// <exception cref="InvalidOperationException">Thrown if the matrix type T is not float or double.</exception>
        /// <exception cref="ArgumentException">Thrown if axis is not 0 or 1.</exception>
        /// <remarks>
        /// This method computes the mean of elements along the specified axis, producing a matrix of shape [1, Columns] for axis=0 or [Rows, 1] for axis=1.
        /// The operation is optimized with parallel processing for large matrices (Rows * Columns >= ParallelThreshold) and uses Kahan summation for numerical stability.
        /// Thread-safe as it performs read-only operations on the input matrix and writes to a new result matrix.
        /// </remarks>
        public Matrix<T> Mean(int axis)
        {
            if (!IsFloat && !IsDouble)
                throw new InvalidOperationException("Mean is only supported for Matrix<float> or Matrix<double>.");
            if (axis != 0 && axis != 1)
                throw new ArgumentException("Axis must be 0 (rows) or 1 (columns).");

            if (Rows == 0 || Columns == 0)
                throw new InvalidOperationException("Cannot compute mean of an empty matrix.");

            bool useParallel = Rows * Columns >= ParallelThreshold;

            if (axis == 0)
            {
                var result = new Matrix<T>(1, Columns);
                if (useParallel)
                {
                    Parallel.For(0, Columns, j =>
                    {
                        T sum = Zero();
                        T compensation = Zero(); // Kahan summation for numerical stability
                        for (int i = 0; i < Rows; i++)
                        {
                            T y = Subtract(Data[i][j], compensation);
                            T t = Add(sum, y);
                            compensation = Subtract(Subtract(t, sum), y);
                            sum = t;
                        }
                        result.Data[0][j] = Divide(sum, (T)(object)(IsFloat ? (float)Rows : (double)Rows));
                    });
                }
                else
                {
                    for (int j = 0; j < Columns; j++)
                    {
                        T sum = Zero();
                        T compensation = Zero();
                        for (int i = 0; i < Rows; i++)
                        {
                            T y = Subtract(Data[i][j], compensation);
                            T t = Add(sum, y);
                            compensation = Subtract(Subtract(t, sum), y);
                            sum = t;
                        }
                        result.Data[0][j] = Divide(sum, (T)(object)(IsFloat ? (float)Rows : (double)Rows));
                    }
                }
                return result;
            }
            else // axis == 1
            {
                var result = new Matrix<T>(Rows, 1);
                if (useParallel)
                {
                    Parallel.For(0, Rows, i =>
                    {
                        T sum = Zero();
                        T compensation = Zero();
                        for (int j = 0; j < Columns; j++)
                        {
                            T y = Subtract(Data[i][j], compensation);
                            T t = Add(sum, y);
                            compensation = Subtract(Subtract(t, sum), y);
                            sum = t;
                        }
                        result.Data[i][0] = Divide(sum, (T)(object)(IsFloat ? (float)Columns : (double)Columns));
                    });
                }
                else
                {
                    for (int i = 0; i < Rows; i++)
                    {
                        T sum = Zero();
                        T compensation = Zero();
                        for (int j = 0; j < Columns; j++)
                        {
                            T y = Subtract(Data[i][j], compensation);
                            T t = Add(sum, y);
                            compensation = Subtract(Subtract(t, sum), y);
                            sum = t;
                        }
                        result.Data[i][0] = Divide(sum, (T)(object)(IsFloat ? (float)Columns : (double)Columns));
                    }
                }
                return result;
            }
        }



        /// <summary>
        /// Computes the minimum value across all elements in the matrix.
        /// </summary>
        /// <returns>The smallest value in the matrix as type T.</returns>
        /// <exception cref="InvalidOperationException">Thrown if the matrix is empty (has no rows or columns).</exception>
        /// <remarks>
        /// Iterates over all elements in parallel to find the minimum value, leveraging the Compare method for type safety.
        /// Useful for determining the lower bound of matrix values in optimization or normalization tasks.
        /// </remarks>
        public T Min()
        {

            if (Rows == 0 || Columns == 0)
                throw new InvalidOperationException("Cannot compute minimum of an empty matrix.");

            T globalMin = default; // Will be set in first iteration
            object lockObj = new object();
            bool first = true;
            Parallel.For(0, Rows, () => Data[0][0], (i, state, localMin) =>
            {
                T min = Data[i][0];
                for (int j = 1; j < Columns; j++)
                    if (Compare(Data[i][j], min) < 0)
                        min = Data[i][j];
                return min;
            }, localMin =>
            {
                lock (lockObj)
                {
                    if (first) { globalMin = localMin; first = false; }
                    else if (Compare(localMin, globalMin) < 0)
                        globalMin = localMin;
                }
            });
            return globalMin;
        }



        /// <summary>
        /// Computes the maximum value across all elements in the matrix.
        /// </summary>
        /// <returns>The largest value in the matrix as type T.</returns>
        /// <exception cref="InvalidOperationException">Thrown if the matrix is empty (has no rows or columns).</exception>
        /// <remarks>
        /// Iterates over all elements in parallel to find the maximum value, leveraging the Compare method for type safety.
        /// Useful for determining the upper bound of matrix values in optimization or normalization tasks.
        /// </remarks>
        public T Max()
        {

            if (Rows == 0 || Columns == 0)
                throw new InvalidOperationException("Cannot compute maximum of an empty matrix.");

            T[] rowMaxes = new T[Rows];
            Parallel.For(0, Rows, i =>
            {
                T max = Data[i][0];
                for (int j = 1; j < Columns; j++)
                    if (Compare(Data[i][j], max) > 0)
                        max = Data[i][j];
                rowMaxes[i] = max;
            });

            T globalMax = rowMaxes[0];
            for (int i = 1; i < Rows; i++)
                if (Compare(rowMaxes[i], globalMax) > 0)
                    globalMax = rowMaxes[i];
            return globalMax;
        }



        /// <summary>
        /// Retrieves the token index with the highest logit value for a specific sequence position.
        /// </summary>
        /// <param name="logits">Matrix of logit values, where rows represent sequence positions and columns represent token probabilities.</param>
        /// <param name="currentSeqLen">The current sequence length, used to determine the row to inspect (1-based index).</param>
        /// <returns>The column index (token ID) of the maximum logit value in the specified row, or -1 if an error occurs.</returns>
        /// <exception cref="ArgumentNullException">Thrown if logits is null.</exception>
        /// <exception cref="ArgumentOutOfRangeException">Thrown if currentSeqLen is less than 1 or exceeds the number of rows in logits.</exception>
        /// <remarks>
        /// Uses ArgMaxInRow to find the maximum value's index in the row at (currentSeqLen - 1). Includes additional validation and logging for debugging.
        /// </remarks>
        public static int GetNextToken(Matrix<double> logits, int currentSeqLen)
        {

            if (logits == null)
                throw new ArgumentNullException(nameof(logits), "Logits matrix cannot be null.");

            if (logits.Rows == 0 || logits.Columns == 0)
            {
                Console.WriteLine($"Error: Logits matrix is empty [Rows={logits.Rows}, Columns={logits.Columns}]");
                return -1; // Fallback value
            }

            if (currentSeqLen < 0 || currentSeqLen > logits.Rows)
            {
                Console.WriteLine($"Error: currentSeqLen={currentSeqLen} is out of range for logits.Rows={logits.Rows}");
                return -1; // Fallback value
            }

            try
            {
                int[] maxIndices = logits.ArgMaxInRow();
                if (maxIndices == null || maxIndices.Length != logits.Rows)
                {
                    Console.WriteLine($"Error: ArgMaxInRow returned invalid result (null or length mismatch, expected {logits.Rows}, got {maxIndices?.Length})");
                    return -1;
                }

                int rowIndex = currentSeqLen;
                return maxIndices[rowIndex];
            }
            catch (Exception ex)
            {
                Console.WriteLine($"Exception in GetNextToken: {ex.Message} at row {currentSeqLen - 1}, logits dimensions [{logits.Rows}, {logits.Columns}]");
                return -1; // Fallback value
            }
        }



        /// <summary>
        /// Finds the index of the maximum value in each row of the matrix.
        /// </summary>
        /// <returns>An array of integers, where each element is the column index of the maximum value in the corresponding row.</returns>
        /// <exception cref="InvalidOperationException">Thrown if the matrix has no rows.</exception>
        /// <remarks>
        /// Parallelized over rows for efficiency, using Compare for type-safe comparisons.
        /// Identical to ArgMax but explicitly named ArgMaxInRow for clarity and to match original naming conventions.
        /// Commonly used in classification tasks to identify the predicted class index per row.
        /// </remarks>
        public int[] ArgMaxInRow()
        {

            if (Rows == 0)
                throw new InvalidOperationException("Cannot compute ArgMaxInRow on an empty matrix.");

            int[] result = new int[Rows];
            Parallel.For(0, Rows, i =>
            {
                int maxIdx = 0;
                T maxVal = Data[i][0];
                for (int j = 1; j < Columns; j++)
                {
                    if (Compare(Data[i][j], maxVal) > 0)
                    {
                        maxVal = Data[i][j];
                        maxIdx = j;
                    }
                }
                result[i] = maxIdx;
            });
            return result;
        }



        /// <summary>
        /// Normalizes an array of values to have an L2 norm of 1, scaling all elements proportionally.
        /// </summary>
        /// <param name="array">The input array to normalize, must be non-null and non-empty.</param>
        /// <returns>A new array with the same length, normalized to unit length.</returns>
        /// <exception cref="ArgumentNullException">Thrown if the input array is null.</exception>
        /// <exception cref="ArgumentException">Thrown if the input array is empty.</exception>
        /// <remarks>
        /// Computes the L2 norm (sqrt(sum of squares)) and divides each element by it.
        /// Returns a zero array if the norm is zero to avoid division issues. Used internally for vector normalization.
        /// </remarks>
        public static T[] NormalizeArray(T[] array)
        {

            if (array == null)
                throw new ArgumentNullException(nameof(array), "Input array cannot be null.");
            if (array.Length == 0)
                throw new ArgumentException("Input array must not be empty.", nameof(array));

            T sumOfSquares = Zero();
            object lockObj = new object();
            Parallel.For(0, array.Length, () => Zero(), (i, state, localSum) =>
                Add(localSum, Multiply(array[i], array[i])), localSum =>
                {
                    lock (lockObj)
                        sumOfSquares = Add(sumOfSquares, localSum);
                });

            T norm = Sqrt(sumOfSquares);
            T[] result = new T[array.Length];

            if (Compare(norm, Zero()) == 0)
                return result;

            Parallel.For(0, array.Length, i => result[i] = Divide(array[i], norm));
            return result;
        }



        /// <summary>
        /// Initializes a matrix using Xavier (Glorot) initialization for neural network weights.
        /// </summary>
        /// <param name="rows">Number of rows in the matrix, typically the output size.</param>
        /// <param name="cols">Number of columns in the matrix, typically the input size.</param>
        /// <returns>A new Matrix<T> initialized with Xavier values.</returns>
        /// <exception cref="ArgumentException">Thrown if rows or cols are less than or equal to zero.</exception>
        /// <remarks>
        /// Uses a uniform distribution scaled by sqrt(6 / (rows + cols)) to initialize weights, promoting stable gradients.
        /// Thread-safe random number generation with lock ensures consistency in parallel contexts.
        /// Commonly used in deep learning to initialize weight matrices for layers.
        /// </remarks>
        public static Matrix<T> InitializeXavier(int rows, int cols)
        {

            if (rows <= 0) throw new ArgumentException("Number of rows must be positive.", nameof(rows));
            if (cols <= 0) throw new ArgumentException("Number of columns must be positive.", nameof(cols));

            double limit = Math.Sqrt(6.0 / (rows + cols));
            var result = new Matrix<T>(rows, cols);
            Parallel.For(0, rows, i =>
            {
                Random rand = _random.Value;
                for (int j = 0; j < cols; j++)
                {
                    double val = rand.NextDouble() * 2 * limit - limit;
                    result.Data[i][j] = IsFloat ? (T)(object)(float)val : (T)(object)val;
                }
            });
            return result;
        }



        /// <summary>
        /// Initializes a matrix using He initialization for neural network weights, optimized for ReLU activations.
        /// </summary>
        /// <param name="rows">Number of rows in the matrix, typically the output size.</param>
        /// <param name="cols">Number of columns in the matrix, typically the input size.</param>
        /// <returns>A new Matrix<T> initialized with He values.</returns>
        /// <exception cref="ArgumentException">Thrown if rows or cols are less than or equal to zero.</exception>
        /// <remarks>
        /// Uses a normal distribution with mean 0 and standard deviation sqrt(2 / cols) to initialize weights, promoting stable gradients.
        /// Thread-safe random number generation with lock ensures consistency in parallel contexts.
        /// Commonly used in deep learning to initialize weight matrices for layers with ReLU or similar activations.
        /// </remarks>
        public static Matrix<T> InitializeHe(int rows, int cols)
        {

            if (rows <= 0) throw new ArgumentException("Number of rows must be positive.", nameof(rows));
            if (cols <= 0) throw new ArgumentException("Number of columns must be positive.", nameof(cols));

            double stdDev = Math.Sqrt(2.0 / cols);
            var result = new Matrix<T>(rows, cols);
            Parallel.For(0, rows, i =>
            {
                Random rand = _random.Value;
                for (int j = 0; j < cols; j++)
                {
                    double u1 = 1.0 - rand.NextDouble();
                    double u2 = rand.NextDouble();
                    double normal = Math.Sqrt(-2.0 * Math.Log(u1)) * Math.Cos(2.0 * Math.PI * u2);
                    double val = normal * stdDev;
                    result.Data[i][j] = IsFloat ? (T)(object)(float)val : (T)(object)val;
                }
            });
            return result;
        }



        /// <summary>
        /// Sets the values of a specific row in the matrix to the values of another matrix row.
        /// </summary>
        /// <param name="rowIndex">The index of the row to set, must be between 0 and Rows-1.</param>
        /// <param name="rowValues">A Matrix<T> with 1 row and matching column count containing the values to set.</param>
        /// <exception cref="IndexOutOfRangeException">Thrown if rowIndex is out of bounds.</exception>
        /// <exception cref="ArgumentNullException">Thrown if rowValues is null.</exception>
        /// <exception cref="ArgumentException">Thrown if rowValues does not have 1 row or its column count does not match this matrix's column count.</exception>
        public void SetRow(int rowIndex, Matrix<T> rowValues)
        {

            if (rowIndex < 0 || rowIndex >= MatrixRows)
                throw new IndexOutOfRangeException($"Row index {rowIndex} is out of bounds for matrix with {Rows} rows.");
            if (rowValues == null)
                throw new ArgumentNullException(nameof(rowValues), "Row values matrix cannot be null.");
            if (rowValues.Rows != 1)
                throw new ArgumentException($"Row values must have exactly 1 row, got {rowValues.Rows}.");
            if (rowValues.Columns != MatrixColumns)
                throw new ArgumentException($"Row values columns ({rowValues.Columns}) must match matrix columns ({Columns}).");

            lock (_syncLock)
                Array.Copy(rowValues.Data[0], Data[rowIndex], MatrixColumns);
        }



        /// <summary>
        /// Adds a value to an existing element in the matrix, performing an in-place increment.
        /// </summary>
        /// <param name="row">The row index of the element to increment, must be between 0 and Rows-1.</param>
        /// <param name="col">The column index of the element to increment, must be between 0 and Columns-1.</param>
        /// <param name="value">The value to add to the existing element.</param>
        /// <exception cref="IndexOutOfRangeException">Thrown if row or col is out of bounds.</exception>
        public void AddInPlace(int row, int col, T value)
        {

            if (row < 0 || row >= MatrixRows)
                throw new IndexOutOfRangeException($"Row index {row} is out of bounds for matrix with {Rows} rows.");
            if (col < 0 || col >= MatrixColumns)
                throw new IndexOutOfRangeException($"Column index {col} is out of bounds for matrix with {Columns} columns.");
            lock (_syncLock)
                Data[row][col] = Add(Data[row][col], value);
        }



        /// <summary>
        /// Gets the values of a specific row as an array.
        /// </summary>
        /// <param name="rowIndex">The index of the row to retrieve, must be between 0 and Rows-1.</param>
        /// <returns>An array of type T containing the row values.</returns>
        /// <exception cref="IndexOutOfRangeException">Thrown if rowIndex is out of bounds.</exception>
        public T[] GetRowArray(int rowIndex)
        {

            if (rowIndex < 0 || rowIndex >= MatrixRows)
                throw new IndexOutOfRangeException($"Row index {rowIndex} is out of bounds for matrix with {Rows} rows.");

            T[] result = new T[MatrixColumns];
            Array.Copy(Data[rowIndex], result, MatrixColumns);
            return result;
        }



        /// <summary>
        /// Creates a deep copy of the matrix.
        /// </summary>
        /// <returns>A new Matrix<T> with identical values.</returns>
        public Matrix<T> Copy()
        {

            var copy = new Matrix<T>(Rows, Columns);
            Parallel.For(0, Rows, i => { for (int j = 0; j < Columns; j++) copy[i, j] = this[i, j]; });
            return copy;
        }



        /// <summary>
        /// Returns a new matrix representing the specified row as a 1xColumns matrix.
        /// </summary>
        /// <param name="rowIndex">The index of the row to extract, must be between 0 and Rows-1.</param>
        /// <returns>A new Matrix<T> with shape [1, Columns] containing the row data.</returns>
        /// <exception cref="IndexOutOfRangeException">Thrown if rowIndex is out of bounds.</exception>
        public Matrix<T> Row(int rowIndex)
        {

            if (rowIndex < 0 || rowIndex >= MatrixRows)
                throw new IndexOutOfRangeException($"Row index {rowIndex} is out of bounds for matrix with {Rows} rows.");

            var result = new Matrix<T>(1, MatrixColumns);
            Array.Copy(Data[rowIndex], result.Data[0], MatrixColumns);
            return result;
        }



        /// <summary>
        /// Returns a new matrix representing the specified column as a Rowsx1 matrix.
        /// </summary>
        /// <param name="colIndex">The index of the column to extract, must be between 0 and Columns-1.</param>
        /// <returns>A new Matrix<T> with shape [Rows, 1] containing the column data.</returns>
        /// <exception cref="IndexOutOfRangeException">Thrown if colIndex is out of bounds.</exception>
        public Matrix<T> Column(int colIndex)
        {

            if (colIndex < 0 || colIndex >= MatrixColumns)
                throw new IndexOutOfRangeException($"Column index {colIndex} is out of bounds for matrix with {Columns} columns.");

            var result = new Matrix<T>(MatrixRows, 1);
            Parallel.For(0, MatrixRows, i => result.Data[i][0] = Data[i][colIndex]);
            return result;
        }



        /// <summary>
        /// Checks if all elements in the matrix satisfy the given predicate.
        /// </summary>
        /// <param name="predicate">The condition to test each element against.</param>
        /// <returns>True if all elements satisfy the predicate, false otherwise.</returns>
        public bool All(Func<T, bool> predicate)
        {

            bool allTrue = true;
            Parallel.For(0, Rows, (i, state) =>
            {
                if (allTrue)
                {
                    for (int j = 0; j < Columns; j++)
                    {
                        if (!predicate(Data[i][j]))
                        {
                            allTrue = false;
                            state.Stop(); // Early exit
                            break;
                        }
                    }
                }
            });
            return allTrue;
        }



        /// <summary>
        /// Adds a scalar value to every element of the matrix, returning a new matrix.
        /// </summary>
        /// <param name="value">The scalar value to add, converted to type T (float or double).</param>
        /// <returns>A new Matrix<T> with the scalar added to each element.</returns>
        /// <exception cref="InvalidOperationException">Thrown if T is not float or double.</exception>
        /// <remarks>
        /// Creates a new matrix of the same shape [Rows, Columns] with each element Data[i][j] + value.
        /// Optimized with parallel processing for large matrices (Rows * Columns >= ParallelThreshold).
        /// Thread-safe as it performs read-only operations on the input matrix and writes to a new result matrix.
        /// </remarks>
        public Matrix<T> Add(T value)
        {
            if (!IsFloat && !IsDouble)
                throw new InvalidOperationException("Add is only supported for Matrix<float> or Matrix<double>.");

            var result = new Matrix<T>(Rows, Columns);
            bool useParallel = Rows * Columns >= ParallelThreshold;

            if (useParallel)
            {
                Parallel.For(0, Rows, i =>
                {
                    for (int j = 0; j < Columns; j++)
                    {
                        if (IsFloat)
                            result.Data[i][j] = (T)(object)((float)(object)Data[i][j] + (float)(object)value);
                        else // IsDouble
                            result.Data[i][j] = (T)(object)((double)(object)Data[i][j] + (double)(object)value);
                    }
                });
            }
            else
            {
                for (int i = 0; i < Rows; i++)
                    for (int j = 0; j < Columns; j++)
                    {
                        if (IsFloat)
                            result.Data[i][j] = (T)(object)((float)(object)Data[i][j] + (float)(object)value);
                        else // IsDouble
                            result.Data[i][j] = (T)(object)((double)(object)Data[i][j] + (double)(object)value);
                    }
            }

            return result;
        }



        /// <summary>
        /// Checks if any element in the matrix satisfies the given predicate.
        /// </summary>
        /// <param name="predicate">The condition to test each element against.</param>
        /// <returns>True if at least one element satisfies the predicate, false otherwise.</returns>
        public bool Any(Func<T, bool> predicate)
        {

            bool found = false;
            Parallel.For(0, Rows, (i, state) =>
            {
                if (!found) // Avoid unnecessary work after finding a match
                {
                    for (int j = 0; j < Columns; j++)
                    {
                        if (predicate(Data[i][j]))
                        {
                            found = true;
                            state.Stop(); // Early exit
                            break;
                        }
                    }
                }
            });
            return found;
        }



        /// <summary>
        /// Creates a causal mask to prevent attending to future tokens in decoding.
        /// </summary>
        /// <param name="seqLen">Sequence length.</param>
        /// <returns>Causal mask [seqLen, seqLen] with -infinity for future positions.</returns>
        public static Matrix<double> CreateCausalMask(int seqLen)
        {

            var mask = new Matrix<double>(seqLen, seqLen);
            Parallel.For(0, seqLen, i =>
            {
                for (int j = 0; j < seqLen; j++)
                    mask[i, j] = j > i ? -1e9 : 0.0;
            });
            return mask;
        }



        /// <summary>
        /// Normalizes the matrix to have an L2 norm of 1, scaling all elements proportionally.
        /// </summary>
        /// <returns>A new matrix with the same shape, normalized to unit length.</returns>
        /// <exception cref="InvalidOperationException">Thrown if the matrix is empty.</exception>
        public Matrix<T> Normalize()
        {

            if (Rows == 0 || Columns == 0)
                throw new InvalidOperationException("Cannot normalize an empty matrix.");

            T sumOfSquares = this.ElementWiseMultiply(this).Sum();
            if (double.IsNaN((double)(object)sumOfSquares))
                throw new InvalidOperationException("Sum of squares is NaN in Normalize.");

            T norm = Sqrt(sumOfSquares);
            var result = new Matrix<T>(Rows, Columns);

            T epsilon = IsFloat ? (T)(object)1e-6f : (T)(object)1e-6;
            if (Compare(norm, epsilon) <= 0)
                return result; // Return zeros if norm is too small

            if (Rows * Columns < ParallelThreshold)
            {
                for (int i = 0; i < Rows; i++)
                    for (int j = 0; j < Columns; j++)
                        result.Data[i][j] = Divide(Data[i][j], norm);
            }
            else
            {
                Parallel.For(0, Rows, i =>
                {
                    for (int j = 0; j < Columns; j++)
                        result.Data[i][j] = Divide(Data[i][j], norm);
                });
            }
            return result;
        }



        /// <summary>
        /// Computes the multiplicative inverse (1/x) of each element in the matrix.
        /// </summary>
        /// <returns>A new matrix with the same dimensions where each element is the inverse of the corresponding element.</returns>
        /// <exception cref="ArgumentException">Thrown if any element is zero or near-zero.</exception>
        public Matrix<T> ElementWiseInverse()
        {

            var result = new Matrix<T>(Rows, Columns);
            T epsilon = IsFloat ? (T)(object)1e-6f : (T)(object)1e-6;

            Parallel.For(0, Rows, i =>
            {
                for (int j = 0; j < Columns; j++)
                {
                    T value = Data[i][j];
                    T absValue = IsFloat ? (T)(object)Math.Abs((float)(object)value) : (T)(object)Math.Abs((double)(object)value);
                    if (Compare(absValue, epsilon) < 0)
                        result[i, j] = Zero(); // Return 0 for zero or near-zero values
                    else
                        result[i, j] = Divide(One(), value);
                }
            });
            return result;
        }



        /// <summary>
        /// Computes the square root of each element in the matrix.
        /// </summary>
        /// <returns>A new matrix with the same dimensions where each element is the square root of the corresponding element.</returns>
        /// <exception cref="ArgumentException">Thrown if any element is negative.</exception>
        public Matrix<T> ElementWiseSqrt()
        {

            var result = new Matrix<T>(Rows, Columns);
            Parallel.For(0, Rows, i =>
            {
                for (int j = 0; j < Columns; j++)
                {
                    T value = Data[i][j];
                    if (Compare(value, Zero()) < 0)
                        throw new ArgumentException($"Cannot compute square root of negative value {value} at position [{i}, {j}].");
                    result[i, j] = Sqrt(value);
                }
            });
            return result;
        }



        /// <summary>
        /// Computes the mean of each row in the matrix.
        /// For a matrix of shape [numRows, numCols], this method returns a matrix of shape [1, numRows],
        /// where each entry [0, i] is the mean of the i-th row.
        /// </summary>
        /// <returns>A matrix of shape [1, numRows] containing the mean of each row.</returns>
        /// <exception cref="InvalidOperationException">Thrown if the matrix is empty (has 0 rows or 0 columns).</exception>
        public Matrix<T> MeanRows()
        {
            if (Rows == 0 || Columns == 0)
                throw new InvalidOperationException("Cannot compute mean of an empty matrix.");

            var result = new Matrix<T>(1, Rows);
            for (int i = 0; i < Rows; i++)
            {
                T sum = default(T);
                for (int j = 0; j < Columns; j++)
                {
                    sum = (dynamic)sum + this[i, j];
                }
                result[0, i] = (dynamic)sum / Columns;
            }
            return result;
        }



        /// <summary>
        /// Adds a bias vector (1 row) to each row of the matrix.
        /// </summary>
        /// <param name="bias">Matrix with 1 row and matching columns to apply as bias.</param>
        /// <returns>A new matrix with bias added to each row.</returns>
        /// <exception cref="ArgumentNullException">Thrown if bias is null.</exception>
        /// <exception cref="ArgumentException">Thrown if bias dimensions are incompatible.</exception>
        public Matrix<T> AddBias(Matrix<T> bias)
        {

            if (bias == null)
                throw new ArgumentNullException(nameof(bias), "Bias matrix cannot be null.");
            if (bias.Rows != 1)
                throw new ArgumentException($"Bias must have 1 row, got {bias.Rows}.");
            if (bias.Columns != this.Columns)
                throw new ArgumentException($"Bias columns ({bias.Columns}) must match matrix columns ({this.Columns}).");

            var result = new Matrix<T>(this.Rows, this.Columns);
            Parallel.For(0, this.Rows, i =>
            {
                for (int j = 0; j < this.Columns; j++)
                    result.Data[i][j] = Add(this.Data[i][j], bias.Data[0][j]);
            });
            return result;
        }



        /// <summary>
        /// Applies the exponential function to each element of the matrix.
        /// </summary>
        /// <returns>A new matrix with exponential values.</returns>
        public Matrix<T> Exp()
        {

            var result = new Matrix<T>(Rows, Columns);
            if (Rows * Columns < ParallelThreshold)
            {
                for (int i = 0; i < Rows; i++)
                    for (int j = 0; j < Columns; j++)
                        result.Data[i][j] = Exp(Data[i][j]);
            }
            else
            {
                Parallel.For(0, Rows, i =>
                {
                    for (int j = 0; j < Columns; j++)
                        result.Data[i][j] = Exp(Data[i][j]);
                });
            }
            return result;
        }



        /// <summary>
        /// Converts token indices to a one-hot encoded matrix.
        /// </summary>
        /// <param name="ids">Token indices [batchSize][seqLen].</param>
        /// <param name="vocabSize">Size of the vocabulary.</param>
        /// <returns>One-hot matrix [batchSize * seqLen, vocabSize].</returns>
        private Matrix<double> OneHotEncode(int[][] ids, int vocabSize)
        {

            int batchSize = ids.Length;
            int seqLen = ids[0].Length;
            var result = new Matrix<double>(batchSize * seqLen, vocabSize);
            Parallel.For(0, batchSize, b =>
            {
                for (int t = 0; t < seqLen; t++)
                {
                    int idx = b * seqLen + t;
                    result[idx, ids[b][t]] = 1.0;
                }
            });
            return result;
        }



        /// <summary>
        /// Computes the sum of all elements in the matrix.
        /// </summary>
        /// <returns>The total sum as type T.</returns>
        public T Sum()
        {

            T sum = Zero();
            T compensation = Zero(); // Kahan compensation term
            object lockObj = new object();
            Parallel.For(0, Rows, () => (Zero(), Zero()), (i, state, local) =>
            {
                T localSum = local.Item1;
                T localComp = local.Item2;
                for (int j = 0; j < Columns; j++)
                {
                    T y = Subtract(Data[i][j], localComp);
                    T t = Add(localSum, y);
                    localComp = Subtract(Subtract(t, localSum), y);
                    localSum = t;
                }
                return (localSum, localComp);
            }, local =>
            {
                lock (lockObj)
                {
                    T y = Subtract(local.Item1, compensation);
                    T t = Add(sum, y);
                    compensation = Subtract(Subtract(t, sum), y);
                    sum = t;
                }
            });
            return sum;
        }



        /// <summary>
        /// Sums all elements across rows, returning a matrix with 1 row and the same number of columns.
        /// </summary>
        /// <returns>A new Matrix<T> of shape [1, Columns] where each element is the sum of the corresponding column.</returns>
        public Matrix<T> SumRows()
        {

            var result = new Matrix<T>(1, Columns);
            Parallel.For(0, Columns, j =>
            {
                T sum = Zero();
                for (int i = 0; i < Rows; i++)
                    sum = Add(sum, Data[i][j]);
                result.Data[0][j] = sum;
            });
            return result;
        }



        /// <summary>
        /// Adds a scalar to every element of the matrix.
        /// </summary>
        /// <param name="scalar">Value to add to each element.</param>
        /// <returns>A new matrix with the scalar added to all elements.</returns>
        public Matrix<T> AddScalar(T scalar)
        {

            var result = new Matrix<T>(Rows, Columns);
            if (Rows * Columns < ParallelThreshold)
            {
                for (int i = 0; i < Rows; i++)
                    for (int j = 0; j < Columns; j++)
                        result.Data[i][j] = Add(Data[i][j], scalar);
            }
            else
            {
                Parallel.For(0, Rows, i =>
                {
                    for (int j = 0; j < Columns; j++)
                        result.Data[i][j] = Add(Data[i][j], scalar);
                });
            }
            return result;
        }



        /// <summary>
        /// Finds the index of the maximum value in each row.
        /// </summary>
        /// <returns>An array of integers, length equal to Rows, where each element is the column index of the maximum value in that row.</returns>
        public int[] ArgMax()
        {

            int[] result = new int[Rows];
            Parallel.For(0, Rows, i =>
            {
                int maxIdx = 0;
                T maxVal = Data[i][0];
                for (int j = 1; j < Columns; j++)
                {
                    if (Compare(Data[i][j], maxVal) > 0)
                    {
                        maxVal = Data[i][j];
                        maxIdx = j;
                    }
                }
                result[i] = maxIdx;
            });
            return result;
        }



        /// <summary>
        /// Performs element-wise multiplication of two matrices.
        /// </summary>
        /// <param name="other">Matrix to multiply element-wise with this matrix.</param>
        /// <returns>A new matrix with element-wise products.</returns>
        /// <exception cref="ArgumentNullException">Thrown if other is null.</exception>
        /// <exception cref="ArgumentException">Thrown if sizes mismatch.</exception>
        public Matrix<T> ElementWiseMultiply(Matrix<T> other)
        {

            CheckSizes(this, other, "element-wise multiply");
            var result = new Matrix<T>(Rows, Columns);
            if (Rows * Columns < ParallelThreshold)
            {
                for (int i = 0; i < Rows; i++)
                    for (int j = 0; j < Columns; j++)
                        result.Data[i][j] = Multiply(Data[i][j], other.Data[i][j]);
            }
            else
            {
                Parallel.For(0, Rows, i =>
                {
                    for (int j = 0; j < Columns; j++)
                        result.Data[i][j] = Multiply(Data[i][j], other.Data[i][j]);
                });
            }
            return result;
        }



        /// <summary>
        /// Transposes the matrix, swapping rows and columns.
        /// </summary>
        /// <returns>A new matrix with dimensions [Columns, Rows].</returns>
        /// <remarks>
        /// Parallelized over rows of the original matrix to efficiently build the transposed matrix.
        /// Restored from original Matrix<T> implementation.
        /// </remarks>
        public Matrix<T> Transpose()
        {

            var result = new Matrix<T>(Columns, Rows);
            Parallel.For(0, Rows, i =>
            {
                for (int j = 0; j < Columns; j++)
                    result.Data[j][i] = Data[i][j];
            });
            return result;
        }



        /// <summary>
        /// Computes the variance of each row in the matrix, given the mean of each row.
        /// For a matrix of shape [numRows, numCols], this method returns a matrix of shape [1, numRows],
        /// where each entry [0, i] is the variance of the i-th row, computed as (1/numCols) * Σ(x - mean)^2.
        /// </summary>
        /// <param name="mean">A matrix of shape [1, numRows] containing the mean of each row,
        /// typically computed by MeanRows().</param>
        /// <returns>A matrix of shape [1, numRows] containing the variance of each row.</returns>
        /// <exception cref="ArgumentException">Thrown if the mean matrix does not have shape [1, numRows]
        /// or if the matrix is empty (has 0 rows or 0 columns).</exception>
        public Matrix<T> VarianceRows(Matrix<T> mean)
        {
            if (Rows == 0 || Columns == 0)
                throw new InvalidOperationException("Cannot compute variance of an empty matrix.");
            if (mean.Rows != 1 || mean.Columns != Rows)
                throw new ArgumentException($"Mean matrix must have shape [1, {Rows}], but got [{mean.Rows}, {mean.Columns}].");

            var result = new Matrix<T>(1, Rows);
            for (int i = 0; i < Rows; i++)
            {
                T sumSquaredDiff = default(T);
                for (int j = 0; j < Columns; j++)
                {
                    T diff = (dynamic)this[i, j] - mean[0, i];
                    sumSquaredDiff = (dynamic)sumSquaredDiff + ((dynamic)diff * (dynamic)diff);
                }
                result[0, i] = (dynamic)sumSquaredDiff / Columns;
            }
            return result;
        }



        /// <summary>
        /// Extracts a submatrix from the current matrix starting at the specified row and column, with the given size.
        /// </summary>
        /// <param name="startRow">The starting row index (inclusive) of the submatrix, must be between 0 and Rows-1.</param>
        /// <param name="rowCount">The number of rows to include in the submatrix, must be positive and within bounds.</param>
        /// <param name="startCol">The starting column index (inclusive) of the submatrix, must be between 0 and Columns-1.</param>
        /// <param name="colCount">The number of columns to include in the submatrix, must be positive and within bounds.</param>
        /// <returns>A new Matrix<T> containing the specified submatrix [rowCount, colCount].</returns>
        /// <exception cref="ArgumentOutOfRangeException">Thrown if the specified indices or counts are invalid or exceed matrix dimensions.</exception>
        /// <remarks>
        /// Parallelized for efficient extraction, used in multi-head attention to split matrices into head-specific portions.
        /// Restored from original Matrix<T> implementation.
        /// </remarks>
        public Matrix<T> SubMatrix(int startRow, int rowCount, int startCol, int colCount)
        {

            if (startRow < 0 || startRow >= Rows)
                throw new ArgumentOutOfRangeException(nameof(startRow), $"Start row {startRow} must be between 0 and {Rows - 1}.");
            if (startCol < 0 || startCol >= Columns)
                throw new ArgumentOutOfRangeException(nameof(startCol), $"Start column {startCol} must be between 0 and {Columns - 1}.");
            if (rowCount <= 0)
                throw new ArgumentOutOfRangeException(nameof(rowCount), $"Row count {rowCount} must be positive.");
            if (colCount <= 0)
                throw new ArgumentOutOfRangeException(nameof(colCount), $"Column count {colCount} must be positive.");
            if (startRow + rowCount > Rows)
                throw new ArgumentOutOfRangeException(nameof(rowCount), $"Row range {startRow} to {startRow + rowCount - 1} exceeds matrix rows {Rows}.");
            if (startCol + colCount > Columns)
                throw new ArgumentOutOfRangeException(nameof(colCount), $"Column range {startCol} to {startCol + colCount - 1} exceeds matrix columns {Columns}.");

            var result = new Matrix<T>(rowCount, colCount);
            Parallel.For(0, rowCount, i =>
            {
                for (int j = 0; j < colCount; j++)
                    result[i, j] = this[startRow + i, startCol + j];
            });
            return result;
        }



        /// <summary>
        /// Adds another matrix to this matrix in-place, element-wise.
        /// </summary>
        /// <param name="a">The matrix to add to (modified in-place), must be non-null.</param>
        /// <param name="b">The matrix to add, must be non-null and have the same dimensions as <paramref name="a"/>.</param>
        /// <returns>The modified matrix <paramref name="a"/>.</returns>
        /// <exception cref="ArgumentNullException">Thrown if either matrix <paramref name="a"/> or <paramref name="b"/> is null.</exception>
        /// <exception cref="ArgumentException">Thrown if the matrices have different dimensions or if <paramref name="b"/> is not a valid matrix for element-wise addition.</exception>
        /// <exception cref="InvalidOperationException">Thrown if the matrix type T is not float or double.</exception>
        /// <remarks>
        /// This method performs an element-wise in-place addition, computing a[i,j] += b[i,j] for all elements.
        /// The operation is optimized with parallel processing for large matrices (when the total number of elements exceeds <see cref="Matrix{T}.ParallelThreshold"/>),
        /// ensuring efficient computation while maintaining thread safety by modifying only the input matrix <paramref name="a"/>.
        /// The method leverages the <see cref="Matrix{T}.Add"/> method for type-safe arithmetic, ensuring compatibility with float or double types.
        /// This operation is useful in machine learning tasks such as gradient updates or accumulating outputs in neural network layers.
        /// </remarks>
        public static Matrix<T> PlusEquals(Matrix<T> a, Matrix<T> b)
        {
            if (a == null)
                throw new ArgumentNullException(nameof(a), "Matrix to modify cannot be null for in-place addition.");
            if (b == null)
                throw new ArgumentNullException(nameof(b), "Matrix to add cannot be null.");
            if (!Matrix<T>.IsFloat && !Matrix<T>.IsDouble)
                throw new InvalidOperationException("In-place addition is only supported for Matrix<float> or Matrix<double>.");
            if (a.Rows != b.Rows || a.Columns != b.Columns)
                throw new ArgumentException($"Matrix dimensions must match for in-place addition: a[{a.Rows},{a.Columns}] vs b[{b.Rows},{b.Columns}].");

            bool useParallel = a.Rows * a.Columns >= a.ParallelThreshold;

            if (useParallel)
            {
                Parallel.For(0, a.Rows, i =>
                {
                    for (int j = 0; j < a.Columns; j++)
                        a.Data[i][j] = Matrix<T>.Add(a.Data[i][j], b.Data[i][j]);
                });
            }
            else
            {
                for (int i = 0; i < a.Rows; i++)
                {
                    for (int j = 0; j < a.Columns; j++)
                        a.Data[i][j] = Matrix<T>.Add(a.Data[i][j], b.Data[i][j]);
                }
            }

            return a;
        }

        /// <summary>
        /// Subtracts another matrix from this matrix in-place, element-wise.
        /// </summary>
        /// <param name="a">The matrix to subtract from (modified in-place), must be non-null.</param>
        /// <param name="b">The matrix to subtract, must be non-null and have the same dimensions as <paramref name="a"/>.</param>
        /// <returns>The modified matrix <paramref name="a"/>.</returns>
        /// <exception cref="ArgumentNullException">Thrown if either matrix <paramref name="a"/> or <paramref name="b"/> is null.</exception>
        /// <exception cref="ArgumentException">Thrown if the matrices have different dimensions or if <paramref name="b"/> is not a valid matrix for element-wise subtraction.</exception>
        /// <exception cref="InvalidOperationException">Thrown if the matrix type T is not float or double.</exception>
        /// <remarks>
        /// This method performs an element-wise in-place subtraction, computing a[i,j] -= b[i,j] for all elements.
        /// The operation is optimized with parallel processing for large matrices (when the total number of elements exceeds <see cref="Matrix{T}.ParallelThreshold"/>),
        /// ensuring efficient computation while maintaining thread safety by modifying only the input matrix <paramref name="a"/>.
        /// The method leverages the <see cref="Matrix{T}.Subtract"/> method for type-safe arithmetic, ensuring compatibility with float or double types.
        /// This operation is useful in machine learning tasks such as gradient descent updates or residual computations in neural network layers.
        /// </remarks>
        public static Matrix<T> MinusEquals(Matrix<T> a, Matrix<T> b)
        {
            if (a == null)
                throw new ArgumentNullException(nameof(a), "Matrix to modify cannot be null for in-place subtraction.");
            if (b == null)
                throw new ArgumentNullException(nameof(b), "Matrix to subtract cannot be null.");
            if (!Matrix<T>.IsFloat && !Matrix<T>.IsDouble)
                throw new InvalidOperationException("In-place subtraction is only supported for Matrix<float> or Matrix<double>.");
            if (a.Rows != b.Rows || a.Columns != b.Columns)
                throw new ArgumentException($"Matrix dimensions must match for in-place subtraction: a[{a.Rows},{a.Columns}] vs b[{b.Rows},{b.Columns}].");

            bool useParallel = a.Rows * a.Columns >= a.ParallelThreshold;

            if (useParallel)
            {
                Parallel.For(0, a.Rows, i =>
                {
                    for (int j = 0; j < a.Columns; j++)
                        a.Data[i][j] = Matrix<T>.Subtract(a.Data[i][j], b.Data[i][j]);
                });
            }
            else
            {
                for (int i = 0; i < a.Rows; i++)
                {
                    for (int j = 0; j < a.Columns; j++)
                        a.Data[i][j] = Matrix<T>.Subtract(a.Data[i][j], b.Data[i][j]);
                }
            }

            return a;
        }



        /// <summary>
        /// Subtracts another vector from this vector in-place, element-wise.
        /// </summary>
        /// <param name="a">The vector to subtract from (modified in-place).</param>
        /// <param name="b">The vector to subtract.</param>
        /// <returns>The modified vector a.</returns>
        /// <exception cref="ArgumentNullException">Thrown if either vector is null.</exception>
        /// <exception cref="ArgumentException">Thrown if vector lengths mismatch.</exception>
        public static Vector<double> MinusEquals(Vector<double> a, Vector<double> b)
        {
            if (a == null)
                throw new ArgumentNullException(nameof(a));
            if (b == null)
                throw new ArgumentNullException(nameof(b));
            if (a.Length != b.Length)
                throw new ArgumentException("Vector lengths must match for subtraction.");

            Parallel.For(0, a.Length, i =>
            {
                a[i] -= b[i];
            });
            return a;
        }

        /// <summary>
        /// Adds another vector to this vector in-place, element-wise.
        /// </summary>
        /// <param name="a">The vector to add to (modified in-place).</param>
        /// <param name="b">The vector to add.</param>
        /// <returns>The modified vector a.</returns>
        /// <exception cref="ArgumentNullException">Thrown if either vector is null.</exception>
        /// <exception cref="ArgumentException">Thrown if vector lengths mismatch.</exception>
        public static Vector<double> PlusEquals(Vector<double> a, Vector<double> b)
        {
            if (a == null)
                throw new ArgumentNullException(nameof(a));
            if (b == null)
                throw new ArgumentNullException(nameof(b));
            if (a.Length != b.Length)
                throw new ArgumentException("Vector lengths must match for addition.");

            Parallel.For(0, a.Length, i =>
            {
                a[i] += b[i];
            });
            return a;
        }



        /// <summary>
        /// Adds a scalar value to each element of the matrix, returning a new matrix.
        /// </summary>
        /// <param name="a">The input matrix to which the scalar is added, must be non-null.</param>
        /// <param name="scalar">The scalar value to add to each element, of type T (float or double).</param>
        /// <returns>A new Matrix<T> with the same dimensions as the input matrix, where each element is the sum of the corresponding input element and the scalar.</returns>
        /// <exception cref="ArgumentNullException">Thrown if the input matrix <paramref name="a"/> is null.</exception>
        /// <exception cref="InvalidOperationException">Thrown if the matrix type T is not float or double.</exception>
        /// <remarks>
        /// This method performs an element-wise addition, computing result[i,j] = a[i,j] + scalar for all elements.
        /// The operation is optimized with parallel processing for large matrices (when the total number of elements exceeds <see cref="Matrix{T}.ParallelThreshold"/>),
        /// ensuring efficient computation while maintaining thread safety by performing read-only operations on the input matrix and writing to a new result matrix.
        /// The method leverages the <see cref="Matrix{T}.Add"/> method for type-safe arithmetic, ensuring compatibility with float or double types.
        /// This operation is useful in machine learning tasks such as bias adjustments, gradient updates, or scaling operations in neural network layers.
        /// </remarks>
        public static Matrix<T> operator +(Matrix<T> a, T scalar)
        {
            if (a == null)
                throw new ArgumentNullException(nameof(a), "Input matrix cannot be null for scalar addition.");
            if (!Matrix<T>.IsFloat && !Matrix<T>.IsDouble)
                throw new InvalidOperationException("Scalar addition is only supported for Matrix<float> or Matrix<double>.");

            var result = new Matrix<T>(a.Rows, a.Columns);
            bool useParallel = a.Rows * a.Columns >= a.ParallelThreshold;

            if (useParallel)
            {
                Parallel.For(0, a.Rows, i =>
                {
                    for (int j = 0; j < a.Columns; j++)
                        result.Data[i][j] = Matrix<T>.Add(a.Data[i][j], scalar);
                });
            }
            else
            {
                for (int i = 0; i < a.Rows; i++)
                {
                    for (int j = 0; j < a.Columns; j++)
                        result.Data[i][j] = Matrix<T>.Add(a.Data[i][j], scalar);
                }
            }

            return result;
        }



        /// <summary>
        /// Adds two matrices element-wise, or adds a bias vector to each row if applicable.
        /// </summary>
        /// <param name="a">First matrix or minuend.</param>
        /// <param name="b">Second matrix, bias vector, or subtrahend.</param>
        /// <returns>A new matrix with the sum.</returns>
        /// <exception cref="ArgumentNullException">Thrown if either matrix is null.</exception>
        /// <exception cref="ArgumentException">Thrown if sizes mismatch for element-wise addition.</exception>
        public static Matrix<T> operator +(Matrix<T> a, Matrix<T> b)
        {

            if (b.Rows == 1 && b.Columns == a.Columns)
                return a.AddBias(b);

            CheckSizes(a, b, "add");
            var result = new Matrix<T>(a.Rows, a.Columns);
            Parallel.For(0, a.Rows, i =>
            {
                for (int j = 0; j < a.Columns; j++)
                    result.Data[i][j] = Add(a.Data[i][j], b.Data[i][j]);
            });
            return result;
        }



        /// <summary>
        /// Subtracts a double value from each element of a matrix.
        /// </summary>
        /// <param name="a">The matrix to subtract from.</param>
        /// <param name="value">The double value to subtract.</param>
        /// <returns>A new matrix with the difference.</returns>
        /// <exception cref="ArgumentNullException">Thrown if the matrix is null.</exception>
        public static Matrix<T> operator -(Matrix<T> a, double value)
        {
            if (a == null)
                throw new ArgumentNullException(nameof(a));

            var result = new Matrix<T>(a.Rows, a.Columns);
            Parallel.For(0, a.Rows, i =>
            {
                for (int j = 0; j < a.Columns; j++)
                    result.Data[i][j] = Subtract(a.Data[i][j], (T)Convert.ChangeType(value, typeof(T)));
            });
            return result;
        }

        private static T Subtract<T>(T a, T b)
        {
            dynamic x = a;
            dynamic y = b;
            return x - y;
        }



        /// <summary>
        /// Subtracts one matrix from another element-wise.
        /// </summary>
        /// <param name="a">Minuend matrix.</param>
        /// <param name="b">Subtrahend matrix.</param>
        /// <returns>A new matrix with the difference.</returns>
        /// <exception cref="ArgumentNullException">Thrown if either matrix is null.</exception>
        /// <exception cref="ArgumentException">Thrown if sizes mismatch.</exception>
        public static Matrix<T> operator -(Matrix<T> a, Matrix<T> b)
        {

            CheckSizes(a, b, "subtract");
            var result = new Matrix<T>(a.Rows, a.Columns);
            Parallel.For(0, a.Rows, i =>
            {
                for (int j = 0; j < a.Columns; j++)
                    result.Data[i][j] = Subtract(a.Data[i][j], b.Data[i][j]);
            });
            return result;
        }



        /// <summary>
        /// Multiplies the matrix by a scalar (left operand).
        /// </summary>
        /// <param name="scalar">Scalar value to multiply each element by.</param>
        /// <param name="m">Matrix to scale.</param>
        /// <returns>A new scaled matrix.</returns>
        public static Matrix<T> operator *(T scalar, Matrix<T> m)
        {

            var result = new Matrix<T>(m.Rows, m.Columns);
            Parallel.For(0, m.Rows, i =>
            {
                for (int j = 0; j < m.Columns; j++)
                    result.Data[i][j] = Multiply(scalar, m.Data[i][j]);
            });
            return result;
        }



        /// <summary>
        /// Divides each element of the matrix by a scalar value.
        /// </summary>
        /// <param name="m">The matrix to divide, must be non-null.</param>
        /// <param name="scalar">The scalar value to divide each element by, must be non-zero to avoid division by zero.</param>
        /// <returns>A new matrix with each element equal to the corresponding element in the input matrix divided by the scalar.</returns>
        /// <exception cref="ArgumentNullException">Thrown if the input matrix <paramref name="m"/> is null.</exception>
        /// <exception cref="ArgumentException">Thrown if the scalar value is zero or effectively zero (below a small epsilon threshold).</exception>
        /// <remarks>
        /// This method performs an element-wise division of the matrix by the scalar, creating a new matrix with the same dimensions.
        /// The operation is optimized with parallel processing for large matrices (when the total number of elements exceeds <see cref="ParallelThreshold"/>),
        /// ensuring efficient computation while maintaining thread safety. The scalar is checked against a small epsilon (1e-6 for float, 1e-6 for double)
        /// to prevent division by zero, returning a zero matrix in such cases to maintain numerical stability.
        /// This operation is useful in normalization tasks, scaling adjustments, or gradient computations in machine learning models.
        /// </remarks>
        public static Matrix<T> operator /(Matrix<T> m, T scalar)
        {
            if (m == null)
                throw new ArgumentNullException(nameof(m), "Matrix cannot be null for division operation.");

            // Check for division by zero
            T epsilon = Matrix<T>.IsFloat ? (T)(object)1e-6f : (T)(object)1e-6;
            T absScalar = Matrix<T>.IsFloat ? (T)(object)Math.Abs((float)(object)scalar) : (T)(object)Math.Abs((double)(object)scalar);
            if (Matrix<T>.Compare(absScalar, epsilon) <= 0)
                throw new ArgumentException("Scalar value must not be zero or effectively zero to avoid division by zero.", nameof(scalar));

            var result = new Matrix<T>(m.Rows, m.Columns);
            if (m.Rows * m.Columns < m.ParallelThreshold)
            {
                for (int i = 0; i < m.Rows; i++)
                    for (int j = 0; j < m.Columns; j++)
                        result.Data[i][j] = Matrix<T>.Divide(m.Data[i][j], scalar);
            }
            else
            {
                Parallel.For(0, m.Rows, i =>
                {
                    for (int j = 0; j < m.Columns; j++)
                        result.Data[i][j] = Matrix<T>.Divide(m.Data[i][j], scalar);
                });
            }
            return result;
        }



        /// <summary>
        /// Divides a scalar value by each element of the matrix, returning a new matrix.
        /// </summary>
        /// <param name="scalar">The scalar value (numerator) to divide by each element, of type T (float or double).</param>
        /// <param name="m">The input matrix (denominator), must be non-null and contain no zero or near-zero elements to avoid division by zero.</param>
        /// <returns>A new Matrix<T> with the same dimensions as the input matrix, where each element is the result of dividing the scalar by the corresponding matrix element (scalar / m[i,j]).</returns>
        /// <exception cref="ArgumentNullException">Thrown if the input matrix <paramref name="m"/> is null.</exception>
        /// <exception cref="InvalidOperationException">Thrown if the matrix type T is not float or double.</exception>
        /// <exception cref="ArgumentException">Thrown if any element in the matrix is zero or effectively zero (below a small epsilon threshold), which would cause division by zero.</exception>
        /// <remarks>
        /// This method performs an element-wise division, computing result[i,j] = scalar / m[i,j] for all elements, mimicking the behavior in expressions like (1.0 / (logits.Exp() + 1e-10)).
        /// To prevent division by zero, each matrix element is checked against a small epsilon (1e-6 for float, 1e-6 for double); if an element is too close to zero, an exception is thrown for clarity.
        /// The operation is optimized with parallel processing for large matrices (when the total number of elements exceeds <see cref="Matrix{T}.ParallelThreshold"/>),
        /// ensuring efficient computation while maintaining thread safety by performing read-only operations on the input matrix and writing to a new result matrix.
        /// The method leverages the <see cref="Matrix{T}.Divide"/> method for type-safe arithmetic, ensuring compatibility with float or double types.
        /// This operation is useful in machine learning tasks such as computing inverses in normalization, loss functions (e.g., cross-entropy), or gradient computations.
        /// </remarks>
        public static Matrix<T> operator /(T scalar, Matrix<T> m)
        {
            if (m == null)
                throw new ArgumentNullException(nameof(m), "Input matrix cannot be null for scalar division.");
            if (!Matrix<T>.IsFloat && !Matrix<T>.IsDouble)
                throw new InvalidOperationException("Scalar division is only supported for Matrix<float> or Matrix<double>.");

            // Check for zero or near-zero elements in the matrix
            T epsilon = Matrix<T>.IsFloat ? (T)(object)1e-6f : (T)(object)1e-6;
            for (int i = 0; i < m.Rows; i++)
            {
                for (int j = 0; j < m.Columns; j++)
                {
                    T absValue = Matrix<T>.IsFloat ? (T)(object)Math.Abs((float)(object)m[i, j]) : (T)(object)Math.Abs((double)(object)m[i, j]);
                    if (Matrix<T>.Compare(absValue, epsilon) <= 0)
                        throw new ArgumentException($"Element at [{i}, {j}] in matrix is zero or effectively zero, which would cause division by zero.", nameof(m));
                }
            }

            var result = new Matrix<T>(m.Rows, m.Columns);
            bool useParallel = m.Rows * m.Columns >= m.ParallelThreshold;

            if (useParallel)
            {
                Parallel.For(0, m.Rows, i =>
                {
                    for (int j = 0; j < m.Columns; j++)
                        result.Data[i][j] = Matrix<T>.Divide(scalar, m.Data[i][j]);
                });
            }
            else
            {
                for (int i = 0; i < m.Rows; i++)
                {
                    for (int j = 0; j < m.Columns; j++)
                        result.Data[i][j] = Matrix<T>.Divide(scalar, m.Data[i][j]);
                }
            }

            return result;
        }



        /// <summary>
        /// Performs element-wise division of two matrices, dividing each element of the first matrix by the corresponding element of the second matrix.
        /// </summary>
        /// <param name="a">The numerator matrix, must be non-null and have the same dimensions as <paramref name="b"/>.</param>
        /// <param name="b">The denominator matrix, must be non-null, have the same dimensions as <paramref name="a"/>, and contain no zero or near-zero elements to avoid division by zero.</param>
        /// <returns>A new matrix with the same dimensions as the input matrices, where each element is the result of dividing the corresponding elements of <paramref name="a"/> by <paramref name="b"/>.</returns>
        /// <exception cref="ArgumentNullException">Thrown if either <paramref name="a"/> or <paramref name="b"/> is null.</exception>
        /// <exception cref="ArgumentException">Thrown if the matrices have different dimensions or if any element in <paramref name="b"/> is zero or effectively zero (below a small epsilon threshold).</exception>
        /// <remarks>
        /// This method computes an element-wise division, i.e., result[i,j] = a[i,j] / b[i,j], producing a new matrix without modifying the inputs.
        /// The operation is optimized with parallel processing for large matrices (when the total number of elements exceeds <see cref="ParallelThreshold"/>),
        /// ensuring efficient computation while maintaining thread safety. To prevent division by zero, each element of <paramref name="b"/> is checked against
        /// a small epsilon (1e-6 for float, 1e-6 for double); if any element is too close to zero, an exception is thrown for clarity.
        /// This operation is useful in normalization, scaling, or gradient computations where element-wise division of matrices is required.
        /// Note that this is not matrix division in the linear algebra sense (e.g., involving inverses), but rather an element-wise operation.
        /// </remarks>
        public static Matrix<T> operator /(Matrix<T> a, Matrix<T> b)
        {
            if (a == null)
                throw new ArgumentNullException(nameof(a), "Numerator matrix cannot be null for division operation.");
            if (b == null)
                throw new ArgumentNullException(nameof(b), "Denominator matrix cannot be null for division operation.");
            if (a.Rows != b.Rows || a.Columns != b.Columns)
                throw new ArgumentException($"Matrices must have the same dimensions for element-wise division, got [{a.Rows}, {a.Columns}] vs [{b.Rows}, {b.Columns}].");

            // Check for zero or near-zero elements in b
            T epsilon = Matrix<T>.IsFloat ? (T)(object)1e-6f : (T)(object)1e-6;
            for (int i = 0; i < b.Rows; i++)
            {
                for (int j = 0; j < b.Columns; j++)
                {
                    T absValue = Matrix<T>.IsFloat ? (T)(object)Math.Abs((float)(object)b[i, j]) : (T)(object)Math.Abs((double)(object)b[i, j]);
                    if (Matrix<T>.Compare(absValue, epsilon) <= 0)
                        throw new ArgumentException($"Element at [{i}, {j}] in denominator matrix is zero or effectively zero, which would cause division by zero.", nameof(b));
                }
            }

            var result = new Matrix<T>(a.Rows, a.Columns);
            if (a.Rows * a.Columns < a.ParallelThreshold)
            {
                for (int i = 0; i < a.Rows; i++)
                    for (int j = 0; j < a.Columns; j++)
                        result[i, j] = Matrix<T>.Divide(a[i, j], b[i, j]);
            }
            else
            {
                Parallel.For(0, a.Rows, i =>
                {
                    for (int j = 0; j < a.Columns; j++)
                        result[i, j] = Matrix<T>.Divide(a[i, j], b[i, j]);
                });
            }

            return result;
        }



        /// <summary>
        /// Multiplies the matrix by a scalar (right operand).
        /// </summary>
        /// <param name="m">Matrix to scale.</param>
        /// <param name="scalar">Scalar value to multiply each element by.</param>
        /// <returns>A new scaled matrix.</returns>
        public static Matrix<T> operator *(Matrix<T> m, T scalar) => scalar * m;



        /// <summary>
        /// Performs matrix multiplication between two matrices.
        /// </summary>
        /// <param name="a">Left matrix (minuend).</param>
        /// <param name="b">Right matrix (subtrahend).</param>
        /// <returns>A new matrix, the product of a and b.</returns>
        /// <exception cref="ArgumentException">Thrown if a.Columns does not match b.Rows.</exception>
        public static Matrix<T> operator *(Matrix<T> a, Matrix<T> b)
        {

            if (a == null || b == null)
                throw new ArgumentNullException("Matrices cannot be null.");
            if (a.Columns != b.Rows)
                throw new ArgumentException($"Matrix multiplication requires {a.Columns} columns to match {b.Rows} rows.");

            var result = new Matrix<T>(a.Rows, b.Columns);
            int totalOps = a.Rows * a.Columns * b.Columns;
            int numThreads = Environment.ProcessorCount;
            int minOpsPerThread = 1000; // Tune based on profiling

            // Dynamic threshold: Use parallelism if enough work per thread
            if (totalOps < numThreads * minOpsPerThread)
            {
                // Sequential for very small matrices
                for (int i = 0; i < a.Rows; i++)
                {
                    for (int j = 0; j < b.Columns; j++)
                    {
                        T sum = Zero();
                        for (int k = 0; k < a.Columns; k++)
                            sum = Add(sum, Multiply(a.Data[i][k], b.Data[k][j]));
                        result.Data[i][j] = sum;
                    }
                }
            }
            else
            {
                // Adaptive block size based on cache (e.g., L1 cache ~32KB, 4096 doubles)
                const int cacheSizeElements = 4096;
                int blockSize = Math.Min(Math.Max(16, cacheSizeElements / Math.Max(b.Columns, a.Columns)), a.Rows);
                int rowBlocks = (a.Rows + blockSize - 1) / blockSize;

                Parallel.For(0, rowBlocks, new ParallelOptions { MaxDegreeOfParallelism = numThreads }, ib =>
                {
                    int startRow = ib * blockSize;
                    int endRow = Math.Min(startRow + blockSize, a.Rows);
                    // Transpose b locally for better cache access (optional, costs memory)
                    T[][] bT = b.Transpose().Data; // [b.Columns, b.Rows]

                    for (int i = startRow; i < endRow; i++)
                    {
                        for (int j = 0; j < b.Columns; j++)
                        {
                            T sum = Zero();
                            T compensation = Zero(); // Kahan summation
                            for (int k = 0; k < a.Columns; k++)
                            {
                                T product = Multiply(a.Data[i][k], bT[j][k]); // Use transposed b
                                T y = Subtract(product, compensation);
                                T t = Add(sum, y);
                                compensation = Subtract(Subtract(t, sum), y);
                                sum = t;
                            }
                            result.Data[i][j] = sum;
                        }
                    }
                });
            }

            return result;
        }



        /// <summary>
        /// Creates a matrix filled with ones of the specified dimensions.
        /// </summary>
        /// <param name="rows">The number of rows in the matrix.</param>
        /// <param name="cols">The number of columns in the matrix.</param>
        /// <returns>A new Matrix<T> of shape [rows, cols] where all elements are 1.</returns>
        /// <exception cref="ArgumentException">Thrown if rows or cols are less than or equal to zero.</exception>
        public static Matrix<T> Ones(int rows, int cols)
        {

            if (rows <= 0)
                throw new ArgumentException("Number of rows must be positive.", nameof(rows));
            if (cols <= 0)
                throw new ArgumentException("Number of columns must be positive.", nameof(cols));

            var matrix = new Matrix<T>(rows, cols);
            Parallel.For(0, rows, i =>
            {
                for (int j = 0; j < cols; j++)
                    matrix[i, j] = One();
            });
            return matrix;
        }



        /// <summary>
        /// Creates a matrix filled with zeros of the specified dimensions.
        /// </summary>
        /// <param name="rows">The number of rows in the matrix.</param>
        /// <param name="cols">The number of columns in the matrix.</param>
        /// <returns>A new Matrix<T> of shape [rows, cols] where all elements are 0.</returns>
        /// <exception cref="ArgumentException">Thrown if rows or cols are less than or equal to zero.</exception>
        public static Matrix<T> Zeros(int rows, int cols)
        {

            if (rows <= 0)
                throw new ArgumentException("Number of rows must be positive.", nameof(rows));
            if (cols <= 0)
                throw new ArgumentException("Number of columns must be positive.", nameof(cols));

            return new Matrix<T>(rows, cols); // Default initialization to zero
        }



        /// <summary>
        /// Creates a matrix with random values between -maxValue and +maxValue using uniform distribution.
        /// </summary>
        /// <param name="rows">Number of rows in the resulting matrix.</param>
        /// <param name="cols">Number of columns in the resulting matrix.</param>
        /// <param name="maxValue">Maximum absolute value for random entries.</param>
        /// <returns>A new matrix filled with random values.</returns>
        public static Matrix<T> Random(int rows, int cols, T maxValue, int? seed = null)
        {

            var rand = seed.HasValue ? new Random(seed.Value) : _random.Value;
            var result = new Matrix<T>(rows, cols);
            Parallel.For(0, rows, i =>
            {
                Random threadRand = seed.HasValue ? new Random(rand.Next()) : _random.Value;
                for (int j = 0; j < cols; j++)
                {
                    T val = Multiply((T)(object)(2.0 * threadRand.NextDouble() - 1.0), maxValue);
                    result.Data[i][j] = val;
                }
            });
            return result;
        }



        /// <summary>
        /// Creates an identity matrix with ones on the diagonal and zeros elsewhere.
        /// </summary>
        /// <param name="size">Size of the square matrix (rows = columns).</param>
        /// <returns>A new identity matrix of shape [size, size].</returns>
        public static Matrix<T> Identity(int size)
        {

            var result = new Matrix<T>(size, size);
            Parallel.For(0, size, i => result.Data[i][i] = One());
            return result;
        }



        /// <summary>
        /// Validates that two matrices have the same dimensions for element-wise operations.
        /// </summary>
        /// <param name="a">First matrix to check.</param>
        /// <param name="b">Second matrix to check.</param>
        /// <param name="operation">Description of the operation for error messages.</param>
        /// <exception cref="ArgumentNullException">Thrown if either matrix is null.</exception>
        /// <exception cref="ArgumentException">Thrown if dimensions mismatch.</exception>
        private static void CheckSizes(Matrix<T> a, Matrix<T> b, string operation)
        {

            if (a == null || b == null)
                throw new ArgumentNullException($"Both matrices must be non-null for {operation}.");
            if (a.Rows != b.Rows || a.Columns != b.Columns)
                throw new ArgumentException($"For {operation}, matrices must have same dimensions, got [{a.Rows}, {a.Columns}] vs [{b.Rows}, {b.Columns}].");
        }



        /// <summary>
        /// Creates a deep copy of the matrix.
        /// </summary>
        /// <returns>A new matrix with identical values.</returns>
        public Matrix<T> Clone()
        {

            var result = new Matrix<T>(Rows, Columns);
            Parallel.For(0, Rows, i => Array.Copy(Data[i], result.Data[i], Columns));
            return result;
        }



        /// <summary>
        /// Applies softmax normalization with adjacency masking.
        /// </summary>
        public static Matrix<double> Softmax(Matrix<double> scores, Matrix<double> adjacency)
        {

            var expScores = scores.Exp();
            var sumExp = expScores.SumRows();
            var result = new Matrix<double>(scores.Rows, scores.Columns);
            for (int i = 0; i < scores.Rows; i++)
                for (int j = 0; j < scores.Columns; j++)
                    result[i, j] = adjacency[i, j] == 0 ? 0.0 : expScores[i, j] / (sumExp[0, i] + 1e-10);
            return result;
        }



        /// <summary>
        /// Converts the matrix to a flat 1D array in row-major order.
        /// </summary>
        /// <returns>A 1D array of type T containing all elements of the matrix, ordered by rows.</returns>
        /// <exception cref="InvalidOperationException">Thrown if the matrix is empty (has no rows or columns).</exception>
        /// <remarks>
        /// Flattens the matrix into a single array by concatenating all rows in row-major order.
        /// For a matrix of shape [Rows, Columns], the resulting array has length Rows * Columns.
        /// Uses parallel processing for large matrices to improve performance, ensuring efficient memory copying.
        /// The method employs Parallel.For when the total number of elements exceeds the ParallelThreshold,
        /// copying each row into the appropriate segment of the flat array to minimize contention.
        /// For smaller matrices, a sequential approach is used to avoid the overhead of parallelism.
        /// This method is particularly useful for interfacing with APIs that require contiguous memory layouts,
        /// such as GPU operations, statistical computations (e.g., correlation matrices in visualization),
        /// or when interfacing with libraries that expect flat arrays.
        /// </remarks>
        public T[] ToFlatArray()
        {

            if (Rows == 0 || Columns == 0)
                throw new InvalidOperationException("Cannot flatten an empty matrix.");

            T[] flat = new T[Rows * Columns];
            if (Rows * Columns < ParallelThreshold)
            {
                int idx = 0;
                for (int i = 0; i < Rows; i++)
                    for (int j = 0; j < Columns; j++)
                        flat[idx++] = Data[i][j];
            }
            else
            {
                Parallel.For(0, Rows, i =>
                {
                    int startIdx = i * Columns;
                    Array.Copy(Data[i], 0, flat, startIdx, Columns);
                });
            }
            return flat;
        }



        /// <summary>
        /// Returns a string representation of the matrix.
        /// </summary>
        /// <returns>A formatted string showing the matrix rows, each enclosed in brackets and comma-separated.</returns>
        public override string ToString()
        {

            return string.Join("\n", Data.Select(row => $"[{string.Join(", ", row)}]"));
        }



        /// <summary>
        /// Checks if another object is an identical matrix.
        /// </summary>
        /// <param name="obj">Object to compare with this matrix.</param>
        /// <returns>True if the matrices are equal in size and element values, false otherwise.</returns>
        public override bool Equals(object obj)
        {

            if (obj is not Matrix<T> other || Rows != other.Rows || Columns != other.Columns)
                return false;

            for (int i = 0; i < Rows; i++)
                for (int j = 0; j < Columns; j++)
                    if (!Equals(Data[i][j], other.Data[i][j]))
                        return false;
            return true;
        }



        /// <summary>
        /// Generates a hash code for the matrix.
        /// </summary>
        /// <returns>A hash code based on all elements, suitable for use in hash-based collections.</returns>
        public override int GetHashCode()
        {

            int hash = 17;
            for (int i = 0; i < Rows; i++)
                for (int j = 0; j < Columns; j++)
                    hash = hash * 23 + Data[i][j].GetHashCode();
            return hash;
        }



        /// <summary>
        /// Adds two values of type T.
        /// </summary>
        /// <param name="a">First operand.</param>
        /// <param name="b">Second operand.</param>
        /// <returns>The sum of a and b as type T.</returns>
        public static T Add(T a, T b) => IsFloat ? (T)(object)((float)(object)a + (float)(object)b) : (T)(object)((double)(object)a + (double)(object)b);



        /// <summary>
        /// Subtracts one value from another.
        /// </summary>
        /// <param name="a">Minuend.</param>
        /// <param name="b">Subtrahend.</param>
        /// <returns>The difference a - b as type T.</returns>
        public static T Subtract(T a, T b) => IsFloat ? (T)(object)((float)(object)a - (float)(object)b) : (T)(object)((double)(object)a - (double)(object)b);



        /// <summary>
        /// Multiplies two values of type T.
        /// </summary>
        /// <param name="a">First operand.</param>
        /// <param name="b">Second operand.</param>
        /// <returns>The product of a and b as type T.</returns>
        public static T Multiply(T a, T b) => IsFloat ? (T)(object)((float)(object)a * (float)(object)b) : (T)(object)((double)(object)a * (double)(object)b);



        /// <summary>
        /// Divides one value by another, with epsilon to prevent division by zero.
        /// </summary>
        /// <param name="a">Numerator.</param>
        /// <param name="b">Denominator.</param>
        /// <returns>The quotient a / b as type T, with a small epsilon if b is zero.</returns>
        public static T Divide(T a, T b)
        {

            T epsilon = IsFloat ? (T)(object)1e-6f : (T)(object)1e-6;
            T absB = IsFloat ? (T)(object)Math.Abs((float)(object)b) : (T)(object)Math.Abs((double)(object)b);
            T safeB = Compare(absB, epsilon) < 0 ? (Compare(b, Zero()) < 0 ? Subtract(Zero(), epsilon) : epsilon) : b;
            return IsFloat ? (T)(object)((float)(object)a / (float)(object)safeB)
                           : (T)(object)((double)(object)a / (double)(object)safeB);
        }



        /// <summary>
        /// Computes the exponential of a value.
        /// </summary>
        /// <param name="x">The exponent.</param>
        /// <returns>e^x as type T.</returns>
        public static T Exp(T x) => IsFloat ? (T)(object)(float)Math.Exp((float)(object)x) : (T)(object)Math.Exp((double)(object)x);



        /// <summary>
        /// Computes the square root of a value.
        /// </summary>
        /// <param name="x">Non-negative value to compute the square root of.</param>
        /// <returns>The square root of x as type T.</returns>
        /// <exception cref="ArgumentException">Thrown if x is negative.</exception>
        public static T Sqrt(T x)
        {

            if (Compare(x, Zero()) < 0)
                throw new ArgumentException("Cannot compute square root of a negative number.");
            return IsFloat ? (T)(object)(float)Math.Sqrt((float)(object)x) : (T)(object)Math.Sqrt((double)(object)x);
        }



        /// <summary>
        /// Compares two values of type T.
        /// </summary>
        /// <param name="a">First value.</param>
        /// <param name="b">Second value.</param>
        /// <returns>-1 if a < b, 0 if a == b, 1 if a > b.</returns>
        public static int Compare(T a, T b) => IsFloat ? ((float)(object)a).CompareTo((float)(object)b) : ((double)(object)a).CompareTo((double)(object)b);



        /// <summary>
        /// Returns zero for type T.
        /// </summary>
        /// <returns>0f for float, 0.0 for double.</returns>
        public static T Zero() => IsFloat ? (T)(object)0f : (T)(object)0.0;



        /// <summary>
        /// Returns one for type T.
        /// </summary>
        /// <returns>1f for float, 1.0 for double.</returns>
        public static T One() => IsFloat ? (T)(object)1f : (T)(object)1.0;
    }



    /// <summary>
    /// Represents a vector, a specialized matrix with either one row (row vector) or one column (column vector).
    /// Inherits from Matrix<T> to leverage its functionality while providing vector-specific operations.
    /// </summary>
    /// <typeparam name="T">The numeric type, constrained to float or double, implementing IComparable and IEquatable.</typeparam>
    public class Vector<T> : Matrix<T> where T : struct, IComparable<T>, IEquatable<T>
    {



        #region Fields:



        /// <summary>
        /// Indicates whether the vector is a row vector (1 row, n columns) or column vector (n rows, 1 column).
        /// </summary>
        private readonly bool _isRowVector;



        /// <summary>
        /// Gets the length of the vector (number of elements).
        /// </summary>
        public int Length => _isRowVector ? Columns : Rows;



        #endregion



        #region Properties:



        /// <summary>
        /// Gets or sets the value at the specified index in the vector.
        /// </summary>
        /// <param name="index">Index of the element, must be between 0 and Length-1.</param>
        /// <returns>The value at the specified index.</returns>
        /// <exception cref="IndexOutOfRangeException">Thrown if index is out of bounds.</exception>
        public T this[int index]
        {

            get => _isRowVector ? this[0, index] : this[index, 0];
            set
            {
                if (_isRowVector) this[0, index] = value;
                else this[index, 0] = value;
            }
        }



        #endregion



        /// <summary>
        /// Initializes a vector with the specified length, treated as a column vector by default.
        /// </summary>
        /// <remarks>
        /// Default Constructor for Json Serialization/Deserialization..
        /// </remarks>
        public Vector()
        {
        }



        /// <summary>
        /// Initializes a vector with the specified length, treated as a column vector by default.
        /// </summary>
        /// <param name="length">Number of elements, must be positive.</param>
        /// <param name="isRowVector">If true, creates a row vector [1, length]; otherwise, a column vector [length, 1].</param>
        /// <exception cref="ArgumentException">Thrown if length is less than or equal to zero.</exception>
        public Vector(int length, bool isRowVector = false) : base(isRowVector ? 1 : length, isRowVector ? length : 1)
        {

            if (length <= 0)
                throw new ArgumentException("Vector length must be positive.", nameof(length));
            _isRowVector = isRowVector;
        }



        /// <summary>
        /// Initializes a vector from an array of values, treated as a column vector by default.
        /// </summary>
        /// <param name="data">Array of values, must be non-null and non-empty.</param>
        /// <param name="isRowVector">If true, creates a row vector [1, n]; otherwise, a column vector [n, 1].</param>
        /// <exception cref="ArgumentNullException">Thrown if data is null.</exception>
        /// <exception cref="ArgumentException">Thrown if data is empty.</exception>
        public Vector(T[] data, bool isRowVector = false) : base(isRowVector ? new T[][] { data } : data.Select(x => new T[] { x }).ToArray())
        {

            if (data == null)
                throw new ArgumentNullException(nameof(data), "Vector data cannot be null.");
            if (data.Length == 0)
                throw new ArgumentException("Vector data must not be empty.", nameof(data));
            _isRowVector = isRowVector;
        }



        /// <summary>
        /// Computes the dot product of this vector with another vector.
        /// </summary>
        /// <param name="other">The other vector, must have the same length.</param>
        /// <returns>The dot product as type T.</returns>
        /// <exception cref="ArgumentNullException">Thrown if other is null.</exception>
        /// <exception cref="ArgumentException">Thrown if vectors have different lengths or incompatible orientations.</exception>
        public T Dot(Vector<T> other)
        {

            if (other == null)
                throw new ArgumentNullException(nameof(other));
            if (Length != other.Length)
                throw new ArgumentException($"Vectors must have same length, got {Length} vs {other.Length}.");
            if (_isRowVector && !other._isRowVector && Columns == other.Rows)
                return (this * other)[0, 0];

            T sum = Zero();
            T comp = Zero();
            object lockObj = new object();
            Parallel.For(0, Length, () => (Zero(), Zero()), (i, state, local) =>
            {
                T localSum = local.Item1;
                T localComp = local.Item2;
                T y = Subtract(Multiply(this[i], other[i]), localComp);
                T t = Add(localSum, y);
                localComp = Subtract(Subtract(t, localSum), y);
                localSum = t;
                return (localSum, localComp);
            }, local =>
            {
                lock (lockObj)
                {
                    T y = Subtract(local.Item1, comp);
                    T t = Add(sum, y);
                    comp = Subtract(Subtract(t, sum), y);
                    sum = t;
                }
            });
            return sum;
        }



        /// <summary>
        /// Computes the magnitude (L2 norm) of the vector.
        /// </summary>
        /// <returns>The magnitude as type T.</returns>
        public T Magnitude()
        {

            return Sqrt(this.ElementWiseMultiply(this).Sum());
        }



        /// <summary>
        /// Normalizes the vector to have a magnitude of 1.
        /// </summary>
        /// <returns>A new Vector<T> with the same direction but unit length.</returns>
        /// <exception cref="InvalidOperationException">Thrown if the magnitude is zero.</exception>
        public Vector<T> Normalize()
        {

            var matrix = this.Normalize();
            return new Vector<T>(matrix.GetRowArray(0), _isRowVector);
        }



        /// <summary>
        /// Converts the vector to an array.
        /// </summary>
        /// <returns>An array of type T containing the vector elements.</returns>
        public T[] ToArray()
        {

            return _isRowVector ? GetRowArray(0) : Enumerable.Range(0, Rows).Select(i => this[i, 0]).ToArray();
        }



        /// <summary>
        /// Converts the vector to a row vector if it isn’t already.
        /// </summary>
        /// <returns>A new Vector<T> as a row vector [1, Length].</returns>
        public Vector<T> ToRowVector()
        {

            if (_isRowVector) return this;
            return new Vector<T>(Enumerable.Range(0, Rows).Select(i => this[i, 0]).ToArray(), true);
        }



        /// <summary>
        /// Converts the vector to a column vector if it isn’t already.
        /// </summary>
        /// <returns>A new Vector<T> as a column vector [Length, 1].</returns>
        public Vector<T> ToColumnVector()
        {

            if (!_isRowVector) return this;
            return new Vector<T>(GetRowArray(0), false);
        }



        /// <summary>
        /// Converts a Matrix<T> to a Vector<T>, assuming the matrix represents a single row or column.
        /// </summary>
        /// <typeparam name="T">The numeric type, constrained to struct, IComparable<T>, and IEquatable<T>.</typeparam>
        /// <param name="matrix">The source matrix to convert, must have either 1 row or 1 column.</param>
        /// <param name="asRowVector">If true, treats the result as a row vector [1, n]; if false, as a column vector [n, 1]. 
        /// Defaults to false (column vector).</param>
        /// <returns>A new Vector<T> containing the matrix elements.</returns>
        /// <exception cref="ArgumentNullException">Thrown if the matrix is null.</exception>
        /// <exception cref="ArgumentException">Thrown if the matrix does not have exactly 1 row or 1 column.</exception>
        /// <remarks>
        /// This method facilitates conversion from a Matrix<T> to a Vector<T> when the matrix represents a vector quantity.
        /// - If the matrix has 1 row and n columns, it can be converted to either a row vector [1, n] or column vector [n, 1].
        /// - If the matrix has n rows and 1 column, it can be converted to either a column vector [n, 1] or row vector [1, n].
        /// The method extracts the single row or column and constructs a Vector<T> with the specified orientation.
        /// Useful in scenarios where matrix operations result in a vector-like structure needing vector-specific functionality.
        /// </remarks>
        public static Vector<T> ToVector<T>(Matrix<T> matrix, bool asRowVector = false) where T : struct, IComparable<T>, IEquatable<T>
        {
            // Validate input
            if (matrix == null)
                throw new ArgumentNullException(nameof(matrix), "Cannot convert a null matrix to a vector.");

            // Check matrix dimensions
            if (matrix.Rows != 1 && matrix.Columns != 1)
                throw new ArgumentException(
                    $"Matrix must have exactly 1 row or 1 column to convert to a vector, got [{matrix.Rows}, {matrix.Columns}].",
                    nameof(matrix));

            // Handle single row matrix
            if (matrix.Rows == 1)
            {
                T[] data = matrix.GetRowArray(0);
                return new Vector<T>(data, asRowVector);
            }

            // Handle single column matrix
            if (matrix.Columns == 1)
            {
                T[] data = new T[matrix.Rows];
                for (int i = 0; i < matrix.Rows; i++)
                    data[i] = matrix[i, 0];
                return new Vector<T>(data, asRowVector);
            }

            // This line should never be reached due to prior checks, but included for completeness
            throw new InvalidOperationException("Unexpected matrix dimensions encountered during conversion.");
        }



        /// <summary>
        /// Converts the vector to a flat 1D array containing all elements in their natural order.
        /// </summary>
        /// <returns>A 1D array of type T containing all elements of the vector, with length equal to the vector's Length.</returns>
        /// <exception cref="InvalidOperationException">Thrown if the vector is empty (has length 0).</exception>
        /// <remarks>
        /// This method flattens the vector into a 1D array, preserving the order of elements as they are stored internally.
        /// - For a row vector (shape [1, Length]), the elements are ordered as _data[0][0], _data[0][1], ..., _data[0][Length-1].
        /// - For a column vector (shape [Length, 1]), the elements are ordered as _data[0][0], _data[1][0], ..., _data[Length-1][0].
        /// In both cases, the resulting array has length equal to the vector's Length, and the elements are copied directly
        /// from the internal _data array, ensuring a contiguous memory layout.
        /// The method avoids parallelization since vectors in typical machine learning contexts (e.g., hidden activations)
        /// are often small, and the overhead of parallelism would outweigh the benefits.
        /// This method is functionally equivalent to ToArray() but is named ToFlatArray to align with Matrix<T> terminology
        /// and to emphasize its use in contexts requiring a flat array, such as GPU operations or statistical computations.
        /// </remarks>
        public new T[] ToFlatArray()
        {

            if (Length == 0)
                throw new InvalidOperationException("Cannot flatten an empty vector.");

            T[] flat = new T[Length];
            if (_isRowVector)
            {
                // Row vector: [1, Length], copy _data[0] directly
                Array.Copy(Data[0], flat, Length);
            }
            else
            {
                // Column vector: [Length, 1], extract _data[i][0] for each i
                for (int i = 0; i < Length; i++)
                    flat[i] = Data[i][0];
            }
            return flat;
        }



        /// <summary>
        /// Multiplies the matrix by a scalar (left operand).
        /// </summary>
        /// <param name="scalar">Scalar value to multiply each element by.</param>
        /// <param name="m">Matrix to scale.</param>
        /// <returns>A new scaled matrix.</returns>
        public static Vector<T> operator *(T scalar, Vector<T> m)
        {

            var result = new Vector<T>(m.Rows);
            Parallel.For(0, m.Rows, i =>
            {
                result[i] = Multiply(scalar, m[i]);
            });
            return result;
        }
    }
}
    public class HRModel
    {
        #region Fields

        private readonly int _vocabSize;
        private readonly int _dModel;
        private readonly int _nHeads;
        private readonly int _dHead;
        private readonly int _dFF;
        private readonly int _maxSeqLen;
        private readonly int _lCycles;
        private readonly double _dropoutRate;
        private readonly Matrix<double> _tokenEmbeddings;
        private readonly Matrix<double> _positionEmbeddings;
        private readonly Matrix<double> _puzzleEmbeddings;
        private readonly Matrix<double>[] _attnWeightsHigh;
        private readonly Matrix<double>[] _attnBiasesHigh;
        private readonly Matrix<double>[] _attnWeightsLow;
        private readonly Matrix<double>[] _attnBiasesLow;
        private readonly Matrix<double> _rnnWHigh;
        private readonly Matrix<double> _rnnBHigh;
        private readonly Matrix<double> _rnnWLow;
        private readonly Matrix<double> _rnnBLow;
        private readonly Matrix<double> _ffnW1;
        private readonly Matrix<double> _ffnB1;
        private readonly Matrix<double> _ffnW2;
        private readonly Matrix<double> _ffnB2;
        private readonly Matrix<double> _outputW;
        private readonly Matrix<double> _outputB;
        private readonly object _syncLock = new object();
        private readonly ThreadLocal<Random> _random = new ThreadLocal<Random>(() => new Random(Guid.NewGuid().GetHashCode()));

        #endregion

        #region Constructor

        public HRModel(int vocabSize, int dModel, int nHeads, int dFF, int maxSeqLen, int lCycles = 8, double dropoutRate = 0.1)
        {
            if (vocabSize <= 0) throw new ArgumentException("Vocabulary size must be positive.", nameof(vocabSize));
            if (dModel <= 0) throw new ArgumentException("Model dimension must be positive.", nameof(dModel));
            if (nHeads <= 0) throw new ArgumentException("Number of heads must be positive.", nameof(nHeads));
            if (dModel % nHeads != 0) throw new ArgumentException("dModel must be divisible by nHeads.", nameof(nHeads));
            if (dFF <= 0) throw new ArgumentException("Feed-forward dimension must be positive.", nameof(dFF));
            if (maxSeqLen <= 0) throw new ArgumentException("Maximum sequence length must be positive.", nameof(maxSeqLen));
            if (lCycles <= 0) throw new ArgumentException("Number of cycles must be positive.", nameof(lCycles));
            if (dropoutRate < 0 || dropoutRate > 1) throw new ArgumentException("Dropout rate must be between 0 and 1.", nameof(dropoutRate));

            _vocabSize = vocabSize;
            _dModel = dModel;
            _nHeads = nHeads;
            _dHead = dModel / nHeads;
            _dFF = dFF;
            _maxSeqLen = maxSeqLen;
            _lCycles = lCycles;
            _dropoutRate = dropoutRate;

            _tokenEmbeddings = Matrix<double>.InitializeXavier(vocabSize, dModel);
            _positionEmbeddings = Matrix<double>.InitializeXavier(maxSeqLen, dModel);
            _puzzleEmbeddings = Matrix<double>.InitializeXavier(vocabSize, dModel);
            _attnWeightsHigh = new Matrix<double>[nHeads];
            _attnBiasesHigh = new Matrix<double>[nHeads];
            _attnWeightsLow = new Matrix<double>[nHeads];
            _attnBiasesLow = new Matrix<double>[nHeads];
            for (int h = 0; h < nHeads; h++)
            {
                _attnWeightsHigh[h] = Matrix<double>.InitializeXavier(dModel, _dHead);
                _attnBiasesHigh[h] = Matrix<double>.Zeros(1, _dHead);
                _attnWeightsLow[h] = Matrix<double>.InitializeXavier(dModel, _dHead);
                _attnBiasesLow[h] = Matrix<double>.Zeros(1, _dHead);
            }
            _rnnWHigh = Matrix<double>.InitializeXavier(dModel, dModel);
            _rnnBHigh = Matrix<double>.Zeros(1, dModel);
            _rnnWLow = Matrix<double>.InitializeXavier(dModel, dModel);
            _rnnBLow = Matrix<double>.Zeros(1, dModel);
            _ffnW1 = Matrix<double>.InitializeXavier(dModel, dFF);
            _ffnB1 = Matrix<double>.Zeros(1, dFF);
            _ffnW2 = Matrix<double>.InitializeXavier(dFF, dModel);
            _ffnB2 = Matrix<double>.Zeros(1, dModel);
            _outputW = Matrix<double>.InitializeXavier(dModel, vocabSize);
            _outputB = Matrix<double>.Zeros(1, vocabSize);
        }

        #endregion

        #region Forward Pass

        public Matrix<double> Forward(int[][] inputIds, int[][] puzzleIds = null)
        {
            if (inputIds == null)
                throw new ArgumentNullException(nameof(inputIds), "Input IDs cannot be null.");
            if (inputIds.Length == 0 || inputIds[0].Length == 0)
                throw new ArgumentException("Input IDs must have at least one sequence and one token.", nameof(inputIds));
            if (inputIds.Any(seq => seq.Length != inputIds[0].Length))
                throw new ArgumentException("All sequences must have the same length.", nameof(inputIds));
            if (inputIds[0].Length > _maxSeqLen)
                throw new ArgumentException($"Sequence length {inputIds[0].Length} exceeds maximum {_maxSeqLen}.", nameof(inputIds));
            if (puzzleIds != null && (puzzleIds.Length != inputIds.Length || puzzleIds[0].Length != inputIds[0].Length))
                throw new ArgumentException("Puzzle IDs must match input IDs dimensions.", nameof(puzzleIds));

            int batchSize = inputIds.Length;
            int seqLen = inputIds[0].Length;

            // Embeddings
            var embeddings = OneHotEncode(inputIds, _vocabSize) * _tokenEmbeddings; // [batchSize * seqLen, dModel]
            if (puzzleIds != null)
            {
                var puzzleEmb = OneHotEncode(puzzleIds, _vocabSize) * _puzzleEmbeddings; // [batchSize * seqLen, dModel]
                embeddings = embeddings + puzzleEmb;
            }
            var posEmbed = _positionEmbeddings.SubMatrix(0, seqLen, 0, _dModel); // [seqLen, dModel]
            var tiledPosEmbed = new Matrix<double>(batchSize * seqLen, _dModel);
            Parallel.For(0, batchSize, b =>
            {
                for (int t = 0; t < seqLen; t++)
                {
                    int idx = b * seqLen + t;
                    Array.Copy(posEmbed.Data[t], tiledPosEmbed.Data[idx], _dModel);
                }
            });
            embeddings = embeddings + tiledPosEmbed; // [batchSize * seqLen, dModel]

            // Recurrent processing
            var stateHigh = Matrix<double>.Zeros(batchSize * seqLen, _dModel);
            var stateLow = Matrix<double>.Zeros(batchSize * seqLen, _dModel);
            Matrix<double>[] prevAttnWeightsHigh = null; // [nHeads, batchSize * seqLen, batchSize * seqLen]
            const int haltMaxSteps = 8;
            const double haltThreshold = 1e-4;

            for (int cycle = 0; cycle < _lCycles; cycle++)
            {
                // High-level module: Abstract planning with attention
                var highInput = embeddings + stateHigh; // [batchSize * seqLen, dModel]
                var (highAttnWeights, _, _, _, highOutput) = MultiHeadAttention(highInput, seqLen, batchSize, _attnWeightsHigh, _attnBiasesHigh); // [batchSize * seqLen, dModel]

                // Check for halting (based on attention weight stability)
                if (cycle > 0 && prevAttnWeightsHigh != null)
                {
                    double attnDiff = 0.0;
                    for (int h = 0; h < _nHeads; h++)
                    {
                        var diff = highAttnWeights[h] - prevAttnWeightsHigh[h];
                        attnDiff += diff.ElementWiseMultiply(diff).Sum();
                    }
                    if (attnDiff < haltThreshold && cycle < haltMaxSteps)
                        break;
                }
                prevAttnWeightsHigh = highAttnWeights.Select(m => m.Copy()).ToArray();

                stateHigh = (stateHigh + highOutput) * _rnnWHigh + _rnnBHigh; // [batchSize * seqLen, dModel]
                stateHigh = stateHigh.ElementWiseTanh();
                if (_dropoutRate > 0)
                    stateHigh = ApplyDropout(stateHigh, _dropoutRate);

                // Low-level module: Detailed computation
                var lowInput = embeddings + stateHigh; // [batchSize * seqLen, dModel]
                var (_, _, _, _, lowOutput) = MultiHeadAttention(lowInput, seqLen, batchSize, _attnWeightsLow, _attnBiasesLow); // [batchSize * seqLen, dModel]
                stateLow = (stateLow + lowOutput) * _rnnWLow + _rnnBLow; // [batchSize * seqLen, dModel]
                stateLow = stateLow.ElementWiseTanh();
                if (_dropoutRate > 0)
                    stateLow = ApplyDropout(stateLow, _dropoutRate);
            }

            // Feed-forward and output
            var ffnOutput = FeedForward(stateLow); // [batchSize * seqLen, dModel]
            var logits = ffnOutput * _outputW + _outputB; // [batchSize * seqLen, vocabSize]

            return logits;
        }

        private (Matrix<double>[], Matrix<double>[], Matrix<double>[], Matrix<double>[], Matrix<double>) MultiHeadAttention(Matrix<double> input, int seqLen, int batchSize, Matrix<double>[] attnWeights, Matrix<double>[] attnBiases)
        {
            var qs = new Matrix<double>[_nHeads];
            var ks = new Matrix<double>[_nHeads];
            var vs = new Matrix<double>[_nHeads];
            var outputs = new Matrix<double>[_nHeads];
            var attnWeightsOut = new Matrix<double>[_nHeads];
            Parallel.For(0, _nHeads, h =>
            {
                qs[h] = input * attnWeights[h]; // [batchSize * seqLen, dHead]
                ks[h] = input * attnWeights[h]; // [batchSize * seqLen, dHead]
                vs[h] = input * attnWeights[h]; // [batchSize * seqLen, dHead]
                qs[h] = qs[h] + BroadcastBias(attnBiases[h], batchSize * seqLen);
                ks[h] = ks[h] + BroadcastBias(attnBiases[h], batchSize * seqLen);
                vs[h] = vs[h] + BroadcastBias(attnBiases[h], batchSize * seqLen);
                var (attn, output) = ScaledDotProductAttention(qs[h], ks[h], vs[h], seqLen, batchSize); // [batchSize * seqLen, dHead]
                outputs[h] = output;
                attnWeightsOut[h] = attn;
            });
            return (attnWeightsOut, qs, ks, vs, ConcatenateHeads(outputs, batchSize, seqLen)); // [batchSize * seqLen, dModel]
        }

        private Matrix<double> BroadcastBias(Matrix<double> bias, int rows)
        {
            var result = new Matrix<double>(rows, bias.Columns);
            Parallel.For(0, rows, i =>
            {
                Array.Copy(bias.Data[0], result.Data[i], bias.Columns);
            });
            return result;
        }

        private (Matrix<double>, Matrix<double>) ScaledDotProductAttention(Matrix<double> q, Matrix<double> k, Matrix<double> v, int seqLen, int batchSize)
        {
            var scores = q * k.Transpose() / System.Math.Sqrt(_dHead); // [batchSize * seqLen, batchSize * seqLen]
            var causalMask = Matrix<double>.CreateCausalMask(batchSize * seqLen); // [batchSize * seqLen, batchSize * seqLen]
            scores = scores + causalMask;
            var attnWeights = scores.CustomSoftmax(); // [batchSize * seqLen, batchSize * seqLen]
            if (_dropoutRate > 0)
                attnWeights = ApplyDropout(attnWeights, _dropoutRate);
            var output = attnWeights * v; // [batchSize * seqLen, dHead]
            return (attnWeights, output);
        }

        private Matrix<double> ConcatenateHeads(Matrix<double>[] heads, int batchSize, int seqLen)
        {
            var result = new Matrix<double>(batchSize * seqLen, _dModel);
            Parallel.For(0, batchSize * seqLen, i =>
            {
                int offset = 0;
                for (int h = 0; h < _nHeads; h++)
                {
                    Array.Copy(heads[h].Data[i], 0, result.Data[i], offset, _dHead);
                    offset += _dHead;
                }
            });
            return result;
        }

        private Matrix<double> FeedForward(Matrix<double> input)
        {
            var hidden = input * _ffnW1 + _ffnB1; // [batchSize * seqLen, dFF]
            hidden = hidden.ElementWiseTanh();
            if (_dropoutRate > 0)
                hidden = ApplyDropout(hidden, _dropoutRate);
            var output = hidden * _ffnW2 + _ffnB2; // [batchSize * seqLen, dModel]
            return output;
        }

        private Matrix<double> OneHotEncode(int[][] ids, int vocabSize)
        {
            int batchSize = ids.Length;
            int seqLen = ids[0].Length;
            var result = new Matrix<double>(batchSize * seqLen, vocabSize);
            Parallel.For(0, batchSize, b =>
            {
                for (int t = 0; t < seqLen; t++)
                {
                    int idx = b * seqLen + t;
                    if (ids[b][t] >= 0 && ids[b][t] < vocabSize)
                        result[idx, ids[b][t]] = 1.0;
                    else
                        throw new ArgumentException($"Token ID {ids[b][t]} at batch {b}, position {t} is out of vocabulary range [0, {vocabSize - 1}].");
                }
            });
            return result;
        }

        #endregion

        #region Training

        public double Train(int[][] inputs, int[][] targets, double learningRate, int[][] puzzleIds = null)
        {
            if (inputs == null || targets == null)
                throw new ArgumentNullException("Inputs and targets cannot be null.");
            if (inputs.Length != targets.Length)
                throw new ArgumentException($"Inputs and targets must have same batch size, got {inputs.Length} vs {targets.Length}.");
            if (inputs.Length == 0 || inputs[0].Length == 0)
                throw new ArgumentException("Inputs must have at least one sequence and one token.");
            if (inputs[0].Length != targets[0].Length)
                throw new ArgumentException($"Inputs and targets must have same sequence length, got {inputs[0].Length} vs {targets[0].Length}.");
            if (inputs[0].Length > _maxSeqLen)
                throw new ArgumentException($"Sequence length {inputs[0].Length} exceeds maximum {_maxSeqLen}.");
            if (puzzleIds != null && (puzzleIds.Length != inputs.Length || puzzleIds[0].Length != inputs[0].Length))
                throw new ArgumentException("Puzzle IDs must match input IDs dimensions.", nameof(puzzleIds));
            if (learningRate <= 0)
                throw new ArgumentException("Learning rate must be positive.", nameof(learningRate));

            int batchSize = inputs.Length;
            int seqLen = inputs[0].Length;

            // Forward pass
            var embeddings = OneHotEncode(inputs, _vocabSize) * _tokenEmbeddings; // [batchSize * seqLen, dModel]
            if (puzzleIds != null)
            {
                var puzzleEmb = OneHotEncode(puzzleIds, _vocabSize) * _puzzleEmbeddings; // [batchSize * seqLen, dModel]
                embeddings = embeddings + puzzleEmb;
            }
            var posEmbed = _positionEmbeddings.SubMatrix(0, seqLen, 0, _dModel); // [seqLen, dModel]
            var tiledPosEmbed = new Matrix<double>(batchSize * seqLen, _dModel);
            Parallel.For(0, batchSize, b =>
            {
                for (int t = 0; t < seqLen; t++)
                {
                    int idx = b * seqLen + t;
                    Array.Copy(posEmbed.Data[t], tiledPosEmbed.Data[idx], _dModel);
                }
            });
            embeddings = embeddings + tiledPosEmbed; // [batchSize * seqLen, dModel]

            // Recurrent processing
            var stateHigh = Matrix<double>.Zeros(batchSize * seqLen, _dModel);
            var stateLow = Matrix<double>.Zeros(batchSize * seqLen, _dModel);
            var qsHigh = new Matrix<double>[_lCycles][];
            var ksHigh = new Matrix<double>[_lCycles][];
            var vsHigh = new Matrix<double>[_lCycles][];
            var attnWeightsHigh = new Matrix<double>[_lCycles][];
            var attnOutputsHigh = new Matrix<double>[_lCycles][];
            var qsLow = new Matrix<double>[_lCycles][];
            var ksLow = new Matrix<double>[_lCycles][];
            var vsLow = new Matrix<double>[_lCycles][];
            var attnWeightsLow = new Matrix<double>[_lCycles][];
            var attnOutputsLow = new Matrix<double>[_lCycles][];
            var highOutputs = new Matrix<double>[_lCycles];
            var lowOutputs = new Matrix<double>[_lCycles];
            Matrix<double>[] prevAttnWeightsHigh = null; // [nHeads, batchSize * seqLen, batchSize * seqLen]
            const int haltMaxSteps = 8;
            const double haltThreshold = 1e-4;
            int actualCycles = _lCycles;

            for (int cycle = 0; cycle < _lCycles; cycle++)
            {
                // Initialize arrays for this cycle
                qsHigh[cycle] = new Matrix<double>[_nHeads];
                ksHigh[cycle] = new Matrix<double>[_nHeads];
                vsHigh[cycle] = new Matrix<double>[_nHeads];
                attnWeightsHigh[cycle] = new Matrix<double>[_nHeads];
                attnOutputsHigh[cycle] = new Matrix<double>[_nHeads];
                qsLow[cycle] = new Matrix<double>[_nHeads];
                ksLow[cycle] = new Matrix<double>[_nHeads];
                vsLow[cycle] = new Matrix<double>[_nHeads];
                attnWeightsLow[cycle] = new Matrix<double>[_nHeads];
                attnOutputsLow[cycle] = new Matrix<double>[_nHeads];

                // High-level module
                var highInput = embeddings + stateHigh;
                (attnWeightsHigh[cycle], qsHigh[cycle], ksHigh[cycle], vsHigh[cycle], var highOutput) = MultiHeadAttention(highInput, seqLen, batchSize, _attnWeightsHigh, _attnBiasesHigh);
                highOutputs[cycle] = highOutput;

                // Check for halting
                if (cycle > 0 && prevAttnWeightsHigh != null)
                {
                    double attnDiff = 0.0;
                    for (int h = 0; h < _nHeads; h++)
                    {
                        var diff = attnWeightsHigh[cycle][h] - prevAttnWeightsHigh[h];
                        attnDiff += diff.ElementWiseMultiply(diff).Sum();
                    }
                    if (attnDiff < haltThreshold && cycle < haltMaxSteps)
                    {
                        actualCycles = cycle + 1;
                        break;
                    }
                }
                prevAttnWeightsHigh = attnWeightsHigh[cycle].Select(m => m.Copy()).ToArray();

                stateHigh = (stateHigh + highOutput) * _rnnWHigh + _rnnBHigh;
                stateHigh = stateHigh.ElementWiseTanh();
                if (_dropoutRate > 0)
                    stateHigh = ApplyDropout(stateHigh, _dropoutRate);

                // Low-level module
                var lowInput = embeddings + stateHigh;
                (attnWeightsLow[cycle], qsLow[cycle], ksLow[cycle], vsLow[cycle], var lowOutput) = MultiHeadAttention(lowInput, seqLen, batchSize, _attnWeightsLow, _attnBiasesLow);
                lowOutputs[cycle] = lowOutput;
                stateLow = (stateLow + lowOutput) * _rnnWLow + _rnnBLow;
                stateLow = stateLow.ElementWiseTanh();
                if (_dropoutRate > 0)
                    stateLow = ApplyDropout(stateLow, _dropoutRate);
            }

            // Feed-forward and output
            var hidden = stateLow * _ffnW1 + _ffnB1; // [batchSize * seqLen, dFF]
            hidden = hidden.ElementWiseTanh();
            if (_dropoutRate > 0)
                hidden = ApplyDropout(hidden, _dropoutRate);
            var ffnOutput = hidden * _ffnW2 + _ffnB2; // [batchSize * seqLen, dModel]
            var logits = ffnOutput * _outputW + _outputB; // [batchSize * seqLen, vocabSize]
            var probs = logits.CustomSoftmax(); // [batchSize * seqLen, vocabSize]

            // Backward pass
            var targetMatrix = OneHotEncode(targets, _vocabSize); // [batchSize * seqLen, vocabSize]
            double loss = CrossEntropyLoss(probs, targetMatrix);

            var gradLogits = probs - targetMatrix; // [batchSize * seqLen, vocabSize]
            var gradOutputW = ffnOutput.Transpose() * gradLogits; // [dModel, vocabSize]
            var gradOutputB = gradLogits.SumRows(); // [1, vocabSize]
            var gradFFnOutput = gradLogits * _outputW.Transpose(); // [batchSize * seqLen, dModel]
            var gradFFnW2 = hidden.Transpose() * gradFFnOutput; // [dFF, dModel]
            var gradFFnB2 = gradFFnOutput.SumRows(); // [1, dModel]
            var gradHidden = gradFFnOutput * _ffnW2.Transpose(); // [batchSize * seqLen, dFF]
            gradHidden = gradHidden.ElementWiseMultiply(Matrix<double>.Ones(batchSize * seqLen, _dFF) - hidden.ElementWiseMultiply(hidden)); // [batchSize * seqLen, dFF]
            var gradFFnW1 = stateLow.Transpose() * gradHidden; // [dModel, dFF]
            var gradFFnB1 = gradHidden.SumRows(); // [1, dFF]
            var gradStateLow = gradHidden * _ffnW1.Transpose(); // [batchSize * seqLen, dModel]

            // Backward through recurrent cycles
            var gradStateHigh = Matrix<double>.Zeros(batchSize * seqLen, _dModel);
            var gradEmbeddings = Matrix<double>.Zeros(batchSize * seqLen, _dModel);
            var gradPuzzleEmb = puzzleIds != null ? Matrix<double>.Zeros(batchSize * seqLen, _dModel) : null;
            var gradRnnWHigh = Matrix<double>.Zeros(_dModel, _dModel);
            var gradRnnBHigh = Matrix<double>.Zeros(1, _dModel);
            var gradRnnWLow = Matrix<double>.Zeros(_dModel, _dModel);
            var gradRnnBLow = Matrix<double>.Zeros(1, _dModel);
            var gradAttnWeightsHigh = new Matrix<double>[_nHeads];
            var gradAttnBiasesHigh = new Matrix<double>[_nHeads];
            var gradAttnWeightsLow = new Matrix<double>[_nHeads];
            var gradAttnBiasesLow = new Matrix<double>[_nHeads];
            for (int h = 0; h < _nHeads; h++)
            {
                gradAttnWeightsHigh[h] = Matrix<double>.Zeros(_dModel, _dHead);
                gradAttnBiasesHigh[h] = Matrix<double>.Zeros(1, _dHead);
                gradAttnWeightsLow[h] = Matrix<double>.Zeros(_dModel, _dHead);
                gradAttnBiasesLow[h] = Matrix<double>.Zeros(1, _dHead);
            }

            for (int cycle = actualCycles - 1; cycle >= 0; cycle--)
            {
                // Low-level module backward
                var gradRnnLow = gradStateLow.ElementWiseMultiply(Matrix<double>.Ones(batchSize * seqLen, _dModel) - stateLow.ElementWiseMultiply(stateLow));
                lock (_syncLock)
                {
                    gradRnnWLow += (stateLow + lowOutputs[cycle]).Transpose() * gradRnnLow; // [dModel, dModel]
                    gradRnnBLow += gradRnnLow.SumRows(); // [1, dModel]
                }
                var gradLowOutput = gradRnnLow * _rnnWLow.Transpose(); // [batchSize * seqLen, dModel]
                var gradAttnOutputsLow = SplitHeads(gradLowOutput, batchSize, seqLen); // [nHeads, batchSize * seqLen, dHead]
                Parallel.For(0, _nHeads, h =>
                {
                    var gradAttnScores = gradAttnOutputsLow[h] * vsLow[cycle][h].Transpose(); // [batchSize * seqLen, batchSize * seqLen]
                    var gradVsLow = attnWeightsLow[cycle][h].Transpose() * gradAttnOutputsLow[h]; // [batchSize * seqLen, dHead]
                    var gradScores = attnWeightsLow[cycle][h].ElementWiseMultiply(Matrix<double>.Ones(batchSize * seqLen, batchSize * seqLen) - attnWeightsLow[cycle][h]).ElementWiseMultiply(gradAttnScores);
                    var gradQsLow = gradScores * ksLow[cycle][h] / System.Math.Sqrt(_dHead); // [batchSize * seqLen, dHead]
                    var gradKsLow = gradScores.Transpose() * qsLow[cycle][h] / System.Math.Sqrt(_dHead); // [batchSize * seqLen, dHead]
                    var gradInputLow = gradQsLow * _attnWeightsLow[h].Transpose() + gradKsLow * _attnWeightsLow[h].Transpose() + gradVsLow * _attnWeightsLow[h].Transpose(); // [batchSize * seqLen, dModel]
                    var gradAttnW = (embeddings + stateHigh).Transpose() * (gradQsLow + gradKsLow + gradVsLow); // [dModel, dHead]
                    lock (_syncLock)
                    {
                        gradAttnWeightsLow[h] += gradAttnW;
                        gradAttnBiasesLow[h] += (gradQsLow + gradKsLow + gradVsLow).SumRows(); // [1, dHead]
                        gradStateHigh += gradInputLow;
                        gradEmbeddings += gradInputLow;
                        if (puzzleIds != null)
                            gradPuzzleEmb += gradInputLow;
                    }
                });

                // High-level module backward
                var gradRnnHigh = gradStateHigh.ElementWiseMultiply(Matrix<double>.Ones(batchSize * seqLen, _dModel) - stateHigh.ElementWiseMultiply(stateHigh));
                lock (_syncLock)
                {
                    gradRnnWHigh += (stateHigh + highOutputs[cycle]).Transpose() * gradRnnHigh; // [dModel, dModel]
                    gradRnnBHigh += gradRnnHigh.SumRows(); // [1, dModel]
                }
                var gradHighOutput = gradRnnHigh * _rnnWHigh.Transpose(); // [batchSize * seqLen, dModel]
                var gradAttnOutputsHigh = SplitHeads(gradHighOutput, batchSize, seqLen); // [nHeads, batchSize * seqLen, dHead]
                Parallel.For(0, _nHeads, h =>
                {
                    var gradAttnScores = gradAttnOutputsHigh[h] * vsHigh[cycle][h].Transpose(); // [batchSize * seqLen, batchSize * seqLen]
                    var gradVsHigh = attnWeightsHigh[cycle][h].Transpose() * gradAttnOutputsHigh[h]; // [batchSize * seqLen, dHead]
                    var gradScores = attnWeightsHigh[cycle][h].ElementWiseMultiply(Matrix<double>.Ones(batchSize * seqLen, batchSize * seqLen) - attnWeightsHigh[cycle][h]).ElementWiseMultiply(gradAttnScores);
                    var gradQsHigh = gradScores * ksHigh[cycle][h] / System.Math.Sqrt(_dHead); // [batchSize * seqLen, dHead]
                    var gradKsHigh = gradScores.Transpose() * qsHigh[cycle][h] / System.Math.Sqrt(_dHead); // [batchSize * seqLen, dHead]
                    var gradInputHigh = gradQsHigh * _attnWeightsHigh[h].Transpose() + gradKsHigh * _attnWeightsHigh[h].Transpose() + gradVsHigh * _attnWeightsHigh[h].Transpose(); // [batchSize * seqLen, dModel]
                    var gradAttnW = (embeddings + stateHigh).Transpose() * (gradQsHigh + gradKsHigh + gradVsHigh); // [dModel, dHead]
                    lock (_syncLock)
                    {
                        gradAttnWeightsHigh[h] += gradAttnW;
                        gradAttnBiasesHigh[h] += (gradQsHigh + gradKsHigh + gradVsHigh).SumRows(); // [1, dHead]
                        gradEmbeddings += gradInputHigh;
                        if (puzzleIds != null)
                            gradPuzzleEmb += gradInputHigh;
                    }
                });
            }

            var gradTokenEmbeddings = OneHotEncode(inputs, _vocabSize).Transpose() * gradEmbeddings; // [vocabSize, dModel]
            var gradPuzzleEmbMatrix = puzzleIds != null ? OneHotEncode(puzzleIds, _vocabSize).Transpose() * gradPuzzleEmb : null;
            var gradPosEmbed = new Matrix<double>(seqLen, _dModel);
            Parallel.For(0, batchSize, b =>
            {
                for (int t = 0; t < seqLen; t++)
                {
                    int idx = b * seqLen + t;
                    lock (_syncLock)
                    {
                        for (int d = 0; d < _dModel; d++)
                            gradPosEmbed[t, d] += gradEmbeddings[idx, d];
                    }
                }
            });

            // Update weights with gradient clipping
            UpdateWeights(learningRate, gradTokenEmbeddings, gradPosEmbed, gradPuzzleEmbMatrix, gradOutputW, gradOutputB, gradFFnW1, gradFFnB1, gradFFnW2, gradFFnB2,
                          gradAttnWeightsHigh, gradAttnBiasesHigh, gradAttnWeightsLow, gradAttnBiasesLow, gradRnnWHigh, gradRnnBHigh, gradRnnWLow, gradRnnBLow);

            return loss;
        }

        private Matrix<double>[] SplitHeads(Matrix<double> input, int batchSize, int seqLen)
        {
            var result = new Matrix<double>[_nHeads];
            Parallel.For(0, _nHeads, h =>
            {
                result[h] = input.SubMatrix(0, batchSize * seqLen, h * _dHead, _dHead); // [batchSize * seqLen, dHead]
            });
            return result;
        }

        private Matrix<double> ApplyDropout(Matrix<double> input, double dropoutRate)
        {
            var result = new Matrix<double>(input.Rows, input.Columns);
            Parallel.For(0, input.Rows, i =>
            {
                var rand = _random.Value;
                for (int j = 0; j < input.Columns; j++)
                {
                    result[i, j] = rand.NextDouble() < dropoutRate ? 0.0 : input[i, j] / (1.0 - dropoutRate);
                }
            });
            return result;
        }

        private double CrossEntropyLoss(Matrix<double> probs, Matrix<double> targets)
        {
            double loss = 0.0;
            object lockObj = new object();
            Parallel.For(0, probs.Rows, () => 0.0, (i, state, localLoss) =>
            {
                for (int j = 0; j < probs.Columns; j++)
                {
                    if (targets[i, j] > 0)
                    {
                        double prob = System.Math.Max(probs[i, j], 1e-10);
                        localLoss -= targets[i, j] * System.Math.Log(prob);
                    }
                }
                return localLoss;
            }, localLoss =>
            {
                lock (lockObj)
                    loss += localLoss;
            });
            return loss / probs.Rows;
        }

        private void UpdateWeights(double learningRate, Matrix<double> gradTokenEmbeddings, Matrix<double> gradPosEmbed, Matrix<double> gradPuzzleEmb,
            Matrix<double> gradOutputW, Matrix<double> gradOutputB, Matrix<double> gradFFnW1, Matrix<double> gradFFnB1,
            Matrix<double> gradFFnW2, Matrix<double> gradFFnB2, Matrix<double>[] gradAttnWeightsHigh, Matrix<double>[] gradAttnBiasesHigh,
            Matrix<double>[] gradAttnWeightsLow, Matrix<double>[] gradAttnBiasesLow, Matrix<double> gradRnnWHigh, Matrix<double> gradRnnBHigh,
            Matrix<double> gradRnnWLow, Matrix<double> gradRnnBLow)
        {
            const double clipValue = 1.0; // Gradient clipping threshold

            // Clip gradients
            gradTokenEmbeddings = gradTokenEmbeddings.Clip(-clipValue, clipValue);
            gradPosEmbed = gradPosEmbed.Clip(-clipValue, clipValue);
            if (gradPuzzleEmb != null)
                gradPuzzleEmb = gradPuzzleEmb.Clip(-clipValue, clipValue);
            gradOutputW = gradOutputW.Clip(-clipValue, clipValue);
            gradOutputB = gradOutputB.Clip(-clipValue, clipValue);
            gradFFnW1 = gradFFnW1.Clip(-clipValue, clipValue);
            gradFFnB1 = gradFFnB1.Clip(-clipValue, clipValue);
            gradFFnW2 = gradFFnW2.Clip(-clipValue, clipValue);
            gradFFnB2 = gradFFnB2.Clip(-clipValue, clipValue);
            gradRnnWHigh = gradRnnWHigh.Clip(-clipValue, clipValue);
            gradRnnBHigh = gradRnnBHigh.Clip(-clipValue, clipValue);
            gradRnnWLow = gradRnnWLow.Clip(-clipValue, clipValue);
            gradRnnBLow = gradRnnBLow.Clip(-clipValue, clipValue);
            for (int h = 0; h < _nHeads; h++)
            {
                gradAttnWeightsHigh[h] = gradAttnWeightsHigh[h].Clip(-clipValue, clipValue);
                gradAttnBiasesHigh[h] = gradAttnBiasesHigh[h].Clip(-clipValue, clipValue);
                gradAttnWeightsLow[h] = gradAttnWeightsLow[h].Clip(-clipValue, clipValue);
                gradAttnBiasesLow[h] = gradAttnBiasesLow[h].Clip(-clipValue, clipValue);
            }

            Parallel.For(0, _tokenEmbeddings.Rows, i =>
            {
                for (int j = 0; j < _tokenEmbeddings.Columns; j++)
                    _tokenEmbeddings[i, j] -= learningRate * gradTokenEmbeddings[i, j];
            });
            Parallel.For(0, gradPosEmbed.Rows, i =>
            {
                for (int j = 0; j < gradPosEmbed.Columns; j++)
                    _positionEmbeddings[i, j] -= learningRate * gradPosEmbed[i, j] / _maxSeqLen;
            });
            if (gradPuzzleEmb != null)
            {
                Parallel.For(0, _puzzleEmbeddings.Rows, i =>
                {
                    for (int j = 0; j < _puzzleEmbeddings.Columns; j++)
                        _puzzleEmbeddings[i, j] -= learningRate * gradPuzzleEmb[i, j];
                });
            }
            Parallel.For(0, _outputW.Rows, i =>
            {
                for (int j = 0; j < _outputW.Columns; j++)
                    _outputW[i, j] -= learningRate * gradOutputW[i, j];
            });
            Parallel.For(0, _outputB.Columns, j =>
            {
                _outputB[0, j] -= learningRate * gradOutputB[0, j];
            });
            Parallel.For(0, _ffnW1.Rows, i =>
            {
                for (int j = 0; j < _ffnW1.Columns; j++)
                    _ffnW1[i, j] -= learningRate * gradFFnW1[i, j];
            });
            Parallel.For(0, _ffnB1.Columns, j =>
            {
                _ffnB1[0, j] -= learningRate * gradFFnB1[0, j];
            });
            Parallel.For(0, _ffnW2.Rows, i =>
            {
                for (int j = 0; j < _ffnW2.Columns; j++)
                    _ffnW2[i, j] -= learningRate * gradFFnW2[i, j];
            });
            Parallel.For(0, _ffnB2.Columns, j =>
            {
                _ffnB2[0, j] -= learningRate * gradFFnB2[0, j];
            });
            Parallel.For(0, _nHeads, h =>
            {
                for (int i = 0; i < _attnWeightsHigh[h].Rows; i++)
                    for (int j = 0; j < _attnWeightsHigh[h].Columns; j++)
                        _attnWeightsHigh[h][i, j] -= learningRate * gradAttnWeightsHigh[h][i, j];
                for (int j = 0; j < _attnBiasesHigh[h].Columns; j++)
                    _attnBiasesHigh[h][0, j] -= learningRate * gradAttnBiasesHigh[h][0, j];
                for (int i = 0; i < _attnWeightsLow[h].Rows; i++)
                    for (int j = 0; j < _attnWeightsLow[h].Columns; j++)
                        _attnWeightsLow[h][i, j] -= learningRate * gradAttnWeightsLow[h][i, j];
                for (int j = 0; j < _attnBiasesLow[h].Columns; j++)
                    _attnBiasesLow[h][0, j] -= learningRate * gradAttnBiasesLow[h][0, j];
            });
            Parallel.For(0, _rnnWHigh.Rows, i =>
            {
                for (int j = 0; j < _rnnWHigh.Columns; j++)
                    _rnnWHigh[i, j] -= learningRate * gradRnnWHigh[i, j];
            });
            Parallel.For(0, _rnnBHigh.Columns, j =>
            {
                _rnnBHigh[0, j] -= learningRate * gradRnnBHigh[0, j];
            });
            Parallel.For(0, _rnnWLow.Rows, i =>
            {
                for (int j = 0; j < _rnnWLow.Columns; j++)
                    _rnnWLow[i, j] -= learningRate * gradRnnWLow[i, j];
            });
            Parallel.For(0, _rnnBLow.Columns, j =>
            {
                _rnnBLow[0, j] -= learningRate * gradRnnBLow[0, j];
            });
        }

        #endregion

        #region Inference

        public int[] GenerateNextToken(int[][] inputIds, int[][] puzzleIds = null)
        {
            var logits = Forward(inputIds, puzzleIds); // [batchSize * seqLen, vocabSize]
            int batchSize = inputIds.Length;
            int seqLen = inputIds[0].Length;
            int[] nextTokens = new int[batchSize];
            Parallel.For(0, batchSize, b =>
            {
                int idx = b * seqLen + (seqLen - 1);
                var row = logits.Row(idx);
                nextTokens[b] = row.ArgMax()[0];
            });
            return nextTokens;
        }

        #endregion
    }
    class Program
    {
        static void Main(string[] args)
        {
            // Initialize model
            Console.WriteLine("Initializing HRModel...");
            var model = new HRModel(vocabSize: 100, dModel: 64, nHeads: 4, dFF: 128, maxSeqLen: 10, lCycles: 4);

            // Generate synthetic training data (example: arithmetic sequences)
            Console.WriteLine("Generating synthetic training data...");
            int batchSize = 2;
            int numSequences = 62; // For 31 batches
            int[][] inputs = new int[numSequences][];
            int[][] targets = new int[numSequences][];
            Random rand = new Random();
            for (int i = 0; i < numSequences; i++)
            {
                int start = rand.Next(0, 90);
                inputs[i] = new int[] { start, start + 1, start + 2 };
                targets[i] = new int[] { start + 1, start + 2, start + 3 };
            }

            double learningRate = 0.01; // Increased for faster convergence
            int epochs = 50; // Reduced for testing
            int numBatches = numSequences / batchSize;

            while (true)
            {
                Console.WriteLine("\nHRModel Menu:");
                Console.WriteLine("1. Train the model");
                Console.WriteLine("2. Perform inference");
                Console.WriteLine("3. Exit");
                Console.Write("Select an option (1-3): ");

                string choice = Console.ReadLine();
                if (!int.TryParse(choice, out int option) || option < 1 || option > 3)
                {
                    Console.WriteLine("Invalid option. Please enter 1, 2, or 3.");
                    continue;
                }

                if (option == 3)
                {
                    Console.WriteLine("Exiting program...");
                    break;
                }

                if (option == 1)
                {
                    // Training logic
                    Console.WriteLine("\nStarting training...");
                    for (int epoch = 0; epoch < epochs; epoch++)
                    {
                        double totalLoss = 0.0;
                        for (int batch = 0; batch < numBatches; batch++)
                        {
                            int startIdx = batch * batchSize;
                            int[][] batchInputs = inputs.Skip(startIdx).Take(batchSize).ToArray();
                            int[][] batchTargets = targets.Skip(startIdx).Take(batchSize).ToArray();
                            double loss = model.Train(batchInputs, batchTargets, learningRate);
                            totalLoss += loss;
                            if ((batch + 1) % 3 == 0 || batch == numBatches - 1)
                            {
                                Console.WriteLine($"Epoch {epoch + 1}, Batch {(batch + 1)}/{numBatches}, Batch Loss: {loss:F4}");
                            }
                        }
                        double avgLoss = totalLoss / numBatches;
                        Console.WriteLine($"Epoch {epoch + 1} completed. Average Loss: {avgLoss:F4}");
                        // Test sample prediction
                        int[] predicted = model.GenerateNextToken(new int[][] { inputs[0] });
                        Console.WriteLine($"Sample Input: [{string.Join(", ", inputs[0])}], Predicted: {predicted[0]}");
                    }
                    Console.WriteLine("Training completed.");
                }
                else if (option == 2)
                {
                    // Inference logic
                    Console.WriteLine("\nEnter a sequence of token IDs (comma-separated, e.g., 0,1,2):");
                    string inputLine = Console.ReadLine();
                    int[] inputSeq;
                    try
                    {
                        inputSeq = inputLine.Split(',')
                            .Select(s => int.Parse(s.Trim()))
                            .ToArray();
                        if (inputSeq.Length == 0 || inputSeq.Length > model.MaxSeqLen)
                        {
                            Console.WriteLine($"Sequence length must be between 1 and {model.MaxSeqLen}.");
                            continue;
                        }
                        if (inputSeq.Any(id => id < 0 || id >= model.VocabSize))
                        {
                            Console.WriteLine($"Token IDs must be between 0 and {model.VocabSize - 1}.");
                            continue;
                        }
                    }
                    catch (FormatException)
                    {
                        Console.WriteLine("Invalid input. Please enter comma-separated integers.");
                        continue;
                    }

                    // Optional puzzle IDs
                    Console.WriteLine("Enter puzzle IDs (comma-separated, same length as input, or press Enter to skip):");
                    string puzzleLine = Console.ReadLine();
                    int[][] puzzleIds = null;
                    if (!string.IsNullOrWhiteSpace(puzzleLine))
                    {
                        try
                        {
                            int[] puzzleSeq = puzzleLine.Split(',')
                                .Select(s => int.Parse(s.Trim()))
                                .ToArray();
                            if (puzzleSeq.Length != inputSeq.Length)
                            {
                                Console.WriteLine("Puzzle IDs must have the same length as input sequence.");
                                continue;
                            }
                            if (puzzleSeq.Any(id => id < 0 || id >= model.VocabSize))
                            {
                                Console.WriteLine($"Puzzle IDs must be between 0 and {model.VocabSize - 1}.");
                                continue;
                            }
                            puzzleIds = new int[][] { puzzleSeq };
                        }
                        catch (FormatException)
                        {
                            Console.WriteLine("Invalid puzzle IDs. Please enter comma-separated integers or skip.");
                            continue;
                        }
                    }

                    // Perform inference
                    int[][] inputBatch = new int[][] { inputSeq };
                    try
                    {
                        int[] nextTokens = model.GenerateNextToken(inputBatch, puzzleIds);
                        Console.WriteLine($"Input sequence: [{string.Join(", ", inputSeq)}]");
                        if (puzzleIds != null)
                        {
                            Console.WriteLine($"Puzzle IDs: [{string.Join(", ", puzzleIds[0])}]");
                        }
                        Console.WriteLine($"Predicted next token: {nextTokens[0]}");
                    }
                    catch (Exception ex)
                    {
                        Console.WriteLine($"Inference error: {ex.Message}");
                    }
                }
            }
        }
    }

    public static class HRModelExtensions
    {
        public static int VocabSize(this HRModel model) => model.GetType().GetField("_vocabSize", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance).GetValue(model) as int? ?? 100;
        public static int MaxSeqLen(this HRModel model) => model.GetType().GetField("_maxSeqLen", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance).GetValue(model) as int? ?? 10;
    }
}