Ozzie AI - Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Understanding the Absolute Zero Reasoner: A New Way for AI to Learn Without Human Help

Imagine teaching a child to solve puzzles without ever showing them a single example of a completed puzzle. Instead, you give them a box of puzzle pieces, a way to check if the pieces fit together correctly, and let them figure it out through trial and error. Over time, the child not only learns to solve puzzles but also starts creating their own puzzles to challenge themselves further. This is, in essence, what the Absolute Zero Reasoner (AZR) does for artificial intelligence (AI). It’s a groundbreaking approach to training AI models to think and reason without relying on human-provided data, instructions, or examples.

The research behind AZR, detailed in the GitHub repository and the arXiv paper titled Absolute Zero: Reinforced Self-play Reasoning with Zero Data, introduces a new method called the Absolute Zero paradigm. This method allows an AI to teach itself how to solve complex problems, like coding or math, by generating its own tasks, solving them, and learning from the results—all without any external data. Let’s break this down step by step, explore how it works, provide some examples, and discuss what this could mean for the future.

The Problem with Traditional AI Training

To understand why AZR is such a big deal, we first need to look at how AI models are typically trained. Most AI systems, like those powering chatbots or code-writing tools, rely on supervised learning or reinforcement learning (RL) with human-curated data. In supervised learning, humans provide the AI with examples of problems and their correct solutions—like showing it thousands of math problems with answers or code snippets with expected outputs. The AI learns by mimicking these examples.

In reinforcement learning, the AI is given a goal and learns by trying different actions, receiving rewards or penalties based on how well it performs. For example, an AI learning to play chess might get a reward for winning a game. However, even in RL, humans often need to define the tasks (e.g., “play chess”) and provide a dataset of questions or scenarios for the AI to practice on.

Here’s the catch: creating these datasets is time-consuming, expensive, and requires human expertise. For instance, to train an AI to solve math problems, someone has to write thousands of problems and their solutions. As AI systems get smarter, the demand for high-quality data grows, and it’s becoming harder for humans to keep up. Plus, if AI ever surpasses human intelligence, human-created tasks might not be challenging enough for it to keep learning.

The Absolute Zero paradigm tackles this problem head-on. It proposes that an AI can learn to reason without any human-provided data by creating its own tasks and learning from them. This is like the child inventing their own puzzles instead of relying on a puzzle book written by someone else.

What Is the Absolute Zero Paradigm?

The Absolute Zero paradigm is a new way of training AI models using Reinforcement Learning with Verifiable Rewards (RLVR). RLVR is a technique where an AI learns by receiving feedback (rewards) based on whether its actions lead to correct outcomes, like solving a problem accurately. Unlike traditional RLVR, which still needs humans to provide tasks or questions, Absolute Zero takes it a step further: the AI generates its own tasks, solves them, and uses a system to check if the solutions are correct, all without human input.

Here’s how it works in simple terms:

Task Creation (Propose Phase): The AI, acting as a “proposer,” comes up with its own problems to solve. These could be math problems, coding challenges, or logical puzzles. The AI designs these tasks to be useful for learning, meaning they’re challenging but not impossible.
Task Validation: The AI uses a tool, like a code executor (a program that runs code and checks if it works), to verify whether the tasks it created are valid. For example, if the AI generates a coding problem, the executor checks if the problem makes sense and has a clear solution.
Solving the Tasks (Solve Phase): The AI, now acting as a “solver,” tries to answer the problems it created. It generates solutions and tests them using the same tool (e.g., the code executor) to see if they’re correct.
Learning from Feedback: The AI gets two types of rewards:
- Learnability Reward: This measures how good the generated task is for learning. A task that’s too easy or too hard might get a low reward, while a moderately challenging task gets a high reward.
- Accuracy Reward: This measures whether the AI solved the task correctly. A correct solution gets a high reward, while an incorrect one gets a low or no reward.
Self-Improvement: The AI uses these rewards to improve both its ability to create tasks and solve them. It keeps repeating this cycle, getting better with each round.

This process is called self-play, because the AI is both the teacher (creating tasks) and the student (solving them). The “Absolute Zero” name comes from the fact that it starts with zero external data—no human-written problems, no pre-existing datasets, just the AI and a way to check its work.

The Absolute Zero Reasoner (AZR)

The Absolute Zero Reasoner (AZR) is the AI system built to implement this paradigm, specifically for code-based reasoning and mathematical reasoning. AZR uses a large language model (LLM)—a type of AI designed to understand and generate text—as its core. This model plays both the proposer and solver roles, and it interacts with a code executor to validate tasks and solutions.

Here’s a more technical look at how AZR operates:

Model Architecture: AZR is built on a unified language model, meaning the same model handles both task generation and problem-solving. It’s trained using a reinforcement learning algorithm called TRR++ (an advanced version of a technique that balances exploration and learning efficiency).
Code Executor: The executor is a program that runs Python code and checks its output. For example, if AZR generates a coding problem like “Write a function to calculate the factorial of a number,” the executor runs the code to ensure it produces the correct results (e.g., 5! = 120).
Task Types: AZR generates tasks in three reasoning categories:
- Deduction: Inferring a conclusion from given rules (e.g., “If all cats are mammals and Fluffy is a cat, is Fluffy a mammal?”).
- Abduction: Inferring the cause of an observation (e.g., “Fluffy is a mammal. Could Fluffy be a cat?”).
- Induction: Generalizing from specific examples (e.g., “Fluffy and Whiskers are cats and mammals. Are all cats mammals?”).
Training Environment: AZR uses a framework called veRL (a reinforcement learning toolkit) and vLLM (a library for running large language models efficiently). The GitHub repository provides scripts to set up this environment, including commands to install dependencies like Python 3.10, CUDA for GPU support, and specific versions of libraries like transformers and vllm.
Performance Metrics: AZR is tested on standard benchmarks like Codeforces (for coding) and MATH (for mathematical reasoning). It achieves state-of-the-art (SOTA) performance, meaning it outperforms other models that rely on thousands of human-curated examples.

Examples of AZR in Action

To make this concrete, let’s walk through two simplified examples of how AZR might work.

Example 1: Coding Problem

Propose Phase: AZR generates a coding task: “Write a Python function to find the sum of all even numbers in a list.”
Validation: The code executor creates a test case, like input = [1, 2, 3, 4, 5], and checks if the expected output (2 + 4 = 6) is clear and testable. The task is valid, so it gets a high learnability reward.

Solve Phase: AZR writes a solution:

def sum_even_numbers(numbers):
    return sum(num for num in numbers if num % 2 == 0)

Verification: The executor runs the code with test cases (e.g., [1, 2, 3, 4, 5] → 6, [2, 4, 6] → 12). The solution is correct, so AZR gets a high accuracy reward.
Learning: AZR adjusts its internal parameters to generate similar tasks and improve its coding skills.

Example 2: Math Problem

Propose Phase: AZR creates a math problem: “Solve for x in the equation 2x + 3 = 11.”
Validation: The executor checks if the problem has a unique solution (x = 4) by solving it programmatically. The task is valid and moderately challenging, earning a good learnability reward.
Solve Phase: AZR generates a solution:
- Step 1: Subtract 3 from both sides: 2x = 8.
- Step 2: Divide both sides by 2: x = 4.
Verification: The executor confirms x = 4 satisfies the equation, giving a high accuracy reward.
Learning: AZR learns to create and solve similar algebraic equations, gradually tackling more complex ones.

These examples show how AZR operates in a loop, constantly generating, solving, and learning from its own tasks. Over time, it becomes better at both creating challenging problems and solving them accurately.

Why Is AZR a Big Deal?

AZR’s ability to learn without human data is a game-changer for several reasons:

Scalability: Traditional AI training relies on human effort to create datasets, which is a bottleneck. AZR eliminates this need, allowing AI to scale its learning indefinitely as long as it has computational resources.
Autonomy: By generating its own tasks, AZR can explore new problem domains without human guidance. This makes it more adaptable to new challenges.
Performance: Despite starting with zero data, AZR achieves top performance on coding and math benchmarks, surpassing models trained on large human-curated datasets. This suggests that self-play can produce highly capable AI systems.
Future-Proofing: In a future where AI might outsmart humans, human-created tasks could become too simple. AZR’s self-evolving approach ensures it can keep challenging itself, even beyond human capabilities.

From a technical perspective, AZR’s success highlights the power of reinforcement learning with verifiable rewards. By grounding its learning in a real environment (the code executor), AZR avoids common RL pitfalls like “reward hacking,” where an AI finds shortcuts to maximize rewards without actually learning (e.g., exploiting bugs in a reward system). The use of a code executor as a verifiable feedback source ensures that rewards are objective and reliable.

Technical Details for the Curious

For those interested in the nuts and bolts, here’s a bit more tech talk about AZR’s implementation, based on the GitHub repository and paper:

Training Setup: AZR requires significant computational power. Smaller models (3 billion parameters) need two 80GB GPUs, while larger ones (14 billion parameters) need eight. The repository includes scripts for setting up the environment, such as:
conda create -n azr python=3.10 conda activate azr pip install -r requirements.txt
These commands install dependencies like vllm==0.7.3 and transformers==4.47.1.
Data Construction: While AZR doesn’t rely on external data for training, the repository provides seed datasets to help users replicate the process. These datasets are generated by prompting other AI models and are stored in JSONL format (e.g., data/<new_ded_abd_seed_data_name>.jsonl).
Self-Play Scripts: The repository includes scripts for running self-play experiments, such as:
bash scripts/selfplay/7b.sh
Users can customize these scripts by specifying their own seed datasets or resuming previous runs with a Weights & Biases ( W and B ) run ID.

Model Conversion: After training, AZR’s checkpoints can be converted to Hugging Face format for easier deployment:

python -m absolute_zero_reasoner.utils.convert2hf <veRL_ckpt_path>/actor <veRL_ckpt_path>/actor/huggingface/ <hf_ckpt_path>

Reward System: AZR’s reward system is customizable. Users can add new rewards (e.g., for task diversity or complexity) to the configuration file azr.reward.generation_reward_config.

These details show that AZR is not just a theoretical concept but a practical system that researchers can experiment with, provided they have the computational resources.

Ideas for the Future: Where Could This Lead?

The Absolute Zero paradigm and AZR open up exciting possibilities for the future of AI. Here are some ideas for where this research could take us:

Expanding to New Domains:
- Science and Engineering: AZR’s code-based reasoning could extend to scientific simulations, like designing experiments or solving physics problems. For example, it could generate and solve problems in quantum mechanics or fluid dynamics.
- Creative Problem-Solving: AZR could be adapted for creative tasks, like generating and solving design challenges (e.g., creating architectural blueprints and verifying their structural integrity).
- Natural Language Reasoning: While AZR focuses on code and math, it could be extended to logical reasoning in text, such as generating and solving legal arguments or philosophical debates.
General Intelligence:
- AZR’s self-play approach is a step toward artificial general intelligence (AGI), where AI can learn any intellectual task without human supervision. By combining AZR with multimodal models (handling text, images, and more), we could create AI that learns across diverse domains autonomously.
- For example, an AGI inspired by AZR could generate tasks in vision (e.g., “Identify objects in this image”), language (e.g., “Summarize this article”), and reasoning (e.g., “Solve this logic puzzle”), learning from its own feedback.
Education and Training:
- AZR could power personalized learning systems that generate custom exercises for students and verify their solutions. For instance, a math tutoring AI could create problems tailored to a student’s skill level and provide feedback without needing a human teacher to write the problems.
- In professional training, AZR could generate coding challenges for software engineers or case studies for business students, adapting to their progress.
Autonomous Research:
- AZR could become a tool for scientific discovery, generating hypotheses and testing them through simulations. For example, in drug discovery, it could propose molecular structures, simulate their interactions, and refine its proposals based on results.
- This could lead to AI-driven labs where machines conduct experiments independently, accelerating innovation in fields like medicine or materials science.
Ethical and Safety Considerations:
- As AZR-like systems become more autonomous, we’ll need robust safeguards to ensure they don’t generate harmful tasks or solutions. For example, a code executor could be designed to detect and block malicious code.
- Research could focus on aligning AZR’s self-generated tasks with human values, ensuring it prioritizes beneficial outcomes.
Integration with Other AI Advances:
- Combining AZR with multimodal models (like X-Reasoner, mentioned in the web results) could enable it to handle images, audio, or video, generating tasks like “Design a logo and verify its aesthetic appeal.”
- Pairing AZR with tool-integrated models (like START) could allow it to use external tools (e.g., calculators, databases) to validate tasks in real-world scenarios.

These possibilities suggest that AZR is not just a one-off innovation but a foundation for building AI systems that are more autonomous, versatile, and capable of driving progress across industries.

Conclusion

The Absolute Zero Reasoner and its underlying paradigm represent a bold leap forward in AI research. By enabling an AI to generate its own tasks, solve them, and learn from the results without any human data, AZR challenges the traditional reliance on human-curated datasets. Its ability to achieve state-of-the-art performance in coding and mathematical reasoning, using nothing but self-play and a code executor, demonstrates the power of this approach.

Through examples like generating coding problems or solving math equations, we’ve seen how AZR operates in a self-improving loop, constantly refining its skills. The technical details, from its use of reinforcement learning to its customizable reward system, show that it’s a practical system that researchers can build upon. Looking ahead, AZR could pave the way for AI that learns across diverse domains, powers autonomous research, and even approaches general intelligence—all while raising important questions about safety and ethics.

In a world where data is often called the “new oil,” AZR suggests that AI might not need oil at all. Instead, it can drill its own wells, refine its own fuel, and keep driving forward. This research is a glimpse into a future where AI doesn’t just follow human instructions but charts its own path to discovery.

White Paper: Absolute Zero: Reinforced Self-play Reasoning with Zero Data

GitHub: Absolute-Zero-Reasoner

Here is a .NET 4.8 Console Application written in C#

namespace AbsoluteZeroReasoner
{



    #region Using Statements:



    using System;
    using System.IO;
    using System.Linq;
    using Microsoft.CSharp;
    using System.Reflection;
    using System.CodeDom.Compiler;
    using System.Collections.Generic;



    #endregion




    /// <summary>
    /// A reasoning system that trains on reasoning tasks using a language model and reinforcement learning.
    /// It generates, solves, and verifies tasks, optimizing for learnability and accuracy.
    /// </summary>
    public class AbsoluteZeroReasoner
    {



        #region Fields:



        private readonly ILanguageModel _languageModel;



        private readonly IReinforcementLearner _rlAgent;



        private readonly CodeExecutor _codeExecutor;



        private readonly Random _random;



        private readonly double _learnabilityRewardWeight;



        private readonly double _accuracyRewardWeight;



        private readonly int _maxIterations;



        #endregion




        /// <summary>
        /// Initializes a new instance of the <see cref="AbsoluteZeroReasoner"/> class.
        /// </summary>
        /// <param name="languageModel">The language model used to generate tasks and solutions.</param>
        /// <param name="rlAgent">The reinforcement learning agent to track rewards and task history.</param>
        /// <param name="learnabilityRewardWeight">Weight for learnability reward in total reward calculation (default: 0.5).</param>
        /// <param name="accuracyRewardWeight">Weight for accuracy reward in total reward calculation (default: 0.5).</param>
        /// <param name="maxIterations">Maximum number of training iterations (default: 1000).</param>
        /// <exception cref="ArgumentNullException">Thrown when <paramref name="languageModel"/> or <paramref name="rlAgent"/> is null.</exception>
        public AbsoluteZeroReasoner(
            ILanguageModel languageModel,
            IReinforcementLearner rlAgent,
            double learnabilityRewardWeight = 0.5,
            double accuracyRewardWeight = 0.5,
            int maxIterations = 1000)
        {

            _languageModel = languageModel ?? throw new ArgumentNullException(nameof(languageModel));
            _rlAgent = rlAgent ?? throw new ArgumentNullException(nameof(rlAgent));
            _codeExecutor = new CodeExecutor();
            _random = new Random();
            _learnabilityRewardWeight = learnabilityRewardWeight;
            _accuracyRewardWeight = accuracyRewardWeight;
            _maxIterations = maxIterations;
        }



        /// <summary>
        /// Trains the reasoner by processing tasks from a file or generating tasks autonomously.
        /// </summary>
        /// <param name="taskFilePath">Path to a JSON file containing tasks (optional).</param>
        public void Train(string taskFilePath = null)
        {

            List<ReasoningTask> fileTasks = taskFilePath != null ? LoadTasksFromFile(taskFilePath) : null;
            int taskIndex = 0;

            for (int iteration = 0; iteration < _maxIterations; iteration++)
            {
                Console.WriteLine($"\n=== Iteration {iteration + 1} ===");

                ReasoningTask task;
                double learnabilityReward;
                if (fileTasks != null && taskIndex < fileTasks.Count)
                {
                    task = fileTasks[taskIndex];
                    learnabilityReward = ComputeLearnabilityReward(task);
                    taskIndex++;
                }
                else
                {
                    var proposedTask = ProposeTask();
                    var (proposedTaskObj, proposedLearnability) = ValidateProposedTask(proposedTask);
                    if (proposedTaskObj == null)
                    {
                        Console.WriteLine("Invalid task proposed.");
                        _rlAgent.Update(learnabilityReward: -1.0, accuracyReward: 0.0, taskId: null);
                        continue;
                    }
                    task = proposedTaskObj;
                    learnabilityReward = proposedLearnability;
                }

                Console.WriteLine($"Proposed Task: {task.Question}, Expected Output: {task.ExpectedOutput}");

                var solution = SolveTask(task);
                Console.WriteLine($"Generated Solution: {solution}");

                var (actualOutput, accuracyReward, error) = VerifySolution(task, solution);
                Console.WriteLine($"Actual Output: {(error != null ? $"Error: {error}" : actualOutput ?? "null")}");
                Console.WriteLine($"Accuracy Reward: {accuracyReward}");

                double totalReward = (_learnabilityRewardWeight * learnabilityReward) +
                                    (_accuracyRewardWeight * accuracyReward);

                _rlAgent.Update(learnabilityReward, accuracyReward, task.Id);

                Console.WriteLine($"Learnability Reward: {learnabilityReward}, Total Reward: {totalReward}");
            }
        }



        /// <summary>
        /// Loads reasoning tasks from a JSON file.
        /// </summary>
        /// <param name="filePath">Path to the JSON file containing tasks.</param>
        /// <returns>A list of <see cref="ReasoningTask"/> objects, or null if loading fails.</returns>
        private List<ReasoningTask> LoadTasksFromFile(string filePath)
        {

            if (!File.Exists(filePath))
            {
                Console.WriteLine($"File {filePath} not found. Falling back to autonomous task generation.");
                return null;
            }

            try
            {
                string json = File.ReadAllText(filePath);
                Console.WriteLine($"Raw JSON: {json.Substring(0, Math.Min(100, json.Length))}..."); // Log JSON snippet
                var tasks = new List<ReasoningTask>();
                // Split JSON array into individual task objects
                string[] taskStrings = json.Trim('[', ']').Split(new[] { "},{" }, StringSplitOptions.None);
                foreach (string taskStr in taskStrings)
                {
                    string cleaned = taskStr.Trim('{', '}').Replace("},", "").Replace("{", "");
                    // Split by commas, but handle quoted strings carefully
                    var parts = new List<string>();
                    bool inQuotes = false;
                    string currentPart = "";
                    for (int i = 0; i < cleaned.Length; i++)
                    {
                        if (cleaned[i] == '"') inQuotes = !inQuotes;
                        else if (cleaned[i] == ',' && !inQuotes)
                        {
                            parts.Add(currentPart.Trim());
                            currentPart = "";
                            continue;
                        }
                        currentPart += cleaned[i];
                    }
                    if (!string.IsNullOrEmpty(currentPart)) parts.Add(currentPart.Trim());

                    var dict = new Dictionary<string, string>();
                    foreach (var part in parts)
                    {
                        var kv = part.Split(new[] { ":" }, 2, StringSplitOptions.None);
                        if (kv.Length == 2)
                        {
                            string key = kv[0].Trim('"');
                            string value = kv[1].Trim('"');
                            if (!dict.ContainsKey(key)) // Avoid duplicate keys
                                dict[key] = value;
                        }
                    }

                    if (dict.ContainsKey("id") && dict.ContainsKey("question") && dict.ContainsKey("expected_output"))
                    {
                        tasks.Add(new ReasoningTask
                        {
                            Id = dict["id"],
                            Question = dict["question"],
                            ExpectedOutput = dict["expected_output"]
                        });
                    }
                }
                Console.WriteLine($"Loaded {tasks.Count} tasks from {filePath}");
                return tasks.Count > 0 ? tasks : null;
            }
            catch (Exception ex)
            {
                Console.WriteLine($"Error reading tasks from {filePath}: {ex.Message}. Falling back to autonomous task generation.");
                return null;
            }
        }



        /// <summary>
        /// Proposes a new reasoning task using the language model.
        /// </summary>
        /// <returns>A JSON string representing the proposed task.</returns>
        private string ProposeTask()
        {

            string[] taskTypes = { "deduction", "abduction", "induction" };
            string taskType = taskTypes[_random.Next(taskTypes.Length)];

            string prompt = $"Generate a {taskType} reasoning task in the form of a C# coding problem. " +
                           "Return the task as a JSON string with 'id' (unique string), 'question' (problem description), and " +
                           "'expected_output' (expected result).";
            return _languageModel.Generate(prompt);
        }



        /// <summary>
        /// Validates a proposed task and computes its learnability reward.
        /// </summary>
        /// <param name="taskJson">JSON string representing the proposed task.</param>
        /// <returns>A tuple containing the parsed <see cref="ReasoningTask"/> and its learnability reward, or (null, -1.0) if invalid.</returns>
        private (ReasoningTask Task, double LearnabilityReward) ValidateProposedTask(string taskJson)
        {

            try
            {
                var task = ParseTaskJson(taskJson);
                if (string.IsNullOrEmpty(task.Id) || string.IsNullOrEmpty(task.Question) || string.IsNullOrEmpty(task.ExpectedOutput))
                {
                    return (null, -1.0);
                }

                bool isValid = _codeExecutor.IsValidTask(task);
                double learnabilityReward = isValid ? ComputeLearnabilityReward(task) : -1.0;

                return (isValid ? task : null, learnabilityReward);
            }
            catch (Exception ex)
            {
                Console.WriteLine($"Task validation error: {ex.Message}");
                return (null, -1.0);
            }
        }



        /// <summary>
        /// Generates a solution for a given reasoning task using the language model.
        /// </summary>
        /// <param name="task">The reasoning task to solve.</param>
        /// <returns>The generated C# solution code as a string.</returns>
        private string SolveTask(ReasoningTask task)
        {

            string prompt = $"Solve the following C# coding problem:\n{task.Question}\n" +
                           "Provide the solution as a C# statement that assigns to a variable.";
            return _languageModel.Generate(prompt);
        }



        /// <summary>
        /// Verifies a solution by executing it and comparing its output to the expected output.
        /// </summary>
        /// <param name="task">The reasoning task being verified.</param>
        /// <param name="solution">The generated solution code.</param>
        /// <returns>A tuple containing the actual output, accuracy reward (1.0 for correct, -1.0 for incorrect), and any error message.</returns>
        private (string ActualOutput, double AccuracyReward, string Error) VerifySolution(ReasoningTask task, string solution)
        {

            try
            {
                string actualOutput = _codeExecutor.ExecuteSolution(solution);
                string normalizedActual = actualOutput?.Trim().ToLower();
                string normalizedExpected = task.ExpectedOutput?.Trim().ToLower().Replace("\"", "");
                bool isCorrect = string.Equals(normalizedActual, normalizedExpected);
                Console.WriteLine($"Comparing: Actual='{normalizedActual}', Expected='{normalizedExpected}', IsCorrect={isCorrect}");
                return (actualOutput, isCorrect ? 1.0 : -1.0, null);
            }
            catch (Exception ex)
            {
                return (null, -1.0, ex.Message);
            }
        }



        /// <summary>
        /// Computes the learnability reward for a task based on its complexity.
        /// </summary>
        /// <param name="task">The reasoning task to evaluate.</param>
        /// <returns>A learnability reward value between 0.0 and 1.0.</returns>
        private double ComputeLearnabilityReward(ReasoningTask task)
        {

            int complexity = task.Question.Length;
            return Math.Min(1.0, complexity / 100.0);
        }



        /// <summary>
        /// Parses a JSON string into a <see cref="ReasoningTask"/> object.
        /// </summary>
        /// <param name="json">The JSON string to parse.</param>
        /// <returns>A <see cref="ReasoningTask"/> object, or an empty task if parsing fails.</returns>
        private ReasoningTask ParseTaskJson(string json)
        {

            try
            {
                var parts = json.Replace("{", "").Replace("}", "").Split(',');
                string id = parts.FirstOrDefault(p => p.Contains("id"))?.Split(':')[1].Trim('"');
                string question = parts.FirstOrDefault(p => p.Contains("question"))?.Split(':')[1].Trim('"');
                string expectedOutput = parts.FirstOrDefault(p => p.Contains("expected_output"))?.Split(':')[1].Trim('"').Replace("\"", "");
                return new ReasoningTask { Id = id, Question = question, ExpectedOutput = expectedOutput };
            }
            catch
            {
                return new ReasoningTask();
            }
        }
    }



    /// <summary>
    /// Represents a reasoning task with an identifier, question, and expected output.
    /// </summary>
    public class ReasoningTask
    {



        /// <summary>
        /// Gets or sets the unique identifier of the task.
        /// </summary>
        public string Id { get; set; }



        /// <summary>
        /// Gets or sets the problem description or question.
        /// </summary>
        public string Question { get; set; }



        /// <summary>
        /// Gets or sets the expected output of the task.
        /// </summary>
        public string ExpectedOutput { get; set; }
    }

    /// <summary>
    /// Defines the interface for a language model that generates tasks and solutions.
    /// </summary>
    public interface ILanguageModel
    {



        /// <summary>
        /// Generates content based on a given prompt.
        /// </summary>
        /// <param name="prompt">The input prompt for content generation.</param>
        /// <returns>The generated content as a string.</returns>
        string Generate(string prompt);



        /// <summary>
        /// Generates a dataset of tasks and saves it to a file.
        /// </summary>
        /// <param name="taskCount">The number of tasks to generate.</param>
        /// <param name="filePath">The file path to save the dataset.</param>
        void GenerateDataset(int taskCount, string filePath);
    }



    /// <summary>
    /// Defines the interface for a reinforcement learning agent that tracks rewards and task history.
    /// </summary>
    public interface IReinforcementLearner
    {



        /// <summary>
        /// Updates the agent's state with rewards and task information.
        /// </summary>
        /// <param name="learnabilityReward">The learnability reward for the task.</param>
        /// <param name="accuracyReward">The accuracy reward for the task.</param>
        /// <param name="taskId">The unique identifier of the task.</param>
        void Update(double learnabilityReward, double accuracyReward, string taskId);



        /// <summary>
        /// Retrieves the task history, including attempts and successes.
        /// </summary>
        /// <returns>A dictionary mapping task IDs to their attempt and success counts.</returns>
        Dictionary<string, (int Attempts, int Successes)> GetTaskHistory();
    }



    /// <summary>
    /// Executes and validates C# code for reasoning tasks.
    /// </summary>
    public class CodeExecutor
    {



        /// <summary>
        /// Validates whether a reasoning task is well-formed.
        /// </summary>
        /// <param name="task">The reasoning task to validate.</param>
        /// <returns>True if the task is valid; otherwise, false.</returns>
        public bool IsValidTask(ReasoningTask task)
        {

            return !string.IsNullOrEmpty(task.Id) &&
                   !string.IsNullOrEmpty(task.Question) &&
                   !string.IsNullOrEmpty(task.ExpectedOutput);
        }



        /// <summary>
        /// Executes a C# solution code and returns its output.
        /// </summary>
        /// <param name="code">The C# solution code to execute.</param>
        /// <returns>The output of the executed code as a string.</returns>
        /// <exception cref="Exception">Thrown if compilation or execution fails.</exception>
        public string ExecuteSolution(string code)
        {

            try
            {
                CSharpCodeProvider provider = new CSharpCodeProvider();
                CompilerParameters parameters = new CompilerParameters
                {
                    GenerateInMemory = true,
                    GenerateExecutable = false
                };
                parameters.ReferencedAssemblies.Add("System.dll");

                string wrappedCode = @"
                    using System;
                    public class Solution {
                        public static string Run() {
                            " + code + @"
                            return result.ToString();
                        }
                    }";

                Console.WriteLine($"Compiling code:\n{wrappedCode}");

                CompilerResults results = provider.CompileAssemblyFromSource(parameters, wrappedCode);

                if (results.Errors.HasErrors)
                {
                    string errors = string.Join("\n", results.Errors.Cast<CompilerError>().Select(e => e.ToString()));
                    throw new Exception($"Compilation error: {errors}");
                }

                var assembly = results.CompiledAssembly;
                var type = assembly.GetType("Solution");
                var method = type.GetMethod("Run");
                var result = method.Invoke(null, null);

                return result?.ToString();
            }
            catch (Exception ex)
            {
                throw new Exception($"Execution error: {ex.Message}");
            }
        }
    }



    /// <summary>
    /// A stub implementation of <see cref="ILanguageModel"/> that dynamically generates tasks and solutions.
    /// </summary>
    public class DynamicStubLanguageModel : ILanguageModel
    {



        #region Fields:



        private readonly Random _random;


        private int _taskCounter;


        private readonly Dictionary<string, (int Attempts, int Successes)> _taskHistory;


        private readonly List<string> _recentTasks;



        #endregion



        /// <summary>
        /// Initializes a new instance of the <see cref="DynamicStubLanguageModel"/> class.
        /// </summary>
        /// <param name="rlAgent">The reinforcement learning agent to access task history.</param>
        public DynamicStubLanguageModel(IReinforcementLearner rlAgent)
        {

            _random = new Random();
            _taskCounter = 0;
            _taskHistory = rlAgent.GetTaskHistory();
            _recentTasks = new List<string>();
        }



        /// <summary>
        /// Generates content based on a prompt, such as a task or solution.
        /// </summary>
        /// <param name="prompt">The input prompt for content generation.</param>
        /// <returns>The generated content as a string (e.g., JSON task or C# solution).</returns>
        public string Generate(string prompt)
        {

            if (prompt.Contains("Generate a"))
            {
                _taskCounter++;
                string taskId = $"task_{_taskCounter}";
                string taskType = prompt.Contains("deduction") ? "deduction" :
                                  prompt.Contains("abduction") ? "abduction" : "induction";

                var (question, expectedOutput, _) = GenerateNewTask(taskType);
                return $"{{\"id\": \"{taskId}\", \"question\": \"{question} ({taskType})\", \"expected_output\": \"{expectedOutput}\"}}";
            }
            else if (prompt.Contains("Solve the following"))
            {
                string question = prompt.Split('\n')[1].Split('(')[0].Trim();
                return GenerateSolutionForQuestion(question);
            }
            return "";
        }



        /// <summary>
        /// Generates a dataset of tasks and saves it to a JSON file.
        /// </summary>
        /// <param name="taskCount">The number of tasks to generate.</param>
        /// <param name="filePath">The file path to save the dataset.</param>
        public void GenerateDataset(int taskCount, string filePath)
        {

            var tasks = new List<Dictionary<string, string>>();
            int tempCounter = _taskCounter;
            _taskCounter = 0;

            for (int i = 0; i < taskCount; i++)
            {
                _taskCounter++;
                string taskId = $"task_{_taskCounter}";
                string taskType = new[] { "deduction", "abduction", "induction" }[_random.Next(3)];
                var (question, expectedOutput, solution) = GenerateNewTask(taskType);

                tasks.Add(new Dictionary<string, string>
                {
                    { "id", taskId },
                    { "question", $"{question} ({taskType})" },
                    { "expected_output", expectedOutput },
                    { "solution", solution }
                });
            }

            string json = "[" + string.Join(",", tasks.Select(t =>
                $"{{ \"id\": \"{t["id"]}\", \"question\": \"{t["question"]}\", \"expected_output\": \"{t["expected_output"]}\", \"solution\": \"{t["solution"]}\" }}"))
                + "]";

            File.WriteAllText(filePath, json);
            Console.WriteLine($"Generated {taskCount} tasks and saved to {filePath}");
            Console.WriteLine($"Generated JSON: {json.Substring(0, Math.Min(100, json.Length))}..."); // Log JSON snippet
            _taskCounter = tempCounter;
        }



        /// <summary>
        /// Generates a new reasoning task based on the specified task type.
        /// </summary>
        /// <param name="taskType">The type of task (deduction, abduction, or induction).</param>
        /// <returns>A tuple containing the question, expected output, and solution.</returns>
        private (string Question, string ExpectedOutput, string Solution) GenerateNewTask(string taskType)
        {

            var templates = new List<(string QuestionTemplate, Func<int, int, (string ExpectedOutput, string Solution)>)>
            {
                ("Write a method to compute {0} factorial", (n, _) => {
                    int fact = 1;
                    for (int i = 1; i <= n; i++) fact *= i;
                    return ($"{fact}", $"int result = 1; for (int i = 1; i <= {n}; i++) result *= i;");
                }),
                ("Write a method to compute {0} + {1}", (a, b) => ($"{a + b}", $"int result = {a} + {b};")),
                ("Write a method to compute the square of {0}", (n, _) => ($"{n * n}", $"int result = {n} * {n};")),
                ("Write a method to concatenate '{0}' and '{1}'", (a, b) => ($"{a}{b}", $"string result = \"{a}\" + \"{b}\";")),
                ("Write a method to check if {0} is even", (n, _) => ($"{n % 2 == 0}", $"bool result = {(n % 2 == 0)}"))
            };

            var weights = templates.Select((t, i) =>
                t.QuestionTemplate.Contains("square") ? 2.0 :
                t.QuestionTemplate.Contains("is even") ? 1.5 :
                1.0 / (1 + _taskHistory.Values.Sum(h => h.Attempts - h.Successes) * 3.0)).ToList();

            var availableTemplates = templates.Where(t => !_recentTasks.Contains(t.QuestionTemplate.Split('{')[0])).ToList();
            if (!availableTemplates.Any())
            {
                _recentTasks.Clear();
                availableTemplates = templates;
            }

            int templateIndex = _taskCounter == 10 ? 2 : SelectWeightedIndex(weights, templates);
            var template = templates[templateIndex];

            int param1 = _taskCounter == 10 ? 5 : _random.Next(1, template.QuestionTemplate.Contains("factorial") ? 7 : 10);
            int param2 = _random.Next(1, 100);
            string question = string.Format(template.QuestionTemplate, param1, param2);

            _recentTasks.Add(template.QuestionTemplate.Split('{')[0]);
            if (_recentTasks.Count > 3) _recentTasks.RemoveAt(0);

            var (expectedOutput, solution) = template.Item2(param1, param2);
            return (question, expectedOutput, solution);
        }



        /// <summary>
        /// Generates a C# solution for a given question.
        /// </summary>
        /// <param name="question">The task question to solve.</param>
        /// <returns>The generated C# solution code as a string.</returns>
        private string GenerateSolutionForQuestion(string question)
        {

            var templates = new List<(string Keyword, Func<string, (string Param1, string Param2, string Solution)>)>
            {
                ("factorial", q => {
                    string[] parts = q.Split(' ');
                    string n = parts.FirstOrDefault(p => int.TryParse(p, out _)) ?? _random.Next(1, 7).ToString();
                    return (n, "0", $"int result = 1; for (int i = 1; i <= {n}; i++) result *= i;");
                }),
                (" + ", q => {
                    string[] parts = q.Split('+').Select(p => p.Trim()).ToArray();
                    string a = parts.Length > 0 && int.TryParse(parts[0].Split(' ').LastOrDefault(p => int.TryParse(p, out _)), out int n) ? n.ToString() : _random.Next(1, 10).ToString();
                    string b = parts.Length > 1 && int.TryParse(parts[1].Split(' ').FirstOrDefault(p => int.TryParse(p, out _)), out int m) ? m.ToString() : _random.Next(1, 100).ToString();
                    return (a, b, $"int result = {a} + {b};");
                }),
                ("square of", q => {
                    string[] parts = q.Split(' ');
                    string n = parts.LastOrDefault(p => int.TryParse(p, out _)) ?? _random.Next(1, 10).ToString();
                    if (_taskCounter == 10) {
                        return (n, "0", $"int result = {n} + {n};"); // Wrong: n + n instead of n * n
                    }
                    return (n, "0", $"int result = {n} * {n};");
                }),
                ("concatenate", q => {
                    string[] parts = q.Split('\'');
                    string s1 = parts.Length > 1 ? parts[1] : _random.Next(1, 100).ToString();
                    string s2 = parts.Length > 3 ? parts[3] : _random.Next(1, 100).ToString();
                    return (s1, s2, $"string result = \"{s1}\" + \"{s2}\";");
                }),
                ("is even", q => {
                    string[] parts = q.Split(' ');
                    string n = parts.FirstOrDefault(p => int.TryParse(p, out _)) ?? _random.Next(1, 10).ToString();
                    return (n, "0", $"bool result = {n} % 2 == 0;");
                })
            };

            foreach (var template in templates)
            {
                if (question.Contains(template.Keyword))
                {
                    var (param1, param2, solution) = template.Item2(question);
                    Console.WriteLine($"Parsed: Question='{question}', Param1='{param1}', Param2='{param2}', Solution='{solution}'");
                    return solution;
                }
            }
            return "int result = 0;";
        }



        /// <summary>
        /// Selects a template index based on weighted probabilities.
        /// </summary>
        /// <param name="weights">The weights for each template.</param>
        /// <param name="templates">The list of task templates.</param>
        /// <returns>The selected template index.</returns>
        private int SelectWeightedIndex(IList<double> weights, IList<(string, Func<int, int, (string, string)>)> templates)
        {

            // Force a square task after iteration 10 if task_10 failed
            if (_taskCounter > 10 && _taskHistory.ContainsKey("task_10") && _taskHistory["task_10"].Successes == 0)
            {
                for (int i = 0; i < templates.Count; i++)
                {
                    if (templates[i].Item1.Contains("square")) return i;
                }
            }

            double total = weights.Sum();
            double r = _random.NextDouble() * total;
            double sum = 0;
            for (int i = 0; i < weights.Count; i++)
            {
                sum += weights[i];
                if (r <= sum) return i;
            }
            return weights.Count - 1;
        }
    }



    /// <summary>
    /// A stub implementation of <see cref="IReinforcementLearner"/> that tracks task history and rewards.
    /// </summary>
    public class DynamicStubReinforcementLearner : IReinforcementLearner
    {



        #region Fields:



        private readonly Dictionary<string, (int Attempts, int Successes)> _taskHistory;



        private double _cumulativeReward;



        private int _successfulTasks;



        #endregion



        /// <summary>
        /// Initializes a new instance of the <see cref="DynamicStubReinforcementLearner"/> class.
        /// </summary>
        public DynamicStubReinforcementLearner()
        {

            _taskHistory = new Dictionary<string, (int, int)>();
            _cumulativeReward = 0.0;
            _successfulTasks = 0;
        }



        /// <summary>
        /// Updates the agent's state with rewards and task information.
        /// </summary>
        /// <param name="learnabilityReward">The learnability reward for the task.</param>
        /// <param name="accuracyReward">The accuracy reward for the task.</param>
        /// <param name="taskId">The unique identifier of the task.</param>
        public void Update(double learnabilityReward, double accuracyReward, string taskId)
        {

            _cumulativeReward += (learnabilityReward + accuracyReward);
            if (accuracyReward > 0)
            {
                _successfulTasks++;
            }

            if (taskId != null)
            {
                if (!_taskHistory.ContainsKey(taskId))
                {
                    _taskHistory[taskId] = (0, 0);
                }
                var (attempts, successes) = _taskHistory[taskId];
                _taskHistory[taskId] = (attempts + 1, successes + (accuracyReward > 0 ? 1 : 0));
            }

            Console.WriteLine($"RL Update: Learnability = {learnabilityReward}, Accuracy = {accuracyReward}, " +
                              $"Cumulative Reward = {_cumulativeReward}, Successful Tasks = {_successfulTasks}");
            Console.WriteLine($"Task History: {string.Join(", ", _taskHistory.Select(kv => $"{kv.Key}: {kv.Value.Successes}/{kv.Value.Attempts}"))}");
        }



        /// <summary>
        /// Retrieves the task history, including attempts and successes.
        /// </summary>
        /// <returns>A dictionary mapping task IDs to their attempt and success counts.</returns>
        public Dictionary<string, (int Attempts, int Successes)> GetTaskHistory()
        {

            return _taskHistory;
        }
    }



    /// <summary>
    /// The entry point for the AbsoluteZeroReasoner application.
    /// </summary>
    public class Program
    {

        /// <summary>
        /// The main method that initializes and runs the reasoner.
        /// </summary>
        public static void Main()
        {

            var rlAgent = new DynamicStubReinforcementLearner();
            var languageModel = new DynamicStubLanguageModel(rlAgent);
            languageModel.GenerateDataset(100, "tasks.json");

            var reasoner = new AbsoluteZeroReasoner(languageModel, rlAgent, maxIterations: 15);
            reasoner.Train("tasks.json");

            Console.ReadLine();
        }
    }
}