3RecursiveIntelligence.io

Practical AI Methodology Meets Cognitive Science|Looking for Ricursive (the AI chip design company)? You want ricursive.com

AI/ML Reading List

Curated links with summaries. RSS feed ↗

  • Anthropic’s Claude Mythos found 10,000 critical vulnerabilities in one month. The patches can’t keep up.

    Anthropic's Project Glasswing cybersecurity program discovered 10,000+ vulnerability candidates in critical infrastructure software in its first month, with 1,094 confirmed as high/critical severity. This demonstrates AI-assisted vulnerability discovery at scale, relevant to practitioners evaluating LLM capabilities in security applications, though the story would benefit from direct Anthropic statement or technical documentation.

    ai-mlresearch
  • Please help with tensor dock [d]
    ai-mlcommunity
  • Sponsio: Deterministic Contract Layer for LLM Agents [P]

    Sponsio is an open-source contract layer for LLM agents that enforces deterministic tool-call boundaries (sequencing, retry limits, approval gates) at runtime via YAML rules, solving a critical gap between demo safety and production reliability. This matters because LLM agent deployment has hit a wall on tool-call safety—prompt engineering and post-hoc auditing both fail at scale—and a lightweight, composable enforcement layer that works with existing frameworks (LangGraph) unlocks broader adoption of agentic patterns in high-stakes workflows.

    ai-mlcommunity
  • When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming

    A new arXiv paper formalizes the gap between next-token prediction and actual language generation by distinguishing the full conditional process (conditioned on latent context), the marginal text-only process, and the model-learned distribution. The work shows that marginal text-only prediction requires strong stationarity and ergodicity assumptions that fail on heterogeneous corpora, and that usefulness depends on whether observed text is sufficient to predict the next token given omitted circumstances—reframing RAG and tool use as mechanisms for achieving conditional sufficiency.

    ai-mlresearch
  • Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

    Fast-dDrive introduces a block-diffusion VLA that balances trajectory planning fidelity with inference efficiency for autonomous driving, using structured scaffolding and speculative decoding to achieve 12× speedup over autoregressive baselines while improving accuracy on standard benchmarks. This matters because it directly addresses the speed-accuracy tradeoff blocking deployment of high-capacity VLMs on edge hardware for real-time driving applications—a key bottleneck in practical autonomous systems.

    ai-mlresearch
  • Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography

    Researchers used sparse autoencoders to decompose GPT-2 XL and Llama-3.1-8B into interpretable semantic features that align with known cortical semantic organization, recovering 94% of peak brain encoding performance and predicting both cortical topography and human reading times. This work directly addresses the long-standing question of why intermediate LLM layers best predict human brain responses, providing mechanistic evidence for brain-LLM alignment and demonstrating that interpretability techniques can recover neuroscience-validated organizational principles.

    ai-mlresearchlong-signal:rdd
  • Can AI Guess What You Know? Performance Comparison of Large Language Models for Human Domain Knowledge Estimation From Communication Logs

    Researchers benchmarked seven LLMs (Gemini, Claude, GPT variants) on inferring employee domain knowledge from Slack logs, finding Gemini 2.5 Flash most accurate (MAE 21.13%) and revealing that message volume weakly correlates with inference quality. The work validates LLM capability for organizational knowledge mapping while flagging privacy and representation challenges—relevant to practitioners building enterprise AI systems and researchers studying LLM reasoning over real-world conversational data.

    ai-mlresearch
  • Evaluating Memory Structure in LLM Agents

    Researchers propose StructMemEval, a benchmark that moves beyond simple fact-recall to test whether LLM agents can organize long-term memory into meaningful structures (ledgers, hierarchies, etc.). Findings show modern LLMs struggle with this task without explicit prompting, identifying a concrete research direction for improving memory-augmented agents and training procedures.

    ai-mlresearch
  • SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding

    SciHorizon-GENE is a large-scale gene-centric benchmark with 540K questions across 190K human genes, systematically evaluating LLMs on gene-to-function reasoning with explicit focus on hallucination, completeness, and literature grounding. This work provides practitioners and researchers concrete evaluation criteria for LLM adoption in biomedical interpretation pipelines and establishes an important failure-mode analysis framework for domain-specific LLM deployment.

    ai-mlresearch
  • ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions

    ThoughtTrace is a new 10K-annotation dataset pairing real LLM conversations with users' self-reported thoughts across 1,058 users and 20 models, showing thoughts are semantically distinct from messages and difficult for frontier LLMs to infer. The work establishes user cognition as a trainable signal for personalization and alignment, with demonstrated utility for behavior prediction and thought-guided fine-tuning—opening a new modality for studying latent user goals in human-AI interaction.

    ai-mlresearch
  • SemEval-2026 Task 6: CLARITY -- Unmasking Political Question Evasions

    SemEval-2026 Task 6 (CLARITY) introduces a benchmark for detecting political question evasion through two-level classification: clarity (Clear/Ambivalent/Non-Reply) and fine-grained evasion strategies, evaluated on 124 teams across U.S. presidential interviews. Relevant to NLP practitioners building discourse analysis systems and researchers studying strategic language, though narrower in scope than foundational model or architecture advances.

    ai-mlresearch
  • TurkicNLP: An NLP Toolkit for Turkic Languages

    TurkicNLP is an open-source Python library providing unified NLP tooling for Turkic languages across four script families and eight core NLP tasks, filling a significant resource gap for 200+ million speakers. The modular architecture and CoNLL-U standard compliance make it valuable for practitioners building downstream applications and researchers studying morphologically rich, under-resourced language families.

    ai-mlresearch
  • Improving Sampling for Masked Diffusion Models via Information Gain

    Researchers propose Info-Gain Sampler, a training-free decoding method that improves masked diffusion model sampling by balancing immediate uncertainty against information gain across remaining positions, outperforming greedy baselines by 2.9–11.6% on reasoning tasks and 62.8% win rate on creative writing. Relevant to practitioners building or optimizing diffusion-based systems for reasoning, coding, and generation tasks.

    ai-mlresearch
  • Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval

    ToolMerge proposes LLM-based query decomposition with multi-tool ranking merging for keyframe retrieval in long-video QA, introducing M2M benchmark where questions are temporally anchored. Relevant to practitioners building video understanding systems and researchers optimizing retrieval for multimodal LLM tasks; open-source code democratizes adoption.

    ai-mlresearchvelocity:hn-medium
  • RADAR: Relative Angular Divergence Across Representations

    RADAR is a new transferability metric that uses layer-wise representation geometry to predict when cross-domain data transfer will help or hurt downstream performance. The work addresses negative transfer in foundation models across vision and NLP, offering practitioners a tool to screen dataset combinations before expensive retraining—relevant to anyone building multi-modal or cross-domain systems at scale.

    ai-mlresearch
  • ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models

    ChartFI-Bench is a new evaluation framework for assessing how faithfully and insightfully multimodal LLMs describe charts, addressing gaps in existing benchmarks through 896 high-quality complex chart-description pairs and four aligned metrics (Faithfulness, Coverage, Informativeness, Acuity). This matters because chart understanding is a real multimodal capability gap, and systematic benchmarking of description quality directly informs which MLLMs are suitable for accessibility and data communication applications.

    ai-mlresearch
  • When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening

    Researchers evaluated five state-of-the-art LLMs on psychiatric screening across anxiety, depression, and PTSD using clinician-validated interview transcripts, finding accuracy ranges from 0.49–0.86 and evidence-weighting biases that systematically downweight symptom evidence when protective factors are present. The findings are critical for practitioners and clinicians considering LLM deployment in mental health triage, revealing systematic failure modes that require mitigation before clinical use.

    ai-mlresearch
  • Learnability-Informed Fine-Tuning of Diffusion Language Models

    Researchers propose LIFT, a learnability-informed fine-tuning algorithm that improves reasoning in diffusion language models by aligning token difficulty with masking context across diffusion steps, achieving 3x gains on AIME benchmarks. This addresses a foundational question about why standard SFT underperforms on non-autoregressive architectures and provides practitioners with a concrete, open-sourced technique to improve DLM post-training.

    ai-mlresearch
  • How Mobile World Model Guides GUI Agents?

    Researchers trained mobile world models across four modalities (delta text, full text, diffusion images, renderable code) and evaluated which representations best guide GUI agents on long-horizon tasks. Key finding: renderable code excels in-distribution while text handles OOD robustly; generated trajectories improve training but world models work better as priors than post-hoc verifiers—directly informing how to architect embodied AI systems.

    ai-mlresearchvelocity:hn-high
  • Entropy-Aware On-Policy Distillation of Language Models

    Researchers propose Entropy-Aware On-Policy Distillation, which adaptively switches between reverse KL and forward KL divergence based on teacher entropy to improve knowledge transfer in language model distillation. The method sustains generation diversity while improving alignment, showing +1.37 to +5.05 Pass@8 gains across math reasoning benchmarks—directly relevant to practitioners building efficient smaller models and researchers optimizing distillation pipelines.

    ai-mlresearch
  • SciNet: Evaluating AI Agents in Relation-Aware Scientific Literature Retrieval

    SciNet is a new dataset of 8,940 tasks evaluating AI agents on relation-aware scientific paper retrieval across 269M papers, revealing current agents score <20% on relational tasks but achieve 25.3% improvement in literature review quality when trained on relation-aware data. The work addresses a real gap in information retrieval systems—understanding citation networks, conflicts, and technological lineages—and releases the benchmark openly, making it immediately useful for practitioners building research tools.

    ai-mlresearch
  • LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

    LambdaPO introduces a pairwise preference-based advantage estimation framework that replaces GRPO's scalar baseline with fine-grained reward differentials, augmented by semantic density rewards for reasoning tasks. This addresses a fundamental information-theoretic limitation in modern RL alignment for LLMs and demonstrates improvements on math reasoning and QA benchmarks.

    ai-mlresearchvelocity:hn-high
  • Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning

    KG-R1 introduces a single-agent RL framework for knowledge-graph RAG that replaces multi-module LLM pipelines, achieving better accuracy with fewer tokens on smaller models while maintaining performance across unseen knowledge graphs without retraining. This addresses a practical pain point in production RAG systems—inference cost and schema rigidity—with demonstrated transferability and open-source implementation, making it directly applicable to practitioners building grounded reasoning systems.

    ai-mlresearchvelocity:hn-medium
  • InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion

    InfiGFusion proposes a structure-aware LLM fusion framework using Graph-on-Logits Distillation to model semantic dependencies across vocabulary dimensions, with a scalable Gromov-Wasserstein approximation. The work addresses a practitioner-relevant challenge in multi-model systems with strong benchmark results (+35.6 on reasoning tasks), making it relevant to teams exploring model merging and ensemble approaches.

    ai-mlresearchvelocity:hn-medium
  • CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test

    CoSPlay introduces a training-free, test-time framework that jointly improves code generation and unit test quality through cooperative self-play, eliminating the ground-truth test bottleneck that limits current RLVR/TTS methods. Demonstrates substantial improvements (22.1% → 33.2% BoN on Qwen2.5-7B) and generalizes across backbones, offering a scalable inference strategy for production code generation systems.

    ai-mlresearch
  • Autonomous Frontier-Based Exploration with VLM Guidance

    Researchers propose a training-free pipeline that uses Vision-Language Models to perform high-level strategic decision-making for autonomous robot exploration, replacing geometric heuristics with contextual spatial reasoning and achieving up to 24% improvement in map coverage. This work is relevant to practitioners building embodied AI systems and demonstrates a practical application of VLM reasoning beyond language tasks, with deployment simplicity (standard sensors + internet connection) that lowers barriers to adoption.

    ai-mlresearch
  • Is a Document Educational or Just Wikipedia-Style? -- Pitfalls of Classifier-Based Quality Filtering

    Researchers demonstrate that simple Wikipedia-style reformatting can fool classifier-based quality filters used in LLM pre-training, causing ~7% of low-quality documents to bypass the FineWeb-Edu CQF model. This exposes a critical robustness gap in a technique now foundational to corpus construction across major LLM development efforts, with direct implications for practitioners designing training pipelines.

    ai-mlresearch
  • Naturalistic measure of social norms alignment

    Researchers propose a framework and 3k-dilemma Danish dataset for measuring social norm alignment in LLMs through naturalistic free-form responses, comparing LLM-to-human, LLM-to-LLM, and human-to-human agreement. This work advances the field's ability to evaluate value and behavioral alignment beyond synthetic benchmarks, with implications for culturally-aware deployment and understanding how models reason about socially-contingent problems.

    ai-mlresearchlong-signal:rdd
  • Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning

    Study of 16 LLMs across 8 families reveals that representational convergence on reasoning tasks masks fundamental divergence in reasoning strategies—models align on shared failures but diverge on solutions, and shared representations lack causal influence on predictions. Directly impacts ensemble design, interpretability transfer assumptions, and how practitioners should evaluate model similarity claims.

    ai-mlresearch
  • Memorization Dynamics of Fill-in-the-Middle Pretraining

    Researchers conducted controlled experiments pretraining Llama 3.2 models with fill-in-the-middle (FIM) vs. standard left-to-right objectives to characterize how FIM affects verbatim memorization of training data. The findings reveal FIM recovers shorter partial spans than LTR, memorization grows linearly with repetitions, and suffix context alone is insufficient—insights with direct implications for practitioners tuning pretraining objectives and evaluating model memorization risks.

    ai-mlresearch
  • PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

    Researchers introduce Progress-Bench and ProgressLM-45K to evaluate whether vision-language models can reason about task progress over long horizons rather than just describe static visual content. Findings reveal most VLMs fail at progress estimation and are sensitive to viewpoint/modality changes, but training-based ProgressLM-3B shows generalizable improvements—signaling a capability frontier where VLMs need architectural or training innovations to move beyond recognition to temporal reasoning.

    ai-mlresearch
  • MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

    MAS-Orchestra proposes a training-time framework that treats multi-agent orchestration as a reinforcement learning problem, enabling holistic system-level reasoning while abstracting agents as callable functions; MASBENCH provides controlled benchmarking across five task dimensions to rigorously characterize when multi-agent systems outperform single-agent approaches. This matters because it challenges the assumption that multi-agent coordination is universally beneficial, provides principled methods for designing and evaluating MAS, and demonstrates 10x efficiency gains—directly advancing how practitioners build and reason about multi-agent intelligence.

    ai-mlresearch
  • More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts

    Researchers systematically compared context window sizes, RAG with moral knowledge, and model scaling for detecting Schwartz values in political text, finding that larger models and longer context don't uniformly improve performance—retrieved knowledge is more consistent than scale. The work challenges assumptions about bigger-is-better in value-sensitive NLP and has implications for practitioners building interpretable political text systems.

    ai-mlresearchvelocity:hn-medium
  • Fine-grained Claim-level RAG Benchmark for Law

    Researchers introduce ClaimRAG-LAW, a bilingual (French/English) benchmark dataset for evaluating retrieval-augmented generation systems in legal QA, with fine-grained claim-level analysis that separates retrieval and generation performance. This addresses a critical gap in legal AI evaluation—existing benchmarks lack granularity for high-stakes domains where hallucination rates vary significantly, and the dataset covers both expert and non-expert query patterns, making it immediately useful for practitioners and researchers building production legal RAG systems.

    ai-mlresearch
  • CoFrGeNet: Continued Fraction Architectures for Language Generation

    Researchers propose CoFrGeNets, a continued-fraction-inspired architecture that replaces attention and feed-forward layers in transformers while reducing parameters by 33-50% with competitive or superior downstream performance. This is relevant to practitioners seeking efficient alternatives to standard transformers and signals a credible path to parameter-efficient generative models at scale.

    ai-mlresearch