Practical AI Methodology Meets Cognitive Science|Looking for Ricursive (the AI chip design company)? You want ricursive.com

AI/ML Reading List

Curated links with summaries. RSS feed ↗

If you use NVIDIA Isaac Sim for reinforcement learning, do you use Isaac Lab with it? Just want to get a sense of what the status quo is. [D]
May 25, 2026ai-mlcommunity
SEO teams are tracking keywords. But are they tracking what ChatGPT says about their brand?
May 25, 2026ai-mlresearch
Berlin’s Peec AI more than doubled revenue to $10M ARR in six months. Its product helps brands show up in ChatGPT.
May 25, 2026ai-mlresearch
Anthropic’s Claude Mythos found 10,000 critical vulnerabilities in one month. The patches can’t keep up.
Anthropic's Project Glasswing cybersecurity program discovered 10,000+ vulnerability candidates in critical infrastructure software in its first month, with 1,094 confirmed as high/critical severity. This demonstrates AI-assisted vulnerability discovery at scale, relevant to practitioners evaluating LLM capabilities in security applications, though the story would benefit from direct Anthropic statement or technical documentation.
May 25, 2026ai-mlresearch
The US blacklisted Anthropic as a security threat. Its spy agencies are using Claude anyway.
May 25, 2026ai-mlresearch
SoftBank hits a fresh record as Tokyo bets the OpenAI IPO is finally coming
May 25, 2026ai-mlresearch
From the Vatican stage, Anthropic’s Chris Olah says AI cannot be steered by AI labs alone
May 25, 2026ai-mlresearch
PapersWithCode new features - week 1 [P]
May 25, 2026ai-mlcommunity
Working on a cgo-free CUDA binding in Go for ML stuff Week 3 - open source [P]
May 25, 2026ai-mlcommunityboost:open-source
Thermocompute constant time inference [P]
May 25, 2026ai-mlcommunity
Testing a Cold War-Era AI on Satellite Image Datasets
May 25, 2026ai-mlcommunity
How do ML practitioners select hyperparameters, architectures, etc for self-supervised representation learning when the loss is non-monotonic? [D]
May 25, 2026ai-mlcommunity
MergeNB: An intuitive merge conflict resolver built for Jupyter notebooks in VS Code [P]
May 25, 2026ai-mlcommunity
"AI solved one of math's greatest challenges, but it cannot add two numbers reliably?!" [D]
May 25, 2026ai-mlcommunity
Please help with tensor dock [d]
May 25, 2026ai-mlcommunity
Sponsio: Deterministic Contract Layer for LLM Agents [P]
Sponsio is an open-source contract layer for LLM agents that enforces deterministic tool-call boundaries (sequencing, retry limits, approval gates) at runtime via YAML rules, solving a critical gap between demo safety and production reliability. This matters because LLM agent deployment has hit a wall on tool-call safety—prompt engineering and post-hoc auditing both fail at scale—and a lightweight, composable enforcement layer that works with existing frameworks (LangGraph) unlocks broader adoption of agentic patterns in high-stakes workflows.
May 25, 2026ai-mlcommunity
If you use NVIDIA Isaac Sim for reinforcement learning, do you use Isaac Lab with it? Just want to get a sense of what the status quo is. [D]
May 25, 2026ai-mlcommunity
Call for Papers - Workshop on Unlearning and Model Editing U&ME at ECCV 2026 [R]
May 25, 2026ai-mlcommunity
When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming
A new arXiv paper formalizes the gap between next-token prediction and actual language generation by distinguishing the full conditional process (conditioned on latent context), the marginal text-only process, and the model-learned distribution. The work shows that marginal text-only prediction requires strong stationarity and ergodicity assumptions that fail on heterogeneous corpora, and that usefulness depends on whether observed text is sufficient to predict the next token given omitted circumstances—reframing RAG and tool use as mechanisms for achieving conditional sufficiency.
May 25, 2026ai-mlresearch
Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving
Fast-dDrive introduces a block-diffusion VLA that balances trajectory planning fidelity with inference efficiency for autonomous driving, using structured scaffolding and speculative decoding to achieve 12× speedup over autoregressive baselines while improving accuracy on standard benchmarks. This matters because it directly addresses the speed-accuracy tradeoff blocking deployment of high-capacity VLMs on edge hardware for real-time driving applications—a key bottleneck in practical autonomous systems.
May 25, 2026ai-mlresearch
Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography
Researchers used sparse autoencoders to decompose GPT-2 XL and Llama-3.1-8B into interpretable semantic features that align with known cortical semantic organization, recovering 94% of peak brain encoding performance and predicting both cortical topography and human reading times. This work directly addresses the long-standing question of why intermediate LLM layers best predict human brain responses, providing mechanistic evidence for brain-LLM alignment and demonstrating that interpretability techniques can recover neuroscience-validated organizational principles.
May 25, 2026ai-mlresearchlong-signal:rdd
Can AI Guess What You Know? Performance Comparison of Large Language Models for Human Domain Knowledge Estimation From Communication Logs
Researchers benchmarked seven LLMs (Gemini, Claude, GPT variants) on inferring employee domain knowledge from Slack logs, finding Gemini 2.5 Flash most accurate (MAE 21.13%) and revealing that message volume weakly correlates with inference quality. The work validates LLM capability for organizational knowledge mapping while flagging privacy and representation challenges—relevant to practitioners building enterprise AI systems and researchers studying LLM reasoning over real-world conversational data.
May 25, 2026ai-mlresearch
Evaluating Memory Structure in LLM Agents
Researchers propose StructMemEval, a benchmark that moves beyond simple fact-recall to test whether LLM agents can organize long-term memory into meaningful structures (ledgers, hierarchies, etc.). Findings show modern LLMs struggle with this task without explicit prompting, identifying a concrete research direction for improving memory-augmented agents and training procedures.
May 25, 2026ai-mlresearch
SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding
SciHorizon-GENE is a large-scale gene-centric benchmark with 540K questions across 190K human genes, systematically evaluating LLMs on gene-to-function reasoning with explicit focus on hallucination, completeness, and literature grounding. This work provides practitioners and researchers concrete evaluation criteria for LLM adoption in biomedical interpretation pipelines and establishes an important failure-mode analysis framework for domain-specific LLM deployment.
May 25, 2026ai-mlresearch
ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions
ThoughtTrace is a new 10K-annotation dataset pairing real LLM conversations with users' self-reported thoughts across 1,058 users and 20 models, showing thoughts are semantically distinct from messages and difficult for frontier LLMs to infer. The work establishes user cognition as a trainable signal for personalization and alignment, with demonstrated utility for behavior prediction and thought-guided fine-tuning—opening a new modality for studying latent user goals in human-AI interaction.
May 25, 2026ai-mlresearch
SemEval-2026 Task 6: CLARITY -- Unmasking Political Question Evasions
SemEval-2026 Task 6 (CLARITY) introduces a benchmark for detecting political question evasion through two-level classification: clarity (Clear/Ambivalent/Non-Reply) and fine-grained evasion strategies, evaluated on 124 teams across U.S. presidential interviews. Relevant to NLP practitioners building discourse analysis systems and researchers studying strategic language, though narrower in scope than foundational model or architecture advances.
May 25, 2026ai-mlresearch
TurkicNLP: An NLP Toolkit for Turkic Languages
TurkicNLP is an open-source Python library providing unified NLP tooling for Turkic languages across four script families and eight core NLP tasks, filling a significant resource gap for 200+ million speakers. The modular architecture and CoNLL-U standard compliance make it valuable for practitioners building downstream applications and researchers studying morphologically rich, under-resourced language families.
May 25, 2026ai-mlresearch
Improving Sampling for Masked Diffusion Models via Information Gain
Researchers propose Info-Gain Sampler, a training-free decoding method that improves masked diffusion model sampling by balancing immediate uncertainty against information gain across remaining positions, outperforming greedy baselines by 2.9–11.6% on reasoning tasks and 62.8% win rate on creative writing. Relevant to practitioners building or optimizing diffusion-based systems for reasoning, coding, and generation tasks.
May 25, 2026ai-mlresearch
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
ToolMerge proposes LLM-based query decomposition with multi-tool ranking merging for keyframe retrieval in long-video QA, introducing M2M benchmark where questions are temporally anchored. Relevant to practitioners building video understanding systems and researchers optimizing retrieval for multimodal LLM tasks; open-source code democratizes adoption.
May 25, 2026ai-mlresearchvelocity:hn-medium
RADAR: Relative Angular Divergence Across Representations
RADAR is a new transferability metric that uses layer-wise representation geometry to predict when cross-domain data transfer will help or hurt downstream performance. The work addresses negative transfer in foundation models across vision and NLP, offering practitioners a tool to screen dataset combinations before expensive retraining—relevant to anyone building multi-modal or cross-domain systems at scale.
May 25, 2026ai-mlresearch
ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models
ChartFI-Bench is a new evaluation framework for assessing how faithfully and insightfully multimodal LLMs describe charts, addressing gaps in existing benchmarks through 896 high-quality complex chart-description pairs and four aligned metrics (Faithfulness, Coverage, Informativeness, Acuity). This matters because chart understanding is a real multimodal capability gap, and systematic benchmarking of description quality directly informs which MLLMs are suitable for accessibility and data communication applications.
May 25, 2026ai-mlresearch
When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening
Researchers evaluated five state-of-the-art LLMs on psychiatric screening across anxiety, depression, and PTSD using clinician-validated interview transcripts, finding accuracy ranges from 0.49–0.86 and evidence-weighting biases that systematically downweight symptom evidence when protective factors are present. The findings are critical for practitioners and clinicians considering LLM deployment in mental health triage, revealing systematic failure modes that require mitigation before clinical use.
May 25, 2026ai-mlresearch
Learnability-Informed Fine-Tuning of Diffusion Language Models
Researchers propose LIFT, a learnability-informed fine-tuning algorithm that improves reasoning in diffusion language models by aligning token difficulty with masking context across diffusion steps, achieving 3x gains on AIME benchmarks. This addresses a foundational question about why standard SFT underperforms on non-autoregressive architectures and provides practitioners with a concrete, open-sourced technique to improve DLM post-training.
May 25, 2026ai-mlresearch
How Mobile World Model Guides GUI Agents?
Researchers trained mobile world models across four modalities (delta text, full text, diffusion images, renderable code) and evaluated which representations best guide GUI agents on long-horizon tasks. Key finding: renderable code excels in-distribution while text handles OOD robustly; generated trajectories improve training but world models work better as priors than post-hoc verifiers—directly informing how to architect embodied AI systems.
May 25, 2026ai-mlresearchvelocity:hn-high
Entropy-Aware On-Policy Distillation of Language Models
Researchers propose Entropy-Aware On-Policy Distillation, which adaptively switches between reverse KL and forward KL divergence based on teacher entropy to improve knowledge transfer in language model distillation. The method sustains generation diversity while improving alignment, showing +1.37 to +5.05 Pass@8 gains across math reasoning benchmarks—directly relevant to practitioners building efficient smaller models and researchers optimizing distillation pipelines.
May 25, 2026ai-mlresearch
SciNet: Evaluating AI Agents in Relation-Aware Scientific Literature Retrieval
SciNet is a new dataset of 8,940 tasks evaluating AI agents on relation-aware scientific paper retrieval across 269M papers, revealing current agents score <20% on relational tasks but achieve 25.3% improvement in literature review quality when trained on relation-aware data. The work addresses a real gap in information retrieval systems—understanding citation networks, conflicts, and technological lineages—and releases the benchmark openly, making it immediately useful for practitioners building research tools.
May 25, 2026ai-mlresearch
LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models
LambdaPO introduces a pairwise preference-based advantage estimation framework that replaces GRPO's scalar baseline with fine-grained reward differentials, augmented by semantic density rewards for reasoning tasks. This addresses a fundamental information-theoretic limitation in modern RL alignment for LLMs and demonstrates improvements on math reasoning and QA benchmarks.
May 25, 2026ai-mlresearchvelocity:hn-high
Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning
KG-R1 introduces a single-agent RL framework for knowledge-graph RAG that replaces multi-module LLM pipelines, achieving better accuracy with fewer tokens on smaller models while maintaining performance across unseen knowledge graphs without retraining. This addresses a practical pain point in production RAG systems—inference cost and schema rigidity—with demonstrated transferability and open-source implementation, making it directly applicable to practitioners building grounded reasoning systems.
May 25, 2026ai-mlresearchvelocity:hn-medium
InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion
InfiGFusion proposes a structure-aware LLM fusion framework using Graph-on-Logits Distillation to model semantic dependencies across vocabulary dimensions, with a scalable Gromov-Wasserstein approximation. The work addresses a practitioner-relevant challenge in multi-model systems with strong benchmark results (+35.6 on reasoning tasks), making it relevant to teams exploring model merging and ensemble approaches.
May 25, 2026ai-mlresearchvelocity:hn-medium
CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test
CoSPlay introduces a training-free, test-time framework that jointly improves code generation and unit test quality through cooperative self-play, eliminating the ground-truth test bottleneck that limits current RLVR/TTS methods. Demonstrates substantial improvements (22.1% → 33.2% BoN on Qwen2.5-7B) and generalizes across backbones, offering a scalable inference strategy for production code generation systems.
May 25, 2026ai-mlresearch
Autonomous Frontier-Based Exploration with VLM Guidance
Researchers propose a training-free pipeline that uses Vision-Language Models to perform high-level strategic decision-making for autonomous robot exploration, replacing geometric heuristics with contextual spatial reasoning and achieving up to 24% improvement in map coverage. This work is relevant to practitioners building embodied AI systems and demonstrates a practical application of VLM reasoning beyond language tasks, with deployment simplicity (standard sensors + internet connection) that lowers barriers to adoption.
May 25, 2026ai-mlresearch
Is a Document Educational or Just Wikipedia-Style? -- Pitfalls of Classifier-Based Quality Filtering
Researchers demonstrate that simple Wikipedia-style reformatting can fool classifier-based quality filters used in LLM pre-training, causing ~7% of low-quality documents to bypass the FineWeb-Edu CQF model. This exposes a critical robustness gap in a technique now foundational to corpus construction across major LLM development efforts, with direct implications for practitioners designing training pipelines.
May 25, 2026ai-mlresearch
Naturalistic measure of social norms alignment
Researchers propose a framework and 3k-dilemma Danish dataset for measuring social norm alignment in LLMs through naturalistic free-form responses, comparing LLM-to-human, LLM-to-LLM, and human-to-human agreement. This work advances the field's ability to evaluate value and behavioral alignment beyond synthetic benchmarks, with implications for culturally-aware deployment and understanding how models reason about socially-contingent problems.
May 25, 2026ai-mlresearchlong-signal:rdd
Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning
Study of 16 LLMs across 8 families reveals that representational convergence on reasoning tasks masks fundamental divergence in reasoning strategies—models align on shared failures but diverge on solutions, and shared representations lack causal influence on predictions. Directly impacts ensemble design, interpretability transfer assumptions, and how practitioners should evaluate model similarity claims.
May 25, 2026ai-mlresearch
Memorization Dynamics of Fill-in-the-Middle Pretraining
Researchers conducted controlled experiments pretraining Llama 3.2 models with fill-in-the-middle (FIM) vs. standard left-to-right objectives to characterize how FIM affects verbatim memorization of training data. The findings reveal FIM recovers shorter partial spans than LTR, memorization grows linearly with repetitions, and suffix context alone is insufficient—insights with direct implications for practitioners tuning pretraining objectives and evaluating model memorization risks.
May 25, 2026ai-mlresearch
PROGRESSLM: Towards Progress Reasoning in Vision-Language Models
Researchers introduce Progress-Bench and ProgressLM-45K to evaluate whether vision-language models can reason about task progress over long horizons rather than just describe static visual content. Findings reveal most VLMs fail at progress estimation and are sensitive to viewpoint/modality changes, but training-based ProgressLM-3B shows generalizable improvements—signaling a capability frontier where VLMs need architectural or training innovations to move beyond recognition to temporal reasoning.
May 25, 2026ai-mlresearch
MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks
MAS-Orchestra proposes a training-time framework that treats multi-agent orchestration as a reinforcement learning problem, enabling holistic system-level reasoning while abstracting agents as callable functions; MASBENCH provides controlled benchmarking across five task dimensions to rigorously characterize when multi-agent systems outperform single-agent approaches. This matters because it challenges the assumption that multi-agent coordination is universally beneficial, provides principled methods for designing and evaluating MAS, and demonstrates 10x efficiency gains—directly advancing how practitioners build and reason about multi-agent intelligence.
May 25, 2026ai-mlresearch
More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts
Researchers systematically compared context window sizes, RAG with moral knowledge, and model scaling for detecting Schwartz values in political text, finding that larger models and longer context don't uniformly improve performance—retrieved knowledge is more consistent than scale. The work challenges assumptions about bigger-is-better in value-sensitive NLP and has implications for practitioners building interpretable political text systems.
May 25, 2026ai-mlresearchvelocity:hn-medium
Fine-grained Claim-level RAG Benchmark for Law
Researchers introduce ClaimRAG-LAW, a bilingual (French/English) benchmark dataset for evaluating retrieval-augmented generation systems in legal QA, with fine-grained claim-level analysis that separates retrieval and generation performance. This addresses a critical gap in legal AI evaluation—existing benchmarks lack granularity for high-stakes domains where hallucination rates vary significantly, and the dataset covers both expert and non-expert query patterns, making it immediately useful for practitioners and researchers building production legal RAG systems.
May 25, 2026ai-mlresearch
CoFrGeNet: Continued Fraction Architectures for Language Generation
Researchers propose CoFrGeNets, a continued-fraction-inspired architecture that replaces attention and feed-forward layers in transformers while reducing parameters by 33-50% with competitive or superior downstream performance. This is relevant to practitioners seeking efficient alternatives to standard transformers and signals a credible path to parameter-efficient generative models at scale.
May 25, 2026ai-mlresearch

Email Substack Bluesky GitHub Store