System translated (Gemini)

🤖 AI 速览

Today’s main theme is AI moving from demonstrations to real-world processes: enterprise agents are beginning to focus on learning, auditing, and reliability, and edge models are nearing production readiness; simultaneously, Claude Fable 5, Codex, Kimi, and others are heating up programming …
📋 文章元数据
发布时间
2026-06-13
类型
ai-daily
字数
9167
阅读时长
44 min

2026-06-13 AI Daily | Agents Enter Organizational Workflows as Edge Models and AI Programming Costs Heat Up Link to heading

Today’s main theme is the shift of AI from demos to real-world workflows: enterprise agents are beginning to focus on learning, auditing, and reliability, while edge models are approaching production readiness. Meanwhile, models like Claude Fable 5, Codex, and Kimi are fueling the rise of programming agents, but token costs, verifiability, and software engineering constraints are emerging as new key variables.

📖 In-Depth Guide to This Issue’s Watch List Link to heading

The most important theme to explore today is “agents moving from demos to organizational workflows.” OpenAI Academy has explicitly integrated learning into the deployment process. Arbor and ToolSense are addressing reliability concerns for enterprise adoption by focusing on tree-search cognition and tool knowledge auditing, respectively. This is highly recommended reading for engineering and product teams.

A second theme is that AI evaluation is becoming more “scenario-based”: multi-turn reasoning for shopping, adversarial AI peer reviews, and causal analysis of latent reasoning models all remind us that elegant observable patterns do not equate to real capabilities.

Finally, a concentration of papers on healthcare, low-resource languages, and science communication indicates that model applications are penetrating high-value, highly constrained domains. EDEN, AfriSUD, MentalMARBERT, and the generation of videos from scientific charts will be of interest to readers focused on vertical data and multimodal knowledge representation.

🌐 AI Hot Topics on X Link to heading

Topic 1: Developers Share Weekend Builds Powered by OpenAI’s Codex Link to heading

  • Category: AI · News
  • Summary: Trending Time: , Related Posts: 47
  • What it is: Developers on X have been sharing applications, tools, and prototypes they built over the weekend using OpenAI’s Codex.
  • Why it matters: This demonstrates that code generation models are lowering the barrier to entry for software development and accelerating the iteration cycle from idea to runnable product, a key signal for the practical application of AI programming assistants.
  • Discussion summary: The discussion focuses on Codex’s development efficiency, code quality, and suitable use cases. Supporters argue it significantly boosts prototyping speed, while critics raise concerns about the reliability and security of the generated code and whether it diminishes developers’ control over the underlying implementation.

Topic 2: SpaceX Launches Record $75 Billion IPO, Creating Thousands of Employee Millionaires Link to heading

  • Category: AI · News
  • Summary: Trending Time: 12 hours ago, Related Posts: 805,000
  • What it is: SpaceX has reportedly launched a record-breaking IPO with an estimated valuation of $75 billion, drawing attention to its wealth effect on employees and the future of commercial spaceflight.
  • Why it matters: SpaceX’s satellite internet, launch capabilities, and computing infrastructure could influence the long-term strategies of AI companies regarding global connectivity, edge computing, and space-based data acquisition.
  • Discussion summary: Discussions on X center on whether the IPO valuation is justified, the wealth effect of employees becoming millionaires, the potential link between SpaceX and AI infrastructure, and whether commercial spaceflight is forming a new tech bubble.

Topic 3: Elon Musk Becomes World’s First Trillionaire After SpaceX IPO Link to heading

  • Category: AI · News
  • Summary: Trending Time: , Related Posts: 10,000
  • What it is: Following its reported record-breaking IPO, SpaceX’s valuation is said to have surpassed approximately $2 trillion, making Elon Musk the world’s first trillionaire due to the increased value of his holdings.
  • Why it matters: This event highlights the massive scale of capitalization in aerospace, satellite internet, electric vehicles, and AI-related infrastructure. It also shows a further concentration of influence over AI computing power, communication networks, and the frontier technology ecosystem in the hands of a few tech entrepreneurs.
  • Discussion summary: The discussion on X is divided into two main camps: supporters view it as a triumph of innovation, venture capital, and market returns, while critics focus on wealth inequality, the role of government contracts and subsidies in creating private fortunes, and whether there should be increased regulation and taxation of the ultra-wealthy and key technology platforms.

Topic 4: Claude Fable 5 Tops Coding Benchmarks Amid Access Frustrations Link to heading

  • Category: AI · News
  • Summary: Trending Time: 1 day ago, Related Posts: 9,300
  • What it is: Anthropic’s Claude Fable 5 is reported to be leading in several coding benchmarks, but users are expressing frustration with its limited access and availability.
  • Why it matters: Coding capability is a key metric for the commercialization and developer adoption of large models. If Claude Fable 5’s performance is confirmed, it will intensify the competition among AI programming assistants and foundational models.
  • Discussion summary: Discussions on X focus on whether its benchmark scores translate to real-world development efficiency, whether access barriers and rate limits are harming the user experience, and whether Anthropic should prioritize expanding availability over emphasizing leaderboard rankings.

Topic 5: Moonshot AI Releases Kimi-K2.7-Code, Top Open-Source Coding Model Link to heading

  • Category: AI · News
  • Overview: Trending Time: 8 hours ago, Related Posts: 2900
  • What it is: Moonshot AI released Kimi-K2.7-Code, claiming it to be the current top-performing open-source code model.
  • Why it matters: This indicates the growing competitiveness of Chinese AI companies in code generation and agent programming models. It could also further drive the application of open-source coding models in development tools, enterprise automation, and AI Agent scenarios.
  • Discussion Overview: Discussions on X are mainly focused on whether its benchmarks are truly leading, the gap with models like DeepSeek, Qwen, and Claude Code, its open-source license and commercial viability, and whether the actual programming experience matches the official claims.

Topic 6: Hundreds Queue for Cursor AI Hackathon at a16z San Francisco Link to heading

  • Category: AI · News
  • Overview: Trending Time: 19 hours ago, Related Posts: 397
  • What it is: Hundreds of developers queued up at the a16z office in San Francisco for the Cursor AI Hackathon, highlighting the high level of enthusiasm for community events related to AI programming tools.
  • Why it matters: This reflects that AI programming assistants like Cursor are evolving from simple tools into developer ecosystems and entry points for startups, further driving changes in software development workflows and the competition for talent.
  • Discussion Overview: Discussions on X are centered on whether AI programming tools have become the next-generation developer infrastructure. Supporters argue that the long queue reflects genuine demand and entrepreneurial energy, while skeptics believe the hype may be inflated by VCs and social media, with long-term retention and actual productivity improvements yet to be proven.

Topic 7: Mexico Beats South Africa 2-0 in 2026 World Cup Opener Link to heading

  • Category: AI · Entertainment
  • Overview: Trending Time: 2 days ago, Related Posts: 482,000
  • What it is: In the 2026 World Cup opener, host nation Mexico defeated South Africa 2-0, sparking extensive shares and discussions on X.
  • Why it matters: While the event itself is not AI technology news, the real-time broadcasting, automated clip generation, data analysis, personalized recommendations, and content moderation for major sports events amplify the role of AI in media distribution and online entertainment.
  • Discussion Overview: Discussions on X focused on Mexico’s victory, the game’s trajectory after a South African player was sent off, street celebrations by fans, and the buzz around the broadcast and short-form video content. Some users also questioned the appropriateness of classifying this topic as AI-related.

Topic 8: Yemeni ‘Spider-Man’ Climber Dies in Volcanic Crater Fall Link to heading

  • Category: AI · Sports
  • Overview: Trending Time: , Related Posts: 766
  • What it is: A well-known Yemeni climber, nicknamed ‘Spider-Man’, fell to his death while climbing in a volcanic crater, drawing attention on the X platform.
  • Why it matters: The incident itself is not an AI breakthrough, but its rapid dissemination highlights the need for governance on social platforms regarding algorithmic recommendations, video verification, flagging of high-risk content, and safety advisories for minors in response to sudden death events.
  • Discussion Overview: Discussions on X centered on the details of the accident and accountability, whether extreme sports should be incentivized by online traffic, whether the related videos should continue to be circulated, and whether platform algorithms amplify high-risk challenge content.

Topic 9: Real Madrid Hires Mourinho and Signs Bernardo Silva in Double Move Link to heading

  • Category: AI · Sports
  • Overview: Trending Time: 22 hours ago, Related Posts: 158,000
  • What it is: Real Madrid officially announced the reappointment of Mourinho as manager and the signing of midfielder Bernardo Silva from Manchester City.
  • Why it matters: As a trending topic at the intersection of AI and sports, this highlights how AI-generated or virtual soccer news is entering the public consciousness on a massive scale, raising concerns about information authenticity and the role of AI in sports narratives.
  • Discussion Overview: The discussion is focused on whether the event is fake transfer news generated by AI, the ethical boundaries of AI in sports journalism, and the information-verification vulnerabilities exposed by the viral spread of such fabricated content on social media.

Topic 10: Real Madrid Appoints Mourinho and Signs Bernardo Silva in Stunning Double Move Link to heading

  • Category: AI · Other
  • Overview: Trending Time: 2 days ago, Related Posts: 333,000
  • Summary: Real Madrid Appoints Mourinho and Signs Bernardo Silva in Stunning Double Move:

Topic 11: Thomas Partey Misses Ghana’s World Cup Opener After Canada Visa Denial Link to heading

  • Category: AI · Sports
  • Overview: Trending Time: 6 hours ago, Related Posts: 109,000
  • Summary:Thomas Partey Misses Ghana’s World Cup Opener After Canada Visa Denial:

Topic 12: AI Builds Full Hogwarts Replica in One Prompt Link to heading

  • Category: AI · Other
  • Overview: Trending Time:, Related Posts: 101
  • What it is: A buzz on X about “using a single prompt to have AI build a complete replica of Hogwarts.” However, the information mainly comes from topic headlines, lacking specific demonstration details and source verification.
  • Why it matters: This topic reflects the imaginative potential of generative AI in complex scene modeling, interactive content generation, and creative workflow automation. It also highlights the public’s interest in the boundaries of the “single-sentence generation of large virtual worlds” capability.
  • Discussion Summary: The discussion focuses on whether this achievement is real and reproducible, which models or tools were used, and whether it’s merely a clipped demonstration or exaggerated marketing. It also touches upon the potential impact of such capabilities on game development, film and television asset production, and copyright compliance.

Topic 13: Tesla FSD 14.3.4 Delivers Superhuman Reactions and European Rollout Link to heading

  • Category: AI · News
  • Overview: Trending Time:, Related Posts: 117
  • What it is: Tesla released/pushed FSD 14.3.4, which users claim has driving reaction speeds approaching or surpassing humans, sparking attention around its European market rollout.
  • Why it matters: The iteration of FSD demonstrates the continuous progress of end-to-end autonomous driving models in real-world road scenarios. If it can be expanded in Europe, it will be a significant signal for the commercialization and regulatory implementation of AI-driven autonomous driving.
  • Discussion Summary: Discussions on X are centered on whether the new version’s reaction speed, disengagement rate, and safety are genuinely superior to humans. Users are also focused on the European launch timeline, regulatory approvals, adaptation to road rules, and the question of whether Tesla is engaged in excessive marketing.

Topic 14: BTS Launches ARIRANG Tour Homecoming in Busan with Electric Energy Link to heading

  • Category: AI · Entertainment
  • Overview: Trending Time: 1 day ago, Related Posts: 732,000
  • What it is: BTS kicked off their ARIRANG Homecoming Tour in Busan, using augmented reality technology to create a stunning electronic music atmosphere.
  • Why it matters: This performance deeply integrates AI-driven augmented reality into a large-scale live show, validating the scalable application of real-time rendering, spatial computing, and audience interaction, setting a new benchmark for the fusion of virtual and real in the entertainment industry.
  • Discussion Summary: On X, the debate revolves around the smoothness of the technical implementation, whether the AR effects enhanced the live experience or were a distraction, and whether this technological experience can be replicated in other tour locations. Fans are divided on the balance between pure live emotion and technological intervention.

Summary of AI Public Opinion on X Today Link to heading

Today’s main narrative clearly revolves around “AI is moving from demonstration to infrastructure,” with a strong focus on programming tools like Codex, Claude, Kimi, and Cursor. Developers widely recognize that AI is significantly accelerating prototyping, lowering the barrier to building software, and creating a new developer ecosystem. The consensus is that areas like code models, autonomous driving, AR entertainment, content generation, and space communication are viewed as the next frontiers for technological competition and commercialization. However, disagreements focus on whether benchmark achievements can translate into real-world productivity, whether popular projects are overhyped by investors and social media, and whether access, availability, and long-term retention live up to the marketing claims. Another underlying thread is the conflict between “AI and information authenticity”: from fabricated sports news and the Hogwarts generation demo to non-AI events being algorithmically pushed into AI topics, user skepticism about platform categorization, source verification, and the boundaries of generated content is on the rise. Potential risks include the security and maintainability of generated code, public safety issues from the over-marketing of autonomous driving, the algorithmic amplification of extreme content and graphic images of death, and the increasing concentration of wealth, computing power, communication networks, and key tech platforms, as exemplified by the narrative surrounding SpaceX and Musk. Overall, the market sentiment is largely one of excitement, but concerns about trust, regulation, verifiability, and the concentration of power are growing in parallel with technological advancements.

💡 Influencer Insights Link to heading

The following is an in-depth analysis report of tweets from several AI influencers on X over the past 24 hours.


AI Industry Daily: On-Device Intelligence Explodes, Fable 5 Controversy, and Software Engineering Reimagined Link to heading

1.1 The Full Rise of On-Device Models and the Establishment of a New Paradigm This week’s core narrative is undoubtedly “on-device models.” From major tech companies to independent developers, there’s a consensus that locally deployed AI models have reached a tipping point of being “usable” and even “good.”

  • Performance Leap: @zhixianio conducted a “monk-like” experiment, forcing themself to work using only local models (Qwen3.6-35B-A3B-oQ6-fp16-mtp). The results showed that in both coding and personal assistant (PA) scenarios, the response speed was faster than remote LLMs, the intelligence was on point, and the native multimodal experience was even superior to DeepSeek V4 Pro.
  • Quantitative Technology Breakthrough: Google’s Gemma 4 QAT (Quantization-Aware Training) model, released via @googledevs, has drawn significant attention. @zhixianio sees this as a new approach to on-device optimization. By assuming the model will be quantized during the training phase, it significantly reduces memory usage and improves local inference speed. He believes this signals that powerful models will soon be built into Android devices.
  • Concrete Application Scenarios: On-device models are no longer just for benchmarking. @zhixianio demonstrated using a Mac to run a model to “thaw a rice ball” and produced episode E5 of the podcast “Cognitive County,” using on-device TTS to generate the host’s voice. This marks the transition of on-device intelligence from a geek’s toy to a near-production tool.

1.2 Claude Fable 5: A Dichotomy of Extreme Intelligence and “Token Incinerator” Anthropic’s new model, Claude Fable 5, has become the undisputed focus of the community, but opinions are polarized.

  • Powerful Logic and Planning Capabilities: @zhixianio was amazed that it independently completed 70% of a development task in 40 minutes and proactively corrected flaws in the human design. @vista8 and @dotey confirmed its depth of thought, noting it could ponder an idea for 15 minutes before taking action, generating extremely high-quality code.
  • The High Cost: @dotey and @Pluvio9yte pointed out its biggest pain point: massive Token consumption. @dotey quoted a tweet from @jerryjliu0, stating that a team member consumed $1,500 worth of Tokens in 10 hours. @dotey commented, “More and more companies are finally discovering that AI is more expensive than employees!”. @Pluvio9yte advised rational consumption and pointed out a hidden switch in /effort max.

1.3 OpenAI Codex: Long-Duration Task (Goal) Mechanism and Tokenomics Iteration OpenAI’s programming agent, Codex, has shown surprising stability in long-duration tasks.

  • “Farming-Style” Development: @vista8 shared a /goal command for Codex that allows the AI to automatically develop and iterate on a website while he sleeps, running for up to 10 hours. It achieved full automation of the entire process, from code generation and testing to deployment. He wrote a dedicated PRD generation Skill for Codex to adapt to this new development paradigm.
  • Token Reset Gamification: To address user Token anxiety, OpenAI (@drivers/pipeline/marketer-daily-content-pipeline/drafts/2026-05-03-openai-adds-adorable-animated-pets-to-codex-coding-content-task.json) introduced a feature to refresh weekly limits by inviting friends. @dotey (@dotey) joked that Codex has taken the Token reset mechanism to a new level, even allowing users to save reset opportunities for later. @ruanyf mentioned that his friend’s company significantly reduced API costs by leveraging a cloud provider’s compliance caching mechanism.

1.4 YouMind 1.0 Released: A Coming-of-Age for AI-Native Creation Tools YouMind, created by the former Head of Product for Feishu, Yubo (@lifesinger), officially released version 1.0, receiving collective congratulations from industry leaders like @dotey, @vista8, and @gefei55. @vista8 pointed out its traffic has surged in the last six months, proving that even in the age of AI, a creation tool polished with dedication for two years can still find a huge market. @gefei55 sees this as a prime example of a successful growth transition for a tech professional from a major company.


2. Noteworthy Perspectives and Industry Foresight Link to heading

2.1 AI and Software Engineering: Reshaping, Not Replacing @dotey corrected the view that “AI is redefining software engineering,” proposing instead that “AI doesn’t redefine software engineering; it magnifies its importance.” Meanwhile, @Pluvio9yte shared his painful experience evolving from a “Vibe Coder” to a full-stack engineer, proposing the Contract First development philosophy. He believes that in the chaotic process of AI coding, defining APIs and data contracts is the prerequisite for all work, and based on this, he forked and customized the OpenSpec framework.

2.2 The AI Efficiency Paradox and Economic Burden @ruanyf sharply pointed out that while AI boosts individual productivity, if calculated at the scale of the OpenClaw founder’s $1.3 million monthly Token consumption, enterprise-level AI programming would be far more expensive than hiring human programmers. He raised a deeper contradiction in AI’s cost-reduction and efficiency-gain narrative: If AI can complete a week’s work quickly, should employees get time off? If AI integration means no raises and no time off, what is the point for the employees?

2.3 The Endgame of Robotics and World Models @AI_Jasonyu highlighted Professor Huang ( @huang_biwei)’s aggressive predictions for robot development. Professor Huang believes the current VLA+stack data path is unfeasible, asserting that robots will usher in their “GPT-3 moment” by early 2027. His core argument divides world models into three stages: rendering (Sora) -> simulation (Fei-Fei Li) -> imagination (causal large models), emphasizing that “compression is intelligence” should be elevated to “structured compression is intelligence”.

2.4 AI’s Impact on the Outsourcing Industry and Organizational Transformation @dotey observed that real estate giant OpenDoor laid off its entire offshore team in India (over 200 people), turning instead to build a local, AI-native team in the US. This sends a stern signal: AI is not only replacing basic labor but also beginning to disrupt the global outsourcing industry driven by labor cost differentials.

2.5 Deep Dive: The False Metric of Tokens @lijigang issued a philosophical warning, reminding developers not to fall into the anxiety of Token consumption bills. He believes that Token consumption is a “false metric,” and whether the problem is solved is the “true metric.” This perspective is a calm reflection on the exorbitant consumption observed in current models like Fable 5.


3.1 Programming Development and AI Agents

  • oMLX v0.4.0: Released by @jundotkim, supporting native Swift macOS interface, it’s a powerful tool for running large models on Mac devices. @zhixianio highly recommends and uses it regularly.
  • Codex Skills:
    • Qiaomu Goal Meta Skill ( @vista8): A tool that transforms a single-sentence requirement into a Codex /goal instruction, increasing the success rate of AI long-term tasks.
    • 10 Chinese Creator Codex Skills ( @wsl8297): A comprehensive automation toolkit covering writing, de-AI-fying, image pairing, and creating Xiaohongshu cards.
  • AI Video Subtitle Tool: Open-sourced by @xiaohu and recommended by @Pluvio9yte, this is a local one-stop video processing tool (download > transcription > translation > polishing > subtitle burning).

3.2 Product Design and Frontend

  • baoyu-design skill ( @dotey): Major functional update, supporting the import of Figma local files (*.fig), allowing for the reconstruction of a complete design system within a conversation.
  • Online Logo Design Tool: A non-AI logo designer implemented in pure HTML+JS, generated by @vista8 using Fable 5, demonstrating Fable’s potential in graphical programming.

3.3 Knowledge Management and AI Reading

  • Shadow Book Reading Method: A new AI reading paradigm proposed by @lijigang, which uses AI to analyze the author’s unwritten subtext, opposing viewpoints, and intellectual heritage, expanding reading from a two-dimensional plane to a multi-dimensional space.
  • OfoxAI: A cost-effective relay station recommended by @AI_Jasonyu, offering discounted APIs for models like GPT-5.5, known for stability and direct, low-latency connections.

3.4 Hardware and Going Global

  • Giffgaff SIM card for keeping number active: @AI_Jasonyu once again highlighted its advantage as a 0-monthly-fee, permanent overseas mobile number, suitable for long-term overseas account registration.
  • Mac Productivity Toolkit: @Pluvio9yte recommended Bartender 6 (status bar management), Maccy (open-source clipboard), and Mos (mouse scrolling optimization).

📚 Appendix: Today’s Watch List Update Sources Link to heading

Time Window: Last 3 days; Covering 22 sources; Total 34 updates

All-In Podcast (A_full) Link to heading

  • All-In’s Best Ideas Pitch Competition: 4 Investors Present Their Top Trades Live
    • Release Time: 2026-06-12 09:25 Beijing Time
    • Summary: - EY - EY helps private equity firms translate market insights into action, navigate complexity, and unlock new paths to growth and long-term value.
      • New York Stock Exchange - Thank you to our partners at the New York Stock Exchange - a modern market and exchange dedicated to building the future.
      • Plaud, our official wearable AI note-taking partner at the All-In Liquidity Summit, captured every insight.
      • All-In Best Ideas Pitch Competition: 4 Investors Present Their Top Trades Live.
    • EN Key Points:
      • (0:00) Chamath explains the Best Ideas format
  • (2:31) Suvretta Capital Management’s Aaron Cowen pitches MGM Resorts
  • (13:07) Bornite Capital’s Dan Dreyfus pitches Talen Energy
  • (27:19) EcoR1 Capital’s Oleg Nodelman pitches Aktis Oncology

Stratechery by Ben Thompson (A_full) Link to heading

  • 2026.24: Hey Siri, Tell Me a Fable
    • Posted: 2026-06-13 01:00 Beijing Time
    • Summary: - Welcome back to This Week in Stratechery!
      • As a reminder, each week, every Friday, we’re sending out this overview of content in the Stratechery bundle; highlighted links are free for everyone.
      • Additionally, you have complete control over what we send to you.
      • With that said, here are a few of our favorites from the week.
      • Apple Finally Releases Intelligence. Tim Cook’s last WWDC as CEO was largely about cleaning up a mess Apple made two years ago, and while Cook didn’t drive the Siri AI demo — that was engineering head and now Siri chief Mike Rockwell — the final product felt like a fitting send-off as his tenure nears its end.
    • EN Key Points:
      • Welcome back to This Week in Stratechery
      • As a reminder, each week, every Friday, we’re sending out this overview of content in the Stratechery bundle; highlighted links are free for everyone
      • Additionally, you have complete control over what we send to you
      • If you don’t want to receive This Week in Stratechery emails (there is no podcast), please uncheck the box in your delivery settings

OpenAI Blog (A_full) Link to heading

  • New OpenAI Academy courses for the next era of work

    • Posted: 2026-06-12 18:00 Beijing Time
    • Summary: - Artificial intelligence is giving organizations new capacity for action.
      • Work that once required scarce time or expertise can be moved forward continuously by AI.
      • But this promise is only realized when people know how to apply these tools in their work and turn successful uses into repeatable ways of working.
      • At OpenAI, we see learning as part of deployment.
      • We build models and products and work closely with organizations applying them in their business.
    • EN Key Points:
      • OpenAI introduces three Academy courses that help people build practical AI skills, create repeatable workflows, and apply agents in everyday work.
  • How Preply combines AI and human tutors to personalize learning

    • Posted: 2026-06-12 08:00 Beijing Time
    • Summary: - With personalized 1-on-1 teaching in over 90 languages, Preply’s mission is to make quality language education accessible to anyone, anywhere.
      • Language learning is inherently human.
      • It requires dialogue, confidence, motivation, and cultural understanding.
      • While Preply tutors provide learners with irreplaceable energy, motivation, cultural nuance, and human connection, they also face repetitive tasks: writing personalized plans and session notes.
      • Meanwhile, students typically need a clear sense of progress to stay highly engaged.
    • EN Key Points:
  • Preply uses OpenAI to launch AI-generated lesson summaries, providing personalised feedback and language learning exercises.

ArXiv cs.AI (B_intro+search) Link to heading

  • ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

    • Published: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.12451v1 Announce Type: new.
      • Abstract: Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck.
      • As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM’s vocabulary and is fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on the standard ToolBench retrieval benchmark.
      • However, these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither of which reveals whether the model truly understands its tools.
    • EN Highlights:
      • arXiv:2606.12451v1 Announce Type: new
      • Abstract: Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck
      • As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by…
      • Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neithe…
  • Arbor: Tree Search as a Cognition Layer for Autonomous Agents

    • Published: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.12563v1 Announce Type: new.
      • Abstract: Arbor is a multi-agent framework that introduces structured tree search as a cognition layer for autonomous agents operating in large, stateful action spaces.
      • Prior autonomous optimization systems operate on isolated targets with stateless evaluation.
      • Arbor instead maintains an explicit search tree of scored hypotheses that serves as the shared working memory across agents, evolving with every measurement, treating failures as diagnostic signals to reshape subsequent exploration, and expanding as prior successes shift the bottleneck distribution.
    • EN Highlights:
      • arXiv:2606.12563v1 Announce Type: new
      • Abstract: Arbor is a multi-agent framework that introduces structured tree search as a cognition layer for autonomous agents operating in large, stateful action…
      • Prior autonomous optimization systems operate on isolated targets with stateless evaluation
      • Arbor instead maintains an explicit search tree of scored hypotheses that serves as the shared working memory across agents, evolving with every measurement, tr…
  • Strategic Decision Support for AI Agents

    • Publication Time: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.12587v1 Announce Type: new.
      • Abstract: Traditionally, decision support studies how humans use machine learning models to make better decisions.
      • In modern agentic systems, this division of roles is increasingly reversed: AI agents act on behalf of users, while humans and tools become support mechanisms around them.
      • This role reversal brings reliability concerns to the forefront, since agentic errors can be consequential and agent behavior must remain aligned with human goals and constraints.
    • EN Key Points:
      • arXiv:2606.12587v1 Announce Type: new
      • Abstract: Traditionally, decision support studies how humans use machine learning models to make better decisions
      • In modern agentic systems, this division of roles is increasingly reversed: AI agents act on behalf of users, while humans and tools becomes support mechanisms…
      • This role reversal brings reliability concerns to the forefront, since agentic errors can be consequential and agent behavior must remain aligned with human goa…
  • Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

    • Publication Time: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.12594v1 Announce Type: new.
      • Abstract: Modern Lean theorem provers achieve strong performance only with substantial training and inference compute, driven in part by scarce verified proof data and the long inference traces of formal proof search, which makes both supervised fine-tuning (SFT) and sampling expensive.
      • We introduce Pythagoras-Prover, a compute-efficient open-source family of Lean theorem provers built for practical compute budgets.
      • The family spans two generation paradigms: autoregressive models at 4B and 32B parameters, and a first proof-of-concept diffusion-based prover (4B) that iteratively refines Lean proofs at inference time.
    • EN Key Points:
      • arXiv:2606.12594v1 Announce Type: new
      • Abstract: Modern Lean theorem provers achieve strong performance only with substantial training and inference compute, driven in part by scarce verified proof d…
      • We introduce Pythagoras-Prover, a compute-efficient open-source family of Lean theorem provers built for practical compute budgets
      • The family spans two generation paradigms: autoregressive models at 4B and 32B parameters, and a first proof-of-concept diffusion-based prover (4B) that iterati…
  • PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

    • Publication Time: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.12616v1 Announce Type: new.
  • Abstract: Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by rule-based traffic managers or learned models trained for a single behavioral pattern.

  • Recent work introduces style variation through post-hoc labels on observational data or LLM-inferred reward weights, but these signals act as proxies for what a style should reward, rather than demonstrations from humans explicitly asked to drive in that style.

  • We introduce PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, where participants drove CARLA leaderboard routes on a driver-in-the-loop rig following aggressive, neutral, and conservative instructions.

    • EN Highlights:
      • arXiv:2606.12616v1 Announce Type: new
      • Abstract: Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by…
      • Recent work introduces style variation through post-hoc labels on observational data or LLM-inferred reward weights, but these signals act as proxies for what a…
      • We introduce PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human dri…
  • “Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

    • Release Time: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.12618v1 Announce Type: new.
      • Abstract: Robust lie detectors for language models could enable powerful techniques for auditing, monitoring, and post-hoc investigation of model behaviour, but evaluating them requires testbeds where it is verifiable that the model is not telling the truth.
      • We show that existing trained model organisms often fail this requirement, leaving prior positive and negative detection results difficult to interpret.
      • We address this with 13 reasoning model organisms whose hidden beliefs are verified in chain-of-thought and shown to generalise to held-out tasks, alongside a variety of deceptions, a prompt-based lying testbed that covers a wide range of lie-inducing motivations.
    • EN Highlights:
      • arXiv:2606.12618v1 Announce Type: new
      • Abstract: Robust lie detectors for language models could enable powerful techniques for auditing, monitoring, and post-hoc investigation of model behaviour, but…
      • We show that existing trained model organisms often fail this requirement, leaving prior positive and negative detection results difficult to interpret
      • We address this with 13 reasoning model organisms whose hidden beliefs are verified in chain-of-thought and shown to generalise to held-out tasks, alongside Var…
  • TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation

    • Release Time: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.12657v1 Announce Type: new.
  • Abstract: Human mobility data is crucial for transportation, urban planning, and epidemic control, but large-scale trajectory collection is often costly and privacy-constrained, which has motivated the generation of realistic synthetic trajectories.

  • Existing LLM-based generators typically rely on either prompt engineering (which preserves zero-shot reasoning but lacks fine-grained spatiotemporal grounding) or trajectory-level fine-tuning (which improves statistical accuracy but incurs significant computational costs and may weaken general reasoning).

  • We propose TrajGenAgent, a semantic-aware hierarchical LLM agent framework for generating human mobility trajectories without model fine-tuning.

  • EN Highlights:

    • arXiv:2606.12657v1 Announce Type: new
    • Abstract: Human mobility data is important for transportation, urban planning, and epidemic control, but large-scale trajectory collection is often costly and p…
    • Existing LLM-based generators typically rely on either prompt engineering, which preserves zero-shot reasoning but lacks fine-grained spatiotemporal grounding,…
    • We propose TrajGenAgent, a semantic-aware hierarchical LLM-agent framework for human mobility trajectory generation without model fine-tuning
  • Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

    • Release Time: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.12674v1 Announce Type: new.
      • Abstract: Compact language models (LMs) reduce the cost, latency, and deployment risk for tool agents.
      • However, MCP-style tool use requires more than just isolated function calls: agents must discover tools from live catalogs, satisfy schemas, preserve dependencies between intermediate outputs, and provide a final response grounded in evidence of execution.
      • Small planners often generate plausible workflow graphs that fail under tool resolution, parameter validation, dependency tracking, or execution.
    • EN Highlights:
      • arXiv:2606.12674v1 Announce Type: new
      • Abstract: Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents
      • Yet MCP-style tool use requires more than isolated function calling: an agent must discover tools from live catalogs, satisfy schemas, preserve dependencies acr…
      • Small planners often generate plausible workflow graphs that fail under tool resolution, parameter validation, dependency tracking, or execution
  • From AGI to ASI

    • Release Time: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.12683v1 Announce Type: new.
      • Abstract: Over the past decade, building human-level artificial general intelligence has shifted from a distant speculation to a concrete goal for the next decade for many of the largest AI organizations.
      • Achieving this goal will have profound impacts on human society, raising many complex questions for the coming decade.
      • This report investigates how AI itself might continue to develop along the continuum of machine intelligence in a post-AGI world.
    • EN Highlights:
      • arXiv:2606.12683v1 Announce Type: new
  • Abstract: Over the last decade, building human-level artificial general intelligence has moved from far-fetched speculation to being a concrete next-decade targ…

  • Achieving this goal would have profound and far-reaching impacts on human society, which raises many complex questions for the decade ahead

  • This report investigates how AI itself might continue to develop in a post-AGI world along the continuum of machine intelligence

  • Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System

    • Published: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.12702v1 Announcement Type: New.
      • Abstract: Large language models (LLMs) are increasingly integrated into clinical systems, making it essential to evaluate the real-world utility of these systems.
      • However, static benchmarks tend to measure correctness rather than user acceptance, aggregate performance across queries, and require densely annotated datasets—leading to major blind spots in evaluating clinical systems.
      • In this work, we perform a deployment-centered evaluation of an LLM system embedded within the electronic health records of an academic medical center, where user feedback is sparse but closely reflects deployment conditions.
    • EN Highlights:
      • arXiv:2606.12702v1 Announce Type: new
      • Abstract: Large language models (LLMs) are increasingly integrated into clinical systems, making it essential to evaluate the real-world utility of these system…
      • However, static benchmarks tend to measure correctness rather than user acceptance, aggregate performance across queries, and require densely annotated datasets…
      • In this work, we perform a deployment-centered evaluation of an LLM system embedded within electronic health records at an academic medical center, where user f…

ArXiv cs.CL (B_intro+search) Link to heading

  • EDEN: A Large-Scale Corpus of Clinical Notes for Italian

    • Published: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.12569v1 Announcement Type: New.
      • Abstract: We present EDEN (Emergency Department Electronic Notes), a new and unique large-scale corpus of clinical notes from the Emergency Departments of Italian hospitals.
      • The current version of the corpus consists of approximately 4 million fully anonymized clinical records, covering different stages of patient care during their stay in the Emergency Department.
      • In addition, a subset of about 6,000 notes has been manually annotated by clinical experts using a structured Case Report Form (CRF) with 132 items related to two patient conditions in the Emergency Department: dyspnea and loss of consciousness.
    • EN Highlights:
      • arXiv:2606.12569v1 Announce Type: new
  • Abstract: We present EDEN (Emergency Department Electronic Notes), a new and unique large-scale corpus of clinical notes produced in Emergency Departments of It…

  • The corpus, in its current version, is composed of approximately 4 million clinical notes fully anonymized, covering diverse phases of patient care during the s…

  • In addition, a subset of about six thousand notes has been manually annotated by clinical experts through a structured Case Report Form (CRF) containing 132 ite…

  • Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures

    • Publication Time: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.12576v1 Announcement Type: New.
      • Abstract: Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned with visual highlights, a capability that current video generation systems and benchmarks lack.
      • To address this, we introduce paper-grounded figure-to-video generation: generating narrated, region-grounded walkthrough videos from a figure and its paper.
      • We propose MINARD (Multimodal Interpretation of Narrated Architecture via Region Decomposition), a pipeline that generates paper-grounded narrations and sequentially grounds them on figure regions.
    • EN Key Points:
      • arXiv:2606.12576v1 Announce Type: new
      • Abstract: Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned wit…
      • To address this, we introduce paper-grounded figure-to-video generation: generating narrated, region-grounded walkthrough videos from a figure and its paper
      • We propose MINARD (Multimodal Interpretation of Narrated Architecture via Region Decomposition), a pipeline that generates paper-grounded narrations and sequent…
  • MARD: Mirror-Augmented Reasoning Distillation for Mechanism-Level Drug-Drug Interaction Prediction

    • Publication Time: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.12578v1 Announcement Type: New.
      • Abstract: Mechanism-level Drug-Drug Interaction (DDI) prediction requires identifying which enzyme or pharmacodynamic axis is involved, the direction, and the evidence, not merely whether two drugs interact.
      • We introduce a reproducible mechanism-level DDI labeling and evaluation protocol with a structured 7-family/147-subtype taxonomy, a leak-safe cold-split protocol, and auditable reasoning metrics for evaluating pharmacological predictions beyond flat interaction classification.
      • We propose a pipeline that yields a 7B reasoning model, MARD (Mirror-Augmented Reasoning Distillation), which combines three training innovations: single-token KL-divergence on direction labels that links model predictions, per-loss PRM-weighted DPO with programmatic hard negatives, and a leak-safe, mechanism-aware retrieval passage.
    • EN Key Points:
      • arXiv:2606.12578v1 Announce Type: new
  • Abstract: Mechanism-level drug-drug interaction (DDI) prediction requires identifying which enzyme or pharmacodynamic axis is implicated, in which direction, an…

  • We introduce a reproducible mechanism-level DDI labelling and evaluation protocol with a structured 7-family/147-subtype taxonomy, leakage-safe cold-split proto…

  • We propose a pipeline that produces a 7B reasoning MARD (Mirror-Augmented Reasoning Distillation), combining three training innovations: a single-token KL diver…

  • Constrained Semantic Decompression in LLMs through Persian Proverb-Conditioned Story Generation

    • Publication Date: 2026-06-12 12:00 Beijing Time
    • Abstract:- arXiv:2606.12599v1 Announce Type: new.
      • Abstract: Transforming dense, abstract proverbs into engaging and morally faithful narratives requires deep cultural understanding and robust semantic grounding.
      • We frame this problem as a \emph{constrained semantic decompression} task and study proverb-conditioned story generation as a testbed for abstraction-to-realization in Large Language Models (LLMs).
      • Focusing on Persian, we introduce the Proverb Aligned Narrative Dataset (PAND), pairing proverbs with human-written stories and explicit meanings.
    • EN Key Points:
      • arXiv:2606.12599v1 Announce Type: new
      • Abstract: Transforming a dense, abstract proverb into an engaging and morally faithful narrative requires deep cultural understanding and robust semantic ground…
      • We frame this problem as a \emph{constrained semantic decompression} task and study proverb-conditioned story generation as a testbed for abstraction-to-realiza…
      • Focusing on Persian, we introduce the Proverb Aligned Narrative Dataset (PAND), pairing proverbs with human-written stories and explicit meanings
  • Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

    • Publication Date: 2026-06-12 12:00 Beijing Time
    • Abstract:- arXiv:2606.12608v1 Announce Type: new.
      • Abstract: Conversational shopping assistants now serve hundreds of millions of customers, but no existing benchmark jointly evaluates the open-ended multi-turn reasoning, domain expertise, and production-grade quality required for real shopping conversations.
      • Shopping reasoning is unique among language model applications.
      • Unlike factual Q&A or verifiable code generation, it requires balancing subjective preferences, budget constraints, and cross-product tradeoffs over multiple turns, a capability missing from previous e-commerce and general benchmarks.
    • EN Key Points:
      • arXiv:2606.12608v1 Announce Type: new
  • Abstract: Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open-ended multi-turn…

    • Shopping reasoning is unique among language model applications
    • Unlike factual question answering or verifiable code generation, it requires balancing subjective preferences, budget constraints, and cross-product trade-offs…
  • MentalMARBERT: Domain-Adaptive Pre-training and Two-Stage Fine-Tuning for Arabic Mental Health Disorders Detection

    • Published: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.12649v1 Announce Type: new.
      • Abstract: Detecting mental health disorders from Arabic social media text remains challenging due to dialectal variation, informal language, limited high-quality annotated resources, and severe class imbalance.
      • While English mental health natural language processing (NLP) has progressed substantially, Arabic multi-class disorder classification remains insufficiently studied.
      • This study proposes a two-phase framework for Arabic mental health text classification.
    • EN Highlights:
      • arXiv:2606.12649v1 Announce Type: new
      • Abstract: Detecting mental health disorders from Arabic social media text remains challenging due to dialectal variation, informal language, limited high-qualit…
      • While English mental health natural language processing (NLP) has progressed substantially, Arabic multi-class disorder classification remains insufficiently st…
      • This study proposes a two-phase framework for Arabic mental health text classification
  • Observable Patterns Are Not Explanations: A Causal-Geometric Analysis of Latent Reasoning Models

    • Published: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.12689v1 Announce Type: new.
      • Abstract: Latent reasoning models (LRMs) replace explicit chain-of-thought with continuous thoughts.
      • Recent work treats observable latent-state patterns (e.g., BFS-like frontiers and decodable arithmetic computation) as evidence for internal reasoning mechanisms.
      • Evaluating two LRMs (PaLM and CODI) against controls that lack the proposed recurrence or curriculum, we find that these patterns also appear in the controls and do not always have a causal effect on behavior.
    • EN Highlights:
      • arXiv:2606.12689v1 Announce Type: new
      • Abstract: Latent reasoning models (LRMs) replace explicit chain-of-thought with continuous thoughts
      • Recent work treats observable latent-state patterns, such as BFS-like frontiers and decodable arithmetic computation, as evidence for internal reasoning mechani…
  • Evaluating two LRMs (Coconut and CODI) against controls lacking the proposed recurrence or curriculum, we find these patterns also appear in the controls and do…

  • AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages

    • Publication Time: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.12708v1 Announcement Type: New.
      • Abstract: Despite their linguistic diversity and global significance, African languages remain underrepresented in research and resources to support NLP.
      • Our goal is to bridge this gap by introducing AfriSUD, the first large-scale collection of syntactically annotated treebanks for nine diverse African languages, covering major language families and regions of sub-Saharan Africa.
      • Using the Surface-Syntactic Universal Dependencies (SUD) framework, our community-led effort provides high-quality, native-speaker-verified data that captures typologically key features such as agglutination and tone.
    • EN Key Points:
      • arXiv:2606.12708v1 Announce Type: new
      • Abstract: Despite their linguistic diversity and global significance, African languages remain underrepresented in research and resources to support NLP
      • We aim to bridge this gap by introducing AfriSUD, the first large-scale collection of syntactically annotated treebanks for nine diverse African languages spann…
      • Using the Surface-Syntactic Universal Dependencies (SUD) framework, our community-led effort provides high-quality, native-speaker verified data that capture ty…
  • Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review

    • Publication Time: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.12716v1 Announcement Type: New.
      • Abstract: The integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peer-review workflows introduces novel and significant risks for adversarial manipulation, particularly given the multimodal nature of scientific papers where graphics (and not just text) convey core evidence.
      • This creates a significant gap: current robustness studies on AI peer-review are overwhelmingly text-only.
      • Furthermore, this problem differs from standard jailbreaking, as peer review attacks aim to induce domain-specific, targeted failures (e.g., “inflate this score”) rather than general security policy violations, for which no practical defenses currently exist.
    • EN Key Points:
      • arXiv:2606.12716v1 Announce Type: new
      • Abstract: The integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peer-review workflows introduces novel and significant ris…
      • This creates a significant gap: current robustness studies on AI peer-review are overwhelmingly text-only
  • Moreover, the problem is distinct from standard jailbreaking, as a peer-review attack seeks to induce a domain-specific, targeted failure (e.g., “inflate this s…

  • Agent-based models for the evolution of morphological alternation patterns

    • Published: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.12748v1 Announce Type: new.
      • Abstract: Why is the past of English “go” the apparently unrelated “went”?
      • Such alternations are frequent in languages.
      • They neither aid communication nor learnability, yet they can be persistent, surviving over centuries or millennia.
    • EN Key Points:
      • arXiv:2606.12748v1 Announce Type: new
      • Abstract: Why is the past of English “go” the apparently unrelated “went”
      • Such alternations are frequent in languages
      • They neither aid communication nor learnability, yet they can be persistent, surviving over centuries or millennia

ArXiv cs.LG (B_intro+search) Link to heading

  • Restless bandits with imperfect binary feedback: PCL-indexability analysis and computation

    • Published: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.11192v1 Announce Type: new.
      • Abstract: We study restless bandits with binary latent states and imperfect binary feedback, motivated by opportunistic spectrum access with sensing errors.
      • For the associated belief-state model, we develop a partial conservation laws (PCL)-based analytical and computational framework for establishing indexability and evaluating the Whittle index, building upon the verification theorem for real-state discounted restless bandits.
      • The framework analyzes the stochastic dynamics via an associated deterministic skeleton, renewal decompositions, and combinatorics on words.
    • EN Key Points:
      • arXiv:2606.11192v1 Announce Type: new
      • Abstract: We study restless bandits with binary latent states and imperfect binary feedback, motivated by opportunistic spectrum access with sensing errors
      • For the associated belief-state model, we develop a partial conservation laws (PCL)-based analytical and computational framework for establishing indexability a…
      • The framework analyzes the stochastic dynamics via an associated deterministic skeleton, renewal decompositions, and combinatorics on words
  • To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending

    • Published: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.11201v1 Announce Type: new.
      • Abstract: The widespread deployment of LLMs makes model alignment necessary to enable newly trained models to respond safely and effectively to user instructions.
      • Among different methods, inference-time alignment is often cheaper as it only intervenes (i.e., provides guidance) during output generation.
  • Existing proposals apply guidance extracted from certain aligned models without properly assessing their reliability.

  • EN Key Points:

    • arXiv:2606.11201v1 Announce Type: new
    • Abstract: The wide deployment of LLMs has made model alignment necessary to make newly trained models safely and effectively respond to user instructions
    • Among different methods, inference-time alignment is often cheaper as it intervenes (i.e., offers guidances) only during output generation
    • Existing proposals apply guidances extracted from certain aligned models without properly assessing their reliability
  • Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention

    • Release Date: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.11205v1 Announce Type: new.
      • Abstract: Activation steering can shift LLM behavior, but standard evaluations do not typically test whether a sycophancy-reduction direction also suppresses agreement with factually correct statements.
      • We introduce dual-stance evaluation, which tests both stances of each topic, and apply it to centroid-difference steering on Llama-3-8B-Instruct.
      • We find a dissociation: the model represents sycophantic and factual agreement in geometrically distinct subspaces, yet the steering direction projects equally into both, and cannot differentially target either.
    • EN Key Points:
      • arXiv:2606.11205v1 Announce Type: new
      • Abstract: Activation steering can shift LLM behaviour, but standard evaluations do not typically test whether a sycophancy-reduction direction also suppresses a…
      • We introduce dual-stance evaluation, which tests both stances of each topic, and apply it to centroid-difference steering on Llama-3-8B-Instruct
      • We find a dissociation: The model represents sycophantic and factual agreement in geometrically distinct subspaces, yet the steering direction projects equally…
  • Few-Shot Resampling for Scalable Statistically-Sound Data Mining

    • Release Date: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.11235v1 Announce Type: new.
      • Abstract: A key step in knowledge discovery is the evaluation of data mining results.
      • In various applications, including pattern mining, graph analysis, etc., this step involves assessing the statistical significance of results to avoid spurious findings due solely to noise or random fluctuations in the data.
      • Although specialized procedures have been developed for certain specific applications, resampling-based methods are widely used, especially for complex analyses where analytical results cannot be derived.
    • EN Key Points:
      • arXiv:2606.11235v1 Announce Type: new
      • Abstract: A key step in knowledge discovery is the evaluation of data mining results
  • In several applications, including pattern mining, graph analysis, and others, this step includes the evaluation of the statistical significance of the results,…

  • While specialized procedures have been developed for some specific applications, resampling-based approaches are widely used, in particular for complex analyses…

  • ProHiFlo: Hierarchical Flow Matching with Functional Guidance for De Novo Protein Generation

    • Publication Time: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.11243v1 Announcement Type: new.
      • Abstract: De novo protein generation holds transformative potential in therapeutic design, enzyme engineering, and synthetic biology.
      • While methods based on diffusion and flow matching have made progress, they typically operate at a single resolution and lack mechanisms to incorporate functional constraints.
      • We introduce ProHiFlo, a hierarchical flow matching framework with three innovations: (1) coarse-to-fine generation that models backbone geometry before refining to all-atom coordinates, thereby reducing computational cost while maintaining accuracy; (2) functional guidance that utilizes pre-trained predictors to steer a generation towards desired properties without retraining; and (3) an adaptive SE(3) equivariant architecture for efficient multi-scale processing.
    • EN Highlights:
      • arXiv:2606.11243v1 Announce Type: new
      • Abstract: De novo protein generation has transformative potential in therapeutic design, enzyme engineering, and synthetic biology
      • While diffusion-based and flow matching approaches have achieved progress, they typically operate at single resolution and lack mechanisms for incorporating fun…
      • We introduce ProHiFlo, a hierarchical flow matching framework with three innovations: (1) coarse-to-fine generation that models backbone geometry before refinin…
  • Physics-informed generative AI for semiconductor manufacturing: Enforcing hard physical constraints in generative models by construction

    • Publication Time: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.11247v1 Announcement Type: new.
      • Abstract: Generative models are increasingly used to propose designs, data, and control actions for physical systems, but many such systems are governed by strict physical constraints rather than perceptual plausibility.
      • Semiconductor manufacturing provides a strict test case: generated masks, layouts, synthetic defect data, and process recipes must adhere to lithographic, transport, reaction, and device physics constraints, as physically invalid samples are not just low-quality but unusable.
      • This perspective argues that semiconductor manufacturing presents a broader challenge for computational science, where generative AI for constrained physical domains must be physics-informed by construction, not just corrected by post-hoc filtering.
    • EN Highlights:
      • arXiv:2606.11247v1 Announce Type: new
      • Abstract: Generative models are increasingly used to propose designs, data, and control actions for physical systems, yet many such systems are governed by hard…
  • Semiconductor manufacturing provides a demanding test case: generated masks, layouts, synthetic defect data, and process recipes must obey lithography, transpor…

  • This Perspective argues that semiconductor manufacturing exposes a broader computational-science challenge, namely that generative AI for constrained physical d…

  • Mechanical Field Networks: Structured Neural Dynamics for Multivariate Systems

    • Publication Time: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.11251v1 Announce Type: New.
      • Abstract: Many multivariate dynamical systems can only be observed through trajectories, thus hiding the mechanisms that control their joint dynamics.
      • Existing methods can impose interpretable dynamics or learn flexible state transitions, but the resulting interaction structure is typically either specified in advance or implicit in the learned dynamics.
      • We introduce MF-Net, a recurrent dynamical model that represents all variables in a shared field state and updates this state through a learned relational law.
    • EN Key Points:
      • arXiv:2606.11251v1 Announce Type: new
      • Abstract: Many multivariate dynamical systems are observed only through trajectories, leaving the mechanisms governing their joint dynamics hidden
      • Existing approaches can impose interpretable dynamics or learn flexible state transitions, yet the resulting interaction structure is typically either specified…
      • We introduce MF-Net, a recurrent dynamical model that represents all variables in a shared field state and updates this state through a learned relation law
  • Bernstein-Schur Kernels: Random Features by Sketched Modulation and Radial Randomization

    • Publication Time: 2026-06-12 12:00 Beijing Time
    • Abstract: - arXiv:2606.11255v2 Announce Type: New.
      • Abstract: Bernstein–Schur kernels are products of a finite-feature kernel and a completely monotone shift-invariant kernel: nonstationary kernels between shift-invariant and dot product templates utilize random features, so neither Bochner sampling nor polynomial sketching can be directly applied to the full kernel.
      • We provide a random feature construction for the entire class, randomizing both factors: it sketches finite modulation and samples the 1D Bernstein-Widder scale of the radial factor before applying Gaussian random Fourier features, yielding feature dimension $Dm$, unaffected by the $O(d^2)$ size of exact modulation features.
      • When modulation remains exact (limit $m\to\infty$), we prove unbiasedness, exact variance, and matrix Bernstein operator norm bounds controlled by top kernels and modulation eigenvalues, as well as intrinsic dimension, rather than the original $N\max_{ij}$ route.
    • EN Key Points:
      • arXiv:2606.11255v2 Announce Type: new
      • Abstract: Bernstein–Schur kernels are products of a finite-feature kernel and a completely monotone shift-invariant kernel: nonstationary kernels falling betwe…
  • We give one random-feature construction for the whole class that randomizes both factors: it sketches the finite modulation and samples the radial factor’s one-…

  • With the modulation kept exact (the $m\to\infty$ limit), we prove unbiasedness, an exact variance, and a matrix-Bernstein operator-norm bound controlled by the…

  • Loss Landscape Diagnosis for Gradient-Based Gray-Scott System Inversion: Disentangling the Roles of PINN Components

    • Release Date: 2026-06-12 12:00 Beijing Time
    • Abstract:- arXiv:2606.11258v1 Announcement Type: new.
      • Abstract: Gradient-based inversion of reaction-diffusion systems is typically approached via surrogate models or physics-informed neural networks (PINNs), while the most direct route, backpropagation through the PDE structure itself, is largely avoided.
      • We pursue this direct route as a diagnostic probe, backpropagating a steady-state loss through unrolled Gray-Scott simulation to recover its parameters, with no surrogates or neural network augmentation.
      • Optimization fails to converge, and plotting the landscape directly locates the failure in its geometry—flat plateaus with no gradient signal, bounded by sharp cliffs aligned with bifurcation boundaries—a structure that recurs in the loss function and is inherited, but where gradients are routed to the parameters.
    • EN Key Points:
      • arXiv:2606.11258v1 Announce Type: new
      • Abstract: Gradient-based inversion of reaction-diffusion systems is typically approached via surrogate models or physics-informed neural networks (PINNs), while…
      • We pursue this direct route as a diagnostic probe, backpropagating a steady-state loss through unrolled Gray-Scott simulation to recover its parameters, with no…
      • Optimization fails to converge, and plotting the landscape directly locates the failure in its geometry – flat plateaus with no gradient signal, bounded by sha…
  • PermDoRA – Understanding Adapter Interference in Language Models: Limits of Parameter-Space Geometry

    • Release Date: 2026-06-12 12:00 Beijing Time
    • Abstract:- arXiv:2606.11262v1 Announcement Type: new.
      • Abstract: Access control in large language models (LLMs) requires modular mechanisms to enable domain-specific behavior without retraining or cross-domain interference.
      • A common assumption is that interference during adapter composition arises from overlapping linear parameter updates, suggesting that enforcing orthogonality or directional independence should improve multi-domain performance.
      • We test this hypothesis using DoRA-RBAC, a hierarchical adapter composition framework for weight-decomposed low-rank adaptation.
    • EN Key Points:
      • arXiv:2606.11262v1 Announce Type: new
      • Abstract: Access control in large language models (LLMs) requires modular mechanisms to enable domain-specific behavior without retraining or cross-domain inter…
  • A common hypothesis is that interference during adapter composition arises from overlap in linear parameter updates, suggesting that enforcing orthogonality or…

  • We test this hypothesis using DoRA-RBAC, a hierarchical adapter composition framework based on weight-decomposed low-rank adaptation