Microsoft 365 Copilot Blog

8 MIN READ

Researcher agent in Microsoft 365 Copilot

gauravanand

Microsoft

Mar 26, 2025

Your assistant for deep research at work

Gaurav Anand, CVP, Microsoft 365 Engineering

Recent advancements in reasoning models are transforming chain-of-thought based iterative reasoning, enabling AI systems to distill vast amounts of data into well-founded conclusions. While some web-centric deep research tools have emerged, modern information workers need these models to reason across both enterprise and web data. For M365 users, producing thorough, accurate, and deeply contextualized research reports is crucial, as these reports can influence market-entry strategies, sales pitches, and R&D investments.

Researcher addresses this gap by navigating and reasoning over enterprise data sources such as emails, chats, meeting recordings, documents, and ISV/LOB applications. Although these workflows are slower than the near real-time performance of Microsoft 365 Copilot Chat, the resulting depth and accuracy saves employees hours of time and effort.

Our Approach

Our approach mirrors the methodology a human would take when tasked with researching a subject: seek any needed clarification, devise a higher-order plan, and then break the problem into subtasks. They would then begin an iterative loop of Reason → Retrieve → Review for each subtask, collecting findings on a scratch pad until further research would unlikely yield any new information, at which point they would synthesize the final report. We instilled these behaviors into the Researcher with a structured, multi-phase process.

Initial planning phase $(P_{0})$

The agent analyzes the user utterance and context to formulate a high-level plan. During this phase, the agent might ask the user clarifying questions to ensure the final output aligns with user expectations in both content and format. We define insights from this phase as $I_{0} .$

Iterative research phase

The Researcher agent then loops through iterative cycles until it hits diminishing returns, starting with $j = 1 .$

Reasoning (R_j): Deep analysis to identify which subtask to tackle and what specific details are missing

Retrieval (T_j): Search across documents, emails, messages, calendar, transcripts and/or web data to fetch the missing details.

Review (V_j): Evaluating the collected evidence, computing its relevance to the original user utterance, and preserving the findings on a “scratch pad”

We define ΔI_j to be the new insights gained in iteration j from R_j, T_j, and V_j. These are added to the prior knowledge: (I_j = I_j-1 ∪ ΔI_j).

Note that with each cycle, the marginal insight ΔI_j tends to diminish. The agent monitors this and essentially implements a check to conclude further research at iteration m when ΔI_m < ε.

Synthesis phase

The agent synthesizes the aggregate $I_{m}$ $I_{m}$ by consolidating findings, analyzing patterns, drawing conclusions, and drafting a coherent report. The output includes explanations and cites sources to provide traceability.

The Researcher agent in action

To illustrate, if a user asks: "How did our Product P perform in Q4 compared to industry trends?”, the phases would be as follows.

Planning

Identifying subtasks: (1) get internal Q4 sales numbers for Product P; (2) find industry news or analyst reports on Q4 trends.

Asking clarifying questions e.g., a specific region or competitor focus.

Iterative research

In iteration 1, it:

Reasons: Start with internal sales data
Retrieves: Pulls the Q4 sales report
Reviews: Observes contribution of feature F in driving Product P’s sales growth

In iteration 2, it:

Reasons: Adapt the plan to explore feature F
Retrieves: Retrieves internal and external communications about F; web search for competitor offerings
Reviews: Customer reception of F; related industry news

Iteration by iteration, it gathers pieces of the puzzle until new iterations yield only minor details.

Synthesis

Researcher then drafts a report, detailing a thorough comparison of Product P’s Q4 performance to the market, citing the internal sales numbers and external industry analysis, highlighting that feature F was a competitive differentiator.

Technical Implementation

Our current implementation leverages OpenAI’s deep research model, powered by a version of the upcoming OpenAI o3 model trained specifically for research tasks. Performance benchmarks highlight its efficacy, achieving 26.6% accuracy on the Humanity’s Last Exam (HLEx) and an average score of 72.6% on the GAIA reasoning benchmark¹. 

Included below are a few technical approaches that were employed to build Researcher:

Reasoning over enterprise data

We have expanded the model’s toolkit with Copilot tools that can retrieve both first-party enterprise data—like meetings, events, and internal documents—as well as third-party content through graph connectors, such as shared company wikis and integrated CRM systems. These tools are part of the Copilot Control System that allows IT administrators and security professionals to secure, manage and analyze the use of Researcher. The Copilot tools are provided to the model using a familiar interface that the model was trained on such as the ability to “open” a document and “scroll” or “find” information within it.

We have experimented with different techniques to address deviations from the distribution of the model’s original training data due to inherent differences between web and enterprise research queries. Internal evaluations revealed that the Researcher typically requires 30–50% more iterations to achieve equivalent coverage on enterprise-specific queries compared to its performance on public web data.

Personalization with enterprise context

Unlike web research where results are uniform regardless of user, Researcher produces highly personalized results. It leverages the enterprise knowledge graph to integrate user and organizational context, including details about people, projects, products, and the unique interplay of these entities within the user's work.

For instance, when a user says, “Help me learn more about Olympus,” the system quickly identifies that Olympus is an internal AI initiative and understands that the user's team plans to take a dependency on it. This rich contextualization enables the system to:

Ask more nuanced clarifying questions, such as: “Should we focus on the foundational research aspects of Olympus, or are you more interested in integration details?”
Tailor the starting condition (P₀) for the deep research model so it’s not only precise but also personalized, thereby mitigating its lack of familiarity with company-specific jargon.

Deep retrieval complementing deep reasoning

Researcher retrieves a broad set of results for each query and semantic passages for each returned document to increase the insights gained per iteration T_j.

Instead of a serial iterative approach, Researcher first performs broad but shallow retrieval across heterogenous data sources and then lets the model decide the domains and entities to zoom into.

Integrating specialized agents

In enterprise contexts, interpreting data often demands the nuanced perspective of domain-specific experts. That’s why agents are a critical part of the Microsoft 365 Copilot ecosystem.

Researcher is being extended to seamlessly integrate with other Agents. For instance, Researcher can leverage the Sales Agent to apply advanced time-series modeling to provide an insight like, “Sales in Europe are expected to be 5% above quota, driven by product X,”

Moreover, these tools and Agents can be chained together. For example, if a user asks, {help me prepare for my customer meetings next week}, the system first employs calendar search to identify the relevant customers; and then, in addition to pulling searching over recent communications, it also retrieves the CRM information from the Sales agent.

By allowing Researcher to delegate complex subtasks to these specialists, we help compress multi-step reasoning iterations into a single step and complement Researcher agent’s intelligence with specialist knowledge.

Results and Impact

Even in early testing, Researcher has demonstrated tangible benefits.

Response quality

We evaluated Researcher extensively in early trials, focusing on complex prompts that require consulting multiple sources. For quality assessment, we employed a framework called ACRU, which rates each answer on four dimensions:

Accuracy(factual correctness)
Completeness (coverage of all key points)
Relevance(focus on the user’s query without extraneous info)
Usefulness(utility of the answer for accomplishing the task)

Each dimension is scored from 1 (very poor) to 5 (excellent) by both human and LLM-based reviewers.

When we compared Researcher’s performance against our baseline M365 Copilot Chat on a diverse set of 1K queries, we saw an increase of 88.5% in accuracy, 70.4% increase in completeness, 25.9% increase in relevance, and 22.2% increase in utility.

It is worth noting that the agent’s improved accuracy comes from its ability to double-check facts. It cites on average ~10.1 sources per response in our above evaluation. 61.5% of the answers included at least one enterprise document as a source, 58.5% included a web page, 55.4% cited an email, and 33.8% pulled in a snippet from a meeting transcript.

Time savings

For this measurement, we surveyed two groups of internal users:

22 Product Managers responsible for crafting product strategy documents and project updates to align stakeholders
12 Account Managers interacting with Microsoft customers, writing client proposals, and maintaining clear communication with stakeholders

The feedback from both groups has been extremely positive. Users reported tasks that previously took days of manual research could be completed in minutes with the agent’s help. Overall, our pilot users estimated that Researcher saved them 6–8 hours per week, essentially eliminating an entire day’s worth of drudgery.

Here is verbatim from a product manager “it even found data in an archive I wouldn’t have checked. Knowing the AI searched everywhere—my meeting transcripts, shared files, the web—makes me trust the final recommendation much more.”. I have found myself using Researcher daily. Researcher’s intelligence to reason and connect the dots leads to magical moments. Below is a snippet from a report to prepare for my upcoming meetings.

The appointment at 11:30am was a placeholder for me to send out broad communication to the team with some survey results. Researcher identified that I had done this already and encouraged me to use the time instead to collect feedback from the team.

What's Next

Reinforcement Learning

We will continue to improve the quality of Researcher to make reports more complete, accurate and useful. The next phase of adaptation to enterprise data will involve post-training reasoning models on real-world, multi-step work tasks using reinforcement learning.

This will involve learning a policy function (π(s)→a), which picks the next step a as a function of its current state s to maximize the cumulative reward:

Steps are range of actions accessible to the model (reasoning, tools, synthesizing)
State encapsulates the user’s initial utterance and the insights I_n thus far
Reward function evaluates output quality at each decision point

Formally, we interleave internal reasoning and actions to build the cumulative insight I(i)=I(i-1)+R(s_i,a_i), where (R(s_i,a_i)) denotes the reward obtained by taking action a, given the state s_i.Through successive iterations, the model learns an optimized policy ((π(s)).

To achieve this, we will focus on creating datasets for high quality research reports and investing in robust evaluation metrics and benchmarks.

User control

Researcher reasons across knowledge sources that the user has access to and find the most useful nuggets of information. However, we understand our users and enterprises often need more control over the information sources. To this end, Researcher will allow “steerability” over the sources from which the report will be created. Below is an early visual of what this could look like.

Agentic orchestration

Agentic orchestration is a core capability of Researcher. We have already integrated a few Microsoft agents, and we will generalize this capability. Moreover, we will afford end users and admins the ability to customize Researcher by bringing their own agents into the Researcher workflow.

For example, imagine a law firm has created an agent to format reports into legal briefs. We will allow the output of Researcher to be chained with this custom agent to customize the output.

Conclusion

Researcher can significantly transform knowledge workers’ everyday tasks. Early results show that users trust the agent to deliver factually accurate and detailed reports that save time and drive productivity. As we expand the capabilities of Researcher, improve quality and allow deeper customization, we envision a future where Researcher evolves into a trusted and indispensable tool in the workplace.

For additional details on Researcher, including rollout and availability for customers, please also check out our blog post highlighting reasoning agents within M365 Copilot and more.

¹Introducing deep research | OpenAI

Updated Mar 27, 2025

Version 3.0

microsoft 365 copilot

gauravanand

Microsoft

Joined March 24, 2025

View Profile

Microsoft 365 Copilot Blog

Follow this blog board to get notified when there's new activity

Blog Post

Researcher agent in Microsoft 365 Copilot

Your assistant for deep research at work