Agentic RAG: Company Knowledge Slack Agents

I would have figured that most companies would have built or implemented their own RAG agents by now.

An AI knowledge agent can dig through internal documentation — websites, PDFs, random docs — and answer employees in Slack (or Teams/Discord) within a few seconds. So, these bots should significantly reduce time sifting through information for employees.

I’ve seen a few of these in bigger tech companies, like AskHR from IBM, but they aren’t all that mainstream yet.

If you’re keen to understand how they are built and how much resources it takes to build one, this is an article for you.

I’ll go through the tools, techniques, and architecture involved, while also looking at the economics of building something like this. I’ll also include a section on what you’ll end up focusing the most on.

If you’re already familiar with RAG, feel free to skip the next section — it’s just a bit of repetitive stuff around agents and RAG.

What is RAG and Agentic RAG?

Retrieval-Augmented Generation (RAG) is a way to fetch information that gets fed into the large language model (LLM) before it answers the user’s question.

This allows us to provide relevant information from various documents to the bot in real time so it can answer the user correctly.

This retrieval system is doing more than simple keyword search, as it finds similar matches rather than just exact ones. For example, if someone asks about fonts, a similarity search might return documents on typography.

Many would say that RAG is a fairly simple concept to understand, but how you store information, how you fetch it, and what kind of embedding models you use still matter a lot.

If you’re keen to learn more about embeddings and retrieval, I’ve written about this here.

Today, people have gone further and primarily work with agent systems.

In agent systems, the LLM can decide where and how it should fetch information, rather than just having content dumped into its context before generating a response.

It’s important to remember that just because more advanced tools exist doesn’t mean you should always use them. You want to keep the system intuitive and also keep API calls to a minimum.

With agent systems the API calls will increase, as it needs to at least call one tool and then make another call to generate a response.

That said, I really like the user experience of the bot “going somewhere” — to a tool — to look something up. Seeing that flow in Slack helps the user understand what’s happening.

But going with an agent or using a full framework isn’t necessarily the better choice. I’ll elaborate on this as we continue.

Technical Stack

I did a bit of research before picking an agent framework, vector database, and deployment option, so I’ll go through some choices.

For the deployment option, since we’re working with Slack webhooks, we’re dealing with event-driven architecture where the code only runs when there’s a question in Slack.

To keep costs to a minimum, we can use serverless functions. The choice is either going with AWS Lambda or picking a new vendor.

Platforms like Modal are technically built to serve LLM models, but they work well for long-running ETL processes, and for LLM apps in general.

Modal hasn’t been battle-tested as much, and you’ll notice that in terms of latency, but it’s very smooth and offers super cheap CPU pricing.

I should note though that when setting this up with Modal on the free tier, I’ve had a few 500 errors, but that might be expected.

As for how to pick the agent framework, this is completely optional. I did a comparison piece a few weeks ago on open-source agentic frameworks that you can find here, and the one I left out was LlamaIndex.

So I decided to give it a try here.

The last thing you need to pick is a vector database, or a database that supports vector search. This is where we store the embeddings and other metadata, so we can perform similarity search when a user’s query comes in.

There are a lot of options out there, but I think the ones with the highest potential are Weaviate, Milvus, pgvector, Redis, and Qdrant.

Both Qdrant and Milvus have pretty generous free tiers for their cloud options. Qdrant, I know, allows us to store both dense and sparse vectors. Llamaindex, along with most agent frameworks, support many different vector databases so any can work.

I’ll try Milvus more in the future to compare performance and latency, but for now, Qdrant works well.

Redis is a solid pick too, or really any vector extension of your existing database.

Cost & time to build

In terms of time and cost, you have to account for engineering hours, cloud, embedding, and LLM costs.

It doesn’t take that much time to boot up a framework to run something minimal. What takes time is connecting the content properly, prompting the system, parsing the outputs, and making sure it runs fast enough.

But if we turn to overhead costs, cloud costs to run the agent system is minimal for just one bot for one company using serverless functions as you saw in the table in the last section.

However, for the vector databases, it will get more expensive the more data you store.

Both Zilliz and Qdrant Cloud has a good amount of free tier for your first 1 to 5GBs of data, so unless you go beyond a few thousand chunks you may not pay for anything.

You will start paying though once you go beyond the thousands mark, with Weaviate being the most expensive of the vendors above.

As for the embeddings, these are generally very cheap.

You can see a table below on using OpenAI’s text-embedding-3-small with chunks of different sizes once you embed 1 to 10 million texts.

When people start optimizing for embeddings and storage, they’ve usually moved beyond embedding millions of texts.

The one thing that matters the most though is what large language model (LLM) you use. You need to think about API prices, since an agent system will typically call an LLM two to four times per run.

For this system, I’m using GPT-4o-mini or Gemini Flash 2.0, which are the cheapest options.

So let’s say a company is using the bot a few hundred times per day and each run costs us 2–4 API calls, we might end up at around less of a dollar per day and around $10–50 dollars per month.

You can see that switching to a more expensive model would increase the monthly bill by 10x to 100x. Using ChatGPT is mostly subsidized for free users, but when you build your own applications you’ll be financing it.

There will be smarter and cheaper models in the future, so whatever you build now will likely improve over time. But start small, because costs add up and for simple systems like this you don’t need them to be exceptional.

The next section will get into how to build this system.

The architecture (processing documents)

The system has two parts. The first is how we split up documents — what we call chunking — and embed them. This first part is very important, as it will dictate how the agent answers later.

So, to make sure you’re preparing all the sources properly, you need to think carefully about how to chunk them.

If you look at the document above, you can see that we can miss context if we split the document based on headings but also on the number of characters where the paragraphs attached to the first heading is split up for being too long.

You need to be smart about ensuring each chunk has enough context (but not too much). You also need to make sure the chunk is attached to metadata so it’s easy to trace back to where it was found.

This is where you’ll spend the most time, and honestly, I think there should be better tools out there to do this intelligently.

I ended up using Docling for PDFs, building it out to attach elements based on headings and paragraph sizes. For web pages, I built a crawler that looked over page elements to decide whether to chunk based on anchor tags, headings, or general content.

Remember, if the bot is supposed to cite sources, each chunk needs to be attached to URLs, anchor tags, page numbers, block IDs, permalinks so the system can locate the information correctly being used.

There is also the option to keep the chunks small, but fetching surrounding chunks for context expansion after retrieval (and perhaps after re-ranking).

You then make sure that the retrieval stage has a greater chance of finding the correct information while making sure the LLM has enough information to provide a coherent answer.

Since most of the content you’re working with is scattered and often low quality, I also decided to just push in summarizations using an LLM.

These summaries were given different labels with higher authority, which meant they were prioritized during retrieval. I suppose you can attach them as context to the chunks you are fetching as well.

There is also the option to push in the summaries in their own tools, and keep deep dive information separate. Letting the agent decide which one to use but it will look strange to users as it’s not intuitive behavior.

Still, I have to stress that if the quality of the source information is poor, it’s hard to make the system work well.

For example, if a user asks how an API request should be made and there are four different web pages giving different answers, the bot won’t know which one is most relevant.

To demo this, I had to do some manual review. I also had AI do deeper research around the company to help fill in gaps, and then I embedded that too.

In the future, I think I’ll build something better for document ingestion — I like the idea of expanding chunks with more information to give to the LLM.

The architecture (the agent)

For the second part, where we connect to this data, we need to build a system where an agent can connect to different tools that contain different amounts of data from our vector database.

We keep to one agent only to make it easy enough to control. This one agent can decide what information it needs based on the user’s question.

It’s good not to complicate things and build it out to use too many agents, or you’ll run into issues, especially with these smaller models.

Although this may go against my own recommendations, I did set up a first LLM function that decides if we need to run the agent at all.

This was primarily for the user experience, as it takes a few extra seconds to boot up the agent (even when starting it as a background task when the container starts).

As for how to build the agent itself, this is easy, as LlamaIndex does most of the work for us. For this, you can use the FunctionAgent, passing in different tools when setting it up.

# Only runs if the first LLM thinks it is necessary
access_links_tool = get_access_links_tool()
public_docs_tool = get_public_docs_tool()
onboarding_tool = get_onboarding_information_tool()
general_info_tool = get_general_info_tool()
    
formatted_system_prompt = get_system_prompt(team_name)
    
agent = FunctionAgent(
  tools=[onboarding_tool, public_docs_tool, access_links_tool, general_info_tool],
  llm=global_llm,
  system_prompt=formatted_system_prompt
)

The tools have access to different data from the vector database, and they are wrappers around the CitationQueryEngine. This engine helps to cite the source nodes in the text. We can access the source nodes at the end of the agent run, which you can attach to the message and in the footer.

To make sure the user experience is good, you can tap into the event stream to send updates back to Slack.

handler = agent.run(user_msg=full_msg, ctx=ctx, memory=memory)

async for event in handler.stream_events():
  if isinstance(event, ToolCall):
     display_tool_name = format_tool_name(event.tool_name)
     message = f"✅ Checking {display_tool_name}"
     post_thinking(message)
  if isinstance(event, ToolCallResult):
     post_thinking(f"✅ Done checking...")

final_output = await handler  
final_text = final_output
blocks = build_slack_blocks(final_text, mention)

post_to_slack(
  channel_id=channel_id, 
  blocks=blocks,
  timestamp=initial_message_ts,
  client=client 
)

Make sure to format the messages and Slack blocks well, and refine the system prompt for the agent so it formats the messages correctly based on the information that the tools will return.

The architecture should be easy enough to understand, but there are still some retrieval techniques we should dig into.

Techniques you can try

A lot of people will emphasize certain techniques when building RAG systems, and they’re partially right. You should use hybrid search along with some kind of re-ranking.

The first I will mention is hybrid search when we perform retrieval.

I mentioned that we use semantic similarity to fetch chunks of data in the various tools, but you also need to account for cases where exact keyword search is required.

Just imagine a user asking for a specific certificate name, like CAT-00568. In that case, the system needs to find exact matches just as much as fuzzy ones.

With hybrid search, supported by both Qdrant and LlamaIndex, we use both dense and sparse vectors.

# when setting up the vector store (both for embedding and fetching)
vector_store = QdrantVectorStore(
   client=client,
   aclient=async_client,
   collection_name="knowledge_bases",
   enable_hybrid=True,
   fastembed_sparse_model="Qdrant/bm25"
 )

Sparse is perfect for exact keywords but blind to synonyms, whereas dense is great for “fuzzy” matches (“benefits policy” matches “employee perks”) but they can miss literal strings like CAT-00568.

Once the results are fetched, it’s useful to apply deduplication and re-ranking to filter out irrelevant chunks before sending them to the LLM for citation and synthesis.

reranker = LLMRerank(llm=OpenAI(model="gpt-3.5-turbo"), top_n=5)
dedup = SimilarityPostprocessor(similarity_cutoff=0.9)

engine = CitationQueryEngine(
    retriever=retriever,
    node_postprocessors=[dedup, reranker],
    metadata_mode=MetadataMode.ALL,
)

This part wouldn’t be necessary if your data were exceptionally clean, which is why it shouldn’t be your main focus. It adds overhead and another API call.

It’s also not necessary to use a large model for re-ranking, but you’ll need to do some research on your own to figure out your options.

These techniques are easy to understand and quick to set up, so they aren’t where you’ll spend most of your time.

What you’ll actually spend time on

Most of the things you’ll spend time on aren’t so sexy. It’s prompting, reducing latency, and chunking documents correctly.

Before you start, you should look into different prompt templates from various frameworks to see how they prompt the models. You’ll spend quite a bit of time making sure the system prompt is well-crafted for the LLM you choose.

The second thing you’ll spend most of your time on is making it fast. I’ve looked into internal tools from tech companies building AI knowledge agents and found they usually respond in about 8 to 13 seconds.

So, you need something in that range.

Using a serverless provider can be a problem here because of cold starts. LLM providers also introduce their own latency, which is hard to control.

That said, you can look into spinning up resources before they’re used, switching to lower-latency models, skipping frameworks to reduce overhead, and generally decreasing the number of API calls per run.

The last thing, which takes a huge amount of work and which I’ve mentioned before, is chunking documents.

If you had exceptionally clean data with clear headers and separations, this part would be easy. But more often, you’ll be dealing with poorly structured HTML, PDFs, raw text files, Notion boards, and Confluence notes — often scattered and formatted inconsistently.

The challenge is figuring out how to programmatically ingest these documents so the system gets the full information needed to answer a question.

Just working with PDFs, for example, you’ll need to extract tables and images properly, separate sections by page numbers or layout elements, and trace each source back to the correct page.

You want enough context, but not chunks that are too large, or it will be harder to retrieve the right info later.

This kind of stuff isn’t well generalized. You can’t just push it in and expect the system to understand it — you have to think it through before you build it.

How to build it out further

At this point, it works well for what it’s supposed to do, but there are a few pieces I should cover (or people will think I’m simplifying too much). You’ll want to implement caching, a way to update the data, and long-term memory.

Caching isn’t essential, but you can at least cache the query’s embedding in larger systems to speed up retrieval, and store recent source results for follow-up questions. I don’t think LlamaIndex helps much here, but you should be able to intercept the QueryTool on your own.

You’ll also want a way to continuously update information in the vector databases. This is the biggest headache — it’s hard to know when something has changed, so you need some kind of change-detection method along with an ID for each chunk.

You could just use periodic re-embedding strategies where you update a chunk with different meta tags altogether (this is my preferred approach because I’m lazy).

The last thing I want to mention is long-term memory for the agent, so it can understand conversations you’ve had in the past. For that, I’ve implemented some state by fetching history from the Slack API. This lets the agent see around 3–6 previous messages when responding.

We don’t want to push in too much history, since the context window grows — which not only increases cost but also tends to confuse the agent.

That said, there are better ways to handle long-term memory using external tools. I’m keen to write more on that in the future.

Learnings and so on

After doing this now for a bit I have a few notes to share about working with frameworks and keeping it simple (that I personally don’t always follow).

You learn a lot from using a framework, especially how to prompt well and how to structure the code. But at some point, working around the framework adds overhead.

For instance, in this system, I’m bypassing the framework a bit by adding an initial API call that decides whether to move on to the agent and responds to the user quickly.

If I had built this without a framework, I think I could have handled that kind of logic better where the first model decides what tool to call right away.

Can remove certain API calls if built on your own

I haven’t tried this but I’m assuming this would be cleaner.

Also, LlamaIndex optimizes the user query, which it should, before retrieval.

But sometimes it reduces the query too much, and I need to go in and fix it. The citation synthesizer doesn’t have access to the conversation history, so with that overly simplified query, it doesn’t always answer well.

With a framework, it’s also hard to trace where latency is coming from in the workflow since you can’t always see everything, even with observation tools.

Most developers recommend using frameworks for quick prototyping or bootstrapping, then rewriting the core logic with direct calls in production.

It’s not because the frameworks aren’t useful, but because at some point it’s better to write something you fully understand that only does what you need.

The general recommendation is to keep things as simple as possible and minimize LLM calls (which I am not even fully doing myself here).

But if all you need is RAG and not an agent, stick with that.

You can create a simple LLM call that sets the right parameters in the vector DB. From the user’s point of view, it’ll still look like the system is “looking into the database” and returning relevant info.

Some Notes

If you’re going down the same path, I hope this was useful.

There is bit more to it though. You’ll want to implement some kind of evaluation, guardrails, and monitoring (I’ve used Phoenix here).

Me and a friend is looking into building a plug-and-play tool that lets you crawl your company's docs and build a Slack bot without having to set up the architecture yourself. It helps us with clients, and we can scale it across several teams.

We’re also building it so the bot can act on your behalf, meaning it has access to APIs, not just data. We’re still testing that and having some fun with it.

If you’re curious about what we’re working on or want to follow my writing, you can find me here, on Medium, or on LinkedIn.

I’ll try to dive deeper into agentic memory, evals, and prompting over the summer.

❤