Empowering Sales with Elastic GenAI

How is GenAI used to create Sales Proposals at iZeno?

Imagine you’re creating a proposal for a potential client, armed with the opportunity to dazzle them with your company’s expertise and solutions. But wait – where did that write-up about my company’s ESG plan go? And didn’t someone mention a similar project we delivered two years ago? Cue the frantic hunt through your company’s internal Google Drive of past proposals, projects, and internal knowledge base.

If you’re a sales or presales person working at a company like iZeno, with over 600 projects delivered in 20 years and thousands of proposals created during the same period, this scenario may sound very familiar to you. When you factor in the technical documentation from our technology partners and their continually evolving products, you’ll quickly find yourself buried in information.

Well, fear not, because today we’re diving into the world of Elastic and how we’ve used it internally to improve the way we submit our proposals. Plus, we’ll sprinkle in a little magic called Retrieval Augmented Generation (RAG) to take our searches to a whole new level of efficiency. Disclaimer: Elastic is a technology partner of iZeno, and this is our way of ‘dogfooding’ the technology and testing the limits of their product capabilities.

RAG = Retrieval (Search) + Generation (AI Generated Responses)

First things first, let’s talk about Elastic. If you haven’t heard of it, think of Elastic (more specifically, their product, Elasticsearch) as a powerful search engine that helps you quickly sift through massive amounts of data to find exactly what you’re looking for. Whether you’re searching through documents, application logs, or any data that is generally unstructured or semi-structured and doesn’t fit neatly into traditional databases with rows and columns, Elastic can handle the search with lightning speed.

Imagine coupling that search prowess with the power of Generative AI and Retrieval Augmented Generation (RAG). As its name suggests, RAG has two main jobs: fetching information from a big database (the retrieval part) and using that information to create a coherent response (the generation part).

So, how does it all work? Imagine you’re searching for information on a specific project within your company’s knowledge base. With Elastic, you have a search engine tailored to your internal documents, enabling you to locate relevant content quickly. But here’s where RAG kicks in: not only does it surface those documents, but it also goes the extra mile by providing concise summaries or answers to your questions, saving you precious time and effort otherwise spent reading through each document.

RAG, Elastic, or both? Elastic doesn’t replace RAG. Rather, the RAG approach is enhanced by Elastic. Ultimately, the effectiveness of the response hinges on the quality of the initial search result. Without delving into the intricacies of semantic search, vector search, or ranking techniques, the most effective way to assess information retrieval is by asking ourselves: “Is the response I received relevant and accurate?” With our list of test questions, the answer was a definite yes.

With the search part of the equation settled, let’s focus on the AI-generated response. This step entails integrating an LLM model to ensure a polished and concise reply. When integrating external LLM API services, such as GPT or Claude models, two primary considerations arise: cost and response times. This is where Elastic’s caching and chunking mechanisms come into play.

Implementing caching mechanisms can reduce API call frequency, effectively lowering API usage expenses. Additionally, the Elastic Similarity threshold parameter provides a convenient means to specify the level of similarity required for a question to access the cache. Accessing the cache also significantly improves our response times.

LLM Caching Benefits:

Increase Response Speed

When user’s ask the same or similar questions frequently against data that doesn’t change, no need to call an LLM to repeat the same answer.

Reduce LLM Token Costs

Each duplicate call to an LLM has a token cost associated with it. When you simply need an already generated response, you can avoid the token fee.

Pre-Vet Generative Responses

By pre-caching the most common questions user will ask before they ask them, you have the opportunity to validate the Generative AI’s responses.

Another strategy involves chunking, breaking down large documents into smaller, more manageable chunks of text. Elastic enables only the most relevant chunks to pass through to the LLM model. The outcome? By reducing the number of input tokens required to generate a coherent response, we can reduce the cost by up to 90% per API call. With an efficient chunking strategy, we not only optimize costs but also uphold the quality and reliability of our responses.

In summary, Elastic’s mature information retrieval capability effectively fulfills the retrieval aspect of RAG, benefiting from its balanced approach to optimizing search accuracy and cost efficiency. As for the response generation part of RAG, Elastic’s efficient caching and chunking mechanisms play a critical role in significantly reducing LLM inference costs and improving response times.

Role-Based Access Control (RBAC) with Elastic

While we’ve covered Elastic’s features related to RAG, there’s still much more to explore about Elastic as an enterprise-grade data store. For instance, Elastic provides a robust and easily configurable set of Role-Based Access Control (RBAC) features, allowing users to fine-tune access permissions, including field and document-level restrictions. These RBAC functionalities ensure that sensitive documents are shielded from unauthorized access and comply with iZeno’s compliance policies.

Continuing our quest to maximize productivity at iZeno

By tapping into our Google Drive repository, we’re ingesting all our past proposals, project deliverables, and other valuable details our teams have produced over the years. Additionally, we’ve incorporated information such as proposal guidelines and commercial terms stored in Confluence via Elastic’s connector. But our efforts don’t stop there. We continue to crawl and sync the latest documentation across our multiple technology vendor partners using Elastic’s web crawler.

For those curious about the finer modeling details, we chose to utilize Elastic’s ELSER v2 model for embedding. Furthermore, we pass the context into GPT 3.5 turbo for response generation, balancing precision and cost efficiency throughout the process. We encountered challenges while exploring the possibility of importing a model from Hugging Face through the Eland Python Client, as the model’s results were not immediately usable.

The result? We now have an assistant to generate an accurate and concise summary in our search results for any topic. No more searching for missing critical info buried in some old document. Proposal generation and RFP responses are now 20% faster, and we’re looking to do even better. This will allow our sales and presales to spend less time responding to proposals and more time directly engaging with customers and partnerships that move the needle.

But we’re just getting started – we have bigger plans to make an indispensable part of our daily workflows. The next phase? Deeper integration with Google Workspace tools. 

Imagine this: You’re working in Google Sheets, responding to a Statement of Compliance document. With a few keystrokes, you’ll be able to summon RAG summaries and respond to each line item perfectly tailored to the question. The same goes for creating proposals in Google Docs. Just feed it the basics and watch as it puts together pre-approved content, technical specs, case study details and more into a complete narrative. Operating across 5 countries, our teams deliver proposals in multiple languages. Our ultimate goal is to be able to localize and translate content, further enhancing our operations autonomously.

The future’s bright—the days of doing things the painfully manual way are ending. Pretty soon, this will help our teams operate at maximum efficiency while ensuring they bring the full weight of our company’s knowledge to every opportunity. Hang on tight—generative AI is going to be one heck of a ride!

iZeno is an Elastic Partner.

With solutions in Enterprise Search, Observability, and Security, we help enhance customer and employee search experiences, keep mission-critical applications running smoothly, and protect against cyber threats. Delivered wherever data lives, in one cloud, across many clouds, or on-prem, Elastic enables more than 50% of the Fortune 500, and 17,000+ customers including Netflix, Uber, Slack, and Microsoft, to achieve new levels of success at scale and on a single platform.