Ready to Experiment with AI?

May 17

Artificial Intelligence (AI) is rapidly changing the landscape, but many organizations struggle with adoption. Common hurdles include demonstrating AI's value, skills gaps, lack of confidence in the technology, data limitations, and aligning AI with business use cases. Early, systematic experimentation can overcome many of these challenges and increase the chances of moving AI projects from proof-of-concept to production. But where do you start?

This guide, inspired by "An AI Experimentation Guide for the Curious", outlines a clear, four-step methodology to help you begin experimenting with AI more systematically and find useful applications.

The 4-Step Methodology for AI Experimentation

Identify Outcomes & Tasks: Determine the goals you want to achieve and the specific tasks where AI might help.
Create a Gold Standard & Gather Data: Establish how you'll measure success and collect the necessary data.
Experiment & Pilot: Test different AI models on your chosen task.
Build & Deploy: Integrate the successful AI solution into workflows.

Let's dive into each step.

Step 1: Identify the Right Outcomes and Tasks

The key is to focus your efforts. Start by thinking about your team's current workflows:

Are there repetitive tasks?
Do any tasks involve unstructured data (like text, images, audio)?
Could automating or speeding up a task free up time for more creative or strategic work?
Is the task well-understood? Do you know what defines success and what constitutes a good, better, or bad outcome?

You are the expert on the problems you face daily, which makes you perfectly positioned to identify opportunities.

Consider categorizing potential tasks:

Extraction: Reading, summarizing, finding keywords, routing information, reasoning, or analyzing varied sources.
Expansion: Writing or generating content like emails, code, or marketing copy based on a prompt.

Create a wishlist or backlog of potential tasks. Then, start building intuition by trying out readily available commercial AI models (like Gemini, ChatGPT, Claude) with some simple prompts related to your tasks. See what they are good at, where they struggle, and if providing more context helps. Based on this initial exploration, narrow down your list to 1-2 tasks that seem feasible and offer a good potential return on investment.

Step 2: Create a Gold Standard Evaluation Set & Gather Data

Before serious experimentation, you need a reliable way to measure performance. This involves creating a "gold standard" evaluation set.

Expert-Created: This dataset should be built by someone who deeply understands the task.
Sufficient Examples: Aim for around 100 diverse examples representing the task accurately.
Structure: Each example should ideally include the input, the desired output (perhaps rated as bad, good, great), and comments on the specific behavior being tested.
Held Back: This data is only for testing (inference) – models should never see it during training.

For initial, smaller-scale tests, you can track evaluations manually (e.g., in a table), noting the input, target output, actual model output, and other relevant factors. Remember to track all experiments systematically.

Also, gather any data that could provide useful context for the AI, such as templates, style guides, or examples of similar completed tasks.

Step 3: Set Up Experiments and Pilot Models

Now it's time for systematic testing:

Test Models: Select a few promising AI models (commercial ones are a good starting point) and test them against your gold standard evaluation set.
Vary Prompts: Try different ways of asking the model to perform the task.
Add Context: Experiment with providing the relevant data you gathered (templates, examples) as context to the models. See if varying the amount of context helps.
Document Everything: Keep detailed records of your prompts, the models used, the data provided, and the results.

Analyze your findings:

Performance: Did the models perform well enough to potentially augment or automate the workflow? Can you quantify the potential time savings or other benefits (ROI)?
Weaknesses: Where did the models fail? Did they hallucinate or make consistent errors? Understanding error patterns is crucial.
Scalability: If the results are promising, could this solution benefit the broader team or organization?

By the end of this step, you should have a clear analysis of model performance, potential business value, and a recommendation on whether to proceed.

Step 4: Build Out the Application and Deploy

If your pilot is successful, the next stage involves scaling the solution:

Collaborate: Work with engineers, product managers, and other relevant teams to refine requirements and estimate the effort needed to build a robust application.
Design Integration: Plan how the AI capability will fit naturally into existing workflows. This might involve designing user interfaces or even developing AI agents that can perform sequences of tasks.
Develop & Evaluate: Start with a Proof of Concept (POC) implementation and evaluate its effectiveness in a real-world context.
Deploy: If the POC is successful, move towards implementing the full application or agent workflow into production, including monitoring its ongoing performance.

The Best Way To Learn is To Experiment

Following a structured approach—identifying tasks, creating evaluation standards, experimenting systematically, and then building thoughtfully—demystifies AI adoption. It allows you to learn quickly, demonstrate value, and build confidence in using AI to solve real problems.

Reach out to us at KAMI Think Tank (info@kamithinktank.com) if you want more detailed guidance on this methodology of experimentation.

Kamayani Gupta