Model evaluation using Durable Functions (Python)

About sample

This sample demonstrates how to use Durable Functions to call multiple models in parallel to quickly get the best response to a user's query. It uses three models - GPT-3.5-turbo, GPT-4o-mini, and Phi-4 - to answer a query. After getting the responses, it uses GPT-4o to evaluate and score the responses based on a certain criteria.

There's no particular reason for choosing the models used in this sample - the key is to demonstrate how to leverage Durable Function's fan-out/fan-in pattern to easily realize this scenario.

About Durable Functions

Durable Functions is part of Azure Functions offering. It helps orchestrate stateful logic that is long-running and provides reliable execution. For example, when there's infrastructure failure (process crash, VM restart, etc.), the framework rebuilds application state and start from the point of failure instead of the beginning. This helps save time and money, especially for expensive operations like LLM calls. Common scenarios where Durable Functions is useful include agentic workflows, data processing, asynchronous APIs, batch processing, and infrastructure management.

Durable Functions needs a backend provider to persist application states. This sample uses the new Durable Task Scheduler backend that's currently in preview.

Important

This sample creates several resources. Delete the resource group after testing to minimize charges.

Run in your local environment

The project is designed to run on your local computer, provided you have met the required prerequisites. You can run the project locally in these environments:

Using Visual Studio Code
Using Azure Functions Core Tools (CLI)

Prerequisites

Deploy language models

Create an Azure subscription with a valid payment method. Free or trial Azure subscriptions won't work.
Create a project in Azure AI Foundry
Go to Model catalog on the left menu and search for the following models to deploy:
- GPT-4o
- GPT-3.5-turbo
- GPT-4o-mini
- Phi-4 (small language model by Microsoft that has advanced reasoning capabilities in areas like math and science)

Get endpoints and keys for models

You'll need the model API key and endpoint for the next step.

Go to the Overview tab of the project where models are deployed. API key is on the top.

To get the endpoint, click on Azure AI inference under "Included capabilities":

Set up Durable Task Scheduler emulator

Pull Docker image:

docker pull mcr.microsoft.com/dts/dts-emulator:v0.0.5

Run Docker image:

docker run -d -p 8080:8080 -p 8082:8082 mcr.microsoft.com/dts/dts-emulator:v0.0.5

The emulator exposes several ports:

8080: gRPC endpoint that allows the app to connect to the scheduler
8082: endpoint for monitoring dashboard

Run app using Visual Studio Code

Open app folder in a new terminal
Open VS Code by entering code . in the terminal

In the root folder, create a file named local.settings.json with the following, filling in connection information from the previous step:

{
  "IsEncrypted": false,
  "Values": {
      "AzureWebJobsStorage": "UseDevelopmentStorage=true",
      "BLOB_STORAGE_ENDPOINT": "DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;",
      "MODELS_ENDPOINT": "https://<resource name>.services.ai.azure.com/models",
      "AZURE_AI_API_KEY": "<api key>", 
      "DURABLE_TASK_SCHEDULER_CONNECTION_STRING": "Endpoint=http://localhost:8080;Authentication=None",
      "TASKHUB_NAME": "default",
      "FUNCTIONS_WORKER_RUNTIME": "python"
  }
}

[!NOTE] The value shown for BLOB_STORAGE_ENDPOINT is the default value for Azurite (Azure Storage emulator) - it's not a private key.

Start Azurite by running:
```
azurite start --skipApiVersionCheck
```
Run project with debugging (or press F5)
You can test easily by going to the test.http file and click "Send Request". This file has POST requests asking different questions. For example:

"What is the value proposition of Azure Durable Functions and what is it used for?"

The request will return an HTTP response with some URLs that allow you to manage the orchestration, but this sample won't be using those.
The model evaluation result is stored in a container called results and can be viewed using the Azure Storage Explorer. Open the explorer, click Emulator & Attached > Storage Accounts > (Emulator - Default Ports)(Key) > Blob Containers > results. Double click on a .txt file to see evaluation result for a specific prompt.
View the dashboard for orchestration details by going to localhost://8082 and clicking on the "default" task hub.

Inspect the solution

Take a look at the orchestrator_function to see how Durable Functions allows you to write code that runs in parallel. This function simply adds the activity functions that make calls to language models to a list and then call context.task_all(tasks), which would signal the activity functions to run in parallel. Note that you don't have to worry about when each activity functions finishes or if any fail in the middle - Durable Functions handles the "fan in" and the automatic retries. Simply take the result and continue with your business logic.

@app.orchestration_trigger(context_name="context")
def orchestrator_function(context):
  # Previous logic
  
  # Run all tasks in parallel
  tasks = [
    context.call_activity_with_retry("get_gpt35_result", retry_options, [user_prompt, system_prompt]),
    context.call_activity_with_retry("get_gpt4omini_result", retry_options, [user_prompt, system_prompt]),
    context.call_activity_with_retry("get_phi4_result", retry_options, [user_prompt, system_prompt])
  ]
  
 # Wait for all the parallel tasks to complete before continuing
  results = yield context.task_all(tasks)

  # Other business logic

Each of the get_<model>_result activity functions makes a call to the corresponding language model. For example, the get_gpt35_result looks like:

@app.activity_trigger(input_name="prompts")
def get_gpt35_result(prompts: list):
    user_prompt, system_prompt = prompts[0], prompts[1]
    
    client = ChatCompletionsClient(
        endpoint=os.environ["MODEL_ENDPOINT"],
        credential=AzureKeyCredential(os.environ["MODEL_API_KEY"]),
    )
    response = client.complete(
        model="gpt-35-turbo", # model deployment name
        messages=[
            SystemMessage(content=system_prompt),
            UserMessage(content=user_prompt)
        ],
        temperature=0
    )
    
    return [response.choices[0].message.content, "gpt-35-turbo", datetime.now().strftime("%Y-%m-%d %H:%M:%S")]

Run app using Azure Functions Core Tools (CLI)

Make sure Azurite is started before proceeding.
Open the cloned repo in a new terminal and navigate to the app directory:

cd app

Create and activate the virtual environment:

python3 -m venv venv_name

source .venv/bin/activate

Install required packages:

python3 -m pip install -r requirements.txt

Add local.settings.json to root directory (app)
Start function app

func start

Deploy and run app on Azure

Follow instructions to create the required resources on Azure. One of the resources created is an Azure Storage account, which is used by the Function App for deployment purposes. The sample uses this same storage account to store the model evaluation results.
On Azure portal, add these environment variables to the Function App by going to Settings > Environment variables:
- MODELS_ENDPOINT
- AZURE_AI_API_KEY
- BLOB_STORAGE_ENDPOINT
The value of BLOB_STORAGE_ENDPOINT should be the same as the AzureWebJobsStorage variable, which should be set automatically.
Deploy the app.

Run the following command to get the endpoint of the HTTP trigger after deployment:

az functionapp function list --resource-group <YOUR_RESOURCE_GROUP_NAME> --name <YOUR_FUNCTION_APP_NAME>  --query '[].{Function:name, URL:invokeUrlTemplate}' --output json

Update test.http with the right endpoint to send a POST request.
Go to the Azure Storage account used by the Function App and find Data storage > Containers. Click on the container named results. This container stores the results of evaluations.

Resources

For more information on Durable Functions, see the following:

Durable Functions overview
Durable Task Scheduler samples
Order processing workflow with Durable Functions Python sample, C# sample

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github		.github
app		app
media		media
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
host.json		host.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Model evaluation using Durable Functions (Python)

About sample

About Durable Functions

Run in your local environment

Prerequisites

Deploy language models

Get endpoints and keys for models

Set up Durable Task Scheduler emulator

Run app using Visual Studio Code

Inspect the solution

Run app using Azure Functions Core Tools (CLI)

Deploy and run app on Azure

Resources

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Model evaluation using Durable Functions (Python)

About sample

About Durable Functions

Run in your local environment

Prerequisites

Deploy language models

Get endpoints and keys for models

Set up Durable Task Scheduler emulator

Run app using Visual Studio Code

Inspect the solution

Run app using Azure Functions Core Tools (CLI)

Deploy and run app on Azure

Resources

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages