Prompt an LLM. Embed a document. Process a video. It's all just a DataFrame.
Daft is a data engine that treats AI operations as first-class citizens. Calling an LLM, generating embeddings, reading a PDF — these aren't afterthoughts bolted onto a DataFrame library. They're column expressions, right in the query plan.
This repo is 90+ runnable examples that show you how.
df = daft.from_pydict({"text": ["Einstein was a brilliant scientist.", "Mozart was a brilliant pianist."]})
df = df.with_column("summary", prompt(df["text"], model="openai/gpt-4.1-mini"))
df = df.with_column("embedding", embed_text(df["text"], model="Qwen/Qwen3-Embedding-0.6B"))git clone https://github.com/Eventual-Inc/daft-examples.git
cd daft-examples
make setup # installs deps, copies .env.example → .envAdd your OpenAI key to .env, then:
uv run quickstart/01_hello_world_prompt.pyThat's it. Every script in this repo is self-contained — dependencies are declared inline via PEP 723, so uv run handles everything.
Five scripts, each under 30 seconds. Start here.
| # | Script | What it does |
|---|---|---|
| 01 | Hello World | Classify text with an LLM prompt |
| 02 | Semantic Search | PDF → embeddings → vector search → Turbopuffer |
| 03 | Data Enrichment | ETL pipeline with LLM-based enrichment |
| 04 | Audio Files | Read audio metadata, resample with daft.File |
| 05 | Video Files | Extract video frames and metadata |
Small, focused scripts. One concept each.
| Script | What it shows |
|---|---|
| prompt.py | Basic classification — one function call |
| prompt_structured_outputs.py | Pydantic models for type-safe LLM output |
| prompt_chat_completions.py | Chat-style completions with system personas |
| prompt_files_images.py | Multimodal — send images and PDFs to the model |
| prompt_pdfs.py | Feed entire PDFs into the prompt |
| prompt_openai_web_search.py | Web search tool integration |
| prompt_qa.py | Synthetic Q&A generation with LLM-as-judge |
| prompt_session.py | Stateful prompt sessions |
| prompt_unity_catalog.py | Prompt over Unity Catalog tables |
| prompt_gemini3_code_review.py | Automated code review with Gemini |
| Script | What it shows |
|---|---|
| embed_text.py | Text embeddings at multiple dimensions |
| embed_images.py | Image embeddings with Apple AIMv2 |
| embed_text_providers.py | Compare embedding providers side by side |
| embed_video_frames.py | Embed individual video frames |
| cosine_similarity.py | Semantic similarity search |
| Script | What it shows |
|---|---|
| daft_file.py | daft.File basics |
| daft_audiofile.py | Audio metadata, resampling |
| daft_audiofile_udf.py | Custom audio processing UDF |
| daft_videofile.py | Video metadata and keyframes |
| daft_videofile_stream.py | Streaming video frame extraction |
| daft_file_pdf.py | PDF parsing and page extraction |
| daft_file_code.py | Source code analysis |
| Script | What it shows |
|---|---|
| daft_func.py | Simple @daft.func UDF |
| daft_func_async.py | Async UDFs for I/O-bound work |
| daft_func_batch.py | Batch-mode UDFs |
| daft_cls_model.py | @daft.cls — load a model once, run it on every row |
| daft_cls_with_types.py | Class UDFs with TypedDict and Pydantic |
| daft_cls_async_client.py | Async class UDFs with persistent clients |
| Script | What it shows |
|---|---|
| stocks.py | Window functions on real stock data — moving averages, rankings, Golden Cross detection |
| Script | What it shows |
|---|---|
| classify_text.py | Text classification |
| classify_image.py | Image classification |
| Script | What it shows |
|---|---|
| read_pdfs.py | Discover and read PDFs from remote storage |
| read_video_files.py | Frame-level video reading with daft.read_video_frames |
| Script | What it shows |
|---|---|
| cc_show.py | Browse Common Crawl data |
| cc_chunk_embed.py | Chunk and embed web pages |
| cc_wet_paragraph_dedupe.py | Paragraph-level deduplication at scale |
End-to-end workflows. These are where things get interesting.
| Pipeline | What it does |
|---|---|
| rag.py | Minimal RAG — embed, retrieve, generate |
| full_rag.py | Full RAG — PDF extraction, PyMuPDF UDF, cross-join ranking, generation |
| Pipeline | What it does |
|---|---|
| lambda_mapreduce.py | 6 long-context reasoning patterns as native query plans (search, summarize, classify, extract, QA, analyze) |
| chunking_strategies.py | Compare fixed-size, sentence, and paragraph chunking |
| few_shot_example_selection.py | Embedding-based few-shot selection |
| llm_judge_elo.py | LLM-as-judge with ELO ranking |
| Pipeline | What it does |
|---|---|
| voice_ai_analytics.py | Transcription → summarization → translation → embeddings → RAG over transcripts |
| key_moments_extraction.py | Extract and clip key moments from audio transcripts |
| shot_boundary_detection.py | Video scene detection with frame embeddings |
| Pipeline | What it does |
|---|---|
| ai_search.py | PDF search with Turbopuffer |
| embed_docs.py | Codebase analysis with SpaCy chunking and embeddings |
| data_enrichment.py | LLM-powered data enrichment pipeline |
| Pipeline | What it does |
|---|---|
| prompt_github.py | Prompt over GitHub repos |
| cursor.py | Code analysis pipeline |
Processing patterns for real public datasets — not toy data.
| Dataset | Scripts | What you'll process |
|---|---|---|
| Common Crawl | WARC, WAT, WET parsing, text deduplication, chunk & embed | Billions of web pages |
| LAION | Image-text pairs, CLIP training data, metadata | 5B+ image-text pairs |
| Open Images | Image loading, processing, vision models | 9M annotated images |
| TPC-H | SQL queries, performance benchmarks | Industry-standard analytical benchmark |
- Python 3.12
- uv
- FFmpeg (for audio/video examples)
Most examples need an OpenAI key. Some need more. Copy the example and fill in what you have:
cp .env.example .env| Key | What uses it |
|---|---|
OPENAI_API_KEY |
Most prompt, embed, and RAG examples |
OPENROUTER_API_KEY |
Multi-model and structured output examples |
TURBOPUFFER_API_KEY |
Vector search pipelines |
AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY |
Common Crawl, TPC-H, Open Images |
HF_TOKEN |
Private HuggingFace datasets |
uv run quickstart/01_hello_world_prompt.py
uv run examples/prompt/prompt.py
uv run pipelines/rag/full_rag.pyEvery script declares its own dependencies. No extras to install.
make format # auto-format with ruff
make lint # lint check
make precommit # lint + format check (runs on git commit)
make test # run all tests
make test-no-creds # run tests that don't need API keysSee CONTRIBUTING.md for guidelines on adding new examples.
Apache 2.0