|
1 | 1 | --- |
2 | | -title: "Data Processing & ETL" |
3 | | -description: "Build reliable data processing and ETL pipelines with automatic retries, progress tracking, and no timeout limits using Trigger.dev" |
| 2 | +title: "Data processing & ETL workflows" |
| 3 | +sidebarTitle: "Data processing & ETL" |
| 4 | +description: "Learn how to use Trigger.dev for data processing and ETL including web scraping, database synchronization, batch enrichment, and streaming analytics workflows" |
4 | 5 | --- |
5 | 6 |
|
6 | 7 | import UseCasesCards from "/snippets/use-cases-cards.mdx"; |
7 | 8 |
|
8 | 9 | ## Overview |
9 | 10 |
|
10 | | -Data processing and ETL (Extract, Transform, Load) workflows require handling large datasets, complex transformations, and reliable data movement between systems. Build robust data pipelines in TypeScript with automatic retries, progress tracking, and no timeout limits; perfect for web scraping, database synchronization, real-time analytics, and large-scale data transformation. |
| 11 | +Build data pipelines that process large datasets without timeouts. Handle streaming analytics, batch enrichment, web scraping, database sync, and file processing with automatic retries and progress tracking. |
11 | 12 |
|
12 | | -## Basic data processing and ETL workflow implementation |
| 13 | +## Featured examples |
13 | 14 |
|
14 | | -A typical ETL pipeline: |
15 | | - |
16 | | -1. **Extract**: Pull from APIs, databases, S3, or web scraping |
17 | | -2. **Transform**: Clean, validate, enrich data |
18 | | -3. **Load**: Write to warehouse, database, or storage |
19 | | -4. **Monitor**: Track progress, handle failures |
20 | | - |
21 | | -Each step is durable and retryable—if transformation fails, Trigger.dev automatically retries without re-extracting source data thanks to [checkpoint-resume](/how-it-works#the-checkpoint-resume-system) and [idempotency keys](/idempotency). |
22 | | - |
23 | | -Trigger.dev is ideal for ETL pipelines because there are no [timeout limits](/runs/max-duration) (process datasets for hours or days), [batchTriggerAndWait()](/triggering#yourtask-batchtriggerandwait) parallelizes across thousands of records with [queue.concurrencyLimit](/queue-concurrency) to respect API rate limits, [metadata](/runs/metadata) + [realtime](/realtime) stream row-by-row progress to dashboards, and [schedules.task()](/tasks/scheduled) handles recurring jobs with cron syntax. |
24 | | - |
25 | | -## Data processing workflow examples |
26 | | - |
27 | | -<CardGroup cols={2}> |
| 15 | +<CardGroup cols={3}> |
28 | 16 | <Card |
29 | 17 | title="Realtime CSV importer" |
30 | 18 | icon="book" |
31 | 19 | href="/guides/example-projects/realtime-csv-importer" |
32 | 20 | > |
33 | | - Import CSV files with progress tracking streamed to the frontend. |
| 21 | + Import CSV files with progress streamed live to frontend. |
34 | 22 | </Card> |
35 | 23 | <Card title="Web scraper with BrowserBase" icon="book" href="/guides/examples/scrape-hacker-news"> |
36 | | - Scrape Hacker News using BrowserBase and Puppeteer, summarize with ChatGPT. |
37 | | - </Card> |
38 | | - <Card title="Firecrawl" icon="book" href="/guides/examples/firecrawl-url-crawl"> |
39 | | - Crawl URLs and return LLM-ready markdown using Firecrawl. |
| 24 | + Scrape websites using BrowserBase and Puppeteer. |
40 | 25 | </Card> |
41 | 26 | <Card |
42 | 27 | title="Supabase database operations" |
43 | 28 | icon="book" |
44 | 29 | href="/guides/examples/supabase-database-operations" |
45 | 30 | > |
46 | | - Run CRUD operations on a Supabase database table. |
47 | | - </Card> |
48 | | - <Card title="Sequin database triggers" icon="book" href="/guides/frameworks/sequin"> |
49 | | - Trigger tasks from database changes using Sequin's CDC platform. |
50 | | - </Card> |
51 | | - <Card |
52 | | - title="Sync Vercel environment variables" |
53 | | - icon="book" |
54 | | - href="/guides/examples/vercel-sync-env-vars" |
55 | | - > |
56 | | - Automatically sync environment variables from Vercel projects. |
| 31 | + Run CRUD operations on Supabase database tables. |
57 | 32 | </Card> |
58 | 33 | </CardGroup> |
59 | 34 |
|
60 | | -## Production use cases |
61 | | - |
62 | | -<Card title="Papermark customer story" href="https://trigger.dev/customers/papermark-customer-story"> |
63 | | - |
64 | | -Read how Papermark processes thousands of documents per month using Trigger.dev. |
65 | | - |
66 | | -</Card> |
67 | | - |
68 | | -## Common data processing patterns |
69 | | - |
70 | | -### Scheduled Data Syncs |
71 | | - |
72 | | -Run ETL jobs on a schedule to keep systems in sync: |
73 | | - |
74 | | -- Daily database exports and backups |
75 | | -- Hourly API data pulls and transformations |
76 | | -- Real-time webhook processing and routing |
77 | | -- Periodic data warehouse updates |
78 | | - |
79 | | -### Event-Driven Processing |
80 | | - |
81 | | -Respond to data events with automated workflows: |
82 | | - |
83 | | -- Process new database records as they're created |
84 | | -- Transform uploaded files immediately |
85 | | -- React to webhook events from external systems |
86 | | -- Handle real-time data streams |
87 | | - |
88 | | -### Batch Processing |
89 | | - |
90 | | -Process large datasets efficiently: |
91 | | - |
92 | | -- Import CSV files with thousands of rows |
93 | | -- Bulk update records across systems |
94 | | -- Process queued data in parallel batches |
95 | | -- Generate reports from aggregated data |
96 | | - |
97 | | -### Pipeline Orchestration |
98 | | - |
99 | | -Chain multiple processing steps together: |
100 | | - |
101 | | -- Extract from API → Transform → Load to database |
102 | | -- Web scraping → Data cleaning → Analysis → Storage |
103 | | -- File upload → Validation → Processing → Notification |
104 | | -- Multi-source data aggregation and enrichment |
| 35 | +## Why Trigger.dev for data processing |
| 36 | + |
| 37 | +**Process datasets for hours without timeouts** |
| 38 | + |
| 39 | +Handle multi-hour transformations, large file processing, or complete database exports. No execution time limits. |
| 40 | + |
| 41 | +**Parallel processing with built-in rate limiting** |
| 42 | + |
| 43 | +Process thousands of records simultaneously while respecting API rate limits. Scale efficiently without overwhelming downstream services. |
| 44 | + |
| 45 | +**Stream progress to your users in real-time** |
| 46 | + |
| 47 | +Show row-by-row processing status updating live in your dashboard. Users see exactly where processing is and how long remains. |
| 48 | + |
| 49 | +## Common workflows |
| 50 | + |
| 51 | +Here are some basic examples of data processing and ETL workflows: |
| 52 | + |
| 53 | +<Tabs> |
| 54 | + <Tab title="ETL pipeline"> |
| 55 | + <Steps> |
| 56 | + <Step title="Extract">Pull from APIs, databases, S3, or web scraping</Step> |
| 57 | + <Step title="Transform">Clean, validate, enrich data</Step> |
| 58 | + <Step title="Load">Write to warehouse, database, or storage</Step> |
| 59 | + <Step title="Monitor">Track progress, handle failures</Step> |
| 60 | + </Steps> |
| 61 | + </Tab> |
| 62 | + <Tab title="Web scraping"> |
| 63 | + <Steps> |
| 64 | + <Step title="Navigate">Load target pages with headless browser</Step> |
| 65 | + <Step title="Extract">Pull content, links, structured data</Step> |
| 66 | + <Step title="Transform">Clean HTML, parse JSON, normalize data</Step> |
| 67 | + <Step title="Store">Save to database or file storage</Step> |
| 68 | + </Steps> |
| 69 | + </Tab> |
| 70 | + <Tab title="Batch enrichment"> |
| 71 | + <Steps> |
| 72 | + <Step title="Query">Fetch records needing enrichment</Step> |
| 73 | + <Step title="Enrich">Call external APIs in parallel batches</Step> |
| 74 | + <Step title="Validate">Check data quality and completeness</Step> |
| 75 | + <Step title="Update">Write enriched data back to database</Step> |
| 76 | + </Steps> |
| 77 | + </Tab> |
| 78 | + <Tab title="File processing"> |
| 79 | + <Steps> |
| 80 | + <Step title="Upload">Receive file via webhook or storage event</Step> |
| 81 | + <Step title="Parse">Read CSV, JSON, XML, or binary format</Step> |
| 82 | + <Step title="Process">Transform, validate, chunk large files</Step> |
| 83 | + <Step title="Import">Bulk insert to database or data warehouse</Step> |
| 84 | + </Steps> |
| 85 | + </Tab> |
| 86 | +</Tabs> |
105 | 87 |
|
106 | 88 | <UseCasesCards /> |
0 commit comments