Skip to content

Commit 55197c5

Browse files
committed
init
0 parents  commit 55197c5

28 files changed

Lines changed: 4059 additions & 0 deletions

.gitignore

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Mintlify
2+
.mintlify/
3+
4+
# Dependencies
5+
node_modules/
6+
7+
# OS
8+
.DS_Store
9+
10+
# IDE
11+
.idea/
12+
.vscode/
13+
*.swp
14+
*.swo
15+
16+
# Claude
17+
.claude/

README.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# reader-docs
2+
3+
Documentation for [Reader](https://github.com/vakra-dev/reader) - open-source web scraping for LLMs.
4+
5+
Built with [Mintlify](https://mintlify.com).
6+
7+
## Development
8+
9+
```bash
10+
npx mintlify dev
11+
```
12+
13+
Open [http://localhost:3000](http://localhost:3000).
14+
15+
## Deployment
16+
17+
Push to GitHub and connect to Mintlify Dashboard for automatic deployments.
18+
19+
## License
20+
21+
Apache 2.0

api-reference/crawl-options.mdx

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
---
2+
title: CrawlOptions
3+
description: Options for the crawl() function
4+
---
5+
6+
## Type Definition
7+
8+
```typescript
9+
interface CrawlOptions {
10+
// Required
11+
url: string;
12+
13+
// Crawl limits
14+
depth?: number;
15+
maxPages?: number;
16+
17+
// Scraping
18+
scrape?: boolean;
19+
scrapeConcurrency?: number;
20+
formats?: Array<"markdown" | "html">;
21+
22+
// Rate limiting
23+
delayMs?: number;
24+
timeoutMs?: number;
25+
26+
// URL filtering
27+
includePatterns?: string[];
28+
excludePatterns?: string[];
29+
30+
// Request configuration
31+
userAgent?: string;
32+
proxy?: ProxyConfig;
33+
34+
// Debugging
35+
verbose?: boolean;
36+
showChrome?: boolean;
37+
}
38+
```
39+
40+
## Options Reference
41+
42+
### Required Options
43+
44+
| Option | Type | Description |
45+
| ------ | -------- | ------------------------------- |
46+
| `url` | `string` | Seed URL to start crawling from |
47+
48+
### Crawl Limit Options
49+
50+
| Option | Type | Default | Description |
51+
| ---------- | -------- | ------- | ------------------------- |
52+
| `depth` | `number` | `1` | Maximum crawl depth |
53+
| `maxPages` | `number` | `20` | Maximum pages to discover |
54+
55+
### Scraping Options
56+
57+
| Option | Type | Default | Description |
58+
| ------------------- | ----------------------------- | ---------------------- | --------------------------------------- |
59+
| `scrape` | `boolean` | `false` | Also scrape content of discovered pages |
60+
| `scrapeConcurrency` | `number` | `2` | Concurrent scraping threads |
61+
| `formats` | `Array<"markdown" \| "html">` | `["markdown", "html"]` | Output formats when scraping |
62+
63+
### Rate Limiting Options
64+
65+
| Option | Type | Default | Description |
66+
| ----------- | -------- | ----------- | ---------------------------- |
67+
| `delayMs` | `number` | `1000` | Delay between requests (ms) |
68+
| `timeoutMs` | `number` | `undefined` | Total timeout for crawl (ms) |
69+
70+
### URL Filtering Options
71+
72+
| Option | Type | Default | Description |
73+
| ----------------- | ---------- | ----------- | ------------------------------- |
74+
| `includePatterns` | `string[]` | `undefined` | URL patterns to include (regex) |
75+
| `excludePatterns` | `string[]` | `undefined` | URL patterns to exclude (regex) |
76+
77+
### Request Configuration Options
78+
79+
| Option | Type | Default | Description |
80+
| ----------- | ------------- | ----------- | ------------------------ |
81+
| `userAgent` | `string` | `undefined` | Custom user agent string |
82+
| `proxy` | `ProxyConfig` | `undefined` | Proxy configuration |
83+
84+
### Debugging Options
85+
86+
| Option | Type | Default | Description |
87+
| ------------ | --------- | ------- | ---------------------- |
88+
| `verbose` | `boolean` | `false` | Enable verbose logging |
89+
| `showChrome` | `boolean` | `false` | Show browser window |
90+
91+
## Examples
92+
93+
### Basic Crawl
94+
95+
```typescript
96+
await reader.crawl({
97+
url: "https://example.com",
98+
depth: 2,
99+
maxPages: 50,
100+
});
101+
```
102+
103+
### Crawl with Scraping
104+
105+
```typescript
106+
await reader.crawl({
107+
url: "https://example.com",
108+
depth: 2,
109+
maxPages: 50,
110+
scrape: true,
111+
scrapeConcurrency: 5,
112+
formats: ["markdown"],
113+
});
114+
```
115+
116+
### With URL Filtering
117+
118+
```typescript
119+
await reader.crawl({
120+
url: "https://example.com",
121+
depth: 3,
122+
maxPages: 100,
123+
includePatterns: ["^/docs/", "^/guides/"],
124+
excludePatterns: ["^/admin/", "^/api/"],
125+
});
126+
```
127+
128+
### With Rate Limiting
129+
130+
```typescript
131+
await reader.crawl({
132+
url: "https://example.com",
133+
depth: 2,
134+
delayMs: 2000, // 2 seconds between requests
135+
timeoutMs: 300000, // 5 minute total timeout
136+
});
137+
```
138+
139+
### Full Options
140+
141+
```typescript
142+
await reader.crawl({
143+
url: "https://example.com",
144+
depth: 3,
145+
maxPages: 100,
146+
scrape: true,
147+
scrapeConcurrency: 5,
148+
formats: ["markdown"],
149+
delayMs: 1000,
150+
timeoutMs: 600000,
151+
includePatterns: ["^/docs/"],
152+
excludePatterns: ["^/docs/legacy/"],
153+
proxy: {
154+
host: "proxy.example.com",
155+
port: 8080,
156+
username: "user",
157+
password: "pass",
158+
},
159+
verbose: true,
160+
});
161+
```

api-reference/crawl-result.mdx

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
---
2+
title: CrawlResult
3+
description: Result structure from the crawl() function
4+
---
5+
6+
## Type Definition
7+
8+
```typescript
9+
interface CrawlResult {
10+
urls: CrawlUrl[];
11+
scraped?: ScrapeResult;
12+
metadata: CrawlMetadata;
13+
}
14+
```
15+
16+
## CrawlUrl
17+
18+
Information about each discovered URL:
19+
20+
```typescript
21+
interface CrawlUrl {
22+
url: string;
23+
title: string;
24+
description: string | null;
25+
}
26+
```
27+
28+
## CrawlMetadata
29+
30+
Metadata about the crawl operation:
31+
32+
```typescript
33+
interface CrawlMetadata {
34+
totalUrls: number;
35+
maxDepth: number;
36+
totalDuration: number; // Milliseconds
37+
seedUrl: string;
38+
}
39+
```
40+
41+
## ScrapeResult
42+
43+
When `scrape: true`, the `scraped` property contains the scraped content:
44+
45+
```typescript
46+
interface ScrapeResult {
47+
data: WebsiteScrapeResult[];
48+
batchMetadata: BatchMetadata;
49+
}
50+
```
51+
52+
See [ScrapeResult](/api-reference/scrape-result) for full structure.
53+
54+
## Examples
55+
56+
### Access Discovered URLs
57+
58+
```typescript
59+
const result = await reader.crawl({
60+
url: "https://example.com",
61+
depth: 2,
62+
maxPages: 50,
63+
});
64+
65+
console.log(`Found ${result.urls.length} pages`);
66+
67+
result.urls.forEach((page) => {
68+
console.log(`- ${page.title}`);
69+
console.log(` URL: ${page.url}`);
70+
console.log(` Description: ${page.description}`);
71+
});
72+
```
73+
74+
### Access Crawl Metadata
75+
76+
```typescript
77+
const result = await reader.crawl({
78+
url: "https://example.com",
79+
depth: 2,
80+
});
81+
82+
const { metadata } = result;
83+
84+
console.log("Seed URL:", metadata.seedUrl);
85+
console.log("Total URLs:", metadata.totalUrls);
86+
console.log("Max Depth:", metadata.maxDepth);
87+
console.log("Duration:", metadata.totalDuration, "ms");
88+
```
89+
90+
### Access Scraped Content
91+
92+
```typescript
93+
const result = await reader.crawl({
94+
url: "https://example.com",
95+
depth: 2,
96+
scrape: true,
97+
});
98+
99+
if (result.scraped) {
100+
console.log(`Scraped ${result.scraped.batchMetadata.successfulUrls} pages`);
101+
102+
result.scraped.data.forEach((page) => {
103+
console.log(`Title: ${page.metadata.website.title}`);
104+
console.log(`Content: ${page.markdown?.substring(0, 200)}...`);
105+
});
106+
}
107+
```
108+
109+
### Full Example
110+
111+
```typescript
112+
const result = await reader.crawl({
113+
url: "https://docs.example.com",
114+
depth: 3,
115+
maxPages: 100,
116+
scrape: true,
117+
formats: ["markdown"],
118+
});
119+
120+
console.log("=== Crawl Summary ===");
121+
console.log(`Seed URL: ${result.metadata.seedUrl}`);
122+
console.log(`Pages discovered: ${result.metadata.totalUrls}`);
123+
console.log(`Duration: ${(result.metadata.totalDuration / 1000).toFixed(1)}s`);
124+
125+
console.log("\n=== Discovered URLs ===");
126+
result.urls.forEach((page, i) => {
127+
console.log(`${i + 1}. ${page.title}`);
128+
console.log(` ${page.url}`);
129+
});
130+
131+
if (result.scraped) {
132+
console.log("\n=== Scraped Content ===");
133+
console.log(`Success: ${result.scraped.batchMetadata.successfulUrls}`);
134+
console.log(`Failed: ${result.scraped.batchMetadata.failedUrls}`);
135+
136+
result.scraped.data.forEach((page) => {
137+
console.log(`\n--- ${page.metadata.website.title} ---`);
138+
console.log(page.markdown?.substring(0, 500));
139+
});
140+
}
141+
```

0 commit comments

Comments
 (0)