chore: update skill

zhjwpku · zhjwpku · commit d7f5d395d944 · 2026-03-20T23:22:07.000+08:00
diff --git a/.cursor/skills/pgweekly-blog-generation/SKILL.md b/.cursor/skills/pgweekly-blog-generation/SKILL.md
@@ -9,26 +9,29 @@ Generates English and Chinese technical blog posts from PostgreSQL mailing list
 
 ## Quick Workflow
 
-1. **Fetch** thread data (required; do not skip): run the fetch script so that the thread HTML, Markdown, and **all patch attachments** are downloaded and saved under `data/threads/`:
+1. **Fetch** thread data (required; do not skip):
    ```bash
    python3 tools/fetch_data.py --thread-id "{THREAD_ID_OR_URL}"
    ```
-   This creates `data/threads/YYYY-MM-DD/<sanitized-thread-id>/` and downloads every `.patch` (and other allowed attachments) into `data/threads/YYYY-MM-DD/<sanitized-thread-id>/attachments/`. Always run this step before writing the blog.
+   - **Wait for the command to finish** (check exit code is 0). Do not proceed if fetch failed.
+   - This creates `data/threads/YYYY-MM-DD/<sanitized-thread-id>/` and downloads attachments into `attachments/`.
+   - The `YYYY-MM-DD` in the path is the **fetch date** (when you ran the script), NOT the thread date—do not use it for year/week.
 
 2. **Locate** fetched content in `data/threads/YYYY-MM-DD/<thread-id>/`:
    - `thread.html` - Original HTML
    - `thread.md` - Converted Markdown
-   - `metadata.txt` - Thread info (use for year/week)
+   - `metadata.txt` - Thread info
    - `attachments/` - **Downloaded patches** (e.g. `.patch` files from the mailing list)
    - `attachments.txt` - List of downloaded attachment filenames
 
-3. **Verify** all patch set versions are downloaded (required before analyze):
-   - Read `thread.md` and `thread.html` to identify all patch versions referenced in the thread (e.g. v1, v2, v3, v4, v5…; also patterns like `0001-`, `0002-` in patch series)
-   - List files in `attachments/` and compare: every referenced version must have a corresponding downloaded file
-   - If any referenced version is missing:
-     - Run `python3 tools/fetch_data.py --thread-dir "data/threads/YYYY-MM-DD/<thread-id>"` to retry downloading missing attachments
-     - If still missing, do not proceed with analysis; report the missing versions and ask the user to verify the thread or manually add the patches
-   - Only proceed to analyze/generate once all referenced patch versions are present in `attachments/`
+3. **Verify** all patch set versions are downloaded — **MANDATORY GATE; do not skip**:
+   - Read `thread.md` and `thread.html` to identify **all** patch versions referenced (v1, v2, v3…; or `0001-`, `0002-` in patch series)
+   - Run `ls data/threads/YYYY-MM-DD/<thread-id>/attachments/` and compare with the list of referenced versions
+   - **If any referenced version is missing:**
+     - Run `python3 tools/fetch_data.py --thread-dir "data/threads/YYYY-MM-DD/<thread-id>"` to retry
+     - Re-verify; if still missing, **STOP** — report missing versions to the user and do not write the blog
+   - **If the thread has no patches**, verification passes (nothing to check).
+   - **CRITICAL:** Do not proceed to step 4 (Analyze) until you have explicitly confirmed: "Referenced versions: [list] ✓ All present in attachments/". Only then may you write the blog.
 
 4. **Analyze** content:
    - If multiple patch versions (v1, v2, v3...), run `diff -u` between versions to explain evolution
@@ -48,7 +51,7 @@ Generates English and Chinese technical blog posts from PostgreSQL mailing list
    - Chinese: `src/cn/{year}/{week}/{descriptive-filename}.md`
    - Filename: kebab-case from main topic (e.g. `planner-count-optimization`)
 
-7. **Update** SUMMARY.md and year READMEs:
+7. **Update** `src/SUMMARY.md` and year READMEs:
    - Add entries under both `# 🇬🇧 English` and `# 🇨🇳 中文`
    - Follow existing hierarchy: year → week → link to article
    - **Put the new week/article at the top** (newest first): insert the new week immediately after the year line, so the latest week appears first in the list.
@@ -57,7 +60,12 @@ Generates English and Chinese technical blog posts from PostgreSQL mailing list
 
 ## Year/Week
 
-Determine from `metadata.txt` (thread date) or use current date. Use ISO week number (e.g. 06 for week 6).
+**Use the blog writing date (the day you write the blog) as the source of truth.** This determines which week the article is filed under.
+
+**Rules:**
+- Compute ISO year and ISO week from **today's date** (the date when the blog is being written).
+- Example: if writing on 2026-03-20, use year=2026, week=12 (from `datetime(2026, 3, 20).isocalendar()`).
+- **Do NOT use** the thread date, `metadata.txt`, the directory name `YYYY-MM-DD` (fetch date), or "Downloaded:" for year/week.
 
 ## Writing Guidelines
 
diff --git a/tools/fetch_data.py b/tools/fetch_data.py
@@ -5,6 +5,7 @@
 import re
 from datetime import datetime
 from pathlib import Path
+from email.utils import parsedate_to_datetime
 import urllib.request
 from html.parser import HTMLParser
 
@@ -51,6 +52,37 @@ def extract_title(html: str) -> str:
     return "PostgreSQL Thread Summary"
 
 
+def extract_thread_date(html: str) -> str | None:
+    """Extract the first/original message date from thread HTML for year/week determination.
+    Returns YYYY-MM-DD or None if not found.
+    """
+    # RFC 2822 style: "Mon, 20 Jan 2026 12:00:00 +0000" or "Date: Mon, 20 Jan 2026..."
+    rfc2822 = re.findall(
+        r'(?:Date:\s*)?([A-Za-z]{3},\s*\d{1,2}\s+[A-Za-z]{3}\s+\d{4}\s+\d{1,2}:\d{2}(?::\d{2})?\s*[+-]\d{4})',
+        html
+    )
+    for s in rfc2822:
+        try:
+            dt = parsedate_to_datetime(s.strip())
+            return dt.strftime("%Y-%m-%d")
+        except (ValueError, TypeError):
+            continue
+    # "On Mon, Jan 20, 2026 at 12:00 PM" style
+    on_wrote = re.findall(
+        r'On\s+([A-Za-z]{3}),\s*([A-Za-z]{3})\s+(\d{1,2}),?\s+(\d{4})',
+        html
+    )
+    if on_wrote:
+        try:
+            # Use first (original) message date
+            _, month_str, day, year = on_wrote[0]
+            dt = datetime.strptime(f"{month_str} {day} {year}", "%b %d %Y")
+            return dt.strftime("%Y-%m-%d")
+        except ValueError:
+            pass
+    return None
+
+
 def html_to_markdown(html: str) -> str:
     """Convert HTML to Markdown using html2text if available."""
     if HAS_HTML2TEXT:
@@ -297,16 +329,31 @@ def main() -> None:
         print("  No attachments found")
 
     # Step 6: Create metadata file
-    metadata_path = thread_dir / "metadata.txt"
-    metadata_content = "\n".join([
+    thread_date_str = extract_thread_date(html)
+    iso_year, iso_week = "", ""
+    if thread_date_str:
+        try:
+            dt = datetime.strptime(thread_date_str, "%Y-%m-%d")
+            iso_year = str(dt.isocalendar()[0])
+            iso_week = f"{dt.isocalendar()[1]:02d}"
+        except ValueError:
+            pass
+
+    metadata_lines = [
         f"Thread ID: {thread_id}",
         f"Title: {title}",
         f"Downloaded: {datetime.now().isoformat()}",
         f"HTML Size: {len(html)} bytes",
         f"Markdown Size: {len(markdown_content)} chars",
         f"Attachments: {len(attachments) if attachments else 0}",
-    ])
-    metadata_path.write_text(metadata_content, encoding="utf-8")
+    ]
+    if thread_date_str:
+        metadata_lines.insert(2, f"Thread date: {thread_date_str}")
+        if iso_year and iso_week:
+            metadata_lines.insert(3, f"ISO year: {iso_year}, ISO week: {iso_week}")
+
+    metadata_path = thread_dir / "metadata.txt"
+    metadata_path.write_text("\n".join(metadata_lines), encoding="utf-8")
 
     print(f"\n✅ Done! All files saved to: {thread_dir.resolve()}")
     print(f"\nContents:")