From 08a3786049a45617df2b98c3b88ca1ba6e712ce1 Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" <41898282+github-actions[bot]@users.noreply.github.com> Date: Thu, 16 Apr 2026 16:09:09 -0400 Subject: [PATCH] fix(vendor): harden firecrawl trust center crawling MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * fix(vendor): harden firecrawl trust center crawling * refactor(vendor): export TRUSTED_PORTAL_DOMAINS and add host check helper Co-Authored-By: Claude Sonnet 4.6 * feat(vendor): add trust portal section-url discovery helper Co-Authored-By: Claude Sonnet 4.6 * feat(vendor): add certification merge helper with status priority Pure mergeCertifications function dedupes by canonical slug and resolves status via verified > expired > unknown > not_certified priority, preferring core URL/dates on ties. Co-Authored-By: Claude Sonnet 4.6 * feat(vendor): scaffold trust portal deep-scrape orchestrator with gate Co-Authored-By: Claude Sonnet 4.6 * feat(vendor): implement trust portal deep-scrape orchestrator Clicks through SPA sidebar sections, concatenates markdown from each, and extracts certifications via Claude Sonnet 4.6. * fix(vendor): escape CSS selector values and cover concurrency bound Add cssEscapeAttr helper to sanitize `\` and `"` inside CSS double-quoted attribute values in buildSectionScrapeOptions, preventing silent selector no-ops for anchor slugs containing CSS-reserved characters. Add two new tests: one verifying the escaping (using `\` which survives URL normalization) and one confirming mapWithConcurrency covers all items when section count (8) exceeds SECTION_CONCURRENCY (5). Co-Authored-By: Claude Sonnet 4.6 * feat(vendor): run trust portal deep-scrape after core agent Resolves a source URL (trust center -> security page -> verified cert url), runs deepScrapeTrustPortal, and merges certifications before returning. * refactor(vendor): extract pickDeepScrapeSourceUrl and tighten extraction prompt Move pickDeepScrapeSourceUrl into its own module with unit tests so firecrawl-agent-core.ts drops below the 300-line limit. Also hoist the Firecrawl Agent JSON schema into firecrawl-agent-schema-json.ts for the same reason. Tighten the Sonnet 4.6 extraction prompt to explicitly require evidence_snippet so Claude doesn't silently drop rows. * feat(vendor): log Agent snapshot, deep-scrape decision, and persisted certs Adds three diagnostic logs so a trigger.dev run tells the full story: - "Firecrawl Agent returned — pre-deep-scrape snapshot" dumps the raw Agent links, normalized links, and cert types/statuses before the deep-scrape decision. Exposes what the LLM actually found. - Deep-scrape branch logs either "source URL resolved" + merged types, "returned no certifications", or "skipped: no usable URL on vendor domain" with available links + verified certs — no more silent gate decisions. - "Risk level and badges extracted" now includes the full compliance badge payload and the certifications array being persisted to the vendor record, so DB-write state is inspectable from logs. * fix(vendor): json-stringify complex diagnostic log fields Trigger.dev's OpenTelemetry attribute pipeline strips nested objects and arrays — keeping only top-level scalars — so rich log payloads like rawAgentLinks, normalizedLinks, and complianceBadges were being silently discarded. Serialize them to JSON strings so they survive the OTel export and surface in the dashboard / MCP span details. * feat(vendor): rewrite Firecrawl Agent prompt — URL-discovery first Prior prompt treated trust_center_url as just another field, so when the Agent failed to extract certifications from a JavaScript SPA (e.g. ui.com/trust-center) it abandoned the whole output — including the URL the downstream deep-scrape needs. New prompt reframes the mission: - Primary goal: return trust_center_url even when page content is empty or SPA-only. Deep-scrape handles rendering; Agent just has to find. - Explicit numbered URL paths to try when nav discovery fails, including third-party portals keyed off the vendor slug. - Explicit instruction to return URLs of SPA-only pages rather than discarding them. - Stricter output contract marking trust_center_url as REQUIRED when any trust/security/compliance surface exists on the vendor domain. - Bumped maxCredits 2500 → 4000 to give the Agent headroom on sites that require multi-hop discovery. Prompt extracted into firecrawl-agent-prompt.ts to keep core orchestrator under the 300-line limit. * chore(vendor): log raw firecrawl agent response for ui.com diagnosis Adds temporary diagnostic logs capturing: - agentResponse.success / status / error / keys (before schema parse) - first 4KB of the raw agentResponse JSON - first 4KB of parsed.data JSON, plus security_assessment and risk_level The agent is returning links: null for ubiquiti even after the URL-first prompt rewrite — need to see what it IS returning to understand whether it's a fetch block, a model compliance issue, or a parse path we're missing. Pushes the file to 315 lines; will roll back once diagnosed. * fix(vendor): handle firecrawl agent processing status + extend timeouts Discovered via new diagnostic log: the Firecrawl SDK's agent call was returning status="processing" on ui.com because its internal poll timed out (360s) before the agent job completed on Firecrawl's side. Our code only guarded against status="failed", so it silently parsed the empty response as success — leaving vendor records with no certifications even when the agent could have found them given more time. Changes: - Guard on status !== "completed" instead of just "failed"; log clearly when SDK returns while job is still processing so timeouts are visible instead of silent. - Bump agent SDK timeout 360s -> 1500s (25 min) so slow SPA trust centers like Ubiquiti have room to finish. - Bump task maxDuration 10 min -> 30 min to accommodate the longer agent call plus deep-scrape + DB writes. * fix(vendor): score agent payload candidates by populated fields The firecrawl agent response has a nested shape: { success, status, data: { links, certifications, ... }, ... } extractAgentPayloadCandidates returns [wrapper, wrapper.data] in that order, and every field in vendorRiskAssessmentAgentSchema is optional. The wrapper therefore parsed successfully as an empty object and won the first-match .find() lookup — even though it contained no real fields. The actual .data payload (with trust_center_url, security page, privacy policy, etc.) was silently discarded. Pick the candidate with the most populated schema fields instead of the first success. This has been a latent bug on main — the ubiquiti run on v20260415.12 showed the same "found 0 links, 0 certifications" symptom. * fix(vendor): remove invalid maxCredits from scrape calls Firecrawl's v2 /scrape endpoint rejects maxCredits — that option belongs to the Agent API, not scrape. We were passing it on both the initial scrape and the per-section scrapes, and Firecrawl was returning "Unrecognized key in body", causing the deep-scrape pass to fail on its very first call. Replace with `timeout` (2 min per scrape, within Firecrawl's 5-min cap) which is the scrape v2 equivalent of "budget per call." * chore(vendor): log raw initial scrape output for section discovery diag Ubiquiti run finished with sectionCount=0 even though the initial scrape returned 9891 chars of markdown. Need to see what firecrawlClient.scrape actually returned in `links` to understand whether the sidebar items are missing from the response or whether discoverSectionUrls is wrongly filtering them out. Logs the first 50 links and the first 2KB of markdown from the initial scrape. Temporary diagnostic, will trim once the sidebar discovery strategy is fixed. * feat(vendor): llm-driven tab discovery for spa trust portals Ubiquiti's trust center sidebar items are