Skip to content

reglab/procurement-ca

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

procurement-ca

A repository of two components from Stanford RegLab's California procurement research. It is shared so partners can see how the work operates — it is not meant to be run end-to-end.

The two components are:

  1. Market Matching Pipeline (pipeline/) — a multi-stage LLM system that takes a government purchase line item and finds comparable real-world market listings so the purchase price can be benchmarked.
  2. LPA Attachment Downloader (lpa/) — a browser automation script that downloads the attachment files (contract PDFs, price lists, user instructions) for Leveraged Procurement Agreement (LPA) contracts from California's public eProcurement portal.

Repository layout

procurement-odi/
├── README.md
├── requirements.txt
├── .env.example              # template for API keys (pipeline only)
├── helpers/
│   └── db.py                 # PostgreSQL layer (in our actual code, we uploaded data from Fi$Cal SCPRS to a DB)
├── pipeline/                 # System 1 — market matching pipeline
│   ├── pipeline.py           # entry point: orchestrates the 5 phases, handles resume
│   ├── run.py                # the per-phase implementations
│   ├── batch.py              # OpenAI Batch API helpers (submit / poll / parse)
│   ├── callers.py            # direct (non-batch) LLM call helpers
│   └── scraping.py           # Google Search + page fetch via ScraperAPI
├── prompts/                  # the LLM system prompts (one per reasoning phase)
│   ├── query_generation.txt
│   ├── field_extraction.txt
│   ├── conflict_detection.txt
│   └── price_normalization.txt
└── lpa/                      # System 2 — LPA attachment downloader
    └── download_lpa_attachments.py

System 1 — Market Matching Pipeline

What it does

Given a CSV of procurement line items, the pipeline produces, for each item, a set of real market listings that plausibly match it, with prices normalized to a common unit so the government's paid price can be compared to the market. The hard parts it solves are (a) turning a terse, often cryptic line-item description into a good web search, (b) reading messy vendor pages to extract structured product fields, (c) deciding whether a candidate listing is really the same product, and (d) making prices comparable across different units/pack sizes.

The five phases

The pipeline runs as a sequence of phases. Each phase writes its output to the run directory, so an interrupted run resumes from where it stopped.

 input CSV (line items)
        │
 [Step 1] Query generator   → filter out non-goods, extract structured fields from the
        │                 description, and generate a web search query
 [Step 2] Scraping          → run the query via Google Search (ScraperAPI), then fetch the
        │                 text of the top N result pages concurrently
 [Step 3] Extractor         → for each scraped page, extract structured product fields
        │                 (LLM, run in batches, one round per result position)
 [Step 4] Conflict detector → compare each candidate listing first against the reference item
                          and against one another, keeping only genuine matches (reject the rest with a reason)
 [Step 5] Price normalization → convert each surviving candidate's price to the reference
        │                 item's unit so prices are directly comparable
        ▼
 results (output.json and/or PostgreSQL)

The key design idea is a progressively expanding schema. There is no fixed list of fields. The system starts by extracting every identifying field it can find in the line item description [Step 1]. Then, after candidate listings are scraped [Step 2], the system processes each candidate listing one-by-one, doing two things for each listing [Step 3]: (1) Extracts values for every field that has already been seen (2) Adds new identifying fields that the listing reveals. What this means is that the structured field set grows richer with each listing. Then, the system runs conflict detection in two passes [Step 4]. (1) First, each listing is compared against the item description: a listing is dropped if it has a non-null value that disagrees with the description's value for that same field. (2) Then, the surviving listings are compared against one another. If their fields are all mutually consistent, they're returned as matches. If conflicts remain between them, it signals that the original item description is ambiguous enough to plausibly match multiple distinct products. In this case, the system returns no listing in order to avoid false matches.

How the code is organized

  • pipeline/pipeline.py — the entry point. Loads the input CSV and the four prompt files, then calls the five phase functions in order and assembles the final results. It is designed to be resumable: every phase checkpoints to the run directory, and already-processed items are skipped on re-run.

  • pipeline/run.py — the implementation of each phase (run_query_generator_batch, run_scraping, run_extractor_batch, run_conflict_detector_batch, run_price_normalization_batch).

  • pipeline/batch.py — helpers for the OpenAI Batch API (submitting a JSONL job, polling until complete, and parsing results). The LLM phases use batch mode for throughput and cost.

  • pipeline/callers.py — helpers for direct, synchronous LLM calls.

  • pipeline/scraping.py — Google Search and page fetching through ScraperAPI, including a blacklist of consumer-marketplace domains (eBay, Facebook, Craigslist, etc.) that aren't useful price references.

  • prompts/*.txt — the system prompts that define each LLM phase's behavior. These are the heart of the pipeline's reasoning and are the most interesting files to read to understand how matching decisions are made. [Step 1] Query generator

  • helpers/db.py — optional PostgreSQL persistence. Results are mirrored to a small set of tables (pipeline_results, listings, normalized_prices, …). In this shared copy the connection function is a placeholder, so the pipeline writes output.json instead — the DB is not required to understand or run the logic.

Input and output

Input CSV columns (names are configurable via CLI flags): detail_id (an integer id per line item), item_description, unit_price, unit_of_measure, vendor.

Output is a JSON array — one object per input item — roughly:

{
  "detail_id": 123,
  "input": "item description",
  "unit_price": 42.0,
  "vendor": "Some Vendor",
  "output": {
    "category": "sufficient | non-good | insufficient_description | ...",
    "line_item_fields": { "...": "..." },
    "search_query": "the generated query",
    "surviving_candidates": [
      { "url": "...", "seller": "...", "unit_price": 39.5,
        "extracted_fields": { "...": "..." },
        "normalized_price": 39.5, "normalization_explanation": "..." }
    ],
    "removed_candidates": [ { "url": "...", "removal_reason": "..." } ]
  }
}

External services

  • OpenAI — the LLM for phases 1, 3, 4, 5 (default models gpt-5.2 and gpt-5-mini).
  • ScraperAPI — Google Search results and fetching candidate listing pages.

System 2 — LPA Attachment Downloader

What it does

lpa/download_lpa_attachments.py downloads every attachment file for a set of LPA contracts from California's eProcurement portal (caleprocure.ca.gov).

How it handles the portal

The contract pages are a PeopleSoft application, which makes naive scraping unreliable. The script's docstring documents the approach in detail; the key points:

  • The site blocks obvious headless browsers, so the script sends a realistic desktop User-Agent.
  • Clicking a "View" button triggers a JSON response containing a one-time signed download URL (/psc/.../view/...). Rather than fight the popup → "Download Attachment" modal → inline-PDF-viewer flow (which is flaky in headless mode), the script captures that URL from the JSON and fetches the file bytes directly with the browser's request API. This works uniformly for PDFs, Excel, and Word files.
  • Filenames are read from the Content-Disposition header (URL-decoded), and downloaded bytes are validated so an HTML error page is never saved as a .pdf.

Contract-id rule

Some contract ids carry a trailing version suffix (e.g. 4-16-71-0013G.17). The .17 is stripped to form the page URL (4-16-71-0013G). Ids that share a base after stripping are scraped once and the files are copied into the sibling folders, so the portal is not hit repeatedly for the same documents.

Usage

python lpa/download_lpa_attachments.py                 # all folders under data/contracts/
python lpa/download_lpa_attachments.py --only 7-21-99-41-01
python lpa/download_lpa_attachments.py --headed        # watch the browser

Setup

python3.11 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
playwright install chromium          # for the LPA downloader

cp .env.example .env                 # then add your own OpenAI + ScraperAPI keys

Run the pipeline from the repo root (prompts are loaded by relative path):

python pipeline/pipeline.py --csv your_items.csv --run-name demo --workers 5

Running the pipeline requires your own OpenAI and ScraperAPI keys and incurs cost. PostgreSQL is optional (see helpers/db.py); without it, results go to output.json.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages