A repository of two components from Stanford RegLab's California procurement research. It is shared so partners can see how the work operates — it is not meant to be run end-to-end.
The two components are:
- Market Matching Pipeline (
pipeline/) — a multi-stage LLM system that takes a government purchase line item and finds comparable real-world market listings so the purchase price can be benchmarked. - LPA Attachment Downloader (
lpa/) — a browser automation script that downloads the attachment files (contract PDFs, price lists, user instructions) for Leveraged Procurement Agreement (LPA) contracts from California's public eProcurement portal.
procurement-odi/
├── README.md
├── requirements.txt
├── .env.example # template for API keys (pipeline only)
├── helpers/
│ └── db.py # PostgreSQL layer (in our actual code, we uploaded data from Fi$Cal SCPRS to a DB)
├── pipeline/ # System 1 — market matching pipeline
│ ├── pipeline.py # entry point: orchestrates the 5 phases, handles resume
│ ├── run.py # the per-phase implementations
│ ├── batch.py # OpenAI Batch API helpers (submit / poll / parse)
│ ├── callers.py # direct (non-batch) LLM call helpers
│ └── scraping.py # Google Search + page fetch via ScraperAPI
├── prompts/ # the LLM system prompts (one per reasoning phase)
│ ├── query_generation.txt
│ ├── field_extraction.txt
│ ├── conflict_detection.txt
│ └── price_normalization.txt
└── lpa/ # System 2 — LPA attachment downloader
└── download_lpa_attachments.py
Given a CSV of procurement line items, the pipeline produces, for each item, a set of real market listings that plausibly match it, with prices normalized to a common unit so the government's paid price can be compared to the market. The hard parts it solves are (a) turning a terse, often cryptic line-item description into a good web search, (b) reading messy vendor pages to extract structured product fields, (c) deciding whether a candidate listing is really the same product, and (d) making prices comparable across different units/pack sizes.
The pipeline runs as a sequence of phases. Each phase writes its output to the run directory, so an interrupted run resumes from where it stopped.
input CSV (line items)
│
[Step 1] Query generator → filter out non-goods, extract structured fields from the
│ description, and generate a web search query
[Step 2] Scraping → run the query via Google Search (ScraperAPI), then fetch the
│ text of the top N result pages concurrently
[Step 3] Extractor → for each scraped page, extract structured product fields
│ (LLM, run in batches, one round per result position)
[Step 4] Conflict detector → compare each candidate listing first against the reference item
and against one another, keeping only genuine matches (reject the rest with a reason)
[Step 5] Price normalization → convert each surviving candidate's price to the reference
│ item's unit so prices are directly comparable
▼
results (output.json and/or PostgreSQL)
The key design idea is a progressively expanding schema. There is no fixed list of fields. The system starts by extracting every identifying field it can find in the line item description [Step 1]. Then, after candidate listings are scraped [Step 2], the system processes each candidate listing one-by-one, doing two things for each listing [Step 3]: (1) Extracts values for every field that has already been seen (2) Adds new identifying fields that the listing reveals. What this means is that the structured field set grows richer with each listing. Then, the system runs conflict detection in two passes [Step 4]. (1) First, each listing is compared against the item description: a listing is dropped if it has a non-null value that disagrees with the description's value for that same field. (2) Then, the surviving listings are compared against one another. If their fields are all mutually consistent, they're returned as matches. If conflicts remain between them, it signals that the original item description is ambiguous enough to plausibly match multiple distinct products. In this case, the system returns no listing in order to avoid false matches.
-
pipeline/pipeline.py— the entry point. Loads the input CSV and the four prompt files, then calls the five phase functions in order and assembles the final results. It is designed to be resumable: every phase checkpoints to the run directory, and already-processed items are skipped on re-run. -
pipeline/run.py— the implementation of each phase (run_query_generator_batch,run_scraping,run_extractor_batch,run_conflict_detector_batch,run_price_normalization_batch). -
pipeline/batch.py— helpers for the OpenAI Batch API (submitting a JSONL job, polling until complete, and parsing results). The LLM phases use batch mode for throughput and cost. -
pipeline/callers.py— helpers for direct, synchronous LLM calls. -
pipeline/scraping.py— Google Search and page fetching through ScraperAPI, including a blacklist of consumer-marketplace domains (eBay, Facebook, Craigslist, etc.) that aren't useful price references. -
prompts/*.txt— the system prompts that define each LLM phase's behavior. These are the heart of the pipeline's reasoning and are the most interesting files to read to understand how matching decisions are made. [Step 1] Query generator -
helpers/db.py— optional PostgreSQL persistence. Results are mirrored to a small set of tables (pipeline_results,listings,normalized_prices, …). In this shared copy the connection function is a placeholder, so the pipeline writesoutput.jsoninstead — the DB is not required to understand or run the logic.
Input CSV columns (names are configurable via CLI flags): detail_id (an integer id
per line item), item_description, unit_price, unit_of_measure, vendor.
Output is a JSON array — one object per input item — roughly:
{
"detail_id": 123,
"input": "item description",
"unit_price": 42.0,
"vendor": "Some Vendor",
"output": {
"category": "sufficient | non-good | insufficient_description | ...",
"line_item_fields": { "...": "..." },
"search_query": "the generated query",
"surviving_candidates": [
{ "url": "...", "seller": "...", "unit_price": 39.5,
"extracted_fields": { "...": "..." },
"normalized_price": 39.5, "normalization_explanation": "..." }
],
"removed_candidates": [ { "url": "...", "removal_reason": "..." } ]
}
}- OpenAI — the LLM for phases 1, 3, 4, 5 (default models
gpt-5.2andgpt-5-mini). - ScraperAPI — Google Search results and fetching candidate listing pages.
lpa/download_lpa_attachments.py downloads every attachment file for a set of LPA
contracts from California's eProcurement portal (caleprocure.ca.gov).
The contract pages are a PeopleSoft application, which makes naive scraping unreliable. The script's docstring documents the approach in detail; the key points:
- The site blocks obvious headless browsers, so the script sends a realistic desktop User-Agent.
- Clicking a "View" button triggers a JSON response containing a one-time signed
download URL (
/psc/.../view/...). Rather than fight the popup → "Download Attachment" modal → inline-PDF-viewer flow (which is flaky in headless mode), the script captures that URL from the JSON and fetches the file bytes directly with the browser's request API. This works uniformly for PDFs, Excel, and Word files. - Filenames are read from the
Content-Dispositionheader (URL-decoded), and downloaded bytes are validated so an HTML error page is never saved as a.pdf.
Some contract ids carry a trailing version suffix (e.g. 4-16-71-0013G.17). The .17 is
stripped to form the page URL (4-16-71-0013G). Ids that share a base after stripping are
scraped once and the files are copied into the sibling folders, so the portal is not
hit repeatedly for the same documents.
python lpa/download_lpa_attachments.py # all folders under data/contracts/
python lpa/download_lpa_attachments.py --only 7-21-99-41-01
python lpa/download_lpa_attachments.py --headed # watch the browser
python3.11 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
playwright install chromium # for the LPA downloader
cp .env.example .env # then add your own OpenAI + ScraperAPI keysRun the pipeline from the repo root (prompts are loaded by relative path):
python pipeline/pipeline.py --csv your_items.csv --run-name demo --workers 5Running the pipeline requires your own OpenAI and ScraperAPI keys and incurs cost.
PostgreSQL is optional (see helpers/db.py); without it, results go to output.json.