Skip to content

SA-693/two step classify#29

Open
dstewartons wants to merge 28 commits into
mainfrom
SA-693/two-step-classify
Open

SA-693/two step classify#29
dstewartons wants to merge 28 commits into
mainfrom
SA-693/two-step-classify

Conversation

@dstewartons
Copy link
Copy Markdown
Contributor

@dstewartons dstewartons commented May 20, 2026

📌 Pull Request Template

Please complete all sections

✨ Summary

Adds two-step SOC LLM methods for Survey Assist classify, mirroring SIC in sic-classification-utils: unambiguous_soc_code (step 1) and formulate_open_question (step 2 when not codable). Introduces SOC_PROMPT_UNAMBIGUOUS and SOC_PROMPT_OPENFOLLOWUP. Keeps sa_rag_soc_code for demos and other callers; classify in survey-assist-api must use the two-step methods only.

Companion PR: survey-assist-api SA-693 — connects classify to these methods and removes SA-673 clear-winner logic from the API.

Release: version bumped to 0.1.4 — tag and publish after merge so downstream repos can pin the Git tag (replacing local path dependencies).

📜 Changes Introduced

  • Feature implementation (feat:) / bug fix (fix:) / refactoring (chore:) / documentation (docs:) / testing (test:)

  • Updates to tests and/or documentation

  • Terraform changes (if applicable)

  • llm/llm.py: unambiguous_soc_code, formulate_open_question (SIC-shaped chains, logging, correlation_id); sa_rag_soc_code aligned with sa_rag_sic_code (signature, parse fallbacks, prompt candidate metadata).

  • llm/prompt.py: SOC_PROMPT_UNAMBIGUOUS, SOC_PROMPT_OPENFOLLOWUP; SA_SOC_PROMPT_RAG wording aligned with SIC (no SA-673 clear-winner instructions).

  • models/response_model.py: UnambiguousResponse / OpenFollowUp docs aligned with SIC two-step naming.

  • utils/constants.py: SOC prompt name constants for config consumers.

  • tests/test_llm.py: mocked tests for unambiguous_soc_code and formulate_open_question (typed responses, call dict, job title normalisation).

  • demos/llm/: example script and shortlist JSON (moved from src/.../llm_embedding_example.py).

  • README.md: documents two-step vs single-shot RAG.

  • pyproject.toml: version 0.1.4.

✅ Checklist

Please confirm you've completed these checks before requesting a review.

  • Code is formatted using Black
  • Imports are sorted using isort
  • Code passes linting with Ruff, Pylint, and Mypy
  • Security checks pass using Bandit
  • API and Unit tests are written and pass using pytest
  • Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
  • DocStrings follow Google-style and are added as per Pylint recommendations
  • Documentation has been updated if needed

🔍 How to Test

1. Unit tests (no GCP required)

cd soc-classification-utils
poetry install
make all-tests
make check-python-nofix

*Verified 36 passed; make check-python-nofix clean.

Mocked tests in tests/test_llm.py cover unambiguous_soc_code and formulate_open_question (typed responses and call dicts).

2. Live LLM demo (this repo only)

What this proves: unambiguous_soc_code calls Gemini and returns valid JSON (step 1 only). Uses a fake shortlist from the JSON file — not the API or vector store.

Requires Vertex / Gemini credentials. No vector store or survey-assist-api.

cd soc-classification-utils
poetry install
poetry run python demos/llm/llm_embedding_example.py

The script uses a fixed shortlist in demos/llm/data/school_embed_short_list_soc.json and prints three JSON blocks (legacy get_soc_code, sa_rag_soc_code, then unambiguous_soc_code).

Check the third block has codable, class_code, alt_candidates, and reasoning. With the bundled school-teacher inputs and a shortlist of primary / secondary / special-needs codes, the model often returns codable: false and class_code: null, with 2313 as the leading entry in alt_candidates(ambiguous — step 2 would beformulate_open_questionin classify). Exact values can vary with the LLM; do not expect a fixedclass_code` every run.

Verified third block codable: false, class_code: null, top candidate 2313 at likelihood 0.9.

Full POST /classify checks (vector store + API, farm hand / manager curls) belong in the survey-assist-api companion PR — not repeated here.

Notes

  • After merge, publish Git tag v0.1.4 so survey-assist-api and soc-classification-vector-store can pin it instead of path = "../soc-classification-utils".
  • Classify consumers must call unambiguous_soc_code then formulate_open_questionwhen not codable — notsa_rag_soc_code`.

…ep helpers

- default model and Vertex region from get_config, share code_digits and candidates_limit for RAG and unambiguous shortlists
- rework unambiguous_soc_code and formulate_open_question to match SIC tracing, truncate_identifier and parse-retry logging
- import PydanticOutputParser from langchain_core and apply config defaults to sa_rag_soc_code
…classify

- define SOC_PROMPT_UNAMBIGUOUS and SOC_PROMPT_OPENFOLLOWUP with pydantic partial format instructions
- use UnambiguousResponse and OpenFollowUp as the parser pydantic shapes for those templates
- import PydanticOutputParser from langchain_core instead of langchain.output_parsers
…lassify

- use langchain_core PydanticOutputParser and partial_variables so each template embeds the matching UnambiguousResponse or OpenFollowUp JSON schema text
…sify

- define the same pydantic fields as sic-classification-utils uses for unambiguous and open follow-up outputs (codable, class_code, class_descriptive, alt_candidates, reasoning, followup) and constrain alt_candidates length with field_validator and MAX_ALT_CANDIDATES
…ging

- same truncation behaviour as industrial_classification_utils.utils.constants (empty or None becomes an empty string, longer strings cut at max length with an ellipsis suffix)
- make get_config call get_default_config from utils.constants and drop the inline default dict
- read EmbeddingHandler defaults from config embedding keys (model name, db dir, k_matches)
…e sic-classification-utils

- add EmbeddingConfig for embedding_model_name, db_dir and k_matches
- narrow LLMConfig to llm_model_name, model_location, code_digits and candidates_limit only
- return embedding, llm and lookups defaults in one dict that satisfies FullConfig
- set generative defaults to gemini-2.5-flash, europe-west2, four-digit code_digits and candidates_limit 10
- add mock_soc fixtures and tests for unambiguous_soc_code and formulate_open_question
- use pytest-mock with example lookup codes in sa_rag shortlists
@dstewartons dstewartons changed the title Sa 693/two step classify SA-693/two step classify May 20, 2026
@dstewartons dstewartons force-pushed the SA-693/two-step-classify branch from e772cd6 to 9a35f47 Compare May 20, 2026 18:45
- add Details from group_description and Includes from unit-group tasks
- remove unused expand_search_terms parameter and tidy docstring
- match untyped inner helper and keyword args when building the open follow-up call dict
- align class doc and field descriptions with sic unambiguousresponse parity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant