SA-693/two step classify#29
Open
dstewartons wants to merge 28 commits into
Open
Conversation
…ep helpers - default model and Vertex region from get_config, share code_digits and candidates_limit for RAG and unambiguous shortlists - rework unambiguous_soc_code and formulate_open_question to match SIC tracing, truncate_identifier and parse-retry logging - import PydanticOutputParser from langchain_core and apply config defaults to sa_rag_soc_code
…classify - define SOC_PROMPT_UNAMBIGUOUS and SOC_PROMPT_OPENFOLLOWUP with pydantic partial format instructions - use UnambiguousResponse and OpenFollowUp as the parser pydantic shapes for those templates - import PydanticOutputParser from langchain_core instead of langchain.output_parsers
…lassify - use langchain_core PydanticOutputParser and partial_variables so each template embeds the matching UnambiguousResponse or OpenFollowUp JSON schema text
…sify - define the same pydantic fields as sic-classification-utils uses for unambiguous and open follow-up outputs (codable, class_code, class_descriptive, alt_candidates, reasoning, followup) and constrain alt_candidates length with field_validator and MAX_ALT_CANDIDATES
…ging - same truncation behaviour as industrial_classification_utils.utils.constants (empty or None becomes an empty string, longer strings cut at max length with an ellipsis suffix)
- make get_config call get_default_config from utils.constants and drop the inline default dict - read EmbeddingHandler defaults from config embedding keys (model name, db dir, k_matches)
…e sic-classification-utils - add EmbeddingConfig for embedding_model_name, db_dir and k_matches - narrow LLMConfig to llm_model_name, model_location, code_digits and candidates_limit only
- return embedding, llm and lookups defaults in one dict that satisfies FullConfig - set generative defaults to gemini-2.5-flash, europe-west2, four-digit code_digits and candidates_limit 10
- add mock_soc fixtures and tests for unambiguous_soc_code and formulate_open_question - use pytest-mock with example lookup codes in sa_rag shortlists
…ion-utils - replace src package example with JSON shortlist demo for get_soc_code, sa_rag_soc_code and unambiguous_soc_code
- provide code, title and distance entries for demos/llm/llm_embedding_example.py
- drop the old package example now covered by demos/llm/llm_embedding_example.py
…arity - add the same discriminator ordering, quality standards, and edge cases as SIC with SOC-specific occupation wording
e772cd6 to
9a35f47
Compare
- add Details from group_description and Includes from unit-group tasks
- remove unused expand_search_terms parameter and tidy docstring
- match untyped inner helper and keyword args when building the open follow-up call dict
- align class doc and field descriptions with sic unambiguousresponse parity
11 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📌 Pull Request Template
✨ Summary
Adds two-step SOC LLM methods for Survey Assist classify, mirroring SIC in
sic-classification-utils:unambiguous_soc_code(step 1) andformulate_open_question(step 2 when not codable). IntroducesSOC_PROMPT_UNAMBIGUOUSandSOC_PROMPT_OPENFOLLOWUP. Keepssa_rag_soc_codefor demos and other callers; classify in survey-assist-api must use the two-step methods only.Companion PR: survey-assist-api SA-693 — connects classify to these methods and removes SA-673 clear-winner logic from the API.
Release: version bumped to 0.1.4 — tag and publish after merge so downstream repos can pin the Git tag (replacing local
pathdependencies).📜 Changes Introduced
Feature implementation (feat:) / bug fix (fix:) / refactoring (chore:) / documentation (docs:) / testing (test:)
Updates to tests and/or documentation
Terraform changes (if applicable)
llm/llm.py:unambiguous_soc_code,formulate_open_question(SIC-shaped chains, logging,correlation_id);sa_rag_soc_codealigned withsa_rag_sic_code(signature, parse fallbacks, prompt candidate metadata).llm/prompt.py:SOC_PROMPT_UNAMBIGUOUS,SOC_PROMPT_OPENFOLLOWUP;SA_SOC_PROMPT_RAGwording aligned with SIC (no SA-673 clear-winner instructions).models/response_model.py:UnambiguousResponse/OpenFollowUpdocs aligned with SIC two-step naming.utils/constants.py: SOC prompt name constants for config consumers.tests/test_llm.py: mocked tests forunambiguous_soc_codeandformulate_open_question(typed responses, call dict, job title normalisation).demos/llm/: example script and shortlist JSON (moved fromsrc/.../llm_embedding_example.py).README.md: documents two-step vs single-shot RAG.pyproject.toml: version0.1.4.✅ Checklist
terraform fmt&terraform validate)🔍 How to Test
1. Unit tests (no GCP required)
cd soc-classification-utils poetry install make all-tests make check-python-nofix*Verified
36 passed;make check-python-nofixclean.Mocked tests in
tests/test_llm.pycoverunambiguous_soc_codeandformulate_open_question(typed responses and call dicts).2. Live LLM demo (this repo only)
What this proves:
unambiguous_soc_codecalls Gemini and returns valid JSON (step 1 only). Uses a fake shortlist from the JSON file — not the API or vector store.Requires Vertex / Gemini credentials. No vector store or survey-assist-api.
cd soc-classification-utils poetry install poetry run python demos/llm/llm_embedding_example.pyThe script uses a fixed shortlist in
demos/llm/data/school_embed_short_list_soc.jsonand prints three JSON blocks (legacyget_soc_code,sa_rag_soc_code, thenunambiguous_soc_code).Check the third block has
codable,class_code,alt_candidates, andreasoning. With the bundled school-teacher inputs and a shortlist of primary / secondary / special-needs codes, the model often returnscodable: falseandclass_code: null, with2313 as the leading entry inalt_candidates(ambiguous — step 2 would beformulate_open_questionin classify). Exact values can vary with the LLM; do not expect a fixedclass_code` every run.Verified third block
codable: false,class_code: null, top candidate2313at likelihood0.9.Full
POST /classifychecks (vector store + API, farm hand / manager curls) belong in the survey-assist-api companion PR — not repeated here.Notes
v0.1.4so survey-assist-api and soc-classification-vector-store can pin it instead ofpath = "../soc-classification-utils".unambiguous_soc_codethen formulate_open_questionwhen not codable — notsa_rag_soc_code`.