SA-693/two step classify by dstewartons · Pull Request #29 · ONSdigital/soc-classification-utils

dstewartons · 2026-05-20T11:49:40Z

📌 Pull Request Template

Please complete all sections

✨ Summary

Adds two-step SOC LLM methods for Survey Assist classify, mirroring SIC in sic-classification-utils: unambiguous_soc_code (step 1) and formulate_open_question (step 2 when not codable). Introduces SOC_PROMPT_UNAMBIGUOUS and SOC_PROMPT_OPENFOLLOWUP. Keeps sa_rag_soc_code for demos and other callers; classify in survey-assist-api must use the two-step methods only.

Companion PR: survey-assist-api SA-693 — connects classify to these methods and removes SA-673 clear-winner logic from the API.

Release: version bumped to 0.1.4 — tag and publish after merge so downstream repos can pin the Git tag (replacing local path dependencies).

📜 Changes Introduced

Feature implementation (feat:) / bug fix (fix:) / refactoring (chore:) / documentation (docs:) / testing (test:)
Updates to tests and/or documentation
Terraform changes (if applicable)
llm/llm.py: unambiguous_soc_code, formulate_open_question (SIC-shaped chains, logging, correlation_id); sa_rag_soc_code aligned with sa_rag_sic_code (signature, parse fallbacks, prompt candidate metadata).
llm/prompt.py: SOC_PROMPT_UNAMBIGUOUS, SOC_PROMPT_OPENFOLLOWUP; SA_SOC_PROMPT_RAG wording aligned with SIC (no SA-673 clear-winner instructions).
models/response_model.py: UnambiguousResponse / OpenFollowUp docs aligned with SIC two-step naming.
utils/constants.py: SOC prompt name constants for config consumers.
tests/test_llm.py: mocked tests for unambiguous_soc_code and formulate_open_question (typed responses, call dict, job title normalisation).
demos/llm/: example script and shortlist JSON (moved from src/.../llm_embedding_example.py).
README.md: documents two-step vs single-shot RAG.
pyproject.toml: version 0.1.4.

✅ Checklist

Please confirm you've completed these checks before requesting a review.

Code is formatted using Black
Imports are sorted using isort
Code passes linting with Ruff, Pylint, and Mypy
Security checks pass using Bandit
API and Unit tests are written and pass using pytest
Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
DocStrings follow Google-style and are added as per Pylint recommendations
Documentation has been updated if needed

🔍 How to Test

1. Unit tests (no GCP required)

cd soc-classification-utils
poetry install
make all-tests
make check-python-nofix

*Verified 36 passed; make check-python-nofix clean.

Mocked tests in tests/test_llm.py cover unambiguous_soc_code and formulate_open_question (typed responses and call dicts).

2. Live LLM demo (this repo only)

What this proves: unambiguous_soc_code calls Gemini and returns valid JSON (step 1 only). Uses a fake shortlist from the JSON file — not the API or vector store.

Requires Vertex / Gemini credentials. No vector store or survey-assist-api.

cd soc-classification-utils
poetry install
poetry run python demos/llm/llm_embedding_example.py

The script uses a fixed shortlist in demos/llm/data/school_embed_short_list_soc.json and prints three JSON blocks (legacy get_soc_code, sa_rag_soc_code, then unambiguous_soc_code).

Check the third block has codable, class_code, alt_candidates, and reasoning. With the bundled school-teacher inputs and a shortlist of primary / secondary / special-needs codes, the model often returns codable: false and class_code: null, with 2313 as the leading entry in alt_candidates(ambiguous — step 2 would beformulate_open_questionin classify). Exact values can vary with the LLM; do not expect a fixedclass_code` every run.

Verified third block codable: false, class_code: null, top candidate 2313 at likelihood 0.9.

Full POST /classify checks (vector store + API, farm hand / manager curls) belong in the survey-assist-api companion PR — not repeated here.

Notes

After merge, publish Git tag v0.1.4 so survey-assist-api and soc-classification-vector-store can pin it instead of path = "../soc-classification-utils".
Classify consumers must call unambiguous_soc_code then formulate_open_questionwhen not codable — notsa_rag_soc_code`.

…ep helpers - default model and Vertex region from get_config, share code_digits and candidates_limit for RAG and unambiguous shortlists - rework unambiguous_soc_code and formulate_open_question to match SIC tracing, truncate_identifier and parse-retry logging - import PydanticOutputParser from langchain_core and apply config defaults to sa_rag_soc_code

…classify - define SOC_PROMPT_UNAMBIGUOUS and SOC_PROMPT_OPENFOLLOWUP with pydantic partial format instructions - use UnambiguousResponse and OpenFollowUp as the parser pydantic shapes for those templates - import PydanticOutputParser from langchain_core instead of langchain.output_parsers

…lassify - use langchain_core PydanticOutputParser and partial_variables so each template embeds the matching UnambiguousResponse or OpenFollowUp JSON schema text

…sify - define the same pydantic fields as sic-classification-utils uses for unambiguous and open follow-up outputs (codable, class_code, class_descriptive, alt_candidates, reasoning, followup) and constrain alt_candidates length with field_validator and MAX_ALT_CANDIDATES

…ging - same truncation behaviour as industrial_classification_utils.utils.constants (empty or None becomes an empty string, longer strings cut at max length with an ellipsis suffix)

- make get_config call get_default_config from utils.constants and drop the inline default dict - read EmbeddingHandler defaults from config embedding keys (model name, db dir, k_matches)

…e sic-classification-utils - add EmbeddingConfig for embedding_model_name, db_dir and k_matches - narrow LLMConfig to llm_model_name, model_location, code_digits and candidates_limit only

- return embedding, llm and lookups defaults in one dict that satisfies FullConfig - set generative defaults to gemini-2.5-flash, europe-west2, four-digit code_digits and candidates_limit 10

- add mock_soc fixtures and tests for unambiguous_soc_code and formulate_open_question - use pytest-mock with example lookup codes in sa_rag shortlists

…ion-utils - replace src package example with JSON shortlist demo for get_soc_code, sa_rag_soc_code and unambiguous_soc_code

- provide code, title and distance entries for demos/llm/llm_embedding_example.py

- drop the old package example now covered by demos/llm/llm_embedding_example.py

…arity - add the same discriminator ordering, quality standards, and edge cases as SIC with SOC-specific occupation wording

…t wording

…_code

…soc llm

- add Details from group_description and Includes from unit-group tasks

… metadata

- remove unused expand_search_terms parameter and tidy docstring

- match untyped inner helper and keyword args when building the open follow-up call dict

- align class doc and field descriptions with sic unambiguousresponse parity

dstewartons added 9 commits May 20, 2026 08:07

add SOC_PROMPT_UNAMBIGUOUS and SOC_PROMPT_OPENFOLLOWUP for two-step c…

3eee807

…lassify - use langchain_core PydanticOutputParser and partial_variables so each template embeds the matching UnambiguousResponse or OpenFollowUp JSON schema text

add truncate_identifier and DEFAULT_TRUNCATE_LEN for privacy-safe log…

14139cc

…ging - same truncation behaviour as industrial_classification_utils.utils.constants (empty or None becomes an empty string, longer strings cut at max length with an ellipsis suffix)

align embed/embedding.py with shared get_default_config

b63faff

- make get_config call get_default_config from utils.constants and drop the inline default dict - read EmbeddingHandler defaults from config embedding keys (model name, db dir, k_matches)

split SOC FullConfig into embedding and generative llm TypedDicts lik…

7a796d6

…e sic-classification-utils - add EmbeddingConfig for embedding_model_name, db_dir and k_matches - narrow LLMConfig to llm_model_name, model_location, code_digits and candidates_limit only

add get_default_config alongside truncate helpers in utils.constants

911aec3

- return embedding, llm and lookups defaults in one dict that satisfies FullConfig - set generative defaults to gemini-2.5-flash, europe-west2, four-digit code_digits and candidates_limit 10

extend test_llm coverage for two-step ClassificationLLM flows

dd05424

- add mock_soc fixtures and tests for unambiguous_soc_code and formulate_open_question - use pytest-mock with example lookup codes in sa_rag shortlists

dstewartons changed the title ~~Sa 693/two step classify~~ SA-693/two step classify May 20, 2026

dstewartons added 12 commits May 20, 2026 13:22

move SOC llm embedding example to demos/llm to mirror sic-classificat…

ea52bff

…ion-utils - replace src package example with JSON shortlist demo for get_soc_code, sa_rag_soc_code and unambiguous_soc_code

add mock SOC embed shortlist fixture for the school teacher demo

be67bd2

- provide code, title and distance entries for demos/llm/llm_embedding_example.py

remove legacy src llm embedding example after move to demos/llm

db9a2e3

- drop the old package example now covered by demos/llm/llm_embedding_example.py

expand SOC open follow-up prompt to mirror SIC structure for SA-693 p…

20636ae

…arity - add the same discriminator ordering, quality standards, and edge cases as SIC with SOC-specific occupation wording

update version for release

121a69a

log raw job fields in SOC formulate_open_question to match SIC

59fa681

remove unused mock_llm = mock.MagicMock()

f858bac

align SA_SOC_PROMPT_RAG with SIC and remove SA-673 clear-winner promp…

ea396c5

…t wording

update SA_SOC_PROMPT_RAG test to assert SIC-style follow-up wording

b554268

align sa_rag_soc_code chain and parse error responses with sa_rag_sic…

5d187c8

…_code

default SocResponse fields used when sa_rag_soc_code parsing fails

5f330f4

align formulate_open_question and _prompt_candidate_list with SIC in …

9a35f47

…soc llm

dstewartons force-pushed the SA-693/two-step-classify branch from e772cd6 to 9a35f47 Compare May 20, 2026 18:45

dstewartons added 7 commits May 20, 2026 20:02

implement _prompt_candidate include_all metadata like SIC

510038e

- add Details from group_description and Includes from unit-group tasks

add test_prompt_candidate_include_all to cover SOC include_all prompt…

09bc7e9

… metadata

pylint tweak to stop warning about calling internal method

324eeb7

align sa_rag_soc_code signature with sa_rag_sic_code

4b267a9

- remove unused expand_search_terms parameter and tidy docstring

align formulate_open_question prep_call_dict with sic equivalent

653a495

- match untyped inner helper and keyword args when building the open follow-up call dict

update unambiguousresponse docs to match generic sic two-step naming

804b7de

- align class doc and field descriptions with sic unambiguousresponse parity

update read me about reranker

c1dad6f

dstewartons requested a review from gibbardsteve May 20, 2026 23:04

dstewartons mentioned this pull request May 21, 2026

SA-693/two step classify ONSdigital/survey-assist-api#66

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SA-693/two step classify#29

SA-693/two step classify#29
dstewartons wants to merge 28 commits into
mainfrom
SA-693/two-step-classify

dstewartons commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dstewartons commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Pull Request Template

✨ Summary

📜 Changes Introduced

✅ Checklist

🔍 How to Test

1. Unit tests (no GCP required)

2. Live LLM demo (this repo only)

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dstewartons commented May 20, 2026 •

edited

Loading