chore(deps-dev): bump unstructured from 0.18.14 to 0.23.1 in /backend by dependabot[bot] · Pull Request #3280 · ModelEngine-Group/nexent

dependabot · 2026-06-19T15:43:45Z

Bumps unstructured from 0.18.14 to 0.23.1.

Release notes

0.23.1

What's Changed

feat: extract filled AcroForm field text in PDF partitioning by @badGarnet in Unstructured-IO/unstructured#4372

Full Changelog: Unstructured-IO/unstructured@0.23.0...0.23.1

0.23.0

What's Changed

fix: stop decimating embedded text on dense PDF pages by @badGarnet in Unstructured-IO/unstructured#4368

fix: keep extracted text aligned with rotated PDF page images in hi_res by @badGarnet in Unstructured-IO/unstructured#4367

feat: add enrichment origins metadata field by @badGarnet in Unstructured-IO/unstructured#4370

Full Changelog: Unstructured-IO/unstructured@0.22.32...0.23.0

0.22.32

What's Changed

fix(hi_res): recover text inside PDF figure overlays by @qued in Unstructured-IO/unstructured#4363

Full Changelog: Unstructured-IO/unstructured@0.22.31...0.22.32

0.22.31

What's Changed

fix: rename isolate_tables chunking option to isolate_table by @badGarnet in Unstructured-IO/unstructured#4355

Full Changelog: Unstructured-IO/unstructured@0.22.30...0.22.31

0.22.30

What's Changed

feat: add option for table chunking by @badGarnet in Unstructured-IO/unstructured#4354

Full Changelog: Unstructured-IO/unstructured@0.22.29...0.22.30

0.22.29

What's Changed

fix: handle text too long for spacy issue by @badGarnet in Unstructured-IO/unstructured#4353

Full Changelog: Unstructured-IO/unstructured@0.22.28...0.22.29

0.22.28

What's Changed

fix: chunking dropping table content by @badGarnet in Unstructured-IO/unstructured#4352

Full Changelog: Unstructured-IO/unstructured@0.22.27...0.22.28

... (truncated)

Changelog

Sourced from unstructured's changelog.

0.23.1

Enhancements

Extract filled AcroForm field values as text: values typed into fillable PDF form fields live in widget annotations rather than the page content stream, so pdfminer's text pass missed them. They are now recovered for both the fast and hi_res strategies and emitted as elements alongside the content-stream text.

Fixes

Fix inferred/extracted layout merge skipping subregion removal for single-region pages: the rule that removes an inferred box overlapping an extracted region was gated on any(extracted_to_keep), which evaluated False when the only kept extracted region was at index 0. On single-region pages (e.g. a PDF whose only text is one filled form field) this left a duplicate element; the guard now checks the array size.

0.23.0

Enhancements

Add enrichment_origins metadata field for per-attribute model provenance: ElementMetadata gains a serialized enrichment_origins field mapping a written attribute name (e.g. text, text_as_html, embeddings) to a list of records {"type", "provider", "model"}, in application order. Enrichment producers stamp which model wrote (or contributed to) each attribute; authoring enrichments overwrite the list while additive ones append, preserving the prior author. A new ConsolidationStrategy.DICT_LIST_UNIQUE merges these dicts across elements during chunking (union keys, concatenate then dedupe records, preserving first-seen order).

0.22.34

Fixes

Keep extracted text aligned with rotated PDF page images in hi_res: when unstructured-inference rotates a rendered page image to make its text upright, the same rotation is now mirrored onto the pdfminer-extracted coordinates so the extracted-text layer and the object-detection layer share one coordinate frame and merge correctly. Previously the two layers could be off by the page's /Rotate, scattering extracted text in the merged output.

0.22.33

Fixes

Fix over-aggressive de-duplication of embedded text on dense PDF pages: remove_duplicate_elements chunks its IoU computation for pages with more than ~2000 extracted elements, but the per-chunk "keep" mask was not offset by the chunk's global start index. As a result, every element in a chunk after the first was compared against itself (and earlier elements) and wrongly dropped, so dense pages (large tables, engineering drawings) lost a large fraction of their extracted text. The diagonal offset is now applied per chunk so only genuine later-duplicate boxes are removed.

0.22.32

Fixes

Recover text inside PDF figure overlays in hi_res: hi_res pdfminer extraction only pulled text from objects exposing get_text (e.g. LTTextBox), and extract_text_objects only collected LTTextLine. Text held as loose LTChars inside an LTFigure — for example text drawn into a figure/XObject overlay rather than the main content stream — was dropped from the output. hi_res now groups such loose characters into text lines, inserting spaces on wide inter-character gaps and skipping hidden (render mode 3) and rotated characters.

0.22.31

Enhancements

Rename isolate_tables chunking option to isolate_table: the option added in 0.22.30 has been renamed for naming consistency. Callers passing isolate_tables= must update to isolate_table=.

0.22.30

Enhancements

Toggle table isolation in chunking: Add isolate_tables to basic/title chunking options. Defaults to True (the post-#4307 behavior: Table/TableChunk elements always staged alone). Set to False to allow tables to share pre-chunks with adjacent non-table elements and be combined by PreChunkCombiner.

0.22.29

Fixes

... (truncated)

Commits

5ead69a feat: extract filled AcroForm field text in PDF partitioning (#4372)
dacae2c feat: add enrichment origins metadata field (#4370)
dedf144 fix: keep extracted text aligned with rotated PDF page images in hi_res (#4367)
19857c1 fix: stop decimating embedded text on dense PDF pages (#4368)
620cde7 fix(hi_res): recover text inside PDF figure overlays (#4363)
7c8f675 fix: rename isolate_tables chunking option to isolate_table (#4355)
346cfff feat: add option for table chunking (#4354)
bfd78b2 fix: handle text too long for spacy issue (#4353)
238657f fix: chunking dropping table content (#4352)
8daa154 fix: ndjson file type detection (#4349)
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [unstructured](https://github.com/Unstructured-IO/unstructured) from 0.18.14 to 0.23.1. - [Release notes](https://github.com/Unstructured-IO/unstructured/releases) - [Changelog](https://github.com/Unstructured-IO/unstructured/blob/main/CHANGELOG.md) - [Commits](Unstructured-IO/unstructured@unstructured_0.18.14...0.23.1) --- updated-dependencies: - dependency-name: unstructured dependency-version: 0.23.1 dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

dependabot · 2026-06-19T15:43:45Z

Labels

The following labels could not be found: dependencies. Please create it before Dependabot can add it to a pull request.

Please fix the above issues or remove invalid values from dependabot.yml.

dependabot Bot requested review from Dallas98 and WMC001 as code owners June 19, 2026 15:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(deps-dev): bump unstructured from 0.18.14 to 0.23.1 in /backend#3280

chore(deps-dev): bump unstructured from 0.18.14 to 0.23.1 in /backend#3280
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/pip/backend/unstructured-0.23.1

dependabot Bot commented on behalf of github Jun 19, 2026

Uh oh!

dependabot Bot commented on behalf of github Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

dependabot Bot commented on behalf of github Jun 19, 2026

0.23.1

What's Changed

0.23.0

What's Changed

0.22.32

What's Changed

0.22.31

What's Changed

0.22.30

What's Changed

0.22.29

What's Changed

0.22.28

What's Changed

0.23.1

Enhancements

Fixes

0.23.0

Enhancements

0.22.34

Fixes

0.22.33

Fixes

0.22.32

Fixes

0.22.31

Enhancements

0.22.30

Enhancements

0.22.29

Fixes

Uh oh!

dependabot Bot commented on behalf of github Jun 19, 2026

Labels

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants