Skip to content

chore(deps-dev): bump unstructured from 0.18.14 to 0.23.1 in /backend#3280

Open
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/pip/backend/unstructured-0.23.1
Open

chore(deps-dev): bump unstructured from 0.18.14 to 0.23.1 in /backend#3280
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/pip/backend/unstructured-0.23.1

Conversation

@dependabot

@dependabot dependabot Bot commented on behalf of github Jun 19, 2026

Copy link
Copy Markdown

Bumps unstructured from 0.18.14 to 0.23.1.

Release notes

Sourced from unstructured's releases.

0.23.1

What's Changed

Full Changelog: Unstructured-IO/unstructured@0.23.0...0.23.1

0.23.0

What's Changed

Full Changelog: Unstructured-IO/unstructured@0.22.32...0.23.0

0.22.32

What's Changed

Full Changelog: Unstructured-IO/unstructured@0.22.31...0.22.32

0.22.31

What's Changed

Full Changelog: Unstructured-IO/unstructured@0.22.30...0.22.31

0.22.30

What's Changed

Full Changelog: Unstructured-IO/unstructured@0.22.29...0.22.30

0.22.29

What's Changed

Full Changelog: Unstructured-IO/unstructured@0.22.28...0.22.29

0.22.28

What's Changed

Full Changelog: Unstructured-IO/unstructured@0.22.27...0.22.28

... (truncated)

Changelog

Sourced from unstructured's changelog.

0.23.1

Enhancements

  • Extract filled AcroForm field values as text: values typed into fillable PDF form fields live in widget annotations rather than the page content stream, so pdfminer's text pass missed them. They are now recovered for both the fast and hi_res strategies and emitted as elements alongside the content-stream text.

Fixes

  • Fix inferred/extracted layout merge skipping subregion removal for single-region pages: the rule that removes an inferred box overlapping an extracted region was gated on any(extracted_to_keep), which evaluated False when the only kept extracted region was at index 0. On single-region pages (e.g. a PDF whose only text is one filled form field) this left a duplicate element; the guard now checks the array size.

0.23.0

Enhancements

  • Add enrichment_origins metadata field for per-attribute model provenance: ElementMetadata gains a serialized enrichment_origins field mapping a written attribute name (e.g. text, text_as_html, embeddings) to a list of records {"type", "provider", "model"}, in application order. Enrichment producers stamp which model wrote (or contributed to) each attribute; authoring enrichments overwrite the list while additive ones append, preserving the prior author. A new ConsolidationStrategy.DICT_LIST_UNIQUE merges these dicts across elements during chunking (union keys, concatenate then dedupe records, preserving first-seen order).

0.22.34

Fixes

  • Keep extracted text aligned with rotated PDF page images in hi_res: when unstructured-inference rotates a rendered page image to make its text upright, the same rotation is now mirrored onto the pdfminer-extracted coordinates so the extracted-text layer and the object-detection layer share one coordinate frame and merge correctly. Previously the two layers could be off by the page's /Rotate, scattering extracted text in the merged output.

0.22.33

Fixes

  • Fix over-aggressive de-duplication of embedded text on dense PDF pages: remove_duplicate_elements chunks its IoU computation for pages with more than ~2000 extracted elements, but the per-chunk "keep" mask was not offset by the chunk's global start index. As a result, every element in a chunk after the first was compared against itself (and earlier elements) and wrongly dropped, so dense pages (large tables, engineering drawings) lost a large fraction of their extracted text. The diagonal offset is now applied per chunk so only genuine later-duplicate boxes are removed.

0.22.32

Fixes

  • Recover text inside PDF figure overlays in hi_res: hi_res pdfminer extraction only pulled text from objects exposing get_text (e.g. LTTextBox), and extract_text_objects only collected LTTextLine. Text held as loose LTChars inside an LTFigure — for example text drawn into a figure/XObject overlay rather than the main content stream — was dropped from the output. hi_res now groups such loose characters into text lines, inserting spaces on wide inter-character gaps and skipping hidden (render mode 3) and rotated characters.

0.22.31

Enhancements

  • Rename isolate_tables chunking option to isolate_table: the option added in 0.22.30 has been renamed for naming consistency. Callers passing isolate_tables= must update to isolate_table=.

0.22.30

Enhancements

  • Toggle table isolation in chunking: Add isolate_tables to basic/title chunking options. Defaults to True (the post-#4307 behavior: Table/TableChunk elements always staged alone). Set to False to allow tables to share pre-chunks with adjacent non-table elements and be combined by PreChunkCombiner.

0.22.29

Fixes

... (truncated)

Commits
  • 5ead69a feat: extract filled AcroForm field text in PDF partitioning (#4372)
  • dacae2c feat: add enrichment origins metadata field (#4370)
  • dedf144 fix: keep extracted text aligned with rotated PDF page images in hi_res (#4367)
  • 19857c1 fix: stop decimating embedded text on dense PDF pages (#4368)
  • 620cde7 fix(hi_res): recover text inside PDF figure overlays (#4363)
  • 7c8f675 fix: rename isolate_tables chunking option to isolate_table (#4355)
  • 346cfff feat: add option for table chunking (#4354)
  • bfd78b2 fix: handle text too long for spacy issue (#4353)
  • 238657f fix: chunking dropping table content (#4352)
  • 8daa154 fix: ndjson file type detection (#4349)
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [unstructured](https://github.com/Unstructured-IO/unstructured) from 0.18.14 to 0.23.1.
- [Release notes](https://github.com/Unstructured-IO/unstructured/releases)
- [Changelog](https://github.com/Unstructured-IO/unstructured/blob/main/CHANGELOG.md)
- [Commits](Unstructured-IO/unstructured@unstructured_0.18.14...0.23.1)

---
updated-dependencies:
- dependency-name: unstructured
  dependency-version: 0.23.1
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot @github

dependabot Bot commented on behalf of github Jun 19, 2026

Copy link
Copy Markdown
Author

Labels

The following labels could not be found: dependencies. Please create it before Dependabot can add it to a pull request.

Please fix the above issues or remove invalid values from dependabot.yml.

@dependabot dependabot Bot requested review from Dallas98 and WMC001 as code owners June 19, 2026 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants