Skip to content

feat: add multimodal image support to chat input#2305

Closed
riyo264 wants to merge 55 commits intoarc53:mainfrom
riyo264:multimodal-clean
Closed

feat: add multimodal image support to chat input#2305
riyo264 wants to merge 55 commits intoarc53:mainfrom
riyo264:multimodal-clean

Conversation

@riyo264
Copy link
Copy Markdown

@riyo264 riyo264 commented Mar 16, 2026

🚀 feat: Multimodal Vision Support (Gemini & OpenAI)

📝 Summary
This PR introduces multimodal capabilities to DocsGPT, allowing users to upload images alongside their text queries. The system can now analyze visual data (diagrams, screenshots, etc.) while leveraging the existing RAG pipeline to provide context-aware answers.

🛠️ Key Changes

  1. Backend (Python/Flask)
    multimodal_service.py: A new centralized service to handle routing between Google Gemini and OpenAI Vision models.

Docling/RAG Integration: The service explicitly uses the docs_together context retrieved from the Docling pipeline, ensuring visual analysis is grounded in the provided documentation.

API Normalization: Added normalize_question_payload to handle both camelCase and snake_case keys and support legacy JSON-string formats for backward compatibility.

Stable API Handshaking: Targeted the Gemini v1 stable endpoint to ensure production reliability and prevent v1beta handshake errors.

  1. Frontend (React/Redux)
    State Persistence: Updated sharedConversationSlice.ts and conversationSlice.ts to store imageBase64 within the Query objects. This ensures that uploaded images persist in the chat history and remain visible when switching between conversations.

API Handlers: Updated conversationHandlers.ts to pass multimodal data through both standard and streaming paths.

Models: Expanded RetrievalPayload and Query interfaces in conversationModels.ts to support image data.

  1. Dependencies
    Updated requirements.txt to include langchain-google-genai>=1.0.0.

⚙️ Environment Configuration
To enable this feature, the following variables should be added to the .env file:

GOOGLE_API_KEY: Required for Gemini support.

LLM_PROVIDER: Set to google or openai.

LLM_NAME: Set to a vision-capable model (e.g., gemini-1.5-flash).

🧪 Testing & Verification
Full-Stack Data Flow: Confirmed that images are correctly encoded to Base64 on the frontend and decoded in the service layer.

Isolation: Verified that the multimodal path only triggers when an image is present, ensuring zero impact on standard text-only RAG or OCR/Docling ingestion paths.

State Integrity: Confirmed that images remain associated with their respective queries when navigating the conversation history.

This change was needed to:

  • Bridge the Visual Gap: Allow users to troubleshoot "real-world" scenarios by uploading screenshots of error messages or configurations.

  • Enhance Docling Context: Complement the Docling ingestion pipeline by allowing the LLM to perform native visual reasoning on diagrams that are difficult to represent through OCR alone.

  • Future-Proof the Architecture: Update the project's state management and routing "plumbing" to handle multimodal data (Base64/MIME), setting the foundation for future media-rich interactions.

Note:

During local testing, some 404 NOT_FOUND errors were observed due to local API key/region versioning quirks (v1beta). This has been mitigated in the code by explicitly targeting the stable v1 endpoint, which is the standard for production environments.

Fixes #1451

@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 16, 2026

@riyo264 is attempting to deploy a commit to the Arc53 Team on Vercel.

A member of the Team first needs to authorize it.

riyo264 added 5 commits April 4, 2026 01:11
@github-actions github-actions Bot added the tests Tests label Apr 4, 2026
riyo264 and others added 18 commits April 4, 2026 08:04
Updated mock StreamProcessor to return new values for pre-fetch methods.
Updated mock StreamProcessor to return modified values for pre_fetch_docs and pre_fetch_tools.
Bumps [i18next-browser-languagedetector](https://github.com/i18next/i18next-browser-languageDetector) from 8.2.0 to 8.2.1.
- [Changelog](https://github.com/i18next/i18next-browser-languageDetector/blob/master/CHANGELOG.md)
- [Commits](i18next/i18next-browser-languageDetector@v8.2.0...v8.2.1)

---
updated-dependencies:
- dependency-name: i18next-browser-languagedetector
  dependency-version: 8.2.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [react-i18next](https://github.com/i18next/react-i18next) from 16.2.4 to 16.5.8.
- [Changelog](https://github.com/i18next/react-i18next/blob/master/CHANGELOG.md)
- [Commits](i18next/react-i18next@v16.2.4...v16.5.8)

---
updated-dependencies:
- dependency-name: react-i18next
  dependency-version: 16.5.8
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [prettier-plugin-tailwindcss](https://github.com/tailwindlabs/prettier-plugin-tailwindcss) from 0.7.1 to 0.7.2.
- [Release notes](https://github.com/tailwindlabs/prettier-plugin-tailwindcss/releases)
- [Changelog](https://github.com/tailwindlabs/prettier-plugin-tailwindcss/blob/main/CHANGELOG.md)
- [Commits](tailwindlabs/prettier-plugin-tailwindcss@v0.7.1...v0.7.2)

---
updated-dependencies:
- dependency-name: prettier-plugin-tailwindcss
  dependency-version: 0.7.2
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [@typescript-eslint/eslint-plugin](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/eslint-plugin) from 8.46.3 to 8.57.1.
- [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases)
- [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/eslint-plugin/CHANGELOG.md)
- [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v8.57.1/packages/eslint-plugin)

---
updated-dependencies:
- dependency-name: "@typescript-eslint/eslint-plugin"
  dependency-version: 8.57.1
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
dependabot Bot and others added 25 commits April 19, 2026 18:01
Bumps [langchain-core](https://github.com/langchain-ai/langchain) from 1.2.23 to 1.2.26.
- [Release notes](https://github.com/langchain-ai/langchain/releases)
- [Commits](langchain-ai/langchain@langchain-core==1.2.23...langchain-core==1.2.26)

---
updated-dependencies:
- dependency-name: langchain-core
  dependency-version: 1.2.26
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [tzdata](https://github.com/python/tzdata) from 2025.3 to 2026.1.
- [Release notes](https://github.com/python/tzdata/releases)
- [Changelog](https://github.com/python/tzdata/blob/master/NEWS.md)
- [Commits](python/tzdata@2025.3...2026.1)

---
updated-dependencies:
- dependency-name: tzdata
  dependency-version: '2026.1'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps the pip group with 1 update in the /application directory: [cryptography](https://github.com/pyca/cryptography).


Updates `cryptography` from 46.0.6 to 46.0.7
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](pyca/cryptography@46.0.6...46.0.7)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-version: 46.0.7
  dependency-type: direct:production
  dependency-group: pip
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [@babel/core](https://github.com/babel/babel/tree/HEAD/packages/babel-core) from 7.24.6 to 7.29.0.
- [Release notes](https://github.com/babel/babel/releases)
- [Changelog](https://github.com/babel/babel/blob/main/CHANGELOG.md)
- [Commits](https://github.com/babel/babel/commits/v7.29.0/packages/babel-core)

---
updated-dependencies:
- dependency-name: "@babel/core"
  dependency-version: 7.29.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [mermaid](https://github.com/mermaid-js/mermaid) from 11.13.0 to 11.14.0.
- [Release notes](https://github.com/mermaid-js/mermaid/releases)
- [Commits](https://github.com/mermaid-js/mermaid/compare/mermaid@11.13.0...mermaid@11.14.0)

---
updated-dependencies:
- dependency-name: mermaid
  dependency-version: 11.14.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [tailwindcss](https://github.com/tailwindlabs/tailwindcss/tree/HEAD/packages/tailwindcss) from 4.2.1 to 4.2.2.
- [Release notes](https://github.com/tailwindlabs/tailwindcss/releases)
- [Changelog](https://github.com/tailwindlabs/tailwindcss/blob/main/CHANGELOG.md)
- [Commits](https://github.com/tailwindlabs/tailwindcss/commits/v4.2.2/packages/tailwindcss)

---
updated-dependencies:
- dependency-name: tailwindcss
  dependency-version: 4.2.2
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [@tailwindcss/postcss](https://github.com/tailwindlabs/tailwindcss/tree/HEAD/packages/@tailwindcss-postcss) from 4.1.16 to 4.2.2.
- [Release notes](https://github.com/tailwindlabs/tailwindcss/releases)
- [Changelog](https://github.com/tailwindlabs/tailwindcss/blob/main/CHANGELOG.md)
- [Commits](https://github.com/tailwindlabs/tailwindcss/commits/v4.2.2/packages/@tailwindcss-postcss)

---
updated-dependencies:
- dependency-name: "@tailwindcss/postcss"
  dependency-version: 4.2.2
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [react-dom](https://github.com/facebook/react/tree/HEAD/packages/react-dom) and [@types/react-dom](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/react-dom). These dependencies needed to be updated together.

Updates `react-dom` from 19.2.0 to 19.2.5
- [Release notes](https://github.com/facebook/react/releases)
- [Changelog](https://github.com/facebook/react/blob/main/CHANGELOG.md)
- [Commits](https://github.com/facebook/react/commits/v19.2.5/packages/react-dom)

Updates `@types/react-dom` from 19.2.2 to 19.2.3
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases)
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/react-dom)

---
updated-dependencies:
- dependency-name: react-dom
  dependency-version: 19.2.5
  dependency-type: direct:production
  update-type: version-update:semver-patch
- dependency-name: "@types/react-dom"
  dependency-version: 19.2.3
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [@babel/preset-env](https://github.com/babel/babel/tree/HEAD/packages/babel-preset-env) from 7.24.6 to 7.29.2.
- [Release notes](https://github.com/babel/babel/releases)
- [Changelog](https://github.com/babel/babel/blob/main/CHANGELOG.md)
- [Commits](https://github.com/babel/babel/commits/v7.29.2/packages/babel-preset-env)

---
updated-dependencies:
- dependency-name: "@babel/preset-env"
  dependency-version: 7.29.2
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [eslint-plugin-n](https://github.com/eslint-community/eslint-plugin-n) from 17.23.1 to 17.24.0.
- [Release notes](https://github.com/eslint-community/eslint-plugin-n/releases)
- [Changelog](https://github.com/eslint-community/eslint-plugin-n/blob/master/CHANGELOG.md)
- [Commits](eslint-community/eslint-plugin-n@v17.23.1...v17.24.0)

---
updated-dependencies:
- dependency-name: eslint-plugin-n
  dependency-version: 17.24.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [svgo](https://github.com/svg/svgo) from 3.3.3 to 4.0.1.
- [Release notes](https://github.com/svg/svgo/releases)
- [Commits](svg/svgo@v3.3.3...v4.0.1)

---
updated-dependencies:
- dependency-name: svgo
  dependency-version: 4.0.1
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [eslint-plugin-prettier](https://github.com/prettier/eslint-plugin-prettier) from 5.5.4 to 5.5.5.
- [Release notes](https://github.com/prettier/eslint-plugin-prettier/releases)
- [Changelog](https://github.com/prettier/eslint-plugin-prettier/blob/main/CHANGELOG.md)
- [Commits](prettier/eslint-plugin-prettier@v5.5.4...v5.5.5)

---
updated-dependencies:
- dependency-name: eslint-plugin-prettier
  dependency-version: 5.5.5
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [lucide-react](https://github.com/lucide-icons/lucide/tree/HEAD/packages/lucide-react) from 0.562.0 to 1.8.0.
- [Release notes](https://github.com/lucide-icons/lucide/releases)
- [Commits](https://github.com/lucide-icons/lucide/commits/1.8.0/packages/lucide-react)

---
updated-dependencies:
- dependency-name: lucide-react
  dependency-version: 1.8.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
* feat: postgres tests

* feat: mongo cutoff

* feat: mongo cutoff

* feat: adjust docs and compose files

* fix: mini code mongo removals

* fix: tests and k8s mongo stuff

* feat: test fixes

* fix: ruff

* fix: vale

* Potential fix for pull request finding 'CodeQL / Clear-text logging of sensitive information'

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

* fix: mini suggestions

* vale lint fix 2

* fix: codeql columns thing

* fix: test mongo

* fix: tests coverage

* feat: better tests 4

* feat: more tests

* feat: decent coverage

* fix: ruff fixes

* fix: remove mongo mock

* feat: enhance workflow engine and API routes; add document retrieval and source handling

* feat: e2e tests

* fix: mcp, mongo and more

* fix: mini codeql warning

* fix: agent chunk view

* fix: mini issues

* fix: more pg fixes

* feat: postgres prep on start

* feat: qa tests

* fix: mini improvements

* fix: tests

---------

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Siddhant Rai <siddhant.rai.5686@gmail.com>
Bumps [react-router-dom](https://github.com/remix-run/react-router/tree/HEAD/packages/react-router-dom) from 7.13.1 to 7.14.1.
- [Release notes](https://github.com/remix-run/react-router/releases)
- [Changelog](https://github.com/remix-run/react-router/blob/main/packages/react-router-dom/CHANGELOG.md)
- [Commits](https://github.com/remix-run/react-router/commits/react-router-dom@7.14.1/packages/react-router-dom)

---
updated-dependencies:
- dependency-name: react-router-dom
  dependency-version: 7.14.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Copy link
Copy Markdown
Collaborator

@ManishMadan2882 ManishMadan2882 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort here, @riyo264 - sharing concrete feedback on design and scope so a follow-up can land cleanly.

Design critique

1. This duplicates an existing feature instead of extending it.
DocsGPT already has a working multimodal attachment pipeline:
PR #1733

  • POST /api/store_attachment -> Celery attachment_worker -> attachments table
  • Poll /api/task_status -> get attachment_id
  • Pass attachments: ["<id>"] to /stream (or docsgpt.attachments on /v1/chat/completions)

Images already flow through the provider layer (_upload_attachment_to_google), with test coverage and docs in place. Introducing a separate image_base64 path that lives in Redux and bypasses Celery, StreamProcessor, and the provider abstraction creates a parallel system that will drift over time.

2. multimodal_service.run_multimodal_completion bypasses core architecture.
This sidesteps LLMCreator, prompt rendering, tool execution, usage tracking, persistence, and streaming. That leads to:

  • Hardcoded system prompt overriding agent configuration
  • Ignoring prompt templates (docs_together injected as raw text)
  • Non-standard SSE events (missing conversation_id, sources, etc.)
  • No persistence despite claims in the PR
  • Usage not recorded (billing and limits bypassed)
  • No fallback models or API key resolution

The correct integration point here is StreamProcessor, not a parallel execution path.

3. Control-flow regression in routes.
In both answer.py and stream.py, the multimodal branch is added before the existing flow rather than replacing it. For non-image requests this causes:

  • Duplicate prefetch and usage checks
  • Duplicate stream execution (first result discarded)
  • Two agent instances per request, with inconsistent state

This is a silent regression affecting all traffic, not just multimodal. It needs to be fixed before anything ships.

4. Base64 images in Redux are the wrong abstraction.
Storing multi-MB blobs per message in client state does not scale. The current UUID-based attachment model offloads storage to backend or CDN. This change increases memory usage and bloats persisted state.

5. Provider routing is reimplemented (and diverges).
run_multimodal_completion reintroduces provider selection and even overrides model_id implicitly. This logic already exists in LLMCreator. Duplicating it guarantees drift, and silently changing models is not a safe default.

6. Additional issues compounding the above:

  • ChatOpenAI(thinking_budget=0) -> invalid arg, will error
  • normalize_question_payload introduces a "legacy" format that does not exist
  • Duplicate extract_markdown_image_urls, with one unused
  • Missing dependency (langchain-google-genai)

Scope creep

This PR goes well beyond multimodal input and introduces unrelated regressions:

  • application/llm/google_ai.py: breaks tool-calling and streaming behavior on Google provider
  • settings.py: unrelated config churn (likely rebase residue)
  • requirements.txt: unnecessary dependency changes
  • Frontend files: multiple unrelated edits
  • ConversationBubble.tsx: massive reformat obscuring the actual change
  • MessageInput.tsx: breaking API changes and removal of cleanup logic
  • conversationHandlers.ts: breaking function signature change
  • No tests for new logic
  • Lint, build, and test issues in current state

This makes the PR hard to review and risky to merge.

leaning towards closing this PR.

@riyo264
Copy link
Copy Markdown
Author

riyo264 commented Apr 20, 2026

Thank you so much for the detailed breakdown @ManishMadan2882. You are completely right, I fundamentally misunderstood the existing attachment pipeline and ended up building a parallel system that bypasses the core architecture. I'm going to close this PR and keep all the things you mentioned in mind and start fresh. I really appreciate the guidance through the architecture.

@riyo264 riyo264 closed this Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🚀 Feature Extract and process images from source uploads

4 participants