An iOS accessibility app that helps visually impaired users understand their surroundings through real-time on-device image analysis and spatial audio feedback.
VisionVoice is an iOS accessibility application built in Swift 6 for visually impaired and low-vision users. It uses Apple's Vision framework, AVFoundation, and a custom SoundscapeEngine to analyze the live camera feed and return spoken and spatial audio feedback — with no internet connection required, ensuring complete privacy.
The app offers three analysis modes, spatial audio positioning for detected objects, live text-to-speech narration, and a safe Demo Mode that works without a physical camera — all built with WCAG 2.1 AA accessibility standards in mind.
| Feature | Description |
|---|---|
| 🎙️ Scene Description | Narrates a natural-language description of your surroundings using on-device Vision analysis |
| 📄 Read Text | OCR text recognition from signs, labels, books, and documents |
| 📦 Object Guide | Identifies and names up to 10 objects or people in frame |
| 🔊 Spatial Audio | 3D positional tones map detected objects to left/right/up/down in your audio space |
| 🗣️ Text-to-Speech | All results spoken aloud via AVSpeechSynthesizer at a comfortable pace |
| 🧪 Demo Mode | Fully functional without a camera — analyzes a built-in sample image |
| 🔒 100% On-Device | All processing runs locally — no data sent to any server |
| ♿ WCAG 2.1 AA | VoiceOver-compatible, accessibility labels and hints on every control |
Toggle Demo Mode in the app to analyze a built-in sample image without needing camera access.
VisionVoice follows a modular pipeline architecture with clear separation of concerns:
┌─────────────────────────────────────────────────────────┐
│ ContentView (UI) │
│ Mode Selector │ Camera Display │ Status Card │ TTS │
└────────────────────────┬────────────────────────────────┘
│
┌──────────▼──────────┐
│ CameraPipeline │ ← Orchestrates all subsystems
└──┬──────────────┬───┘
│ │
┌────────────▼──┐ ┌─────▼───────────┐
│ CameraModel │ │ VisionRecognizer │
│ AVCapture │ │ Vision + OCR │
│ Session Mgmt │ │ On-Device │
└───────────────┘ └────────┬─────────┘
│
┌──────────▼──────────┐
│ SoundscapeEngine │
│ AVAudioEngine + │
│ 3D Spatial Audio │
└─────────────────────┘
VisionVoice/
├── MyApp.swift # App entry point (@main)
├── ContentView.swift # Root SwiftUI view, mode switching, TTS orchestration
├── CameraModel.swift # AVCaptureSession management, pixel buffer streaming
├── CameraPipeline.swift # Connects CameraModel → VisionRecognizer → SoundscapeEngine
├── CameraPreview.swift # UIViewRepresentable for live camera preview layer
├── Visionrecognizer.swift # Vision inference: scene description, OCR, object detection
├── SoundscapeEngine.swift # AVAudioEngine + 3D spatial tones per detected entity
├── Models.swift # Shared data models: AppMode, DetectedThing, VisionResult
├── infoview.swift # In-app help/about sheet
├── MoreInfo.plist # App metadata / info plist
├── Package.swift # Swift Package Manager manifest (iOS 16+, Swift 6)
├── Contents.json # Asset catalog metadata
├── AppIcon.png # Application icon
└── demo.jpeg # Sample image used in Demo Mode
CameraModel sets up a high-quality AVCaptureSession, streams frames as CVPixelBuffer on a dedicated DispatchQueue, and exposes the latest frame thread-safely via an NSLock. CameraPreview renders the live feed using a CALayer-backed UIViewRepresentable.
On each analysis tap, a CVPixelBuffer is passed to VisionRecognizer, which runs:
- Object & Person Detection — Apple's Vision framework classifiers
- OCR —
VNRecognizeTextRequestfor reading text from images at accurate recognition level - Scene Description — Constructs a natural-language sentence from detection results, including position descriptors (e.g., "left", "center", "upper right")
All processing runs on-device via the Vision framework with no network calls.
| Mode | Behavior |
|---|---|
| Scene Description | Speaks a full natural-language scene summary + activates spatial audio tones |
| Read Text | Reads out all detected lines of text joined with pauses |
| Object Guide | Lists detected objects/people by name (up to 10) + activates spatial audio |
Each detected entity type is assigned a distinct sine-wave tone frequency:
| Entity | Frequency | Character |
|---|---|---|
| Person | 440 Hz (A4) | Warm, human |
| Object | 330 Hz (E4) | Neutral |
| Text | 550 Hz (C#5) | High, attention |
| Surface | 220 Hz (A3) | Low, grounding |
AVAudioEnvironmentNode positions each tone in 3D space based on the object's bounding box center in the camera frame — objects on the left sound from the left, objects above sound from above. Volume scales with confidence score and bounding box size.
All results are spoken via AVSpeechSynthesizer with:
- Rate:
0.5(comfortable listening pace) - Voice:
en-US - Audio session category:
.playbackwith.duckOthers— VisionVoice automatically ducks background audio
| Requirement | Value |
|---|---|
| iOS | 16.0+ |
| Swift | 6.0 |
| Xcode | 15.0+ (Swift Playgrounds 4.5+ compatible) |
| Device | iPhone or iPad with rear camera |
| Permissions | Camera |
| Network | ❌ Not required — fully offline |
- Clone or download this repository.
- Open Swift Playgrounds on iPad or Mac.
- Tap "Open" and select the
VisionVoicefolder. - Tap
▶️ Run — the app launches instantly. - Grant camera permission when prompted.
git clone https://github.com/Apple-beep/VisionVoice.git
cd VisionVoice
Open Package.swift in Xcode 15+.
Select your target device or simulator.
Build & Run (⌘R).
Grant camera permission on first launch.
Note: Demo Mode works on the simulator — live camera requires a physical device.🎮 Usage Launch the app — A welcome message is spoken automatically.
Select a mode at the bottom: Scene, Read, or Objects.
Point your camera at your surroundings.
Tap the camera display to analyze the current frame.
Listen to the spoken result and the spatial audio tones.
Toggle Demo Mode at the bottom to test without camera access.
♿ Accessibility VisionVoice is designed first for accessibility:
Every interactive control has .accessibilityLabel and .accessibilityHint
Mode buttons announce "Double tap to switch to [mode] mode"
Welcome message is spoken automatically on first launch
Audio session uses .duckOthers to not conflict with other assistive audio
The entire UI is navigable via VoiceOver
Compliant with WCAG 2.1 AA standards
🛠️ Tech Stack SwiftUI — Declarative UI with @StateObject, @Published, @MainActor
AVFoundation — Camera session, speech synthesis, 3D audio engine
Vision Framework — On-device object detection and text recognition (OCR)
AVAudioEngine + AVAudioEnvironmentNode — Spatial 3D soundscape
Swift 6 Concurrency — async/await, Task.detached, @MainActor for thread safety
Swift Package Manager — Dependency and build management
🔐 Privacy VisionVoice processes all data entirely on-device:
❌ No analytics
❌ No external APIs
❌ No data collection
✅ Camera frames are analyzed in memory and immediately discarded
✅ Works fully offline
🤝 Contributing Contributions are welcome! Here's how to get started:
bash
git checkout -b feature/your-feature-name git commit -m "Add: your feature description" git push origin feature/your-feature-name
Please ensure your code:
Follows Swift 6 concurrency rules (@MainActor, Sendable)
Maintains or improves accessibility coverage
Runs cleanly on iOS 16+
📄 License This project is licensed under the MIT License — see the LICENSE file for details.
👤 Author Musharaf Khan Pathan Illinois Institute of Technology, Chicago GitHub: @Apple-beep
Built with ❤️ for the visually impaired community — because everyone deserves to experience the world.
