Skip to content

Sourish1997/accountant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CGST Accountant

An AI-powered application for querying Central Goods and Services Tax (CGST) documentation. Built with AWS Bedrock Knowledge Base, Claude Sonnet 4.5, and React, this application provides intelligent answers to GST-related questions with PDF source citations and automatic document highlighting.

Features

  • 🤖 AI-Powered Chat: Natural language queries powered by Claude Sonnet 4.5
  • 📚 Knowledge Base: Semantic search across CGST documentation using AWS Bedrock
  • 📄 PDF Highlighting: Automatic fuzzy text highlighting in source documents
  • 🔐 Secure Authentication: AWS Cognito with managed login UI
  • ⚡ Real-time Streaming: Server-sent events for instant AI responses
  • 🎯 Citation Tracking: Transparent sourcing with relevance scores
  • ☁️ Serverless Architecture: Fully managed AWS infrastructure
  • 🌐 Custom Domains: Production deployment with CloudFront CDN

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         User Browser                            │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  React SPA (TypeScript + Vite)                           │   │
│  │  - ChatInterface  - PDFViewer  - AuthContext             │   │
│  └──────────────────────────────────────────────────────────┘   │
└──────────────────┬──────────────────────────────────────────────┘
                   │
                   ↓
┌──────────────────────────────────────────────────────────────────┐
│                    CloudFront CDN                                │
│  - Custom Domain: accountant.sourish-banerjee.com                │
│  - S3 Static Hosting  - Origin Access Control                    │
└──────────────────┬───────────────────────────────────────────────┘
                   │
         ┌─────────┴─────────┬───────────────────────┬────────────┐
         ↓                   ↓                       ↓            ↓
┌────────────────┐  ┌────────────────┐  ┌──────────────────┐  ┌──────────┐
│   Cognito      │  │  API Gateway   │  │  Streaming       │  │   S3     │
│                │  │  (Custom       │  │  Lambda          │  │  PDFs    │
│  - User Pool   │  │   Domain)      │  │  (Function URL)  │  │          │
│  - Identity    │  │                │  │                  │  │          │
│    Pool        │  │  /presign      │  │  Claude Sonnet   |  │          │
│  - OAuth       │  │  /retrieve     │  │  4.5             |  │          │
└────────────────┘  └────────┬───────┘  └──────────────────┘  └──────────┘
                             │
                    ┌────────┴────────┐
                    ↓                 ↓
            ┌───────────────┐  ┌──────────────────┐
            │  Presign      │  │  Retrieve        │
            │  Lambda       │  │  Lambda          │
            │               │  │                  │
            │  Generate     │  │  Query KB        │
            │  S3 URLs      │  │                  │
            └───────────────┘  └────────┬─────────┘
                                        ↓
                              ┌──────────────────────┐
                              │  Bedrock Knowledge   │
                              │  Base                │
                              │                      │
                              │  - Titan Embeddings  │
                              │  - OpenSearch        │
                              │    Serverless        │
                              └──────────────────────┘

Project Structure

accountant/
├── backend/                    # Python backend (AWS CDK + Lambda)
│   ├── cdk/                       # CDK infrastructure
│   │   ├── app.py                    # CDK app entry point
│   │   └── stacks/                   # Stack definitions
│   │       ├── bedrock_stack.py         # Knowledge Base + OpenSearch
│   │       ├── auth_stack.py            # Cognito authentication
│   │       ├── streaming_stack.py       # Claude streaming Lambda
│   │       ├── api_stack.py             # API Gateway + integrations
│   │       └── web_stack.py             # CloudFront + S3 hosting
│   ├── lambda/                    # Lambda functions
│   │   ├── presign/                  # S3 presigned URL generation
│   │   ├── retrieve/                 # Bedrock KB retrieval
│   │   └── stream/                   # Claude streaming (Docker)
│   ├── scripts/                   # Utility scripts
│   │   └── generate_metadata.py      # KB metadata generation
│   └── copilot/                   # Scratch workspace (gitignored)
│       ├── kb_source/                # Knowledge Base PDFs
│       └── summaries/                # Task documentation
├── frontend/                   # React frontend (TypeScript + Vite)
│   ├── src/
│   │   ├── components/              # React components
│   │   │   ├── ChatInterface.tsx       # Main chat UI
│   │   │   ├── PDFViewer.tsx          # PDF rendering with highlighting
│   │   │   ├── LoginPage.tsx          # Cognito authentication UI
│   │   │   └── ...
│   │   ├── contexts/                # React contexts
│   │   │   └── AuthContext.tsx        # Authentication state
│   │   ├── utils/                   # Utility functions
│   │   │   ├── bedrock.ts             # KB retrieval API
│   │   │   ├── anthropic.ts           # Claude streaming
│   │   │   ├── s3.ts                  # S3 presigned URLs
│   │   │   └── auth.ts                # AWS SigV4 signing
│   │   ├── aws-config.ts            # AWS Amplify config
│   │   └── main.tsx                 # App entry point
│   ├── .env.dev                    # Development config (gitignored)
│   ├── .env.prod                   # Production config (gitignored)
│   └── copilot/                    # Scratch workspace (gitignored)
└── README.md                   # This file

Technology Stack

Frontend

  • React 18 - UI framework
  • TypeScript - Type-safe JavaScript
  • Vite - Build tool and dev server
  • AWS Amplify - Authentication integration
  • react-pdf - PDF rendering
  • pdfjs-dist - PDF parsing and text extraction

Backend

  • AWS CDK - Infrastructure as code (Python)
  • AWS Lambda - Serverless compute
  • API Gateway - RESTful API with custom domain
  • Bedrock Knowledge Base - Semantic document search
  • Bedrock Runtime - Claude Sonnet 4.5 integration
  • OpenSearch Serverless - Vector database (1024-dim embeddings)
  • Cognito - User authentication and authorization
  • CloudFront - CDN and static hosting
  • S3 - Object storage for PDFs and website
  • Secrets Manager - API key management
  • Route53 - DNS and custom domains
  • ACM - SSL/TLS certificates

Infrastructure

  • Python 3.13 - Backend runtime
  • uv - Fast Python package manager
  • Docker - Container runtime for streaming Lambda
  • Lambda Web Adapter - HTTP server in Lambda
  • FastAPI - Python web framework for streaming
  • Uvicorn - ASGI server

Quick Start

Prerequisites

  • Python 3.13+ with uv installed
  • Node.js 18+ and npm
  • AWS CLI configured with credentials
  • AWS CDK installed globally (npm install -g aws-cdk)
  • Docker for Lambda container builds
  • Anthropic API key for Claude access

Installation

  1. Clone repository:

    git clone <repository-url>
    cd accountant
  2. Install backend dependencies:

    cd backend
    uv sync
  3. Install frontend dependencies:

    cd ../frontend
    npm install

Development Setup

  1. Create Anthropic API key secret:

    aws secretsmanager create-secret \
        --name anthropic-api-key \
        --secret-string "your-anthropic-api-key"
  2. Bootstrap CDK (one-time per AWS account/region):

    cd backend/cdk
    uv run cdk bootstrap
  3. Deploy infrastructure:

    uv run cdk deploy --all
  4. Create frontend environment files:

    cd ../../frontend
    cp .env.example .env.dev
    # Edit .env.dev with CDK stack outputs
  5. Start development server:

    npm run dev

Visit http://localhost:5173

Production Deployment

  1. Configure custom domains in backend/cdk/app.py:

    DOMAIN_NAME = "accountant.sourish-banerjee.com"
    API_DOMAIN_NAME = "api.accountant.sourish-banerjee.com"
  2. Deploy infrastructure:

    cd backend/cdk
    uv run cdk deploy --all
  3. Upload Knowledge Base PDFs:

    cd ../copilot/kb_source
    # Add PDF files
    cd ../..
    python scripts/generate_metadata.py
    
    # Upload to S3 (get bucket name from BedrockStack outputs)
    aws s3 sync copilot/kb_source/ s3://<kb-bucket-name>/
  4. Build and deploy frontend:

    cd ../../frontend
    # Update .env.prod with stack outputs
    npm run build
    
    cd ../backend/cdk
    uv run cdk deploy AccountantWebStack
  5. Access application at your custom domain

Configuration

Environment Variables

Frontend (.env.dev / .env.prod)

# Cognito Configuration
VITE_USER_POOL_ID=us-east-1_XXXXXXXXX
VITE_USER_POOL_CLIENT_ID=xxxxxxxxxxxxxxxxxxxx
VITE_IDENTITY_POOL_ID=us-east-1:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
VITE_OAUTH_DOMAIN=accountant-XXXXX.auth.us-east-1.amazoncognito.com

# OAuth URLs
VITE_REDIRECT_SIGN_IN=https://accountant.sourish-banerjee.com
VITE_REDIRECT_SIGN_OUT=https://accountant.sourish-banerjee.com

# API Configuration
VITE_API_BASE_URL=https://api.accountant.sourish-banerjee.com
VITE_STREAMING_LAMBDA_URL=https://xxxxxx.lambda-url.us-east-1.on.aws

# AWS Region
VITE_AWS_REGION=us-east-1

Backend (Set by CDK automatically)

Lambda environment variables are configured by CDK stacks. No manual configuration needed.

Custom Domains

To use custom domains:

  1. Create Route53 hosted zone for your domain
  2. Update domain names in backend/cdk/app.py
  3. Deploy stacks - CDK creates ACM certificates automatically
  4. Wait for DNS propagation (can take 30+ minutes)

To deploy without custom domains:

# In backend/cdk/app.py
DOMAIN_NAME = None
API_DOMAIN_NAME = None

Documentation

Development Workflow

Backend Changes

cd backend/cdk
uv run cdk diff              # Review changes
uv run cdk deploy --all      # Deploy infrastructure

Frontend Changes

cd frontend
npm run dev                  # Test locally
npm run build               # Build for production
cd ../backend/cdk
uv run cdk deploy AccountantWebStack  # Deploy to CloudFront

Lambda Changes

Lambda code changes are detected automatically by CDK:

# After editing lambda/*/handler.py
cd backend/cdk
uv run cdk deploy AccountantApiStack  # Deploys updated Lambda

Knowledge Base Updates

# Add new PDFs to copilot/kb_source/
cd backend
python scripts/generate_metadata.py

# Upload to S3
aws s3 sync copilot/kb_source/ s3://<kb-bucket-name>/

# Sync in Bedrock console (manual step)

Cost Breakdown

Monthly costs for production deployment:

Service Cost Notes
OpenSearch Serverless $700 2 OCU minimum (fixed cost)
CloudFront $20-50 Varies with traffic
Lambda $5-10 Based on invocations
API Gateway $3-5 Based on requests
Bedrock KB Queries $10-20 ~$0.001 per query
Anthropic API $20-100 Based on usage
Cognito $0-5 Free tier: 50,000 MAU
S3 Storage $1-3 PDFs + static assets
Secrets Manager $0.40 Per secret
Route53 $0.50 Per hosted zone
Total $760-915 Per month

Note: OpenSearch Serverless is the primary cost driver. Consider alternatives for lower-traffic applications.

Security

  • Authentication: AWS Cognito with SRP protocol
  • Authorization: Cognito User Pool authorizer on API Gateway
  • API Security: IAM authentication with SigV4 signing
  • Data Encryption: At rest (S3, OpenSearch) and in transit (HTTPS)
  • Secret Management: AWS Secrets Manager for API keys
  • Network Security: Private S3 buckets with Origin Access Control
  • CORS: Strict origin validation
  • PDF Access: Time-limited presigned URLs (1-hour expiration)

Key Features

PDF Fuzzy Highlighting

Advanced text matching algorithm:

  • Exact match with special character cleanup
  • Levenshtein distance for fuzzy matching
  • Multi-page search with automatic navigation
  • Handles OCR artifacts and extraction inconsistencies

Streaming Responses

Real-time AI responses using:

  • Lambda Function URL with response streaming
  • Lambda Web Adapter for HTTP server
  • Server-sent events (SSE) protocol
  • FastAPI + Uvicorn for async streaming

Citation System

Transparent sourcing:

  • JSON-formatted citations from Claude
  • Relevance scores from Knowledge Base
  • Clickable citations load PDFs
  • Automatic text highlighting in documents

Troubleshooting

Common Issues

Lambda Cold Starts:

  • Expected 1-3 second delay on first request
  • Consider provisioned concurrency for production

CORS Errors:

  • Verify allowed origins in Lambda/API Gateway
  • Check request includes correct origin header

PDF Not Loading:

  • Check S3 bucket permissions
  • Verify presigned URL hasn't expired
  • Ensure object exists in S3

Knowledge Base No Results:

  • Verify PDFs uploaded to S3
  • Trigger data source sync in Bedrock console
  • Wait 5-10 minutes for indexing

Authentication Issues:

  • Check Cognito User Pool configuration
  • Verify OAuth redirect URLs match exactly
  • Clear browser cache and cookies

Logs

View CloudWatch logs:

# API Lambda logs
aws logs tail /aws/lambda/AccountantApiStack-RetrieveFunction-* --follow

# Streaming Lambda logs
aws logs tail /aws/lambda/AccountantStreamingStack-StreamingFunction-* --follow

Contributing

This is a private project. For questions or issues, contact the repository owner.

Development Tips

Faster Iteration

  1. Test Lambda locally before deploying
  2. Use CDK watch for auto-deployment: uv run cdk watch
  3. Deploy single stack instead of all: uv run cdk deploy AccountantApiStack
  4. Cache presigned URLs in frontend to reduce API calls

Debugging

  1. CloudWatch Logs - Check Lambda execution logs
  2. CDK Diff - Review infrastructure changes before deploying
  3. Browser DevTools - Network tab for API debugging
  4. React DevTools - Component state inspection

Code Quality

  • Backend: Follow Python type hints and docstrings
  • Frontend: Use TypeScript strict mode
  • Lambda: Keep functions small and focused
  • CDK: Use stack outputs for cross-stack references

Roadmap

Future enhancements:

  • Multi-user support with user-specific data
  • Conversation history persistence
  • Export chat transcripts to PDF
  • Advanced filtering by document type/year
  • Notification system for new CGST updates
  • Mobile-responsive UI improvements
  • Cost optimization with caching layer
  • Batch document upload interface

License

Private project - All rights reserved

Acknowledgments

  • AWS Bedrock - Knowledge Base and Claude integration
  • Anthropic - Claude Sonnet 4.5 AI model
  • AWS Lambda Web Adapter - HTTP streaming in Lambda
  • react-pdf - PDF rendering library
  • AWS CDK - Infrastructure as code framework

Built with ❤️ for Indian GST professionals

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors