Skip to content

bdeva1975/llm-research-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Research Engine

A LangChain-based autonomous web research engine that takes a plain question, searches the web, scrapes and summarizes results, and produces a structured research report using an LLM.

How It Works

Question
   │
   ▼
Assistant Selection (LLM)
   │  Selects the right research persona based on the topic
   ▼
Search Query Generation (LLM)
   │  Generates N targeted web search queries
   ▼
Web Search (DuckDuckGo → Serper → Google CSE)
   │  Retrieves top URLs per query
   ▼
Async Web Scraping (httpx + asyncio)
   │  Scrapes all URLs concurrently in a single batch
   ▼
Summarization (LLM)
   │  Summarizes each scraped page in parallel
   ▼
Research Report (LLM)
   │  Synthesizes all summaries into a structured APA-format report
   ▼
Final Report (Markdown)

Features

  • Multi-backend search — tries DuckDuckGo first, falls back to Serper.dev, then Google Custom Search Engine
  • Async scraping — all URLs scraped concurrently using httpx + asyncio.gather, not sequentially
  • Smart content extraction — prefers <article> / <main> tags to skip navbars, footers, and boilerplate
  • Singleton LLM — single ChatOpenAI instance shared across the entire pipeline
  • Retry logic — exponential backoff on search failures; graceful degradation across backends
  • Empty result guard — fails fast with a clear error message if no search results are retrieved, instead of sending empty data to the LLM
  • LangChain LCEL — entire pipeline built with LangChain Expression Language (LCEL) chains

Project Structure

├── llm_models.py          # Singleton ChatOpenAI instance
├── prompts.py             # All prompt templates
├── utilities.py           # JSON parsing with markdown fence stripping
├── web_searching.py       # Multi-backend search (DDG / Serper / Google CSE)
├── web_scraping.py        # Async batch scraping with httpx
├── chain_1_2.py           # Chain 1: assistant selection
├── chain_2_1.py           # Chain 2: search query generation
├── chain_3_1.py           # Chain 3: URL retrieval per query
├── chain_4_1.py           # Chain 4: batch scrape + summarize
├── chain_5_1.py           # Chain 5: full research pipeline
├── research_engine_seq.py # Sequential (non-chain) version for reference
├── chain_try_*.py         # Individual chain test scripts
├── .env.example           # Environment variable template
└── .gitignore

Setup

1. Clone the repository

git clone https://github.com/bdeva1975/llm-research-engine.git
cd llm-research-engine

2. Create and activate a virtual environment

python -m venv env
# Windows
env\Scripts\activate
# macOS/Linux
source env/bin/activate

3. Install dependencies

pip install -r requirements.txt

4. Configure environment variables

Copy .env.example to .env and fill in your API keys:

cp .env.example .env
OPENAI_API_KEY=your_openai_api_key_here
SERPER_API_KEY=your_serper_api_key_here      # recommended fallback if DDG is blocked
GOOGLE_API_KEY=your_google_api_key_here      # optional second fallback
GOOGLE_CSE_ID=your_google_cse_id_here        # optional second fallback

Getting API keys:

Usage

Run the full research pipeline:

python chain_try_5_1.py

The default question is:

What can I see and do in the Spanish town of Astorga?

To change the question, edit chain_try_5_1.py:

question = 'Your question here'

Running individual chains for testing

python chain_try_1_2.py   # test assistant selection
python chain_try_2_1.py   # test search query generation
python chain_try_3_1.py   # test URL retrieval
python chain_try_4_1.py   # test scraping + summarization
python chain_try_5_1.py   # test full pipeline

Dependencies

Key packages (see requirements.txt for full list):

Package Purpose
langchain LCEL chain orchestration
langchain-openai ChatOpenAI LLM
langchain-community DuckDuckGo wrapper
duckduckgo-search DDG search backend
httpx Async HTTP for web scraping
beautifulsoup4 HTML parsing
requests HTTP for Serper / Google CSE
python-dotenv .env file loading

Notes

  • DuckDuckGo search may be rate-limited or blocked in some corporate/institutional networks. In that case, configure SERPER_API_KEY in your .env — Serper is the recommended fallback.
  • The pipeline makes approximately 9 LLM calls per question (1 assistant selection + 1 query generation + 6 summaries + 1 final report). Latency depends on your LLM provider and network speed.
  • RESULT_TEXT_MAX_CHARACTERS in chain_4_1.py controls how much scraped text is fed per summarization call (default: 5000 characters). Increase for more detail, decrease for faster/cheaper calls.

License

MIT

About

LangChain-based web research engine with async scraping and multi-backend search

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages