LLM Research Engine

A LangChain-based autonomous web research engine that takes a plain question, searches the web, scrapes and summarizes results, and produces a structured research report using an LLM.

How It Works

Question
   │
   ▼
Assistant Selection (LLM)
   │  Selects the right research persona based on the topic
   ▼
Search Query Generation (LLM)
   │  Generates N targeted web search queries
   ▼
Web Search (DuckDuckGo → Serper → Google CSE)
   │  Retrieves top URLs per query
   ▼
Async Web Scraping (httpx + asyncio)
   │  Scrapes all URLs concurrently in a single batch
   ▼
Summarization (LLM)
   │  Summarizes each scraped page in parallel
   ▼
Research Report (LLM)
   │  Synthesizes all summaries into a structured APA-format report
   ▼
Final Report (Markdown)

Features

Multi-backend search — tries DuckDuckGo first, falls back to Serper.dev, then Google Custom Search Engine
Async scraping — all URLs scraped concurrently using httpx + asyncio.gather, not sequentially
Smart content extraction — prefers <article> / <main> tags to skip navbars, footers, and boilerplate
Singleton LLM — single ChatOpenAI instance shared across the entire pipeline
Retry logic — exponential backoff on search failures; graceful degradation across backends
Empty result guard — fails fast with a clear error message if no search results are retrieved, instead of sending empty data to the LLM
LangChain LCEL — entire pipeline built with LangChain Expression Language (LCEL) chains

Project Structure

├── llm_models.py          # Singleton ChatOpenAI instance
├── prompts.py             # All prompt templates
├── utilities.py           # JSON parsing with markdown fence stripping
├── web_searching.py       # Multi-backend search (DDG / Serper / Google CSE)
├── web_scraping.py        # Async batch scraping with httpx
├── chain_1_2.py           # Chain 1: assistant selection
├── chain_2_1.py           # Chain 2: search query generation
├── chain_3_1.py           # Chain 3: URL retrieval per query
├── chain_4_1.py           # Chain 4: batch scrape + summarize
├── chain_5_1.py           # Chain 5: full research pipeline
├── research_engine_seq.py # Sequential (non-chain) version for reference
├── chain_try_*.py         # Individual chain test scripts
├── .env.example           # Environment variable template
└── .gitignore

Setup

1. Clone the repository

git clone https://github.com/bdeva1975/llm-research-engine.git
cd llm-research-engine

2. Create and activate a virtual environment

python -m venv env
# Windows
env\Scripts\activate
# macOS/Linux
source env/bin/activate

3. Install dependencies

pip install -r requirements.txt

4. Configure environment variables

Copy .env.example to .env and fill in your API keys:

cp .env.example .env

OPENAI_API_KEY=your_openai_api_key_here
SERPER_API_KEY=your_serper_api_key_here      # recommended fallback if DDG is blocked
GOOGLE_API_KEY=your_google_api_key_here      # optional second fallback
GOOGLE_CSE_ID=your_google_cse_id_here        # optional second fallback

Getting API keys:

OpenAI: https://platform.openai.com/api-keys
Serper (free — 2500 searches/month): https://serper.dev
Google CSE (free — 100 queries/day): https://programmablesearchengine.google.com

Usage

Run the full research pipeline:

python chain_try_5_1.py

The default question is:

What can I see and do in the Spanish town of Astorga?

To change the question, edit chain_try_5_1.py:

question = 'Your question here'

Running individual chains for testing

python chain_try_1_2.py   # test assistant selection
python chain_try_2_1.py   # test search query generation
python chain_try_3_1.py   # test URL retrieval
python chain_try_4_1.py   # test scraping + summarization
python chain_try_5_1.py   # test full pipeline

Dependencies

Key packages (see requirements.txt for full list):

Package	Purpose
`langchain`	LCEL chain orchestration
`langchain-openai`	ChatOpenAI LLM
`langchain-community`	DuckDuckGo wrapper
`duckduckgo-search`	DDG search backend
`httpx`	Async HTTP for web scraping
`beautifulsoup4`	HTML parsing
`requests`	HTTP for Serper / Google CSE
`python-dotenv`	`.env` file loading

Notes

DuckDuckGo search may be rate-limited or blocked in some corporate/institutional networks. In that case, configure SERPER_API_KEY in your .env — Serper is the recommended fallback.
The pipeline makes approximately 9 LLM calls per question (1 assistant selection + 1 query generation + 6 summaries + 1 final report). Latency depends on your LLM provider and network speed.
RESULT_TEXT_MAX_CHARACTERS in chain_4_1.py controls how much scraped text is fed per summarization call (default: 5000 characters). Increase for more detail, decrease for faster/cheaper calls.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Research Engine

How It Works

Features

Project Structure

Setup

1. Clone the repository

2. Create and activate a virtual environment

3. Install dependencies

4. Configure environment variables

Usage

Running individual chains for testing

Dependencies

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
chain_1_1.py		chain_1_1.py
chain_1_2.py		chain_1_2.py
chain_2_1.py		chain_2_1.py
chain_3_1.py		chain_3_1.py
chain_4_1.py		chain_4_1.py
chain_5_1.py		chain_5_1.py
chain_try_1_1.py		chain_try_1_1.py
chain_try_1_2.py		chain_try_1_2.py
chain_try_2_1.py		chain_try_2_1.py
chain_try_3_1.py		chain_try_3_1.py
chain_try_4_1.py		chain_try_4_1.py
chain_try_5_1.py		chain_try_5_1.py
llm_models.py		llm_models.py
prompts.py		prompts.py
requirements.txt		requirements.txt
research_engine_seq.py		research_engine_seq.py
utilities.py		utilities.py
web_scraping.py		web_scraping.py
web_scraping_try.py		web_scraping_try.py
web_searching.py		web_searching.py
web_searching_try.py		web_searching_try.py

Folders and files

Latest commit

History

Repository files navigation

LLM Research Engine

How It Works

Features

Project Structure

Setup

1. Clone the repository

2. Create and activate a virtual environment

3. Install dependencies

4. Configure environment variables

Usage

Running individual chains for testing

Dependencies

Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages