A LangChain-based autonomous web research engine that takes a plain question, searches the web, scrapes and summarizes results, and produces a structured research report using an LLM.
Question
│
▼
Assistant Selection (LLM)
│ Selects the right research persona based on the topic
▼
Search Query Generation (LLM)
│ Generates N targeted web search queries
▼
Web Search (DuckDuckGo → Serper → Google CSE)
│ Retrieves top URLs per query
▼
Async Web Scraping (httpx + asyncio)
│ Scrapes all URLs concurrently in a single batch
▼
Summarization (LLM)
│ Summarizes each scraped page in parallel
▼
Research Report (LLM)
│ Synthesizes all summaries into a structured APA-format report
▼
Final Report (Markdown)
- Multi-backend search — tries DuckDuckGo first, falls back to Serper.dev, then Google Custom Search Engine
- Async scraping — all URLs scraped concurrently using
httpx+asyncio.gather, not sequentially - Smart content extraction — prefers
<article>/<main>tags to skip navbars, footers, and boilerplate - Singleton LLM — single
ChatOpenAIinstance shared across the entire pipeline - Retry logic — exponential backoff on search failures; graceful degradation across backends
- Empty result guard — fails fast with a clear error message if no search results are retrieved, instead of sending empty data to the LLM
- LangChain LCEL — entire pipeline built with LangChain Expression Language (LCEL) chains
├── llm_models.py # Singleton ChatOpenAI instance
├── prompts.py # All prompt templates
├── utilities.py # JSON parsing with markdown fence stripping
├── web_searching.py # Multi-backend search (DDG / Serper / Google CSE)
├── web_scraping.py # Async batch scraping with httpx
├── chain_1_2.py # Chain 1: assistant selection
├── chain_2_1.py # Chain 2: search query generation
├── chain_3_1.py # Chain 3: URL retrieval per query
├── chain_4_1.py # Chain 4: batch scrape + summarize
├── chain_5_1.py # Chain 5: full research pipeline
├── research_engine_seq.py # Sequential (non-chain) version for reference
├── chain_try_*.py # Individual chain test scripts
├── .env.example # Environment variable template
└── .gitignore
git clone https://github.com/bdeva1975/llm-research-engine.git
cd llm-research-enginepython -m venv env
# Windows
env\Scripts\activate
# macOS/Linux
source env/bin/activatepip install -r requirements.txtCopy .env.example to .env and fill in your API keys:
cp .env.example .envOPENAI_API_KEY=your_openai_api_key_here
SERPER_API_KEY=your_serper_api_key_here # recommended fallback if DDG is blocked
GOOGLE_API_KEY=your_google_api_key_here # optional second fallback
GOOGLE_CSE_ID=your_google_cse_id_here # optional second fallback
Getting API keys:
- OpenAI: https://platform.openai.com/api-keys
- Serper (free — 2500 searches/month): https://serper.dev
- Google CSE (free — 100 queries/day): https://programmablesearchengine.google.com
Run the full research pipeline:
python chain_try_5_1.pyThe default question is:
What can I see and do in the Spanish town of Astorga?
To change the question, edit chain_try_5_1.py:
question = 'Your question here'python chain_try_1_2.py # test assistant selection
python chain_try_2_1.py # test search query generation
python chain_try_3_1.py # test URL retrieval
python chain_try_4_1.py # test scraping + summarization
python chain_try_5_1.py # test full pipelineKey packages (see requirements.txt for full list):
| Package | Purpose |
|---|---|
langchain |
LCEL chain orchestration |
langchain-openai |
ChatOpenAI LLM |
langchain-community |
DuckDuckGo wrapper |
duckduckgo-search |
DDG search backend |
httpx |
Async HTTP for web scraping |
beautifulsoup4 |
HTML parsing |
requests |
HTTP for Serper / Google CSE |
python-dotenv |
.env file loading |
- DuckDuckGo search may be rate-limited or blocked in some corporate/institutional networks. In that case, configure
SERPER_API_KEYin your.env— Serper is the recommended fallback. - The pipeline makes approximately 9 LLM calls per question (1 assistant selection + 1 query generation + 6 summaries + 1 final report). Latency depends on your LLM provider and network speed.
RESULT_TEXT_MAX_CHARACTERSinchain_4_1.pycontrols how much scraped text is fed per summarization call (default: 5000 characters). Increase for more detail, decrease for faster/cheaper calls.
MIT