Observe the surface. Dive for the signal.
A research tool that analyzes public GitHub repositories to extract engineering and security-relevant development metrics.
“Biguá” is the Portuguese name for a cormorant, a diving bird commonly found along Brazilian coasts and rivers.
Cormorants are known for carefully observing their surroundings and diving beneath the surface to find what is hidden. In a similar way, bigua-analyzer inspects public repositories and dives into their history and structure to uncover patterns in how software is built.
The name reflects this idea: observing the ecosystem and extracting insights that are not immediately visible on the surface.
'bigua-analyzer' inspects public GitHub repositories and extracts a set of engineering and development signals that help reveal real-world software development patterns.
The analyzer focuses exclusively on publicly available repository metadata and commit history.
For the full metric catalogue and the traffic-light signal quality layer (green/yellow/orange/red), see bigua_project_docs/metrics.md.
- Total number of commits
- Commit frequency over time
- Commit burst patterns
- Time between commits
- Repository age
- Total number of contributors
- Contribution distribution (top contributors vs long tail)
- Bus factor estimation
- New contributor arrival rate
- Maintainer activity patterns
- Repository size
- File count
- Directory depth
- Language distribution
- Presence of dependency declaration files (package.json, requirements.txt, pom.xml, etc.)
- Pull request frequency
- Merge latency
- Commit message patterns
- Code churn over time
- Branching activity
- Presence of security-related files (SECURITY.md, CODEOWNERS)
- Dependency update patterns
- Signals of automated tooling (CI/CD, linters, security scanners)
- Indicators associated with security maturity
These metrics can be aggregated across repositories to study large-scale patterns in open-source software development and engineering practices.
Clone the repository and install dependencies:
git clone https://github.com/icidade/bigua-analyzer.git
cd bigua-analyzer
pip install -e .bigua-analyzer uses subcommands:
| Subcommand | Purpose |
|---|---|
analyze |
Clone repositories and extract metrics to CSV/JSONL across human, hybrid, or AI-aware SDLC modes |
analyze-report |
Generate an AI-assisted Markdown + HTML report from a metrics CSV |
In addition to subcommands, bigua-analyzer also supports a direct visualization mode:
bigua-analyzer --plots --input <metrics.csv> --out <plots_dir>
The core analysis pipeline is illustrated below:
bigua-analyzer analyze https://github.com/microsoft/vscodeThis will analyze the default branch (usually main or master) and output results to out/results.csv and out/results.jsonl.
Create a CSV file repos.csv with repository URLs:
url
https://github.com/microsoft/vscode
https://github.com/facebook/react
https://github.com/golang/goThen run:
bigua-analyzer analyze --dataset repos.csv --out analysis-results- Specify a branch/tag/SHA:
--ref main - Limit number of repos:
--max-repos 10 - Parallel processing:
--max-workers 8(default 4) - Analysis depth:
--mode full|fast(defaultfull) - Scope history by date:
--since YYYY-MM-DD - Scope history by relative window:
--time-window 365 - Limit analyzed commits inside the selected scope:
--sample-size 240 - Disable persistent scope cache:
--no-analysis-cache - SDLC analysis mode:
--sdlc-mode auto|human|hybrid|ai(defaultauto) - Output format:
--format csvor--format jsonlor--format both(default) - Custom cache directory:
--cache-dir /path/to/cache
--mode fast keeps the default output schema but scopes the expensive history-based calculations to a recent window and optional sampled commit subset.
- Default fast-mode window: last 365 days
- Default fast-mode sample size: 240 commits
- Sampling strategy: time-bucketed commit sampling to preserve chronological coverage
- Persistent cache: scoped commit listings and AI scan inputs are cached under the analysis cache directory inside
--cache-dir
Use full when you need maximum fidelity over the entire repository history. Use fast when you need a materially quicker approximation on very large repositories.
bigua-analyzer analyze supports four SDLC modes:
auto: computes a repository-level AI Influence Score and derives the effective mode automaticallyhuman: keeps traditional repository metrics as the primary analysis lenshybrid: combines traditional metrics with AI-aware metricsai: prioritizes AI-aware metrics while preserving existing output fields for compatibility
When --sdlc-mode auto is used, the effective mode is resolved from the AI Influence Score:
< 0.30→human>= 0.30and< 0.60→hybrid>= 0.60→ai
The AI Influence Score is repository-level and is based on normalized heuristics for commit patterns, temporal anomalies, style uniformity, and metadata signals. Repository age is used only as a weak contextual prior and is never sufficient on its own to classify a project as hybrid or ai.
-
Quick analysis of a popular repo:
bigua-analyzer analyze https://github.com/elastic/elasticsearch --ref v8.11.0
-
Batch analysis with parallel processing:
bigua-analyzer analyze --dataset all_repos.csv --max-workers 8 --out batch-results --format both
-
Testing with small dataset:
bigua-analyzer analyze --dataset repos.csv --max-repos 5 --max-workers 2 --out test-output
-
Hybrid SDLC analysis:
bigua-analyzer analyze https://github.com/org/repo --sdlc-mode hybrid --out hybrid-results
-
Auto-detect effective SDLC mode:
bigua-analyzer analyze https://github.com/org/repo --sdlc-mode auto --out auto-results
-
Fast analysis for a very large repository:
bigua-analyzer analyze https://github.com/hashicorp/terraform --mode fast --out terraform-fast
-
Fast dataset run with an explicit time scope and sample cap:
bigua-analyzer analyze --dataset all_repos.csv --mode fast --time-window 180 --sample-size 120 --max-workers 2 --out fast-results
-
Full analysis constrained to recent history only:
bigua-analyzer analyze https://github.com/org/repo --mode full --since 2024-01-01 --out scoped-full-results
Takes a metrics CSV produced by analyze and sends it to the selected LLM provider to generate a professional Markdown analysis report, optionally rendered as HTML.
Supported providers include openai-compatible, openai, xai, gemini, and local models via ollama.
Privacy warning: Repository metadata and derived metrics may be sent to an external LLM API when using
analyze-report. If you are working with private or sensitive repositories, verify your organization policy before running this command.
analyze-report asks for interactive confirmation before sending data to the configured LLM backend.
Use --yes to bypass this prompt in CI or scripted runs.
metrics CSV
↓
prompt builder (metrics + derived signals)
↓
LLM (provider-selected: cloud or local)
↓
analysis_report.md
↓
analysis_report.html
Derived signals are computed automatically before the prompt is sent:
| Signal | Logic |
|---|---|
contribution_concentration |
high if gini_coefficient > 0.7, else moderate |
bus_factor_risk |
high if bus_factor < 2, else moderate |
contributor_stability |
unstable if developer_turnover > 0.5, else stable |
change_velocity |
high if code_churn > 1000, else normal |
If the input CSV contains SDLC context fields such as sdlc_mode, effective_sdlc_mode, and ai_influence_score, analyze-report includes that context in the LLM prompt and asks the model to interpret the repository according to human, hybrid, or AI-assisted development conditions.
# 1. Set your API key
export LLM_API_KEY=sk-...
# 2. Run the analyzer
bigua-analyzer analyze https://github.com/org/repo --out out/results
# 3. Generate the AI report
bigua-analyzer analyze-report \
--csv out/results.csv \
--out-md analysis_report.md \
--out-html analysis_report.htmlOutputs analysis_report.md and analysis_report.html.
If --repo-url is omitted and the CSV contains multiple rows, analyze-report runs in batch mode and generates one report per repository under analysis_reports/ by default.
| Flag | Default | Description |
|---|---|---|
--csv |
(required) | Path to the metrics CSV |
--repo-url |
auto | Filter to a specific URL; omit it to generate one report per CSV row |
--out-dir |
analysis_reports |
Output directory for batch mode |
--out-md |
analysis_report.md |
Markdown output path |
--out-html |
analysis_report.html |
HTML output path (pass "" to skip) |
--llm |
openai-compatible |
LLM adapter: openai-compatible, openai, xai, gemini, ollama (--provider is an alias) |
--model |
provider-specific | LLM model name |
--base-url |
provider-specific | API base URL override |
--api-key |
provider-specific | API key override |
--temperature |
0.2 |
Sampling temperature |
--top-p |
0.9 |
Nucleus sampling probability mass |
--max-tokens |
4096 |
Max tokens in LLM response |
--suppress-external-llm-warning |
false |
Suppress runtime privacy warning |
--yes |
false |
Bypass interactive confirmation prompt |
The LLM call uses a system/user split for better output consistency:
- System role: persona, rules, and output constraints (static)
- User role: repository metadata, metrics, derived signals, and format instructions (dynamic)
Defaults are tuned for factual, analytical output: temperature=0.2, top_p=0.9.
Provider-specific environment variables are respected automatically:
- Generic (all providers):
LLM_API_KEY,LLM_BASE_URL,LLM_MODEL openai/openai-compatible:OPENAI_BASE_URL,OPENAI_MODEL(legacyOPENAI_API_KEYalso supported)xai:XAI_API_KEY,XAI_BASE_URL,XAI_MODELgemini:GEMINI_API_KEY,GEMINI_BASE_URL,GEMINI_MODELollama:OLLAMA_BASE_URL,OLLAMA_MODEL(no API key required by default)
# Ollama native API
bigua-analyzer analyze-report \
--csv out/results.csv \
--llm ollama \
--base-url http://localhost:11434 \
--model llama3.1 \
--yes
# OpenAI-compatible endpoint
bigua-analyzer analyze-report \
--csv out/results.csv \
--llm openai-compatible \
--base-url http://localhost:11434/v1 \
--model llama3 \
--api-key dummy
# Gemini native API
bigua-analyzer analyze-report \
--csv out/results.csv \
--llm gemini \
--model gemini-2.0-flash
# xAI (Grok via OpenAI-compatible API)
bigua-analyzer analyze-report \
--csv out/results.csv \
--llm xai \
--model grok-2-latest
# Non-interactive mode (CI/scripts)
bigua-analyzer analyze-report \
--csv out/results.csv \
--yesGenerate one Markdown and HTML report per repository from a multi-row CSV:
bigua-analyzer analyze-report \
--csv out/results-parallel.csv \
--out-dir out/analysis_reports \
--yesThis writes files such as owner__repo_analysis_report.md and owner__repo_analysis_report.html under the chosen output directory.
Generate research-ready PNG charts directly from the analyzer CSV output without interactive rendering.
bigua-analyzer --plots --input out/results.csv --out plotsThis mode runs headless (no interactive window) and writes PNG files such as:
gini_vs_bus_factor.pngai_influence_distribution.pngrepo_classification.pngturnover_vs_contributors.pngrelease_vs_ai.pngradar_aggregate_by_traffic_light.pngradar_top8_ai_influence.pngradar_median_plus_outliers.pngradar_<repo_id>.png(one per repository)
Radar plots use a shared normalized [0,1] scale across all generated radar views so profiles remain directly comparable. The standardized axes are:
DistributionBus FactorContributors (log)StabilityAI Influence
HTML output is always available. By default, bigua-analyzer uses a built-in Markdown converter that covers the subset typically produced by LLMs: headings, bold, italic, inline code, fenced code blocks, lists, and links.
For higher-fidelity rendering — including tables and extended Markdown syntax — install the optional markdown package:
# If installed from source
pip install -e ".[ai]"
# If installed from PyPI
pip install "bigua-analyzer[ai]"Results are saved as CSV and/or JSONL files. Each row/object contains:
- Repository metadata (URL, ref, repo_id)
- Success status and error messages (if any)
- All calculated metrics
When SDLC-aware analysis is enabled, outputs also include additive fields such as:
sdlc_modeeffective_sdlc_modeanalysis_modeanalysis_sinceanalysis_time_window_daysanalysis_sample_sizeanalysis_sampling_strategyanalysis_cache_enabledanalysis_cache_hitcommit_scope_total_commitscommit_scope_analyzed_commitscommit_scope_is_approximateai_influence_scoreai_weighted_base_scoreai_temporal_adoption_priorai_temporal_anomaly_weightai_commit_pattern_scoreai_temporal_anomaly_score_rawai_temporal_anomaly_scoreai_style_uniformity_scoreai_metadata_signal_scoreai_influence_rationaletraffic_lighttraffic_light_scoreis_research_grade
The commit_scope_* and analysis_* fields make scoped runs explicit, so downstream analysis can distinguish full-history metrics from fast approximations.
In JSONL output, AI-aware metrics are preserved under metrics.ai_metrics.
In CSV output, nested AI-aware metrics are flattened into columns such as ai_metrics_aidr, ai_metrics_cbf, ai_metrics_amr, ai_metrics_aich, and ai_metrics_aci.
Note: For the
analyzecommand, if--outis provided as a bare filename (no directory), output is written underout/by default (e.g.,--out results→out/results.csv). If you specify a path (e.g.,--out data/results), that path is used as given.
See bigua_project_docs/metrics.md for detailed metric definitions.
Datasets can be CSV or JSONL:
CSV:
url,ref,repo_id
https://github.com/org/repo,main,my-repoJSONL:
{"url": "https://github.com/org/repo", "ref": "main", "repo_id": "my-repo"}- First run clones repositories (cached for future runs).
- Use
--max-workersfor parallel processing on multi-core systems. - Large repos may take time; start with small datasets for testing.
- See CONTRIBUTING.md for contribution workflow and pull request guidance.
- See CODE_OF_CONDUCT.md for community expectations.
- See SECURITY.md for vulnerability reporting guidance.
- See CHANGELOG.md for notable release history.
This repository includes:
- CodeQL static analysis workflow at
.github/workflows/codeql.yml - Dependabot updates at
.github/dependabot.yml - A local pre-release audit script at
scripts/security_audit.py
Run the local audit before publishing:
python scripts/security_audit.pyOptional flags:
- Scan fewer commits in history:
python scripts/security_audit.py --history-commits 100 - Skip commit history scan:
python scripts/security_audit.py --no-history - Include broad keyword scan (more false positives):
python scripts/security_audit.py --include-generic-keyword-scan
If this research is useful for your work or organization, consider supporting its development:
