Bug fixes and improvements. by newmana · Pull Request #20 · saketkc/pySCTransform

newmana · 2026-02-02T05:21:21Z

Updated to support the following:

Fixed bugs with GLM and missing parameters for multiple samples (Fixes There is an error in this code, when I run SCTransform() #17).
Fixed theta parameter generation - now returns a list for all genes rather than a single value.
Improved performance of robust_scale_binned (25% faster).
Added tests to compare R outputs with Python.
Fix downsampling logic.
Better handling of sparse matrices.
Also supports pandas 2.x for (Fixes Pandas 2.x compatibility issues #19).
Gene deduplication.
Adding batch support - passing batch variables through and per batch smoothing.

Summary by Sourcery

Align variance-stabilizing transformation, theta estimation, and residual calculations with the R sctransform implementation while improving performance and API ergonomics.

Bug Fixes:

Correct theta maximum-likelihood estimation and return full mean vectors instead of scalars to match R behavior.
Fix downsampling logic to only subsample when requested counts are below dataset size and clamp requests above size.
Correct handling of sparse inputs for geometric mean, variance, and Pearson residuals to avoid dense-only assumptions and shape bugs.
Align GLM/GLMGP parameter handling and bandwidth selection with R outputs, including proper handling of theta and offsets.

Enhancements:

Speed up robust_scale_binned and related robust scaling by replacing DataFrame groupby logic with vectorized bin-wise operations.
Introduce a generic variance helper that supports both dense and sparse matrices and use it in gene-level statistics.
Refine batch-effect and latent-variable modeling in vst/SCTransform, including optional batch covariates and passing extra cell attributes.
Expose plotting utilities via the public package API and clean up imports, logging, and minor numerical guard logic across modules.

Build:

Move dependency and packaging configuration into setup.cfg, modernize Python version support to 3.10+, and declare runtime and dev dependencies explicitly.

Tests:

Add stepwise regression, regularization, and residual-comparison tests against R reference outputs, including PBMC3k data download utilities.
Add targeted tests for downsampling behavior, robust_scale_binned performance, theta_ml correctness, and sparse/dense geometric mean equivalence.

Summary by Sourcery

Align variance-stabilizing transformation, theta estimation, and residual calculations with the R sctransform implementation while improving performance, sparse support, and batch-aware modeling, and modernizing packaging and dependencies.

New Features:

Add batch-aware SCTransform/vst support with batch covariates in the model matrix and per-batch parameter smoothing.
Expose plotting utilities via the public package API for easier downstream visualization.
Provide scripts to generate and compare Python outputs against R sctransform references, including integrated and batch-reference workflows.

Bug Fixes:

Correct negative binomial theta maximum-likelihood estimation to return stable, R-consistent values and full mean vectors.
Fix downsampling logic to avoid subsampling when requested sizes exceed dataset dimensions and to clamp requests appropriately.
Resolve shape and type issues when handling sparse matrices in geometric mean, variance, and residual computations.
Ensure residual calculations correctly handle sparse inputs and clipping behavior, and match R sctransform statistics.

Enhancements:

Unify variance computation for dense and sparse inputs and optimize row geometric mean and robust scaling operations for better performance.
Refine model matrix construction, parameter fitting, and regularization (including bandwidth selection and outlier handling) to better match R sctransform behavior and support batch terms.
Improve GLM/GLMGP fitting routines, parameter passing, and internal logging/warning configuration for more robust numerical behavior and diagnostics.

Build:

Move dependency specification into setup.cfg, modernize supported Python versions to 3.10+, and declare core and dev dependencies explicitly.
Update setup.py metadata, console entry point definition, and package description to reflect the current project scope and Python support matrix.

Tests:

Add extensive stepwise tests for theta_ml, per-gene parameter estimation, regularization (with and without batch), and residual calculations against R reference outputs.
Introduce tests for batch model matrix structure, downsampling parameter logic, robust_scale_binned correctness and performance, and sparse/dense row geometric mean equivalence.
Add pytest configuration, shared test utilities, and fixtures for downloading and caching PBMC3k data and loading R-generated reference CSVs.

…- don't turn into dense.

…d all. And remove duplication.

sourcery-ai

Sorry @newmana, your pull request is larger than the review limit of 150000 diff characters

…iables.

This reverts commit c0a7f6c.

sourcery-ai

Hey - I've found 8 issues, and left some high level feedback:

In SCTransform, the line method = "theta_ml" unconditionally overwrites the user-supplied method argument (except for vst_flavor == "v2"), which makes the method parameter effectively useless; consider only overriding it when vst_flavor requires a specific method or when method is None.
In get_regularized_params, the list comprehension that builds batch_cell_idx recreates set(batch_cells_step1) on every iteration, which is unnecessarily expensive for large datasets; precompute the set once outside the comprehension and reuse it.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In SCTransform, the line `method = "theta_ml"` unconditionally overwrites the user-supplied `method` argument (except for `vst_flavor == "v2"`), which makes the `method` parameter effectively useless; consider only overriding it when `vst_flavor` requires a specific method or when `method` is None.
- In get_regularized_params, the list comprehension that builds `batch_cell_idx` recreates `set(batch_cells_step1)` on every iteration, which is unnecessarily expensive for large datasets; precompute the set once outside the comprehension and reuse it.

## Individual Comments

### Comment 1
<location> `pysctransform/pysctransform.py:54-58` </location>
<code_context>
+        return npy.var(X, axis=axis)
+

+def bw_silverman(x, bw_adjust=3):
+    """
+    Bandwidth selection using Silverman's rule of thumb.

-def bwSJ(genes_log10_gmean_step1, bw_adjust=3):
-    # See https://kdepy.readthedocs.io/en/latest/bandwidth.html
-    fit = FFTKDE(kernel="gaussian", bw="ISJ").fit(npy.asarray(genes_log10_gmean_step1))
-    _ = fit.evaluate()
-    bw = fit.bw * bw_adjust
-    return npy.array([bw], dtype=float)
+    This matches R's bw.SJ more closely than KDEpy's ISJ.
+    """
+    x = npy.asarray(x)
</code_context>

<issue_to_address>
**issue:** Docstring description is misleading relative to the actual Silverman bandwidth implementation.

The code correctly implements Silverman’s rule-of-thumb (std + IQR), but the docstring incorrectly claims it matches R’s `bw.SJ`. Since `bw.SJ` uses a different pilot/bias-reduction procedure, please either remove the `bw.SJ` comparison or clarify that this is Silverman’s rule-of-thumb, which may approximate `bw.SJ` empirically but is a different estimator.
</issue_to_address>

### Comment 2
<location> `pysctransform/pysctransform.py:561-570` </location>
<code_context>
+def get_downsampling_params(n_cells, n_genes, total_cells, total_genes):
</code_context>

<issue_to_address>
**issue:** Downsampling helper does not guard against non-positive n_cells/n_genes values.

If `n_cells` or `n_genes` are `<= 0`, they’re effectively clamped to 0 while still setting `downsample_* = True`, so later calls like `np.random.choice(..., size=n_cells, replace=False)` will run with empty selections. It would be safer to validate these inputs (e.g., require `None` or positive integers and raise a clear error, or clamp to a minimum of 1) to avoid this confusing behavior.
</issue_to_address>

### Comment 3
<location> `pysctransform/pysctransform.py:476-485` </location>
<code_context>
     return pearson_residuals


 def deviance_residual(y, mu, theta, weight=1):
-    theta = npy.tile(theta.reshape(-1, 1), y.shape[1])
-    L = npy.multiply((y + theta), npy.log((y + theta) / (mu + theta)))
-    log_mu = npy.log(mu)
-    log_y = npy.log(y.maximum(1).todense())
-    r = npy.multiply(y.todense(), log_y - log_mu)
-    r = 2 * weight * (r - L)
-    return npy.multiply(npy.sqrt(r), npy.sign(y - mu))
+    if sparse.issparse(y):
+        y = y.toarray()
+    theta = theta.reshape(-1, 1)
+    y_safe = npy.maximum(y, 1)
+
+    # Unit deviance: d_i = 2 * [y*log(y/μ) - (y+θ)*log((y+θ)/(μ+θ))]
+    unit_deviance = 2 * (
+            y * npy.log(y_safe / mu)
+            - (y + theta) * npy.log((y + theta) / (mu + theta))
+    )
+
+    return npy.sqrt(weight * unit_deviance) * npy.sign(y - mu)


</code_context>

<issue_to_address>
**issue:** Deviance residual computation can hit numerical issues when mu is zero or very small.

`np.log(y_safe / mu)` and `np.log((y + theta) / (mu + theta))` are evaluated without guarding against `mu == 0` or very small `mu`, which can yield `inf`/`nan` deviance values and corrupt downstream residuals. Please clip `mu` to a reasonable lower bound (e.g. `mu_clipped = np.clip(mu, np.finfo(float).tiny, None)`) and use that in the log/ratio terms, analogous to `y_safe`. A similar guard is needed in `pearson_residual`, where `variance` can become zero if both `mu` and `theta` are small.
</issue_to_address>

### Comment 4
<location> `pysctransform/fit.py:109-118` </location>
<code_context>
+def theta_ml(y, mu, limit=10, eps=1e-4):
</code_context>

<issue_to_address>
**suggestion:** New theta_ml implementation changes behavior; consider guarding edge cases and documenting differences from the previous version.

The new Newton-like `theta_ml` returns `inf` in several failure cases (non-finite/negative initial estimate, zero information, negative final estimate), which can materially change fitted thetas, especially when `var(y) ≈ mu` or with small counts. I suggest: (1) explicitly handling cases where `var(y) <= mu` before computing the initial estimate; and (2) documenting in the docstring when `theta_ml` returns `inf` and how callers should interpret that (e.g., as Poisson), so downstream behavior remains predictable and debuggable.

Suggested implementation:

```python
def theta_ml(y, mu, limit=10, eps=1e-4):
    """
    Maximum likelihood estimation of theta for negative binomial.
    Matches R's MASS::theta.ml.

    Parameters
    ----------
    y : array
        Count data
    mu : array
        Fitted mean values corresponding to ``y``.

    Notes
    -----
    This implementation returns ``np.inf`` (interpreted as a Poisson mean-variance
    relationship / no extra-Poisson dispersion) in several situations:

    * when the sample variance of ``y`` is less than or equal to the mean implied
      by ``mu`` (i.e. there is no empirical evidence for over-dispersion and the
      negative binomial effectively reduces to a Poisson model);
    * when the initial moment-based estimate of ``theta`` is non-finite or negative;
    * when the observed Fisher information is zero during the Newton iterations; or
    * when the final Newton-updated estimate would be negative.

    Callers should treat a return value of ``np.inf`` as indicating that a Poisson
    mean-variance relationship is adequate and that no reliable finite dispersion
    estimate could be obtained.

```

To fully implement the behavior you described, you should also:

1. Add an explicit guard before computing the initial moment-based estimate of ``theta`` inside ``theta_ml``:

```python
    # Before computing the initial estimate of theta, check for cases
    # where the variance does not exceed the mean implied by mu. In that
    # situation, there is no evidence for over-dispersion and the NB
    # reduces to Poisson, which we represent by theta = np.inf.
    var_y = np.var(y, ddof=1)
    mean_mu = np.mean(mu)

    if not np.isfinite(var_y) or not np.isfinite(mean_mu):
        return np.inf

    if var_y <= mean_mu:
        return np.inf
```

Place this block immediately after the docstring and before any existing code that computes the initial ``theta`` estimate.

2. Ensure that:

   * ``numpy`` is imported as ``np`` at the top of the module (if it is not already).
   * Any existing early-return paths that currently return ``float("inf")`` are either kept as-is or updated to return ``np.inf`` consistently with the docstring language.
   * The Newton-like update loop continues to return ``np.inf`` in the failure cases you mentioned (non-finite/negative initial estimate, zero information, negative final estimate), so the behavior matches the documentation you just added.
</issue_to_address>

### Comment 5
<location> `tests/test_robust_scale_binned.py:18-27` </location>
<code_context>
+    }
+
+
+class TestRobustScaleBinned:
+
+    def test_preserves_input_order(self):
+        y = np.array([10.0, 20.0, 30.0, 40.0, 50.0, 60.0])
+        x = np.array([0.9, 0.1, 0.9, 0.1, 0.9, 0.1])
+        breaks = [0, 0.5, 1.0]
+
+        result = robust_scale_binned(y, x, breaks)
+
+        # Check that high-x values (indices 0, 2, 4) are scaled together
+        # and low-x values (indices 1, 3, 5) are scaled together
+        high_bin = result[[0, 2, 4]]
+        low_bin = result[[1, 3, 5]]
+
+        # Within each bin, relative ordering should be preserved
+        assert high_bin[0] < high_bin[1] < high_bin[2]
+        assert low_bin[0] < low_bin[1] < low_bin[2]
+
+    def test_output_length_matches_input(self, sample_data):
+        result = robust_scale_binned(
+            sample_data['y'],
+            sample_data['x'],
+            sample_data['breaks'],
+        )
+        assert len(result) == len(sample_data['y'])
+
+    def test_no_nans_in_output(self, sample_data):
</code_context>

<issue_to_address>
**suggestion (testing):** Extend `robust_scale_binned` tests to cover NaNs and empty bins

Current tests only cover inputs where all `x` fall into populated bins and contain no NaNs. Please add:

- A case where some `x` are NaN to capture and lock in the current behavior (entries with `code == -1` remain NaN / unchanged).
- A case where the specified breaks create empty bins (e.g. all `x` in [0, 0.5] but breaks span [0, 1]) to confirm the function returns finite values for populated bins and does not raise.

These will exercise the `np.unique(bin_codes)` loop and protect against regressions in binning behavior.
</issue_to_address>

### Comment 6
<location> `tests/test_batch_clamping.py:4-8` </location>
<code_context>

 # Distribution / packaging
 .Python
</code_context>

<issue_to_address>
**nitpick (testing):** Fix test class name typo and consider parametrizing the batch clamping scenarios

`TestBatchCamping` appears to be a typo for `TestBatchClamping`, which makes the test intent less clear in reports; renaming would help. Since the three tests only differ in which indices are zeroed or amplified while sharing `_make_batch_inputs`, you could combine them into a single parametrized test (e.g. with `@pytest.mark.parametrize`) to cut duplication and explicitly document each scenario of zero/high-expression gene combinations.
</issue_to_address>

### Comment 7
<location> `README.rst:89` </location>
<code_context>
+Development
+===========
+
+Code style chec:
+
+.. code-block:: bash
</code_context>

<issue_to_address>
**issue (typo):** Fix typo in "Code style chec:" heading.

```suggestion
Code style check:
```
</issue_to_address>

### Comment 8
<location> `HISTORY.rst:11-12` </location>
<code_context>
 * Fix for "TypeError: unsupported operand type(s) for -: 'IntVector' and 'int'" `#8 <https://github.com/saketkc/pySCTransform/pull/8>`_
+* Fix broken syntax in GLM design matrix code generation - where each batch can have its own intercept and slope.
+* Fix fit.py returning a single mu value - generating incorrect theta estimates,
+* Fixed theta_ml to match scTransform implementation - 9 iterations by default,
+* Use Silverman's bandwidth selection to be more like R scTransform.
+
+Features:
</code_context>

<issue_to_address>
**suggestion (typo):** Normalize capitalization of "scTransform" to match the R package name.

Here it’s referring to the R package, which is named “SCTransform.” Please update these references to use “SCTransform” for consistency with both the R package and the project name “pySCTransform.”

Suggested implementation:

```
* Fixed theta_ml to match SCTransform implementation - 9 iterations by default,

```

```
* Use Silverman's bandwidth selection to be more like R SCTransform.

```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

* Compute od_factor before outlier detection so it gets screened along with other columns * Use batch-specific gene means

newmana added 27 commits January 20, 2026 09:10

Update.

8e455f2

Speed up certain methods.

e4628d4

fix batch var construction.

1b84d03

Fix batch_var.

b0dcfb3

Simpler formulae.

20f764b

Improve forumale.

c25c09c

Have just one method for calculating mean - handle both sparse and not.

90cc8ed

Add check for sparse.

a8ebc5d

Handle both sparse and dense.

3597c76

Adding benchmark and tests for robust_scale_binned

55284f1

Add test for row_gmean and refactor to improve sparse implementation …

ada9ca2

…- don't turn into dense.

Ignore mean return value.

149f358

Fix flake8 errors.

1f74d6b

Remove outputs.

c0a7f6c

Make fit 100% like R's. Add tests for theta and step 1.

d4cc872

Replace bwSJ with more correct version. Add csv generation for fit an…

673ba7f

…d all. And remove duplication.

Check step 3.

25972c6

Remove KDEpy.

f6e6f75

Put back KDEpy.

fae5ed4

Fix.

f885320

Fix missing export.

dcf1a6c

Update history, fix mistake in doco.

e15dbed

Update with more excludes - but put back data.

87c82f9

Add data files generated by R.

2c70817

Finish documentation of changes.

8eb98ae

Remove warning and check impelementation against R more closely.

e39e740

.

c38159c

sourcery-ai Bot reviewed Feb 2, 2026

View reviewed changes

newmana added 2 commits February 2, 2026 15:47

Fix down sampling of number of genes and cells and cleanup unused var…

5932e99

…iables.

Revert "Remove outputs."

ced695b

This reverts commit c0a7f6c.

newmana added 8 commits February 2, 2026 19:56

Better docstring.

b7a8ac2

Push threads parameter into get_model_params_allgene_glmgp

50b743b

Mark tests with @pytest.mark.network

59209ee

Reformat.

389b682

Use better index syntax.

411a489

More cleanup - fix indexing.

f1febe8

Remove extra step.

1c23209

Restate deviance_residual more clsoely with formulae.

67b4256

newmana marked this pull request as draft February 4, 2026 01:24

newmana added 7 commits February 4, 2026 11:55

Fix formula construction and add tests.

badba1e

.

5361875

Fix batch correction.

5ae214e

.

062bfee

Add batch mode output from R SCTransform.

1c6265c

Cleanup tests.

7091018

For batch fix clamping of values and regularization.

eda4a53

newmana marked this pull request as ready for review February 11, 2026 04:28

sourcery-ai Bot reviewed Feb 11, 2026

View reviewed changes

newmana added 12 commits February 11, 2026 14:51

Fix.

196f337

Fix.

ef362d8

Fix.

6d3436a

Add test to check for NaN values and empty bin support.

02f7a62

Don't create the set every time in the list comprehension.

8c2c43e

Missed some pandas 2 support - fix and instructions on testing.

a24348c

Remove groupby.apply.

73cb070

Oops.

9d1c494

Fix import.

829850b

Fix not getting batch variable specific mean values.

a971b07

* Return inf for degenerate near-zero theta

943d2e8

* Compute od_factor before outlier detection so it gets screened along with other columns * Use batch-specific gene means

Add bw_adjust to parameters.

aa4ef5f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fixes and improvements.#20

Bug fixes and improvements.#20
newmana wants to merge 59 commits into
saketkc:developfrom
newmana:develop

newmana commented Feb 2, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

newmana commented Feb 2, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Summary by Sourcery

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

newmana commented Feb 2, 2026 •

edited by sourcery-ai Bot

Loading