Skip to content

Bug fixes and improvements.#20

Open
newmana wants to merge 59 commits into
saketkc:developfrom
newmana:develop
Open

Bug fixes and improvements.#20
newmana wants to merge 59 commits into
saketkc:developfrom
newmana:develop

Conversation

@newmana

@newmana newmana commented Feb 2, 2026

Copy link
Copy Markdown

Updated to support the following:

  • Fixed bugs with GLM and missing parameters for multiple samples (Fixes There is an error in this code, when I run SCTransform() #17).
  • Fixed theta parameter generation - now returns a list for all genes rather than a single value.
  • Improved performance of robust_scale_binned (25% faster).
  • Added tests to compare R outputs with Python.
  • Fix downsampling logic.
  • Better handling of sparse matrices.
  • Also supports pandas 2.x for (Fixes Pandas 2.x compatibility issues #19).
  • Gene deduplication.
  • Adding batch support - passing batch variables through and per batch smoothing.

Summary by Sourcery

Align variance-stabilizing transformation, theta estimation, and residual calculations with the R sctransform implementation while improving performance and API ergonomics.

Bug Fixes:

  • Correct theta maximum-likelihood estimation and return full mean vectors instead of scalars to match R behavior.
  • Fix downsampling logic to only subsample when requested counts are below dataset size and clamp requests above size.
  • Correct handling of sparse inputs for geometric mean, variance, and Pearson residuals to avoid dense-only assumptions and shape bugs.
  • Align GLM/GLMGP parameter handling and bandwidth selection with R outputs, including proper handling of theta and offsets.

Enhancements:

  • Speed up robust_scale_binned and related robust scaling by replacing DataFrame groupby logic with vectorized bin-wise operations.
  • Introduce a generic variance helper that supports both dense and sparse matrices and use it in gene-level statistics.
  • Refine batch-effect and latent-variable modeling in vst/SCTransform, including optional batch covariates and passing extra cell attributes.
  • Expose plotting utilities via the public package API and clean up imports, logging, and minor numerical guard logic across modules.

Build:

  • Move dependency and packaging configuration into setup.cfg, modernize Python version support to 3.10+, and declare runtime and dev dependencies explicitly.

Tests:

  • Add stepwise regression, regularization, and residual-comparison tests against R reference outputs, including PBMC3k data download utilities.
  • Add targeted tests for downsampling behavior, robust_scale_binned performance, theta_ml correctness, and sparse/dense geometric mean equivalence.

Summary by Sourcery

Align variance-stabilizing transformation, theta estimation, and residual calculations with the R sctransform implementation while improving performance, sparse support, and batch-aware modeling, and modernizing packaging and dependencies.

New Features:

  • Add batch-aware SCTransform/vst support with batch covariates in the model matrix and per-batch parameter smoothing.
  • Expose plotting utilities via the public package API for easier downstream visualization.
  • Provide scripts to generate and compare Python outputs against R sctransform references, including integrated and batch-reference workflows.

Bug Fixes:

  • Correct negative binomial theta maximum-likelihood estimation to return stable, R-consistent values and full mean vectors.
  • Fix downsampling logic to avoid subsampling when requested sizes exceed dataset dimensions and to clamp requests appropriately.
  • Resolve shape and type issues when handling sparse matrices in geometric mean, variance, and residual computations.
  • Ensure residual calculations correctly handle sparse inputs and clipping behavior, and match R sctransform statistics.

Enhancements:

  • Unify variance computation for dense and sparse inputs and optimize row geometric mean and robust scaling operations for better performance.
  • Refine model matrix construction, parameter fitting, and regularization (including bandwidth selection and outlier handling) to better match R sctransform behavior and support batch terms.
  • Improve GLM/GLMGP fitting routines, parameter passing, and internal logging/warning configuration for more robust numerical behavior and diagnostics.

Build:

  • Move dependency specification into setup.cfg, modernize supported Python versions to 3.10+, and declare core and dev dependencies explicitly.
  • Update setup.py metadata, console entry point definition, and package description to reflect the current project scope and Python support matrix.

Tests:

  • Add extensive stepwise tests for theta_ml, per-gene parameter estimation, regularization (with and without batch), and residual calculations against R reference outputs.
  • Introduce tests for batch model matrix structure, downsampling parameter logic, robust_scale_binned correctness and performance, and sparse/dense row geometric mean equivalence.
  • Add pytest configuration, shared test utilities, and fixtures for downloading and caching PBMC3k data and loading R-generated reference CSVs.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @newmana, your pull request is larger than the review limit of 150000 diff characters

@newmana newmana marked this pull request as draft February 4, 2026 01:24
@newmana newmana marked this pull request as ready for review February 11, 2026 04:28

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 8 issues, and left some high level feedback:

  • In SCTransform, the line method = "theta_ml" unconditionally overwrites the user-supplied method argument (except for vst_flavor == "v2"), which makes the method parameter effectively useless; consider only overriding it when vst_flavor requires a specific method or when method is None.
  • In get_regularized_params, the list comprehension that builds batch_cell_idx recreates set(batch_cells_step1) on every iteration, which is unnecessarily expensive for large datasets; precompute the set once outside the comprehension and reuse it.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In SCTransform, the line `method = "theta_ml"` unconditionally overwrites the user-supplied `method` argument (except for `vst_flavor == "v2"`), which makes the `method` parameter effectively useless; consider only overriding it when `vst_flavor` requires a specific method or when `method` is None.
- In get_regularized_params, the list comprehension that builds `batch_cell_idx` recreates `set(batch_cells_step1)` on every iteration, which is unnecessarily expensive for large datasets; precompute the set once outside the comprehension and reuse it.

## Individual Comments

### Comment 1
<location> `pysctransform/pysctransform.py:54-58` </location>
<code_context>
+        return npy.var(X, axis=axis)
+

+def bw_silverman(x, bw_adjust=3):
+    """
+    Bandwidth selection using Silverman's rule of thumb.

-def bwSJ(genes_log10_gmean_step1, bw_adjust=3):
-    # See https://kdepy.readthedocs.io/en/latest/bandwidth.html
-    fit = FFTKDE(kernel="gaussian", bw="ISJ").fit(npy.asarray(genes_log10_gmean_step1))
-    _ = fit.evaluate()
-    bw = fit.bw * bw_adjust
-    return npy.array([bw], dtype=float)
+    This matches R's bw.SJ more closely than KDEpy's ISJ.
+    """
+    x = npy.asarray(x)
</code_context>

<issue_to_address>
**issue:** Docstring description is misleading relative to the actual Silverman bandwidth implementation.

The code correctly implements Silverman’s rule-of-thumb (std + IQR), but the docstring incorrectly claims it matches R’s `bw.SJ`. Since `bw.SJ` uses a different pilot/bias-reduction procedure, please either remove the `bw.SJ` comparison or clarify that this is Silverman’s rule-of-thumb, which may approximate `bw.SJ` empirically but is a different estimator.
</issue_to_address>

### Comment 2
<location> `pysctransform/pysctransform.py:561-570` </location>
<code_context>
+def get_downsampling_params(n_cells, n_genes, total_cells, total_genes):
</code_context>

<issue_to_address>
**issue:** Downsampling helper does not guard against non-positive n_cells/n_genes values.

If `n_cells` or `n_genes` are `<= 0`, they’re effectively clamped to 0 while still setting `downsample_* = True`, so later calls like `np.random.choice(..., size=n_cells, replace=False)` will run with empty selections. It would be safer to validate these inputs (e.g., require `None` or positive integers and raise a clear error, or clamp to a minimum of 1) to avoid this confusing behavior.
</issue_to_address>

### Comment 3
<location> `pysctransform/pysctransform.py:476-485` </location>
<code_context>
     return pearson_residuals


 def deviance_residual(y, mu, theta, weight=1):
-    theta = npy.tile(theta.reshape(-1, 1), y.shape[1])
-    L = npy.multiply((y + theta), npy.log((y + theta) / (mu + theta)))
-    log_mu = npy.log(mu)
-    log_y = npy.log(y.maximum(1).todense())
-    r = npy.multiply(y.todense(), log_y - log_mu)
-    r = 2 * weight * (r - L)
-    return npy.multiply(npy.sqrt(r), npy.sign(y - mu))
+    if sparse.issparse(y):
+        y = y.toarray()
+    theta = theta.reshape(-1, 1)
+    y_safe = npy.maximum(y, 1)
+
+    # Unit deviance: d_i = 2 * [y*log(y/μ) - (y+θ)*log((y+θ)/(μ+θ))]
+    unit_deviance = 2 * (
+            y * npy.log(y_safe / mu)
+            - (y + theta) * npy.log((y + theta) / (mu + theta))
+    )
+
+    return npy.sqrt(weight * unit_deviance) * npy.sign(y - mu)


</code_context>

<issue_to_address>
**issue:** Deviance residual computation can hit numerical issues when mu is zero or very small.

`np.log(y_safe / mu)` and `np.log((y + theta) / (mu + theta))` are evaluated without guarding against `mu == 0` or very small `mu`, which can yield `inf`/`nan` deviance values and corrupt downstream residuals. Please clip `mu` to a reasonable lower bound (e.g. `mu_clipped = np.clip(mu, np.finfo(float).tiny, None)`) and use that in the log/ratio terms, analogous to `y_safe`. A similar guard is needed in `pearson_residual`, where `variance` can become zero if both `mu` and `theta` are small.
</issue_to_address>

### Comment 4
<location> `pysctransform/fit.py:109-118` </location>
<code_context>
+def theta_ml(y, mu, limit=10, eps=1e-4):
</code_context>

<issue_to_address>
**suggestion:** New theta_ml implementation changes behavior; consider guarding edge cases and documenting differences from the previous version.

The new Newton-like `theta_ml` returns `inf` in several failure cases (non-finite/negative initial estimate, zero information, negative final estimate), which can materially change fitted thetas, especially when `var(y) ≈ mu` or with small counts. I suggest: (1) explicitly handling cases where `var(y) <= mu` before computing the initial estimate; and (2) documenting in the docstring when `theta_ml` returns `inf` and how callers should interpret that (e.g., as Poisson), so downstream behavior remains predictable and debuggable.

Suggested implementation:

```python
def theta_ml(y, mu, limit=10, eps=1e-4):
    """
    Maximum likelihood estimation of theta for negative binomial.
    Matches R's MASS::theta.ml.

    Parameters
    ----------
    y : array
        Count data
    mu : array
        Fitted mean values corresponding to ``y``.

    Notes
    -----
    This implementation returns ``np.inf`` (interpreted as a Poisson mean-variance
    relationship / no extra-Poisson dispersion) in several situations:

    * when the sample variance of ``y`` is less than or equal to the mean implied
      by ``mu`` (i.e. there is no empirical evidence for over-dispersion and the
      negative binomial effectively reduces to a Poisson model);
    * when the initial moment-based estimate of ``theta`` is non-finite or negative;
    * when the observed Fisher information is zero during the Newton iterations; or
    * when the final Newton-updated estimate would be negative.

    Callers should treat a return value of ``np.inf`` as indicating that a Poisson
    mean-variance relationship is adequate and that no reliable finite dispersion
    estimate could be obtained.

```

To fully implement the behavior you described, you should also:

1. Add an explicit guard before computing the initial moment-based estimate of ``theta`` inside ``theta_ml``:

```python
    # Before computing the initial estimate of theta, check for cases
    # where the variance does not exceed the mean implied by mu. In that
    # situation, there is no evidence for over-dispersion and the NB
    # reduces to Poisson, which we represent by theta = np.inf.
    var_y = np.var(y, ddof=1)
    mean_mu = np.mean(mu)

    if not np.isfinite(var_y) or not np.isfinite(mean_mu):
        return np.inf

    if var_y <= mean_mu:
        return np.inf
```

Place this block immediately after the docstring and before any existing code that computes the initial ``theta`` estimate.

2. Ensure that:

   * ``numpy`` is imported as ``np`` at the top of the module (if it is not already).
   * Any existing early-return paths that currently return ``float("inf")`` are either kept as-is or updated to return ``np.inf`` consistently with the docstring language.
   * The Newton-like update loop continues to return ``np.inf`` in the failure cases you mentioned (non-finite/negative initial estimate, zero information, negative final estimate), so the behavior matches the documentation you just added.
</issue_to_address>

### Comment 5
<location> `tests/test_robust_scale_binned.py:18-27` </location>
<code_context>
+    }
+
+
+class TestRobustScaleBinned:
+
+    def test_preserves_input_order(self):
+        y = np.array([10.0, 20.0, 30.0, 40.0, 50.0, 60.0])
+        x = np.array([0.9, 0.1, 0.9, 0.1, 0.9, 0.1])
+        breaks = [0, 0.5, 1.0]
+
+        result = robust_scale_binned(y, x, breaks)
+
+        # Check that high-x values (indices 0, 2, 4) are scaled together
+        # and low-x values (indices 1, 3, 5) are scaled together
+        high_bin = result[[0, 2, 4]]
+        low_bin = result[[1, 3, 5]]
+
+        # Within each bin, relative ordering should be preserved
+        assert high_bin[0] < high_bin[1] < high_bin[2]
+        assert low_bin[0] < low_bin[1] < low_bin[2]
+
+    def test_output_length_matches_input(self, sample_data):
+        result = robust_scale_binned(
+            sample_data['y'],
+            sample_data['x'],
+            sample_data['breaks'],
+        )
+        assert len(result) == len(sample_data['y'])
+
+    def test_no_nans_in_output(self, sample_data):
</code_context>

<issue_to_address>
**suggestion (testing):** Extend `robust_scale_binned` tests to cover NaNs and empty bins

Current tests only cover inputs where all `x` fall into populated bins and contain no NaNs. Please add:

- A case where some `x` are NaN to capture and lock in the current behavior (entries with `code == -1` remain NaN / unchanged).
- A case where the specified breaks create empty bins (e.g. all `x` in [0, 0.5] but breaks span [0, 1]) to confirm the function returns finite values for populated bins and does not raise.

These will exercise the `np.unique(bin_codes)` loop and protect against regressions in binning behavior.
</issue_to_address>

### Comment 6
<location> `tests/test_batch_clamping.py:4-8` </location>
<code_context>

 # Distribution / packaging
 .Python
</code_context>

<issue_to_address>
**nitpick (testing):** Fix test class name typo and consider parametrizing the batch clamping scenarios

`TestBatchCamping` appears to be a typo for `TestBatchClamping`, which makes the test intent less clear in reports; renaming would help. Since the three tests only differ in which indices are zeroed or amplified while sharing `_make_batch_inputs`, you could combine them into a single parametrized test (e.g. with `@pytest.mark.parametrize`) to cut duplication and explicitly document each scenario of zero/high-expression gene combinations.
</issue_to_address>

### Comment 7
<location> `README.rst:89` </location>
<code_context>
+Development
+===========
+
+Code style chec:
+
+.. code-block:: bash
</code_context>

<issue_to_address>
**issue (typo):** Fix typo in "Code style chec:" heading.

```suggestion
Code style check:
```
</issue_to_address>

### Comment 8
<location> `HISTORY.rst:11-12` </location>
<code_context>
 * Fix for "TypeError: unsupported operand type(s) for -: 'IntVector' and 'int'" `#8 <https://github.com/saketkc/pySCTransform/pull/8>`_
+* Fix broken syntax in GLM design matrix code generation - where each batch can have its own intercept and slope.
+* Fix fit.py returning a single mu value - generating incorrect theta estimates,
+* Fixed theta_ml to match scTransform implementation - 9 iterations by default,
+* Use Silverman's bandwidth selection to be more like R scTransform.
+
+Features:
</code_context>

<issue_to_address>
**suggestion (typo):** Normalize capitalization of "scTransform" to match the R package name.

Here it’s referring to the R package, which is named “SCTransform.” Please update these references to use “SCTransform” for consistency with both the R package and the project name “pySCTransform.”

Suggested implementation:

```
* Fixed theta_ml to match SCTransform implementation - 9 iterations by default,

```

```
* Use Silverman's bandwidth selection to be more like R SCTransform.

```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread pysctransform/pysctransform.py
Comment thread pysctransform/pysctransform.py
Comment thread pysctransform/pysctransform.py
Comment thread pysctransform/fit.py
Comment thread tests/test_robust_scale_binned.py
Comment thread tests/test_batch_clamping.py Outdated
Comment thread README.rst Outdated
Comment thread HISTORY.rst Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pandas 2.x compatibility issues There is an error in this code, when I run SCTransform()

1 participant