Bug fixes and improvements.#20
Open
newmana wants to merge 59 commits into
Open
Conversation
…- don't turn into dense.
…d all. And remove duplication.
There was a problem hiding this comment.
Sorry @newmana, your pull request is larger than the review limit of 150000 diff characters
This reverts commit c0a7f6c.
There was a problem hiding this comment.
Hey - I've found 8 issues, and left some high level feedback:
- In SCTransform, the line
method = "theta_ml"unconditionally overwrites the user-suppliedmethodargument (except forvst_flavor == "v2"), which makes themethodparameter effectively useless; consider only overriding it whenvst_flavorrequires a specific method or whenmethodis None. - In get_regularized_params, the list comprehension that builds
batch_cell_idxrecreatesset(batch_cells_step1)on every iteration, which is unnecessarily expensive for large datasets; precompute the set once outside the comprehension and reuse it.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In SCTransform, the line `method = "theta_ml"` unconditionally overwrites the user-supplied `method` argument (except for `vst_flavor == "v2"`), which makes the `method` parameter effectively useless; consider only overriding it when `vst_flavor` requires a specific method or when `method` is None.
- In get_regularized_params, the list comprehension that builds `batch_cell_idx` recreates `set(batch_cells_step1)` on every iteration, which is unnecessarily expensive for large datasets; precompute the set once outside the comprehension and reuse it.
## Individual Comments
### Comment 1
<location> `pysctransform/pysctransform.py:54-58` </location>
<code_context>
+ return npy.var(X, axis=axis)
+
+def bw_silverman(x, bw_adjust=3):
+ """
+ Bandwidth selection using Silverman's rule of thumb.
-def bwSJ(genes_log10_gmean_step1, bw_adjust=3):
- # See https://kdepy.readthedocs.io/en/latest/bandwidth.html
- fit = FFTKDE(kernel="gaussian", bw="ISJ").fit(npy.asarray(genes_log10_gmean_step1))
- _ = fit.evaluate()
- bw = fit.bw * bw_adjust
- return npy.array([bw], dtype=float)
+ This matches R's bw.SJ more closely than KDEpy's ISJ.
+ """
+ x = npy.asarray(x)
</code_context>
<issue_to_address>
**issue:** Docstring description is misleading relative to the actual Silverman bandwidth implementation.
The code correctly implements Silverman’s rule-of-thumb (std + IQR), but the docstring incorrectly claims it matches R’s `bw.SJ`. Since `bw.SJ` uses a different pilot/bias-reduction procedure, please either remove the `bw.SJ` comparison or clarify that this is Silverman’s rule-of-thumb, which may approximate `bw.SJ` empirically but is a different estimator.
</issue_to_address>
### Comment 2
<location> `pysctransform/pysctransform.py:561-570` </location>
<code_context>
+def get_downsampling_params(n_cells, n_genes, total_cells, total_genes):
</code_context>
<issue_to_address>
**issue:** Downsampling helper does not guard against non-positive n_cells/n_genes values.
If `n_cells` or `n_genes` are `<= 0`, they’re effectively clamped to 0 while still setting `downsample_* = True`, so later calls like `np.random.choice(..., size=n_cells, replace=False)` will run with empty selections. It would be safer to validate these inputs (e.g., require `None` or positive integers and raise a clear error, or clamp to a minimum of 1) to avoid this confusing behavior.
</issue_to_address>
### Comment 3
<location> `pysctransform/pysctransform.py:476-485` </location>
<code_context>
return pearson_residuals
def deviance_residual(y, mu, theta, weight=1):
- theta = npy.tile(theta.reshape(-1, 1), y.shape[1])
- L = npy.multiply((y + theta), npy.log((y + theta) / (mu + theta)))
- log_mu = npy.log(mu)
- log_y = npy.log(y.maximum(1).todense())
- r = npy.multiply(y.todense(), log_y - log_mu)
- r = 2 * weight * (r - L)
- return npy.multiply(npy.sqrt(r), npy.sign(y - mu))
+ if sparse.issparse(y):
+ y = y.toarray()
+ theta = theta.reshape(-1, 1)
+ y_safe = npy.maximum(y, 1)
+
+ # Unit deviance: d_i = 2 * [y*log(y/μ) - (y+θ)*log((y+θ)/(μ+θ))]
+ unit_deviance = 2 * (
+ y * npy.log(y_safe / mu)
+ - (y + theta) * npy.log((y + theta) / (mu + theta))
+ )
+
+ return npy.sqrt(weight * unit_deviance) * npy.sign(y - mu)
</code_context>
<issue_to_address>
**issue:** Deviance residual computation can hit numerical issues when mu is zero or very small.
`np.log(y_safe / mu)` and `np.log((y + theta) / (mu + theta))` are evaluated without guarding against `mu == 0` or very small `mu`, which can yield `inf`/`nan` deviance values and corrupt downstream residuals. Please clip `mu` to a reasonable lower bound (e.g. `mu_clipped = np.clip(mu, np.finfo(float).tiny, None)`) and use that in the log/ratio terms, analogous to `y_safe`. A similar guard is needed in `pearson_residual`, where `variance` can become zero if both `mu` and `theta` are small.
</issue_to_address>
### Comment 4
<location> `pysctransform/fit.py:109-118` </location>
<code_context>
+def theta_ml(y, mu, limit=10, eps=1e-4):
</code_context>
<issue_to_address>
**suggestion:** New theta_ml implementation changes behavior; consider guarding edge cases and documenting differences from the previous version.
The new Newton-like `theta_ml` returns `inf` in several failure cases (non-finite/negative initial estimate, zero information, negative final estimate), which can materially change fitted thetas, especially when `var(y) ≈ mu` or with small counts. I suggest: (1) explicitly handling cases where `var(y) <= mu` before computing the initial estimate; and (2) documenting in the docstring when `theta_ml` returns `inf` and how callers should interpret that (e.g., as Poisson), so downstream behavior remains predictable and debuggable.
Suggested implementation:
```python
def theta_ml(y, mu, limit=10, eps=1e-4):
"""
Maximum likelihood estimation of theta for negative binomial.
Matches R's MASS::theta.ml.
Parameters
----------
y : array
Count data
mu : array
Fitted mean values corresponding to ``y``.
Notes
-----
This implementation returns ``np.inf`` (interpreted as a Poisson mean-variance
relationship / no extra-Poisson dispersion) in several situations:
* when the sample variance of ``y`` is less than or equal to the mean implied
by ``mu`` (i.e. there is no empirical evidence for over-dispersion and the
negative binomial effectively reduces to a Poisson model);
* when the initial moment-based estimate of ``theta`` is non-finite or negative;
* when the observed Fisher information is zero during the Newton iterations; or
* when the final Newton-updated estimate would be negative.
Callers should treat a return value of ``np.inf`` as indicating that a Poisson
mean-variance relationship is adequate and that no reliable finite dispersion
estimate could be obtained.
```
To fully implement the behavior you described, you should also:
1. Add an explicit guard before computing the initial moment-based estimate of ``theta`` inside ``theta_ml``:
```python
# Before computing the initial estimate of theta, check for cases
# where the variance does not exceed the mean implied by mu. In that
# situation, there is no evidence for over-dispersion and the NB
# reduces to Poisson, which we represent by theta = np.inf.
var_y = np.var(y, ddof=1)
mean_mu = np.mean(mu)
if not np.isfinite(var_y) or not np.isfinite(mean_mu):
return np.inf
if var_y <= mean_mu:
return np.inf
```
Place this block immediately after the docstring and before any existing code that computes the initial ``theta`` estimate.
2. Ensure that:
* ``numpy`` is imported as ``np`` at the top of the module (if it is not already).
* Any existing early-return paths that currently return ``float("inf")`` are either kept as-is or updated to return ``np.inf`` consistently with the docstring language.
* The Newton-like update loop continues to return ``np.inf`` in the failure cases you mentioned (non-finite/negative initial estimate, zero information, negative final estimate), so the behavior matches the documentation you just added.
</issue_to_address>
### Comment 5
<location> `tests/test_robust_scale_binned.py:18-27` </location>
<code_context>
+ }
+
+
+class TestRobustScaleBinned:
+
+ def test_preserves_input_order(self):
+ y = np.array([10.0, 20.0, 30.0, 40.0, 50.0, 60.0])
+ x = np.array([0.9, 0.1, 0.9, 0.1, 0.9, 0.1])
+ breaks = [0, 0.5, 1.0]
+
+ result = robust_scale_binned(y, x, breaks)
+
+ # Check that high-x values (indices 0, 2, 4) are scaled together
+ # and low-x values (indices 1, 3, 5) are scaled together
+ high_bin = result[[0, 2, 4]]
+ low_bin = result[[1, 3, 5]]
+
+ # Within each bin, relative ordering should be preserved
+ assert high_bin[0] < high_bin[1] < high_bin[2]
+ assert low_bin[0] < low_bin[1] < low_bin[2]
+
+ def test_output_length_matches_input(self, sample_data):
+ result = robust_scale_binned(
+ sample_data['y'],
+ sample_data['x'],
+ sample_data['breaks'],
+ )
+ assert len(result) == len(sample_data['y'])
+
+ def test_no_nans_in_output(self, sample_data):
</code_context>
<issue_to_address>
**suggestion (testing):** Extend `robust_scale_binned` tests to cover NaNs and empty bins
Current tests only cover inputs where all `x` fall into populated bins and contain no NaNs. Please add:
- A case where some `x` are NaN to capture and lock in the current behavior (entries with `code == -1` remain NaN / unchanged).
- A case where the specified breaks create empty bins (e.g. all `x` in [0, 0.5] but breaks span [0, 1]) to confirm the function returns finite values for populated bins and does not raise.
These will exercise the `np.unique(bin_codes)` loop and protect against regressions in binning behavior.
</issue_to_address>
### Comment 6
<location> `tests/test_batch_clamping.py:4-8` </location>
<code_context>
# Distribution / packaging
.Python
</code_context>
<issue_to_address>
**nitpick (testing):** Fix test class name typo and consider parametrizing the batch clamping scenarios
`TestBatchCamping` appears to be a typo for `TestBatchClamping`, which makes the test intent less clear in reports; renaming would help. Since the three tests only differ in which indices are zeroed or amplified while sharing `_make_batch_inputs`, you could combine them into a single parametrized test (e.g. with `@pytest.mark.parametrize`) to cut duplication and explicitly document each scenario of zero/high-expression gene combinations.
</issue_to_address>
### Comment 7
<location> `README.rst:89` </location>
<code_context>
+Development
+===========
+
+Code style chec:
+
+.. code-block:: bash
</code_context>
<issue_to_address>
**issue (typo):** Fix typo in "Code style chec:" heading.
```suggestion
Code style check:
```
</issue_to_address>
### Comment 8
<location> `HISTORY.rst:11-12` </location>
<code_context>
* Fix for "TypeError: unsupported operand type(s) for -: 'IntVector' and 'int'" `#8 <https://github.com/saketkc/pySCTransform/pull/8>`_
+* Fix broken syntax in GLM design matrix code generation - where each batch can have its own intercept and slope.
+* Fix fit.py returning a single mu value - generating incorrect theta estimates,
+* Fixed theta_ml to match scTransform implementation - 9 iterations by default,
+* Use Silverman's bandwidth selection to be more like R scTransform.
+
+Features:
</code_context>
<issue_to_address>
**suggestion (typo):** Normalize capitalization of "scTransform" to match the R package name.
Here it’s referring to the R package, which is named “SCTransform.” Please update these references to use “SCTransform” for consistency with both the R package and the project name “pySCTransform.”
Suggested implementation:
```
* Fixed theta_ml to match SCTransform implementation - 9 iterations by default,
```
```
* Use Silverman's bandwidth selection to be more like R SCTransform.
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
* Compute od_factor before outlier detection so it gets screened along with other columns * Use batch-specific gene means
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Updated to support the following:
Summary by Sourcery
Align variance-stabilizing transformation, theta estimation, and residual calculations with the R sctransform implementation while improving performance and API ergonomics.
Bug Fixes:
Enhancements:
Build:
Tests:
Summary by Sourcery
Align variance-stabilizing transformation, theta estimation, and residual calculations with the R sctransform implementation while improving performance, sparse support, and batch-aware modeling, and modernizing packaging and dependencies.
New Features:
Bug Fixes:
Enhancements:
Build:
Tests: