Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some suggestions about OV #48

Open
lisch7 opened this issue Jan 10, 2024 · 10 comments
Open

Some suggestions about OV #48

lisch7 opened this issue Jan 10, 2024 · 10 comments

Comments

@lisch7
Copy link

lisch7 commented Jan 10, 2024

Hi,

Firstly, I'd like to commend the OV project for its contributions to scRNA analysis. I have a few suggestions that could potentially enhance its utility:

  1. Flexibility in ov.pp.preprocess: This function integrates several key processing steps. However, some steps like robust gene identification and gene filtering are mandatory. It might be beneficial to offer more control here. For instance, adding a parameter such as robust_gene=True, threshold=0.05 could provide users with the option to toggle this feature. Similarly, the mandatory use of sc.pp.normalize_total(..., exclude_highly_expressed=True...) could be made optional with a control parameter.

  2. Expanding regress Functionality: Currently, the regress function seems limited to specific parameters like mito_perc and nUMIs. It would be advantageous to allow regression on other variables as per user requirements.

  3. Concerns with regress_and_scale: In the current implementation, I'm wondering if replacing adata = sc.pp.regress_out(adata, ['mito_perc', 'nUMIs']) with adata_mock = sc.pp.scale(adata_mock) at line 471 might be more appropriate. This change could potentially improve the function's performance or accuracy.

I believe these enhancements could make OV even more flexible and user-friendly for diverse scRNA analysis scenarios. Looking forward to your thoughts on this.

@lisch7
Copy link
Author

lisch7 commented Jan 10, 2024

Moreover, why not enroll Celltypist and BBKNN into OV? I think some of your previous excellent strategies for batch removal and annotation on Wechat Official Accounts can also be integrated into the OV process.

@Starlitnightly
Copy link
Owner

Thank you for your suggestion, here is the response to your suggestion:

    1. more parameters are indeed optional in preprocess, it is worth noting that we auto-calculate the high variable genes, but instead of auto-filtering, we store the raw values in adata.layers['counts']. So you can use adata.layers['counts'] to get the raw values if you need to.
    1. any regression is not implemented in omicverse, this is due to the fact that the regression will interpolate the dropout phenomenon, but is this interpolation really correct? Much of the literature expresses concern about this
    1. instead of modifying the original data, we stored the scaled values in adata.layers['scaled'] for the calculation of pca only.
    1. for Celltypist, it's a good auto annotation algorithm, but considering its installation dependency may conflict with omicverse installation, we suggest that users can use it on their own, and we may add it in a future release.
    1. For BBKNN, this de-batching algorithm is so obsolete that its performance is so poor that it is not acceptable to anyone. We have compared the results of different de-batching algorithms in detail in the tutorial on batch correction, where the optimal algorithm without GPU is harmony and the optimal algorithm with GPU is scVI.

Thanks again for your suggestions.
Zehua

@lisch7
Copy link
Author

lisch7 commented Jan 11, 2024

Thanks for your kind reply, but it seems that I didn't express what I meant exactly. I appreciate the opportunity to clarify my suggestions:

  1. Flexibility in ov.pp.preprocess.
    I noticed that in the preprocessing step (ov.pp.preprocess), certain parameters are set by default, such as in the snippet from lines 372-379 in _preprocess.py:
sc.pp.normalize_total(
    adata, 
    target_sum=target_sum,
    exclude_highly_expressed=True,
    max_fraction=0.2,
)

Here, exclude_highly_expressed=True is automatically applied. In scanpy, the default of exclude_highly_expressed is False (https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.normalize_total.html#scanpy.pp.normalize_total). It may be more appropriate to set it to True based on your experience, but I think it may be more appropriate to stay consistent with the classic tutorial. Thus, my suggestion is to allow users to adjust this setting, perhaps through an additional parameter in ov.pp.preprocess. This could provide more control over the preprocessing based on specific data characteristics.

  1. Expanding regress Functionality.
    In the regress function (lines 437-452 in _preprocess.py), the parameters mito_perc and nUMIs are fixed targets for regression. I propose enhancing this function's flexibility by allowing users to specify which parameters to regress. This flexibility could be crucial for analyses where other variables might be more relevant.

  2. Concerns with regress_and_scale.
    Regarding the regress_and_scale function, particularly at line 471 in _preprocess.py, I wonder if the code adata_mock = scale(adata_mock) could be modified to adata_mock = sc.pp.scale(adata_mock). As someone relatively new to Python, I'm not sure if this change would be more appropriate or efficient, and would appreciate your insight on this.

Looking forward to your thoughts on this.

@Starlitnightly
Copy link
Owner

Thanks for your kind reply, but it seems that I didn't express what I meant exactly. I appreciate the opportunity to clarify my suggestions:

  1. Flexibility in ov.pp.preprocess.
    I noticed that in the preprocessing step (ov.pp.preprocess), certain parameters are set by default, such as in the snippet from lines 372-379 in _preprocess.py:
sc.pp.normalize_total(
    adata, 
    target_sum=target_sum,
    exclude_highly_expressed=True,
    max_fraction=0.2,
)

Here, exclude_highly_expressed=True is automatically applied. In scanpy, the default of exclude_highly_expressed is False (https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.normalize_total.html#scanpy.pp.normalize_total). It may be more appropriate to set it to True based on your experience, but I think it may be more appropriate to stay consistent with the classic tutorial. Thus, my suggestion is to allow users to adjust this setting, perhaps through an additional parameter in ov.pp.preprocess. This could provide more control over the preprocessing based on specific data characteristics.

  1. Expanding regress Functionality.
    In the regress function (lines 437-452 in _preprocess.py), the parameters mito_perc and nUMIs are fixed targets for regression. I propose enhancing this function's flexibility by allowing users to specify which parameters to regress. This flexibility could be crucial for analyses where other variables might be more relevant.
  2. Concerns with regress_and_scale.
    Regarding the regress_and_scale function, particularly at line 471 in _preprocess.py, I wonder if the code adata_mock = scale(adata_mock) could be modified to adata_mock = sc.pp.scale(adata_mock). As someone relatively new to Python, I'm not sure if this change would be more appropriate or efficient, and would appreciate your insight on this.

Looking forward to your thoughts on this.

Thanks for your advice, we will add more parameter in next version.

Zehua

@Liang-Wu-01
Copy link

Follow the issue, I had one question want to ask. Do the authors had any abvises for the huge dataset to save the RAM memory when used the OV to do the analysis? Not sure what kind statistics in the pp.reprocess, it take lot of RAM memory, can use gc.collect to release the memory?

@lisch7
Copy link
Author

lisch7 commented Jan 11, 2024

Additionaly, there are some mistakes in the tutorials (https://omicverse.readthedocs.io/en/latest/Tutorials-single/t_single_batch/).

An example:
image

And similar problem occurs in other calibration batches of tutorials.

@Starlitnightly
Copy link
Owner

Starlitnightly commented Jan 11, 2024

Follow the issue, I had one question want to ask. Do the authors had any abvises for the huge dataset to save the RAM memory when used the OV to do the analysis? Not sure what kind statistics in the pp.reprocess, it take lot of RAM memory, can use gc.collect to release the memory?

If you want to save on RAM expenses, then you might consider setting argument backed='r' when reading h5ad files using sc.read or ov.read

@Liang-Wu-01
Copy link

Liang-Wu-01 commented Jan 12, 2024

Follow the issue, I had one question want to ask. Do the authors had any abvises for the huge dataset to save the RAM memory when used the OV to do the analysis? Not sure what kind statistics in the pp.reprocess, it take lot of RAM memory, can use gc.collect to release the memory?

If you want to save on RAM expenses, then you might consider setting argument backed='r' when reading h5ad files using sc.read or ov.read

backed='r' can help save the memory of pp.preprocessing? I remember that only speed up to reading the files. Whatever, will try.

@colinwxl
Copy link

QQ20240918-164142
There is bugs in function ov.pp.regress_and_scale

@colinwxl
Copy link

image
I think that the re-calculation of scaled|original|X_pca should be removed, as the funtion batch_correction dose not ask us to input 'log-transformed counts' or 'scaled data', the later input will lead to double scaled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants