WIP ENH add censored quadratic df #250

mathurinm · 2024-05-27T16:20:41Z

closes #249

mathurinm · 2024-05-27T17:13:02Z

I have an issue with the datafit update step, because the gradient of 0.5 || Xw - y + a 1_n|| with respect to a is n a - (Xw - y)^t 1_n, so to compute it one needs y.sum()

@shz9 do people usually fit an intercept with such models ? How do they do it ?

EDIT: ok I made it work with the user passing y_mean. If not, I don't think it's possible to compute an intercept.

shz9 · 2024-05-27T17:38:37Z

We usually assume $y$ is centered. So, you could go with the implementation with y_mean, but set it to 0 by default?

Badr-MOUFAD · 2024-05-28T08:23:07Z

skglm/datafits/single_task.py

+    def __init__(self, Xty, y_mean):
+        self.Xty = Xty
+        self.y_mean = y_mean


We better leave the constructor to define the Datafit hyper-parameters.
We can move Xty and y_mean to the initialize method

this would break the existing code: all solvers call datafit.initialize(X, y) internally

maybe a better API would be to instantiate all datafits with X, y and whatever they need (each has its own API)
then call datafit.initialize() that would use all stored attributes

eg

datafit = Quadratic(X, y) penalty = L1(alpha=0.1) solver = AndersonCD() solver.solve(datafit, penalty)

It would give more freedom to each datafit, to require various quantities.

To me this makes more sense because we have datafits like Cox depneding on more than just X and y.

Wdyt @BadrMOUFAD ?

Edit: this may break GeneralizedEstimator, it would need X and y to be instantiated, not at fit time

QB3 · 2024-06-27T10:43:00Z

I just added a way to load the design matrix in a sparse scipy format from magenpy.
Would show us how to load X.T y from magenpy @shz9 ?

shz9 · 2024-06-27T21:33:02Z

Sure, here's a snippet that you can easily integrate for the purposes of testing:

import magenpy as mgp

# Initialize simulator object:
g_sim = mgp.PhenotypeSimulator(mgp.tgp_eur_data_path(), h2=0.5, pi=0.1)
# Simulate outcome (phenotype):
g_sim.simulate()

# Access X:
X_csr = g_sim.genotype[22].to_csr()
# Convert to csc:
X_csc = X_csr.tocsc()
# Access y:
y = g_sim.to_phenotype_table()['phenotype'].values

# Sample size:
n = X_csr.shape[0]

# To get X'y (you may need to standardize both X and y before this step):

Xty = X_csr.T.dot(y)

# For a more general case that will be useful for GWAS settings:

# (1) Perform GWAS:
g_sim.perform_gwas()

# (2) Extract marginal betas (Xty, assumes both x and y are standardized column-wise):
g_sim.sumstats_table[22].standardized_marginal_beta

shz9 · 2024-06-27T21:34:20Z

Would it be possible to have interface take $X^\top X$ instead of $X$ in some instances?

QB3 · 2024-07-02T15:01:00Z

Thanks a lot for the code snippet!

Would it be possible to have the interface take X.T X instead of X in some instances?

This should take more time because we will have to implement a new solver.
We will start with making the code work for X, and then X.T X.

shz9 · 2024-07-02T20:00:02Z

OK, thanks. By the way, I found some interesting statistics literature on losses of the form that we're working on here, going back to Loh and Wainwright (2012):

These papers were focused on the regime of missing/noisy data in $X$ and they proposed losses similar to the ones we're discussing. So, the work here could have wide applicability if we manage to provide efficient and robust solvers for these losses.

This literature also provides some important caveats and considerations to keep in mind, for example, the fact that $X^\top X$ has to be positive (semi-)definite.

mathurinm added 3 commits May 27, 2024 18:20

censored quadratic df

4a8fccb

it compiles but it fails

5c60dcc

without fit_intercept it works

923a8dc

mathurinm mentioned this pull request May 27, 2024

FEAT - Add quadratic datafit with no access to the target #249

Open

make it work with intercept if one passes y_mean

a1f4acc

Badr-MOUFAD reviewed May 28, 2024

View reviewed changes

[ci skip] loaded X_csc magenpy data

a354535

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP ENH add censored quadratic df #250

WIP ENH add censored quadratic df #250

mathurinm commented May 27, 2024

mathurinm commented May 27, 2024 •

edited

Loading

shz9 commented May 27, 2024

Badr-MOUFAD May 28, 2024

mathurinm May 28, 2024

mathurinm May 28, 2024 •

edited

Loading

QB3 commented Jun 27, 2024

shz9 commented Jun 27, 2024

shz9 commented Jun 27, 2024

QB3 commented Jul 2, 2024

shz9 commented Jul 2, 2024

WIP ENH add censored quadratic df #250

Are you sure you want to change the base?

WIP ENH add censored quadratic df #250

Conversation

mathurinm commented May 27, 2024

mathurinm commented May 27, 2024 • edited Loading

shz9 commented May 27, 2024

Badr-MOUFAD May 28, 2024

Choose a reason for hiding this comment

mathurinm May 28, 2024

Choose a reason for hiding this comment

mathurinm May 28, 2024 • edited Loading

Choose a reason for hiding this comment

QB3 commented Jun 27, 2024

shz9 commented Jun 27, 2024

shz9 commented Jun 27, 2024

QB3 commented Jul 2, 2024

shz9 commented Jul 2, 2024

mathurinm commented May 27, 2024 •

edited

Loading

mathurinm May 28, 2024 •

edited

Loading