Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP ENH add censored quadratic df #250

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

mathurinm
Copy link
Collaborator

closes #249

@mathurinm
Copy link
Collaborator Author

mathurinm commented May 27, 2024

I have an issue with the datafit update step, because the gradient of 0.5 || Xw - y + a 1_n|| with respect to a is n a - (Xw - y)^t 1_n, so to compute it one needs y.sum()

@shz9 do people usually fit an intercept with such models ? How do they do it ?

EDIT: ok I made it work with the user passing y_mean. If not, I don't think it's possible to compute an intercept.

@shz9
Copy link

shz9 commented May 27, 2024

We usually assume $y$ is centered. So, you could go with the implementation with y_mean, but set it to 0 by default?

Comment on lines +130 to +132
def __init__(self, Xty, y_mean):
self.Xty = Xty
self.y_mean = y_mean
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We better leave the constructor to define the Datafit hyper-parameters.
We can move Xty and y_mean to the initialize method

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would break the existing code: all solvers call datafit.initialize(X, y) internally

Copy link
Collaborator Author

@mathurinm mathurinm May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a better API would be to instantiate all datafits with X, y and whatever they need (each has its own API)
then call datafit.initialize() that would use all stored attributes

eg

datafit = Quadratic(X, y)
penalty = L1(alpha=0.1)
solver = AndersonCD()
solver.solve(datafit, penalty)

It would give more freedom to each datafit, to require various quantities.

To me this makes more sense because we have datafits like Cox depneding on more than just X and y.

Wdyt @BadrMOUFAD ?

Edit: this may break GeneralizedEstimator, it would need X and y to be instantiated, not at fit time

@QB3
Copy link
Collaborator

QB3 commented Jun 27, 2024

I just added a way to load the design matrix in a sparse scipy format from magenpy.
Would show us how to load X.T y from magenpy @shz9 ?

@shz9
Copy link

shz9 commented Jun 27, 2024

Sure, here's a snippet that you can easily integrate for the purposes of testing:

import magenpy as mgp

# Initialize simulator object:
g_sim = mgp.PhenotypeSimulator(mgp.tgp_eur_data_path(), h2=0.5, pi=0.1)
# Simulate outcome (phenotype):
g_sim.simulate()

# Access X:
X_csr = g_sim.genotype[22].to_csr()
# Convert to csc:
X_csc = X_csr.tocsc()
# Access y:
y = g_sim.to_phenotype_table()['phenotype'].values

# Sample size:
n = X_csr.shape[0]

# To get X'y (you may need to standardize both X and y before this step):

Xty = X_csr.T.dot(y)

# For a more general case that will be useful for GWAS settings:

# (1) Perform GWAS:
g_sim.perform_gwas()

# (2) Extract marginal betas (Xty, assumes both x and y are standardized column-wise):
g_sim.sumstats_table[22].standardized_marginal_beta

@shz9
Copy link

shz9 commented Jun 27, 2024

Would it be possible to have interface take $X^\top X$ instead of $X$ in some instances?

@QB3
Copy link
Collaborator

QB3 commented Jul 2, 2024

Thanks a lot for the code snippet!

Would it be possible to have the interface take X.T X instead of X in some instances?

This should take more time because we will have to implement a new solver.
We will start with making the code work for X, and then X.T X.

@shz9
Copy link

shz9 commented Jul 2, 2024

OK, thanks. By the way, I found some interesting statistics literature on losses of the form that we're working on here, going back to Loh and Wainwright (2012):

These papers were focused on the regime of missing/noisy data in $X$ and they proposed losses similar to the ones we're discussing. So, the work here could have wide applicability if we manage to provide efficient and robust solvers for these losses.

This literature also provides some important caveats and considerations to keep in mind, for example, the fact that $X^\top X$ has to be positive (semi-)definite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FEAT - Add quadratic datafit with no access to the target
4 participants