-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP ENH add censored quadratic df #250
base: main
Are you sure you want to change the base?
Conversation
I have an issue with the datafit update step, because the gradient of @shz9 do people usually fit an intercept with such models ? How do they do it ? EDIT: ok I made it work with the user passing |
We usually assume |
def __init__(self, Xty, y_mean): | ||
self.Xty = Xty | ||
self.y_mean = y_mean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We better leave the constructor to define the Datafit hyper-parameters.
We can move Xty
and y_mean
to the initialize
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would break the existing code: all solvers call datafit.initialize(X, y)
internally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a better API would be to instantiate all datafits with X, y and whatever they need (each has its own API)
then call datafit.initialize()
that would use all stored attributes
eg
datafit = Quadratic(X, y)
penalty = L1(alpha=0.1)
solver = AndersonCD()
solver.solve(datafit, penalty)
It would give more freedom to each datafit, to require various quantities.
To me this makes more sense because we have datafits like Cox depneding on more than just X and y.
Wdyt @BadrMOUFAD ?
Edit: this may break GeneralizedEstimator, it would need X and y to be instantiated, not at fit time
I just added a way to load the design matrix in a sparse scipy format from |
Sure, here's a snippet that you can easily integrate for the purposes of testing: import magenpy as mgp
# Initialize simulator object:
g_sim = mgp.PhenotypeSimulator(mgp.tgp_eur_data_path(), h2=0.5, pi=0.1)
# Simulate outcome (phenotype):
g_sim.simulate()
# Access X:
X_csr = g_sim.genotype[22].to_csr()
# Convert to csc:
X_csc = X_csr.tocsc()
# Access y:
y = g_sim.to_phenotype_table()['phenotype'].values
# Sample size:
n = X_csr.shape[0]
# To get X'y (you may need to standardize both X and y before this step):
Xty = X_csr.T.dot(y)
# For a more general case that will be useful for GWAS settings:
# (1) Perform GWAS:
g_sim.perform_gwas()
# (2) Extract marginal betas (Xty, assumes both x and y are standardized column-wise):
g_sim.sumstats_table[22].standardized_marginal_beta |
Would it be possible to have interface take |
Thanks a lot for the code snippet!
This should take more time because we will have to implement a new solver. |
OK, thanks. By the way, I found some interesting statistics literature on losses of the form that we're working on here, going back to Loh and Wainwright (2012):
These papers were focused on the regime of missing/noisy data in This literature also provides some important caveats and considerations to keep in mind, for example, the fact that |
closes #249