Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openAlexR in and parallel::mclapply(): Multicore API cores fail when no single core API call was issued before. #189

Closed
rkrug opened this issue Nov 8, 2023 · 7 comments

Comments

@rkrug
Copy link

rkrug commented Nov 8, 2023

Hi

I am using parallel::mclapply() to make parallel API calls and these fail, when not a single core has been issued before:

library(openalexR)

## This fails:

parallel::mclapply(1:2, function(x){oa_request(oa_query("biodiversity"), count_only = TRUE)})

## Here is the single core call

oa_request(oa_query("biodiversity"), count_only = TRUE)

## Now it works:

parallel::mclapply(1:2, function(x){oa_request(oa_query("biodiversity"), count_only = TRUE)})

# And this works

The error message is:

objc[80975]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[80976]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[80975]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
objc[80976]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
[[1]]
NULL

[[2]]
NULL

Warning message:
In parallel::mclapply(1:2, function(x) { :
  scheduled cores 1, 2 did not deliver results, all values of the jobs will be affected

It might be necessary to have a OpenAlex Premium key for testing.

But if you have an idea, I would be happy to test.

@yjunechoe
Copy link
Collaborator

This is an interaction between {progress} and {parallel}. We use {progress} to print the progress bar, and the progress bar is stateful - I don't know the internals of {parallel}, but my suspicion is that you have a race condition with each thread updating the same progress state.

I think this should go away if you disable the progress bar, but now I also realize that oa_request() still creates a progress object even with verbose = FALSE. Maybe this is trivial but - @trangdata was there a reason why the progress bar's creation is outside the verbose if-clause?

openalexR/R/oa_fetch.R

Lines 369 to 378 in 32855b6

if (verbose) {
message(
"Getting ", n_pages, pg_plural, " of results",
" with a total of ", n_items, " records..."
)
}
pb <- oa_progress(n = n_pages, text = "OpenAlex downloading")

@trangdata
Copy link
Collaborator

@yjunechoe you're right. oa_progress should be inside the if clause.

@rkrug
Copy link
Author

rkrug commented Nov 10, 2023

Thanks for looking into this - I will try it out as soon as it is changed.

trangdata added a commit that referenced this issue Nov 11, 2023
@trangdata
Copy link
Collaborator

So it looks like oa_progress is actually in some other functions outside of verbose, such as oa_ngrams. Should we wrap it in an if (verbose){} clause @yjunechoe?

@yjunechoe
Copy link
Collaborator

Yeah I think that'd be safest!

@rkrug
Copy link
Author

rkrug commented Nov 12, 2023

Unfortunately, this did not solve the issue. I installed from github It it still crashes:

r$> parallel::mclapply(1:10, function(x){oa_request(oa_query("biodiversity"), count_only = TRUE, verbose = FALSE)})
objc[9825]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[9824]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[9825]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
objc[9824]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.

Just to be sure, I used debugonce(openalexR:::oa_progress) before running one core, and it did not go into that function. So the problem must be somewhere else.

@rkrug
Copy link
Author

rkrug commented Nov 12, 2023

OK - the problem is upstream in http:

library(https)
parallel::mclapply(1:2, function(x){httr::GET("http://google.com/", path = "search")})

and it is independent of https://community.rstudio.com/t/running-parallel-on-mac/142580/6 (although I don't know if it only affect M1 Macs). I filed a bug at r-lib/httr#749.

I do not know if the error occurs on Intel Macs, Windows or Linux - I have a M1 Mac.

It also occurs in httr2, which superseded httr

r$> library(httr2)
    req <- httr2::request("http://google.com")
    parallel::mclapply(1:2, function(x){httr2::req_perform(req)})
objc[50637]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[50637]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
objc[50638]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.
objc[50638]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
[[1]]
NULL

[[2]]
NULL

Warning message:
In parallel::mclapply(1:2, function(x) { :
  scheduled cores 1, 2 did not deliver results, all values of the jobs will be affected

@rkrug rkrug closed this as completed Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants