Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConnectionRefusedError: [Errno 111] Connection refused TPU #8308

Open
huzama opened this issue Oct 24, 2024 · 0 comments
Open

ConnectionRefusedError: [Errno 111] Connection refused TPU #8308

huzama opened this issue Oct 24, 2024 · 0 comments

Comments

@huzama
Copy link
Contributor

huzama commented Oct 24, 2024

🐛 Bug Report

Description

When attempting to initialize a TPU pod, a ConnectionRefusedError: [Errno 111] Connection refused TPU error occurs.

To Reproduce

Steps to reproduce the behavior:

  1. Setup TPU Pod: Ensure that your TPU pod is properly configured and active.
  2. Install Dependencies: Ensure torch >= v2.4.1 and torch_xla >= v2.4.0 are installed.
  3. Run the Following Script:
import torch.distributed as dist
import torch_xla.runtime as xr
import torch_xla.distributed.xla_backend  # Import to register the `xla://` init_method

xr.use_spmd()
dist.init_process_group("gloo", init_method="xla://")

Expected Behavior

The script should run without errors, initializing the process group correctly on the TPU pod.

Environment

  • XLA Backend: TPU
  • Torch XLA Version: 2.4.0
  • Torch Version: 2.4.1 or greater

Additional Context

  • The issue does not occur with torch and torch_xla versions <= 2.4.0.

Error Message

ConnectionRefusedError: [Errno 111] Connection refused TPU
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant