Releases · ml-energy/zeus

10 Sep 22:01

jaywonchung

zeus-v0.10.1

360d0ee

Zeus v0.10.1 Latest

Latest

This is a maintenance release aimed at enhancing usability and fixing small bugs.

What's Changed

Feat: Catch PermissionError and raise with more information by @wbjin in #111
Feat: Alternative RAPL directory inside Docker containers by @wbjin in #115
Feat: added utility function to retrieve CPU index from PID by @danielhou0515 in #117
Docs: More documentation on CPU monitoring by @wbjin in #118
Feat: python -m zeus.show_env by @jaywonchung in #119
Feat: getAverageMemoryPowerUsage by @jaywonchung in #122
Fix: Add getAverageMemoryPowerUsage to GPUs as well by @jaywonchung in #124

New Contributors 🎉

@danielhou0515 made their first contribution in #117

Full Changelog: zeus-v0.10.0...zeus-v0.10.1

Contributors

jaywonchung, danielhou0515, and wbjin

Assets 2

16 Aug 20:49

jaywonchung

zeus-v0.10.0

d480b3e

Zeus v0.10.0: Broader support

What's New

CPU and DRAM energy measurement

We implemented support for Intel RAPL, which allows CPU and DRAM energy measurement on supported CPUs.
Generally speaking, most Intel CPUs support would support both and some AMD CPUs will support RAPL, albeit only CPU measurement.

JAX support

We added preliminary JAX support. Check out our full example here.

API usage is mostly identical:

monitor = ZeusMonitor(sync_execution_with="jax")  # JAX!

monitor.begin_window("computations")
# Run computation
measurement = monitor.end_window("computations")

Zeus Daemon

Our energy optimizers require changing setting on the GPU, including power limit and frequency. This requires admin privileges. More details in our docs.

Zeus Daemon lets you circumvent this by running as a standalone daemon process on the node that implements privileged operations on your behalf, so that you don't have to give the entire Zeus-integrated application admin privileges.

We wrote the Zeus Daemon in Rust: Check out the source code and crates.io for details.

Breaking Changes

ZeusMonitor.begin_window and ZeusMonitor.end_window's second parameter sync_cuda was renamed to sync_execution.
This is because JAX asynchronously runs CPU code as well, and we would like to synchronize both CUDA and CPU computations. This created the need to generalize sync_cuda to sync_execution.

Changelog

Docs: Add warnings about instantiating ZeusMonitor as a global variable. by @jaywonchung in #68
Docs: Fix typo by @Sunt-ing in #69
Docs: Improve the GPU energy monitoring demo by @Sunt-ing in #70
Feat: Detect and reject unofficial pynvml bindings by @jaywonchung in #71
Fix: Pandas warnings from PowerMonitor by @jaywonchung in #75
Feat: Zeus daemon by @jaywonchung in #81
Test: Allow zeusd dev and testing on MacOS by @jaywonchung in #82
Refactor: Reorg zeus.device.gpu by @jaywonchung in #83
Feat: Integrate zeusd into zeus.device.gpu by @jaywonchung in #85
Chore: Fix typo in GitHub Actions by @jaywonchung in #86
Chore: zeusd debug outputs and doc comments by @jaywonchung in #87
Feat: Add CPU measurement (via Intel RAPL) to ZeusMonitor by @wbjin in #90
Fix: RAPL DRAM measurements not to be included in package measurements by @wbjin in #92
Chore: Run checks in PRs from forks by @jaywonchung in #95
Docs: Fix attribute name in ZeusMonitor example by @HGangloff in #96
Feat: Add zero energy warning in ZeusMonitor by @sharonsyh in #93
Feat: Add jax support in CUDA sync by @HGangloff in #97
Docs: Refine JAX integration and example by @jaywonchung in #99
Feat: Multi arch docker build by @sharonsyh in #104
News: Add Perseus news and write Perseus blog by @jaywonchung in #107
Feat: Multi-Arch Docker Build - Pushing to symbioticlab/zeus and mlenergy/zeus by @sharonsyh in #106
Feat: RAPL Monitor for monitoring wraparounds for a rapl file by @wbjin in #105
Test: Tests for CPU monitoring onn ZeusMonitor by @wbjin in #100
Chore: Fix lint warnings from ruff by @wbjin in #108

New Contributors 🎉

@Sunt-ing made their first contribution in #69
@wbjin made their first contribution in #90
@HGangloff made their first contribution in #96
@sharonsyh made their first contribution in #93

Full Changelog: v0.9.1...zeus-v0.10.0

Contributors

HGangloff, jaywonchung, and 3 other contributors

Assets 2

07 May 04:07

jaywonchung

v0.9.1

cf8324c

v0.9.1

What's new

For GPU power draw, we use nvmlDeviceGetFieldValues, which gives us instant power draw (instead of average power draw) for any microarchitecture.

Assets 2

06 May 16:07

jaywonchung

v0.9.0

0ae4de1

v0.9.0: Batch size optimizer and big cleanups

What's new

The batch size optimizer is now a full-fledged server that can be deployed independently, with Docker Compose, or on Kubernetes + KubeFlow.
GPU abstraction: We created an abstraction layer over GPU vendors (NVIDIA and AMD). We're on our way to supporting AMD GPUs.
Completely revamped documentation under https://ml.energy/zeus.

Deprecated

See #20 (ZeusDataLoader, ZeusMaster, and the C++ Zeus monitor)

Assets 2

13 Oct 21:34

jaywonchung

v0.8.0

076df3d

v0.8.0: Energy-efficient large model training

This release features Perseus, an optimizer for energy-efficient large model training.

See the Perseus docs for details.

Assets 2

24 Sep 04:10

jaywonchung

v0.7.1

6082db4

v0.7.1: Moved to under `ml-energy`!

We moved our repository to under ml-energy. No feature changes :)

Assets 2

24 Aug 21:22

jaywonchung

v0.7.0

3ce9012

v0.7.0: Python-based power monitor

What's New

We used to have a C++ power monitor under zeus_monitor, but we've deprecated that. There's no need for high speed polling because NVML power counters do not update that quick anyway.
- In order to poll power consumption programmatically, use zeus.monitor.power.PowerMonitor.
CLI power & energy monitor:
- python -m zeus.monitor power
- python -m zeus.monitor energy
We switched from the old setup.py to the new package metadata standard pyproject.toml.
Docker image sizes are drastically smaller now! The compressed image used to be 8.48 GB, but now it's down to 2.71 GB.

Assets 2

07 Aug 21:18

jaywonchung

v0.6.1

e610849

v0.6.1: `approx_instant_energy`

What's New

approx_instant_energy in ZeusMonitor

Sometimes, the NVML energy counter update period is longer than the measurement window, in which case energy consumption may be return as 0.0. In this case, when approx_instant_energy=True, ZeusMonitor will approximate the energy consumption of the window as instant power consumption multiplied by the duration of the measurement window: $$\textrm{Energy} = \int_0^T \textrm{Power}(t) dt \approx \textrm{Power}(T) \cdot T$$

Assets 2

28 Jul 21:03

jaywonchung

v0.6.0

b2469c9

v0.6.0: `OptimumSelector`

What's New

OptimumSelector

Until know, the optimal power limit for GlobalPowerLimitOptimizer was the one that minimizes the Zeus time-energy cost. Not everyone would want that.
Now, OptimumSelector is an abstract base class with which you can implement your own optimal power limit selection policy.
Pre-implemented one are Time, Energy, ZeusCost, and MaxSlowdownConstraint. These are thoroughly tested.

wait_steps

Now, you can specify wait_steps in GlobalPowerLimitOptimizer, and it'll wait for the specified number of steps before profiling and optimizing.
wait_steps is set to 1 by default to because users may have torch.backends.cudnn.benchmark = True and DataLoader workers usually need time to warm up before ramping up to their normal fetch throughput.

Breaking Changes

GlobalPowerLimitOptimizer now takes an instance of OptimumSelector in its constructor, instead of eta_knob. If you want to recover the functionality of v0.5.0, modify your code like this:

# Before
plo = GlobalPowerLimitOptimizer(..., eta_knob=0.5, ...)

# After
from zeus.optimizer.power_limit import ZeusCost

plo = GlobalPowerLimitOptimizer(..., optimum_selector=ZeusCost(eta_knob=0.5), ...)

Assets 2

12 Jul 03:34

jaywonchung

v0.5.0

54a9b2e

v0.5.0: Big refactor, `GlobalPowerLimitOptimizer`

What's New

Callback-based architecture

zeus.callback.Callback is the new backbone for Zeus components
GlobalPowerLimitOptimizer is the shiny new way to online-profile and optimize the power limit of DNN training.
EarlyStopController monitors and manages all sorts of conditions to determine whether training should stop.

Extensive testing

tests/ is richer than ever. With deep component tests with exhaustive parametrization, there are now around 1500 test cases.
Especially, zeus.util.testing.ReplayZeusMonitor exposes the same public API as ZeusMonitor but replays the measurement window logs produced by ZeusMonitor, instead of doing actual measurement. With this, Zeus can now be tested without any actual GPUs.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New Contributors 🎉

Contributors

What's New

CPU and DRAM energy measurement

JAX support

Zeus Daemon

Breaking Changes

Changelog

New Contributors 🎉

Contributors

What's new

What's new

Deprecated

What's New

What's New

What's New

Breaking Changes

What's New

Releases: ml-energy/zeus

Zeus v0.10.1

What's Changed

New Contributors 🎉

Contributors

Zeus v0.10.0: Broader support

What's New

CPU and DRAM energy measurement

JAX support

Zeus Daemon

Breaking Changes

Changelog

New Contributors 🎉

Contributors

v0.9.1

What's new

v0.9.0: Batch size optimizer and big cleanups

What's new

Deprecated

v0.8.0: Energy-efficient large model training

v0.7.1: Moved to under `ml-energy`!

v0.7.0: Python-based power monitor

What's New

v0.6.1: `approx_instant_energy`

What's New

v0.6.0: `OptimumSelector`

What's New

Breaking Changes

v0.5.0: Big refactor, `GlobalPowerLimitOptimizer`

What's New