Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RollingFileAppender rotates files incorrectly on ext4 due to filesystem timestamp caching #3068

Open
kelunik opened this issue Oct 7, 2024 · 9 comments
Assignees
Labels
appenders:Rolling Affects log file rolling functionality bug Incorrect, unexpected, or unintended behavior of existing code waiting-for-maintainer
Milestone

Comments

@kelunik
Copy link

kelunik commented Oct 7, 2024

Description

RollingFile rotates files incorrectly if an event is logged shortly (in our case up to 10ms) after the rollover time and the application is restarted within the same rollover period on ext4 (Linux).

Configuration

Version: 2.24.1

Operating system: Debian 11 / Linux 5.10.0-32-amd64

JDK: 21.0.4 Corretto

<RollingFile name="app" fileName="app.log" filePattern="app-%d{yyyy-MM-dd}.log">
   <EcsLayout serviceName="foobar"/><!-- layout doesn't matter -->
   <Policies>
      <TimeBasedTriggeringPolicy modulate="true" interval="1"/>
   </Policies>
</RollingFile>

Reproduction

We've been into a deep rabbit hole to analyze these conditions, but here's what happens and how to reproduce:

  • Let's say we use daily rotation and an event gets logged at 2024-10-07 00:00:000.000000000 (you can also use minute based rotations for easier reproduction)
  • TimeBasedTriggeringPolicy#isTriggeringEvent will return true, which results in a rotation
    • The current app.log file will rotate to app-2024-10-07.log
    • RollingFileManager#createFileAfterRollover will create a new app.log (in FileManager#createOutputStream)
    • The new file apparently gets a creationTime set, but that call is ignored on ext3 / ext4 filesystems.
    • Instead, the created file will get its creation timestamp assigned by the OS, which happens to use current_fs_time() on ext4 – current_fs_time() returns a cached timestamp that's only updated on kernel clock ticks (see getconf CLK_TCK)
    • The default frequency for kernel clock ticks is 100 / second, so the cached timestamp from current_fs_time() might lag behind up to 10 milliseconds
    • We end up with a file creation time of 2024-10-06 23:59:59.997423036 (use stat app.log to confirm)
  • After the night, we restart the application (e.g. due to a deployment)
  • TimeBasedTriggeringPolicy#initialize reads RollingFileManager#getFileTime (initalized by RollingFileManager#initialFileTime), which returns the above borked timestamp.
  • It will also calculate the next rollover to be 2024-10-07 00:00:000.000000000, which in turn will trigger a new rollover right on application start.
  • This "unplanned" rollover will move the current app.log to app-2024-10-06.log, therefore leading to:
    • Loss of the original app-2024-10-06.log file contents
    • Log events of 2024-10-07 00:00:00 until the restart being unexpectedly in app-2024-10-06.log instead of app.log (if still on the same day) or app-2024-10-07.log (later)
    • Unexpected file rotation, causing tail -f app.log to hang (our original symptom) – tail will (by default) follow the file descriptor on the rename to app-2024-10-06.log and new events will be written to app.log instead.
@ppkarwasz
Copy link
Contributor

Hi @kelunik,

Thank you for the very detailed analysis.

Unfortunately creationTime is not a reliable source of information, since it does not exist on many POSIX systems and as far as I have seen, JDK does not have the functionality to do it.

We probably would need to modify the RollingFileManager logic to only work with lastModificationTime, which can be set on any filesystem.

  • This "unplanned" rollover will move the current app.log to app-2024-10-06.log, therefore leading to:

    • Loss of the original app-2024-10-06.log file contents
    • Log events of 2024-10-07 00:00:00 until the restart being unexpectedly in app-2024-10-06.log instead of app.log (if still on the same day) or app-2024-10-07.log (later)
    • Unexpected file rotation, causing tail -f app.log to hang (our original symptom) – tail will (by default) follow the file descriptor on the rename to app-2024-10-06.log and new events will be written to app.log instead.

This is IMHO a bug, which is due to the usage of REPLACE_EXISTING option during the file renaming phase. As I mentioned in #3021 (comment), I plan to modify that.
The rolling appender should be able to correctly move files out of the way, so they don't get overwritten. If he can not do it, the file rotation should fail and events should be appended to the current log file.

@ppkarwasz ppkarwasz added bug Incorrect, unexpected, or unintended behavior of existing code appenders:Rolling Affects log file rolling functionality and removed waiting-for-maintainer labels Oct 8, 2024
@ppkarwasz ppkarwasz added this to the 2.x milestone Oct 8, 2024
@kelunik
Copy link
Author

kelunik commented Oct 8, 2024

This is IMHO a bug, which is due to the usage of REPLACE_EXISTING option during the file renaming phase. As I mentioned in #3021 (comment), I plan to modify that.

This is another problem by itself, but the fact that a rollover happens, is itself a bug.

Using the last-modified time would likely work (and is what java.nio.file.attribute.BasicFileAttributes#creationTime usually falls back to), however it might run into a similar edge case as well, I guess:

If there's just a single log event logged at 00:00:00.000000000 and then no other log event within the rollover period, I guess the last-modified has the same issue with lagging up to 10ms behind.

This condition is, however, a lot less likely than the current edge case.

My original proposed fix was rounding up to the nearest second, but then I learned that rotation based on milliseconds is supported, where this would break.

@ppkarwasz
Copy link
Contributor

Using the last-modified time would likely work (and is what java.nio.file.attribute.BasicFileAttributes#creationTime usually falls back to), however it might run into a similar edge case as well, I guess:

If there's just a single log event logged at 00:00:00.000000000 and then no other log event within the rollover period, I guess the last-modified has the same issue with lagging up to 10ms behind.

The last-modified time can be set on virtually any OS and filesystem, so we can override the value assigned by the OS. There are some Windows file systems with a terrible timestamp resolution (see Windows-compatible filesystems' file time resolutions), but hopefully these are at least capable of precisely representing full minutes, hours and days.

Ideally we could set the last-modified time to the theoretical time of the rollover, which should be close to the time the file was actually modified.

@kelunik
Copy link
Author

kelunik commented Oct 8, 2024

Ideally we could set the last-modified time to the theoretical time of the rollover, which should be close to the time the file was actually modified.

We need to set this on the current log file, not on rollover on the rolled away file, so setting this on every write doesn't seem like a performant approach? Any write to the file will make the OS change the value again.

@ppkarwasz
Copy link
Contributor

We need to set this on the current log file, not on rollover on the rolled away file, so setting this on every write doesn't seem like a performant approach? Any write to the file will make the OS change the value again.

Right, I was looking at this wrong. If the last modification time of the file is taken into account for a rotation, then we probably don't need to do anything, because the natural last modification time will be well after midnight. Only the creation time can be before midnight.

@kelunik
Copy link
Author

kelunik commented Oct 10, 2024

@rgoers Do you remember why you changed from modified time to creation time in e392c79?

Calling file.lastModified() in initialFileTime if the file doesn't exist doesn't seem to make sense.

My fix proposal would be to revert this back to lastModified() and keep the > 0 check and maybe (?) add a check that it's <= System.currentTimeMillis().

@ppkarwasz
Copy link
Contributor

That commit is a fix to LOG4J2-1906 and is probably related to #2297, where a regression occurred.

Anyway I believe that those issues are independent of this one:

  • your problem affects only the first rollover after the JVM starts. At this time Log4j Core uses the timestamp on the current log file to deduce when the last rollover must have occurred.
  • those issues deal with the timestamp that is used in all rollovers. After the first rollover, Log4j Core does not need to check the filesystem to find out when the last rollover happened. 😉

@kelunik
Copy link
Author

kelunik commented Oct 11, 2024

@ppkarwasz Yes, I've of course looked at that ticket, but that didn't make it clear to me why modification time got changed to creation time.

your problem affects only the first rollover after the JVM starts. At this time Log4j Core uses the timestamp on the current log file to deduce when the last rollover must have occurred.

Yes, I know, that's why I want to change the initial time back to the modification time instead of the creation time, but I don't want to cause a regression with that, that's why I want to understand the reasons for the original change there.

@kelunik
Copy link
Author

kelunik commented Oct 18, 2024

I have been thinking about this some more, and the modified time seems to be broken as well, imagine:

  • Log event gets logged at 23:59:59
  • GC starts stop-the-world event for 2 seconds
  • Log event gets written to file at 00:00:01

Then the date is also wrong. However, it would be newer than expected and avoid a rotation instead of causing an additional rotation.

This still leaves the edge case if events are only logged in the first 10ms of the rotation period, but none after that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
appenders:Rolling Affects log file rolling functionality bug Incorrect, unexpected, or unintended behavior of existing code waiting-for-maintainer
Projects
None yet
Development

No branches or pull requests

2 participants