Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Grub on component devices if /boot is on mdraid device #1093

Merged
merged 2 commits into from
Jul 17, 2023

Conversation

matejmatuska
Copy link
Member

@matejmatuska matejmatuska commented Jun 22, 2023

On BIOS systems, previously, if /boot was on md device such as RAID
consisting of multiple partitions on different MBR/GPT partitioned
drives, the part of Grub residing in the 512 Mb after MBR was only
updated for one of the drives. Similar situation occurred on GPT
partitioned drives and the BIOS boot partition. This resulted in
outdated GRUB on the remaining drives which could cause the system to be
unbootable.

Now, Grub is updated on all the component devices of an md array if Grub
was already installed on them before the upgrade.

Jira: OAMG-7835
BZ#2219544
BZ#2140011

@github-actions
Copy link

Thank you for contributing to the Leapp project!

Please note that every PR needs to comply with the Leapp Guidelines and must pass all tests in order to be mergeable.
If you want to request a review or rebuild a package in copr, you can use following commands as a comment:

  • review please @oamg/developers to notify leapp developers of the review request
  • /packit copr-build to submit a public copr build using packit

Packit will automatically schedule regression tests for this PR's build and latest upstream leapp build. If you need a different version of leapp from PR#42, use /packit test oamg/leapp#42

To launch regression testing public members of oamg organization can leave the following comment:

  • /rerun to schedule basic regression tests using this pr build and latest upstream leapp build as artifacts
  • /rerun 42 to schedule basic regression tests using this pr build and leapp*PR42* as artifacts
  • /rerun-sst to schedule sst tests using this pr build and latest upstream leapp build as artifacts
  • /rerun-sst 42 to schedule sst tests using this pr build and leapp*PR42* as artifacts

Please open ticket in case you experience technical problem with the CI. (RH internal only)

Note: In case there are problems with tests not being triggered automatically on new PR/commit or pending for a long time, please contact leapp-infra.

Copy link
Member

@fernflower fernflower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see some questions/remarks inline.
BTW, would it be possible to add some unit tests? It seems like a significant change that definitely deserves some :)

@matejmatuska matejmatuska force-pushed the update-grub-raid branch 4 times, most recently from 3f48a58 to a9e00e3 Compare June 25, 2023 09:05
@matejmatuska
Copy link
Member Author

Tested on RHEL 7.9->8.8 and 8.8->9.2
With the following disk configuration:

# lsblk -so "NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT"
NAME     SIZE TYPE  FSTYPE            MOUNTPOINT
md126     12G raid5 xfs               /
├─vda2     4G part  linux_raid_member
│ └─vda    5G disk
├─vdb2     4G part  linux_raid_member
│ └─vdb    5G disk
├─vdc2     4G part  linux_raid_member
│ └─vdc    5G disk
└─vdd2     4G part  linux_raid_member
  └─vdd    5G disk
md127   1023M raid1 xfs               /boot
├─vda1     1G part  linux_raid_member
│ └─vda    5G disk
├─vdb1     1G part  linux_raid_member
│ └─vdb    5G disk
├─vdc1     1G part  linux_raid_member
│ └─vdc    5G disk
└─vdd1     1G part  linux_raid_member
  └─vdd    5G disk
sr0     1024M rom

(4 drives, /boot on RAID1, / on RAID5)

Without this patch the upgrades failed to boot into the system after initramfs
phase. GRUB jumps into rescue with symbol ... not found error.

With this patch applied, this problem is fixed on both RHEL versions. In the scan phase the
scangrubdevice actor correctly identifies all boot devices:
leapp.workflow.FactsCollection.scan_grub_device_name: GRUB is installed on /dev/vdc,/dev/vdb,/dev/vda,/dev/vdd

And in the RPMUpgrade phase grub2-install is correctly called on each of those
devices:

leapp.workflow.RPMUpgrade.update_grub_core: External command has started: ['grub2-install', u'/dev/vdc', '-v']
...
leapp.workflow.RPMUpgrade.update_grub_core: External command has finished: ['grub2-install', u'/dev/vdc', '-v']
leapp.workflow.RPMUpgrade.update_grub_core: External command has started: ['grub2-install', u'/dev/vdc', '-v']
...
leapp.workflow.RPMUpgrade.update_grub_core: External command has finished: ['grub2-install', u'/dev/vdc', '-v']
leapp.workflow.RPMUpgrade.update_grub_core: External command has started: ['grub2-install', u'/dev/vdc', '-v']
...
leapp.workflow.RPMUpgrade.update_grub_core: External command has finished: ['grub2-install', u'/dev/vdc', '-v']
leapp.workflow.RPMUpgrade.update_grub_core: External command has started: ['grub2-install', u'/dev/vdc', '-v']
...
leapp.workflow.RPMUpgrade.update_grub_core: External command has finished: ['grub2-install', u'/dev/vdc', '-v']

The system then properly boots and the upgrade continues smoothly.

@matejmatuska matejmatuska marked this pull request as ready for review June 30, 2023 09:25
@matejmatuska
Copy link
Member Author

Tested on RHEL 7.9->8.8 and 8.8->9.2 With the following disk configuration:

# lsblk -so "NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT"
NAME     SIZE TYPE  FSTYPE            MOUNTPOINT
md126     12G raid5 xfs               /
├─vda2     4G part  linux_raid_member
│ └─vda    5G disk
├─vdb2     4G part  linux_raid_member
│ └─vdb    5G disk
├─vdc2     4G part  linux_raid_member
│ └─vdc    5G disk
└─vdd2     4G part  linux_raid_member
  └─vdd    5G disk
md127   1023M raid1 xfs               /boot
├─vda1     1G part  linux_raid_member
│ └─vda    5G disk
├─vdb1     1G part  linux_raid_member
│ └─vdb    5G disk
├─vdc1     1G part  linux_raid_member
│ └─vdc    5G disk
└─vdd1     1G part  linux_raid_member
  └─vdd    5G disk
sr0     1024M rom

(4 drives, /boot on RAID1, / on RAID5)

Without this patch the upgrades failed to boot into the system after initramfs phase. GRUB jumps into rescue with symbol ... not found error.

With this patch applied, this problem is fixed on both RHEL versions. In the scan phase the scangrubdevice actor correctly identifies all boot devices: leapp.workflow.FactsCollection.scan_grub_device_name: GRUB is installed on /dev/vdc,/dev/vdb,/dev/vda,/dev/vdd

And in the RPMUpgrade phase grub2-install is correctly called on each of those devices:

leapp.workflow.RPMUpgrade.update_grub_core: External command has started: ['grub2-install', u'/dev/vdc', '-v']
...
leapp.workflow.RPMUpgrade.update_grub_core: External command has finished: ['grub2-install', u'/dev/vdc', '-v']
leapp.workflow.RPMUpgrade.update_grub_core: External command has started: ['grub2-install', u'/dev/vdc', '-v']
...
leapp.workflow.RPMUpgrade.update_grub_core: External command has finished: ['grub2-install', u'/dev/vdc', '-v']
leapp.workflow.RPMUpgrade.update_grub_core: External command has started: ['grub2-install', u'/dev/vdc', '-v']
...
leapp.workflow.RPMUpgrade.update_grub_core: External command has finished: ['grub2-install', u'/dev/vdc', '-v']
leapp.workflow.RPMUpgrade.update_grub_core: External command has started: ['grub2-install', u'/dev/vdc', '-v']
...
leapp.workflow.RPMUpgrade.update_grub_core: External command has finished: ['grub2-install', u'/dev/vdc', '-v']

The system then properly boots and the upgrade continues smoothly.

This should still be up to date.

Copy link
Member

@pirat89 pirat89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matejmatuska I have discussed the PR with @pholica and we have realized that actually it fixes different things than to which refers and commit msg actually should be also different. let's sync about it later.

@pholica
Copy link

pholica commented Jul 4, 2023

Going through the code quickly, it looks like this should be fixing situation when /boot is on RAID1 on BIOS system with MBR partition table, however as @pirat89 said, that's not situation of https://en.wikipedia.org/wiki/BIOS_boot_partition (referenced Jira task by this PR) neither /boot and /boot/efi being on RAID1 on UEFI (which are other parts of "SW RAID support" needed to be implemented to fix the referenced bug.

It sounds to me like the only problematic thing (please correct me if I'm wrong) when it comes to this PR is just bug/issue references and lack of mention of BIOS and MBR in the description.

@matejmatuska
Copy link
Member Author

matejmatuska commented Jul 4, 2023

Going through the code quickly, it looks like this should be fixing situation when /boot is on RAID1 on BIOS system with MBR partition table, however as @pirat89 said, that's not situation of https://en.wikipedia.org/wiki/BIOS_boot_partition (referenced Jira task by this PR) neither /boot and /boot/efi being on RAID1 on UEFI (which are other parts of "SW RAID support" needed to be implemented to fix the referenced bug.

It sounds to me like the only problematic thing (please correct me if I'm wrong) when it comes to this PR is just bug/issue references and lack of mention of BIOS and MBR in the description.

@pholica @pirat89 I see the porblem now, for a moment I missed the fact that there are GPT disks in the BZ. Yes, this should solve the problem for MBR partitoned disks on BIOS, I will update the description. However seems like the BZ reporter fixed the bug in a similar fashion, so I will do some testing with GPT disks also and see if we can fix that here too.

@pirat89
Copy link
Member

pirat89 commented Jul 4, 2023

Going through the code quickly, it looks like this should be fixing situation when /boot is on RAID1 on BIOS system with MBR partition table, however as @pirat89 said, that's not situation of https://en.wikipedia.org/wiki/BIOS_boot_partition (referenced Jira task by this PR) neither /boot and /boot/efi being on RAID1 on UEFI (which are other parts of "SW RAID support" needed to be implemented to fix the referenced bug.
It sounds to me like the only problematic thing (please correct me if I'm wrong) when it comes to this PR is just bug/issue references and lack of mention of BIOS and MBR in the description.

@pholica @pirat89 I see the porblem now, for a moment I missed the fact that there are GPT disks in the BZ. Yes, this should solve the problem for MBR partitoned disks on BIOS, I will update the description. However seems like the BZ reporter fixed the bug in a similar fashion, so I will do some testing with GPT disks also and see if we can fix that here too.

for the GPT, it would need more work as currently we do not have even detection of the grub for GPT partitioned disks nat all.

@matejmatuska
Copy link
Member Author

matejmatuska commented Jul 4, 2023

@pirat89 can you please address the TODO in PR description, not sure what the process is there?

It makes more sense to me to remove them now, but just to be sure.

@matejmatuska matejmatuska changed the title Update Grub on component devices if /boot is on md device Update Grub on component MBR drives if /boot is on mdraid device Jul 4, 2023
@matejmatuska matejmatuska force-pushed the update-grub-raid branch 2 times, most recently from 27d837f to 026ead8 Compare July 7, 2023 11:15
@pirat89
Copy link
Member

pirat89 commented Jul 17, 2023

Manual testing for https://bugzilla.redhat.com/show_bug.cgi?id=2219544:

  • orig:
----------------------------------------------------------------------
Stamp: 2023-07-17T09:56:02.523801Z
Actor: scan_grub_device_name
Phase: FactsCollection
Type: GrubInfo
Message_data:
{
    "orig_device_name": "/dev/vdb"
}

  • new msg:
Actor: scan_grub_device_name
Phase: FactsCollection
Type: GrubInfo
Message_data:
{
    "orig_device_name": null, 
    "orig_devices": [
        "/dev/vdb", 
        "/dev/vda"
    ]
}

@pirat89
Copy link
Member

pirat89 commented Jul 17, 2023

@matejmatuska on normal systems it reproduces now 2 GrubInfo msgs, which is wrong. There must not be more than one:

######################################################################
                          PRODUCED MESSAGES                           
######################################################################
Stamp: 2023-07-17T11:00:21.381393Z
Actor: scan_grub_device_name
Phase: FactsCollection
Type: GrubInfo
Message_data:
{
    "orig_device_name": "/dev/vda", 
    "orig_devices": [
        "/dev/vda"
    ]
}
----------------------------------------------------------------------
Stamp: 2023-07-17T11:00:21.355611Z
Actor: scan_grub_device_name
Phase: FactsCollection
Type: GrubInfo
Message_data:
{
    "orig_device_name": "/dev/vda", 
    "orig_devices": [
        "/dev/vda"
    ]
}
######################################################################

Update: this is bug in the framework: oamg/leapp#836 - in the meanwhile, let's check if the file exists manually - proove it works now:

######################################################################
                          PRODUCED MESSAGES                           
######################################################################
Stamp: 2023-07-17T16:55:53.879339Z
Actor: scan_grub_device_name
Phase: FactsCollection
Type: GrubInfo
Message_data:
{
    "orig_device_name": "/dev/vda", 
    "orig_devices": [
        "/dev/vda"
    ]
}
######################################################################

On BIOS systems, previously, if /boot was on md device such as RAID
consisting of multiple partitions on different MBR/GPT partitioned
drives, the part of Grub residing in the 512 Mb after MBR was only
updated for one of the drives. Similar situation occurred on GPT
partitioned drives and the BIOS boot partition. This resulted in
outdated GRUB on the remaining drives which could cause the system to be
unbootable.

Now, Grub is updated on all the component devices of an md array if Grub
was already installed on them before the upgrade.

Jira: OAMG-7835
BZ#2219544
BZ#2140011
@matejmatuska matejmatuska changed the title Update Grub on component MBR drives if /boot is on mdraid device Update Grub on component devices if /boot is on mdraid device Jul 17, 2023
Praviously the check was implemented using OSError return from `run`
function. However, in this particular case it's not safe and leads
to unexpected behaviour. Check the existence of the file explicitly
instead prior the `run` function is called.

Update existing unit-tests and extend the test case when mdadm
is not installed.
@pirat89
Copy link
Member

pirat89 commented Jul 17, 2023

Tested manually

works as expected. merging.

Copy link
Member

@pirat89 pirat89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ltgm and works (see my comments)

@pirat89 pirat89 merged commit 2e85af5 into oamg:master Jul 17, 2023
11 of 15 checks passed
@pirat89 pirat89 added the bug Something isn't working label Jul 17, 2023
@pirat89 pirat89 added this to the 8.9/9.3 milestone Jul 17, 2023
@pirat89 pirat89 added the changelog-checked The merger/reviewer checked the changelog draft document and updated it when relevant label Jul 17, 2023
@matejmatuska matejmatuska added the deprecation Any change in the set of deprecation functionality. label Aug 23, 2023
pirat89 added a commit to pirat89/leapp that referenced this pull request Aug 23, 2023
pirat89 added a commit to pirat89/leapp-repository that referenced this pull request Aug 23, 2023
## Packaging
- Requires leapp-framework 5.0

## Upgrade handling
### Fixes
- Add el8toel9 actor to handle directory -> symlink with ruby IRB. (oamg#1076)
- Do not try to update GRUB core on IBM Z systems (oamg#1117)
- Fix failing upgrades with devtmpfs file systems specified in FSTAB (oamg#1090)
- Fix the calculation of the required free space on each partitions/volume for the upgrade transactions (oamg#1097)
- Fix the generation of the report about hybrid images (oamg#1064)
- Handle correctly the installed certificates to allow upgrades with custom repositories using HTTPs with enabled SSL verification (oamg#1106)
- Minor improvements and fixes of various reports (oamg#1066, oamg#1067, oamg#1085)
- Update error messages about leapp data files to inform user how to obtain valid data files (oamg#1121)
- Update links in various reports (oamg#1062, oamg#1086)
- Update the repomap data to cover changed repoids in RHUI Azure (oamg#1087)
- [IPU 7 -> 8] Fix false positive report about invalid symlinks on RHEL 7 (oamg#1052)
- [IPU 8 -> 9] Inhibit the upgrade when unsupported x86-64 microarchitecture is detected (oamg#1059)

### Enhancements
- Include updated leapp data files in the RPM (oamg#1046, oamg#1092, oamg#1119)
- Update the set of supported upgrade paths (oamg#1077):
  - RHEL with SAP HANA 7.9 -> 8.6, 8.8 (default: 8.6)
  - RHEL with SAP HANA 8.8 -> 9.2
- Introduce new upgrade paths:
  - RHEL 7.9 -> 8.9 (default)
  - RHEL 8.9 -> 9.3
- Correctly update grub2 when /boot resides on multiple devices aggregated in RAID (oamg#1093, oamg#1115)
- Enable upgrades for machines using RHUI on AlibabaCloud (oamg#1088)
- Introduce possibility to add kernel drivers to initramfs (oamg#1081)
- Redesign handling of information about kernel (booted and target) in preparation for new changes in RHEL 9 (oamg#1107)
- Redesign source system overlay to use disk images backed by sparse files to optimize disk space consumption (oamg#1097, oamg#1103)
- Requires leapp-framework 5.0 (oamg#1061, oamg#1116)
- Use new leapp CLI API which provides better report summary output (oamg#1061, oamg#1116)
- [IPU 8 -> 9] Detect and report use of deprecated Xorg drivers (oamg#1078)
- [IPU 8 -> 9] Introduce IPU for systems with FIPS enabled (oamg#1053)

## Additional changes interesting for devels
- Deprecated `GrubInfo.orig_device_name` field in the `GrubInfo` model (replaced by `GrubInfo.orig_devices`) (oamg#1093)
- Deprecated `InstalledTargetKernelVersion` model (replaced by `InstalledTargetKernelInfo`) (oamg#1107)
- Deprecated `leapp.libraries.common.config.version.is_rhel_realtime` (check the type in msg `KernelInfo`, field `type`) (oamg#1107)
- Deprecated `leapp.libraries.common.grub.get_grub_device()` (replaced by `leapp.libraries.common.grub.get_grub_devices()`) (oamg#1093)
- Introduced new devel envar LEAPP_DEVEL_KEEP_DISK_IMGS=1 to skip the removal of the created disk images for OVL. That's sometimes handy for the debugging. (oamg#1097)
@pirat89 pirat89 mentioned this pull request Aug 23, 2023
pirat89 added a commit to pirat89/leapp that referenced this pull request Aug 23, 2023
Rezney pushed a commit that referenced this pull request Aug 23, 2023
## Packaging
- Requires leapp-framework 5.0

## Upgrade handling
### Fixes
- Add el8toel9 actor to handle directory -> symlink with ruby IRB. (#1076)
- Do not try to update GRUB core on IBM Z systems (#1117)
- Fix failing upgrades with devtmpfs file systems specified in FSTAB (#1090)
- Fix the calculation of the required free space on each partitions/volume for the upgrade transactions (#1097)
- Fix the generation of the report about hybrid images (#1064)
- Handle correctly the installed certificates to allow upgrades with custom repositories using HTTPs with enabled SSL verification (#1106)
- Minor improvements and fixes of various reports (#1066, #1067, #1085)
- Update error messages about leapp data files to inform user how to obtain valid data files (#1121)
- Update links in various reports (#1062, #1086)
- Update the repomap data to cover changed repoids in RHUI Azure (#1087)
- [IPU 7 -> 8] Fix false positive report about invalid symlinks on RHEL 7 (#1052)
- [IPU 8 -> 9] Inhibit the upgrade when unsupported x86-64 microarchitecture is detected (#1059)

### Enhancements
- Include updated leapp data files in the RPM (#1046, #1092, #1119)
- Update the set of supported upgrade paths (#1077):
  - RHEL with SAP HANA 7.9 -> 8.6, 8.8 (default: 8.6)
  - RHEL with SAP HANA 8.8 -> 9.2
- Introduce new upgrade paths:
  - RHEL 7.9 -> 8.9 (default)
  - RHEL 8.9 -> 9.3
- Correctly update grub2 when /boot resides on multiple devices aggregated in RAID (#1093, #1115)
- Enable upgrades for machines using RHUI on AlibabaCloud (#1088)
- Introduce possibility to add kernel drivers to initramfs (#1081)
- Redesign handling of information about kernel (booted and target) in preparation for new changes in RHEL 9 (#1107)
- Redesign source system overlay to use disk images backed by sparse files to optimize disk space consumption (#1097, #1103)
- Requires leapp-framework 5.0 (#1061, #1116)
- Use new leapp CLI API which provides better report summary output (#1061, #1116)
- [IPU 8 -> 9] Detect and report use of deprecated Xorg drivers (#1078)
- [IPU 8 -> 9] Introduce IPU for systems with FIPS enabled (#1053)

## Additional changes interesting for devels
- Deprecated `GrubInfo.orig_device_name` field in the `GrubInfo` model (replaced by `GrubInfo.orig_devices`) (#1093)
- Deprecated `InstalledTargetKernelVersion` model (replaced by `InstalledTargetKernelInfo`) (#1107)
- Deprecated `leapp.libraries.common.config.version.is_rhel_realtime` (check the type in msg `KernelInfo`, field `type`) (#1107)
- Deprecated `leapp.libraries.common.grub.get_grub_device()` (replaced by `leapp.libraries.common.grub.get_grub_devices()`) (#1093)
- Introduced new devel envar LEAPP_DEVEL_KEEP_DISK_IMGS=1 to skip the removal of the created disk images for OVL. That's sometimes handy for the debugging. (#1097)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working changelog-checked The merger/reviewer checked the changelog draft document and updated it when relevant deprecation Any change in the set of deprecation functionality.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants