ext4: invalid extent block on imx7

Wed Mar 25 21:17:39 CET 2020

On 25.03.20 21:01, Stephen Warren wrote:
> On 3/25/20 1:11 PM, Jan Kiszka wrote:
>> On 25.03.20 16:00, Tom Rini wrote:
>>> On Wed, Mar 25, 2020 at 07:32:30AM +0100, Jan Kiszka wrote:
>>>> On 20.03.20 19:21, Tom Rini wrote:
>>>>> On Mon, Mar 16, 2020 at 08:09:53PM +0100, Jan Kiszka wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> => ls mmc 0:1 /usr/lib/linux-image-4.9.11-1.3.0-dirty
>>>>>> CACHE: Misaligned operation at range [bdfff998, bdfffd98]
>>>>>> CACHE: Misaligned operation at range [bdfff998, bdfffd98]
>>>>>> CACHE: Misaligned operation at range [bdfff998, bdfffd98]
>>>>>> CACHE: Misaligned operation at range [bdfff998, bdfffd98]
>>>>>> invalid extent block
>>>>>>
>>>>>> I'm using master (50be9f0e1ccc) on the MCIMX7SABRE, defconfig.
>>>>>>
>>>>>> What could this be? The filesystem is fine from Linux POV.
>>>>>
>>>>> Use tune2fs -l and see if there's any new'ish features enabled that we
>>>>> need some sort of check-and-reject for would be my first guess.
>>>>>
>>>>
>>>> Here are the reported feature flags:
>>>>
>>>> has_journal ext_attr resize_inode dir_index filetype extent 64bit
>>>> flex_bg
>>>> sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
>>>
>>> Of that, only metadata_csum means that you can't write to that image,
>>> but you're just trying to read and that should be fine.  Can you go back
>>> in time a little and see if this problem persists or if it's been
>>> introduced of late?  Or recreate it on other platforms/SoCs?  Thanks!
>>>
>>
>> Bisected, regression of d5aee659f217 ("fs: ext4: cache extent data").
>> Reverting this commit over master resolves the issue.
>>
>> Any idea what could be wrong? What I noticed is that the extent has a
>> zeroed magic when things go wrong, so maybe it is falsely considered to
>> be cached?
> 
> This is puzzling. I took another look at that patch and I don't see
> anything wrong. My guess would be:
> 
> - Some unrelated memory corruption bug was exposed simply because this
> patch uses dynamic memory or stack slightly differently than before.
> 
> - Something writes to the cached block, whereas the cache code assumes
> the buffer is read-only.
> 
> The cache metadata exists on the stack and so only lasts for the
> duration of read_allocated_block() or ext4fs_read_file(), so there's no
> issue with re-using the cache across different devices, or persisting
> across an ext4 write operation or anything like that. Is this easy to
> reproduce; is there a small disk image that shows the problem?
> 

Found it: alignment issue, apparently surfaced by your change when 
switching from zalloc (which does cacheline? alignment) to malloc. Is 
this sensitivity maybe SoC specific?

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux