Correcting ext3 File Corruption?

← Back to Stories (view on slashdot.org)

Correcting ext3 File Corruption?

Posted by Cliff on Wednesday July 24, 2002 @08:16PM from the has-anyone-else-seen-this-before dept.

An anonymous reader asks: "I am looking for ext2/ext3 expert. I have a small file (1395 bytes) that appears HUGE when runing ls -l (70368744179059 bytes [yes, that's 70 terabytes]). This causes a problem because tar wants to back up all those extra bytes. We have back ups of the file else where, but I'm afraid to delete it. When I remove it what is going to happen to the file system (Kernal version is 2.4.18 on i686). This seems to be a pretty bad math error on the part of the file system. This is a really weird error, but could just be the issue of a corrupted sector on the drive. Has anyone else seen this before and have any ideas as to whether such files can be recovered? Is this problem just a small glitch or an omen of an impending filesystem crash?

"Here's what the files look like on the system:

[ root@secure parse]# ls -l HTMLFrameSet.class
-rw-rw-r-- 1 root devel 70368744179059 Mar 20 09:05 HTMLFrameSet.class

[root@secure parse]# wc HTMLFrameSet.class
15 58 1395 HTMLFrameSet.class

...and the error message from tar:

tar: HTMLFrameSet.class: File shrank by 70368744169331 bytes; padding with zeros

No wonder my backups didn't finish! :-)"

52 of 74 comments (clear)

Min score:

Reason:

Sort:

fsck by SpatchMonkey · 2002-07-24 20:24 · Score: 1, Informative

Does the fsck.ext3 program help at all?
And when you run "fsck"? by Zocalo · 2002-07-24 20:33 · Score: 5, Informative

Since EXT3 is just EXT2 with a journal tacked on, there is no reason why you can't run the EXT2 fsck utility accross it in the normal way. You are obviously worried about loosing the entire file system, so you probably want to start by running fsck with the verbose (-V) and interactive (-r) options to see exactly what is going on and have the ability to prevent unwanted changes being made.
Since you appear to use tar for backups, you could also backup the affected filesystem using the exclude (-X [filename]) option first, which might be a *really* good idea. ;)

--
UNIX? They're not even circumcised! Savages!
1. Re: And when you run "fsck"? by Omniscient+Ferret · 2002-07-24 23:04 · Score: 2, Informative
  
  I'd like to add two things:
  You can backup the drive image too, so if the file is irreplaceable and corrupted, you can try more than one recovery method safely.
  Also, to fsck /, "touch /forcefsck" and reboot.
2. Re:And when you run "fsck"? by Linux_ho · 2002-07-25 03:11 · Score: 3, Informative
  
  I'd like to add that fsck is ext3-aware. If the journal looks OK, it might not actually check the filesystem unless you tack on the -f option to force the issue.
  
  --
  include $sig;
  1;
Before you try to recover.... by bartjan · 2002-07-24 20:36 · Score: 2, Informative

Make a copy of the /dev device itself, if you have the space for that on another partion.
Then use that backup-file to try out whatever other posters here suggest.
Re:that one is EASY to fix... by linzeal · 2002-07-24 20:57 · Score: 1

Dude I just backed up our entire network!

--
An Education is the Font of All Liberty
Re:Try accessing from another machne.... by transiit · 2002-07-24 21:04 · Score: 1

This is the right idea, but a lot more complex than it needs to be. Considering that the correct file size is already known, you should be able to use dd to specify how many blocks actually get written. So the steps are 'make a backup with dd, specifying length', 'test the backup thouroughly with whatever created it or can validate it', 'once you're certain the original is no longer needed, get rid of it.'.

problem solved?

-transiit
Re:Try accessing from another machne.... by transiit · 2002-07-24 22:11 · Score: 1

no, not really. 1395 bytes? dd it off to a floppy drive, or another partition, or some other removeable media. Where's the risk?

-transiit
Have you contacted SCT? The Creator of Ext3? by LWolenczak · 2002-07-24 22:44 · Score: 2

Have you contacted SCT? The Creator of Ext3?
dd by Yarn · 2002-07-24 22:46 · Score: 3, Interesting

As you know how long it is supposed to be:

dd if=[file] of=[new file] bs=1 count=[length]

I strongly suggest rebuilding the affected filesystem, that kinda weirdness can be indicative of deeper problems.

--
-Yarn - Rio Karma: Excellent
"make test" in Perl builds used to do this.. by pedro · 2002-07-24 22:49 · Score: 2

IE: creating a nonexistent HUGE file that normal measures would not delete.
Try this:
cat /dev/null > /pathto/peskyfile
worked for me in vanilla ext2.
Should (?!?) work in ext3.

--
Brak: What's THAT?
Thundercleese: A light switch.. of TOTAL DEVASTATION!
1. Re:"make test" in Perl builds used to do this.. by unitron · 2002-07-25 09:17 · Score: 2
  
  For a moment I misread that as Internet Explorer having created a nonexistent file, and as I'm constantly mistaking IE for (File) Explorer (giving two different programs the same name may not be Gate's greatest sin, but it's right up near the top of the list) it came as no suprise to me that such a thing could happen.
  
  --
  I see even classic Slashdot is now pretty much unusable on dial up anymore.
Sparse file? by Tony-A · 2002-07-24 23:01 · Score: 3, Informative

from man tar
-S, --sparse
handle sparse files efficiently

I'm not really familiar with them, but haven't seen any other mention here.
I know it's possible to put a file on a floppy that won't fit on your hard drive.
1. Re:Sparse file? by the+way,+what're+you · 2002-07-26 02:21 · Score: 1
  
  cat $FILE > $NEWFILE
  Does Redhat not ship with 'cp' anymore?
  
  --
  example.org - powered by Linux!
2. Re:Sparse file? by Kredal · 2002-07-27 23:30 · Score: 2
  
  cp would probably preserve the mega-huge filesize, so you wouldn't get anywhere. cat file > newfile *should* ignore all the empty space after the 1500 or so bytes that he wants to keep.
  
  Of course, I have no way to test this right now, so I leave it as an exercise to the reader. (:
  
  --
  Whoever stated that signature sizes should be limited to one hundred and twenty characters can just go ahead and kiss my
hex by Merlin42 · 2002-07-24 23:28 · Score: 5, Insightful

I don't know enough about filesystems to say what the implications are but:
the reported size in hex is
0x400000000573
and the actual size in hex is
0x573

Looks like a single extra bit got flipped when the size was stored.

--
Thoughts on tech, Software Engineering, and stuff
1. Re:hex by psychosis · 2002-07-25 02:43 · Score: 1
  
  Wow.... That's really cool.
  Mad props to thinking in Hex. I have a hard enough time getting by in decimal.
This is a sparse file.... by weave · 2002-07-25 00:06 · Score: 5, Informative

It has holes in it. We once ran a medical package 10 years ago that did this on purpose. A 40 gig file took about 4 megs on disk.
This is easy to simulate by writing a small program that scribbes a few bytes to offset zero, then does an fseek out to some insane high offset, then scribble a few bytes there. Close, do an ls, see the huge file, but then note it only takes the space of two blocks on your file system. Imagine the fun you can have with this trick at parties!
Every UNIX file system I've ever dealt with handles this the same way.
tar and other programs should have switches to deal with sparse files correctly.
If you're concerned about what's in it, cat it to od. I believe od is smart enough to collapse zero blocks in its display. That way you can see if there is any real data at some pointer far into the file.
If this is a commercial closed-source package where you can't verify what it's doing, I'd strongly suggest leaving it alone and contacting vendor to see if this behavior is normal.
1. Re:This is a sparse file.... by ivan256 · 2002-07-25 03:27 · Score: 4, Interesting
  
  While what you say about sparse files is generally true, that's probably not what this is. This is probably a single bit error on this guy's hard drive. There's probably more of them, but he noticed this one because it popped up in a noticeable location. The hard drive is probably on the way out, or he's got some faulty memory (if it's ECC, otherwise this could just be a fluke).
  
  Tar does deal with sparse files correctly, and if this were one, he wouldn't be having trouble.
2. Re:This is a sparse file.... by qurob · 2002-07-25 05:22 · Score: 1
  
  The hard drive is probably on the way out
  
  Well, he did say it was an IBM Laptop....go figure
3. Re:This is a sparse file.... by n9hmg · 2002-07-25 06:47 · Score: 2, Informative
  
  He demonstrated that it was not a sparse file, by using the wc command on it. A sparse file treats all the empty space as nulls on reading, so he would have gotten the big size if it were sparse. It's a single-bit error, probably bad media that got past the ECC on the drive, but maybe just a plain corruption. I'd suggest copying it somewhere safe and running an fsck, if you can afford the downtime.
what could have happened ? by phanki · 2002-07-25 01:06 · Score: 1

The author was reporting that the size of the file is all bloated up. Of course there was a reply explaining how it can be done. But can someone reflect on the forensics and guess what *could* be the reason that this particular file got bloated. Any pennies for the thoughts ;-)
Try the mailing list by Outland+Traveller · 2002-07-25 02:12 · Score: 5, Informative

Why don't you try the ext3 mailing list instead of Ask Slashdot? I lurk on the list and I've seen a number of questions extremely similar to yours, with answers. The list gurus will even help you track down the problem.

https://listman.redhat.com/pipermail/ext3-users/20 02-July/thread.html#383
another ext3 question by superid · 2002-07-25 03:42 · Score: 4, Funny

I know this isn't an ext3 help channel...but I haven't gotten a satisfactory answer elsewhere (usually it just consists of a "*shrug*" on the #redhat channel)

I've got a thinkpad running RH 7.3 with two ext3 partitions. Being a laptop it has occasionally had its batteries die and been shutdown improperly. Invariably, there has been a subsequent long fsck .... long....like 10 minutes....once I even was dropped to the maintenance shell to run it manually (yes, yes, yes, yes, yes, yes, yes...when the hell would I *NOT* want to fix the non-zero dtime????)

Isn't the whole point of ext3 so I don't have to go through this pain? This was an extremely generic installation of 7.3, why am I seeing no benefit to ext3?

Thx,

SuperID
1. Re:another ext3 question by Anonymous Coward · 2002-07-25 07:23 · Score: 1
  
  It could be that when your laptop dies, the HDD loses what is in it's cache. That would be a drive settings type error - NOT an EXT3 error. Check your drive's specs.
  
  I've had several experiences with power outages due to storms and id10ts blowing breakers. I've never once had an issue with EXT3 - every system started right back up, no prob. (And, yes, we now have UPSes. ;)
  
  Here's an idea - shutdown your machine just before the battery dies. Or call IBM and tell them they need to replace your battery...
2. Re:another ext3 question by superid · 2002-07-25 12:57 · Score: 2
  
  Yes, ext3 was (incorrectly) a module rather than statically in the kernel........thx to everyone !
Lossy compression by MrResistor · 2002-07-25 03:43 · Score: 2

What was that lossy compression scheme mentioned a while back? lzip, I think? Sounds like that's what you need here...

--
Under capitalism man exploits man. Under communism it's the other way around.
ext3 by ldexter · 2002-07-25 03:49 · Score: 1

Search the ext3-users list archive, I'm sure I've seen this reported before.

--
Hello world!
EXT3 has failed me as well. by SaDan · 2002-07-25 03:50 · Score: 1

EXT3 journaling is a joke. I've had RH 7.2 workstations that lost power lose an entire filesystem, just because they weren't shut down properly.

This has happened more than once too... I can't believe people actually use EXT3, and think their data is safe.

Where I work, we have machines running XFS, JFS, EXT3, and ReiserFS. EXT3 is the only filesystem we have problems with.

I especially like the 1.5 hour long fsck runs on one machine with it's 120gig data partition.
1. Re:EXT3 has failed me as well. by crisco · 2002-07-25 04:47 · Score: 2
  
  A developer I'm working with on a RH7.3 system had the following to say after seeing the ext3 filesystem perform a fsck after a dirty shutdown:
  I looked into the kernel and noticed the ext3 module wasn't statically compiled into the kernel!
  No work yet on whether or not that solved things...
  
  --
  Bleh!
2. Re:EXT3 has failed me as well. by XO · 2002-07-25 05:44 · Score: 1
  
  yes, if you don't have EXT3 compiled into the kernel directly, rather than a module, and your Root partition are ext3, then it will fallback to ext2 (which presumably you DO have compiled in, otherwise the boot would fail completely). Ext3/Ext2 compatibility is a good thing, but any system that's ALWAYS performing a full fsck on dirty shutdown is probably not actually loadiing with ext3.
  or perhaps your journal is screwed up, and you might need to rebuild it with -whatever command it is to rebuild the journal- .. i've been using ext3 on my webserver box since it became part of the kernel, and have had MANY power failures since then, and have not had ONE full fsck ..
  
  I'm also going to go along with the hypothesis that likely one bit on this guys drive is screwed up, and that he should probably back up everything but that file (and this way he would also find any other files that might be affected in a similar way), do a full fsck, and perhaps even completely reformat that partition, doing a bad block check.
  
  --
  "Champagne for my real friends - and real pain for my sham friends!" http://ericblade.postalboard.com/
3. Re:EXT3 has failed me as well. by Rheingold · 2002-07-25 06:06 · Score: 1
  
  This isn't correct. The ext3 module should be in the initrd, which means it doesn't need to be statically compiled in for the initial rootfs mounting. It may be that the mkinitrd isn't adding the module as it should.
  
  --
  Wil
  wiki
4. Re:EXT3 has failed me as well. by haplo21112 · 2002-07-25 07:07 · Score: 2
  
  I too have been running ext3 since it became part of the kernel, actually since it was fairly stable before it got into the kernel proper, around 2.4.12 I think...I have not once had an issue with since I began using it, and I have the machine go down when power goes out at least once every month or two...ext3 even saved my ass, when after installing a new drive and then having the controller decide on the first boot after install that it didn't like the drive. I had already put a large amount of data on the drive before that boot...the ssystem came up spewed errors, all over...I rebooted put it on a different controller in the system, all came up well, ext3 recoved from the journal and all was well, NO data loss...I asked a firend if the same would ahve happened with ext2 and he said he doubted it...the journal was the saving grace in this situation.
  
  --
  Power Corrupts,Absolute Power Corrupts Absolutely, leaving one person(group)in charge is absolutely corrupt.
5. Re:EXT3 has failed me as well. by XO · 2002-07-25 09:01 · Score: 1
  
  care to explain how it might access the init if ext3 isn't compiled in?
  
  --
  "Champagne for my real friends - and real pain for my sham friends!" http://ericblade.postalboard.com/
6. Re:EXT3 has failed me as well. by unitron · 2002-07-25 09:32 · Score: 2
  
  "...and my wife is too cheap to let me get a UPS..."
  Is she to cheap to let you get life insurance? Medical? Comprehensive on the car? If not, explain to her that protection from data loss or not having to reboot after a power failure or glitch is just a fringe benefit, the real reason for the UPS is that it protects your expensive-to-replace electronic equipment from damage due to the electrical, thermal, and mechanical shock caused by glitchy power.
  You can probably convince her that you need a second one for the TV and VCR.
  
  --
  I see even classic Slashdot is now pretty much unusable on dial up anymore.
7. Re:EXT3 has failed me as well. by SaDan · 2002-07-25 16:13 · Score: 1
  
  Yeah... did all of that (bad block checking, swapped in new drives, yadda). All drives are 100% functional.
  
  It's not the controllers, it's not the cables. These drives all ran EXT2 just fine for months. EXT3 just can't handle the amount of data we're mashing through this machine.
8. Re:EXT3 has failed me as well. by Rheingold · 2002-07-25 20:23 · Score: 1
  
  What are you talking about? It's initrd, loaded by the boot-loader, not /sbin/init.
  
  --
  Wil
  wiki
Just delete it. by Tom7 · 2002-07-25 04:12 · Score: 2

Oh, come on, be a man. Backup if you need to, and delete the thing.
I've seen this and a WARNING by Rheingold · 2002-07-25 05:58 · Score: 1

I've seen this. In my case, it was fixed by unmounting and mounting the filesystem again. I've also seen files that one command (like find or rm -rf) would see as a directory and another would see as a file. I don't understand how there can be differences, given that they should all be using the same C library interfaces. These have always been recoverable, however.

Also, I experienced something considerably more distressing: data corruption. After reading the benchmarks comparing ReiserFS and ext3 mounted with 'data=ordered' and 'data=writeback', I decided to try writeback mode. It seemed okay for a while, but lately because of the heat my computer has been shutting itself. Once I came back and found that after hitting the reset button, my Mozilla bookmarks were reduced to a small portion of what they ought to have been. An image I had been working on and saved had been replaced by the content of several e-mail messages. rxvt would no longer start correctly from the KDE panel, even though checking through the properties it looked okay. I re-added the button and it started correctly. There were other things awry too, and probably things I haven't found.

I was using the "offical" kernel from Red Hat for 7.3, 2.4.18-5. In summary, DO NOT USE data=writeback for now.

--
Wil
wiki
1. Re:I've seen this and a WARNING by Rheingold · 2002-07-25 06:09 · Score: 1
  
  I should add that this is a SCSI drive, not a funky IDE drive with a non-disableable (!!) write cache.
  
  --
  Wil
  wiki
2. Re:I've seen this and a WARNING by mmontour · 2002-07-25 07:05 · Score: 1
  
  After reading the benchmarks comparing ReiserFS and ext3 mounted with 'data=ordered' and 'data=writeback', I decided to try writeback mode. [...] An image I had been working on and saved had been replaced by the content of several e-mail messages. rxvt would no longer start correctly from the KDE panel, even though checking through the properties it looked okay.
  
  Um, yes, that's what Writeback does. From the mount(8) manpage:
  
  Data ordering is not preserved - data may be written into the main file system after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It guarantees internal file system integrity, however it can allow old data to appear in files after a crash and journal recovery.
  
  BTW, I've had the same thing happen to me on Reiserfs.
3. Re:I've seen this and a WARNING by Rheingold · 2002-07-25 07:15 · Score: 1
  
  Yes, I know. The thing was, though, that much of this data should have already been committed--the image I saved 10 minutes or so before I left, which means it should have been flushed from the cache. I can understand volatile data like my bookmarks being lost, but not the image file.
  
  --
  Wil
  wiki
You've got it backed up - by qurob · 2002-07-25 06:48 · Score: 1

So you're about 1 step ahead of 90% of the rest of the world.

Your next step is to blow the disk away and restore.

By the time you get a coherent answer from us, you'd be back up and running.

Alternatively, if you bought the retail version of RedHat you could call them, or there's always the free newsgroups and messageboards. Give them a shot.
Deletion question by obtuse · 2002-07-25 08:11 · Score: 3, Interesting

I think you're right about the flipped bit. Copy the file with dd, specifying the right output size.

I'd bet there are problems with the whole filesystem, but to continue with what he asked:

It seems to me that he should be able to rm the file without any worries, after making a good copy. Only the inode that points to the falsely enlarged file will be removed, and the data blocks won't be touched, right?

If there is other data in the misallocated blocks, that dat should either have its own references, or it's already as good as deleted anyway.

--
Assembly is the reverse of disassembly.
Re:another solution by unitron · 2002-07-25 09:11 · Score: 2

Apparently one man's irony is another man's flamebait.

--
I see even classic Slashdot is now pretty much unusable on dial up anymore.
Sure it's ext3? by salimma · 2002-07-25 11:25 · Score: 1

I don't mean to be patronising, and apologies beforehand, but a rather common problem with people who upgrade to RH 7.2/7.3 is that their partitions have not actually been converted to ext3.
This will certainly explain why it fsck'es all the time after reboots - run 'mount' without any parameter and check /proc/mounts (I think - not in front of Linux right now) and see if they both say ext3?
Hope that helps,
Michel

--
Michel
Fedora Project Contribut
Same type of thing happened to me... by Discordia · 2002-07-25 12:39 · Score: 1

A couple of years ago I had a brain fart where I was (humorously enough) making a boot floppy so I could convert my root fs from ext2 to ext3.

Instead of dd if=/tmp/imagefile.img of=/dev/fd0 bs=1440k,
I did dd if=/tmp/imagefile.img of=/dev/hda bs=1440k

Whoops. After restoring my MBR and partition table, I still had to deal with the fact that I overwrote the first 1438KB of my root filesystem with effectively random data.

e2fsck -y /dev/hda run about 10 times took care of the filesystem's integrity, but I still had about 200 "files" in /lost+found of all sorts of random sizes, names, and types (pipes, fifos, regular, dev entries). The problem was that when I'd try to perform a file operation on any of the files, the kernel would get pissed off saying that the file size was too large, since the inode had random data listed as the filesize, and the operation would fail.

The way I finally fixed it was by running tune2fs and removing the file by hand. It's fairly straightforward, since tune2fs has an interface similar to file navigation from a shell prompt (ls, cd, etc). Just navigate to the target directory and remove the inode listed (by ls) as the inode associated with the file in question. You probably want to run e2fsck one more time to be sure.

Happy ending: I'm still using the filesystem that dd stomped all over and luckily lost only a handful of unimportant files.

Hope this helps...

-Fat Fingers
funny. by GiMP · 2002-07-25 15:40 · Score: 2

With reiserfs, I had a file that would reboot the system if I read, wrote, or deleted the file. I rebuilt the journal and everything was ok. Imagine if it was a production system!

stick with a real filesystem, get a Sun, HP, IBM, or SGI and use their journaling filesystems.. you'll never want to use ext* again.
Use Ghost for backup before you touch it by Korth · 2002-07-26 06:43 · Score: 1

I recommend plugging in an extra hard drive, and using Norton Ghost, or one of the alternatives to back up the partition, before touching it. Since the filesystem is corrupt, you'll probably have to do a bit-to-bit copy for it to work.

Afterwards, you can do whatever experiments you want with it, and still be on the safe side.
1. Re:Use Ghost for backup before you touch it by BJH · 2002-07-28 18:57 · Score: 1
  
  Why use Ghost when dd will do the same job quite nicely?
have you tried.. by majorluser · 2002-07-26 11:57 · Score: 1

Try running ls -i (that's a small I). This will list the inodes of the files. You probably have some sort of corruption.
Re:that one is EASY to fix... by Kredal · 2002-07-27 23:35 · Score: 2

I back up my entire network nightly to /dev/null. It takes almost no time at all, and I don't otherwise use the 'mv' command nearly enough.

I read on the interweb that that's how you're supposed to do it... They wouldn't lie to me, would they?

--
Whoever stated that signature sizes should be limited to one hundred and twenty characters can just go ahead and kiss my