This page was exported from phaq
[ http://phaq.phunsites.net ] Export date: Tue Mar 28 23:55:34 2023 / +0000 GMT |
FreeBSD's ufs usually does an excellent job in preventing file system corruption. But even the best system happens to mess up once in a while. One thing you may eventually stumble accross are so called mangled entries, which are usually not fixable with fsck and result in kernel panics upon access. Now these are usually a sign of severe file system corruption, often caused by hardware faults like bad memory modules, a faulty disk controller or even a deffective hard drive. Consider checking and replacing your hardware if you encounter mangled entries on a frequent and recurring occasion. You may actually succeed in fixing it following the steps outlined below, however it is very likely to happen again if you have faulty hardware. So in the end you'll end up curing the side-effects and not the actual reason, which may in turn lead to other, even more critical problems. On the other hand, if you happen to have a corrupted file system like this very, very seldomly (as in "about once in a decade", it happened to myself only three times in 10 years that I've worked on some 200-300 servers in total) you may risk fixing it by means of the file system debugger. I define this as a "minor corruption" to which the following usually applies:
When the error happensA typical error message thrown at you in this case may look like this (some output omitted):
The message gives some essential information about the file system concerned (the actual mountpoint, not the device name itself) amd the inode of the directory or file. First steps in recoverySo the next best thing to do in this situation is to reboot into single user mode. From there have fsck inspect the device first.
Now since mangled entries are usually not fixed by fsck, the term "FILE SYSTEM MARKED CLEAN" should not be trusted in. You may risk to bring your system back up without any further work, however if it panics again with the same message (mind the inode number), you are likely to have unfixable (by means of fsck) corruption. Optional but recommend: Try to crash the machine againPersonally I always try in crashing the system before I touch the file system with the debugger, however not without taking a current backup first if at all possible. The reasons for doing so is simple:
Finding the corrupted entry is easiest by walking the directory structure. For this a simple command line like this usually works well enough. It should be run from single user mode and on the read-only mounted target device only to minimize all impacts.
This will usually cause the system to panic again when accessing the corrupted directory. If it does not, this method may:
If this still doesn't work, you may mount the device read-write so the afore mentioned commands can actually touch the file system to update file access times. And if even that fails, try to create a dummy file inside each directory will do for sure:
Now it must be noted that doing this on a already corrupted read-write file system _is_ dangerous. I cannot stress this enough: Don't take the risk if you don't have a backup! Don't take the risk if you're not aware of the consequences! Don't take the risk if you're a newbie! A panic in this situation could make it even worse! So, the system panics again...Let's assume the system panics again with the same error message. If you were lucky enough you even saw which directory was last accessed before the panic. This may be valuable to know if you run some certain type of application and could reveal yet unknown application errors or even vulnerabilities like temporary file creation race conditions. /mnt/da1s1a: bad dir ino 16392 AT OFFSET 512: MANGLED ENTRY So you now have proof that there is (still) an unfixed corruption on the file system. You also have proof that it happened at the same inode than before. If it's not the same inode, then you know for sure that there's either another corruption or faulty hardware which causes excessive errors. For the latter case remember what I wrote before about faulty hardware. Right, now how to fix it?To fix it go back to single user mode and re-run fsck just to make sure. Keep your device mounted read-only. Then start the file system debugger, fsdb:
Now go to the inode which was mentioned during kernel panic to get some additional information.
Even if it results in data loss, clearing the inode is the way to go to get rid of this.
Then exit the debugger:
Run fsck as told:
This is it?Basically yes. However I recommend rebooting the system once more into single user mode to rerun 'find'. This will reveal if there is (no) further corruption. Also the reboot will ensure that the operating system can re-read the disklabel and file system properly. This is especially important after messing around with the file system debugger. For this reason do run fsck once more just to make sure the file system is really clean. Also try keeping to these premises:
Remember: The file system is at the heart of your server. Messing it up could compromise your data, your users and even your job. So care for it! |
Powered by [ Universal Post Manager ] plugin. HTML saving format developed by gVectors Team www.gVectors.com |