For my first post here I want to talk a bit about file systems. Specifically about NTFS because that is what system administrators in the Microsoft world deal with primarily. I also want to talk about chkdsk. I’m sure at least once in your life you’ve seen a chkdsk triggered because of some corruption on your drive. So let’s explore a bit about what is going on there. This is why I want to talk about NTFS.
At my last job we had a situation where one of our servers spontaneously rebooted. It was a mission critical server so we knew about it immediately. Pulling up the display on it showed that chkdsk was running, and that chkdsk was not happy with what it was seeing. There were entries scrolling around like “Replacing invalid security id with default security id for file 1461234” and “Deleting an index entry with Id 8447 from index $SII of file 9.” Naturally this was rather alarming. A rather heated argument ensued about what was really going on, with some people not even sure that this was chkdsk running. In the heat of the moment sadly I was not able to articulate exactly what was going on as well as I could. This gives me an opportunity to remedy that. To take a look at what is happening here though requires delving into some of the structure of the NTFS file system itself.
In an NTFS partition just about everything is a file. This includes where the meta-data is stored as well. NTFS starts out with a boot sector, and this boot sector is contained in a non-relocatable file. $BOOT seems to be pretty self-evident. The boot sector contains where the Master File Table starts at, which is in file $MFT. There is also a mirrored copy of the master file table contained in $MFTMirr. The $MFTMirr only contains the first 4 entries of the MFT, which are in order $MFT, $MFTMirr, $Logfile, $Volume. The default reservation for the MFT is 12.5% by the way. There are cases where you may wish to increase the reservation side so as to allow for more file references. Now this MFT as the name suggests contains an entry for every file on your drive, which is by reference except for really tiny files that can be stored within the MFT. So where do your security attributes go? $Secure.
$Secure contains all of your security attributes. When you add permissions to a file or directory this is where all of those permissions are stored. They are stored in the indexes of $SDH and $SII. When NTFS is checking on the security of an object it uses $SII to do a quick look up to check the security descriptors of the object. $SDH is used for sharing security descriptors and storing new ones. Whenever a new file it is assigned a standard security descriptor that contains the default security attributes. Which leads to what ties us to chkdsk. $Volume. $Volume contains the dirty bit. When you boot up and mount the partition the dirty bit gets set in $Volume. When you shutdown or reboot one of the last things the OS does is reset the dirty bit. If the dirty bit is not reset, then when the OS boots back up it sees that the volume was not dismounted cleanly. Therefor it fires up chkdsk to make sure that everything was written cleanly. So that explains why we had chkdsk starting up. A spontaneous reboot naturally does not do a clean dismount. But what was chkdsk doing?
Chkdsk runs through three stages in this type of situation. The first three stages are verifying the files, then the indexes, and finally the security descriptors. In this first stage of verifying files chkdsk it checks what clusters are actually in use and what the MFT claims. If there are discrepancies then entries in the MFT will be reset, or added. Then chkdsk moves on to the next stage of checking indexes. This basically makes sure that you can get to every file through a directory. If there are legitimate files, but no directories leading to them, then that leaves them orphaned. Chkdsk tries to figure out where the orphan should go but if not then it is put in special directory at the root of the volume. Then finally we hit stage three, our security descriptors. It checks through to make sure that we have consistent security descriptors for all directories and files. If they are marked as inconsistent, which can be caused by the security descriptor block not having a 20 byte padding at the end of the block, chkdsk resets the security descriptor to the default. This explains a number of the entries that we saw scrolling by as mentioned earlier.
So a question that had resulted from this discussion was this: should we interrupt chkdsk when it is running? The answer to this is an emphatic no. Interrupting chkdsk can make a bad file system even worse that it is in the middle of repairing. If you must then be prepared to pull those back-ups as you may need to restore that system. Always make sure to keep good back-ups. In the case of our system here fortunately we had a disaster recovery site that we could draw upon. Which was a good thing as that volume was discovered to be completely hosed. All data was lost. Hopefully now you understand the process a bit more. Then you will be able to explain to your supervisor what is going on and why you should not stop that system, and you will have the documentation to point to for why.
Windows Forensics, the Field Guide for Conducting Coporate Computer Investigations by Chad Steel
An explanation of chkdsk and the new /C and /I switches
Chkdsk Finds Incorrect Security IDs After You Restore or Copy a Lot of Data
How NTFS Works
Inside Win2K NTFS, Part 1