Home » Posts tagged 'corruption'

Tag Archives: corruption

Advertisements

The Case of the Mysterious Crashing Application

A recent client had migrated off their terminal server and onto a virtualized 2008 R2 RDS server. Actually a farm of them but for this case it did not matter. Their previous setup had been all contained on one 2003 server which also ran their AD, print server, and whatever else was crammed into the kitchen sink. This new setup had some proper separation and centralized storage all on 2008 servers. Now for all of their data and programs they would reach into a file share on the SAN. This was working great except for one program they had would keep on crashing unless the data files were local to the server. Event IDs were as such with one immediately following the other:

Event ID 1000

Application Error

Description:

Faulting application name: OMNIS7.exe, version: 8.0.0.0, time stamp: 0x3bb82293

Faulting module name: OMNIS7.exe, version: 8.0.0.0, time stamp: 0x3bb82293

Exception code: 0xc0000006

Event ID 1005

Application Error

Description:

Windows cannot access the file  for one of the following reasons: there is a problem with the network connection, the disk that the file is stored on, or the storage drivers installed on this computer; or the disk is missing. Windows closed the program Omnis 7 core executable because of this error.

Program: Omnis 7 core executable

File:

The error value is listed in the Additional Data section.

User Action

1. Open the file again. This situation might be a temporary problem that corrects itself when the program runs again.

2. If the file still cannot be accessed and

– It is on the network, your network administrator should verify that there is not a problem with the network and that the server can be contacted.

– It is on a removable disk, for example, a floppy disk or CD-ROM, verify that the disk is fully inserted into the computer.

3. Check and repair the file system by running CHKDSK. To run CHKDSK, click Start, click Run, type CMD, and then click OK. At the command prompt, type CHKDSK /F, and then press ENTER.

4. If the problem persists, restore the file from a backup copy.

5. Determine whether other files on the same disk can be opened. If not, the disk might be damaged. If it is a hard disk, contact your administrator or computer hardware vendor for further assistance.

Additional Data

Error value: C00000C4

Disk type: 0

It was very swiftly recognize that the situation was not a temporary problem though unfortunately a sporadic one. Patterns noted were that crashed the most often in the morning when everyone would sign on and the afternoon when everyone was closing out. Now 0xC00000C4 is STATUS_UNEXPECTED_NETWORK_ERROR but that doesn’t provide much to go on. Grabbing some performance logs also showed that there shouldn’t be a network performance problem either bandwidth-wise. The first thing that was tried was disabling rss and offloading but that did not help matters. Doing more research I was lead to believe that the problem was being caused by oplocks.

Oplocks, short for Opportunistic Locking, is a process in the SMB protocol that was designed to allow multiple processes to lock a file while providing client side caching. The purpose of this is to improve performance for the local clients on the network. For more reading on this consult this document and this document. So all the crashing basically came down to cache integrity since the database used by the client was a flat file instead of transactional database.

To disable oplocks on the server you go into this key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters

And set EnableOplocks to 0. If it is not there create it as a REG_DWORD. Reboot to take effect.

Unfortunately Server 2008 introduces a new problem. Server 2008 will communicate via SMB2 to any client using Vista or newer. SMB2 also does not allow oplocks to be disabled. The work around for this is that if SMB2 is disabled on either the client or the server then communication will fall back to using SMB. Easiest way to fix this then is to disable on the server.

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\LanmanServer\Parameters

Create a REG_DWORD named SMB2 and set it to 0. Reboot to take effect.

You may notice that the server takes substantially longer to start up after making this change. They were severe enough that I decided to test an alternative method for disabling SMB2. Since communication will default to SMB if either the client or the server did not support SMB2 then SMB2 could be disabled on the client side. Disabling on the client side is a bit different since you’re actually disabling a service.

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\LanmanWorkstation

You may want to backup this key for easy restoration. Then edit DependOnService and remove MRxSmb20.

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\mrxsmb20

You may want to backup this key as well. In this key set Start to 4. Reboot your client and SMB2 will now be disabled.

Ever since implementing these changes the client’s applications have been running solid as a rock. For oplocks reading from Microsoft check here.

Advertisements

NTFS, CHKDSK, and You!

For my first post here I want to talk a bit about file systems. Specifically about NTFS because that is what system administrators in the Microsoft world deal with primarily. I also want to talk about chkdsk. I’m sure at least once in your life you’ve seen a chkdsk triggered because of some corruption on your drive. So let’s explore a bit about what is going on there. This is why I want to talk about NTFS.

At my last job we had a situation where one of our servers spontaneously rebooted. It was a mission critical server so we knew about it immediately. Pulling up the display on it showed that chkdsk was running, and that chkdsk was not happy with what it was seeing. There were entries scrolling around like “Replacing invalid security id with default security id for file 1461234” and “Deleting an index entry with Id 8447 from index $SII of file 9.” Naturally this was rather alarming. A rather heated argument ensued about what was really going on, with some people not even sure that this was chkdsk running. In the heat of the moment sadly I was not able to articulate exactly what was going on as well as I could. This gives me an opportunity to remedy that. To take a look at what is happening here though requires delving into some of the structure of the NTFS file system itself.

In an NTFS partition just about everything is a file. This includes where the meta-data is stored as well. NTFS starts out with a boot sector, and this boot sector is contained in a non-relocatable file. $BOOT seems to be pretty self-evident. The boot sector contains where the Master File Table starts at, which is in file $MFT. There is also a mirrored copy of the master file table contained in $MFTMirr. The $MFTMirr only contains the first 4 entries of the MFT, which are in order $MFT, $MFTMirr, $Logfile, $Volume. The default reservation for the MFT is 12.5% by the way. There are cases where you may wish to increase the reservation side so as to allow for more file references. Now this MFT as the name suggests contains an entry for every file on your drive, which is by reference except for really tiny files that can be stored within the MFT. So where do your security attributes go? $Secure.

$Secure contains all of your security attributes. When you add permissions to a file or directory this is where all of those permissions are stored. They are stored in the indexes of $SDH and $SII. When NTFS is checking on the security of an object it uses $SII to do a quick look up to check the security descriptors of the object. $SDH is used for sharing security descriptors and storing new ones. Whenever a new file it is assigned a standard security descriptor that contains the default security attributes. Which leads to what ties us to chkdsk. $Volume. $Volume contains the dirty bit. When you boot up and mount the partition the dirty bit gets set in $Volume. When you shutdown or reboot one of the last things the OS does is reset the dirty bit. If the dirty bit is not reset, then when the OS boots back up it sees that the volume was not dismounted cleanly. Therefor it fires up chkdsk to make sure that everything was written cleanly. So that explains why we had chkdsk starting up. A spontaneous reboot naturally does not do a clean dismount. But what was chkdsk doing?

Chkdsk runs through three stages in this type of situation. The first three stages are verifying the files, then the indexes, and finally the security descriptors. In this first stage of verifying files chkdsk it checks what clusters are actually in use and what the MFT claims. If there are discrepancies then entries in the MFT will be reset, or added. Then chkdsk moves on to the next stage of checking indexes. This basically makes sure that you can get to every file through a directory. If there are legitimate files, but no directories leading to them, then that leaves them orphaned. Chkdsk tries to figure out where the orphan should go but if not then it is put in special directory at the root of the volume. Then finally we hit stage three, our security descriptors. It checks through to make sure that we have consistent security descriptors for all directories and files. If they are marked as inconsistent, which can be caused by the security descriptor block not having a 20 byte padding at the end of the block, chkdsk resets the security descriptor to the default. This explains a number of the entries that we saw scrolling by as mentioned earlier.

So a question that had resulted from this discussion was this: should we interrupt chkdsk when it is running? The answer to this is an emphatic no. Interrupting chkdsk can make a bad file system even worse that it is in the middle of repairing. If you must then be prepared to pull those back-ups as you may need to restore that system. Always make sure to keep good back-ups. In the case of our system here fortunately we had a disaster recovery site that we could draw upon. Which was a good thing as that volume was discovered to be completely hosed. All data was lost. Hopefully now you understand the process a bit more. Then you will be able to explain to your supervisor what is going on and why you should not stop that system, and you will have the documentation to point to for why.

Sources:
Windows Forensics, the Field Guide for Conducting Coporate Computer Investigations by Chad Steel
An explanation of chkdsk and the new /C and /I switches
Chkdsk Finds Incorrect Security IDs After You Restore or Copy a Lot of Data
How NTFS Works
Inside Win2K NTFS, Part 1

%d bloggers like this: