A while back I got involved in an issue where a company had written an application for their own internal use. This application was intended to connect to a web page served out by IIS and then proceed with some other action. They wisely chose to make sure that the initial connections it used were secure on the network. A method they wanted to use for that was secure socket layer (SSL) communication.
This company’s code ran within the IIS application pool and simply called into the underlying Windows SSPI in order to kick things off. We’ll go over a little of this in case you’ve never had the opportunity to delve into the details of how this works. SSL and transport layer security (TLS) typically use public key interchange (PKI) based certificates in Microsoft’s and most other implementations to provide the private and secure pieces of data to essentially guarantee the communicator’s authenticity and security. In other words, the certificates that are present are important.
What our customer saw-the reason they were calling us-was a consistent approximate 20 second delay on initial connection to the web page. This would only happen the first time a user connected to the page; subsequent connections in the same user logon session of that workstation saw no delay. Were they to log off and back on however they would see the problem recur.
Network traces were the first thing we wanted to look at. Filtering for authentication based traffic can be a tricky thing since authentication can be used for a variety of other network protocols. Not all network traffic analyzers will filter for the authentication used and thus catch all traffic for it regardless of the protocol which is needing to be authenticated. The common thing we see this in is filtering for Kerberos authentication, but it applies here as well.
In this scenario, we filtered for SSL.
The traffic we saw showed a distinct 20 seconds in delay after a server packet arrived to the client, before the client responded. A little background understanding…
Since SSL derives this base security from certificates, there is an exchange that happens as client provides the basics on what certificates it has to the server, and server to the client, so that they can decide on what certificates can be used to establish this secure connection. Keep in mind that not all certificates have the same capabilities, so some may not work well. In addition, they often need to be checked for revocation. Also keep in mind that ideally, an SSL connection would like to use both a client and a server certificate to for this security, but if necessary only server side will be used if client doesn’t “qualify” or exist.
Naturally our first thought was that perhaps we were checking for the certificate revocation list for installed certificates on the client (since the delay was so clearly seen on the client side in the network traffic). For those installed certificates, however, there was no outbound traffic for the defined CRL in the certificates. So, that wasn’t it.
At this point we did a lot of head scratching.
Our next thought was to use Filemon.exe and Regmon.exe on that client as it did this. It was a little tough to filter but we saw that it was trying to access the local certificate store on the computer. This is a file location of %userprofile%Application DataMicrosoftCrypto for user certificates, and All UsersApplication DataMicrosoftCrypto for the computer enrolled certificates. These tools really didn’t give us a thorough enough picture to conclude anything. By the way, here’s what regmon showed us:
7943 7:51:31 PM w3wp.exe:4080 OPEN C:Documents and SettingsAll UsersApplication
-43cf-bb20-39161a0a0108 ACCESS DENIED NT AUTHORITYNETWORK SERVICE
7944 7:51:31 PM w3wp.exe:4080 OPEN C:Documents and SettingsAll UsersApplication
-43cf-bb20-39161a0a0108 NOT FOUND Options: Open Sequential Access: All
Our next venture was to use DebugDiag, an IIS specific tr
oubleshooting tool, so we could see what is happening under the hood. I don’t profess to be an expert in using this tool, or even to be someone who uses it often, but it helped us here. Here’s a link on how to get this tool, and what to do with it:
This tool helped us on the client side by capturing lsass.exe. When we looked at those little mini dumps we saw that we would consistently see a thread looking for objects in the All UsersApplication DataMicrosoftCrypto path. This was still not a smoking gun, though, since as you all know, a dump is just a capture of what is happening at that nanosecond. So, like a snapshot picture taken too late at a race finish, we could’ve missed the good shot.
But seeing this made me think of a hypothesis: what if the delay is a timeout period in our code when there is a problem accessing the local certificate store for some (unknown) reason? First culprit to check would be permissions if that was it.
So, we dumped the permissions of that directory and subdirectories using SUBINACL.EXE. You can get that tool from the link below, and I believe it is in the resource kits as well.
The application was running in the IIS app pool (w3wp.exe process). Things running in the app pool apparently will use the context of NT AuthorityNetwork Service to do the SSL session setup. Particularly in later versions and service packs. In the permissions export, you will want to look for the application pool context (I suggest starting with NT AuthorityNetwork Service) and see if those permissions are present. To prevent blinding headaches I suggest dumping the same permissions on a default install client and see what they show for comparison purposes.
Our conclusion was that SSL session setup does a check of the local certificate store for the computer and that the 20 second delay was the amount of time which the CryptoAPI will take before timing out in trying to access the local computer’s machine certificate store and doing a server based SSL session rather than client-server SSL.
The customer had removed the default permission on this directory; they removed the Network Service’s permissions altogether. Although I was never clear on why that was done, there was no glaringly obvious reason not to do that. At least, until we figured this issue out.
As you’ve probably concluded by now, we put these permissions back to default and immediately saw the difference.
For those out there who are trying to match this to their own real world scenario I’d like to point out that there can be variations on this theme. The certificate can be a user based one and it could be trying to reach the cached CRL URL objects, or any of the associated objects in these child directories. The key to start with on this type of thing is:
-Do a trace to see whether the delay is client side or server side.
-Once determined, do filemon.exe as you see the difficulty.
Maybe the trace will send you to another route, like a just-in-time reply from a CRL. Who knows? That’s the cool thing about what we do-each issue is different. But you can apply these techniques, tools, steps to whatever similar issue comes across your desk and maybe that’ll help get things moving smoothly for you.
Happy New Year!