I’ve spent the last few days debugging a strange issue with a service we built using the Documentum Foundation Services (DFS) framework. After a period of usage we were receiving sporadic authorization failure messages when folks attempted to use our custom service and even to a lesser degree when folks tried to use DFS core services.
In this case our UI is an XML editor built on top of XMetaL. We have a COM component that was written in C# that consumes our custom service and also makes calls to DFS core services. The test team discovered that they could setup a search and click the search button repeatedly, sometimes the search would work and sometimes they would receive authorization errors.
It seems obvious now, but I proved without a shadow of a doubt that these requests were all valid coming from our COM comentent into the DFS registration framework via SOAP (i.e. the credentials were valid and were not getting munged by the network.)
We have also been seeing a few other issues that appear now to be related: we have been running out of file handles in our WebLogic instance (that hosts DFS and our custom service). We have also been running out of database connections in the Oracle instance that hosts our docbases.
One of WebLogic’s startup scripts sets the file handle ulimit to 1024 on unix boxes. As an asside this seems a tad low for Webtop/DAM/WebPublisher/etc implementations esp if you have several webapps deployed under a single managed container. We upped this to 2048 and it seemed to help with the file handles issue.
Also we have two docbases sharing the same Oracle database, so we upped our process limit in Oracle (more on this in a bit.) Again, this helped some, but still didn’t fully resolve the issue.
While the error message being surfaced to the end user was invariably about authorization failure. There were several different causes reported in the “caused by” sections of the stack traces in the server logs. When we could identify a root cause from the stack trace it often had to do with running out of docbase sessions, or database connectivity problems.
Keep in mind that DFS is supposed to be essentially stateless. So you have to re-auth prior to any transaction you wish to perform. As such any connectivity issues are likely to be reported as authorization failures. So for example if you run out of docbase sessions, you get an auth failure. If you run out of database processes, you get an auth failure, if the network has a hickup, you get an auth failure.
It turns out there was a bug in our code. We were (tisk-tisk) storing an IDfSession in a member variable for the duration of the service handlers lifetime (several minutes for an interaction with the docbase that typically lasts < 1 second.) This was of course easy to resolve by requesting the session on demand from a the session manager that DFS hands out, and then releasing said session when we are done with it.
The following para is speculation, somewhat borne out by experimentation: It seems that WebLogic was killing the service handler after some period of activity and that doing so resulted in the TCP connection backing the stored session going into a CLOSE_WAIT state indefinately (i.e. we were leeking TCP file handles). Unfortunately these were not getting re-used. This and the 1024 file handle limit WebLogic imposes on itself by default seem to explain why we were running out of file handles.
Not storing a session in a member variable went a long way towards resolving our issue. However, as I had built a test harness for pushing lots of requests through DFS and our custom service I did some additional stress testing and discovered that even with this bug resolved, it was pretty easy to run out of docbase sessions and that we were still running out of database processes.
First lets talk about the database:
Experimental evidence showed that a single docbase just after startup uses about 22 database connections (Oracle process) before any user sessions are opened up. And that each session accounts for 1 database connection. [sql used: select count(*) from v$session]
We had a vanilla setup, so the docbases were configured (via server.ini) to allow upto 100 concurrent sessions. And the Oracle database was configured to allow upto 150 processes.
Since we have two docbases pointing at the same database, we had to up the database processes to allow for both docbases to max out on concurrent sessions and for the baseline database connections introduced by the database itself and the docbase processes. For folks who are messing with the max concurrent docbase sessions it seems that you ought to set your number of processes to be greater than: (35 * number-of-docbases) + (SUM(over-docbases, max-concurrent-connections)). [yes there is some fudge factor in there].
Now let’s talk about docbase connections:
I’m not really sure why, but it seems to be pretty easy to get DFS (or services built on top of DFS) to max out concurrent docbase connections unless you do some extra work. This is exceedingly bad because once you run out of concurrent connections, you can get additional sessions for users or even to run administrative tools like iapi and idql.
As far as I can tell there is one IDfSessionManager associated with each service handler instance. I don’t understand the internals of the session manager, but I would expect sequential requests (where a request does a getSession() and a releaseSession()) to result in a session pool of roughly one session (in my test case all requests were with the same login credentials.) I have seen these session pools grow pretty much without bounds until you run out of concurrent docbase sessions.
After re-reading the Fundamentals doc and the dfcfull.properties file I began to understand some more about how the documentum session managers work.
The scheme is as follows. The session manager maintains a level 1 cache of sessions. Once a session has been released into the cache, a subsequent request for a session using the same credentials as the session has will result in the session being re-used. If session pooling is enabled (I turned it on explicitly in dfc.properties because I couldn’t tell if it would be enable or not by default — docs seem to be conflicting) and a session has been in the level 1 cache for a timeout interval (also configured in dfc.properties) then the session will migrate into the level 2 cache. Sessions in the level 2 cache will be rebound so they can be used even if the session request is using different credentials than what is stored in the cached session instance.
It appears that by default a single dfc instance will not limit the number of sessions it can hold until the servers max concurrent sessions have been reached. If the server max is reached, there does seem to be some re-jiggering, but it’s hard to say.
The moral of the story here appears to be that if you have an app that’s likely to build up a backlog of sessions in the session pool (DFS and services created with the DFS framework appear to fit this bill) you should use the dfc.properties to limit the maximum number of session for each such application.
I’ve heard rumblings recently that MTS may also fit the category of an app that will backlog too many sessions (thus starving other clients.)
For now I think we are going to try 20 sessions for dfs, and 20 sessions for our custom service. I’m not sure if it’s worth limiting Webtop and DA or if were better off letting these apps manage their session pools, since they seem to do a good job of this, and it’s hard to say how to balance sessions for these apps. We will of course have to up our max concurrent docbase sessions so that we can have a good number of users on the system at once, and we will have to up the Oracle process limit accordingly. We will also have to keep an eye on the number of file handles used by WebLogic in order to make sure it doesn’t use more handles than it has allocated itself.