Sometimes you run across something and you figure, that won’t ever happen again. Then it does, repeatedly. You are then reminded of the fact that any random event is possible when presented with enough opportunities to occur.
Well, I’ve been living that world for a while and I think I got to the root of the problem, a bug in Documentum. Not just any bug (or design constraint), but one that requires high-throughput and a little luck to reproduce. The existence of a bug really isn’t the issue, all large systems have them. It is the journey to discovery that is the “fun” part.
Creating a Problem Through Solving Another
On one of my projects, we ingest lots of documents. They are organized by a unique numerical identifier. While we didn’t need to technically put them into multiple folders, we decided we should do so in order to simply things for those users that wanted to browse to the records. This also helped to prevent Documentum from having a conniption when we queried for the contents of a folder with a million or so items.
The next question was, “What kind of structure?” Well, it needed to be simple. To illustrate, I’ll provide you an example using Social Security Numbers, though that is not the number in question.
We could have broken it by any number, but 100 items per level seemed reasonable. Besides, in theory, who would really care?
Documentum, that’s who.
The problem arose when we had two concurrent processes trying to create Records 123456789 and 123456889. They are in the same folder structure for several levels. As missing folders were identified, each process tried to create a folder. This created locks and failures.
This was bad.
So we came-up with a two-fold solution. The first was to pre-create the first two levels in the repository, about 10,000 folders. That took care of some of the immediate pain. On some high-density numbers, we went a layer deeper. The second part of the solution was to create a retry timer on the creation of a folder if it failed.
We rolled that out a year and a half ago and everything seemed okay. We were wrong.
The Existential Crisis
What ended up happening was that when the concurrent creation was attempted. Using the above structure, one process created the folder 123456 correctly with no problems. The other process would appear have created folder 123456, but it actually hadn’t. That didn’t stop the 2nd process from returning a non-existent object id, which was then used by the code for creating the final subfolders as part of the total hierarchy.
The end result was that the code thought everything was okay. The audit trail showed object creation. We could retrieve the record just fine through search. We couldn’t browse to it, but as that was extremely rare to even try, nobody noticed. The issue was non-existent as far as any users were concerned.
At this point, I’ve had my two concurrent “creations”. Now I have third record, 123456881, that wants to live in the same folder structure. It works fine until it looks for folder 12345688 that was created as part of the second process above. A simple check with the Documentum DFC (Documentum Foundation Classes) shows that the folder doesn’t exist. With that fact, the creation of the “missing” folder is initiated. The save of the “new” folder fails because, within the database, that folder DOES exist.
This is all crazy stuff. My mind is just twisting trying to explain it clearly. Let’s talk about what we found.
Database in Chaos
I did some digging and learned a few important things. The folder 12345688 did actually exist. It had two invalid attributes, i_folder_id and i_ancestor_id. They both referenced the non-existent object, ‘123456’, which is the root of the actual problem.
This made the fix simple, enter the database and correct the values for the hierarchy of folders pointing to the invalid folder. I even created a query to find all incidents of this issue:
Select * From dm_folder_r with (NOLOCK)
Where i_ancestor_id Not In
(Select r_object_id From dm_folder_s with (NOLOCK))
And i_ancestor_id IS NOT NULL
* Note that this is from a SQL Server installation and I used the NOLOCK option to keep all queries as Read Only.
This gave me a list of Objects that needed to be fixed. Once I had the list, I could then validate the non-existence of the objects with the following query using the i_ancestor_id as the parameter. You will want to check and double-check everything in this process because you do not want to implement an incorrect fix.
From dm_sysobject_s with (NOLOCK)
Where r_object_id in ('0b01a47c83447536')
At this point, I retrieve the necessary r_object_id that I need to fix the system.
From dm_sysobject_s with (NOLOCK)
Where object_name = '123456'
Using this information I can update the necessary rows. For the ‘12345688’ object, you need to update the i_folder_id in the dm_sysobject_r table. For the both the ‘12345688’ and ‘Record – 123456889’, the i_ancestor_id needs to be updated in the dm_folder_r table. You are essentially replacing the invalid object id with the valid one.
Wrapping It Up
I hope that you never encounter this, but if you do, hopefully this will help you through it. EMC support seemed to indicate that this was limited to SQL Server installs due to database constraint limitations. Given the difficulty in reproducing and testing, they can only hypothesize.
In the meantime, we can fix it. As long as we catch it before, in this example, trying to ingest Record ‘1234567881’, then we will be okay. We can run that first query on a regular basis to find any issues. It does show-up in the data inconsistency report, but this helps us more as we can start working on the solution right away.