Recently, Sumanth Molakala sent me a comment regarding the best place to store metadata.
Here is the relevant portion of his comment:
…one of the key items I am trying to address in it is managing meta-data. I am sure you are aware of the two schools of thought – One – save content in content server and meta-data outside of the content server in a “custom” meta-data repository (assuming that the world doesn’t revolve around Documentum). Two – the traditional approach to save content and meta-data in the content server…
My quick answer…It depends! Before you shoot me, read on…
Why Have Metadata?
Before we answer this question, it is important to remember why we have metadata in an ECM system. Originally, metadata in the ECM world was used primarily to find information in document management systems. Part of the whole Knowledge Management problem is finding information. Metadata helped to address that problem.
As those document management systems evolved into the current crop of ECM systems, metadata began storing other information. Business applications where built on top of ECM systems. Sometimes they stood-alone, other times there was some integration, tightly coupled, with another system. Metadata began to store information about the business functions in which the content was involved. Sometimes, it just replicated information from another system so that the user could work effectively in both systems.
Of course, the world is changing. So the question remains, where should it reside?
Metadata in the ECM 2.0 World
In the ECM 2.0 world, the ECM systems are the backbone of an Enterprise Architecture. It stores content for multiple systems and provides content specific services such as Records Management. In this world, there are several questions to ask.
- Will users access the content directly from the ECM system? If no, then only the information needed to support the access of data is required. This isn’t as simple as an case number or object id. If users are going to use a search from the business application, then it is important to automatically populate the relevant fields from the application to the ECM system in order to facilitate search. If yes, then you need more information depending on what tasks the user will be performing when they directly access the ECM system.
- Are you going to enforce any compliance rules on the content? If yes, then all the information necessary to allow the retention policies to make determinations on how the content is to be treated. In no, get a good lawyer. 😉
- Are you going to just use the ECM vendor for every business solution? If yes, then it needs everything, for performance and CYA if nothing else. This is also ECM 1.0 thinking. That still works, and is necessary for a while longer, but isn’t the world that people are moving towards. Oh, you will also need some good ECM consultants, so drop me a note.
In the ideal world, the answers are No, Yes, and No. The nice thing about that approach, a lot of the metadata is automatic. The metadata comes from the business application automatically or is derived from the environment. The user select the type of document and the name and the system takes it from there.
I do have one system where we are going to have all of the metadata. We have a Web Service that receives metadata for every case and stores it in the repository. Then, as documents are added to the system, they are associated with a case and all the relevant metadata is there, as read-only. This allows users to research multiple cases in a stand-alone ECM portal with all data available to them. The case system is in flux, unlike our ECM system, and isn’t ready to surface the content directly in their application, so we took this approach. While not the long-term approach, it works well for now.
Back to the Question
My answer is simple. I am firmly against a stand-alone meta-data repository. I have no problem with the metadata being stored outside of the ECM system as described above. My point is that the metadata about the context should be stored in the business application and the document specific data should be in the ECM system. What is that document specific data?
- Name
- Audit information (dates, users)
- Security (need to keep it secure)
- Source information (for example, if scanned the where, who, and original location)
- Business Application link (may not be necessary, but is always useful)
- Searchable metadata fields (optional, but allows for better searching within the repository)
- Retention metadata fields (any fields that could impact the retention or records policy)
That may seem like a lot, but take a background investigation system. The system would have lots of information on the person being investigated, interviews being conducted, forms submitted, approvals, reviews, and notes from the investigator. The submitted forms, such as the fingerprints, might be scanned into the system and stored in an ECM system. The ECM system needs to know that they are fingerprints, the investigation number, and when the investigation is completed. The name doesn’t matter and could be automatically generated. The investigation completion date will trigger the beginning of the retention policy. Users should spend their time outside of the ECM system to do their work.
It was a little rambling, but those are my thoughts on the subject. I, and Sumanth, would love to hear what your experiences are.
“Depends” is the correct answer. In the end, it depends on what you are going to do with all that metadata. In my mind, a properly maintained and organized collection of metadata can serve one of two purposes.
The first is using the metadata to tag, label and describe documents. Perfect for searching, organizing and maintaining. Very ECM 1.0. This is all kept within the ECM system.
The second is to take the metadata a step further and almost completely detach the metadata from the documentation. The document is still a record, but more of an “instance” of the metadata. The metadata can become a master data repository. Rather than many documents that have the same metadata entry, the model is entirely flipped. The *metadata* entry now has links to many documents.
A practical example is having 100 documents with the author field value of “John Doe”. The typical query to find all documents written by John is: select document_name from dm_document where author = ‘John Doe’. If you’ve taken the ECM 2.0 approach, you instead query the metadata with: Select document_name from author.John_Doe. (Or something along those lines). Major performance enhancement.
So where do you put this ECM 2.0 metadata? It’s one of the big reasons that EMC purchased XHive and is incorporating this into Documentum as fast as possible. XHive (as I see it) is going to be positioned as a way for other applications to access metadata while keeping it under the comfy warm security blanket of Documentum.
If you don’t have XHive, then you’re looking at another solution. Depending on your industry, compliance rules and metadata usage the answer can change. If you’re selling simple Widgets and using the metadata across your entire company, from customer information to billing to manufacturing, then perhaps your metadata would be better stored in a ERP application like OneWorld or SAP. Financial institutions, medical facilities and universities can probably stick with keeping everything within their ECM.
Personally, I’m weighing the options, but taking a real close look at XHive as something that bridges the gap nicely.
LikeLike
Great comment Chris. Let us not forget the old standbye. The investigation is represented by a folder with custom meta data and the documents in the folder are the case documents.
Simple, yet effective.
-Pie
LikeLike
When possible we always store as much metadata with the content to the manageability limit.
One of the reasons being we often have requests from customers to build the business application itself on top of ECM an then integrate with other parallel business apps.
Having metadata inside ECM helps also with the “central repository of information” marketing/sales pitch.
We also have several solutions where metadata is at minimmum (reference id, security, retention info).
I would support the idea of having information outside the ECM store only if we can think of these as 2 decoupled systems in terms of business process (not technology).
So, it depends. IT is art, not science.
LikeLike
I have worked on a number of solutions that use meta-data outside the content repository, also a lot more with it in the content repository. For me, depends is the correct answer as it really does come down to what you are using your repository for…
A number of projects I have worked on used a repository simply to image enable another system, e.g. CRM. Retention periods could still be set based but the system was never used for anything else (well those class of documents). In this case, I had no problem with not storing meta-data in the system.
Another example was that the meta-data needing to be stored was simply too complex for the content management system to handle (a failing of their chosen provider to be honest). However, in this system I did insist that certain meta-data, key to the files, were also stored. This was simply becuase the retrival system was, be it only from time to time, still being used to gain direct access to these files, not just from the third party system.
So for me, there isnt a hard rule to follow. You need to make sure your TA or ETA has a good understanding of what the business requires and the way in which the system will be used, not just now, but also in the future…..If you dont have a good ETA then you may well have some troble….
LikeLike