Metadata
To understand metadata, ask yourself this question the next time you go shopping: can I get all the information I need from this product I’m holding to make a well-informed purchase decision?
If you have all the information you need, then you most probably got it from the product's label. The label shows the brand, material, price, and any applicable discounts. The information on the label is metadata.
It is the same with digital information. If you look at a document and get the Title, Author, Description, and Date it was published, you’re looking at the document’s metadata.
There are two things you need to know about metadata:
- Metadata is not usually part of the document body (just like labels are not part of the food)
- Metadata contains rich information that identifies the document (e.g. Title, Author, Publish Date etc.)
Here’s a question: What happens if 10 documents have an Author metadata and 10 other documents have a Creator metadata to refer to the same thing—the person who authored the document? How will you find this document?
The search algorithm must know that Author = Creator to give good results.
Now, imagine the engine's overheads if there are hundreds of such lookups. It makes the use of metadata seem inefficient.
A simple solution exists: be consistent with metadata.
Another question: Is it possible to get consistent metadata values that can be applied to all types of information in the company?
Yes. But there’s more to it.
- Not all information requires all metadata. Draft or personal documents, for example, will need little metadata, if any.
- We can start with a generic metadata set for information that needs metadata. You’ve already seen glimpses of it in the previous paragraphs. Title, Author, Description, Publish Date, etc., are generic metadata sets that can be applied to any type of information.
Such metadata is so generic that a standard known as Dublin Core describes 15 metadata elements that can be used for all types of information. The fifteen elements are Contributor, Coverage, Creator, Date, Description, Format, Identifier, Language, Publisher, Relation, Rights, Source, Subject, Title, and Type.
After the generic set, we’ll need another set specific to the company's needs. It covers metadata such as Approvals, Workflow, Security and Preservation. This requires a bit of research and study. An external consultant is usually sought for this kind of work.
So there you have it: metadata = generic set + specific set.
Assigning metadata
Now that the company's metadata set is specified, how will staff assign metadata to the information they create? The assignment is usually done using the content management system (CMS) you use for the website or intranet.
Assigning metadata can be frustrating if not appropriately planned. When staff complain that “the system puts up this screen and expects me to fill out 20 fields, " they complain about the metadata assignment strategy.
There are many ways to make the assignment simple and usable. We’ll cover three here:
- System-assigned: The CMS can automatically assign some values. For example, the system can automatically apply the Author, Publish Date, and File type.
- Smart defaults: Smart defaults try to predict the context of the user and present values that they think will be relevant. For example, the CMS might surface my commonly used metadata values or only those metadata applicable to the user’s department or job function.
- Smart templates: Templates can have pre-assigned metadata. Authors only need to fill out the remaining metadata. This can speed up the authoring process.
Metadata and controlled vocabularies
Let’s say you ask staff to fill out the Purpose of the trip when submitting trip reports. What will they fill out? Some might put ‘business’ while others might put ‘JE’ for joint exercises. You see the problem. There will be no consistency in what staff enter as metadata. How can we then get consistent metadata values?
By controlling what staff can choose.
A controlled vocabulary is a controlled list of terms. Let’s say you give staff only three options to fill out the Purpose of the trip.
- Academic conferences
- Joint exercises
- Research collaboration
Staff can only select from these three options. The values are controlled.
But what if the controlled vocabulary is itself confusing and ambiguous? Again, it is a severe problem, pointing to a lack of a metadata strategy.
A controlled vocabulary can do more than control authors' term selection. It can also enable some backend magic during search and retrieval.
Let’s say a staff member searches for the term “JE,” the informal use of the term “Joint Exercises.” Trip reports, being formal documents, do not contain the term “JE.” They are assigned the metadata term “Joint Exercises.” The result is that the search engines will not find the documents, even though they exist.
This is where an expanded, controlled vocabulary can shine through. Instead of just listing controlled terms, it can also include the following:
JE USE Joint Exercises
The statement tells the search engine to use the term “Joint Exercises” when it encounters the term “JE.” Problem solved.
You can imagine more situations where this simple expansion could help, such as homonyms, misspellings and abbreviations.
But wait. There’s more!
A controlled vocabulary can be expanded in more ways:
- NT = narrower term (Marketing NT Corporate Marketing)
- BT = broader term (Invitation to Quote BT Corporate Procurement)
- RT = related term (Trip report RT Country Report)
We are structuring the linear list of terms in the controlled vocabulary, thereby creating a hierarchy. This kind of controlled vocabulary is called a thesaurus.