We stand with Ukraine

Taxonomies Size: No Single Size Fits All

Filiberto Emanuele - 3 February 2017

Some data analysts think that a bigger taxonomy is better than a smaller one. More categories, more information. However, most actually know that it’s more important for a taxonomy to be correctly designed. Its size doesn’t really matter because it depends on the subject that the taxonomy has to cover.

What many don’t always consider is the way that a taxonomy will be used. This topic (and many others) has a lot to do with a dichotomy that’s always in the background: a general-purpose taxonomy vs. one that is function-specific.

Samuel Johnson said that “Many things difficult to design prove easy to performance.” Perfect design is achieved when a tool doesn’t just do its job. It happens when it is also easy to operate and when it integrates gracefully with other objects that are somehow related to it. In other words, a lamp’s job is not only to light a room, but it has to look like a lamp. It should be easy to place in a room without disrupting the rest of the furniture and, finally, it should be obvious where the switch is. Clearly, all of this isn’t really achievable with just one lamp, since you also want one that is perfect for your bedside table (typically smaller) and perhaps one that sits beside your armchair for when you’re reading a book (a floor lamp). If your floor lamp is designed correctly, it’ll probably have its switch at a height similar to that of your armrest so that you can effortlessly switch it on while seated.

Over the years, while designing taxonomies for many clients, I’ve noticed how close a taxonomy needs to be to the problem it wants to solve. Even for clients in the same industry, I still had to redesign each taxonomy (sometimes radically) simply because the client had different uses in mind. And, when the general structure changes, not even the order of “child” and ” parent” will necessarily stay the same. We all know the concepts of “being a type of” and “being a part of.” An apple is a fruit, so “apple” is going to be the child of “fruit”, not the other way around. This isn’t the case when you’re in the process of linking completely different ideas in a way that justifies its meaning with a specific application.

For example, let’s say that we’re working on a taxonomy to classify events. Here, we have concerts and races and festivals, and we also have the type of public these events may be best suited for (family, young, etc.). Types of events and people are two non-related concepts that we want to link in this taxonomy for the very specific purpose of offering interested users some sort of navigation in a content directory. So, which is better? Go to “concerts” and then drill down into “family” or go to “family” and then drill down into “concerts”? In a few cases you may discover that there is no best way, just preferences. However, most of the time there is a clear choice based on what users will find obvious. Occasionally you may notice that, if you turn your parent-child structure upside-down, people will find it surprisingly difficult to navigate your tree.

Just for kicks, I’m going to answer which way I think is better here: I’d say placing the events below the types of people is more practical, since each user that will return to your service tends to stay in the same category (“young”, for example). This is also because this category becomes sort of their own “root” in your tree, so they can focus on the event choices that are offered. If this was reversed, then we’d have a user going into “concerts” and then going into “young,” and if they’re not so sure about those concerts they would go back to “festivals” and then into “young” (again). You can understand how having to repeat this selection in every branch can become annoying and feel useless to the user.

A single design is almost never right for every purpose. Yes, standard taxonomies exist, and they do for a simple reason: sometimes different systems need to speak a common language. This means that standard taxonomies have another goal that has nothing to do with the actual nodes of the taxonomy tree, nor is it close to the concept of correct design for a domain-specific taxonomy. The main goal of a standard taxonomy is to facilitate communication. They’re very useful for this reason, but they can also be misleading if used as a reference for designing a new taxonomy. In fact, standard taxonomies are rarely the best way to represent a field of knowledge, and companies that use them typically also adopt an internal taxonomy.

The choice around the general size of a taxonomy is even more connected to an application-centric design. Among the many different reasons that push a company to introduce a classification into their process (or products), there are several that will have to deal with facilitating macro-clustering. Put simply, they will want to quickly determine which documents fall into which “super-classes.” For example, in Media we’re pretty used to reading our news on sites that show macro-classes (usually called “channels” or “sections”) such as “Sports”, “Finance”, “Politics”, “Technology,” etc. These particular categories aren’t concerned about overlapping into other categories (an article in the Sports section can also talk about technology), nor do they need to get into great detail (at this level we don’t even worry about the difference between “Football” and “Golf”, it’s just “Sports”). This kind of categorization uses a very small taxonomy tree (often with just a couple of levels). It has numerous applications and is frequently preferred to the kind of detail a larger taxonomy would offer. And it’s not just about performance (which, we should remember, it’s always to be taken into consideration). It’s more about the fact that a larger tree will have a different purpose, thus a different design, so things may end up placed in an order that is not suited for the “news channels” type of clustering.

Small taxonomies are crucial in some subdomains of Artificial Intelligence where simplification is paramount. This is true for reasoning processes in some Fuzzy Logic applications as well as for optimization purposes (self-driving cars are going to process a huge deal of information per second, and the processing power is what it is).

On the other end of the spectrum, good arguments can be made for large taxonomies. Massive standard classification systems like IPTC (for Media & Publishing) and NAICS (for Products & Services) are used to harmonize language in their respective industries, so that, for instance, a government official can quickly find out what an establishment manufactures just by looking at a NAICS code number. Following the NAICS example, we’ll notice that there are two different codes to indicate Chocolate Confectionery Manufacturing and Non-chocolate Confectionery Manufacturing. Clearly, in this case there is a need for greater detail.

In conclusion, when it comes to taxonomies, I believe that no single size fits all, and bigger isn’t necessarily better. We should always think about the final application, its purpose and its most functional design.

Written by Filiberto Emanuele, Director of Technology, Expert System