Data Mining. If you’re not familiar with the term, you soon will be. Just like the gold rushes from centuries ago, there’s a new commodity that everyone is after and it goes by the name of “data”.
5 of the most valuable businesses in the world, namely Amazon, Apple, Facebook, Google (Alphabet) & Microsoft are dominating their industries and leveraging unprecedented and unparalleled revenue streams. Google’s search engine, Amazons personalised and targeted shopping, Facebooks advertising… they’re all driven by data.
Data is everywhere, and those businesses and organisations growing, strengthening and evolving, they’re all doing it through data. Well perhaps you’re thinking it’s all well and good as they’ve found new revenue streams and services to offer, but maybe you aren’t changing any products you offer. Maybe you’re a government department mandated to still provide the same services, just in a different way perhaps?
Well this is where Data Mining comes in. IBM have claimed that they foresee over 700,000 new recruits being hired into data driven roles by 2020. A recent study by the Talent Garden Innovation School (who offer master’s degrees in business data Analysis), found that 50% of Small and Medium Enterprises plan to recruit a data analyst in the next three years.
So, what are you looking for? Well firstly it’s imperative to employ a myriad of approaches. No single approach will work, especially if the data is unstructured. You need a combination of tools to cover areas such as:
- Data Sourcing: Connecting outside sources to your data repositories. Ideally find something that can automate this process and interact with the API’s of your services to collect that data.
- Data Structuring: It’s okay collecting lots of data but you need to be able to work with it. Data Warehousing, Data Lakes etc. Those solutions can help structure and standardise that data.
- Data Storage & Security: You need an easily accessible, flexible, secure storage solution for your data to exist in. No point putting it in a big box, locking it and throwing away the key. Use the cloud, pay only for what you use, rather than having a big SAN running at only 50% capacity.
- Data Visualisation: If you’re unable to see the data presented in different ways, from different sources, it’s hard to understand what you’re working with. The same data displayed in three different charts or graphs can sometimes tell three different stories.
- Data Insights: Intelligent dashboards can build upon the visualisations to present data in real time, re-structured to report on outcomes and objectives, highlight forecasts vs actuals and much more.
It’s vital to remember that to drive more value, more data is needed. Take a credit check for example, if you only had 6 months’ worth of history for the person in question, do you know enough to lend them large sums of money? Maybe they only have one loan, right? What if that loan, was used to pay off another, which in turn had paid off the one before it? More data will provide better insights and visualisations, ultimately painting a truer picture.
Now you’re either trying to find something specific you’re looking for, or you’re trying to find something that you’re not specifically looking for. Sound confusing? Well Data Mining is no different to traditional mining. You might be digging for gold and come across diamonds. Maybe looking for oil, but stumble across natural gas reserves. Anything and everything valuable that can be found, is worth harvesting.
Data Miners and Data Scientists employ a variety of methods during their analysis, such as:
- Classification methods: Grouping data into logical units based upon shared characteristics, such as age, spending habits, locations etc.
- Association methods: Finding trends and making predictions based upon reoccurring instances or habits, i.e. customer payment for invoices exceeding duration terms for invoices of certain amounts.
- Cluster methods: More of a visualisation approach than anything else, but important for mapping and understand where there are trends. Below you see an example of mapping the locality of field engineers regarding the assignments they’re sent to, against the cost of travel expenses they accrue. You’d typically expect to see most groups or objects grouped in identifiable patterns.
Using the above example, perhaps the original hypothesis was that the closer assignments are located to the engineer, the lower the cost. For the most part, this is true, but the business can also now start to delve deeper into why some of the engineers are running up large expenses, despite not having an equally correlating distance to the assignment. What is causing those anomalies?
Using Business Intelligence tools makes it easy to pull in new data sets, cross reference these against other visualisations, overlay them and so on. Where one organisation might stop at proving the original hypothesis and understanding those anomalies, another might keep mining…
What if you have this data available going back quite some time? It wouldn’t be particularly challenging to use the association analysis and start to predict the typical expenses cost based on distance (the concept of making predictions is referred to as “Regression Analysis”). Perhaps if you incorporated the location distance of an engineers’ past 10 assignments into the visualisations, you can identify whether some engineers are persistently being placed on non-local assignments, at greater cost to the organisation. Maybe a reshuffle of assignments is needed, or maybe it’s actually more cost effective to recruit someone closer to the farther-reaching assignments, rather than to keep sending the current resources.
The key point to take on board is that by mining more and more through the data, bringing in larger volumes, different types and experimenting with overlaying these in various ways, you can do much more with it. Better insights, better predictions, even the ability to change services. Amazons “items you might be interested in” is one of the simplest examples. They added more data into each product, in terms of categorising and tagging them, then simply logged the categories/types you purchased and fed that into a service to search the rest of the catalogue for similar tags. Simple, but hugely effective. I wonder how many millions of products they’ve sold from clicks on those items eh?