So, since you’ve invested in this course to understand CRISP-DM better, which will help you in your data science career, I think I can safely say that you love understanding how things work. You love the idea of working through a process. You like creating clarity out of chaos. And you’re probably okay with a little bit of ambiguity.
The business understanding phase, the first phase of CRISP-DM, is crucial because the only way you can bring value to the business, which matters at the end of the day, is to understand the business truly.
Businesses also have constraints; the primary constraints will always be time and money. There’s never enough time, and there’s never enough resources to get things done. The way you approach that problem or goal will depend on the problem, but it will also rely on the timeline. If you don’t have a lot of time, the way you approach that goal or problem will be different than if it’s a long-term project that you have a lot of time on.
And we’re going to talk about in the second lesson the different kinds of problem-solving, or ways of reasoning, that you have at your disposal—also, the importance of specific soft skills, particularly for this first phase, business understanding.
So, one thing that commonly happens in data science is that a business has this final product. The product is a sexy analytical solution, but unfortunately, it adds little business value. And it adds little business value because the people who led the project never really got down to truly understanding the company’s problem and goal. So, we’ll get into the specifics in a bit, but you do work backward. You have the goal, it’s a clear goal, and then you go back and see what you need to make that goal happen. To do this effectively, you need context; you need to know the business and what’s important.
Before I show you how to establish a goal and the objectives based on that goal, I want to first talk about the Information-Value chain. The idea is that information can only create value if it goes through a series of steps, and the result of those steps is some action back in the real world. And the first step in this chain is some real-world event or characteristic, and the last step is that action.
Now, you only see the first four steps on your screen before any analysis occurs. I wanted to focus on these steps for the sake of connecting them to business understanding.
So let me explain the first four steps of the information value chain; as I said, the steps that happen before analysis. The first step is actual world events or characteristics that we can turn into data. So, let’s say a person goes online, uses their browser to go on the Walmart website or the Amazon website, and makes a purchase. So that is a real-world event, something that happened in the world, which gets captured somehow. That is the second step, the system data capture, but we’ll get to that in just a second. But I just wanted to give you an example to understand what I mean by a real-world event.
And you can break these real-world events and characteristics into different categories. You can think of people as the first category. People interact with various institutions, governments, banks, schools, and as a result, these institutions have information about us, for example, demographic information. So, they have characteristics about us. And this gets turned into data. Now people also move through the physical and virtual environments. For example, If I go on the Amazon website, my behavior is being tracked. Amazon is looking at the actions I took before I made the purchase. Did I go onto the left-hand menus and click on them or use the search? How long did I spend on the website before I made a purchase? So, there are all kinds of information about people, their characteristics, and actions like our movements through the physical and virtual environment.
Then we have objects. Think of different products that get sold; they have sizes, colors, and functions. These products have a path, a physical path where they get delivered. They’re also sold via specific channels; online, in person, and so forth. So, these things are especially relevant in industries like logistics, supply chain, and transportation. And then, of course, there’s the fact that some objects do things, they’re not just like, you know, a toy or something like that or a piece of clothes, but they monitor things. For example, if you own a home at the back of your home, you have a mechanism that tracks how much hydro you’re using. And that’s very important for that specific industry because how much you utilize is very relevant for your hydro bill. And then finally we have environmental events. These are things like the weather or temperature, earthquakes, and other natural disasters. These are also real-world events that occur, and we turn them into data.
And the second step is the system data capture. There are different systems; mechanisms that capture or track physical or digital action or information. Examples here are core enterprise systems, which follow the company’s financial operations, billing, invoicing, and accounting. Then you have customer or people systems, so something like the customer management relationship system. These are systems where you interact with customers. You might have email conversations, or chat conversations, or phone conversations that you’ve had with customers. Then you have a product or presence system, such as product or content management systems. This system would be a place where you have information about all your products, let’s say. Then you have capture systems of a technical nature, so these are like monitoring some process, like software operations. And finally, you have what is called external sources.
But basically, the idea is that we have these real-world events or characters, and they get captured by some system. For example, suppose a customer makes a purchase. In that case, it’s essential to understand that some system captures it and the different categories of systems that capture it, so you have some idea of where you would go to get to the pertinent information. Is that data in some accessible location or storage? Often, it’s not very easy to access the data. Sometimes capturing systems are transactional, meaning they capture transactions one at a time or might catch the data in an unstructured way. Essentially, you captured it, but it’s not in a form that’s ready for analysis.
And sometimes, especially for substantial companies, data is in multiple systems, so you’ve got some information in the core enterprise system, some in the people systems or customer systems, and some in the operations systems that we discussed.
So, the way to bring that together is something called the data warehouse, sometimes also called an enterprise data warehouse. And this is where all the data is brought together from all these different data capture systems. Relationships get established between the various data, and transformations occur to make the data usable and ready for the final step before any analysis occurs. That is data extraction for analysis.
Let’s say all this data is now in a data warehouse, you would use SQL as an analyst or data scientist, and you would only get the information you needed for your purposes. And therefore, data analysts and scientists are so valuable because to have the skill to extract from a warehouse only the stuff you need to get going and do analysis. It’s just a super helpful skill.
Now, with some of the context out of the way, let me get back to showing you how to establish a goal for the business. How do we get clarity about what they want?
Imagine there’s this company called ‘Dedicated Supply.’ They are a retail chain, let’s say a regional retail chain. And they sell food but also beauty and health products.
The first thing you must do without a doubt is getting background information about the business. You must figure out the current business situation. What are the available resources? The problems? Their goals? You must determine the organization or company structure, the divisions, departments, project groups involved. Who are the key individuals? Who is providing the finances? How about the domain expertise?
It would help if you described the problem area. Is it in health, food, or beauty? Describe the problem in general terms. Consider the motivations for the project. Have you used any analytic solutions in the past? If not, does there need to be some education for different stakeholders and an outline of the analytical solution’s benefits?
Also, if there is a current solution, you need to describe that current solution. What solutions are currently addressing the problem? What are the positives and negatives of that current solution? What has the level of acceptance been for this solution within the organization?
At the top left of your screen says health division, marketing department. So, when compiling the background information of the business, we decided that the problem area is this division and this specific department.
Now, even though you’ve done all this work and have a lot of information about the business, its background, resources, and problems, the goal might still be unclear. Here’s an example of a vague question. The decision-maker or stakeholder says how to decrease the costs of acquiring customers. This kind of question needs clarification because it’s not clear if the goal is to improve the efficiency of acquiring customers or if it’s to improve the profitability of the business?
Because really the marketing department could decrease the cost of customers by just using the current resources and strategies more effectively and that’s going to reduce the cost of acquiring customers but isn’t necessarily going to improve profits.
It’s essential to understand the purpose and motivations behind the project because it helps understand what the decision-maker/ stakeholders are after.
I have a more defined, straightforward question; what customers should we focus on to increase profits?
So now the question is much more concrete and more precise. The decision-makers are interested in increasing profits, and they want to understand their customers better.
Before I go to the objectives, I want to rewind. Remember, when compiling the background information, we picked a specific division and department as the problem area. To ask practical questions, you need to understand how that business or the particular department (in this case) works. You must learn about it because it helps you recognize what questions are worth asking and thinking about, so a conceptual business model allows. And this is an example of this retail case for the Dedicated Supply retailer.
Now imagine if you’re part of the data science team and trying to help the health division marketing department with their problem, and they provide you with this conceptual business model. And they’re just different tables that record data on customers, and these different tables have relationships with each other.
This arrangement is how relational databases work. As an example, we have the customer’s table in the top left, and that has information like the customer I. D. Every customer will only have one customer I.D. However, look at the table connected to the retail purchases. While there can only be one customer with one customer I.D., that one customer can have many purchases simultaneously. So, suppose they buy three different things. In that case, they’re going to have three different purchase product I.D.’s, so this relationship between the customer’s table in the retail purchases table is a one-to-many relationship because there can only be one customer I.D. But there can be many purchases or products i.d.s.
Going back to the information-value chain, think about the events and characteristics captured by this division and marketing department. This approach will make it much easier to think about the problem at hand. Then think about how they catch these different events or characteristics. Then how did they store them? Are they accessible? So, you always want to be provided a model like this; a business model of how things work so you can ask the right questions.
So, if you look at this, you might be able to ask things like what’s the most popular channel that customers use when buying? Because one table registers whether the customer bought it in-store, mobile, or online. There’s a promotions table. You can ask questions like how this promotion influences the number of products purchased, and you can see that there’s a product’s purchase table up here as well. Or the amount of the purchase? So how does this promotion influence the number of products purchased or the purchase amount? Does the channel influence payment type? So, you know what’s available to you. So, you know what kind of questions you can ask. So, if they didn’t track the channel well, there are specific questions that you certainly couldn’t be able to ask.
So then, going back to the goal and the objectives, now that we have a very defined question, a clear goal, the defined question being once again, what customers should we focus on to increase profits, we can make objectives.
We have two objectives. The first is to utilize clustering to find segments of customers most likely to increase profits. Maybe if we decide to focus on women from 31 to 36 years old with incomes of 100K or more, we found this association with this specific segment. It turns out they were spending more with more margin per purchase, so the idea would be if we use the same number of resources as we did last year, but we focused more on this specific segment, we would have more profits than we would last time around. And then the second objective is to understand the combination of events and preferences that lead to greater per customer profit, and this seems like a very reasonable way to potentially increase profits.
But again, ask yourself, are these events or characteristics preferences captured? You would have to know this ahead of time. What system is catching them? Is it easily accessible? So, before you come up with these objectives based on the goal, know if all this stuff is in place and if it is not, is it worth it to start doing that?
And then, you want to determine how you will judge success objectively and subjectively. Objectively, for example, if profits increased by 10 percent. The project will have been a success, but you might also want some subjective criteria, which is okay if it is clear who the decision-maker will be. Who will be assessing if that thing was a success or not? If it’s not objective, you’re going to need a decision-maker who will make that subjective assessment if it’s a success or not, but make sure you have both.
The second success criteria are that the marketing budget doesn’t exceed last year’s. Now, this might be reasonable or unreasonable depending on the situation, but again make sure this is where tactics like negotiation come into play because you might need to make the case that the marketing budget will undoubtedly go up. After all, you might find out that it’s more expensive to advertise or promote to women who earn 100K or more. It will take more budget to promote to them, but it will be worth it because of the profit gains. Make sure that all stakeholders agree upon those criteria for success.
And then a third one might be something like that all the components need to be completed on the project by a specific date.
The final thing that’s important to understand about business is that the type of problems you have and the objectives you ultimately decide on influence the analytic approach taken.
I purposely picked two objectives aligned with the goal that you can solve with a descriptive-analytic approach, but that might not always be the case.
Typically, whenever you’re interested in looking at relationships between things, you will be interested in descriptive analytics. If you can cluster based on events or preferences and come up with the different segments, you describe relationships among things. As a result, you would use something like clustering, a descriptive-analytic approach. And something as simple as figuring out a mean or standard deviation will also fall in line with a descriptive-analytical strategy. It’s beneficial for day-to-day business decisions, and it’s also much easier to do. It requires less expertise than the predictive and prescriptive analytics approach.
On the other hand, whenever you’re interested in probabilities or being able to predict a specific customer, let’s say if a particular customer is going to buy during a specific type of promotion, this will require using more advanced techniques.
Finally, we have the prescriptive approach. This approach is practical when you need to optimize to make a specific decision. So, let’s say we have 4 to 5 years’ worth of historical data about a promotion that we’ve run. And for each year, we have a data point if this specific customer bought during that particular promotion or didn’t. When we do predictive analytics, we can apply probabilities to predict if that customer will accept that same promotion next year.
But prescription allows us to go even one step further. Because let’s say we want to focus all our marketing dollars and optimize and focus on only the ones that are the most likely to buy.
So, let’s say last year when we were doing just predictive analytics, there were some high probabilities of buying that didn’t work out right, so this year we wanted to be even sharper and make sure that we only focus on the people that will be buying and this is where prescriptive analytics comes into play.
I always tell people three big questions you should ask yourself during this business understanding phase.
The first one is to ask yourself: will my findings cause a meaningful decision if I perform the analysis? In other words, if you think to yourself, I will end up making the same decision whether I found out x, y, or z, then you shouldn’t do the analysis. The point is, you don’t want to run studies or projects just because you’re curious about something or you want to find something out; it must be related to getting value for the business. It might also sometimes indicate that you’re looking at the wrong question, so you need to go back to the drawing board.
The second question I want you to ask yourself is: what kind of analysis what methods and tools will you be using? Remember to explain this to stakeholders to the decision-makers because you must keep in mind that more complex doesn’t necessarily mean better. They need to be able to understand what’s going on. Suppose you can’t explain it because your analysis and the tools that you’ve used in methods are way too complicated, and you can’t communicate to them well. In that case, they might not buy into it because they might not have the background to understand what you did or why it’s going to be helpful so, please keep that in mind as well.
The final thing is to think about where the data will come from. You want to check to ensure the analysis hasn’t already taken place in enormous organizations, the probability that it has increased. You don’t want to duplicate work. Maybe you’re lucky, and all the data is in the data warehouse that we discussed earlier.
And the final thing that I want to cover in this first lesson is the common problems we see.
In the business understanding phase of CRISP-DM, the first and by far the most common problem is just that there is this lack of clarity about the business problem to be solved. Often, analytic teams do not understand the business objective or the project. And they’re very interested in getting to the exciting bit of the project, which is analyzing the data and coming up with some data science products, some models. Again, often attractive models, but they don’t meet the actual needs. And often, the excuse that is given for the model or product not lining up to the true objective of the business is that the model might be performing well. So, let’s say a predictive model is performing very accurately and doing well. They will look at that model and say, well, hey, this model we have for you is performing exceptionally well. The problem is that it might be performing well but not related to the objective. Well then, we have a problem. You are addressing some concern that was not part of the project.
The second common problem is that analysts or data scientists or the data science team, in general, do not do an outstanding job communicating with I. T. and this is in part that a lot of data science people don’t find that it’s their job or their role to worry so much about deployment, so they see that as more of an I. T. role. But you have to be mindful of how difficult this model will be to deploy, or maybe it’s impossible to implement? Is it going to be usable once deployed? So, some people might say, well, that’s the I. T.’s problem and not mine, and it certainly is in part the problem of the data science team. Whenever you’re coming up with solutions, you should always communicate and collaborate with the I. T. department and say so, I’m doing this, and this, do you think this will be easy to deploy, or is it may be impossible? You want to know this ahead of time, obviously before it’s time to deploy. So yes, you might not be doing the technical nitty-gritty stuff, that is the role of the I. T. department, but still, you must make sure that you know it is possible to implement this. The final thing I’m going to mention is that you need to know how your model will be updated (if it’s easy to do and so forth).
The final problem is a failure to iterate. From the free workshop at the beginning of this journey, you might remember that CRISP-DM is an iterative process, and data science professionals often forget this. Some models are old. They’re not monitored and maintained, and not surprisingly, this is usually connected to the first common problem, not understanding the business objectives. If you don’t understand the business objective in the first place, how will you know how to monitor that model to make sure that it’s still meeting the business objective? So again, the goal and purpose need to be clear from the beginning. This clarity would tell you that a model is not good anymore or needs to be updated.
And another problem usually related to this is the fact it’s, it’s more fun to start on new problems and work on new projects than it is to maintain something that you already did in the past, but of course, this should not be an excuse not to keep older models.
That will do it for lesson one, and since you’ve invested in this course, you should already have access to lesson 2. That lesson is quite a bit different. It focuses on problem-solving soft skills and how you can use knowledge from your past field in data science.