Welcome back! On this weeks episode of Data Science Wednesday, Decisive Data's Lead Data Scientist Tessa Jones takes us through a basic understanding of the data science life cycle. Check back next Wednesday for an all new DSW video!
What is the life cycle of Data Science?
Today, we're going to talk about the data science life cycle, which is very important to understand because if you're going to engage in a data science project, it's a good thing to understand how your data scientist thinks.
Six phases of the Data Science Life Cycle
Phase 1: Exploring
The goal in the first phase is to define a question to answer. To do this it's important that the data scientist and the business person are working together. A few questions they might ask one another are;
- What is the main objective of the business
- What are the desired outcomes the business wants to have happen
- What are the pain points of the business?
Phase 2: Data Discovery
It's important to be very concise in this phase. Getting to a phase that is further along in the project then discovering that the data won't answer the question will lead to trouble. Data due diligence is a key factor in data science success.
Phase 3: Data Prep
Once you know you have the data you need, and you have a good question, then you go into data prep. Depending on how sophisticated your business is, this could be a very big step, or it could be a very small step. In the most ideal situation, you're usually just taking a couple of different tables and joining them together and organizing them in the way that the data scientist would like. Then we get into the data science process part of things.
Phase 4: Exploratory Analysis
First, we go into the exploratory analysis phase, and if you've watched any of our past videos, you would see the episode we did on diagnostic analytics. That's very reminiscent of what's going on here. We're really trying to figure out how our different variables are related to one another and what their distributions look like, what correlates to what—trying to get a good view of what's going on with our data.
Phase 5: Model Design
Once we understand that, we go into model design, which is where we really put our thinking hats on, as we want to understand or define the mathematical approach that will most likely give us the results that we want. You need to consider everything that we've seen in here to do a good design.
Phase 6: Build It!
After we have it designed, we go ahead and build the model. Usually, we program this in a language like R or Python, etc., and once we have that built out, we deploy it into the business process, so it can be used by the business person.
All of this really fits nicely with the scientific method, because it allows us to iterate through. The scientific method is all about hypothesis testing, so if you have a question and you think that this particular variable is going to give you the answer that you need, you go through this cycle to test it. You say, "Well, I think it's going to work." And then you build it in the model, and then you see whether it did or did not work, or if it gave you the results that you were trying to get.
Example
For example, let’s take Tammy. Tammy's working with a data scientist, and she's an executive at a retail store chain. What she really wants to do is optimize her prices, because she really wants to drive her revenue up and get the most bang for her buck for every product that she has. She wants to define this for thousands of products over hundreds of stores, so this is no small task. Once you figure all this out with her, and you know what she wants, you find the data that you want and prep it, and then go through this process.
Oftentimes, you think that you know what's going to be important in terms of how to optimize pricings, and it just won't give you the answer that you need. Or you go through this process, and you find out that, "Wow, we have different customers that need to be treated differently within the model, but we don't have enough information about that." You might have to go all the way back up here and discuss with Tammy, "Okay, how do these different clients look? Do-it-yourselfers versus people who are buying in bulk—like contractors, and others.”
Conclusion
You really need to get a definition, and then you go through and find the data, and then go through the whole hypothesis testing again to see if you are right. Is there, in fact, a big difference between customer types, and how will it impact the model results?
Once you are satisfied with the model results you created, you go ahead and deploy it. You integrate it into the business, and then you're ready to go. You can just deploy the prices that you want that are most optimal to hundreds of stores for thousands of products, which is pretty amazing.
Posted by Gage Peake