You’ve probably heard about Data Science, Machine Learning, AI and all those cool buzzwords. You have heard that some of those technologies can help you use your data to predict the future: forecast your demand, avoid fraud and recommend products to customers. It’s great that you are thinking about starting one of these projects.
Most of the times teams start thinking about the how. They focus on the algorithm, on which kind of problem they are trying to solve. Is this a classification problem, clustering or regression? This might bring some undesirable results to the project. The team may optimize the how but forget about the fuel that makes it all run: data.
Before starting you should work on a Data Quality report. It does not needs to be very complex but it needs to answer the following questions:
- Can I trust this data?: Do you know where this data comes from? How was it collected? How much human intervention has it gone through? Is it external or from one of my own systems?
- Do someone on the team understand this data?: Are we sure we know the unit of measurements this data should have? Does the frequency makes sense for the kind of problem we are trying to solve? Can somebody tell if an specific data point is off the charts? (Imagine having a negative number on a sales table)
- Is the data good? There are lots of frameworks to evaluate this, but you can start with simple questions. Do I have a lot of empty rows? Do I have a lot of repeated data? Are the data types correct for all my dataset.
Once you are able to answer this three group of questions you will be in a better position to start your project. During this excercise you might find that you wont be able to use some of the data or asking around you might found about new data the you haven’t event thought about.