We all know that wrong data leads to wrong analysis and hence can cost a lot of money. In this post I am highlighting 5 issues that lead to wrong data and hence wrong analysis. Make sure you take care of these issues before spending any time on analyzing and putting your presentation together.
- Manual entry by end users – When you rely on the end users/customer to enter the data in the free form you will get lot of variations of the data. For example, when you ask them to enter the city, you can get variations such as Redmond, Redmond WA. Redmond (Seattle), Remond etc. Doing any analysis on the city level will pose an issues since you have several variation of the same city. Unless you have thoroughly cleaned and accounted for every possible variation you will not have the right analysis. In order to avoid such issues, wherever possible provide the choices via drop down or auto-fill rather than letting users type in the answers.
- Manual entry by someone in the organization – Though this is more controlled than letting end users enter the data, you still have issues similar to number 1 above.
- Excel sheets (calculations) – This becomes an issue when you have lot of calculated columns and some of them are dependent on the other calculated columns and sheets. One simple mistake can cascade to multiple columns and can mess up all your analysis. Whenever possible, do not rely on calculated columns (prepared by someone else) in excel sheet, just use the raw data and do your own calculations so that you can stand behind them.
- Data Imports, connections and processing – In most companies the data resides in various places – databases, excel sheets, flat files Hadoop, 3rd parties etc. For you to get a 360 views you will need to collect and combine the data for all the sources. The data corruption can occur at several places including but not limited to mapping wrong keys, missing some key data, writing wrong queries, importing partial data etc. You should always verify the data that you are getting and make sure that it is clean and complete. Since there are several owners of the data this is not always an easy task, particularly in the large organizations. There is not much that you can do on daily basis to verify the quality of the data, but make sure you understand the underlying data sources and have good understanding of the process that is used to combine the data. Make sure that you in loop on any changes that are being made to the data sources and the process. You can not afford for your raw material (data) to be of a low quality.
- Visualization tools – You expect these tools to work, don’t you? However be careful and double check your calculations before you present your data. I recently had two issues that made me believe that there are many of you who will run into these issues and might present the wrong analysis.
- When you use averages makes sure that you are using the right columns. Average of an average is not the same as average by summing the values in individual rows. Recently I ran into a situation where CPM (Cost Per Thousand Impressions) was already calculated in the excel sheet (see Issue no. 3 above). When that data was brought over in Tableau, the analyst, in an effort to find average CPM, used the calculated column to compute average CPM. Everything looked good on the surface but it was wrong since it calculated average of an average.
- When using a map make sure you use the correct values of Longitude and Latitude. I saw an example where the map showed up perfectly, however a quick quality check showed that the average value for a state was more than the individual values of all cities in that state, which is not possible. On further investigation we found that the issue was with the way Longitude and Latitude were used to render that map. Once the issue was fixed, everything worked fine.
I would love to hear from you if you have encountered these or any other sources of data quality issues.