The first step is to identify where your data comes from, what format it is in, and what type of data it is. For example, you might have data from a spreadsheet, a database, a web page, a PDF file, or an API. You might also have data that is numerical, categorical, textual, spatial, or temporal. Knowing your data sources, formats, and types will help you determine how to access, store, and process them.
-
Data elements may have the same names and types and formats and precision; however, those elements may have different meanings. To illustrate this, consider “journal account balance”. The meaning of that data element depends on whether the accounting method to which the organization adheres. In other words, the data has to be known.
-
Recognise the project's needs: Start with having a clear grasp of the project's goals, the desired results, and the precise queries the analysis will need to address. This will assist you in determining the pertinent data kinds and sources needed for your study. Please list all the data sources pertinent to your project and collect them. Databases, spreadsheets, APIs, logs, web scraping, and other data repositories could all be used as these sources. Determine the types and formats of data each source stores.
The next step is to assess the quality of your data, which can affect the validity and reliability of your analysis. You should check for issues such as missing values, outliers, duplicates, inconsistencies, errors, or biases. You should also verify that your data is relevant, accurate, complete, and timely for your analysis objectives. You can use various tools and techniques to perform data quality checks, such as summary statistics, visualizations, or data profiling.
-
Analyse the data quality of each source by looking at things like completeness, correctness, consistency, and relevance. Understanding your data's limitations and potential biases can help you decide whether preprocessing or data cleaning is necessary.
-
Data Quality is the nemesis of any data analytics project. Without quality data there is trust in the accuracy, completeness and validity of the data hence the data has zero value to the consumer. Ensuring that data is of high quality can be done by working closely with the data producers and data governance teams (data dictionary, data domains) to understand the data and to know what is and is not expected from the data. Once you know the data, accessing that the data is fit for purpose involves designing and implementing data quality rules that check the different data quality dimensions. Monitoring the results of your data quality checks is equally important as it gives you trends on which data elements fall short of expectations.
-
This critical step can be accomplished using automated tools (I.e. tableau prep), scripts (written in Python, Perl, C, etc), and specialty tools. One risk lies with the documentation: trust and verify…data dictionaries and descriptions are wrong, and even slight deviations can have a profound impact on the quality and usefulness of end products. Consider the impact of an empty field, a field with blanks, and a field with “0”. This step can be done using a combination of tools and exploring the data.
-
Considering relevance in the context of data quality is critical. Collection methods, scope, business definitions, etc. can change over time, and the unfortunate reality is best practices are not always maintained over the course of the data's lifecycle, leading to irrelevant data that may be difficult for an analyst to pinpoint. Consulting reference guides such as data dictionaries or SMEs can help you determine if/how irrelevant data could impact your analysis.
The third step is to transform your data into a format and type that is suitable for your analysis. This might involve converting, cleaning, filtering, aggregating, merging, or reshaping your data. For example, you might need to convert your data from JSON to CSV, or from text to numeric, or from wide to long format. You might also need to clean your data by removing or imputing missing values, or by standardizing or normalizing your data. You can use various tools and techniques to perform data transformation, such as programming languages, libraries, or frameworks.
-
I’ve been using the Power Query Excel add-in to connect/combine multiple data sources into my preferred format. Check it out! It will save you loads of time!! You can refresh all of your individual data sets simultaneously, which will refresh the data output format with the new data (once those data sets have been updated and are connected). No need to modify your raw data sets - they can be flat files and Power Query will then modify them with formulas.
-
Theodor Soneriu
Analytical Software Engineer with a Talent for turning Complex Data into Action
(edited)It is important to define the data structure and meaning. Such as phone number formats. Will it use hyphens, parenthesis, and will you need to include international numbers? Is it necessary to distinguish between mobile, home, or business phone numbers? This will make querying, interpreting and analyzing the data easier in the future.
-
Data extraction and transformation: Combine pertinent information from each source into a single format or structure. This might entail data cleansing, resolving missing values, normalising formats, and, if necessary, merging databases. Tools like Python, R, or SQL may be helpful for extracting, cleansing, and transforming data.
-
Data transformation is one of the most important part of any data pipeline that feeds data to a dashboard or to a report. It is very important because the source data is of varied formats. Structured data is easy to work with. The challenge is usually when dealing with semi-structured (JSON, Parquet, etc) and unstructured (Text, Video, etc). In my experience dealing with semi-structured data, Python is a great tool for standardizing data. Libraries like pandas, pyspark amongst others are great for data transformation. Another challenge is usually how to ensure that your transformed data is output in the right format for downstream applications.
The fourth step is to integrate your data from different sources into a single or multiple data sets that can be analyzed together. This might involve joining, appending, or blending your data based on common attributes or keys. For example, you might need to join your data from a spreadsheet and a database based on a customer ID, or append your data from a web page and an API based on a date. You should also ensure that your data is consistent, aligned, and compatible across different sources. You can use various tools and techniques to perform data integration, such as SQL, ETL, or BI tools.
-
Integrate the data to build a complete dataset for analysis after the data has been translated into a consistent format. You might need to use joins, merges, or lookups to integrate the data based on shared keys or properties.
The fifth step is to analyze your data using appropriate methods and techniques to answer your research questions or hypotheses. This might involve descriptive, exploratory, inferential, or predictive analysis, depending on your goals and objectives. For example, you might need to use descriptive analysis to summarize your data, exploratory analysis to discover patterns or trends, inferential analysis to test hypotheses or relationships, or predictive analysis to forecast or classify outcomes. You can use various tools and techniques to perform data analysis, such as statistics, machine learning, or data mining.
-
Apply numerous analysis techniques to the integrated dataset, such as statistical analysis, data visualisation, machine learning, or any other techniques appropriate for your project requirements. Make good use of the right tools and libraries when doing your analysis.
The final step is to interpret your data and communicate your findings and insights to your audience. This might involve creating reports, dashboards, or presentations that convey your results in a clear, concise, and compelling way. You should also explain the assumptions, limitations, and implications of your analysis, and provide recommendations or actions based on your findings. You can use various tools and techniques to perform data interpretation, such as storytelling, visualization, or narration.
-
Know your audience, know the story and know your top impactful insights. Only explain what is essential to portray the information but in a clear manner that doesn't require knowledge of the data to understand - always remember that other people (who have not been analysing the data with you) do not know the data like you do.
-
It’s important to really know the data. Its not “just a data file”. Know exactly what is being collected from who (or which process), when, how often, what types of errors could occur, what does the data mean. It’s important to understand the problem domain. Consider a simple example: produce a chart of public expenditures for each US congressional district over the last 50 years. Include the political party of the Representative and track the dates that Representative served. Add (sub) committee assignments. Next consider the non-obvious issues. Congressional districts change. Committees and subcommittees change. And…there are many colors of money, and money arrives via many channels.
-
Two ideas here. Firstly, sometimes there is no substitute for manually combining your data files and cleaning them. Especially if your data set isn’t too large (under 2,000 observations total) it can be tedious, but manageable. And secondly, if the first idea isn’t possible, your developer and coder colleagues are your friends because they can help create or identify solutions that work.
Rate this article
More relevant reading
-
Data ManagementHow do you customize data for analysis?
-
Data AnalyticsWhat are the best practices for choosing data analysis methods and tools?
-
Systems AnalysisHow do you choose between logical and physical data models for your system analysis?
-
Data ScienceWhat are the most effective ways to test your data report?