We all hear a lot about big data these days, but exactly how big are we talking? According to the International Data Corporation (IDC), more than 59 zettabytes (ZB) of data will be created, captured, copied, and consumed in 2020. And it’s not going to stop there: IDC forecasts that the data created over the next three years will be more than that created over the past 30 years.
This data can be hugely valuable to businesses and organizations, but first, it needs to be processed. And that’s where data science comes into play. Simply put, data science is the extraction of knowledge, insight, and ultimately value from raw data. It involves using a wide range of scientific methods and processes to turn raw data into something that can inform strategic decision-making and deliver positive outcomes. It’s also key to predictive analysis and is at the core of how companies like Netflix, Spotify, and Amazon sometimes seem to know you more than you do. Data science also underpins the growing use of machine learning and artificial intelligence to drive radical advances in anything from cybersecurity to driverless vehicles and healthcare.
Thinking about it this way, it’s perhaps not so surprising that in 2012 the Harvard Business Review said data scientist was the “sexiest job in the 21st century”. Or that Glassdoor ranked data scientist as ‘the best job in America’ in 2019 for the fourth consecutive year. But what does a data scientist actually do?
The Data Science Life Cycle
Data science is an umbrella term that includes a number of disciplines, including math, statistical analysis, computer science, and business strategy. The broad challenge to build a bridge between the data and business worlds, turning raw information into actionable insights. Though the processes can vary, there are typically six key steps in the data science life cycle:
1. Set Goals/Requirements: To deliver added value a data scientist needs to know what the specific business problem or objective is. Put another way, to find the answers the business is looking for, the data scientist must first be clear on what questions need to be asked.
2. Data Collecting/Mining: This may seem like a relatively basic task, but small errors made here can end up causing bigger problems down the line. What data is required? Where is it available? Is it structured or unstructured? How should it be stored and classified so that it is both useful and secure? Remember, if the data that goes into the process is garbage, the results of the analysis will likely be too.
3. Data Cleaning/Wrangling: This is another vital preparatory step, and often the most time-consuming. Data scientists need to order the data, check for inconsistencies in the data, fix corrupt data, find missing data… basically address anything that could undermine the functionality and effectiveness of the model used to extract insight.
4. Data Exploration: Now is the time to dive into the data and find the patterns, trends, and correlations that will help answer the specific questions formed at the beginning of the process. Given the potentially huge volumes of data to be explored, it becomes clear at this stage why establishing those specific requirements at the start is so important.
5. Data Interpretation/Modeling: This is where the magic happens. It’s when data scientists can process all the information and variables selected using a model that will (hopefully) answer the original question(s). In data science, this often involves a predictive model. For example, it could be used to identify which people are more likely to do something, whether that is buying a software upgrade, listen to a certain album, or develop high blood pressure.
6. Communicate Insights: The data model may have thrown out some fascinating results, but they’re still largely meaningless until they are used to convince key stakeholders to make data-driven strategic decisions that benefit the organization. This is when data visualization (tables, charts, infographics, etc) is used as a powerful way to summarize the key results of the analysis in a manner that is relevant to the business. It is the opportunity for the data scientist to tell the story that triggers a positive response.
Why Python is a Data Scientist’s Best Friend...
Of course, to perform these tasks well a data scientist needs a good toolkit. This will invariably include Python, a high-level and open-source programming language that for years has been the most popular choice among data scientists. In fact, a recent survey revealed that 87% of data scientists used Python regularly, far ahead of the next most used languages, SQL (44%) and R (31%).
Python’s popularity stems from its relative simplicity, flexibility, and widespread community participation. This makes it accessible to beginner data scientists and full of useful resources, such as ‘libraries’, which are essentially ready-made functions that can be easily inserted into data science projects without having to write new code. There are estimated to be more than 137,000 Python libraries out there today, but here are some of the most important open-source ones for data science:
Pandas - One must have python libraries for data wrangling and analysis, especially when dealing with large data sets. It uses data structures such as DataFrames, with built-in methods for grouping, filtering, and combining data, as well as the time-series functionality.
NumPy - Another key library for data processing, NumPy is primarily used for scientific computing and performing basic and advanced array operations. It’s used as a foundation for other Python libraries and its popularity means there is strong community support.
SciPy - The SciPy library provides data scientists with many user-friendly and advanced numerical routines, including for numerical integration, interpolation, optimization, linear algebra, and statistics. It builds on NumPy’s arrays and is part of a wider SciPy stack that includes a range of additional tools/libraries.
Matplotlib - This is a powerful tool for data visualization, giving data scientists the opportunity to tell the story they want with the insights they’ve extracted from the raw data. Graphs can be static, interactive, or animated, and can be easily embedded into other applications. It is part of the SciPy stack.
SciKit-Learn - This is a popular machine learning and data modeling tool that is also part of the SciPy stack. It is regularly used by data scientists for tasks such as clustering, regression, model selection, dimensionality reduction, and classification.
TensorFlow - Developed by Google, this library provides a framework for machine learning and deep learning, which are leading advances in predictive analysis. It is commonly used to underpin text-based applications (like Google translate) or for facial and voice recognition. TensorFlow is being updated continuously and enjoys strong community support.
These are just a handful of widely-used, fundamental libraries. If you’re a data scientist, get in touch and let us know what your go-to Python library is.
--
If you want to stay up to date with all the new content we publish on our blog, share your email and hit the subscribe button.
Also, feel free to browse through the other sections of the blog where you can find many other amazing articles on: Programming, IT, Outsourcing, and even Management.
Andres was born in Quito, Ecuador, where he was raised with an appreciation for cultural exchange. After graduating from Universidad San Francisco de Quito, he worked for a number of companies in the US, before earning his MBA from Fordham University in New York City. While a student, he noticed there was a shortage of good programmers in the United States and an abundance of talented programmers in South America. So he bet everything on South American talent and founded Jobsity -- an innovative company that helps US companies hire and retain Latin American programmers.