Post4

A Glimpse Into the World of Data Science – Part I

Posted by Srinivas Dasu

Data is sleepless and it is being generated rapidly. How much data does each one of us generate every day? Enormous! From tweets and swipes to video calls to likes and shares on social channels – the digital world is just exploding. If individuals are able to generate such a massive amount of data just think how much data an organization would generate! With enormous volumes of data being produced across the entity, it is imperative for businesses to tap into the huge potential that it presents. Data Science provides organizations the perfect solution to capture and evaluate this data.

This is a two-part series. In this part, we are diving into the world of Data Science to learn about the phenomenon and in the following part, we will discuss its significance in the Telecom domain.

Introduction

Data Science involves various scientific methods, statistical techniques and theories to analyze, refine, organize, and visualize the data for informed decision making. One of the interesting aspects of Data Science is that its findings and results are applicable across verticals. It leverages techniques and theories sourced from mathematics, statistics, information science, and computer science. It is thus safe to say Data Science encapsulates programming skills, statistical readiness, visualization techniques, and business senses.

Image Source: https://d2h0cx97tjks2p.cloudfront.net/blogs/wp-content/uploads/sites/2/2019/03/What-is-Data-Science.jpg

Data Science Process Flow

This simple flow diagram explains the process of Data Science.

Summarizing the diagram, the process of Data Science encompasses data munging, data mining, and providing actionable insights. Python and R programming languages and Tableau and SQL for databases play a major role in working in Data Science.

Let us understand all the steps involved in the Data Science process flow.

Data Gathering

Data Science process begins by extracting data from multiple sources. The data can be in raw form, semi-structured, anonymous, or documented.

Data Cleansing

Data cleansing helps remove anomalies in the data. Post cleansing, a dataset is consistent with related datasets in operation. Data cleansing is one of the most crucial activities at the organization level. It involves:
1. Removing duplicate/ irrelevant/ redundant data set
2. Handling missing data
3. Fixing data errors

Data Retrieval (Querying)

SQL or another database language is required to query the database to retrieve and play with the data. We can’t do anything with the data that’s stored in the database until we retrieve it via queries.

Data Formatting

With data being available in multiple forms and formats, including csv, xlsx, docx, pdf, zip, plain text (txt), json, XML, HTML, images, mp3, and mp4, data formatting is the need of the hour as it helps bring uniformity in the elements.

Data Processing

Data processing is the conversion of raw data into a readable format. Data processing operations include calculation, classification, interpretation, organization, sorting, transformation, and validation of data. Data processing can be performed via both manual and automated modes.

Data Exploring

Data exploring, as the name suggests, is about searching for information in the database. Users have to explore the huge data sets sorted in an unstructured manner. The generally summarize the main characteristics of a data set, including its size, accuracy, initial patterns in the data and other attributes to get the desired result. The motive behind the data exploration process is not to reveal every bit of information a dataset holds but to get a broader picture of important trends and major points to study in detail. Data exploration involves a combination of manual methods and automated tools, such as data visualizations, charts, and reports.

Apply Algorithms & Techniques

After gathering, cleaning, retrieving, formatting, processing, and exploring the data the next step is to apply scientific methods or statistical techniques/ algorithms on the data to gain in-depth knowledge this data stores. Some of the major techniques include:
• Linear regression
• Classification
• Resampling Methods
• Subset Selection
• Shrinkage
• Dimension Reduction
• Non-Linear Models
• Tree-based Methods
• Support Vector Machines
• Unsupervised Learning

Building the Data Model

A data model is crucial for understanding how the data is stored and retrieved in a Relational Database Management System (RDBMS), such as SQL Server, MySQL, or Oracle. Data modeling allows us to query data from the database and derive various reports, indirectly contributing to data analysis. These reports are helpful in improving the quality and productivity of the project. Data modeling improves business intelligence by making data modelers work closely with the ground realities of the project, which includes gathering data from multiple unstructured data sources, reporting requirements, spending patterns, etc.

Visualize and Communicate for Informed Decision Making

Here we get the result of all the hard work done in the previous steps. Leveraging data visualization tools, we can represent the data in the form of charts, graphs, maps, etc. that help us make informed business decisions.

This concludes the first blog in the series. We hope you now have a better understanding of the Data Science process flow. Let us know your view in the comments below and Stay tuned for the second blog in the series, which speaks about Data Science in Telecom.

Until next time!

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *