Review These 7 Data Science Topics (Domains) – A Beginner’s Guide
Introduction by Ermin Dedic, Founder of mydatacareer.com
Zahin will introduce 7 data science topics that you should know about but I want you to keep something important in mind.
First and foremost, don’t focus on the details. Breadth over depth at this early stage. I want you to get a general sense of the possibilities and opportunities.
You should add a layer of depth once you have some idea of your end-goal. One great question to determine your end-goal is: Do you see yourself as a problem-solving businessperson or a number/stats person? The latter requires a lot more technical expertise.
Lastly, additional layers of depth can only happen once you start a job. Before you get a job don’t attempt to know every detail about every subject. That is not the expectation.
Data Science is a Hot Topic
“Data is to this century what oil was to the last one”, claimed The Economist in 2017, referring to data being the most valuable asset on earth. Four years on, and you’ll be hard-pressed to find anyone who disagrees with that. I have many people asking me “what topics would you choose to study for data science if you were starting now”?
But if you research the skills required for a Data Scientist role, the common reaction is one of panic! I get it.
Data science covers so many topics that it can be quite overwhelming for new aspirants. The terminologies can often appear to be used in similar contexts and lead to challenges in grasping the nuances between them. I can totally empathize given the sizeable intertwinement and overlap between a lot of these sub-domains within Data Science. You are not the only one – it will be fair to say that anyone starting out in the field goes through this exact same phase.
The 7 Data Science Topics (Knowledge Domains)
I feel these 7 areas do the best job of breaking down Data Science into its bare bones as discretely as possible. The 7 knowledge domains are:
- Data Mining
- Data Visualization
- Modelling & Machine Learning
- Computer Programming
- Subject Matter Expertise
What are the most important mathematical topics in data science?
Mathematics is the technical backbone of Data Science – after all, in its truest sense, it’s all numbers! There are three sub-domains of mathematics that form the relevant foundations: statistics, algebra, and calculus.
What statistics topics are useful for data science?
A good grasp of statistics is vital for a Data Scientist. Learn descriptive statistics concepts like mean, median, mode, variance, and standard deviation.
You should also look to gain familiarity with the various probability distributions, sample and population, limit theorem, skewness and kurtosis, and inferential statistics such as p-value hypothesis testing and confidence intervals. One of the more important aspects of your statistics knowledge will be understanding when the different techniques are (or aren’t) a valid approach – an essential for experimental design. This involves setting up experiments, dealing with sample sizes, control groups, and more.
Statistics are important in data-driven companies where stakeholders will depend on your insights to design/evaluate experiments and models and make strategic decisions.
What Algebra and Calculus topics are useful for data science?
The end uses of linear algebra and calculus are quite similar. They are mostly geared towards machine and deep learning model development and algorithm optimization.
In terms of algebra, you will need to learn about matrices and matrix manipulation and transformation. You will always find yourself working with high dimensionality data (basically, large data sets with multiple variables). You will need to have familiarity with eigenvalues and eigenvectors. This will help you understand how dimensionality reduction and principal component analysis work in feature engineering. Furthermore, an understanding of matrix transformation is needed to appreciate data wrangling and preparation.
Linear algebra is also used in generating loss functions and regularization functions which are methods of optimizing machine learning models. It has specific use cases in areas of Natural Language Processing (NLP) and Computer Vision (CV) areas of AI.
Some basic multivariable calculus is good to know to understand the implementation of tuning of machine learning models. For example, multivariate calculus is used for gradient descent, which is an optimization algorithm used to find the most optimal solution to machine learning models by minimizing error.
Calculus topics such as curvature, divergence, and quadratic approximations are increasingly useful in the more advanced implementations. For instance, cost functions, sigmoid functions and minimization and maximization of functions.
Data Engineers and Database Architects design, deploy, and maintain databases to support high volume, complex data transactions. As a Data Scientist, you don’t need the comprehensive understanding of databases like a Data Engineer or Architects, but familiarity is paramount.
Database Management System (DBMS) essentially consists of a group of linked programs that can edit, index, and manipulate a database. The DBMS accepts requests made for data extraction and instructs the OS to provide the specified required data. In large systems, a DBMS helps users to store, retrieve, and update data in near real-time.
As a Data Scientist, you need to be proficient in SQL. SQL is specifically designed to help you access, communicate and work on relational data – a data format that is universally used. Learning SQL will help you to better understand relational databases and boost your credibility as a Data Scientist.
Some popular DBMS include: MySQL, SQL Server, PostgreSQL, Oracle, and NoSQL databases (MongoDB, Cassandra, etc.).
Often referred to as data exploration, data mining is the process of “exploring” data in order to extract important information.
I am sure you all want to deploy those awesome machine learning predictive models and put together intuitive data visualizations. However, none of that can happen until you have performed the “dirty” work.
This might be the least ‘sexy’ of the data science topics but arguably it takes up the majority of a data scientists day-to-day workload! And this makes sense – the New York Times reported that Data Scientists spend often spend as much as 80% of their time collecting and preparing data i.e. in the data mining phase.
Data Scraping and Extraction
Data scraping is a technique in which computer programs read and collect data from websites. This is done using special libraries (like BeautifulSoup or Selenium in Python) or simple APIs.
Data extraction involves extracting data from databases using querying tools like SQL, as we covered above in Databases.
Often the data a business receives is not ready for modeling and analysis. It is often messy and difficult to work with. Therefore, it is imperative to know how to deal with data imperfections. Some examples of imperfections include missing values, outliers, inconsistent string formatting (e.g., “United Kingdom” vs. “UK” vs. “U.K.”), and date formatting (‘2021-01-01’ vs. ‘01/01/2021’ vs. UNIX time).
Data Wrangling or manipulation is the process where you prepare your data for further analysis. For example, transforming and mapping raw data from one form to another to “clean” up your data into a format that can be analyzed in the next stages. The next stage may be finding actionable insights from the data or feeding the data into a machine learning model as input.
Data Analysis is the step where you “feel” and learn about the data. It gives you the initial appreciation of what you are dealing with and provides the context of where you want to take the data, and further dissect it if needed. It often utilizes some of the statistical concepts that we touched on earlier to get some basic parameters and distribution of the data.
You can carry out this work using libraries such as Pandas in Python, or even Excel or SQL before importing the data over.
So, what does data visualization mean? It can mean different things to different data practitioners.
For me, it is a visual representation of the findings from the data to effectively communicate the findings and insights. It gives me the power to craft a compelling story using the data and create impactful presentations. It’s not just about presenting the final results, but also understand and learn about the data and its potential vulnerabilities.
At the end of the day, decision-makers often would not understand what is meant by a covariance matrix, p-values, or the classification F-score. You need to visually show them what those terms represent and mean in the context of your results. When I create visualizations, I am sure to get meaningful information that can influence the system and decision-makers.
To start, you must be familiar with the basics like Histogram, Bar charts, pie charts, scatter plots. You should then move on to advanced charts like waterfall charts, thermometer charts, heat maps, 3-D plots, etc. These plots come in very handy during the exploratory data analysis stage, and Python’s matplotlib, seaborn, and ggplot are great libraries for that.
For intuitive and sleek data viz, that would take you a step ahead and impress your peers, Tableau and Looker are the better options in my opinion.
Some of the popular Data Visualization tools include: Tableau, PowerBI, QlikView, Plotly, SAS, and Python (matplotlib, seaborn, and ggplot libraries).
Modelling and Machine Learning
What is Machine Learning?
Now, this section itself can be a full article (or two, for that matter). So, I’ll keep it short.
Machine Learning (ML) is a vast field within AI with multiple subsets within itself. At its core, ML is used to build predictive models.
For a Data Scientist, ML is a core skill to have and is deeply complemented by the other domains within data science (i.e., Mathematics, Data Mining, and Computer Programming).
The range of use cases for ML is virtually limitless! You can explore the many use cases at this link: Machine Learning Use Cases
How does ML work and What is Modeling?
ML enables computers to learn from information and apply that learning without human intervention. Machine learning can be categorized into supervised, unsupervised, and reinforcement learning based on the type of dataset. It can be broadly classified into classical machine learning or deep learning based on the algorithm. ML is simply an algorithm designed to recognize patterns and calculate the probability of a certain outcome occurring. To build a model, it is first trained with data called the training data. This allows identification of patterns by learning from the data. Finally, predictions are made on the unknown set of data, or test data, based on the defined ML algorithm.
The accuracy of the model is largely determined based on the algorithm used, its parameters, its suitability for the use case, model optimization/tuning, and even the applicability of the accuracy metric that you are using to measure its performance.
There are tons of modelling algorithms whose use cases are unique to your objectives, computation power, and the type of problem you are dealing with. But that may be out of the scope of this article – we will reserve that rundown for another day.
Programming and what topics in Python are necessary for data science?
If you plan on pursuing a career in data science, you’ll need to code, and code well! Like most things in life, it can be self-taught, especially with the inherent open-source nature of learning materials for software programming and coding.
To start, knowing the basics of some of the software engineering subjects like – basic lifecycle of software development projects, data types, and compilers can be incredibly useful. Python, Scala, SAS, and R are some of the languages that you can expect to use to develop and deploy machine learning models.
Python is by far the most utilized language due to its versatility and simplicity. Scala is used by individuals who have some background in Java. SAS and R are usually reserved for data science applications that are weighted towards statistical analysis and visualization. SAS leads the advanced analytics market. Read more here if SAS programming is an interest.
Witting programs to automate tasks help not only save yourself valuable time but also make your code much easier to debug, read, and maintain. Writing efficient and clean code will help you in the long run and increase team collaboration, efficiency, code scalability, and eventual commercial deployment.
Subject Matter Expertise
Arguably, the most under-appreciated aspect of Data Science is Subject Matter Expertise. It is the glue that holds together all knowledge domains within Data Science. In fact, I believe its significance will only increase in the coming years.
Professionals from all disciplines now use data in their business. As a result, it is imperative for data practitioners to understand the industry and domain they are working in. A common subject matter expertise would be in operations (i.e., logistics). On the other hand, it could also be in the more technical areas such as Finance, Healthcare, and Energy Management, among others.
All Access Data Science Bundle – FREE TRIAL
To be a Data Scientist, you need to have a passion, and zeal to play with data, and a desire to make numbers talk and paint a story with your data. It is a multi-faceted role, and there are a plethora of knowledge domains and skills that you have to master as discussed in this article.
In fact, if you’re only interested in becoming a data scientist, you will eventually want to learn all the data science topics discussed.
Given the enormity of its applications and the perpetually evolving scenery of toolkits, software, and their functionalities, a continuous learning mindset is imperative.
The knowledge domains discussed above are still massively interdependent and leverage each other’s expertise and functionalities. Aside from these technical knowledge domains, there are tons of behavioral attributes needed to be a successful Data Scientist – but that is a discussion for another day.
Hopefully you have a better idea of where you can focus as you pursue your data science career journey!
Author: Zahin Rahman (Bachelors & Masters in Data Science & Engineering (University of Toronto))
With 6+ years of technical and leadership industry experience and a Bachelors and Masters in Engineering, I bring a unique combination of professional conduct, hands-on subject area knowledge, and technical academic foundation. Leveraging my diversified skillset and knowledge base with my industry experience, I am confident of bringing meaningful and tangible value to your business vision and professional endeavors. I’m passionate about AI, machine learning, consumer technology, cloud computing, sustainable energy, oil and gas, aerospace, automotive, and anything technical really. Being a philonist with endless curiosity, I can devour new information very quickly and can swiftly get up to speed on the unique technical specifics that are relevant to your industry and needs.