To start with data science, there is no standard mentioned anywhere that these things are compulsory. In the below discussion I have taken those concepts, the way I have started. Like, mostly used language is python, but again if someone wants to use something else, anything is fine. singular value decomposition (basic linear algebra, optimization, probability theory) and computer science (data structure, algorithms ..). Again the things come with practices, as below mentioned concepts are not always important. But to become a Data Scientist, one should know each of these.

Let’s discuss one by one each of the components.

1. Education

Data Science is highly educated. Mostly around 80% people have Master degree and approx 45% have PhDs. The very strong educational background is also required to depth of knowledge. The basic requirements to start with data science is having bachelor degrees in computer science, statistics, physical science, social science or other similar degrees.

Let me explain with subject wise categorization of required skills.

Mathematics & Statistics : 32%
Computer science :19%
Engineering : 16%

Apart from classroom learning, you can practice what you learned in the classroom by building an app, starting a blog or exploring data analysis to enable you to learn more.

2. Python or R programming.

Python vs R

We can find people using 40-40 % using of both languages in data science programming and remaining uses other languages such as Matlab, C, C++, Java, Perl. Also, It can take various formats of data and you can easily import SQL tables into your code. It allows you to create data sets and you can literally find any type of dataset you need on Google.

You should also know, R language is mostly suitable for statistical analysis. Although, It is a bit more complex than Python, but there are so many resources available on Google where you can start practicing.

The IDE/Platform generally used to write code, is Google Colab, Anaconda, widely used Jupyter. These tools are something different with general IDE tools, but few practices make you comfortable using.

3. Hadoop

Hadoop

Although this isn’t always a requirement, it is heavily preferred in many cases. Having experience with Hive or Pig is also a strong selling point. Familiarity with cloud tools such as Amazon S3 can also be beneficial. A survey on LinkedIn mark it as the second most important skill for a data scientist with 49% rating.

As a data scientist, you may encounter a situation where the volume of data you have exceeds the memory of your system or you need to send data to different servers, this is where Hadoop comes in. You can use Hadoop to quickly convey data to various points on a system. That’s not all. You can use Hadoop for data exploration, data filtration, data sampling and summarization.

4. SQL Database + Coding

SQL

I hope you know the concept of SQL. So the question arises how it is useful in data science as it is a structured, and we have to work on Big data or unstructured datasets.

The answer is here, Even though NoSQL and Hadoop have become a large component of data science, it is still expected that a candidate will be able to write and execute complex queries in SQL. It helps in re restructuring of unstructured data to carry out Analytical functions.

5. Apache Spark

Apache Spark

Let me give you some brief idea about Spark. It is a widely used concept in big data technology. It is also a big computational framework as Hadoop. The only reason why it is in demand, due to its speed in comparison with Hadoop. Technically, we can say, Hadoop reads and writes data from disks. But, Spark caches its computation in memory.

The major advantages of the spark is it prevent loss of data in data science. The strength of Apache Spark lies in its speed and platform which makes it easy to carry out data science projects.  With Apache spark, you can carry out analytics from data intake to distributing computing.

6. Machine learning & AI

Neural working wrt h¹8

These are the backbone of data science. This includes neural networks, reinforcement learning, adversarial learning, etc. If you want to stand out from other data scientists, you need to know Machine learning techniques such as supervised machine learning, decision trees, logistic regression etc. These skills will help you to solve different data science problems that are based on predictions of major organizational outcomes.

Unsupervised machine learning, Time series, Natural language processing, Outlier detection, Computer vision, Recommendation engines, Survival analysis, Reinforcement learning, and Adversarial learning.

Data science involves working with large amounts of data sets. You may want to be familiar with Machine learning.

7. Data visualization

Data visualization

The business world produces a vast amount of data frequently. This data needs to be translated into a format that will be easy to comprehend. People naturally understand pictures in forms of charts and graphs more than raw data. An idiom says “A picture is worth a thousand words”.

As a data scientist, you must be able to visualize data with the aid of data visualization tools such as ggplot, d3.js and Matplottlib, and Tableau. These tools will help you to convert complex results from your projects to a format that will be easy to comprehend. The thing is, a lot of people do not understand serial correlation or p values.  You need to show them visually what those terms represent in your results.

Data visualization gives organizations the opportunity to work with data directly. They can quickly grasp insights that will help them to act on new business opportunities and stay ahead of competitions.

8. Unstructured data

It is critical that a data scientist be able to work with unstructured data. Unstructured data are undefined content that does not fit into database tables. Examples include videos, blog posts, customer reviews, social media posts, video feeds, audio etc.  They are heavy texts lumped together. Sorting these type of data is difficult because they are not streamlined.

Most people referred to unstructured data as ‘dark analytics” because of its complexity. Working with unstructured data helps you to unravel insights that can be useful for decision making. As a data scientist, you must have the ability to understand and manipulate unstructured data from different platforms.