Introduction to Data Science: A Starter Kit for Aspiring Data Scientists
Data Science Projects for Beginners to do During this Lockdown
With the profound growth in the field of data science, it has become a lucrative career option for professionals today. There are a lot of free online data science courses available.
Over the last three years, data science jobs have witnessed nearly 37% growth with healthcare, banking and financial services, insurance, retail and telecom being the top sectors hiring data science professionals. With this current pandemic, every other company has started working from home, and there has also been a significant revolution in recruitment patterns as well! The value of certificates has reduced while the demand for actual skills has grown exponentially.
As a beginner in data science aspiring & aiming for positions like data analyst or data scientist, you must have heard the advice “do data science projects” over a thousand times.
Not only are open source data science projects a great learning experience, but they also help you stand out from the crowd of data science enthusiasts looking to break into the field. Also, it is always good to have a couple of Data Science Projects on your resume, which indirectly draws the attention of the recruiter and you always get an upper hand for getting selected for your dream job!
There are three types of projects which you can deal with as a beginner in data science, which can be segregated as:
- Visualization projects
- Exploratory data analysis (EDA) projects
- Predictive modelling projects
1. Visualization Projects:
Coming toward the visualization projects, Data Visualization is an art. Mastering this art requires a lot of patience, effort, and time. You can become a skilled data visualization expert with this extended lockdown by working on some cool data visualization projects. Below are three interesting datasets that you can use to create some intriguing visualizations to add to your portfolio.
Coronavirus Visualizations:
COVID-19 went from an epidemic to a pandemic. From the first identified case in December 2019, how did the virus spread so fast and widely? As a beginner, you can use Data Visualization to derive information insights from data sources in a better way. Visualizing the number of deaths that happened, the number of people who recovered from the disease, and current active cases in each state of India, and many more such insights can be extracted from the data!
E.g., Dataset Link: https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset
Map Data Visualization Topics:
Everyone uses Google Maps, right? Maps are an integral part of our daily lives, and they’re incorporated in almost every sector as they’re more appealing and understandable by more audiences in data analysis and reporting. Also, the importance of location data continues to grow so do the ways you can visualize this information. As a beginner, you can also try some data visualization using map datasets.
E.g., Dataset Link: https://www.kaggle.com/benhamner/2016-us-election
- Australian Wildfire Visualizations:
The 2019–2020 bushfire season, also known as the black summer, consisted of several extreme wildfires starting in June 2019. The fires burnt an estimated 18.6 million hectares and over 5,900 buildings!
This makes for an interesting project! Leverage your data visualization skills using Plotly or Matplotlib to show the magnitude and geographical impact of the wildfires. You can also use visualization tools like matplotlib and seaborn to find patterns in the time of year and state related to the number of wildfires, and other insights!
Dataset Link: https://www.kaggle.com/gustavomodelli/forest-fires-in-brazil
Time-Series Plot Visualizations:
Visualization plays an important role in time series analysis and forecasting. Plots of the raw sample data can provide valuable diagnostics to identify temporal structures like trends, cycles, and seasonality that can influence the choice of model. A problem is that many novices in the field of time series forecasting stop with line plots!
As a beginner, you can plot time series visualizations using various plots such as Line plots, Histograms and Density Plots, Box and Whisker Plots, Heat Maps, Lag Plots or Scatter Plots, and many more!
E.g., Dataset Link: https://www.kaggle.com/gauravsahani/timeseries-analysis-for-whether-dataset
Coming towards the next set of projects, let’s dive into some exploratory data analysis(EDA) projects section!
2. Exploratory Data Analysis (EDA) Projects:
Exploratory Data Analysis (EDA) is an approach to extract the information enfolded in the data and summarize the main characteristics of the data. It is considered to be a crucial step in any data science project!
In classical analysis, data collection was followed by building a model (e.g. linearity, normality, etc.), and the steps of analysis, estimation, and testing were focused on the parameters of the chosen model. However, nowadays in the case of EDA, data collection is followed immediately by analysis to find the right model for the right conclusions.
Exploratory Data Analysis (EDA), also known as Data Exploration, is a step in the Data Analysis Process, where a number of techniques are used to better understand the dataset being used. One of the projects you can work on is,
New York Airbnb Data Exploration:
Since 2008, guests and hosts have used Airbnb to expand travelling possibilities and present more personalized ways of experiencing the world. This dataset contains information on 2019 listings in New York and its geographical information, prices, number of reviews, and more. With this data, you can also try to answer the questions like 'Which hosts are the busiest and why?', 'What areas have more traffic than others and why is that the case?', 'Are there any relationships between prices, the number of reviews, and the number of days that a given listing is booked?'
EDA is all about exploring, the more you explore the better you extract genuine insights from the data!
Dataset Link: https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
Exploring Factors of Life Expectancy:
“WHO” created a dataset of the health status of all countries over time and includes statistics on life expectancy, adult mortality, and more. Using this dataset, explore the relationships between various variables. What has the biggest impact on life expectancy?
Well, using this dataset, you can work on exploring the below questions and presenting them in the form of effective visuals. Questions are, 'Do various predicting factors that have been chosen initially really affect Life expectancy?', 'What are the predicting variables actually affecting life expectancy?', 'How do Infant and Adult mortality rates affect life expectancy?', 'What is the impact of schooling on the lifespan of humans?', etc.
Dataset Link: https://www.kaggle.com/kumarajarshi/life-expectancy-who
Now finally, coming towards the final section of projects, let’s dive into the Predictive modelling projects section!
3. Predictive Modelling Projects:
Predictive analytics is a branch of advanced analytics which is used to make predictions about unknown future events. It encompasses a variety of statistical techniques from predictive modelling, machine learning, and data mining that analyze current and historical facts to identify risks and opportunities.
Forecast Model Projects:
One of the most widely used predictive analytics models, the forecast model deals in metric value prediction, estimating a numeric value for new data based on learnings from historical data. This model can be applied wherever historical numerical data is available.
The forecast model also considers multiple input parameters. If a restaurant owner wants to predict the number of customers, she is likely to receive in the following week, the model will take into account factors that could impact this, such as, 'Is there an event close by?', 'What is the weather forecast?', 'Is there an illness going around?'.
You can try and experiment on the Time Series Forecast on Energy Consumption dataset, which we will mention below, this dataset is composed of power consumption data from PJM’s website. PJM is a regional transmission organization in the United States. Using this dataset, see if you can build a time series model to predict energy consumption. In addition to that, see if you can find trends around hours of the day, holiday energy usage, and long-term trends!
Dataset Link: https://www.kaggle.com/robikscube/hourly-energy-consumption
Regression Analysis Projects:
The purpose of regression analysis is to predict an outcome based on historical data. Regression analysis is a robust statistical test that allows the examination of the relationship between two or more variables of interest. While there are many types of regression analysis, at the core, all examine the influence of one or more independent variables on a target (dependent) variable.
These are a few topics which you can work on, such as Walmart sales data: Predicting the sales of a store, Boston housing data: Predicting the median value of owner-occupied homes, Wine Quality prediction: Predicting the quality of the wine, Black Friday Sales prediction: Predicting purchase amount for a household. You can use algorithms like CART Models, Decision trees, logistic regression models, etc.
E.g., Dataset Link: https://www.kaggle.com/shree1992/housedata.
Well, this was all from our side, Thank you for reading! I hope you enjoyed the article. Do let us know what projects are you looking forward to learning or doing over this lockdown in your Data Science journey! And we hope this blog is helpful. Thanks.