Data Mining is a technique to gain insight from raw data and convert it into useful information so that it boosts our decision making power. There are many different sources from where we can extract the data for e.g Relational database, NoSQL database, etc. After extracting the data from various sources, it can be cleaned, diversified and so that we can find the underlying trend from it. There are different tools for data analysis such as R programming, Tableau, Rapid Miner, Python, etc.
But we will use Python language for analysis as it provides various analytics libraries for e.g. Pandas, Matplotlib, NumPy, Seaborn, etc.
We are performing an analysis of the weather dataset. One type of data that’s easier to find on the net is Weather data. Many sites provide historical data on many meteorological parameters such as pressure, temperature, humidity, wind_speed, visibility, etc. Our SuvenML team has downloaded one such weather dataset from Kaggle. (Source URL: https://www.kaggle.com/muthuj7/weather-dataset)
The IDE we are using for this project is “Jupyter Notebook” which provides a very user-friendly interface and it is specially used by most of the data analysts.
The data mining techniques we will use in this project are:
- Data cleaning
- Data preparation
- Tracking patterns
- Statistical techniques
- Data Visualization
In this analysis, we have to perform exploratory data analysis, data cleaning and test the hypothesis . The Null Hypothesis H0 is “Has the Apparent temperature and humidity compared monthly across 10 years of the data indicate an increase due to Global warming”.
1. Import all the required libraries for analysis.
Pandas: Specially used for data analysis and allows us to import/export, data manipulation, and data wrangling.
Numpy: Used for faster mathematical operation on arrays.
Matplotlib: Used for beautiful graphs and data visualization.
Seaborn: data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
2. Data reading, cleaning, and exploration.
We can see that our dataset has 96453 Rows and 11 columns.
As you can see in fig 1.3 We are creating two new columns i.e. “Year” and “Month” by extracting the stings from the column “Formatted Date” and dropping the unnecessary columns.
We have to do the monthly analysis for 10 years.
3. Extracting data by months.
4. Calculating the mean value for every month and year.
5. Plotting the average temperature and average humidity for each month for about 10 years.
As in the above, we can see that the temperature in Jan 2007 directly increases but after 2007 it has a decreasing trend. But there was no special effect on humidity, it was constant most of the time for all the 10 years.
In all the above graphs we can see that the trends of average monthly temperature and humidity for some of the months.
in the same way, we have plotted graphs for all 12 months for 10 years.
We know that global warming is a very dangerous problem for the world. It is significantly increasing which may lead to loss of sea ice, stresses the ecosystem, changes in climate condition, etc.
From this analysis, we gained a lot of useful and important insights about weather data conditions for about the past 10 years (2006–2016).
The analysis says that the temperature is increasing constantly especially in the month of June, July, and August. For humidity, there is a slight increase for some years but mostly it remains constatnt throughout all the years.
Hypothesis result: “Apparent temperature and humidity compared monthly across 10 years of the data indicate an increase due to Global warming and it is having a very adverse effect on the environment”.
Hence our Null hypothesis is true.
We can conclude that global warming is having a very bad effect on nature. So to avoid the measure effects of global warming one should take all the precautions such as reduce water-waste, Invest energy in an efficient manner etc.