Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Some say climate change is the biggest threat of our age while others say it’s a myth based on dodgy science. We are turning some of the data over to you so you can form your own view.
Even more than with other data sets that Kaggle has featured, there’s a huge amount of data cleaning and preparation that goes into putting together a long-time study of climate trends. Early data was collected by technicians using mercury thermometers, where any variation in the visit time impacted measurements. In the 1940s, the construction of airports caused many weather stations to be moved. In the 1980s, there was a move to electronic thermometers that are said to have a cooling bias.
Given this complexity, there are a range of organizations that collate climate trends data. The three most cited land and ocean temperature data sets are NOAA’s MLOST, NASA’s GISTEMP and the UK’s HadCrut.
We have repackaged the data from a newer compilation put together by the Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory. The Berkeley Earth Surface Temperature Study combines 1.6 billion temperature reports from 16 pre-existing archives. It is nicely packaged and allows for slicing into interesting subsets (for example by country). They publish the source data and the code for the transformations they applied. They also use methods that allow weather observations from shorter time series to be included, meaning fewer observations need to be thrown away.
In this dataset, we have include several files:
Global Land and Ocean-and-Land Temperatures (GlobalTemperatures.csv):
Other files include:
The raw data comes from the Berkeley Earth data page.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Global Surface Summary of the Day is derived from The Integrated Surface Hourly (ISH) dataset. The ISH dataset includes global data obtained from the USAF Climatology Center, located in the Federal Climate Complex with NCDC. The latest daily summary data are normally available 1-2 days after the date-time of the observations used in the daily summaries.
Over 9000 stations' data are typically available.
The daily elements included in the dataset (as available from each station) are: Mean temperature (.1 Fahrenheit) Mean dew point (.1 Fahrenheit) Mean sea level pressure (.1 mb) Mean station pressure (.1 mb) Mean visibility (.1 miles) Mean wind speed (.1 knots) Maximum sustained wind speed (.1 knots) Maximum wind gust (.1 knots) Maximum temperature (.1 Fahrenheit) Minimum temperature (.1 Fahrenheit) Precipitation amount (.01 inches) Snow depth (.1 inches)
Indicator for occurrence of: Fog, Rain or Drizzle, Snow or Ice Pellets, Hail, Thunder, Tornado/Funnel
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]
. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.
This public dataset was created by the National Oceanic and Atmospheric Administration (NOAA) and includes global data obtained from the USAF Climatology Center. This dataset covers GSOD data between 1929 and present, collected from over 9000 stations. Dataset Source: NOAA
Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Photo by Allan Nygren on Unsplash
High precision and reliable wind speed forecasting is a challenge for meteorologists. Severe wind due to convective storms, causes considerable damages (large scale forest damage, outage, buildings/houses damage, etc.). Convective events such as thunderstorms, tornadoes as well as large hail, strong winds, are natural hazards that have the potential to disrupt daily life, especially over complex terrain favoring the initiation of convection. Even ordinary convective events produce severe winds which causes fatal and costly damages. Therefore, wind speed prediction is an important task to get advanced severe weather warning. This dataset contains the responses of a weather sensor that collected different weather variables such as temperatures and precipitation.
The dataset contains 6574 instances of daily averaged responses from an array of 5 weather variables sensors embedded in a meteorological station. The device was located on the field in a significantly empty area, at 21M. Data were recorded from January 1961 to December 1978 (17 years). Ground Truth daily averaged precipitations, maximum and minimum temperatures, and grass minimum temperature were provided.
If you want to cite this data:
fedesoriano. (April 2022). Wind Speed Prediction Dataset. Retrieved [Date Retrieved] from https://www.kaggle.com/datasets/fedesoriano/wind-speed-prediction-dataset
This dataset was created by saumyadeepm
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Detroit Daily Temperatures with Artificial Warming’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/agajorte/detroit-daily-temperatures-with-artificial-warming on 14 February 2022.
--- Dataset description provided by original source is as follows ---
Who among us doesn't talk a little about the weather now and then? Will it rain tomorrow and get so cold to shake your chin or will it make that cracking sun? Does global warming exist?
With this dataset, you can apply machine learning tools to predict the average temperature of Detroit city based on historical data collected over 5 years.
The given data set was produced from the Historical Hourly Weather Data [https://www.kaggle.com/selfishgene/historical-hourly-weather-data], which consists of about 5 years of hourly measurements of various weather attributes (eg. temperature, humidity, air pressure) from 30 US and Canadian cities.
From this rich database, a cutout was made by selecting only the city of Detroit (USA), highlighting only the temperature, converting it to Celsius degrees and keeping only one value for each date (corresponding to the average daytime temperature - from 9am to 5pm).
In addition, temperature values were artificially and gradually increased by a few Celsius degrees over the available period. This will simulate a small global warming (or is it local?)...
In summary, the available dataset contains the average daily temperatures (collected during the day), artificially increased by a certain value, for the city of Detroit from October 2012 to November 2017.
The purpose of this dataset is to apply forecasting models in order to predict the value of the artificially warmed average daily temperature of Detroit.
See graph in the following image: black dots refer to the actual data and the blue line represents the predictive model (including a confidence area).
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3089313%2Faf9614514242dfb6164a08c013bf6e35%2Fplot-ts2.png?generation=1567827710930876&alt=media" alt="">
This dataset wouldn't be possible without the previous work in Historical Hourly Weather Data.
What are the best forecasting models to address this particular problem? TBATS, ARIMA, Prophet? You tell me!
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Temperature Time-Series for some Brazilian cities’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/volpatto/temperature-timeseries-for-some-brazilian-cities on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Do you ever wonder how are temperatures in Brazilian cities? Too hot? Cold weather sometimes? And what about climate changes? Is Brazil getting hotter?
This is your chance to check it out!
This datasets are collected in order to provide some answers for the above question through Data Analysis. Maybe you want to try some Machine Learning model in order to practice and predict the evolution of temperature in some Brazilian cities.
The content is provided by NOAA GHCN v4 and post-processed by NASA's GISTEMP v4.
In summary, each data file contains a temperature time series for a station named according to the city. The time series provides temperature records by month for each year. Some mean measurement is calculated, like metANN
and D-J-F
. I can't give details about these quantities, nor how they are calculated. Please refer for NASA GISTEMP website in this regard. The most important seems to be metANN
, which is an annual temperature mean.
These datasets are provided through NASA's GISTEMP v4 and recorded by NOAA GHCN v4. Thanks for researchers and staffs for the really nice work!
--- Original source retains full ownership of the source dataset ---
Includes Average Temperature of US States from Jan 1950 - Aug 2022
Source: https://www.ncei.noaa.gov/cag/statewide/mapping/110/tavg/202208/1/value
References: NOAA National Centers for Environmental information, Climate at a Glance: Statewide Mapping, Average Temperature, published September 2022, retrieved on October 8, 2022 from https://www.ncdc.noaa.gov/cag/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Temperatures of INDIA’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/venky73/temperatures-of-india on 28 January 2022.
--- Dataset description provided by original source is as follows ---
The Data is of INDIAN GOVT. collected from indian govt website.
Data consists of temperatures of INDIA averaging the temperatures of all places. Recent updated temperature data is 2017. Temperatures values are recorded in CELSIUS
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Daily minimum temperatures’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/suprematism/daily-minimum-temperatures on 13 February 2022.
--- Dataset description provided by original source is as follows ---
This data set contains information about daily minimum temperatures from 1981 to 1990.
The original data can be found here: https://github.com/upul/WhiteBoard/blob/master/data/daily-minimum-temperatures-in-me.csv
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
If you use the dataset, cite the paper: https://doi.org/10.1016/j.eswa.2022.117541
The most comprehensive dataset to date regarding climate change and human opinions via Twitter. It has the heftiest temporal coverage, spanning over 13 years, includes over 15 million tweets spatially distributed across the world, and provides the geolocation of most tweets. Seven dimensions of information are tied to each tweet, namely geolocation, user gender, climate change stance and sentiment, aggressiveness, deviations from historic temperature, and topic modeling, while accompanied by environmental disaster events information. These dimensions were produced by testing and evaluating a plethora of state-of-the-art machine learning algorithms and methods, both supervised and unsupervised, including BERT, RNN, LSTM, CNN, SVM, Naive Bayes, VADER, Textblob, Flair, and LDA.
The following columns are in the dataset:
➡ created_at: The timestamp of the tweet. ➡ id: The unique id of the tweet. ➡ lng: The longitude the tweet was written. ➡ lat: The latitude the tweet was written. ➡ topic: Categorization of the tweet in one of ten topics namely, seriousness of gas emissions, importance of human intervention, global stance, significance of pollution awareness events, weather extremes, impact of resource overconsumption, Donald Trump versus science, ideological positions on global warming, politics, and undefined. ➡ sentiment: A score on a continuous scale. This scale ranges from -1 to 1 with values closer to 1 being translated to positive sentiment, values closer to -1 representing a negative sentiment while values close to 0 depicting no sentiment or being neutral. ➡ stance: That is if the tweet supports the belief of man-made climate change (believer), if the tweet does not believe in man-made climate change (denier), and if the tweet neither supports nor refuses the belief of man-made climate change (neutral). ➡ gender: Whether the user that made the tweet is male, female, or undefined. ➡ temperature_avg: The temperature deviation in Celsius and relative to the January 1951-December 1980 average at the time and place the tweet was written. ➡ aggressiveness: That is if the tweet contains aggressive language or not.
Since Twitter forbids making public the text of the tweets, in order to retrieve it you need to do a process called hydrating. Tools such as Twarc or Hydrator can be used to hydrate tweets.
This dataset was created by Johar M. Ashfaque
This dataset was created by Aziza Afrin
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Precipitation Prediction in LA’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/varunnagpalspyz/precipitation-prediction-in-la on 13 February 2022.
--- Dataset description provided by original source is as follows ---
This Dataset is part of a basic DIY Machine Learning project offered by my college, Indian Institute of Technology, Guwahati (IIT G). The main aim of this project was to get familiar with the workflow and various techniques involved in a Machine Learning project.
The dataset is fairly simple and contains various features regarding precipitation. PRCP = Precipitation (tenths of mm) TMAX = Maximum temperature (tenths of degrees C) TMIN = Minimum temperature (tenths of degrees C) PGTM = Peak gust time (hours and minutes, i.e., HHMM) AWND = Average daily wind speed (tenths of meters per second) TAVG = Average temperature (tenths of degrees C) WDFx = Direction of fastest x-minute wind (degrees) WSFx = Fastest x-minute wind speed (tenths of meters per second) WT = Weather Type
All Credits go to the Coding Club of Indian Institute of Technology, Guwahati (IIT Guwahati). Instagram: https://www.instagram.com/codingclubiitg/ LinkedIn : https://www.linkedin.com/company/coding-club-iitg/
Hope that this dataset + my notebook (https://www.kaggle.com/varunnagpalspyz/precipitation-prediction/notebook) helps all beginners like me.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Daily Minimum Temperatures in Melbourne’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/paulbrabban/daily-minimum-temperatures-in-melbourne on 28 January 2022.
--- No further description of dataset provided by original source ---
--- Original source retains full ownership of the source dataset ---
This dataset was created by Thanseefudheen
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Crop Recommendation Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/siddharthss/crop-recommendation-dataset on 28 January 2022.
--- Dataset description provided by original source is as follows ---
THE INFORMATION IN THE DATASET IS PROVIDED TO THE BEST OF KNOWLEDGE OF ICAR. THE BELOW DATA CAN BE USED PUBLICALLY UNDER ALL PUBLIC AND PRIVATE UNDERTAKINGS
Context Precision agriculture is in trend nowadays. It helps the farmers to get informed decision about the farming strategy. Here, we present you a dataset which would allow the users to build a predictive model to recommend the most suitable crops to grow in a particular farm based on various parameters.**
Source This dataset was build by augmenting datasets of rainfall, climate and fertilizer data available for India. Gathered over the period by ICFA, India.
Data fields N - ratio of Nitrogen content in soil P - ratio of Phosphorous content in soil K - ratio of Potassium content in soil temperature - temperature in degree Celsius humidity - relative humidity in % ph - ph value of the soil rainfall - rainfall in mm
COPYRIGHT: Indian Chamber of Food and Agriculture https://www.icfa.org.in/
https://www.google.com/url?sa=i&url=https%3A%2F%2F10times.com%2Fcompany%2Findian-council-of-food-and-agriculture&psig=AOvVaw0S9UpuXsVmmje0SgSjybK5&ust=1622035121203000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCNie_uv15PACFQAAAAAdAAAAABAn" alt="">
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘ 🚴 Bike Sharing Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/bike-sharing-datasete on 13 February 2022.
--- Dataset description provided by original source is as follows ---
Source:
Hadi Fanaee-T
Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of PortoINESC Porto, Campus da FEUPRua Dr. Roberto Frias, 3784200 - 465 Porto, Portugal
Original Source:
http://capitalbikeshare.com/system-data
Weather Information:
http://www.freemeteo.com
Holiday Schedule:
http://dchr.dc.gov/page/holiday-scheduleData Set Information:
Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.Attribute Information:
Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv
- instant: record index - dteday : date - season : season (1:springer, 2:summer, 3:fall, 4:winter) - yr : year (0: 2011, 1:2012) - mnth : month ( 1 to 12) - hr : hour (0 to 23) - holiday : weather day is holiday or not (extracted from ) - weekday : day of the week - workingday : if day is neither weekend nor holiday is 1, otherwise is 0. + weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog - temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale) - atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale) - hum: Normalized humidity. The values are divided to 100 (max) - windspeed: Normalized wind speed. The values are divided to 67 (max) - casual: count of casual users - registered: count of registered users - cnt: count of total rental bikes including both casual and registeredRelevant Papers:
- Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, .
Citation Request:
Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, .
@article{ year={2013}, issn={2192-6352}, journal={Progress in Artificial Intelligence}, doi={10.1007/s13748-013-0040-3}, title={Event labeling combining ensemble detectors and background knowledge}, url={ }, publisher={Springer Berlin Heidelberg}, keywords={Event labeling; Event detection; Ensemble learning; Background knowledge}, author={Fanaee-T, Hadi and Gama, Joao}, pages={1-15}}Source: http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
This dataset was created by UCI and contains around 20000 samples along with Dteday, Windspeed, technical information and other features such as: - Registered - Cnt - and more.
- Analyze Weekday in relation to Casual
- Study the influence of Season on Holiday
- More datasets
If you use this dataset in your research, please credit UCI
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
2023
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The International Comprehensive Ocean-Atmosphere Data Set (ICOADS) is a global ocean marine meteorological and surface ocean dataset. It is formed by merging many national and international data sources that contain measurements and visual observations from ships (merchant, navy, research), moored and drifting buoys, coastal stations, and other marine and near-surface ocean platforms. Each marine report contains individual observations of meteorological and oceanographic variables, such as sea surface and air temperatures, wind, pressure, humidity, and cloudiness. The coverage is global and sampling density varies depending on date and geographic position relative to shipping routes and ocean observing systems.
The ICOADS dataset contains global marine data from ships (merchant, navy, research) and buoys, each capturing details according to the current weather or ocean conditions (wave height, sea temperature, wind speed, and so on). Each record contains the exact location of the observation which is great for visualizations. The historical depth of the data is quite comprehensive — There are records going back to 1662!
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]
. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.
Dataset Source: NOAA Category: Meteorological, Climate, Transportation
Citation: National Centers for Environmental Information/NESDIS/NOAA/U.S. Department of Commerce, Research Data Archive/Computational and Information Systems Laboratory/National Center for Atmospheric Research/University Corporation for Atmospheric Research, Earth System Research Laboratory/NOAA/U.S. Department of Commerce, Cooperative Institute for Research in Environmental Sciences/University of Colorado, National Oceanography Centre/Natural Environment Research Council/United Kingdom, Met Office/Ministry of Defence/United Kingdom, Deutscher Wetterdienst (German Meteorological Service)/Germany, Department of Atmospheric Science/University of Washington, and Center for Ocean-Atmospheric Prediction Studies/Florida State University. 2016, updated monthly. International Comprehensive Ocean-Atmosphere Data Set (ICOADS) Release 3, Individual Observations. Research Data Archive at the National Center for Atmospheric Research, Computational and Information Systems Laboratory: https://doi.org/10.5065/D6ZS2TR3. Accessed 01 04 2017.
Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Photo by Gleb Kozenko on Unsplash
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Walmart Dataset (Retail)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/rutuspatel/walmart-dataset-retail on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Dataset Description :
This is the historical data that covers sales from 2010-02-05 to 2012-11-01, in the file Walmart_Store_sales. Within this file you will find the following fields:
Store - the store number
Date - the week of sales
Weekly_Sales - sales for the given store
Holiday_Flag - whether the week is a special holiday week 1 – Holiday week 0 – Non-holiday week
Temperature - Temperature on the day of sale
Fuel_Price - Cost of fuel in the region
CPI – Prevailing consumer price index
Unemployment - Prevailing unemployment rate
Holiday Events Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13 Labour Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13 Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13 Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13
Analysis Tasks
Basic Statistics tasks
1) Which store has maximum sales
2) Which store has maximum standard deviation i.e., the sales vary a lot. Also, find out the coefficient of mean to standard deviation
3) Which store/s has good quarterly growth rate in Q3’2012
4) Some holidays have a negative impact on sales. Find out holidays which have higher sales than the mean sales in non-holiday season for all stores together
5) Provide a monthly and semester view of sales in units and give insights
Statistical Model
For Store 1 – Build prediction models to forecast demand
Linear Regression – Utilize variables like date and restructure dates as 1 for 5 Feb 2010 (starting from the earliest date in order). Hypothesize if CPI, unemployment, and fuel price have any impact on sales.
Change dates into days by creating new variable.
Select the model which gives best accuracy.
--- Original source retains full ownership of the source dataset ---
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Some say climate change is the biggest threat of our age while others say it’s a myth based on dodgy science. We are turning some of the data over to you so you can form your own view.
Even more than with other data sets that Kaggle has featured, there’s a huge amount of data cleaning and preparation that goes into putting together a long-time study of climate trends. Early data was collected by technicians using mercury thermometers, where any variation in the visit time impacted measurements. In the 1940s, the construction of airports caused many weather stations to be moved. In the 1980s, there was a move to electronic thermometers that are said to have a cooling bias.
Given this complexity, there are a range of organizations that collate climate trends data. The three most cited land and ocean temperature data sets are NOAA’s MLOST, NASA’s GISTEMP and the UK’s HadCrut.
We have repackaged the data from a newer compilation put together by the Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory. The Berkeley Earth Surface Temperature Study combines 1.6 billion temperature reports from 16 pre-existing archives. It is nicely packaged and allows for slicing into interesting subsets (for example by country). They publish the source data and the code for the transformations they applied. They also use methods that allow weather observations from shorter time series to be included, meaning fewer observations need to be thrown away.
In this dataset, we have include several files:
Global Land and Ocean-and-Land Temperatures (GlobalTemperatures.csv):
Other files include:
The raw data comes from the Berkeley Earth data page.