Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.
A description of this dataset, including the methodology and validation results, is available at:
Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.
ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.
You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.
#!/bin/bash
# Set download directory
DOWNLOAD_DIR=~/Downloads
base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"
# Loop through years 1991 to 2023 and download & extract data
for year in {1991..2023}; do
echo "Downloading $year.zip..."
wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
rm "$DOWNLOAD_DIR/$year.zip"
done
The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:
ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc
Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:
Additional information for each variable is given in the netCDF attributes.
Changes in v9.1r1 (previous version was v09.1):
These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:
The following records are all part of the Soil Moisture Climate Data Records from satellites community
1 |
ESA CCI SM MODELFREE Surface Soil Moisture Record | <a href="https://doi.org/10.48436/svr1r-27j77" target="_blank" |
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains the Coded Surface Bulletin (CSB) dataset reformatted as netCDF-4 files. The CSB dataset is a collection of ASCII files containing the locations of weather fronts, troughs, high pressure centers, and low pressure centers as determined by National Weather Service meteorologists at the Weather Prediction Center (WPC) during the surface analysis they do every three hours. Each bulletin is broadcast on the NOAAPort service, and has been available since 2003.
Each netCDF file contains one year of CSB fronts data represented as spatial map data grids. The times and geospatial locations for the data grid cells are also included. The front data is stored in a netCDF variable with dimensions (time, front type, y, x), where x and y are geospatial dimensions. There is a 2D geospatial data grid for each time step for each of the 4 front types—cold, warm, stationary, and occluded. The front polylines from the CSB dataset are rasterized into the appropriate data grids. Each file conforms to the Climate and Forecast Metadata Conventions.
There are two large groupings of the CSB netCDF files. One group uses a data grid based on the North American Regional Reanalysis (NARR) grid, which is a Lambert Conformal Conic projection coordinate reference system (CRS) centered over North America. The NARR grid is quite close the the spatial range of data displayed on the WPC workstations used to perform surface analysis and identify front locations. The native NARR grid has grid cells which are 32 km on each side. Our grid covers the same extents with cells that are 96 km on each side.
The other group uses a 1° latitude/longitude data grid centered over North America with extents 171W – 31W / 10N – 77 N. The files in this group are identified by the name MERRA2, because they were used with data from the NASA MERRA-2 dataset, which uses a latitude/longitude data grid.
There are a number of files within each group. The files all follow the naming convention codsus_[masked]_.nc, where [masked] indicates that the presence of the word masked is optional and is either merra2-1deg or narr-96km. The element is either the word mask or the sequence wide_, where is the front width and is the year for the data stored in the file.
The codsus_mask.nc file is a file containing a single data grid that delineates the envelope of the geospatial region where there are, on average, 40 or more front crossing of any type per year. The WPC meteorologists don't attempt to provide equal levels of attention to every grid cell displayed on their workstations. The files of the form codsus_masked_wide_.nc have all had the mask described above applied to exclude parts of fronts that extend past the envelope. The files of the form codsus_wide_.nc have no masking applied.
The wide portion of the file names takes two forms—1wide and 3wide. The fronts in the1wide files were rasterized by drawing the front polylines with a width of one grid cell. The fronts in the 3wide files were rasterized by drawing the front polylines with a width of 3 grid cells.
Within each grid group, there are five subsets of files:
codsus_masked_1wide_.nc
codsus_masked_3wide_.nc
codsus_1wide_.nc
codsus_3wide_.nc
codsus_mask.nc
The primary source for this dataset is an internal archive maintained by personnel at the WPC and provided to the author. It is also provided at DOI 10.5281/zenodo.2642801. Some bulletins missing from the WPC archive were filled in with data acquired from the Iowa Environmental Mesonet.
The Global Forecast System (GFS) is a weather forecast model produced by the National Centers for Environmental Prediction (NCEP). Dozens of atmospheric and land-soil variables are available through this dataset, from temperatures, winds, and precipitation to soil moisture and atmospheric ozone concentration. The GFS data files stored here can be immediately used for OAR/ARL’s NOAA-EPA Atmosphere-Chemistry Coupler Cloud (NACC-Cloud) tool, and are in a Network Common Data Form (netCDF), which is a very common format used across the scientific community. These particular GFS files contain a comprehensive number of global atmosphere/land variables at a relatively high spatiotemporal resolution (approximately 13x13 km horizontal, vertical resolution of 127 levels, and hourly), are not only necessary for the NACC-Cloud tool to adequately drive community air quality applications (e.g., U.S. EPA’s Community Multiscale Air Quality model; https://www.epa.gov/cmaq), but can be very useful for a myriad of other applications in the Earth system modeling communities (e.g., atmosphere, hydrosphere, pedosphere, etc.). While many other data file and record formats are indeed available for Earth system and climate research (e.g., GRIB, HDF, GeoTIFF), the netCDF files here are advantageous to the larger community because of the comprehensive, high spatiotemporal information they contain, and because they are more scalable, appendable, shareable, self-describing, and community-friendly (i.e., many tools available to the community of users). Out of the four operational GFS forecast cycles per day (at 00Z, 06Z, 12Z and 18Z) this particular netCDF dataset is updated daily (/inputs/yyyymmdd/) for the 12Z cycle and includes 24-hr output for both 2D (gfs.t12z.sfcf$0hh.nc) and 3D variables (gfs.t12z.atmf$0hh.nc).
Also available are netCDF formatted Global Land Surface Datasets (GLSDs) developed by Hung et al. (2024). The GLSDs are based on numerous satellite products, and have been gridded to match the GFS spatial resolution (~13x13 km). These GLSDs contain vegetation canopy data (e.g., land surface type, vegetation clumping index, leaf area index, vegetative canopy height, and green vegetation fraction) that are supplemental to and can be combined with the GFS meteorological netCDF data for various applications, including NOAA-ARL's canopy-app. The canopy data variables are climatological, based on satellite data from the year 2020, combined with GFS meteorology for the year 2022, and are created at a daily temporal resolution (/inputs/geo-files/gfs.canopy.t12z.2022mmdd.sfcf000.global.nc)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The twin satellites of the Gravity Recovery and Climate Experiment (GRACE), launched in March of 2002, are making detailed monthly measurements of Earth's gravity field changes. These observations can detect regional mass changes of Earth's water reservoirs over land, ice and oceans. GRACE measures gravity variations by relating it to the distance variations between the two satellites, which fly in the same orbit, separated by about 240 km at an altitude of ~450 km. The monthly land mass grids contain terrestrial water storage anomalies (in aquifers, river basins, etc.) from GRACE time-variable gravity data relative to a time-mean. The storage anomalies are given in 'equivalent water thickness' (in NetCDF format). The time coverage for the monthly grids are determined by GRACE months. For the list of GRACE month dates visit http://grace.jpl.nasa.gov/data/grace-months/ . For information please visit http://grace.jpl.nasa.gov/data/get-data/monthly-mass-grids-land/ .
The TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The Feature Names Relationship File (FEATNAMES.dbf) contains a record for each feature name and any attributes associated with it. Each feature name can be linked to the corresponding edges that make up that feature in the All Lines Shapefile (EDGES.shp), where applicable to the corresponding address range or ranges in the Address Ranges Relationship File (ADDR.dbf), or to both files. Although this file includes feature names for all linear features, not just road features, the primary purpose of this relationship file is to identify all street names associated with each address range. An edge can have several feature names; an address range located on an edge can be associated with one or any combination of the available feature names (an address range can be linked to multiple feature names). The address range is identified by the address range identifier (ARID) attribute, which can be used to link to the Address Ranges Relationship File (ADDR.dbf). The linear feature is identified by the linear feature identifier (LINEARID) attribute, which can be used to relate the address range back to the name attributes of the feature in the Feature Names Relationship File or to the feature record in the Primary Roads, Primary and Secondary Roads, or All Roads Shapefiles. The edge to which a feature name applies can be determined by linking the feature name record to the All Lines Shapefile (EDGES.shp) using the permanent edge identifier (TLID) attribute. The address range identifier(s) (ARID) for a specific linear feature can be found by using the linear feature identifier (LINEARID) from the Feature Names Relationship File (FEATNAMES.dbf) through the Address Range / Feature Name Relationship File (ADDRFN.dbf).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data Summary: US states grid mask file and NOAA climate regions grid mask file, both compatible with the 12US1 modeling grid domain. Note:The datasets are on a Google Drive. The metadata associated with this DOI contain the link to the Google Drive folder and instructions for downloading the data. These files can be used with CMAQ-ISAMv5.3 to track state- or region-specific emissions. See Chapter 11 and Appendix B.4 in the CMAQ User's Guide for further information on how to use the ISAM control file with GRIDMASK files. The files can also be used for state or region-specific scaling of emissions using the CMAQv5.3 DESID module. See the DESID Tutorial and Appendix B.4 in the CMAQ User's Guide for further information on how to use the Emission Control File to scale emissions in predetermined geographical areas. File Location and Download Instructions: Link to GRIDMASK files Link to README text file with information on how these files were created File Format: The grid mask are stored as netcdf formatted files using I/O API data structures (https://www.cmascenter.org/ioapi/). Information on the model projection and grid structure is contained in the header information of the netcdf file. The output files can be opened and manipulated using I/O API utilities (e.g. M3XTRACT, M3WNDW) or other software programs that can read and write netcdf formatted files (e.g. Fortran, R, Python). File descriptions These GRIDMASK files can be used with the 12US1 modeling grid domain (grid origin x = -2556000 m, y = -1728000 m; N columns = 459, N rows = 299). GRIDMASK_STATES_12US1.nc - This file containes 49 variables for the 48 states in the conterminous U.S. plus DC. Each state variable (e.g., AL, AZ, AR, etc.) is a 2D array (299 x 459) providing the fractional area of each grid cell that falls within that state. GRIDMASK_CLIMATE_REGIONS_12US1.nc - This file containes 9 variables for 9 NOAA climate regions based on the Karl and Koss (1984) definition of climate regions. Each climate region variable (e.g., CLIMATE_REGION_1, CLIMATE_REGION_2, etc.) is a 2D array (299 x 459) providing the fractional area of each grid cell that falls within that climate region. NOAA Climate regions: CLIMATE_REGION_1: Northwest (OR, WA, ID) CLIMATE_REGION_2: West (CA, NV) CLIMATE_REGION_3: West North Central (MT, WY, ND, SD, NE) CLIMATE_REGION_4: Southwest (UT, AZ, NM, CO) CLIMATE_REGION_5: South (KS, OK, TX, LA, AR, MS) CLIMATE_REGION_6: Central (MO, IL, IN, KY, TN, OH, WV) CLIMATE_REGION_7: East North Central (MN, IA, WI, MI) CLIMATE_REGION_8: Northeast (MD, DE, NJ, PA, NY, CT, RI, MA, VT, NH, ME) + Washington, D.C.* CLIMATE_REGION_9: Southeast (VA, NC, SC, GA, AL, GA) *Note that Washington, D.C. is not included in any of the climate regions on the website but was included with the “Northeast” region for the generation of this GRIDMASK file.
The TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The Feature Names Relationship File (FEATNAMES.dbf) contains a record for each feature name and any attributes associated with it. Each feature name can be linked to the corresponding edges that make up that feature in the All Lines Shapefile (EDGES.shp), where applicable to the corresponding address range or ranges in the Address Ranges Relationship File (ADDR.dbf), or to both files. Although this file includes feature names for all linear features, not just road features, the primary purpose of this relationship file is to identify all street names associated with each address range. An edge can have several feature names; an address range located on an edge can be associated with one or any combination of the available feature names (an address range can be linked to multiple feature names). The address range is identified by the address range identifier (ARID) attribute, which can be used to link to the Address Ranges Relationship File (ADDR.dbf). The linear feature is identified by the linear feature identifier (LINEARID) attribute, which can be used to relate the address range back to the name attributes of the feature in the Feature Names Relationship File or to the feature record in the Primary Roads, Primary and Secondary Roads, or All Roads Shapefiles. The edge to which a feature name applies can be determined by linking the feature name record to the All Lines Shapefile (EDGES.shp) using the permanent edge identifier (TLID) attribute. The address range identifier(s) (ARID) for a specific linear feature can be found by using the linear feature identifier (LINEARID) from the Feature Names Relationship File (FEATNAMES.dbf) through the Address Range / Feature Name Relationship File (ADDRFN.dbf).
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
a hindcast reanalysis of ocean circulation in the mid-atlantic bight and gulf of maine has been computed using the regional ocean modeling system (roms) with 4-dimensional variational (4d-var) assimilation of data from satellites, land-based ocean surface current measuring radar, and all available in situ observations from the maracoos (maracoos.org) and neracoos (neracoos.org) regional associations of the u.s. integrated ocean observing system (ioos). this reanalysis is version dopanv2r3-ini2007 (version 2, release 3, initialized january 2007).the analysis covers the period 2-jan-2007 to 30-aug-2021 on a 7-km horizonal grid with 40 vertical terrain-following s-coordinate levels. ocean state variables computed are sea level, velocity, temperature, and salinity. air-sea fluxes of heat and momentum, and surface and bottoms stresses, are included.results are provided on the roms model native 3-dimensional grid as (i) 1-hourly interval snapshots (roms “history” files), (ii) 1-day averages, (iii) monthly averages, (iv) yearly averages, and (v) ensemble monthly averages (i.e., the mean of all days in the same month from all years). the output files are in netcdf format and data and metadata follow cf-1.4 conventions for the description of coordinates and variables.the files uploaded here are examples of one time record from each of these 5 collections. outputs for the full reanalysis, which comprises 6.8 terabytes fo data, are made available for download via a thredds (thematic real-time environmental distributed data services) web service to facilitate user geospatial or temporal sub-setting.the thredds catalog urls and example filenames available here, for the respective collections, are: 1-hourly history snapshots 2007-01-02 01:00 through 2021-08-31 00:00: ttps://tds.marine.rutgers.edu/thredds/roms/doppio/catalog.html?dataset=dopanv2r3-ini2007_da_history example file uploaded here is his_dopanv2r3_20140516t0100.nc for 2014-05-06 01:00 24-hour averages 2007-01-02 12:00 through 2021-08-30 12:00 https://tds.marine.rutgers.edu/thredds/roms/doppio/catalog.html?dataset=dopanv2r3-ini2007_da_average example file uploaded here is avg_dopanv2r3_20140516t1200.nc for 2014-05-06 monthly averages 2007-01-17 through 2020-12-16 https://tds.marine.rutgers.edu/thredds/roms/doppio/catalog.html?dataset=dopanv2r3-ini2007_da_monthly_averages example file uploaded here is mon_dopanv2r3_201405.nc for 2014-05 yearly averages 2007 through 2020: https://tds.marine.rutgers.edu/thredds/roms/doppio/catalog.html?dataset=dopanv2r3-ini2007_da_yearly_averages example file uploaded here is year_dopanv2r3_2014.nc for 2014 monthly ensemble averages: https://tds.marine.rutgers.edu/thredds/roms/doppio/catalog.html?dataset=dopanv2r3-ini2007_da_monthly_ensemble_means example file uploaded here is ensmon_dopanv2r3_05.nc for maythe underlying ocean circulation model configuration is described by lopez et al (2020). the observations that are assimilated and the error hypotheses and other aspects of the 4d-var assimilation implementation are described by levin et al. (2020; 2021).lópez, a. g., j. l. wilkin and j. c. levin, (2020) doppio – a roms (v3.6)-based circulation model for the mid-atlantic bight and gulf of maine: configuration and comparison to integrated coastal observing network observations, geosci. model dev., 13, 3709–3729, doi: 10.5194/gmd-13-3709-2020levin, j., h. arango, b. laughlin, e. hunter, j. wilkin and a. moore, (2020), observation impacts on the mid-atlantic bight front and cross-shelf transport in 4d-var ocean state estimates, part i – multiplatform analysis, ocean modelling, 156, 101721, doi: 10.1016/j.ocemod.2020.101721levin, j., h. g. arango, b. laughlin, j. wilkin and a. m. moore, (2021), the impact of remote sensing observations on cross-shelf transport estimates from 4d-var analyses of the mid-atlantic bight, advances in space research, 68, 553-570, doi: 10.1016/j.asr.2019.09.012
The table NORTH CAROLINA is part of the dataset L2 Voter File, available at https://redivis.com/datasets/ey62-9t0gpyvbg. It contains 169832561 rows across 38 variables.
The 2023 cartographic boundary shapefiles are simplified representations of selected geographic areas from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). These boundary files are specifically designed for small-scale thematic mapping. When possible, generalization is performed with the intent to maintain the hierarchical relationships among geographies and to maintain the alignment of geographies within a file set for a given year. Geographic areas may not align with the same areas from another year. Some geographies are available as nation-based files while others are available only as state-based files. The cartographic boundary files include both incorporated places (legal entities) and census designated places or CDPs (statistical entities). An incorporated place is established to provide governmental functions for a concentration of people as opposed to a minor civil division (MCD), which generally is created to provide services or administer an area without regard, necessarily, to population. Places always nest within a state, but may extend across county and county subdivision boundaries. An incorporated place usually is a city, town, village, or borough, but can have other legal descriptions. CDPs are delineated for the decennial census as the statistical counterparts of incorporated places. CDPs are delineated to provide data for settled concentrations of population that are identifiable by name, but are not legally incorporated under the laws of the state in which they are located. The boundaries for CDPs often are defined in partnership with state, local, and/or tribal officials and usually coincide with visible features or the boundary of an adjacent incorporated place or another legal entity. CDP boundaries often change from one decennial census to the next with changes in the settlement pattern and development; a CDP with the same name as in an earlier census does not necessarily have the same boundary. The only population/housing size requirement for CDPs is that they must contain some housing and population. The generalized boundaries of most incorporated places in this file are based on those as of January 1, 2023, as reported through the Census Bureau's Boundary and Annexation Survey (BAS). The generalized boundaries of all CDPs are based on those delineated or updated as part of the the 2023 BAS or the Census Bureau's Participant Statistical Areas Program (PSAP) for the 2020 Census.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.
----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BAyeSian Integrated and Consolidated (BASIC) composite ozone time-series dataset built from a Bayesian joint self-calibration analysis of multiple composite ozone datasets. The construction of the BASIC composite is described in detail in the paper:
Ball et al, Reconciling differences in stratospheric ozone composites, ACP (2017).
If you use the BASIC dataset, please cite both the DOI for this data page and Ball et al 2017 (ACP).
The netCDF file includes variables for time, pressure and latitude giving the Julian dates* and pressure and latitude grid respectively. The ozone time-series data is given in the variable o3[time, pressure, latitude] and associated (time-varying) 1-sigma uncertainties are given in sigma_o3[time, pressure, latitude].
BASIC_V1_swooshV2.6_gozcardsV1.0_sbuvmodV8.6_sbuvmer.nc is built from SWOOSH v2.6, GOZCARDS v1.0, SBUV-MOD v8.6 and SBUV-MER (as described in Tummon et al 2015). This corresponds to the BASIC composite presented in Ball et al 2017 (ACP); the data runs up until Dec 2012.
BASIC_V1_swooshV2.6_gozcardsV2.20.nc is built from SWOOSH v2.6 and GOZCARDS v2.20; the updated data runs up until Dec 2018. This data was used in the revised version of Ball et al, Continuous decline in lower stratospheric ozone offsets ozone layer recovery, 2017 (ACPD) (referred to as merged-swoosh/gozcards in that paper).
*00:00:00.0 on 1/1/1980=2444239.5
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the netCDF files used to run mizuRoute across continental Chile.
Each folder includes the following files:
This research was funded by the Fondecyt Project 11200142 “Robust estimates of current and future water resources across a hydroclimatic gradient in Chile” (Principal Investigator: Pablo A. Mendoza).
The use of these files requires citing this dataset, and the paper that describes the approach used to produce the data:
Cortés-Salazar, N., Vásquez, N., Mizukami, N., Mendoza, P. A., & Vargas, X. (2023). To what extent does river routing matter in hydrological modeling?. Hydrology and Earth System Sciences, 27(19), 3505-3524. (doi.org/10.5194/hess-27-3505-2023).
The TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The Address Ranges Relationship File (ADDR.dbf) contains the attributes of each address range. Each address range applies to a single edge and has a unique address range identifier (ARID) value. The edge to which an address range applies can be determined by linking the address range to the All Lines Shapefile (EDGES.shp) using the permanent topological edge identifier (TLID) attribute. Multiple address ranges can apply to the same edge since an edge can have multiple address ranges. Note that the most inclusive address range associated with each side of a street edge already appears in the All Lines Shapefile (EDGES.shp). The TIGER/Line Files contain potential address ranges, not individual addresses. The term "address range" refers to the collection of all possible structure numbers from the first structure number to the last structure number and all numbers of a specified parity in between along an edge side relative to the direction in which the edge is coded. The address ranges in the TIGER/Line Files are potential ranges that include the full range of possible structure numbers even though the actual structures may not exist.
The TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national filewith no overlaps or gaps between parts, however, each TIGER/Line shapefile is designed to stand alone as an independentdata set, or they can be combined to cover the entire nation. The Address Range / Feature Name Relationship File (ADDRFN.dbf) contains a record for each address range / linear feature name relationship. The purpose of this relationship file is to identify all street names associated with each address range. An edge can have several feature names; an address range located on an edge can be associated with one or any combination of the available feature names (an address range can be linked to multiple feature names). The address range is identified by the address range identifier (ARID) attribute that can be used to link to the Address Ranges Relationship File (ADDR.dbf). The linear feature name is identified by the linear feature identifier (LINEARID) attribute that can be used to link to the Feature Names Relationship File (FEATNAMES.dbf).
Single-beam bathymetry, gravity, and magnetic data along with DGPS navigation data was collected as part of field activity L-4-77-NC in Northern California from 05/10/1977 to 05/21/1977, http://walrus.wr.usgs.gov/infobank/l/l477nc/html/l-4-77-nc.meta.html These data are reformatted from space-delimited ASCII text files located in the Coastal and Marine Geology Program (CMGP) InfoBank field activity catalog at http://walrus.wr.usgs.gov/infobank/l/l477nc/html/l-4-77-nc.bath.html, http://walrus.wr.usgs.gov/infobank/l/l477nc/html/l-4-77-nc.grav.html, and http://walrus.wr.usgs.gov/infobank/l/l477nc/html/l-4-77-nc.mag.html into MGD77T format provided by the NOAA's National Geophysical Data Center(NGDC). The MGD77T format includes a header (documentation) file (.h77t) and a data file (.m77t). More information regarding this format can be found in the publication listed in the Cross_reference section of this metadata file.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description: There are 21 files in the directory. The file “TREFHT_1980-1999.nc” (in netcdf format) contains the 2-meter air temperature (128 lon x 64 lat x 240 months) from 1980-1999. This file has also been transferred into 20 ASCII files, x1-x20 for years 1980-1999, respectively. There are 128x64x12 elements in each ASCII file, which can be input as an array in R by array(“x1”,dim=c(128,64,12)). The data were generated from the NCAR Climate System Model. A part of this data set was used in Shen et al. (2002). Reference:
This dataset is the monthly precipitation data of China, with a spatial resolution of 0.0083333 ° (about 1km) and a time range of 1901.1-2023.12. The data format is NETCDF, i.e.. Nc format. This dataset is generated in China through the Delta spatial downscaling scheme based on the global 0.5 ° climate dataset released by CRU and the global high-resolution climate dataset released by WorldClim. In addition, 496 independent meteorological observation point data are used for verification, and the verification results are reliable. This data set covers the main land areas in China (including Hong Kong, Macao and Taiwan), excluding islands and reefs in the South China Sea. In order to facilitate storage, the data are all int16 type and stored in nc files, with precipitation units of 0.1mm. NC data can be mapped using ArcMAP software; Matlab software can also be used for extraction processing. Matlab has released the function to read and store nc files. The read function is ncread, and switch to the nc file storage folder. The statement is expressed as: ncread ('XXX.nc ',' var ', [i j t], [leni lenj lent]), where XXX.nc is the file name, and is the string required' '; Var is from XXX The variable name read in NC. If it is a string, '' is required; i. J and t are the starting row, column and time of the read data respectively, and leni, lenj and lent i are the length of the read data in the row, column and time dimensions respectively. In this way, this function can be used to read in any region and any time period in the study area. There are many commands about NC data in the help of Matlab, which can be viewed. WGS84 is recommended for data coordinate system.
The TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The Address Ranges Relationship File (ADDR.dbf) contains the attributes of each address range. Each address range applies to a single edge and has a unique address range identifier (ARID) value. The edge to which an address range applies can be determined by linking the address range to the All Lines Shapefile (EDGES.shp) using the permanent topological edge identifier (TLID) attribute. Multiple address ranges can apply to the same edge since an edge can have multiple address ranges. Note that the most inclusive address range associated with each side of a street edge already appears in the All Lines Shapefile (EDGES.shp). The TIGER/Line Files contain potential address ranges, not individual addresses. The term "address range" refers to the collection of all possible structure numbers from the first structure number to the last structure number and all numbers of a specified parity in between along an edge side relative to the direction in which the edge is coded. The address ranges in the TIGER/Line Files are potential ranges that include the full range of possible structure numbers even though the actual structures may not exist.
The project studying the evolution pattern and development trend of the arid environment in western China was a major research component of the project Environmental and Ecological Science for West China, which was funded by the National Natural Science Foundation of China. The leading executive of the project was Academician Zhisheng An from the Institute of Earth Environment of the Chinese Academy of Sciences. The project ran from January 2002 to December 2004. The data collected by the project include the following: 1. History and variability data for arid regions in western China: 1) Chinese Loess Plateau mass accumulation rate data (3600-0 kyr BP): Fields include age and mass accumulation rate (MAR) (txt file). 2) Chinese Loess Plateau grain size and magnetic susceptibility data (3600-0 kyr BP): Fields include age, stacked mean grain size, and stacked magnetic susceptibility (txt file). 2. Sporopollen content data of different loess strata since 12 kyr BP in the Yaozhou District of Shanxi Province (excel table): The distributions of 27 species of sporopollen (0-397 cm) from 67 different layers of loess samples are included. 3. 10Be record data (table) 10Be concentration, magnetic susceptibility and bulk density data of loess with different thicknesses (79.67- 0.09 kyr BP). 4. Simulation data on the modulation of the East Asian monsoon resulting from orbital variability driven by the uplift of the Tibetan Plateau: ah0-sum.nc nc file, hh0-sum.nc nc file, jfh0-sum.nc nc file, kdh0-sum.nc nc file, lfh0-sum.nc nc file, mask.nc nc file, phis.nc nc file.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.
A description of this dataset, including the methodology and validation results, is available at:
Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.
ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.
You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.
#!/bin/bash
# Set download directory
DOWNLOAD_DIR=~/Downloads
base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"
# Loop through years 1991 to 2023 and download & extract data
for year in {1991..2023}; do
echo "Downloading $year.zip..."
wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
rm "$DOWNLOAD_DIR/$year.zip"
done
The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:
ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc
Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:
Additional information for each variable is given in the netCDF attributes.
Changes in v9.1r1 (previous version was v09.1):
These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:
The following records are all part of the Soil Moisture Climate Data Records from satellites community
1 |
ESA CCI SM MODELFREE Surface Soil Moisture Record | <a href="https://doi.org/10.48436/svr1r-27j77" target="_blank" |