95 datasets found

R code
figshare.com
txt
Updated Jun 5, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christine Dodge (2017). R code [Dataset]. http://doi.org/10.6084/m9.figshare.5021297.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5021297.v1
Dataset updated
Jun 5, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Christine Dodge
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers
i
Geographically distributed solar power time series - Dataset - CKAN
rdm.inesctec.pt
Updated Sep 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Geographically distributed solar power time series - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/pe-2020-003
Explore at:
Dataset updated
Sep 10, 2020
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset contains electrical energy hourly time series from 44 small-PV (households) units located in the same region, with installed capacity ranging between 1.1 and 3.7 kWp. This data was collected by smart meters in 15 minutes time series and used in the FP7 SuSTAINABLE project. The dataset contains: admm_functions.R: R script with ADMM algorithm implementation. clear_sky_functions.R: R script to estimate clear-sky solar power generation with the model described in Bacher et al. (2009). coef_generator.R: R script with the functions for generating VAR model coefficients, according to the implementation in (Virolainen, 2020). run_experiments.R: R script with the commands for generating the results of Section 3.3.1. c_sky.csv: estimated clear-sky solar power generation. normalized_PVdata.csv: normalized (with clear-sky model) solar power time series data. PVdata.csv: solar power time series data. See [A] for more details about the R code. [A] C. Gonçalves, R.J. Bessa, P. Pinson, ""A critical overview of privacy-preserving approaches for collaborative forecasting,"" International Journal of Forecasting. DOI:10.1016/j.ijforecast.2020.06.003"
Market Basket Analysis
kaggle.com
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 9, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
NOAA GOES-R Series Advanced Baseline Imager (ABI) Level 2 Clear Sky Mask...
datasets.ai
s.cnmilf.com
+2more
0, 21, 33
Updated Sep 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Oceanic and Atmospheric Administration, Department of Commerce (2024). NOAA GOES-R Series Advanced Baseline Imager (ABI) Level 2 Clear Sky Mask (ACM) [Dataset]. https://datasets.ai/datasets/noaa-goes-r-series-advanced-baseline-imager-abi-level-2-clear-sky-mask-acm3
Explore at:
0, 21, 33Available download formats
Dataset updated
Sep 13, 2024
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
Authors
National Oceanic and Atmospheric Administration, Department of Commerce
Description
The Clear Sky Mask product contains an image in the form of a binary cloud mask that identifies pixels within a coverage region as clear or cloudy. The production of the clear sky mask is an important step in the processing of many other Advanced Baseline Imager (ABI) Level 2+ products that use the information generated in the production of the clear sky mask to determine the presence of a cloud. The product includes data quality information for the binary cloud mask data values for on-earth pixels. The binary cloud mask value is a dimensionless quantity. The Clear Sky Mask product image is provided at 2 km resolution on the ABI fixed grid for Full Disk, CONUS, and Mesoscale coverage regions from GOES East and West. Product data is produced for geolocated source data to local zenith angles of 90 degrees for both daytime and nighttime conditions.
m
GIC//NMC Solar Battery Synthetic Data 2 - 45,000 x 18 degradation for...
data.mendeley.com
Updated Aug 11, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthieu Dubarry (2022). GIC//NMC Solar Battery Synthetic Data 2 - 45,000 x 18 degradation for clear-sky irradiance and cloudy days [Dataset]. http://doi.org/10.17632/7bwjnvvprd.1
Explore at:
Unique identifier
https://doi.org/10.17632/7bwjnvvprd.1
Dataset updated
Aug 11, 2022
Authors
Matthieu Dubarry
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset consists of part 2 of the data associated with publication "Data-driven Direct Diagnosis of PV Connected Batteries "

The synthetic cycles were generated using the mechanistic modeling approach. See “Big data training data for artificial intelligence-based Li-ion diagnosis and prognosis“ (Journal of Power Sources, Volume 479, 15 December 2020, 228806) and "Analysis of Synthetic Voltage vs. Capacity Datasets for Big Data Li-ion Diagnosis and Prognosis" (Energies 2021, 14, 2371 ) for more details.

Two sets of data are available, one for training and one for validation Training dataset: MEDB_PI folder, clear-sky irradiance, 0.025 triplet resolution up to 50% degradation with 2% increment. Validation dataset: MEDB_Cloud, 18 different cloudy days, 0.05 triplet resolution up to 50% degradation with 2% increment.

All datasets were generated with slightly different cell parameters to account for cell-to-cell variations. Details are available in publication. For each duty cycle, 3 set of files are provided, the *_V files contains V vs. Q data, the *_t files contains V vs. time data and the *_R files contains rate vs. Q data.

For each file, column in the volt, voltT, ot rate variable corresponds to 1 degradation path, the 1001 lines corresponds to the resolution in variable Q (for the capacity based data) or timenorm (for the time-based data). Details of each duty cycle is provided in the pathinfo variable with headers in pathinfo_index ( 1 - % LLI, 2 - % LAMPE, 3 - % LAMNE, 4 - Capacity, 5 - DOD).

All simulations were performed with the 2022 version of the alawa toolbox. Voltage and kinetics of electrodes from different manufacturers, with different composition, or with different architecture might differ significantly.

MEDB_irradiancedata.mat contains data gathered for 2 years at MEDB site (see publication for details). Data provided courtesy of HNEI’s Severine Busquet, Jonathan Kobayashi, and Richard Rocheleau

This matlab structure contains the following variables: dn: time from Matlab reference time dh: hour of the day doy: day of the year dm: month of the year dy: year ghi: global irradiance (W/m2) class: clear sky yes/no perc_clear: clear sky percentage (%) meta: panel metadata tot_POA: clear sky irradiance at POA (W/m2) inc: Solar angle of incidence relative to the POA (degree) tot_horz: clear sky horizontal irradiance (W/m2)
g
Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8...
gimi9.com
data.usgs.gov
+2more
Updated Feb 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8 Analysis Ready Dataset Raster Images from 2013-2023 [Dataset]. https://gimi9.com/dataset/data-gov_water-temperature-of-lakes-in-the-conterminous-u-s-using-the-landsat-8-analysis-ready-2013
Explore at:
Dataset updated
Feb 22, 2025
Area covered
Contiguous United States
Description
This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the _byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr_{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.
N
Clear Lake, SD Median Household Income Trends (2010-2023, in 2023...
neilsberg.com
csv, json
Updated Mar 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). Clear Lake, SD Median Household Income Trends (2010-2023, in 2023 inflation-adjusted dollars) [Dataset]. https://www.neilsberg.com/research/datasets/16e5eef0-f81d-11ef-a994-3860777c1fe6/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Mar 3, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Clear Lake, South Dakota
Variables measured
Median Household Income, Median Household Income Year on Year Change, Median Household Income Year on Year Percent Change
Measurement technique
The data presented in this dataset is derived from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. It presents the median household income from the years 2010 to 2023 following an initial analysis and categorization of the census data. Subsequently, we adjusted these figures for inflation using the Consumer Price Index retroactive series via current methods (R-CPI-U-RS). For additional information about these estimations, please contact us via email at research@neilsberg.com
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset illustrates the median household income in Clear Lake, spanning the years from 2010 to 2023, with all figures adjusted to 2023 inflation-adjusted dollars. Based on the latest 2019-2023 5-Year Estimates from the American Community Survey, it displays how income varied over the last decade. The dataset can be utilized to gain insights into median household income trends and explore income variations.

Key observations:

From 2010 to 2023, the median household income for Clear Lake decreased by $2,003 (2.97%), as per the American Community Survey estimates. In comparison, median household income for the United States increased by $5,602 (7.68%) between 2010 and 2023.

Analyzing the trend in median household income between the years 2010 and 2023, spanning 13 annual cycles, we observed that median household income, when adjusted for 2023 inflation using the Consumer Price Index retroactive series (R-CPI-U-RS), experienced growth year by year for 5 years and declined for 8 years.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. All incomes have been adjusting for inflation and are presented in 2022-inflation-adjusted dollars.

Years for which data is available:

2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 0223

Variables / Data Columns

Year: This column presents the data year from 2010 to 2023

Median Household Income: Median household income, in 2023 inflation-adjusted dollars for the specific year

YOY Change($): Change in median household income between the current and the previous year, in 2023 inflation-adjusted dollars

YOY Change(%): Percent change in median household income between current and the previous year

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Clear Lake median household income. You can refer the same here
H
Replication Data for Reconceptualising dimensions of political competition...
dataverse.harvard.edu
bin +3
Updated Feb 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2019). Replication Data for Reconceptualising dimensions of political competition in Europe: A demand side approach [Dataset]. http://doi.org/10.7910/DVN/1B1MXY
Explore at:
tsv(331249), tsv(327), type/x-r-syntax(4522), tsv(485093), tsv(661203), tsv(14761), tsv(12441), tsv(14234), tsv(2085524), type/x-r-syntax(1892), type/x-r-syntax(4707), tsv(2549768), tsv(13422), tsv(334), text/plain; charset=us-ascii(1705), type/x-r-syntax(4615), tsv(321), tsv(393), type/x-r-syntax(1641), tsv(7451), type/x-r-syntax(4554), text/plain; charset=us-ascii(1112), type/x-r-syntax(4494), tsv(341458), text/plain; charset=us-ascii(998), type/x-r-syntax(2015), type/x-r-syntax(1898), tsv(836486), tsv(2746470), type/x-r-syntax(1458), type/x-r-syntax(1446), tsv(294), tsv(15880), tsv(1029255), type/x-r-syntax(2689), tsv(58230734), text/plain; charset=us-ascii(1056), tsv(720450), tsv(5435950), tsv(1008553), type/x-r-syntax(5393), tsv(1020458), tsv(309185), tsv(7594506), type/x-r-syntax(4718), text/plain; charset=us-ascii(1034), tsv(3740969), tsv(11855551), tsv(6791), tsv(2772257), text/plain; charset=us-ascii(1057), tsv(12381), tsv(12522), tsv(292258), tsv(2985998), tsv(192306), tsv(16127), type/x-r-syntax(4560), text/plain; charset=us-ascii(1018), text/plain; charset=us-ascii(1690), tsv(464459), tsv(341), tsv(26223059), text/plain; charset=us-ascii(1687), tsv(3236664), tsv(312), tsv(578743), tsv(16375), tsv(13536), tsv(12739), type/x-r-syntax(4669), tsv(1213762), tsv(311), tsv(10925925), tsv(445801), tsv(333), tsv(277486), tsv(3827698), type/x-r-syntax(4474), type/x-r-syntax(1844), type/x-r-syntax(2557), tsv(310), tsv(4484737), tsv(248936), tsv(437424), tsv(1009024), bin(3926), text/plain; charset=us-ascii(1069), type/x-r-syntax(4789), text/plain; charset=us-ascii(1047), tsv(30352394), tsv(4237412), tsv(1068062), text/plain; charset=us-ascii(1052), tsv(18862), tsv(231429), text/plain; charset=us-ascii(1073), text/plain; charset=us-ascii(13523), tsv(13416), tsv(1270018), type/x-r-syntax(4578), type/x-r-syntax(1634), tsv(115353772), type/x-r-syntax(4719), tsv(325), type/x-r-syntax(1602), tsv(26412407), text/plain; charset=us-ascii(1697), tsv(12719), type/x-r-syntax(4676), tsv(15222838), tsv(4371891), tsv(17128), tsv(292), tsv(295), tsv(361), tsv(3941101), tsv(13338), tsv(1590977), type/x-r-syntax(1535), type/x-r-syntax(4592), type/x-r-syntax(1740), tsv(14578), tsv(1271358), tsv(56364949), tsv(3679697), type/x-r-syntax(1967), tsv(313), tsv(22756302), tsv(453248), type/x-r-syntax(4527), tsv(721829), tsv(9489041), tsv(17466), tsv(316), tsv(3741064), tsv(514401), text/plain; charset=us-ascii(1681), tsv(1553558), text/plain; charset=us-ascii(1704), type/x-r-syntax(4521), tsv(3084752), tsv(7164362), type/x-r-syntax(1979), tsv(3493717), type/x-r-syntax(2000), type/x-r-syntax(1888), type/x-r-syntax(4824), tsv(315), tsv(829525), tsv(16746603), type/x-r-syntax(5118), tsv(1124251), tsv(6755683), type/x-r-syntax(4744), type/x-r-syntax(4539), tsv(331), tsv(12897), type/x-r-syntax(1711), type/x-r-syntax(1607), type/x-r-syntax(4607), tsv(13177), text/plain; charset=us-ascii(1686), type/x-r-syntax(4514), tsv(326), type/x-r-syntax(4570), tsv(528864)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/1B1MXY
Dataset updated
Feb 26, 2019
Dataset provided by
Harvard Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Included are: 1. The raw data (before cleaning and preprocessing) can be found in the files ending "Raw3". The codebooks for each of these data files end in "codebook". This will enable the user to identify the statements that are associated with the items EU1 … 7, Eco1 … 7, Cul1 … 7, AD1 and AD2 that are used in the manuscript.// 2. The R codes ending cleaning_plus.R are used to a) clean the datasets according to the procedure outlined in the online Appendix and b) remove entries with missing values for any of the variables that are used in the calibration process to produce balanced datasets (age, education, gender, political interest). Because of step b), the new datasets generated will be smaller than the clean datasets listed in Table 1 of the Appendix.// 3. For the balancing and calibrating (pre-processing), we use a) the datasets for each country generated by 2 above (the files that are followed by the suffix "_clean"), b) the file drop.py, which is the code (in python) for the balancing algorithm that is based on the principle of raking (see the online Appendix), c) the R files that are used to generate the new calibrated datasets that will be used in the Mokken Scale analysis in 5 below (followed by the suffix "balCode"), and d) a set of files ending in the suffix "estimates" that contain the joint distributions derived from the ESS data (i) for age, below versus above the median age and (ii) for education, degree versus no degree, as well as the marginal distributions for gender and political interest. The median ages of the voting population derived from ESS are as follows: Austria: 50 Bulgaria: 52 Croatia: 52 Cyprus: 47 Czech Republic 50 Denmark: 50 England: 53 Estonia: 50 Finland: 54 France: 55 Germany: 53 Greece: 50 Hungary: 49 Ireland: 50 Italy: 50 Lithuania: 53 Poland: 50 Portugal: 52 Romania: 46 Slovakia: 52 Slovenia: 52 Spain: 50// 4. A set of data files with the suffix myBal, which contain the new calibrated datasets that will be used in the Mokken Scale analysis in 5 (below).// 5. A set of R codes for each country, beginning with the prefix "RCodes" that are used to generate the findings on dimensionality that are presented in the manuscript.
h
fpt_fosd
huggingface.co
Updated Apr 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Phan Tuấn Anh (2022). fpt_fosd [Dataset]. https://huggingface.co/datasets/doof-ferb/fpt_fosd
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 24, 2022
Authors
Phan Tuấn Anh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
unofficial mirror of FPT Open Speech Dataset (FOSD)

released publicly in 2018 by FPT Corporation 100h, 25.9k samples official link (dead): https://fpt.ai/fpt-open-speech-data/ mirror: https://data.mendeley.com/datasets/k9sxg2twv4/4 DOI: 10.17632/k9sxg2twv4.4 pre-process:

remove non-sense strings: -N \r

remove 4 files because missing transcription: Set001_V0.1_008210.mp3 Set001_V0.1_010753.mp3 Set001_V0.1_011477.mp3 Set001_V0.1_011841.mp3

need to do: check misspelling usage… See the full description on the dataset page: https://huggingface.co/datasets/doof-ferb/fpt_fosd.
f
Population and GDP/GNI/CO2 emissions (2019, raw data)
figshare.com
txt
Updated Feb 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liang Zhao (2023). Population and GDP/GNI/CO2 emissions (2019, raw data) [Dataset]. http://doi.org/10.6084/m9.figshare.22085060.v6
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22085060.v6
Dataset updated
Feb 23, 2023
Dataset provided by
figshare
Authors
Liang Zhao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Original dataset The original year-2019 dataset was downloaded from the World Bank Databank by the following approach on July 23, 2022.

Database: "World Development Indicators" Country: 266 (all available) Series: "CO2 emissions (kt)", "GDP (current US$)", "GNI, Atlas method (current US$)", and "Population, total" Time: 1960, 1970, 1980, 1990, 2000, 2010, 2017, 2018, 2019, 2020, 2021 Layout: Custom -> Time: Column, Country: Row, Series: Column Download options: Excel

Preprocessing

With libreoffice,

remove non-country entries (lines after Zimbabwe), shorten column names for easy processing: Country Name -> Country, Country Code -> Code, "XXXX ... GNI ..." -> GNI_1990, etc (notice '_', not '-', for R), remove unnesssary rows after line Zimbabwe.
f
A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...
acs.figshare.com
xlsx
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker (2023). A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets [Dataset]. http://doi.org/10.1021/acs.jproteome.1c00070.s004
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.1c00070.s004
Dataset updated
Jun 4, 2023
Dataset provided by
ACS Publications
Authors
Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.
f
Data from: Error and anomaly detection for intra-participant time-series...
tandf.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David R. Mullineaux; Gareth Irwin (2023). Error and anomaly detection for intra-participant time-series data [Dataset]. http://doi.org/10.6084/m9.figshare.5189002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5189002
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
David R. Mullineaux; Gareth Irwin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Identification of errors or anomalous values, collectively considered outliers, assists in exploring data or through removing outliers improves statistical analysis. In biomechanics, outlier detection methods have explored the ‘shape’ of the entire cycles, although exploring fewer points using a ‘moving-window’ may be advantageous. Hence, the aim was to develop a moving-window method for detecting trials with outliers in intra-participant time-series data. Outliers were detected through two stages for the strides (mean 38 cycles) from treadmill running. Cycles were removed in stage 1 for one-dimensional (spatial) outliers at each time point using the median absolute deviation, and in stage 2 for two-dimensional (spatial–temporal) outliers using a moving window standard deviation. Significance levels of the t-statistic were used for scaling. Fewer cycles were removed with smaller scaling and smaller window size, requiring more stringent scaling at stage 1 (mean 3.5 cycles removed for 0.0001 scaling) than at stage 2 (mean 2.6 cycles removed for 0.01 scaling with a window size of 1). Settings in the supplied Matlab code should be customised to each data set, and outliers assessed to justify whether to retain or remove those cycles. The method is effective in identifying trials with outliers in intra-participant time series data.
Free Universal Sound Separation Dataset
zenodo.org
application/gzip
Updated Sep 2, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Wisdom; Scott Wisdom; Hakan Erdogan; Hakan Erdogan; Dan Ellis; John R. Hershey; Dan Ellis; John R. Hershey (2020). Free Universal Sound Separation Dataset [Dataset]. http://doi.org/10.5281/zenodo.3694384
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3694384
Dataset updated
Sep 2, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Scott Wisdom; Scott Wisdom; Hakan Erdogan; Hakan Erdogan; Dan Ellis; John R. Hershey; Dan Ellis; John R. Hershey
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Free Universal Sound Separation (FUSS) Dataset is a database of arbitrary sound mixtures and source-level references, for use in experiments on arbitrary sound separation.

This is the official sound separation data for the DCASE2020 Challenge Task 4: Sound Event Detection and Separation in Domestic Environments.

Citation: If you use the FUSS dataset or part of it, please cite our paper describing the dataset and baseline [1]. FUSS is based on FSD data so please also cite [2]:

Overview: FUSS audio data is sourced from a pre-release of Freesound dataset known as (FSD50k), a sound event dataset composed of Freesound content annotated with labels from the AudioSet Ontology. Using the FSD50K labels, these source files have been screened such that they likely only contain a single type of sound. Labels are not provided for these source files, and are not considered part of the challenge. For the purpose of the DCASE Task4 Sound Separation and Event Detection challenge, systems should not use FSD50K labels, even though they may become available upon FSD50K release.

To create mixtures, 10 second clips of sources are convolved with simulated room impulse responses and added together. Each 10 second mixture contains between 1 and 4 sources. Source files longer than 10 seconds are considered "background" sources. Every mixture contains one background source, which is active for the entire duration. We provide: a software recipe to create the dataset, the room impulse responses, and the original source audio.

Motivation for use in DCASE2020 Challenge Task 4: This dataset provides a platform to investigate how source separation may help with event detection and vice versa. Previous work has shown that universal sound separation (separation of arbitrary sounds) is possible [3], and that event detection can help with universal sound separation [4]. It remains to be seen whether sound separation can help with event detection. Event detection is more difficult in noisy environments, and so separation could be a useful pre-processing step. Data with strong labels for event detection are relatively scarce, especially when restricted to specific classes within a domain. In contrast, source separation data needs no event labels for training, and may be more plentiful. In this setting, the idea is to utilize larger unlabeled separation data to train separation systems, which can serve as a front-end to event-detection systems trained on more limited data.

Room simulation: Room impulse responses are simulated using the image method with frequency-dependent walls. Each impulse corresponds to a rectangular room of random size with random wall materials, where a single microphone and up to 4 sources are placed at random spatial locations.

Recipe for data creation: The data creation recipe starts with scripts, based on scaper, to generate mixtures of events with random timing of source events, along with a background source that spans the duration of the mixture clip. The scipts for this are at this GitHub repo.

The data are reverberated using a different room simulation for each mixture. In this simulation each source has its own reverberation corresponding to a different spatial location. The reverberated mixtures are created by summing over the reverberated sources. The dataset recipe scripts support modification, so that participants may remix and augment the training data as desired.

The constituent source files for each mixture are also generated for use as references for training and evaluation. The dataset recipe scripts support modification, so that participants may remix and augment the training data as desired.

Note: no attempt was made to remove digital silence from the freesound source data, so some reference sources may include digital silence, and there are a few mixtures where the background reference is all digital silence. Digital silence can also be observed in the event recognition public evaluation data, so it is important to be able to handle this in practice. Our evaluation scripts handle it by ignoring any reference sources that are silent.

Format: All audio clips are provided as uncompressed PCM 16 bit, 16 kHz, mono audio files.

Data split: The FUSS dataset is partitioned into "train", "validation", and "eval" sets, following the same splits used in FSD data. Specifically, the train and validation sets are sourced from the FSD50K dev set, and we have ensured that clips in train come from different uploaders than the clips in validation. The eval set is sourced from the FSD50K eval split.

Baseline System: A baseline system for the FUSS dataset is available at dcase2020_fuss_baseline.

License: All audio clips (i.e., in FUSS_fsd_data.tar.gz) used in the preparation of Free Universal Source Separation (FUSS) dataset are designated Creative Commons (CC0) and were obtained from freesound.org. The source data in FUSS_fsd_data.tar.gz were selected using labels from the FSD50K corpus, which is licensed as Creative Commons Attribution 4.0 International (CC BY 4.0) License.

The FUSS dataset as a whole, is a curated, reverberated, mixed, and partitioned preparation, and is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) License. This license is specified in the `LICENSE-DATASET` file downloaded with the `FUSS_license_doc.tar.gz` file.

Note the links to the github repo in FUSS_license_doc/README.md are currently out of date, so please refer to FUSS_license_doc/README.md in this GitHub repo which is more recently updated.
A Curated List of Image Deblurring Datasets
kaggle.com
Updated Mar 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jishnu Parayil Shibu (2023). A Curated List of Image Deblurring Datasets [Dataset]. https://www.kaggle.com/datasets/jishnuparayilshibu/a-curated-list-of-image-deblurring-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jishnu Parayil Shibu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Given a blurred image, image deblurring aims to produce a clear, high-quality image that accurately represents the original scene. Blurring can be caused by various factors such as camera shake, fast motion, out-of-focus objects, etc. making it a particularly challenging computer vision problem. This has led to the recent development of a large spectrum of deblurring models and unique datasets.

Despite the rapid advancement in image deblurring, the process of finding and pre-processing a number of datasets for training and testing purposes has been both time exhaustive and unnecessarily complicated for both experts and non-experts alike. Moreover, there is a serious lack of ready-to-use domain-specific datasets such as face and text deblurring datasets.

To this end, the following card contains a curated list of ready-to-use image deblurring datasets for training and testing various deblurring models. Additionally, we have created an extensive, highly customizable python package for single image deblurring called DBlur that can be used to train and test various SOTA models on the given datasets just with 2-3 lines of code.

Following is a list of the datasets that are currently provided: - GoPro: The GoPro dataset for deblurring consists of 3,214 blurred images with a size of 1,280×720 that are divided into 2,103 training images and 1,111 test images. - HIDE: HIDE is a motion-blurred dataset that includes 2025 blurred images for testing. It mainly focus on pedestrians and street scenes. - RealBlur: The RealBlur testing dataset consists of two subsets. The first is RealBlur-J, consisting of 1900 camera JPEG outputs. The second is RealBlur-R, consisting of 1900 RAW images. The RAW images are generated by using white balance, demosaicking, and denoising operations. - CelebA: A face deblurring dataset created using the CelebA dataset which consists of 2 000 000 training images, 1299 validation images, and 1300 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018 - Helen: A face deblurring dataset created using the Helen dataset which consists of 2 000 training images, 155 validation images, and 155 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018 - Wider-Face: A face deblurring dataset created using the Wider-Face dataset which consists of 4080 training images, 567 validation images, and 567 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
- TextOCR: A text deblurring dataset created using the TextOCR dataset which consists of 5000 training images, 500 validation images, and 500 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
96 wells fluorescence reading and R code statistic for analysis
zenodo.org
bin, csv, doc, pdf
Updated Aug 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JVD Molino; JVD Molino (2024). 96 wells fluorescence reading and R code statistic for analysis [Dataset]. http://doi.org/10.5281/zenodo.1119285
Explore at:
doc, csv, pdf, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1119285
Dataset updated
Aug 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
JVD Molino; JVD Molino
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

Data points present in this dataset were obtained following the subsequent steps: To assess the secretion efficiency of the constructs, 96 colonies from the selection plates were evaluated using the workflow presented in Figure Workflow. We picked transformed colonies and cultured in 400 μL TAP medium for 7 days in Deep-well plates (Corning Axygen®, No.: PDW500CS, Thermo Fisher Scientific Inc., Waltham, MA), covered with Breathe-Easy® (Sigma-Aldrich®). Cultivation was performed on a rotary shaker, set to 150 rpm, under constant illumination (50 μmol photons/m²s). Then 100 μL sample were transferred clear bottom 96-well plate (Corning Costar, Tewksbury, MA, USA) and fluorescence was measured using an Infinite® M200 PRO plate reader (Tecan, Männedorf, Switzerland). Fluorescence was measured at excitation 575/9 nm and emission 608/20 nm. Supernatant samples were obtained by spinning Deep-well plates at 3000 × g for 10 min and transferring 100 μL from each well to the clear bottom 96-well plate (Corning Costar, Tewksbury, MA, USA), followed by fluorescence measurement. To compare the constructs, R Statistic version 3.3.3 was used to perform one-way ANOVA (with Tukey's test), and to test statistical hypotheses, the significance level was set at 0.05. Graphs were generated in RStudio v1.0.136. The codes are deposit herein.

Info

ANOVA_Turkey_Sub.R -> code for ANOVA analysis in R statistic 3.3.3

barplot_R.R -> code to generate bar plot in R statistic 3.3.3

boxplotv2.R -> code to generate boxplot in R statistic 3.3.3

pRFU_+_bk.csv -> relative supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

sup_+_bl.csv -> supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

sup_raw.csv -> supernatant mCherry fluorescence dataset of 96 colonies for each construct.

who_+_bl2.csv -> whole culture mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

who_raw.csv -> whole culture mCherry fluorescence dataset of 96 colonies for each construct.

who_+_Chlo.csv -> whole culture chlorophyll fluorescence dataset of 96 colonies for each construct.

Anova_Output_Summary_Guide.pdf -> Explain the ANOVA files content

ANOVA_pRFU_+_bk.doc -> ANOVA of relative supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

ANOVA_sup_+_bk.doc -> ANOVA of supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

ANOVA_who_+_bk.doc -> ANOVA of whole culture mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii

ANOVA_Chlo.doc -> ANOVA of whole culture chlorophyll fluorescence of all constructs, plus average and standard deviation values.

Consider citing our work.

Molino JVD, de Carvalho JCM, Mayfield SP (2018) Comparison of secretory signal peptides for heterologous protein expression in microalgae: Expanding the secretion portfolio for Chlamydomonas reinhardtii. PLoS ONE 13(2): e0192433. https://doi.org/10.1371/journal. pone.0192433
NOAA GOES-R Series Advanced Baseline Imager (ABI) Level 2 Snow Cover
datasets.ai
ncei.noaa.gov
+2more
0, 21, 33
Updated Sep 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Oceanic and Atmospheric Administration, Department of Commerce (2024). NOAA GOES-R Series Advanced Baseline Imager (ABI) Level 2 Snow Cover [Dataset]. https://datasets.ai/datasets/noaa-goes-r-series-advanced-baseline-imager-abi-level-2-snow-cover3
Explore at:
0, 21, 33Available download formats
Dataset updated
Sep 11, 2024
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
Authors
National Oceanic and Atmospheric Administration, Department of Commerce
Description
The GOES-R Advanced Baseline Imager (ABI) Snow Cover product contains an image with pixel values identifying the fraction of their areas covered by snow. The product includes data quality information that provides an assessment of the snow cover data values for on-earth pixels. The units of measure for the snow cover value is percent. The Snow Cover product image is produced on the ABI fixed grid at 2 km resolution for Full Disk, CONUS and Mesoscale coverage regions from GOES East and West. Product data is produced under the following conditions: Existence of land; Clear sky; Geolocated source data to local zenith angles of 90 degrees and solar zenith angles of 90 degrees.
P
DQN Replay Dataset Dataset
library.toponeai.link
Updated Apr 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rishabh Agarwal; Dale Schuurmans; Mohammad Norouzi (2025). DQN Replay Dataset Dataset [Dataset]. https://library.toponeai.link/dataset/dqn-replay-dataset
Explore at:
Dataset updated
Apr 29, 2025
Authors
Rishabh Agarwal; Dale Schuurmans; Mohammad Norouzi
Description
The DQN Replay Dataset was collected as follows: We first train a DQN agent, on all 60 Atari 2600 games with sticky actions enabled for 200 million frames (standard protocol) and save all of the experience tuples of (observation, action, reward, next observation) (approximately 50 million) encountered during training.

This logged DQN data can be found in the public GCP bucket gs://atari-replay-datasets which can be downloaded using gsutil. To install gsutil, follow the instructions here.

After installing gsutil, run the command to copy the entire dataset:

gsutil -m cp -R gs://atari-replay-datasets/dqn

To run the dataset only for a specific Atari 2600 game (e.g., replace GAME_NAME by Pong to download the logged DQN replay datasets for the game of Pong), run the command:

gsutil -m cp -R gs://atari-replay-datasets/dqn/[GAME_NAME]

This data can be generated by running the online agents using batch_rl/baselines/train.py for 200 million frames (standard protocol). Note that the dataset consists of approximately 50 million experience tuples due to frame skipping (i.e., repeating a selected action for k consecutive frames) of 4. The stickiness parameter is set to 0.25, i.e., there is 25% chance at every time step that the environment will execute the agent's previous action again, instead of the agent's new action.
Z
DCASE2019_task4_synthetic_data
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shah Ankit Parag (2020). DCASE2019_task4_synthetic_data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2583795
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Salamon Justin
Turpault Nicolas
Shah Ankit Parag
Serizel Romain
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Synthetic data for DCASE 2019 task 4

Freesound dataset [1,2]: A subset of FSD is used as foreground sound events for the synthetic subset of the dataset for DCASE 2019 task 4. FSD is a large-scale, general-purpose audio dataset composed of Freesound content annotated with labels from the AudioSet Ontology [3].

SINS dataset [4]: The derivative of the SINS dataset used for DCASE2018 task 5 is used as background for the synthetic subset of the dataset for DCASE 2019 task 4. The SINS dataset contains a continuous recording of one person living in a vacation home over a period of one week. It was collected using a network of 13 microphone arrays distributed over the entire home. The microphone array consists of 4 linearly arranged microphones.

The synthetic set is composed of 10 sec audio clips generated with Scaper [5]. The foreground events are obtained from FSD. Each event audio clip was verified manually to ensure that the sound quality and the event-to-background ratio were sufficient to be used an isolated event. We also verified that the event was actually dominant in the clip and we controlled if the event onset and offset are present in the clip. Each selected clip was then segmented when needed to remove silences before and after the event and between events when the file contained multiple occurrences of the event class.

License:

All sounds comming from FSD are released under Creative Commons licences. Synthetic sounds can only be used for competition purposes until the full CC license list is made available at the end of the competition.

Further information on dcase website.

References:

[1] F. Font, G. Roma & X. Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia. ACM, 2013.

[2] E. Fonseca, J. Pons, X. Favory, F. Font, D. Bogdanov, A. Ferraro, S. Oramas, A. Porter & X. Serra. Freesound Datasets: A Platform for the Creation of Open Audio Datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017.

[3] Jort F. Gemmeke and Daniel P. W. Ellis and Dylan Freedman and Aren Jansen and Wade Lawrence and R. Channing Moore and Manoj Plakal and Marvin Ritter. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings IEEE ICASSP 2017, New Orleans, LA, 2017.

[4] Gert Dekkers, Steven Lauwereins, Bart Thoen, Mulu Weldegebreal Adhana, Henk Brouckxon, Toon van Waterschoot, Bart Vanrumste, Marian Verhelst, and Peter Karsmakers. The SINS database for detection of daily activities in a home environment using an acoustic sensor network. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 32–36. November 2017.

[5] J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello. Scaper: A library for soundscape synthesis and augmentation In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, Oct. 2017.
f
Scripts for Analysis
figshare.com
txt
Updated Jul 18, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sneddon Lab UCSF (2018). Scripts for Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.6783569.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6783569.v2
Dataset updated
Jul 18, 2018
Dataset provided by
figshare
Authors
Sneddon Lab UCSF
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Scripts used for analysis of V1 and V2 Datasets.seurat_v1.R - initialize seurat object from 10X Genomics cellranger outputs. Includes filtering, normalization, regression, variable gene identification, PCA analysis, clustering, tSNE visualization. Used for v1 datasets. merge_seurat.R - merge two or more seurat objects into one seurat object. Perform linear regression to remove batch effects from separate objects. Used for v1 datasets. subcluster_seurat_v1.R - subcluster clusters of interest from Seurat object. Determine variable genes, perform regression and PCA. Used for v1 datasets.seurat_v2.R - initialize seurat object from 10X Genomics cellranger outputs. Includes filtering, normalization, regression, variable gene identification, and PCA analysis. Used for v2 datasets. clustering_markers_v2.R - clustering and tSNE visualization for v2 datasets. subcluster_seurat_v2.R - subcluster clusters of interest from Seurat object. Determine variable genes, perform regression and PCA analysis. Used for v2 datasets.seurat_object_analysis_v1_and_v2.R - downstream analysis and plotting functions for seurat object created by seurat_v1.R or seurat_v2.R. merge_clusters.R - merge clusters that do not meet gene threshold. Used for both v1 and v2 datasets. prepare_for_monocle_v1.R - subcluster cells of interest and perform linear regression, but not scaling in order to input normalized, regressed values into monocle with monocle_seurat_input_v1.R monocle_seurat_input_v1.R - monocle script using seurat batch corrected values as input for v1 merged timecourse datasets. monocle_lineage_trace.R - monocle script using nUMI as input for v2 lineage traced dataset. monocle_object_analysis.R - downstream analysis for monocle object - BEAM and plotting. CCA_merging_v2.R - script for merging v2 endocrine datasets with canonical correlation analysis and determining the number of CCs to include in downstream analysis. CCA_alignment_v2.R - script for downstream alignment, clustering, tSNE visualization, and differential gene expression analysis.
GHRSST NOAA/STAR GOES-16 ABI L2P America Region SST v2.70 dataset in GDS2
catalog.data.gov
s.cnmilf.com
+4more
Updated Jul 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NASA/JPL/PODAAC;DOC/NOAA/NESDIS/STAR (2025). GHRSST NOAA/STAR GOES-16 ABI L2P America Region SST v2.70 dataset in GDS2 [Dataset]. https://catalog.data.gov/dataset/ghrsst-noaa-star-goes-16-abi-l2p-america-region-sst-v2-70-dataset-in-gds2-13b9a
Explore at:
Dataset updated
Jul 3, 2025
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
National Environmental Satellite, Data, and Information Service
Area covered
United States
Description
GOES-16 (G16) is the first satellite in the US NOAA third generation of Geostationary Operational Environmental Satellites (GOES), a.k.a. GOES-R series (which will also include -S, -T, and -U). G16 was launched on 19 Nov 2016 and initially placed in an interim position at 89.5-deg W, between GOES-East and -West. Upon completion of Cal/Val in Dec 2018, it was moved to its permanent position at 75.2-deg W, and declared NOAA operational GOES-East on 18 Dec 2018. NOAA is responsible for all GOES-R products, including Sea Surface Temperature (SST) from the Advanced Baseline Imager (ABI). The ABI offers vastly enhanced capabilities for SST retrievals, over the heritage GOES-I/P Imager, including five narrow bands (centered at 3.9, 8.4, 10.3, 11.2, and 12.3 um) out of 16 that can be used for SST, as well as accurate sensor calibration, image navigation and co-registration, spectral fidelity, and sophisticated pre-processing (geo-rectification, radiance equalization, and mapping). From altitude 35,800 km, G16/ABI can accurately map SST in a Full Disk (FD) area from 15-135-deg W and 60S-60N, with spatial resolution 2km at nadir (degrading to 15km at view zenith angle, 67-deg) and temporal sampling of 10min (15min prior to 2 Apr 2019). The Level 2 Preprocessed (L2P) SST product is derived at the native sensor resolution using NOAA Advanced Clear-Sky Processor for Ocean (ACSPO) system. ACSPO first processes every 10min FD data SSTs are derived from BTs using the ACSPO clear-sky mask (ACSM; Petrenko et al., 2010) and Non-Linear SST (NLSST) algorithm (Petrenko et al., 2014). Currently, only 4 longwave bands centered at 8.4, 10.3, 11.2, and 12.3 um are used (the 3.9 microns was initially excluded, to minimize possible discontinuities in the diurnal cycle). The regression is tuned against quality controlled in situ SSTs from drifting and tropical mooring buoys in the NOAA iQuam system (Xu and Ignatov, 2014). The 10-min FD data are subsequently collated in time, to produce 1-hr L2P product, with improved coverage, and reduced cloud leakages and image noise, compared to each individual 10min image. In the collated L2P, SSTs and BTs are only reported in clear-sky water pixels (defined as ocean, sea, lake or river, and up to 5 km inland) and fill values elsewhere. The L2P is reported in netCDF4 GHRSST Data Specification version 2 (GDS2) format, 24 granules per day, with a total data volume of 0.6GB/day. In addition to SST, ACSPO files also include sun-sensor geometry, four BTs in ABI bands 11 (8.4um), 13 (10.3um), 14 (11.2um), and 15 (12.3um) and two reflectances in bands 2 and 3 (0.64um and 0.86um; used for cloud identification). The l2p_flags layer includes day/night, land, ice, twilight, and glint flags. Other variables include NCEP wind speed and ACSPO SST minus reference SST (Canadian Met Centre 0.1deg L4 SST; available at https://podaac.jpl.nasa.gov/dataset/CMC0.1deg-CMC-L4-GLOB-v3.0).Pixel-level earth locations are not reported in the granules, as they remain unchanged from granule to granule. To obtain those, user has a choice of using a flat lat-lon file, or a Python script, both available at ftp://ftp.star.nesdis.noaa.gov/pub/socd4/coastwatch/sst/nrt/abi/nav/. Per GDS2 specifications, two additional Sensor-Specific Error Statistics layers (SSES bias and standard deviation) are reported in each pixel. The ACSPO VIIRS L2P product is monitored and validated against in situ data (Xu and Ignatov, 2014) using the Satellite Quality Monitor SQUAM (Dash et al, 2010), and BTs are validated against RTM simulation in MICROS (Liang and Ignatov, 2011). A reduced size (0.2GB/day), equal-angle gridded (0.02-deg resolution), ACSPO L3C product is also available at https://podaac.jpl.nasa.gov/dataset/ABI_G16-STAR-L3C-v2.70, where gridded L2P SSTs are reported, and BT layers omitted.

Facebook

Twitter

Click to copy link

Link copied

Cite

Christine Dodge (2017). R code [Dataset]. http://doi.org/10.6084/m9.figshare.5021297.v1

R code

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.5021297.v1

Dataset updated

Jun 5, 2017

Dataset provided by

Figsharehttp://figshare.com/

Authors

Christine Dodge

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers

Clear search

Close search

Google apps

Main menu

R code

Geographically distributed solar power time series - Dataset - CKAN

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

NOAA GOES-R Series Advanced Baseline Imager (ABI) Level 2 Clear Sky Mask...

GIC//NMC Solar Battery Synthetic Data 2 - 45,000 x 18 degradation for...

Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8...

Clear Lake, SD Median Household Income Trends (2010-2023, in 2023...

About this dataset

Content

Inspiration

Recommended for further research

Replication Data for Reconceptualising dimensions of political competition...

fpt_fosd

Population and GDP/GNI/CO2 emissions (2019, raw data)

A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...

Data from: Error and anomaly detection for intra-participant time-series...

Free Universal Sound Separation Dataset

A Curated List of Image Deblurring Datasets

96 wells fluorescence reading and R code statistic for analysis

NOAA GOES-R Series Advanced Baseline Imager (ABI) Level 2 Snow Cover

DQN Replay Dataset Dataset

DCASE2019_task4_synthetic_data

Scripts for Analysis

GHRSST NOAA/STAR GOES-16 ABI L2P America Region SST v2.70 dataset in GDS2

R code