68 datasets found

TREC 2022 Deep Learning test collection
catalog.data.gov
data.nist.gov
Updated May 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection
Explore at:
Dataset updated
May 9, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.
i
COVID-19 County-Wide Test, Case, and Death Trends
hub.mph.in.gov
Updated May 14, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). COVID-19 County-Wide Test, Case, and Death Trends [Dataset]. https://hub.mph.in.gov/dataset/covid-19-county-wide-test-case-and-death-trends
Explore at:
Dataset updated
May 14, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Archived as of 11/15/2023: With the end of the federal emergency and reporting requirements continuing to evolve, the Indiana Department of Health will no longer publish and refresh the COVID-19 datasets after November 15, 2023 - one final dataset publication will continue to be available as an archival copy. Number of COVID-19 cases, tests, and deaths by report date, by county. New positive cases, deaths and tests have occurred over a range of dates but were reported to ISDH in the last 24 hours. All data displayed is preliminary and subject to change as more information is reported to ISDH. Tests are displayed by the date the test was performed and deaths are displayed by the date the death occurred. Expect historical data to change as data is reported to ISDH.
InductiveQE Datasets
zenodo.org
zip
Updated Nov 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mikhail Galkin; Mikhail Galkin (2022). InductiveQE Datasets [Dataset]. http://doi.org/10.5281/zenodo.7306046
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7306046
Dataset updated
Nov 9, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mikhail Galkin; Mikhail Galkin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
InductiveQE datasets

UPD 2.0: Regenerated datasets free of potential test set leakages

UPD 1.1: Added train_answers_val.pkl files to all freebase-derived datasets - answers of training queries on larger validation graphs

This repository contains 10 inductive complex query answering datasets published in "Inductive Logical Query Answering in Knowledge Graphs" (NeurIPS 2022). 9 datasets (106-550) were created from FB15k-237, the wikikg dataset was created from OGB WikiKG 2 graph. In the datasets, all inference graphs extend training graphs and include new nodes and edges. Dataset numbers indicate a relative size of the inference graph compared to the training graph, e.g., in 175, the number of nodes in the inference graph is 175% compared to the number of nodes in the training graph. The higher the ratio, the more new unseen nodes appear at inference time, the more complex the task is. The Wikikg split has a fixed 133% ratio.

Each dataset is a zip archive containing 17 files:

train_graph.txt (pt for wikikg) - original training graph

val_inference.txt (pt) - inference graph (validation split), new nodes in validation are disjoint with the test inference graph

val_predict.txt (pt) - missing edges in the validation inference graph to be predicted.

test_intference.txt (pt) - inference graph (test splits), new nodes in test are disjoint with the validation inference graph

test_predict.txt (pt) - missing edges in the test inference graph to be predicted.

train/valid/test_queries.pkl - queries of the respective split, 14 query types for fb-derived datasets, 9 types for Wikikg (EPFO-only)

*_answers_easy.pkl - easy answers to respective queries that do not require predicting missing links but only edge traversal

*_answers_hard.pkl - hard answers to respective queries that DO require predicting missing links and against which the final metrics will be computed

train_answers_val.pkl - the extended set of answers for training queries on the bigger validation graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models

train_answers_test.pkl - the extended set of answers for training queries on the bigger test graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models

og_mappings.pkl - contains entity2id / relation2id dictionaries mapping local node/relation IDs from a respective dataset to the original fb15k237 / wikikg2

stats.txt - a small file with dataset stats

Overall unzipped size of all datasets combined is about 10 GB. Please refer to the paper for the sizes of graphs and the number of queries per graph.

The Wikikg dataset is supposed to be evaluated in the inference-only regime being pre-trained solely on simple link prediction, the number of training complex queries is not enough for such a large dataset.

Paper pre-print: https://arxiv.org/abs/2210.08008

The full source code of training/inference models is available at https://github.com/DeepGraphLearning/InductiveQE
Airline Dataset
kaggle.com
Updated Sep 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sourav Banerjee (2023). Airline Dataset [Dataset]. https://www.kaggle.com/datasets/iamsouravbanerjee/airline-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 26, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sourav Banerjee
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Airline data holds immense importance as it offers insights into the functioning and efficiency of the aviation industry. It provides valuable information about flight routes, schedules, passenger demographics, and preferences, which airlines can leverage to optimize their operations and enhance customer experiences. By analyzing data on delays, cancellations, and on-time performance, airlines can identify trends and implement strategies to improve punctuality and mitigate disruptions. Moreover, regulatory bodies and policymakers rely on this data to ensure safety standards, enforce regulations, and make informed decisions regarding aviation policies. Researchers and analysts use airline data to study market trends, assess environmental impacts, and develop strategies for sustainable growth within the industry. In essence, airline data serves as a foundation for informed decision-making, operational efficiency, and the overall advancement of the aviation sector.

Content

This dataset comprises diverse parameters relating to airline operations on a global scale. The dataset prominently incorporates fields such as Passenger ID, First Name, Last Name, Gender, Age, Nationality, Airport Name, Airport Country Code, Country Name, Airport Continent, Continents, Departure Date, Arrival Airport, Pilot Name, and Flight Status. These columns collectively provide comprehensive insights into passenger demographics, travel details, flight routes, crew information, and flight statuses. Researchers and industry experts can leverage this dataset to analyze trends in passenger behavior, optimize travel experiences, evaluate pilot performance, and enhance overall flight operations.

Dataset Glossary (Column-wise)

Passenger ID - Unique identifier for each passenger

First Name - First name of the passenger

Last Name - Last name of the passenger

Gender - Gender of the passenger

Age - Age of the passenger

Nationality - Nationality of the passenger

Airport Name - Name of the airport where the passenger boarded

Airport Country Code - Country code of the airport's location

Country Name - Name of the country the airport is located in

Airport Continent - Continent where the airport is situated

Continents - Continents involved in the flight route

Departure Date - Date when the flight departed

Arrival Airport - Destination airport of the flight

Pilot Name - Name of the pilot operating the flight

Flight Status - Current status of the flight (e.g., on-time, delayed, canceled)

Structure of the Dataset

https://i.imgur.com/cUFuMeU.png" alt="">

Acknowledgement

The dataset provided here is a simulated example and was generated using the online platform found at Mockaroo. This web-based tool offers a service that enables the creation of customizable Synthetic datasets that closely resemble real data. It is primarily intended for use by developers, testers, and data experts who require sample data for a range of uses, including testing databases, filling applications with demonstration data, and crafting lifelike illustrations for presentations and tutorials. To explore further details, you can visit their website.

Cover Photo by: Kevin Woblick on Unsplash

Thumbnail by: Airplane icons created by Freepik - Flaticon
d
COVID-19 case rate per 100,000 population and percent test positivity in the...
catalog.data.gov
data.ct.gov
+1more
Updated Aug 12, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.ct.gov (2023). COVID-19 case rate per 100,000 population and percent test positivity in the last 7 days by town - ARCHIVE [Dataset]. https://catalog.data.gov/dataset/covid-19-case-rate-per-100000-population-and-percent-test-positivity-in-the-last-7-days-by
Explore at:
Dataset updated
Aug 12, 2023
Dataset provided by
data.ct.gov
Description
DPH note about change from 7-day to 14-day metrics: As of 10/15/2020, this dataset is no longer being updated. Starting on 10/15/2020, these metrics will be calculated using a 14-day average rather than a 7-day average. The new dataset using 14-day averages can be accessed here: https://data.ct.gov/Health-and-Human-Services/COVID-19-case-rate-per-100-000-population-and-perc/hree-nys2 As you know, we are learning more about COVID-19 all the time, including the best ways to measure COVID-19 activity in our communities. CT DPH has decided to shift to 14-day rates because these are more stable, particularly at the town level, as compared to 7-day rates. In addition, since the school indicators were initially published by DPH last summer, CDC has recommended 14-day rates and other states (e.g., Massachusetts) have started to implement 14-day metrics for monitoring COVID transmission as well. With respect to geography, we also have learned that many people are looking at the town-level data to inform decision making, despite emphasis on the county-level metrics in the published addenda. This is understandable as there has been variation within counties in COVID-19 activity (for example, rates that are higher in one town than in most other towns in the county). This dataset includes a weekly count and weekly rate per 100,000 population for COVID-19 cases, a weekly count of COVID-19 PCR diagnostic tests, and a weekly percent positivity rate for tests among people living in community settings. Dates are based on date of specimen collection (cases and positivity). A person is considered a new case only upon their first COVID-19 testing result because a case is defined as an instance or bout of illness. If they are tested again subsequently and are still positive, it still counts toward the test positivity metric but they are not considered another case. These case and test counts do not include cases or tests among people residing in congregate settings, such as nursing homes, assisted living facilities, or correctional facilities. These data are updated weekly; the previous week period for each dataset is the previous Sunday-Saturday, known as an MMWR week (https://wwwn.cdc.gov/nndss/document/MMWR_week_overview.pdf). The date listed is the date the dataset was last updated and corresponds to a reporting period of the previous MMWR week. For instance, the data for 8/20/2020 corresponds to a reporting period of 8/9/2020-8/15/2020. Notes: 9/25/2020: Data for Mansfield and Middletown for the week of Sept 13-19 were unavailable at the time of reporting due to delays in lab reporting.
Z
Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction
data.niaid.nih.gov
zenodo.org
Updated Jan 27, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Keshavarz, Hossein (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5907001
Explore at:
Dataset updated
Jan 27, 2022
Dataset provided by
Nagappan, Meiyappan
Keshavarz, Hossein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

The datasets are available under directory dataset. There are 4 datasets in this directory.

apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.

apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).

apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.

apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).

References:

GumTree

https://github.com/GumTreeDiff/gumtree

Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

PyDriller

https://pydriller.readthedocs.io/en/latest/

Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911

SMDG, A Standardized Fundus Glaucoma Dataset

kaggle.com

Updated Apr 23, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Riley Kiefer (2023). SMDG, A Standardized Fundus Glaucoma Dataset [Dataset]. http://doi.org/10.34740/kaggle/ds/2329670

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/ds/2329670

Dataset updated

Apr 23, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Riley Kiefer

Description

Standardized Multi-Channel Dataset for Glaucoma (SMDG-19), a standardization of 19 public glaucoma datasets for AI applications.

Standardized Multi-Channel Dataset for Glaucoma (SMDG-19) is a collection and standardization of 19 public datasets, comprised of full-fundus glaucoma images, associated image metadata like, optic disc segmentation, optic cup segmentation, blood vessel segmentation, and any provided per-instance text metadata like sex and age. This dataset is designed to be exploratory and open-ended with multiple use cases and no established training/validation/test cases. This dataset is the largest public repository of fundus images with glaucoma.

Citation

Please cite at least the first work in academic publications: 1. Kiefer, Riley, et al. "A Catalog of Public Glaucoma Datasets for Machine Learning Applications: A detailed description and analysis of public glaucoma datasets available to machine learning engineers tackling glaucoma-related problems using retinal fundus images and OCT images." Proceedings of the 2023 7th International Conference on Information System and Data Mining. 2023. 2. R. Kiefer, M. Abid, M. R. Ardali, J. Steen and E. Amjadian, "Automated Fundus Image Standardization Using a Dynamic Global Foreground Threshold Algorithm," 2023 8th International Conference on Image, Vision and Computing (ICIVC), Dalian, China, 2023, pp. 460-465, doi: 10.1109/ICIVC58118.2023.10270429. 3. Kiefer, Riley, et al. "A Catalog of Public Glaucoma Datasets for Machine Learning Applications: A detailed description and analysis of public glaucoma datasets available to machine learning engineers tackling glaucoma-related problems using retinal fundus images and OCT images." Proceedings of the 2023 7th International Conference on Information System and Data Mining. 2023. 4. R. Kiefer, J. Steen, M. Abid, M. R. Ardali and E. Amjadian, "A Survey of Glaucoma Detection Algorithms using Fundus and OCT Images," 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 2022, pp. 0191-0196, doi: 10.1109/IEMCON56893.2022.9946629.

Please also see the following optometry abstract publications: 1. A Comprehensive Survey of Publicly Available Glaucoma Datasets for Automated Glaucoma Detection; AAO 2022; https://aaopt.org/past-meeting-abstract-archives/?SortBy=ArticleYear&ArticleType=&ArticleYear=2022&Title=&Abstract=&Authors=&Affiliation=&PROGRAMNUMBER=225129 2. Standardized and Open-Access Glaucoma Dataset for Artificial Intelligence Applications; ARVO 2023; https://iovs.arvojournals.org/article.aspx?articleid=2790420 3. Ground truth validation of publicly available datasets utilized in artificial intelligence models for glaucoma detection; ARVO 2023; https://iovs.arvojournals.org/article.aspx?articleid=2791017

Please also see the DOI citations for this and related datasets: 1. SMDG; @dataset{smdg, title={SMDG, A Standardized Fundus Glaucoma Dataset}, url={https://www.kaggle.com/ds/2329670}, DOI={10.34740/KAGGLE/DS/2329670}, publisher={Kaggle}, author={Riley Kiefer}, year={2023} } 2. EyePACS-light-v1 @dataset{eyepacs-light-v1, title={Glaucoma Dataset: EyePACS AIROGS - Light}, url={https://www.kaggle.com/ds/3222646}, DOI={10.34740/KAGGLE/DS/3222646}, publisher={Kaggle}, author={Riley Kiefer}, year={2023} } 3. EyePACS-light-v2 @dataset{eyepacs-light-v2, title={Glaucoma Dataset: EyePACS-AIROGS-light-V2}, url={https://www.kaggle.com/dsv/7300206}, DOI={10.34740/KAGGLE/DSV/7300206}, publisher={Kaggle}, author={Riley Kiefer}, year={2023} }

Dataset Objective

The objective of this dataset is a machine learning-ready dataset for glaucoma-related applications. Using the help of the community, new open-source glaucoma datasets will be reviewed for standardization and inclusion in this dataset.

Data Standardization

Full fundus images (and corresponding segmentation maps) are standardized using a novel algorithm (Citation 1) by cropping the background, centering the fundus image, padding missing information, and resizing to 512x512 pixels. This standardization ensures that the most amount of foreground information is prevalent during the resizing process for machine-learning-ready image processing.
Each available metadata text is standardized by provided each fundus image as a row and each fundus attribute as a column in a CSV file

Dataset Instance	Original Fundus	Standardized Fundus Image
sjchoi86-HRF	https://user-images.githubusercontent.com/65875562/204170005-2d4dd051-0032-40c8-ba0b-390b6080bb69.png">	https://user-images.githubusercontent.com/65875562/204170011-51b7d001-4d43-4f0d-835e-984d45116b18.png">
BEH	https://user-images.githubusercontent.com/65875562/211052753-93f8a3aa-cc65-4790-8da6-229f512a6afb.PNG">	<img src="htt...

Kinetics-400-[test-set]
kaggle.com
Updated Sep 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Innat (2023). Kinetics-400-[test-set] [Dataset]. https://www.kaggle.com/datasets/ipythonx/k4testset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 11, 2023
Dataset provided by
Kaggle
Authors
Innat
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1984321%2Fee10abf5409ea4eaaad3dfaa9514a4bb%2FScreenshot_2021-08-06_at_16.15.03.png?generation=1694441423300452&alt=media" alt="">

Video Action Recognition : Kinetics 400

The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. Homepage.

License

The kinetics dataset is licensed by Google Inc. under a Creative Commons Attribution 4.0 International License. Published. May 22, 2017.
e
WMT17 Quality Estimation Shared Test Data - Dataset - B2FIND
b2find.eudat.eu
Updated Apr 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). WMT17 Quality Estimation Shared Test Data - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/8c74aaf0-218b-5eb3-a31e-3af48933619f
Explore at:
Dataset updated
Apr 30, 2023
Description
Test data for the WMT17 QE task. Train data can be downloaded from http://hdl.handle.net/11372/LRT-1974 This shared task will build on its previous five editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We include word-level, phrase-level and sentence-level estimation. All tasks will make use of a large dataset produced from post-editions by professional translators. The data will be domain-specific (IT and Pharmaceutical domains) and substantially larger than in previous years. In addition to advancing the state of the art at all prediction levels, our goals include: To test the effectiveness of larger (domain-specific and professionally annotated) datasets. We will do so by increasing the size of one of last year's training sets. To study the effect of language direction and domain. We will do so by providing two datasets created in similar ways, but for different domains and language directions. To investigate the utility of detailed information logged during post-editing. We will do so by providing post-editing time, keystrokes, and actual edits. This year's shared task provides new training and test datasets for all tasks, and allows participants to explore any additional data and resources deemed relevant. A in-house MT system was used to produce translations for all tasks. MT system-dependent information can be made available under request. The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes.
E
Yarmouth County Water Quality Data
cioosatlantic.ca
dev.cioosatlantic.ca
Updated Jan 20, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CMAR (2021). Yarmouth County Water Quality Data [Dataset]. https://cioosatlantic.ca/erddap/info/9qw2-yb2f/index.html
Explore at:
Dataset updated
Jan 20, 2021
Dataset provided by
Centre for Marine Applied Research (CMAR)
Authors
CMAR
Time period covered
Feb 19, 2016 - Aug 7, 2024
Area covered

Variables measured
time, depth, lease, station, latitude, salinity, longitude, waterbody, sensor_type, temperature, and 11 more
Description
The Centre for Marine Applied Research (CMAR) provides high resolution ocean data from around the coast of Nova Scotia through their Coastal Monitoring Program. Through the Water Quality Branch of the program, CMAR collects temperature, dissolved oxygen, and salinity data using sensors deployed on stationary moorings. A typical mooring consists of a line anchored to the sea floor and suspended by a sub-surface buoy, with sensors attached at various depths. Alternatively, sensors may be attached to structures including buoys, docks, or aquaculture equipment. Sensors are deployed for several months, and data are measured every 1 minute to 1 hour. Station locations, summary reports, and data collection methods are available on the CMAR website (https://cmar.ca/coastal-monitoring-program/). Datasets and reports may be revised pending ongoing data collection and analyses. Automated Quality Control tests were applied to the data to identify outlying and unexpected observations. The results of these tests are summarized in the “qc_flag” columns of the dataset. Observations flagged as “Pass” passed all tests, while observations flagged as “Fail” failed at least one test and should be excluded from most analyses. “Suspect/Of Interest” flags highlight unusual events or poor quality data, and “Not Evaluated” flags indicate at least one test was not applied to the observation. Flags should be used as a guide only, and users are responsible for evaluating the data quality prior to use. For technical details on the Quality Control tests, visit the CMAR Data Governance website (https://dempsey-cmar.github.io/cmp-data-governance/pages/cmp_about.html). Other data quality considerations: - Through calibration-validation procedures, CMAR has discovered that the VR2AR temperature sensors typically record 0.5 – 1 °C lower than other temperature sensors. This is not corrected for or flagged in the datasets but may be in the future. - Sensor drift is not flagged in the datasets. - The sensor_depth_at_low_tide_m is an estimate and should be compared to sensor_depth_measured_m when possible. Note the mooring can get “knocked down” by currents or sink from biofouling. Large discrepancies between the estimated depth and the minimum recorded depth are flagged in the column depth_crosscheck_flag. The Coastal Monitoring Program Water Quality data is organized by county. These datasets are very large, typically exceeding the number of rows that can be viewed in Excel. CMAR recommends filtering the data to the waterbody, station, depth, quality control flag, and/or time period of interest before exporting. Take care when exporting data filtered on quality control columns, because the whole row will be filtered (i.e., all other variables measured at that timestamp will also be excluded). If you have accessed any Coastal Monitoring Program data, CMAR would appreciate your feedback: https://forms.gle/AyD7Vi3BpKGe6ueYA. Please acknowledge the Centre for Marine Applied Research in any published material that uses this data. Contact info@cmar.ca for more information. cdm_data_type=TimeSeries cdm_timeseries_variables=waterbody,station,sensor_type,sensor_serial_number contributor_name=Centre for Marine Applied Research (CMAR) contributor_role=owner Conventions=COARDS, CF-1.6, ACDD-1.3 defaultDataQuery=&time>=min(time) Easternmost_Easting=-65.83446 featureType=TimeSeries geospatial_lat_max=43.93093 geospatial_lat_min=43.67901 geospatial_lat_units=degrees_north geospatial_lon_max=-65.83446 geospatial_lon_min=-66.17321 geospatial_lon_units=degrees_east geospatial_vertical_max=15.0 geospatial_vertical_min=1.0 geospatial_vertical_positive=down geospatial_vertical_units=m infoUrl=https://cmar.ca/coastal-monitoring-program/ institution=Centre for Marine Applied Research (CMAR) instrument=hobo-10194899,hobo-10226050,hobo-10777109,hobo-10194911,hobo-10194912,hobo-10777103,hobo-10034865,hobo-10194877,hobo-10778922,hobo-10034851,hobo-10755201,hobo-10755232,hobo-20291436,hobo-20291456,hobo-20291476,aquameasure-680251,hobo-20291444,vr2ar-547086,aquameasure-675009,hobo-10755242,vr2ar-547099,aquameasure-675286,vr2ar-547109,aquameasure-680324,aquameasure-670373,hobo-20495250,vr2ar-545777,aquameasure-680326,aquameasure-670383,aquameasure-671046,aquameasure-671044,aquameasure-670380,vr2ar-548039,hobo-20495248,hobo-20900985,hobo-21043067,hobo-21082791,aquameasure-670354,aquameasure-675011,aquameasure-680360,vr2ar-548038,vr2ar-549340,aquameasure-686013,aquameasure-671011,hobo-20308045,hobo-21152408,vr2ar-551263,hobo-20900974,vr2ar-551264,hobo-20900987,aquameasure-686255,aquameasure-671022,hobo-21083050,vr2ar-549342,hobo-21043083,aquameasure-686011,vr2ar-547115,aquameasure-670367,aquameasure-680325,vr2ar-551261,aquameasure-675014,aquameasure-686256,aquameasure-671331,hobo-20291446,vr2ar-548559,vr2ar-548597,aquameasure-671188,hobo-21152407,hobo-21650150,hobo-20330413,hobo-20804688,vr2ar-548586,aquameasure-671185,hobo-20291480,hobo-20820380,vr2ar-548563 Northernmost_Northing=43.93093 sourceUrl=(local files) Southernmost_Northing=43.67901 standard_name_vocabulary=CF Standard Name Table v55 subsetVariables=waterbody, station, sensor_type, sensor_serial_number,lease,string_configuration,qc_flag_dissolved_oxygen,qc_flag_salinity,qc_flag_sensor_depth_measured,qc_flag_temperature,depth_crosscheck_flag time_coverage_end=2024-08-07T17:41:20Z time_coverage_start=2016-02-19T17:00:00Z Westernmost_Easting=-66.17321
f
Data variable definition and description.
plos.figshare.com
xls
Updated Apr 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maryam Motamedi; Jessica Dawson; Na Li; Douglas G. Down; Nancy M. Heddle (2024). Data variable definition and description. [Dataset]. http://doi.org/10.1371/journal.pone.0297391.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0297391.t001
Dataset updated
Apr 23, 2024
Dataset provided by
PLOS ONE
Authors
Maryam Motamedi; Jessica Dawson; Na Li; Douglas G. Down; Nancy M. Heddle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Platelet products are both expensive and have very short shelf lives. As usage rates for platelets are highly variable, the effective management of platelet demand and supply is very important yet challenging. The primary goal of this paper is to present an efficient forecasting model for platelet demand at Canadian Blood Services (CBS). To accomplish this goal, five different demand forecasting methods, ARIMA (Auto Regressive Integrated Moving Average), Prophet, lasso regression (least absolute shrinkage and selection operator), random forest, and LSTM (Long Short-Term Memory) networks are utilized and evaluated via a rolling window method. We use a large clinical dataset for a centralized blood distribution centre for four hospitals in Hamilton, Ontario, spanning from 2010 to 2018 and consisting of daily platelet transfusions along with information such as the product specifications, the recipients’ characteristics, and the recipients’ laboratory test results. This study is the first to utilize different methods from statistical time series models to data-driven regression and machine learning techniques for platelet transfusion using clinical predictors and with different amounts of data. We find that the multivariable approaches have the highest accuracy in general, however, if sufficient data are available, a simpler time series approach appears to be sufficient. We also comment on the approach to choose predictors for the multivariable models.
Retail Transactions Dataset
kaggle.com
Updated May 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prasad Patil (2024). Retail Transactions Dataset [Dataset]. https://www.kaggle.com/datasets/prasad22/retail-transactions-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 18, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Prasad Patil
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset was created to simulate a market basket dataset, providing insights into customer purchasing behavior and store operations. The dataset facilitates market basket analysis, customer segmentation, and other retail analytics tasks. Here's more information about the context and inspiration behind this dataset:

Context:

Retail businesses, from supermarkets to convenience stores, are constantly seeking ways to better understand their customers and improve their operations. Market basket analysis, a technique used in retail analytics, explores customer purchase patterns to uncover associations between products, identify trends, and optimize pricing and promotions. Customer segmentation allows businesses to tailor their offerings to specific groups, enhancing the customer experience.

Inspiration:

The inspiration for this dataset comes from the need for accessible and customizable market basket datasets. While real-world retail data is sensitive and often restricted, synthetic datasets offer a safe and versatile alternative. Researchers, data scientists, and analysts can use this dataset to develop and test algorithms, models, and analytical tools.

Dataset Information:

The columns provide information about the transactions, customers, products, and purchasing behavior, making the dataset suitable for various analyses, including market basket analysis and customer segmentation. Here's a brief explanation of each column in the Dataset:

Transaction_ID: A unique identifier for each transaction, represented as a 10-digit number. This column is used to uniquely identify each purchase.

Date: The date and time when the transaction occurred. It records the timestamp of each purchase.

Customer_Name: The name of the customer who made the purchase. It provides information about the customer's identity.

Product: A list of products purchased in the transaction. It includes the names of the products bought.

Total_Items: The total number of items purchased in the transaction. It represents the quantity of products bought.

Total_Cost: The total cost of the purchase, in currency. It represents the financial value of the transaction.

Payment_Method: The method used for payment in the transaction, such as credit card, debit card, cash, or mobile payment.

City: The city where the purchase took place. It indicates the location of the transaction.

Store_Type: The type of store where the purchase was made, such as a supermarket, convenience store, department store, etc.

Discount_Applied: A binary indicator (True/False) representing whether a discount was applied to the transaction.

Customer_Category: A category representing the customer's background or age group.

Season: The season in which the purchase occurred, such as spring, summer, fall, or winter.

Promotion: The type of promotion applied to the transaction, such as "None," "BOGO (Buy One Get One)," or "Discount on Selected Items."

Use Cases:

Market Basket Analysis: Discover associations between products and uncover buying patterns.

Customer Segmentation: Group customers based on purchasing behavior.

Pricing Optimization: Optimize pricing strategies and identify opportunities for discounts and promotions.

Retail Analytics: Analyze store performance and customer trends.

Note: This dataset is entirely synthetic and was generated using the Python Faker library, which means it doesn't contain real customer data. It's designed for educational and research purposes.
f
Data from: Chemical Topic Modeling: Exploring Molecular Data Sets Using a...
acs.figshare.com
zip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nadine Schneider; Nikolas Fechner; Gregory A. Landrum; Nikolaus Stiefl (2023). Chemical Topic Modeling: Exploring Molecular Data Sets Using a Common Text-Mining Approach [Dataset]. http://doi.org/10.1021/acs.jcim.7b00249.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.7b00249.s002
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Nadine Schneider; Nikolas Fechner; Gregory A. Landrum; Nikolaus Stiefl
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Big data is one of the key transformative factors which increasingly influences all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance, by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods. Handling those huge sets of molecules effectively is quite challenging and requires compromises that often come at the expense of the interpretability of the results. In order to find an intuitive and meaningful approach to organizing large molecular data sets, we adopted a probabilistic framework called “topic modeling” from the text-mining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to “chemical topics” and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 data set to test its robustness and efficiency. In about 1 h we built a 100-topic model of this large data set in which we could identify interesting topics like “proteins”, “DNA”, or “steroids”. Along with this publication we provide our data sets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.
e
WMT17 Quality Estimation Shared Task Training and Development Data - Dataset...
b2find.eudat.eu
Updated Oct 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). WMT17 Quality Estimation Shared Task Training and Development Data - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/6684e44e-a5b0-522b-8fa2-10bdd37b8fc9
Explore at:
Dataset updated
Oct 28, 2023
Description
Training and development data for the WMT17 QE task. Test data will be published as a separate item. This shared task will build on its previous five editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We include word-level, phrase-level and sentence-level estimation. All tasks will make use of a large dataset produced from post-editions by professional translators. The data will be domain-specific (IT and Pharmaceutical domains) and substantially larger than in previous years. In addition to advancing the state of the art at all prediction levels, our goals include: To test the effectiveness of larger (domain-specific and professionally annotated) datasets. We will do so by increasing the size of one of last year's training sets. To study the effect of language direction and domain. We will do so by providing two datasets created in similar ways, but for different domains and language directions. To investigate the utility of detailed information logged during post-editing. We will do so by providing post-editing time, keystrokes, and actual edits. This year's shared task provides new training and test datasets for all tasks, and allows participants to explore any additional data and resources deemed relevant. A in-house MT system was used to produce translations for all tasks. MT system-dependent information can be made available under request. The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes.
Dataset: A Systematic Literature Review on the topic of High-value datasets
zenodo.org
data.niaid.nih.gov
bin, png, txt
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anastasija Nikiforova; Anastasija Nikiforova; Nina Rizun; Nina Rizun; Magdalena Ciesielska; Magdalena Ciesielska; Charalampos Alexopoulos; Charalampos Alexopoulos; Andrea Miletič; Andrea Miletič (2024). Dataset: A Systematic Literature Review on the topic of High-value datasets [Dataset]. http://doi.org/10.5281/zenodo.8075918
Explore at:
png, bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8075918
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anastasija Nikiforova; Anastasija Nikiforova; Nina Rizun; Nina Rizun; Magdalena Ciesielska; Magdalena Ciesielska; Charalampos Alexopoulos; Charalampos Alexopoulos; Andrea Miletič; Andrea Miletič
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean)and Andrea Miletič (University of Zagreb)
It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.

The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.

***Methodology***

To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).

These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.

To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.

***Test procedure***
Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study.
The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx)
The data collected for each study by two researchers were then synthesized in one final version by the third researcher.

***Description of the data in this data set***

Protocol_HVD_SLR provides the structure of the protocol
Spreadsheets #1 provides the filled protocol for relevant studies.
Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies

The information on each selected study was collected in four categories:
(1) descriptive information,
(2) approach- and research design- related information,
(3) quality-related information,
(4) HVD determination-related information

Descriptive information
1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet
2) Complete reference - the complete source information to refer to the study
3) Year of publication - the year in which the study was published
4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter}
5) DOI / Website- a link to the website where the study can be found
6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science
7) Availability in OA - availability of an article in the Open Access
8) Keywords - keywords of the paper as indicated by the authors
9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}

Approach- and research design-related information
10) Objective / RQ - the research objective / aim, established research questions
11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.)
12) Contributions - the contributions of the study
13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach?
14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared?
15) Period under investigation - period (or moment) in which the study was conducted
16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?

Quality- and relevance- related information
17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)?
18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))

HVD determination-related information
19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term?
20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output")
21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description)
22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles?
23) Data - what data do HVD cover?
24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)

***Format of the file***
.xls, .csv (for the first spreadsheet only), .odt, .docx

***Licenses or restrictions***
CC-BY

For more info, see README.txt
C
Données de qualité de l'eau du comté de Richmond
catalogue.cioosatlantic.ca
catalogue.cioos.ca
erddap, html
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centre for Marine Applied Research (CMAR) (2025). Données de qualité de l'eau du comté de Richmond [Dataset]. https://catalogue.cioosatlantic.ca/en/dataset/ca-cioos_01fd41f3-d775-379d-ae08-05f9d5e7b538
Explore at:
html, erddapAvailable download formats
Dataset updated
Feb 10, 2025
Dataset provided by
CMAR
Authors
Centre for Marine Applied Research (CMAR)
Time period covered
Nov 26, 2015 - Present
Area covered

Variables measured
Oxygen, Subsurface Salinity, Subsurface Temperature
Description
Le Center for Marine Applied Research (CMAR) fournit des données sur l'océan à haute résolution de la côte de la Nouvelle-Écosse grâce à leur programme de surveillance côtière.Grâce à la branche de la qualité de l'eau du programme, CMAR recueille la température, l'oxygène dissous et les données de salinité à l'aide de capteurs déployés sur des amarres stationnaires.Un amarrage typique se compose d'une ligne ancrée au fond marin et suspendu par une bouée sous-surface, avec des capteurs attachés à différentes profondeurs.Alternativement, les capteurs peuvent être attachés à des structures, notamment des bouées, des quais ou des équipements d'aquaculture.Les capteurs sont déployés pendant plusieurs mois et les données sont mesurées toutes les minutes à 1 heure.Les emplacements des stations, les rapports de résumé et les méthodes de collecte de données sont disponibles sur le site Web de CMAR (https://cmar.ca/coastal-monitoring-program/).Les ensembles de données et les rapports peuvent être révisés en attendant la collecte et les analyses de données en cours.Des tests automatisés de contrôle de la qualité ont été appliqués aux données pour identifier les observations périphériques et inattendues.Les résultats de ces tests sont résumés dans les colonnes «QC_FLAG» de l'ensemble de données.Les observations signalées en tant que «pass» ont réussi tous les tests, tandis que les observations ont signalé en «échec» ont échoué au moins un test et devraient être exclues de la plupart des analyses.Les drapeaux «suspects / d'intérêt» mettent en évidence des événements inhabituels ou des données de mauvaise qualité, et les drapeaux «non évalués» indiquent qu'au moins un test n'a pas été appliqué à l'observation.Les drapeaux doivent être utilisés uniquement comme guide et les utilisateurs sont responsables de l'évaluation de la qualité des données avant l'utilisation.Pour plus de détails sur les tests de contrôle de la qualité, visitez le site Web de la gouvernance des données CMAR (https://dempsey-cmar.github.io/cmp-data-governance/pages/cmp_about.html).Autres considérations de qualité des données: - Grâce à des procédures de validation d'étalonnage, CMAR a découvert que les capteurs de température VR2AR enregistrent généralement 0,5 à 1 ° C inférieur à ceux des autres capteurs de température.Ceci n'est pas corrigé ou signalé dans les ensembles de données, mais peut être à l'avenir.- La dérive du capteur n'est pas signalée dans les ensembles de données.- Le capteur_depth_at_low_tide_m est une estimation et doit être comparé à Sensor_Depth_Measured_M lorsque cela est possible.Notez que l'amarrage peut être «renversé» par des courants ou un puits de la biofoux.De grandes écarts entre la profondeur estimée et la profondeur enregistrée minimale sont signalées dans la colonne Depth_Crosscheck_Flag.Le programme de surveillance côtière des données de qualité de l'eau est organisé par le comté.Ces ensembles de données sont très importants, dépassant généralement le nombre de lignes qui peuvent être visualisées dans Excel.CMAR recommande de filtrer les données sur le corps à eau, la station, la profondeur, le drapeau de contrôle de la qualité et / ou la période d'intérêt avant de l'exportation.Faites attention lors de l'exportation de données filtrées sur des colonnes de contrôle de qualité, car la ligne entière sera filtrée (c'est-à-dire que toutes les autres variables mesurées à cet horodat seront également exclues).Si vous avez accédé aux données du programme de surveillance côtière, CMAR apprécierait vos commentaires: https://forms.gle/ayd7vi3bpkge6ueya.Veuillez reconnaître le Center for Marine Applied Research dans tout matériel publié qui utilise ces données.Contactez info@cmar.ca pour plus d'informations.
4
Multimodal WEDAR dataset for attention regulation behaviors, self-reported...
data.4tu.nl
zip
Updated May 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yoon Lee; Marcus Specht (2023). Multimodal WEDAR dataset for attention regulation behaviors, self-reported distractions, reaction time, and knowledge gain in e-reading [Dataset]. http://doi.org/10.4121/8f730aa3-ad04-4419-8a5b-325415d2294b.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/8f730aa3-ad04-4419-8a5b-325415d2294b.v1
Dataset updated
May 9, 2023
Dataset provided by
4TU.ResearchData
Authors
Yoon Lee; Marcus Specht
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Diverse learning theories have been constructed to understand learners' internal states through various tangible predictors. We focus on self-regulatory actions that are subconscious and habitual actions triggered by behavior agents' 'awareness' of their attention loss. We hypothesize that self-regulatory behaviors (i.e., attention regulation behaviors) also occur in e-reading as 'regulators' as found in other behavior models (Ekman, P., & Friesen, W. V., 1969). In this work, we try to define the types and frequencies of attention regulation behaviors in e-reading. We collected various cues that reflect learners' moment-to-moment and page-to-page cognitive states to understand the learners' attention in e-reading.
The text 'How to make the most of your day at Disneyland Resort Paris' has been implemented on a screen-based e-reader, which we developed in a pdf-reader format. An informative, entertaining text was adopted to capture learners' attentional shifts during knowledge acquisition. The text has 2685 words, distributed over ten pages, with one subtopic on each page. A built-in webcam on Mac Pro and a mouse have been used for the data collection, aiming for real-world implementation only with essential computational devices. A height-adjustable laptop stand has been used to compensate for participants' eye levels.
Thirty learners in higher education have been invited for a screen-based e-reading task (M=16.2, SD=5.2 minutes). A pre-test questionnaire with ten multiple-choice questions was given before the reading to check their prior knowledge level about the topic. There was no specific time limit to finish the questionnaire. We collected cues that reflect learners' moment-to-moment and page-to-page cognitive states to understand the learners' attention in e-reading. Learners were asked to report their distractions on two levels during the reading: 1) In-text distraction (e.g., still reading the text with low attentiveness) or 2) out-of-text distraction (e.g., thinking of something else while not reading the text anymore). We implemented two noticeably-designed buttons on the right-hand side of the screen interface to minimize possible distraction from the reporting task. After triggering a new page, we implemented blur stimuli on the text in the random range of 20 seconds. It ensures that the blur stimuli occur at least once on each page. Participants were asked to click the de-blur button on the text area of the screen to proceed with the reading. The button has been implemented in the whole text area, so participants can minimize the effort to find and click the button. Reaction time for de-blur has been measured, too, to grasp the arousal of learners during the reading. We asked participants to answer pre-test and post-test questionnaires about the reading material. Participants were given ten multiple-choice questions before the session, while the same set of questions was given after the reading session (i.e., formative questions) with added subtopic summarization questions (i.e., summative questions). It can provide insights into the quantitative and qualitative knowledge gained through the session and different learning outcomes based on individual differences. A video dataset of 931,440 frames has been annotated with the attention regulator behaviors using an annotation tool that plays the long sequence clip by clip, which contains 30 frames. Two annotators (doctoral students) have done two stages of labeling. In the first stage, the annotators were trained on the labeling criteria and annotated the attention regulator behaviors separately based on their judgments. The labels were summarized and cross-checked in the second round to address the inconsistent cases, resulting in five attention regulation behaviors and one neutral state. See WEDAR_readme.csv for detailed descriptions of features.
The dataset has been uploaded 1) raw data, which has formed as we collected, and 2) preprocessed, that we extracted useful features for further learning analytics based on real-time and post-hoc data.
Reference
Ekman, P., & Friesen, W. V. (1969). The repertoire of nonverbal behavior: Categories, origins, usage, and coding. semiotica, 1(1), 49-98.
s
Fecal occult blood test (FOBT) obtained in past 2 years or, colonoscopy or...
www150.statcan.gc.ca
datasets.ai
+2more
Updated Jun 25, 2009
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Government of Canada, Statistics Canada (2009). Fecal occult blood test (FOBT) obtained in past 2 years or, colonoscopy or sigmoidoscopy obtained in last 5 years [Dataset]. http://doi.org/10.25318/1310045901-eng
Explore at:
Unique identifier
https://doi.org/10.25318/1310045901-eng
Dataset updated
Jun 25, 2009
Dataset provided by
Government of Canada, Statistics Canada
Area covered
Canada
Description
Fecal occult blood test (FOBT) obtained in past 2 years or, colonoscopy or sigmoidoscopy obtained in last 5 years, by age group and sex, aged 50 or older, Canada, provinces, territories, health regions (2007 boundaries) and peer groups.
f
Model performance with different training window sizes and retraining...
plos.figshare.com
xls
Updated Apr 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maryam Motamedi; Jessica Dawson; Na Li; Douglas G. Down; Nancy M. Heddle (2024). Model performance with different training window sizes and retraining periods. [Dataset]. http://doi.org/10.1371/journal.pone.0297391.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0297391.t003
Dataset updated
Apr 23, 2024
Dataset provided by
PLOS ONE
Authors
Maryam Motamedi; Jessica Dawson; Na Li; Douglas G. Down; Nancy M. Heddle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Model performance with different training window sizes and retraining periods.
Open Famous People Faces
kaggle.com
Updated May 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yves Romero (2024). Open Famous People Faces [Dataset]. http://doi.org/10.34740/kaggle/dsv/8500944
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/8500944
Dataset updated
May 23, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yves Romero
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
This dataset was created to compare methods for face reidentification, that is, given an image and a name of a person, check if that image belongs to that person. But it also can be used to test face recognition algorithms, since the dataset has been categorized.

The autors have made a great effort to collect as much images as they could for all classes inside the dataset. Faces were aligned using eye position alignment and then cropped using landmarks to find the region of interest.

The Open Famous People Faces dataset contains 258 classes with at least 5 images per class. Images have different sizes, some are low quality and small sized images, others are high quality and big sized images. We have images from the same person at different ages.

Facebook

Twitter

Click to copy link

Link copied

Cite

National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection

TREC 2022 Deep Learning test collection

Explore at:

Dataset updated

May 9, 2023

Dataset provided by

National Institute of Standards and Technologyhttp://www.nist.gov/

Description

This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.

Clear search

Close search

Google apps

Main menu

TREC 2022 Deep Learning test collection

COVID-19 County-Wide Test, Case, and Death Trends

InductiveQE Datasets

Airline Dataset

Context

Content

Dataset Glossary (Column-wise)

Structure of the Dataset

Acknowledgement

COVID-19 case rate per 100,000 population and percent test positivity in the...

Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

SMDG, A Standardized Fundus Glaucoma Dataset

Standardized Multi-Channel Dataset for Glaucoma (SMDG-19), a standardization of 19 public glaucoma datasets for AI applications.

Citation

Dataset Objective

Data Standardization

Kinetics-400-[test-set]

Video Action Recognition : Kinetics 400

License

WMT17 Quality Estimation Shared Test Data - Dataset - B2FIND

Yarmouth County Water Quality Data

Data variable definition and description.

Retail Transactions Dataset

`Context:`

`Inspiration:`

`Dataset Information:`

`Use Cases:`

Note: This dataset is entirely synthetic and was generated using the Python Faker library, which means it doesn't contain real customer data. It's designed for educational and research purposes.

Data from: Chemical Topic Modeling: Exploring Molecular Data Sets Using a...

WMT17 Quality Estimation Shared Task Training and Development Data - Dataset...

Dataset: A Systematic Literature Review on the topic of High-value datasets

Données de qualité de l'eau du comté de Richmond

Multimodal WEDAR dataset for attention regulation behaviors, self-reported...

Fecal occult blood test (FOBT) obtained in past 2 years or, colonoscopy or...

Model performance with different training window sizes and retraining...

Open Famous People Faces

TREC 2022 Deep Learning test collection

TREC 2022 Deep Learning test collection

COVID-19 County-Wide Test, Case, and Death Trends

InductiveQE Datasets

Airline Dataset

Context

Content

Dataset Glossary (Column-wise)

Structure of the Dataset

Acknowledgement

COVID-19 case rate per 100,000 population and percent test positivity in the...

Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

SMDG, A Standardized Fundus Glaucoma Dataset

Standardized Multi-Channel Dataset for Glaucoma (SMDG-19), a standardization of 19 public glaucoma datasets for AI applications.

Citation

Dataset Objective

Data Standardization

Kinetics-400-[test-set]

Video Action Recognition : Kinetics 400

License

WMT17 Quality Estimation Shared Test Data - Dataset - B2FIND

Yarmouth County Water Quality Data

Data variable definition and description.

Retail Transactions Dataset

Context:

Inspiration:

Dataset Information:

Use Cases:

Note: This dataset is entirely synthetic and was generated using the Python Faker library, which means it doesn't contain real customer data. It's designed for educational and research purposes.

Data from: Chemical Topic Modeling: Exploring Molecular Data Sets Using a...

WMT17 Quality Estimation Shared Task Training and Development Data - Dataset...

Dataset: A Systematic Literature Review on the topic of High-value datasets

Données de qualité de l'eau du comté de Richmond

Multimodal WEDAR dataset for attention regulation behaviors, self-reported...

Fecal occult blood test (FOBT) obtained in past 2 years or, colonoscopy or...

Model performance with different training window sizes and retraining...

Open Famous People Faces

TREC 2022 Deep Learning test collectionSee More Versions

`Context:`

`Inspiration:`

`Dataset Information:`

`Use Cases:`

TREC 2022 Deep Learning test collection