This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Archived as of 11/15/2023: With the end of the federal emergency and reporting requirements continuing to evolve, the Indiana Department of Health will no longer publish and refresh the COVID-19 datasets after November 15, 2023 - one final dataset publication will continue to be available as an archival copy. Number of COVID-19 cases, tests, and deaths by report date, by county. New positive cases, deaths and tests have occurred over a range of dates but were reported to ISDH in the last 24 hours. All data displayed is preliminary and subject to change as more information is reported to ISDH. Tests are displayed by the date the test was performed and deaths are displayed by the date the death occurred. Expect historical data to change as data is reported to ISDH.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
InductiveQE datasets
UPD 2.0: Regenerated datasets free of potential test set leakages
UPD 1.1: Added train_answers_val.pkl files to all freebase-derived datasets - answers of training queries on larger validation graphs
This repository contains 10 inductive complex query answering datasets published in "Inductive Logical Query Answering in Knowledge Graphs" (NeurIPS 2022). 9 datasets (106-550) were created from FB15k-237, the wikikg dataset was created from OGB WikiKG 2 graph. In the datasets, all inference graphs extend training graphs and include new nodes and edges. Dataset numbers indicate a relative size of the inference graph compared to the training graph, e.g., in 175, the number of nodes in the inference graph is 175% compared to the number of nodes in the training graph. The higher the ratio, the more new unseen nodes appear at inference time, the more complex the task is. The Wikikg split has a fixed 133% ratio.
Each dataset is a zip archive containing 17 files:
Overall unzipped size of all datasets combined is about 10 GB. Please refer to the paper for the sizes of graphs and the number of queries per graph.
The Wikikg dataset is supposed to be evaluated in the inference-only regime being pre-trained solely on simple link prediction, the number of training complex queries is not enough for such a large dataset.
Paper pre-print: https://arxiv.org/abs/2210.08008
The full source code of training/inference models is available at https://github.com/DeepGraphLearning/InductiveQE
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Airline data holds immense importance as it offers insights into the functioning and efficiency of the aviation industry. It provides valuable information about flight routes, schedules, passenger demographics, and preferences, which airlines can leverage to optimize their operations and enhance customer experiences. By analyzing data on delays, cancellations, and on-time performance, airlines can identify trends and implement strategies to improve punctuality and mitigate disruptions. Moreover, regulatory bodies and policymakers rely on this data to ensure safety standards, enforce regulations, and make informed decisions regarding aviation policies. Researchers and analysts use airline data to study market trends, assess environmental impacts, and develop strategies for sustainable growth within the industry. In essence, airline data serves as a foundation for informed decision-making, operational efficiency, and the overall advancement of the aviation sector.
This dataset comprises diverse parameters relating to airline operations on a global scale. The dataset prominently incorporates fields such as Passenger ID, First Name, Last Name, Gender, Age, Nationality, Airport Name, Airport Country Code, Country Name, Airport Continent, Continents, Departure Date, Arrival Airport, Pilot Name, and Flight Status. These columns collectively provide comprehensive insights into passenger demographics, travel details, flight routes, crew information, and flight statuses. Researchers and industry experts can leverage this dataset to analyze trends in passenger behavior, optimize travel experiences, evaluate pilot performance, and enhance overall flight operations.
https://i.imgur.com/cUFuMeU.png" alt="">
The dataset provided here is a simulated example and was generated using the online platform found at Mockaroo. This web-based tool offers a service that enables the creation of customizable Synthetic datasets that closely resemble real data. It is primarily intended for use by developers, testers, and data experts who require sample data for a range of uses, including testing databases, filling applications with demonstration data, and crafting lifelike illustrations for presentations and tutorials. To explore further details, you can visit their website.
Cover Photo by: Kevin Woblick on Unsplash
Thumbnail by: Airplane icons created by Freepik - Flaticon
DPH note about change from 7-day to 14-day metrics: As of 10/15/2020, this dataset is no longer being updated. Starting on 10/15/2020, these metrics will be calculated using a 14-day average rather than a 7-day average. The new dataset using 14-day averages can be accessed here: https://data.ct.gov/Health-and-Human-Services/COVID-19-case-rate-per-100-000-population-and-perc/hree-nys2 As you know, we are learning more about COVID-19 all the time, including the best ways to measure COVID-19 activity in our communities. CT DPH has decided to shift to 14-day rates because these are more stable, particularly at the town level, as compared to 7-day rates. In addition, since the school indicators were initially published by DPH last summer, CDC has recommended 14-day rates and other states (e.g., Massachusetts) have started to implement 14-day metrics for monitoring COVID transmission as well. With respect to geography, we also have learned that many people are looking at the town-level data to inform decision making, despite emphasis on the county-level metrics in the published addenda. This is understandable as there has been variation within counties in COVID-19 activity (for example, rates that are higher in one town than in most other towns in the county). This dataset includes a weekly count and weekly rate per 100,000 population for COVID-19 cases, a weekly count of COVID-19 PCR diagnostic tests, and a weekly percent positivity rate for tests among people living in community settings. Dates are based on date of specimen collection (cases and positivity). A person is considered a new case only upon their first COVID-19 testing result because a case is defined as an instance or bout of illness. If they are tested again subsequently and are still positive, it still counts toward the test positivity metric but they are not considered another case. These case and test counts do not include cases or tests among people residing in congregate settings, such as nursing homes, assisted living facilities, or correctional facilities. These data are updated weekly; the previous week period for each dataset is the previous Sunday-Saturday, known as an MMWR week (https://wwwn.cdc.gov/nndss/document/MMWR_week_overview.pdf). The date listed is the date the dataset was last updated and corresponds to a reporting period of the previous MMWR week. For instance, the data for 8/20/2020 corresponds to a reporting period of 8/9/2020-8/15/2020. Notes: 9/25/2020: Data for Mansfield and Middletown for the week of Sept 13-19 were unavailable at the time of reporting due to delays in lab reporting.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction
This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.
The datasets are available under directory dataset. There are 4 datasets in this directory.
In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.
The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.
More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).
References:
Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324
Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911
Standardized Multi-Channel Dataset for Glaucoma (SMDG-19) is a collection and standardization of 19 public datasets, comprised of full-fundus glaucoma images, associated image metadata like, optic disc segmentation, optic cup segmentation, blood vessel segmentation, and any provided per-instance text metadata like sex and age. This dataset is designed to be exploratory and open-ended with multiple use cases and no established training/validation/test cases. This dataset is the largest public repository of fundus images with glaucoma.
Please cite at least the first work in academic publications: 1. Kiefer, Riley, et al. "A Catalog of Public Glaucoma Datasets for Machine Learning Applications: A detailed description and analysis of public glaucoma datasets available to machine learning engineers tackling glaucoma-related problems using retinal fundus images and OCT images." Proceedings of the 2023 7th International Conference on Information System and Data Mining. 2023. 2. R. Kiefer, M. Abid, M. R. Ardali, J. Steen and E. Amjadian, "Automated Fundus Image Standardization Using a Dynamic Global Foreground Threshold Algorithm," 2023 8th International Conference on Image, Vision and Computing (ICIVC), Dalian, China, 2023, pp. 460-465, doi: 10.1109/ICIVC58118.2023.10270429. 3. Kiefer, Riley, et al. "A Catalog of Public Glaucoma Datasets for Machine Learning Applications: A detailed description and analysis of public glaucoma datasets available to machine learning engineers tackling glaucoma-related problems using retinal fundus images and OCT images." Proceedings of the 2023 7th International Conference on Information System and Data Mining. 2023. 4. R. Kiefer, J. Steen, M. Abid, M. R. Ardali and E. Amjadian, "A Survey of Glaucoma Detection Algorithms using Fundus and OCT Images," 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 2022, pp. 0191-0196, doi: 10.1109/IEMCON56893.2022.9946629.
Please also see the following optometry abstract publications: 1. A Comprehensive Survey of Publicly Available Glaucoma Datasets for Automated Glaucoma Detection; AAO 2022; https://aaopt.org/past-meeting-abstract-archives/?SortBy=ArticleYear&ArticleType=&ArticleYear=2022&Title=&Abstract=&Authors=&Affiliation=&PROGRAMNUMBER=225129 2. Standardized and Open-Access Glaucoma Dataset for Artificial Intelligence Applications; ARVO 2023; https://iovs.arvojournals.org/article.aspx?articleid=2790420 3. Ground truth validation of publicly available datasets utilized in artificial intelligence models for glaucoma detection; ARVO 2023; https://iovs.arvojournals.org/article.aspx?articleid=2791017
Please also see the DOI citations for this and related datasets: 1. SMDG; @dataset{smdg, title={SMDG, A Standardized Fundus Glaucoma Dataset}, url={https://www.kaggle.com/ds/2329670}, DOI={10.34740/KAGGLE/DS/2329670}, publisher={Kaggle}, author={Riley Kiefer}, year={2023} } 2. EyePACS-light-v1 @dataset{eyepacs-light-v1, title={Glaucoma Dataset: EyePACS AIROGS - Light}, url={https://www.kaggle.com/ds/3222646}, DOI={10.34740/KAGGLE/DS/3222646}, publisher={Kaggle}, author={Riley Kiefer}, year={2023} } 3. EyePACS-light-v2 @dataset{eyepacs-light-v2, title={Glaucoma Dataset: EyePACS-AIROGS-light-V2}, url={https://www.kaggle.com/dsv/7300206}, DOI={10.34740/KAGGLE/DSV/7300206}, publisher={Kaggle}, author={Riley Kiefer}, year={2023} }
The objective of this dataset is a machine learning-ready dataset for glaucoma-related applications. Using the help of the community, new open-source glaucoma datasets will be reviewed for standardization and inclusion in this dataset.
Dataset Instance | Original Fundus | Standardized Fundus Image |
---|---|---|
sjchoi86-HRF | ||
BEH | <img src="htt... |
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1984321%2Fee10abf5409ea4eaaad3dfaa9514a4bb%2FScreenshot_2021-08-06_at_16.15.03.png?generation=1694441423300452&alt=media" alt="">
The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. Homepage.
The kinetics dataset is licensed by Google Inc. under a Creative Commons Attribution 4.0 International License. Published. May 22, 2017.
Test data for the WMT17 QE task. Train data can be downloaded from http://hdl.handle.net/11372/LRT-1974 This shared task will build on its previous five editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We include word-level, phrase-level and sentence-level estimation. All tasks will make use of a large dataset produced from post-editions by professional translators. The data will be domain-specific (IT and Pharmaceutical domains) and substantially larger than in previous years. In addition to advancing the state of the art at all prediction levels, our goals include: To test the effectiveness of larger (domain-specific and professionally annotated) datasets. We will do so by increasing the size of one of last year's training sets. To study the effect of language direction and domain. We will do so by providing two datasets created in similar ways, but for different domains and language directions. To investigate the utility of detailed information logged during post-editing. We will do so by providing post-editing time, keystrokes, and actual edits. This year's shared task provides new training and test datasets for all tasks, and allows participants to explore any additional data and resources deemed relevant. A in-house MT system was used to produce translations for all tasks. MT system-dependent information can be made available under request. The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes.
The Centre for Marine Applied Research (CMAR) provides high resolution ocean data from around the coast of Nova Scotia through their Coastal Monitoring Program. Through the Water Quality Branch of the program, CMAR collects temperature, dissolved oxygen, and salinity data using sensors deployed on stationary moorings. A typical mooring consists of a line anchored to the sea floor and suspended by a sub-surface buoy, with sensors attached at various depths. Alternatively, sensors may be attached to structures including buoys, docks, or aquaculture equipment. Sensors are deployed for several months, and data are measured every 1 minute to 1 hour. Station locations, summary reports, and data collection methods are available on the CMAR website (https://cmar.ca/coastal-monitoring-program/). Datasets and reports may be revised pending ongoing data collection and analyses. Automated Quality Control tests were applied to the data to identify outlying and unexpected observations. The results of these tests are summarized in the “qc_flag” columns of the dataset. Observations flagged as “Pass” passed all tests, while observations flagged as “Fail” failed at least one test and should be excluded from most analyses. “Suspect/Of Interest” flags highlight unusual events or poor quality data, and “Not Evaluated” flags indicate at least one test was not applied to the observation. Flags should be used as a guide only, and users are responsible for evaluating the data quality prior to use. For technical details on the Quality Control tests, visit the CMAR Data Governance website (https://dempsey-cmar.github.io/cmp-data-governance/pages/cmp_about.html). Other data quality considerations: - Through calibration-validation procedures, CMAR has discovered that the VR2AR temperature sensors typically record 0.5 – 1 °C lower than other temperature sensors. This is not corrected for or flagged in the datasets but may be in the future. - Sensor drift is not flagged in the datasets. - The sensor_depth_at_low_tide_m is an estimate and should be compared to sensor_depth_measured_m when possible. Note the mooring can get “knocked down” by currents or sink from biofouling. Large discrepancies between the estimated depth and the minimum recorded depth are flagged in the column depth_crosscheck_flag. The Coastal Monitoring Program Water Quality data is organized by county. These datasets are very large, typically exceeding the number of rows that can be viewed in Excel. CMAR recommends filtering the data to the waterbody, station, depth, quality control flag, and/or time period of interest before exporting. Take care when exporting data filtered on quality control columns, because the whole row will be filtered (i.e., all other variables measured at that timestamp will also be excluded). If you have accessed any Coastal Monitoring Program data, CMAR would appreciate your feedback: https://forms.gle/AyD7Vi3BpKGe6ueYA. Please acknowledge the Centre for Marine Applied Research in any published material that uses this data. Contact info@cmar.ca for more information. cdm_data_type=TimeSeries cdm_timeseries_variables=waterbody,station,sensor_type,sensor_serial_number contributor_name=Centre for Marine Applied Research (CMAR) contributor_role=owner Conventions=COARDS, CF-1.6, ACDD-1.3 defaultDataQuery=&time>=min(time) Easternmost_Easting=-65.83446 featureType=TimeSeries geospatial_lat_max=43.93093 geospatial_lat_min=43.67901 geospatial_lat_units=degrees_north geospatial_lon_max=-65.83446 geospatial_lon_min=-66.17321 geospatial_lon_units=degrees_east geospatial_vertical_max=15.0 geospatial_vertical_min=1.0 geospatial_vertical_positive=down geospatial_vertical_units=m infoUrl=https://cmar.ca/coastal-monitoring-program/ institution=Centre for Marine Applied Research (CMAR) instrument=hobo-10194899,hobo-10226050,hobo-10777109,hobo-10194911,hobo-10194912,hobo-10777103,hobo-10034865,hobo-10194877,hobo-10778922,hobo-10034851,hobo-10755201,hobo-10755232,hobo-20291436,hobo-20291456,hobo-20291476,aquameasure-680251,hobo-20291444,vr2ar-547086,aquameasure-675009,hobo-10755242,vr2ar-547099,aquameasure-675286,vr2ar-547109,aquameasure-680324,aquameasure-670373,hobo-20495250,vr2ar-545777,aquameasure-680326,aquameasure-670383,aquameasure-671046,aquameasure-671044,aquameasure-670380,vr2ar-548039,hobo-20495248,hobo-20900985,hobo-21043067,hobo-21082791,aquameasure-670354,aquameasure-675011,aquameasure-680360,vr2ar-548038,vr2ar-549340,aquameasure-686013,aquameasure-671011,hobo-20308045,hobo-21152408,vr2ar-551263,hobo-20900974,vr2ar-551264,hobo-20900987,aquameasure-686255,aquameasure-671022,hobo-21083050,vr2ar-549342,hobo-21043083,aquameasure-686011,vr2ar-547115,aquameasure-670367,aquameasure-680325,vr2ar-551261,aquameasure-675014,aquameasure-686256,aquameasure-671331,hobo-20291446,vr2ar-548559,vr2ar-548597,aquameasure-671188,hobo-21152407,hobo-21650150,hobo-20330413,hobo-20804688,vr2ar-548586,aquameasure-671185,hobo-20291480,hobo-20820380,vr2ar-548563 Northernmost_Northing=43.93093 sourceUrl=(local files) Southernmost_Northing=43.67901 standard_name_vocabulary=CF Standard Name Table v55 subsetVariables=waterbody, station, sensor_type, sensor_serial_number,lease,string_configuration,qc_flag_dissolved_oxygen,qc_flag_salinity,qc_flag_sensor_depth_measured,qc_flag_temperature,depth_crosscheck_flag time_coverage_end=2024-08-07T17:41:20Z time_coverage_start=2016-02-19T17:00:00Z Westernmost_Easting=-66.17321
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Platelet products are both expensive and have very short shelf lives. As usage rates for platelets are highly variable, the effective management of platelet demand and supply is very important yet challenging. The primary goal of this paper is to present an efficient forecasting model for platelet demand at Canadian Blood Services (CBS). To accomplish this goal, five different demand forecasting methods, ARIMA (Auto Regressive Integrated Moving Average), Prophet, lasso regression (least absolute shrinkage and selection operator), random forest, and LSTM (Long Short-Term Memory) networks are utilized and evaluated via a rolling window method. We use a large clinical dataset for a centralized blood distribution centre for four hospitals in Hamilton, Ontario, spanning from 2010 to 2018 and consisting of daily platelet transfusions along with information such as the product specifications, the recipients’ characteristics, and the recipients’ laboratory test results. This study is the first to utilize different methods from statistical time series models to data-driven regression and machine learning techniques for platelet transfusion using clinical predictors and with different amounts of data. We find that the multivariable approaches have the highest accuracy in general, however, if sufficient data are available, a simpler time series approach appears to be sufficient. We also comment on the approach to choose predictors for the multivariable models.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created to simulate a market basket dataset, providing insights into customer purchasing behavior and store operations. The dataset facilitates market basket analysis, customer segmentation, and other retail analytics tasks. Here's more information about the context and inspiration behind this dataset:
Context:
Retail businesses, from supermarkets to convenience stores, are constantly seeking ways to better understand their customers and improve their operations. Market basket analysis, a technique used in retail analytics, explores customer purchase patterns to uncover associations between products, identify trends, and optimize pricing and promotions. Customer segmentation allows businesses to tailor their offerings to specific groups, enhancing the customer experience.
Inspiration:
The inspiration for this dataset comes from the need for accessible and customizable market basket datasets. While real-world retail data is sensitive and often restricted, synthetic datasets offer a safe and versatile alternative. Researchers, data scientists, and analysts can use this dataset to develop and test algorithms, models, and analytical tools.
Dataset Information:
The columns provide information about the transactions, customers, products, and purchasing behavior, making the dataset suitable for various analyses, including market basket analysis and customer segmentation. Here's a brief explanation of each column in the Dataset:
Use Cases:
Note: This dataset is entirely synthetic and was generated using the Python Faker library, which means it doesn't contain real customer data. It's designed for educational and research purposes.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Big data is one of the key transformative factors which increasingly influences all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance, by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods. Handling those huge sets of molecules effectively is quite challenging and requires compromises that often come at the expense of the interpretability of the results. In order to find an intuitive and meaningful approach to organizing large molecular data sets, we adopted a probabilistic framework called “topic modeling” from the text-mining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to “chemical topics” and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 data set to test its robustness and efficiency. In about 1 h we built a 100-topic model of this large data set in which we could identify interesting topics like “proteins”, “DNA”, or “steroids”. Along with this publication we provide our data sets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.
Training and development data for the WMT17 QE task. Test data will be published as a separate item. This shared task will build on its previous five editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We include word-level, phrase-level and sentence-level estimation. All tasks will make use of a large dataset produced from post-editions by professional translators. The data will be domain-specific (IT and Pharmaceutical domains) and substantially larger than in previous years. In addition to advancing the state of the art at all prediction levels, our goals include: To test the effectiveness of larger (domain-specific and professionally annotated) datasets. We will do so by increasing the size of one of last year's training sets. To study the effect of language direction and domain. We will do so by providing two datasets created in similar ways, but for different domains and language directions. To investigate the utility of detailed information logged during post-editing. We will do so by providing post-editing time, keystrokes, and actual edits. This year's shared task provides new training and test datasets for all tasks, and allows participants to explore any additional data and resources deemed relevant. A in-house MT system was used to produce translations for all tasks. MT system-dependent information can be made available under request. The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb)
It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.
The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.
***Methodology***
To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).
These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.
To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.
***Test procedure***
Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study.
The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx)
The data collected for each study by two researchers were then synthesized in one final version by the third researcher.
***Description of the data in this data set***
Protocol_HVD_SLR provides the structure of the protocol
Spreadsheets #1 provides the filled protocol for relevant studies.
Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies
The information on each selected study was collected in four categories:
(1) descriptive information,
(2) approach- and research design- related information,
(3) quality-related information,
(4) HVD determination-related information
Descriptive information
1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet
2) Complete reference - the complete source information to refer to the study
3) Year of publication - the year in which the study was published
4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter}
5) DOI / Website- a link to the website where the study can be found
6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science
7) Availability in OA - availability of an article in the Open Access
8) Keywords - keywords of the paper as indicated by the authors
9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}
Approach- and research design-related information
10) Objective / RQ - the research objective / aim, established research questions
11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.)
12) Contributions - the contributions of the study
13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach?
14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared?
15) Period under investigation - period (or moment) in which the study was conducted
16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?
Quality- and relevance- related information
17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)?
18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))
HVD determination-related information
19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term?
20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output")
21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description)
22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles?
23) Data - what data do HVD cover?
24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)
***Format of the file***
.xls, .csv (for the first spreadsheet only), .odt, .docx
***Licenses or restrictions***
CC-BY
For more info, see README.txt
Le Center for Marine Applied Research (CMAR) fournit des données sur l'océan à haute résolution de la côte de la Nouvelle-Écosse grâce à leur programme de surveillance côtière.Grâce à la branche de la qualité de l'eau du programme, CMAR recueille la température, l'oxygène dissous et les données de salinité à l'aide de capteurs déployés sur des amarres stationnaires.Un amarrage typique se compose d'une ligne ancrée au fond marin et suspendu par une bouée sous-surface, avec des capteurs attachés à différentes profondeurs.Alternativement, les capteurs peuvent être attachés à des structures, notamment des bouées, des quais ou des équipements d'aquaculture.Les capteurs sont déployés pendant plusieurs mois et les données sont mesurées toutes les minutes à 1 heure.Les emplacements des stations, les rapports de résumé et les méthodes de collecte de données sont disponibles sur le site Web de CMAR (https://cmar.ca/coastal-monitoring-program/).Les ensembles de données et les rapports peuvent être révisés en attendant la collecte et les analyses de données en cours.Des tests automatisés de contrôle de la qualité ont été appliqués aux données pour identifier les observations périphériques et inattendues.Les résultats de ces tests sont résumés dans les colonnes «QC_FLAG» de l'ensemble de données.Les observations signalées en tant que «pass» ont réussi tous les tests, tandis que les observations ont signalé en «échec» ont échoué au moins un test et devraient être exclues de la plupart des analyses.Les drapeaux «suspects / d'intérêt» mettent en évidence des événements inhabituels ou des données de mauvaise qualité, et les drapeaux «non évalués» indiquent qu'au moins un test n'a pas été appliqué à l'observation.Les drapeaux doivent être utilisés uniquement comme guide et les utilisateurs sont responsables de l'évaluation de la qualité des données avant l'utilisation.Pour plus de détails sur les tests de contrôle de la qualité, visitez le site Web de la gouvernance des données CMAR (https://dempsey-cmar.github.io/cmp-data-governance/pages/cmp_about.html).Autres considérations de qualité des données: - Grâce à des procédures de validation d'étalonnage, CMAR a découvert que les capteurs de température VR2AR enregistrent généralement 0,5 à 1 ° C inférieur à ceux des autres capteurs de température.Ceci n'est pas corrigé ou signalé dans les ensembles de données, mais peut être à l'avenir.- La dérive du capteur n'est pas signalée dans les ensembles de données.- Le capteur_depth_at_low_tide_m est une estimation et doit être comparé à Sensor_Depth_Measured_M lorsque cela est possible.Notez que l'amarrage peut être «renversé» par des courants ou un puits de la biofoux.De grandes écarts entre la profondeur estimée et la profondeur enregistrée minimale sont signalées dans la colonne Depth_Crosscheck_Flag.Le programme de surveillance côtière des données de qualité de l'eau est organisé par le comté.Ces ensembles de données sont très importants, dépassant généralement le nombre de lignes qui peuvent être visualisées dans Excel.CMAR recommande de filtrer les données sur le corps à eau, la station, la profondeur, le drapeau de contrôle de la qualité et / ou la période d'intérêt avant de l'exportation.Faites attention lors de l'exportation de données filtrées sur des colonnes de contrôle de qualité, car la ligne entière sera filtrée (c'est-à-dire que toutes les autres variables mesurées à cet horodat seront également exclues).Si vous avez accédé aux données du programme de surveillance côtière, CMAR apprécierait vos commentaires: https://forms.gle/ayd7vi3bpkge6ueya.Veuillez reconnaître le Center for Marine Applied Research dans tout matériel publié qui utilise ces données.Contactez info@cmar.ca pour plus d'informations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Diverse learning theories have been constructed to understand learners' internal states through various tangible predictors. We focus on self-regulatory actions that are subconscious and habitual actions triggered by behavior agents' 'awareness' of their attention loss. We hypothesize that self-regulatory behaviors (i.e., attention regulation behaviors) also occur in e-reading as 'regulators' as found in other behavior models (Ekman, P., & Friesen, W. V., 1969). In this work, we try to define the types and frequencies of attention regulation behaviors in e-reading. We collected various cues that reflect learners' moment-to-moment and page-to-page cognitive states to understand the learners' attention in e-reading.
The text 'How to make the most of your day at Disneyland Resort Paris' has been implemented on a screen-based e-reader, which we developed in a pdf-reader format. An informative, entertaining text was adopted to capture learners' attentional shifts during knowledge acquisition. The text has 2685 words, distributed over ten pages, with one subtopic on each page. A built-in webcam on Mac Pro and a mouse have been used for the data collection, aiming for real-world implementation only with essential computational devices. A height-adjustable laptop stand has been used to compensate for participants' eye levels.
Thirty learners in higher education have been invited for a screen-based e-reading task (M=16.2, SD=5.2 minutes). A pre-test questionnaire with ten multiple-choice questions was given before the reading to check their prior knowledge level about the topic. There was no specific time limit to finish the questionnaire. We collected cues that reflect learners' moment-to-moment and page-to-page cognitive states to understand the learners' attention in e-reading. Learners were asked to report their distractions on two levels during the reading: 1) In-text distraction (e.g., still reading the text with low attentiveness) or 2) out-of-text distraction (e.g., thinking of something else while not reading the text anymore). We implemented two noticeably-designed buttons on the right-hand side of the screen interface to minimize possible distraction from the reporting task. After triggering a new page, we implemented blur stimuli on the text in the random range of 20 seconds. It ensures that the blur stimuli occur at least once on each page. Participants were asked to click the de-blur button on the text area of the screen to proceed with the reading. The button has been implemented in the whole text area, so participants can minimize the effort to find and click the button. Reaction time for de-blur has been measured, too, to grasp the arousal of learners during the reading. We asked participants to answer pre-test and post-test questionnaires about the reading material. Participants were given ten multiple-choice questions before the session, while the same set of questions was given after the reading session (i.e., formative questions) with added subtopic summarization questions (i.e., summative questions). It can provide insights into the quantitative and qualitative knowledge gained through the session and different learning outcomes based on individual differences. A video dataset of 931,440 frames has been annotated with the attention regulator behaviors using an annotation tool that plays the long sequence clip by clip, which contains 30 frames. Two annotators (doctoral students) have done two stages of labeling. In the first stage, the annotators were trained on the labeling criteria and annotated the attention regulator behaviors separately based on their judgments. The labels were summarized and cross-checked in the second round to address the inconsistent cases, resulting in five attention regulation behaviors and one neutral state. See WEDAR_readme.csv for detailed descriptions of features.
The dataset has been uploaded 1) raw data, which has formed as we collected, and 2) preprocessed, that we extracted useful features for further learning analytics based on real-time and post-hoc data.
Reference
Ekman, P., & Friesen, W. V. (1969). The repertoire of nonverbal behavior: Categories, origins, usage, and coding. semiotica, 1(1), 49-98.
Fecal occult blood test (FOBT) obtained in past 2 years or, colonoscopy or sigmoidoscopy obtained in last 5 years, by age group and sex, aged 50 or older, Canada, provinces, territories, health regions (2007 boundaries) and peer groups.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model performance with different training window sizes and retraining periods.
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
This dataset was created to compare methods for face reidentification, that is, given an image and a name of a person, check if that image belongs to that person. But it also can be used to test face recognition algorithms, since the dataset has been categorized.
The autors have made a great effort to collect as much images as they could for all classes inside the dataset. Faces were aligned using eye position alignment and then cropped using landmarks to find the region of interest.
The Open Famous People Faces dataset contains 258 classes with at least 5 images per class. Images have different sizes, some are low quality and small sized images, others are high quality and big sized images. We have images from the same person at different ages.
This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.