100+ datasets found

j
The archived dataset from the IKI-project (2018-2021)
jyx.jyu.fi
Updated Feb 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anu Palojärvi; Tea Kangasvieri; Liisa Lempel; Josephine Moate (2025). The archived dataset from the IKI-project (2018-2021) [Dataset]. http://doi.org/10.17011/jyx/dataset/77794
Explore at:
Unique identifier
https://doi.org/10.17011/jyx/dataset/77794
Dataset updated
Feb 13, 2025
Authors
Anu Palojärvi; Tea Kangasvieri; Liisa Lempel; Josephine Moate
License
https://rightsstatements.org/page/InC/1.0/https://rightsstatements.org/page/InC/1.0/
Description
The archived data set consists of 19 interview transcriptions from thematic interviews with 11 professionals working in early childhood education and care (ECEC) and 8 teachers from basic education. The participants were recruited from different Finnish cities through email requests to teacher networks and individuals and local education authorities. The main focus of the interviews was on different ways of implementing language education (e.g. principles, goals, new/innovative approaches, collaboration). The interviews were conducted in various different contexts of language education (e.g. language aware teaching, foreign language teaching, bilingual education). Most of the interviews were conducted in Finnish. More detailed information about the metadata and interviews can be found in the metadata files. The interviews can be used for studies on teacher perspectives and reflections within different language education contexts. The dataset includes accounts of individual and community innovation and development. When using the data it should also be used according the IKI-privacy notice based on the JYU guidelines (2018-2019). The IKI-research plan is available as a part of the metadata files. Please note that the teachers should not be evaluated nor the interaction between the interviewers and participants. The data was gathered 2018-2021. Data is a part of a larger IKI dataset.
i
Data from: A Large-Scale Dataset of Twitter Chatter about Online Learning...
ieee-dataport.org
Updated Aug 10, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur (2022). A Large-Scale Dataset of Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave [Dataset]. https://ieee-dataport.org/documents/large-scale-dataset-twitter-chatter-about-online-learning-during-current-covid-19-omicron
Explore at:
Dataset updated
Aug 10, 2022
Authors
Nirmalya Thakur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
no. 8
S
Dataset on the physical and mental development of ethnic minority preschool...
scidb.cn
Updated Jan 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yan Zhixiong (2024). Dataset on the physical and mental development of ethnic minority preschool children [Dataset]. http://doi.org/10.57760/sciencedb.13848
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.13848
Dataset updated
Jan 8, 2024
Dataset provided by
Science Data Bank
Authors
Yan Zhixiong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
According to the distribution of ethnic minority population in Guangxi, 157,598 parents of preschool children from 12 ethnic autonomous counties in Guangxi participated in the assessment. Data on the physical and mental development of preschool children were collected to analyze the distribution characteristics of five major domains: health and physical fitness, language and communication, sociality and emotion, exploration and cognition, and aesthetics and performance. These characteristics were analyzed based on dimensions such as ethnicity, gender, grade, urban and rural areas, and kindergarten attributes. This data collection aims to accumulate resources for the formal implementation of brain and intellectual development assessment and to conduct preliminary exploration. The "Assessment Scale for the Physical and Mental Development of Preschool Children" was used for the assessment. This tool comprehensively covers the five major domains of preschool children's physical and mental development (health and physical fitness, language and communication, sociality and emotion, exploration and cognition, aesthetics and performance) and has high internal consistency reliability (Cronbach's > 0.85). Data analysis and visualization were conducted using R language tools such as psych, psychtool, and dplyr. The data is a 157,598*18 data frame, with each row representing a subject record. The first to eighteenth columns represent: individual id, gender, scale code, scores of each dimension of the scale (V1-V7), role of the subject, city address, preschool, city address, grade, urban and rural series, kindergarten attribute, mean, and ethnicity.
TREC 2022 Deep Learning test collection
catalog.data.gov
data.nist.gov
Updated May 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection
Explore at:
Dataset updated
May 9, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.
Large dataset of infancy and early childhood brain MRIs (T1w and T2w)
zenodo.org
zip
Updated Aug 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tugba Akinci D'Antonoli; Ramona-Alexandra Todea; Alexandre Datta; Bram Stieltjes; Nora Leu; Friederike Prüfer; Jakob Wasserthal; Tugba Akinci D'Antonoli; Ramona-Alexandra Todea; Alexandre Datta; Bram Stieltjes; Nora Leu; Friederike Prüfer; Jakob Wasserthal (2023). Large dataset of infancy and early childhood brain MRIs (T1w and T2w) [Dataset]. http://doi.org/10.5281/zenodo.8055666
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8055666
Dataset updated
Aug 2, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tugba Akinci D'Antonoli; Ramona-Alexandra Todea; Alexandre Datta; Bram Stieltjes; Nora Leu; Friederike Prüfer; Jakob Wasserthal; Tugba Akinci D'Antonoli; Ramona-Alexandra Todea; Alexandre Datta; Bram Stieltjes; Nora Leu; Friederike Prüfer; Jakob Wasserthal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains 833 brain MRI images (T1w and T2w) from infancy and early childhood. The age of the subjects is between 0 months and 36 months. It contains a wide range of pathologies as well as healthy subjects. It is a quite diverse dataset acquired in the clinical routine over several years (images acquired with same scanner, but different protocols).

The T1w images are resampled to the shape of the T2w images. Then both are skull stripped.

All details about this dataset can be found in the paper "Development and Evaluation of Deep Learning Models for Automated Estimation of Myelin Maturation Using Pediatric Brain MRI Scans". If you use this dataset please cite our paper: https://pubs.rsna.org/doi/10.1148/ryai.220292

The metadata can be found in the table meta.csv.

Description of columns:
myelinisation: myelin maturation status in terms of delayed, normal or accelerated according to evaluation by an expert radiologist. For more detail please see the paper.
age: the chronological age (in months) since birth.
age_corrected: the corrected chronological age (in months), which corrected for the premature babies by the number of month the baby was born before 37 weeks of gestation (in month), hence a preterm newborn gets a negative age.
doctor_predicted_age: the predicted age (in months) of the myelin maturation by expert radiologist (subjects with delayed myelin maturation will get lower values than their chronological age).
diagnosis: list of pathologies found in this dataset according to expert radiology reports.
P
Meta-Dataset Dataset
paperswithcode.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eleni Triantafillou; Tyler Zhu; Vincent Dumoulin; Pascal Lamblin; Utku Evci; Kelvin Xu; Ross Goroshin; Carles Gelada; Kevin Swersky; Pierre-Antoine Manzagol; Hugo Larochelle, Meta-Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/meta-dataset
Explore at:
Authors
Eleni Triantafillou; Tyler Zhu; Vincent Dumoulin; Pascal Lamblin; Utku Evci; Kelvin Xu; Ross Goroshin; Carles Gelada; Kevin Swersky; Pierre-Antoine Manzagol; Hugo Larochelle
Description
The Meta-Dataset benchmark is a large few-shot learning benchmark and consists of multiple datasets of different data distributions. It does not restrict few-shot tasks to have fixed ways and shots, thus representing a more realistic scenario. It consists of 10 datasets from diverse domains:

ILSVRC-2012 (the ImageNet dataset, consisting of natural images with 1000 categories) Omniglot (hand-written characters, 1623 classes) Aircraft (dataset of aircraft images, 100 classes) CUB-200-2011 (dataset of Birds, 200 classes) Describable Textures (different kinds of texture images with 43 categories) Quick Draw (black and white sketches of 345 different categories) Fungi (a large dataset of mushrooms with 1500 categories) VGG Flower (dataset of flower images with 102 categories), Traffic Signs (German traffic sign images with 43 classes) MSCOCO (images collected from Flickr, 80 classes).

All datasets except Traffic signs and MSCOCO have a training, validation and test split (proportioned roughly into 70%, 15%, 15%). The datasets Traffic Signs and MSCOCO are reserved for testing only.
e
Comparison of early childhood education in the six largest cities
data.europa.eu
unknown
Updated Apr 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oulu (2025). Comparison of early childhood education in the six largest cities [Dataset]. https://data.europa.eu/data/datasets/c3beb4e5-80e2-45df-b4ec-b3fdbc134010?locale=en
Explore at:
unknownAvailable download formats
Dataset updated
Apr 16, 2025
Dataset authored and provided by
Oulu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The number of children in early childhood education and care by age group and type of care since 2002, the attendance days of municipal day care centres from 2005 onwards and the number of employees since 2008, and the cost of the early childhood education system from 2009 in the six largest cities in Finland.

The reviews of early childhood education and care monitor the use and costs of early childhood education and care provided by municipalities and municipalities as outsourced services, private care support and service vouchers, as well as the use and costs of child home care support. The review also includes pre-primary education in accordance with the Basic Education Act and open early childhood education activities in accordance with the Act on Early Childhood Education.

The sixth cities are made up of the six most populous cities in Finland. The six cities in the order of the population include Helsinki, Espoo, Tampere, Vantaa, Turku and Oulu. The six working groups compare the social and health services of cities and early childhood education and care services. Data on customer numbers, performances, personnel and costs are mainly compiled from municipalities’ own information systems and financial statements. City experts agree on as uniform definitions as possible for data collection and implement the data collection in practice.
f
EDA augmentation parameters.
plos.figshare.com
xls
Updated Sep 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). EDA augmentation parameters. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t009
Dataset updated
Sep 26, 2024
Dataset provided by
PLOS ONE
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
a
Educational Process Mining (EPM): A Learning Analytics Data Set Data Set
academictorrents.com
bittorrent
Updated Feb 11, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mehrnoosh Vahdatand Luca Oneto and Davide Anguita and Mathias Funk and Matthias Rauterberg (2016). Educational Process Mining (EPM): A Learning Analytics Data Set Data Set [Dataset]. https://academictorrents.com/details/e24e083cc337695bb84a2b68707695579c0ab4d8
Explore at:
bittorrent(4934446)Available download formats
Dataset updated
Feb 11, 2016
Dataset authored and provided by
Mehrnoosh Vahdatand Luca Oneto and Davide Anguita and Mathias Funk and Matthias Rauterberg
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
Data Set Information: The experiments have been carried out with a group of 115 students of first-year, undergraduate Engineering major of the University of Genoa. We carried out this study over a simulation environment named Deeds (Digital Electronics Education and Design Suite) which is used for e-learning in digital electronics. The environment provides learning materials through specialized browsers for the students, and asks them to solve various problems with different levels of difficulty. For more information about the Deeds simulator used for this course look at: [Web Link] and to know more about the exercises contents of each session see exercises_info.txt . Our data set contains the students time series of activities during six sessions of laboratory sessions of the course of digital electronics. There are 6 folders containing the studentsâ€™ data per session. Each Session folder contains up to 99 CSV files each dedicated to a specific student log during that ses
S
A dataset on parenting styles and physical and mental development of...
scidb.cn
Updated Apr 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yan Zhixiong (2025). A dataset on parenting styles and physical and mental development of preschool children from ethnic minorities [Dataset]. http://doi.org/10.57760/sciencedb.23038
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.23038
Dataset updated
Apr 7, 2025
Dataset provided by
Science Data Bank
Authors
Yan Zhixiong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
According to the distribution of ethnic minority populations in Guangxi, 320425 parents of preschool children from 12 autonomous counties participated in the assessment. Collect data on parenting styles and physical and mental development of preschool children, including six parenting style factors: "humiliation vs. respect", "rejection vs. acceptance", "punishment vs. motivation", "dictatorship vs. democracy", "indulgence (leniency) vs. control", "rudeness vs. protection (civilization)". Each factor has 10 items, for a total of 60 items. The distribution characteristics of physical and mental development include five major areas: health and physical fitness, language and communication, social and emotional, exploration and cognition, and aesthetics and performance. Analyze these characteristics based on dimensions such as race, gender, grade level, urban-rural area, and kindergarten attributes.Use the "Parental Rearing Style Scale" and "Pre school Children's Physical and Mental Development Assessment Scale" for evaluation. Has high internal consistency reliability (Cronbach>0.85). Using psychology Use R language tools such as psychtool and dplyr for data analysis and visualization.The data is a frame of 320425 * 21, with each row representing a topic record. Includes: personal ID, gender, scale code, scores for each dimension of the scale, subject role, city address, preschool class, city address, grade level, urban-rural series, kindergarten attributes, mean, and race.
P
2DeteCT Dataset
paperswithcode.com
Updated Sep 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maximilian B. Kiss; Sophia B. Coban; K. Joost Batenburg; Tristan van Leeuwen; Felix Lucka (2023). 2DeteCT Dataset [Dataset]. https://paperswithcode.com/dataset/2detect
Explore at:
Dataset updated
Sep 20, 2023
Authors
Maximilian B. Kiss; Sophia B. Coban; K. Joost Batenburg; Tristan van Leeuwen; Felix Lucka
Description
Maximilian B. Kiss, Sophia B. Coban, K. Joost Batenburg, Tristan van Leeuwen, and Felix Lucka "2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography dataset for machine learning", Sci Data 10, 576 (2023) or arXiv:2306.05907 (2023)

Abstract: "Recent research in computational imaging largely focuses on developing machine learning (ML) techniques for image reconstruction, which requires large-scale training datasets consisting of measurement data and ground-truth images. However, suitable experimental datasets for X-ray Computed Tomography (CT) are scarce, and methods are often developed and evaluated only on simulated data. We fill this gap by providing the community with a versatile, open 2D fan-beam CT dataset suitable for developing ML techniques for a range of image reconstruction tasks. To acquire it, we designed a sophisticated, semi-automatic scan procedure that utilizes a highly-flexible laboratory X-ray CT setup. A diverse mix of samples with high natural variability in shape and density was scanned slice-by-slice (5000 slices in total) with high angular and spatial resolution and three different beam characteristics: A high-fidelity, a low-dose and a beam-hardening-inflicted mode. In addition, 750 out-of-distribution slices were scanned with sample and beam variations to accommodate robustness and segmentation tasks. We provide raw projection data, reference reconstructions and segmentations based on an open-source data processing pipeline."

The data collection has been acquired using a highly flexible, programmable and custom-built X-ray CT scanner, the FleX-ray scanner, developed by TESCAN-XRE NV, located in the FleX-ray Lab at the Centrum Wiskunde & Informatica (CWI) in Amsterdam, Netherlands. It consists of a cone-beam microfocus X-ray point source (limited to 90 kV and 90 W) that projects polychromatic X-rays onto a 14-bit CMOS (complementary metal-oxide semiconductor) flat panel detector with CsI(Tl) scintillator (Dexella 1512NDT) and 1536-by-1944 pixels, each. To create a 2D dataset, a fan-beam geometry was mimicked by only reading out the central row of the detector. Between source and detector there is a rotation stage, upon which samples can be mounted. The machine components (i.e., the source, the detector panel, and the rotation stage) are mounted on translation belts that allow the moving of the components independently from one another.

Please refer to the paper for all further technical details.

The complete data collection can be found via the following links: 1-1,000, 1,001-2,000, 2,001-3,000, 3,001-4,000, 4,001-5,000, 5,521-6,370.

Each slice folder ‘slice00001 - slice05000’ and ‘slice05521 - slice06370’ contains three folders for each mode: ‘mode1’, ‘mode2’, ‘mode3’. In each of these folders there are the sinogram, the dark-field, and the two flat-fields for the raw data archives, or just the reconstructions and for mode2 the additional reference segmentation.

The corresponding reference reconstructions and segmentations can be found via the following links: 1-1,000, 1,001-2,000, 2,001-3,000, 3,001-4,000, 4,001-5,000, 5,521-6,370.

The corresponding Python scripts for loading, pre-processing, reconstructing and segmenting the projection data in the way described in the paper can be found on github. A machine-readable file with the used scanning parameters and instrument data for each acquisition mode as well as a script loading it can be found on the GitHub repository as well.

Note: It is advisable to use the graphical user interface when decompressing the .zip archives. If you experience a zipbomb error when unzipping the file on a Linux system rerun the command with the UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE environment variable by setting in your .bashrc “export UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE”.

For more information or guidance in using the data collection, please get in touch with

Maximilian.Kiss [at] cwi.nl

Felix.Lucka [at] cwi.nl
o
Education Attainment and Enrollment around the World - Dataset - Data...
data.opendata.am
Updated Jul 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Education Attainment and Enrollment around the World - Dataset - Data Catalog Armenia [Dataset]. https://data.opendata.am/dataset/dcwb0038973
Explore at:
Dataset updated
Jul 7, 2023
Area covered
World
Description
Patterns of educational attainment vary greatly across countries, and across population groups within countries. In some countries, virtually all children complete basic education whereas in others large groups fall short. The primary purpose of this database, and the associated research program, is to document and analyze these differences using a compilation of a variety of household-based data sets: Demographic and Health Surveys (DHS); Multiple Indicator Cluster Surveys (MICS); Living Standards Measurement Study Surveys (LSMS); as well as country-specific Integrated Household Surveys (IHS) such as Socio-Economic Surveys.As shown at the website associated with this database, there are dramatic differences in attainment by wealth. When households are ranked according to their wealth status (or more precisely, a proxy based on the assets owned by members of the household) there are striking differences in the attainment patterns of children from the richest 20 percent compared to the poorest 20 percent.In Mali in 2012 only 34 percent of 15 to 19 year olds in the poorest quintile have completed grade 1 whereas 80 percent of the richest quintile have done so. In many countries, for example Pakistan, Peru and Indonesia, almost all the children from the wealthiest households have completed at least one year of schooling. In some countries, like Mali and Pakistan, wealth gaps are evident from grade 1 on, in other countries, like Peru and Indonesia, wealth gaps emerge later in the school system.The EdAttain website allows a visual exploration of gaps in attainment and enrollment within and across countries, based on the international database which spans multiple years from over 120 countries and includes indicators disaggregated by wealth, gender and urban/rural location. The database underlying that site can be downloaded from here.
i
SciBank: A Large Dataset of Annotated Scientific Paper Regions for Document...
ieee-dataport.org
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Byron Acuna Acurio (2024). SciBank: A Large Dataset of Annotated Scientific Paper Regions for Document Layout Analysis [Dataset]. https://ieee-dataport.org/open-access/scibank-large-dataset-annotated-scientific-paper-regions-document-layout-analysis
Explore at:
Dataset updated
Jul 11, 2024
Authors
Byron Acuna Acurio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
tables
d
Data from: BuildingsBench: A Large-Scale Dataset of 900K Buildings and...
catalog.data.gov
Updated Jan 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Renewable Energy Laboratory (2024). BuildingsBench: A Large-Scale Dataset of 900K Buildings and Benchmark for Short-Term Load Forecasting [Dataset]. https://catalog.data.gov/dataset/buildingsbench-a-large-scale-dataset-of-900k-buildings-and-benchmark-for-short-term-load-f
Explore at:
Dataset updated
Jan 11, 2024
Dataset provided by
National Renewable Energy Laboratory
Description
The BuildingsBench datasets consist of: Buildings-900K: A large-scale dataset of 900K buildings for pretraining models on the task of short-term load forecasting (STLF). Buildings-900K is statistically representative of the entire U.S. building stock. 7 real residential and commercial building datasets for benchmarking two downstream tasks evaluating generalization: zero-shot STLF and transfer learning for STLF. Buildings-900K can be used for pretraining models on day-ahead STLF for residential and commercial buildings. The specific gap it fills is the lack of large-scale and diverse time series datasets of sufficient size for studying pretraining and finetuning with scalable machine learning models. Buildings-900K consists of synthetically generated energy consumption time series. It is derived from the NREL End-Use Load Profiles (EULP) dataset (see link to this database in the links further below). However, the EULP was not originally developed for the purpose of STLF. Rather, it was developed to "...help electric utilities, grid operators, manufacturers, government entities, and research organizations make critical decisions about prioritizing research and development, utility resource and distribution system planning, and state and local energy planning and regulation." Similar to the EULP, Buildings-900K is a collection of Parquet files and it follows nearly the same Parquet dataset organization as the EULP. As it only contains a single energy consumption time series per building, it is much smaller (~110 GB). BuildingsBench also provides an evaluation benchmark that is a collection of various open source residential and commercial real building energy consumption datasets. The evaluation datasets, which are provided alongside Buildings-900K below, are collections of CSV files which contain annual energy consumption. The size of the evaluation datasets altogether is less than 1GB, and they are listed out below: ElectricityLoadDiagrams20112014 Building Data Genome Project-2 Individual household electric power consumption (Sceaux) Borealis SMART IDEAL Low Carbon London A README file providing details about how the data is stored and describing the organization of the datasets can be found within each data lake version under BuildingsBench.
C
Bogota Measurements - Concurrent Validity Bayley-Short Tests Survey Data:...
data.iadb.org
csv, do, dta, pdf
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IDB Datasets (2025). Bogota Measurements - Concurrent Validity Bayley-Short Tests Survey Data: Early Childhood Development in Large Scale Studies [Dataset]. http://doi.org/10.60966/n2y42g0v
Explore at:
csv(18601), do(3575), do(7146), do(10537), do(2609), do(4170), do(5639), do(4338), do(4589), do(4616), do(4603), do(652), do(4301), do(631), do(4186), do(7421), do(625), pdf(348150), do(3618), do(7438), do(3876), do(4561), dta(1340837), do(7526), do(3246), csv(18785), do(7531), do(3956), do(7357), do(2053), do(11091), do(7349), do(4949), csv(2086190)Available download formats
Unique identifier
https://doi.org/10.60966/n2y42g0v
Dataset updated
Apr 10, 2025
Dataset provided by
IDB Datasets
License
Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
License information was derived automatically
Time period covered
Jan 1, 2011
Area covered
Bogotá
Description
This dataset contains cross-sectional data collected on child development outcomes, child characteristics, and parental and home characteristics for a sample of 1,311 children ages 6-42 months of age living in a representative sample of low- and low-middle-income households in Bogota, Colombia. This is the sample used for the analysis in the paper “Concurrent Validity andFeasibility of Short Tests Currently Used to Measure Early Childhood Development in Large Scale Studies” by Marta Rubio-Codina, M Caridad Araujo, Orazio Attanasio, Pablo Muñoz and Sally Grantam-McGregor, forthcoming at PLOS ONE. The dataset and do files shared allow replication of the results in the paper. Please note that these data can only be used for non-commercial research purposes given the IDB data sharing standards and in order to comply with the commitment acquired by the researchers with study participants by means of the informed consent.
Big-Math-RL-Verified
huggingface.co
Updated Feb 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SynthLabs (2025). Big-Math-RL-Verified [Dataset]. https://huggingface.co/datasets/SynthLabsAI/Big-Math-RL-Verified
Explore at:
Dataset updated
Feb 21, 2025
Dataset provided by
Synth Labs
Authors
SynthLabs
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models

Big-Math is the largest open-source dataset of high-quality mathematical problems, curated specifically for reinforcement learning (RL) training in language models. With over 250,000 rigorously filtered and verified problems, Big-Math bridges the gap between quality and quantity, establishing a robust foundation for advancing reasoning in LLMs.

Request Early Access to Private… See the full description on the dataset page: https://huggingface.co/datasets/SynthLabsAI/Big-Math-RL-Verified.
Data from: Robotic manipulation datasets for offline compositional...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Jun 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marcel Hussing; Jorge Mendez; Anisha Singrodia; Cassandra Kent; Eric Eaton (2024). Robotic manipulation datasets for offline compositional reinforcement learning [Dataset]. http://doi.org/10.5061/dryad.9cnp5hqps
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.9cnp5hqps
Dataset updated
Jun 6, 2024
Dataset provided by
University of Pennsylvania
Massachusetts Institute of Technology
Authors
Marcel Hussing; Jorge Mendez; Anisha Singrodia; Cassandra Kent; Eric Eaton
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Offline reinforcement learning (RL) is a promising direction that allows RL agents to be pre-trained from large datasets avoiding recurrence of expensive data collection. To advance the field, it is crucial to generate large-scale datasets. Compositional RL is particularly appealing for generating such large datasets, since 1) it permits creating many tasks from few components, and 2) the task structure may enable trained agents to solve new tasks by combining relevant learned components. This submission provides four offline RL datasets for simulated robotic manipulation created using the 256 tasks from CompoSuite Mendez et al., 2022. In every task in CompoSuite, a robot arm is used to manipulate an object to achieve an objective all while trying to avoid an obstacle. There are for components for each of these four axes that can be combined arbitrarily leading to a total of 256 tasks. The component choices are * Robot: IIWA, Jaco, Kinova3, Panda* Object: Hollow box, box, dumbbell, plate* Objective: Push, pick and place, put in shelf, put in trashcan* Obstacle: None, wall between robot and object, wall between goal and object, door between goal and object The four included datasets are collected using separate agents each trained to a different degree of performance, and each dataset consists of 256 million transitions. The degrees of performance are expert data, medium data, warmstart data and replay data: * Expert dataset: Transitions from an expert agent that was trained to achieve 90% success on every task.* Medium dataset: Transitions from a medium agent that was trained to achieve 30% success on every task.* Warmstart dataset: Transitions from a Soft-actor critic agent trained for a fixed duration of one million steps.* Medium-replay-subsampled dataset: Transitions that were stored during the training of a medium agent up to 30% success. These datasets are intended for the combined study of compositional generalization and offline reinforcement learning. Methods The datasets were collected by using several deep reinforcement learning agents trained to the various degrees of performance described above on the CompoSuite benchmark (https://github.com/Lifelong-ML/CompoSuite) which builds on top of robosuite (https://github.com/ARISE-Initiative/robosuite) and uses the MuJoCo simulator (https://github.com/deepmind/mujoco). During reinforcement learning training, we stored the data that was collected by each agent in a separate buffer for post-processing. Then, after training, to collect the expert and medium dataset, we run the trained agents for 2000 trajectories of length 500 online in the CompoSuite benchmark and store the trajectories. These add up to a total of 1 million state-transitions tuples per dataset, totalling a full 256 million datapoints per dataset. The warmstart and medium-replay-subsampled dataset contain trajectories from the stored training buffer of the SAC agent trained for a fixed duration and the medium agent respectively. For medium-replay-subsampled data, we uniformly sample trajectories from the training buffer until we reach more than 1 million transitions. Since some of the tasks have termination conditions, some of these trajectories are trunctated and not of length 500. This sometimes results in a number of sampled transitions larger than 1 million. Therefore, after sub-sampling, we artificially truncate the last trajectory and place a timeout at the final position. This can in some rare cases lead to one incorrect trajectory if the datasets are used for finite horizon experimentation. However, this truncation is required to ensure consistent dataset sizes, easy data readability and compatibility with other standard code implementations. The four datasets are split into four tar.gz folders each yielding a total of 12 compressed folders. Every sub-folder contains all the tasks for one of the four robot arms for that dataset. In other words, every tar.gz folder contains a total of 64 tasks using the same robot arm and four tar.gz files form a full dataset. This is done to enable people to only download a part of the dataset in case they do not need all 256 tasks. For every task, the data is separately stored in an hdf5 file allowing for the usage of arbitrary task combinations and mixing of data qualities across the four datasets. Every task is contained in a folder that is named after the CompoSuite elements it uses. In other words, every task is represented as a folder named
m
Generating Heterogeneous Big Data Set for Healthcare and Telemedicine...
data.mendeley.com
Updated Jan 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omar Al-Obidi (2023). Generating Heterogeneous Big Data Set for Healthcare and Telemedicine Research Based on ECG, Spo2, Blood Pressure Sensors, and Text Inputs: Data set classified, Analyzed, Organized, And Presented in Excel File Format. [Dataset]. http://doi.org/10.17632/gsmjh55sfy.1
Explore at:
Unique identifier
https://doi.org/10.17632/gsmjh55sfy.1
Dataset updated
Jan 23, 2023
Authors
Omar Al-Obidi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Heterogenous Big dataset is presented in this proposed work: electrocardiogram (ECG) signal, blood pressure signal, oxygen saturation (SpO2) signal, and the text input. This work is an extension version for our relevant formulating of dataset that presented in [1] and a trustworthy and relevant medical dataset library (PhysioNet [2]) was used to acquire these signals. The dataset includes medical features from heterogenous sources (sensory data and non-sensory). Firstly, ECG sensor’s signals which contains QRS width, ST elevation, peak numbers, and cycle interval. Secondly: SpO2 level from SpO2 sensor’s signals. Third, blood pressure sensors’ signals which contain high (systolic) and low (diastolic) values and finally text input which consider non-sensory data. The text inputs were formulated based on doctors diagnosing procedures for heart chronic diseases. Python software environment was used, and the simulated big data is presented along with analyses.
w
The Medium-term Effects of Home-based Early Childhood Development...
microdata.worldbank.org
catalog.ihsn.org
Updated Dec 17, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Orazio Attanasio (2019). The Medium-term Effects of Home-based Early Childhood Development Intervention Impact Evaluation 2011, Midline Survey - Colombia [Dataset]. https://microdata.worldbank.org/index.php/catalog/3499
Explore at:
Dataset updated
Dec 17, 2019
Dataset authored and provided by
Orazio Attanasio
Time period covered
2011
Area covered
Colombia
Description
Abstract

The Medium-Term Effects of Home-based Early Childhood Development Intervention Impact Evaluation (ECDIIE) covered 96 small towns in central Colombia, representing a large number of small communities across a relatively big geographical area. It exploited structures in place from the government's Conditional Cash Transfer Programme, Familias en Accion (FeA), which targets the poorest 20% of households in the country.

There are currently three waves of data, a baseline, pre-intervention wave collected between February and June 2010, and a follow-up wave 18 months later between September and December 2011, at the end of the intervention period. The second wave of follow-up data collection occurred 2 years after the first follow-up data collection between September and December 2013.

The beneficiaries of FeA periodically elect a female representative, called the Madre Lider (ML). We randomly selected three from each town (municipality), and then from the families represented by the ML we randomly selected 5 children aged 12 to 24 months to be eligible for the intervention. Within each municipality, eligible households were randomly allocated (at the municipality level) to each of the following treatment arms:

Control

Stimulation + Supplementation

Stimulation

Supplementation

The stimulation intervention consisted of weekly visits to the homes of the target children, each visit lasting around one hour. The home visitors received a three-week training programme in activities designed to stimulate children at different ages. They also received a weekly curriculum as a guide, and a set of locally produced materials (homemade toys from recycling material, picture books, puzzles, etc.).

The supplementation arm consisted of providing daily sachets of multiple micronutrient powder to mothers, via the home visitors, to add to the target child's food. Sachets were designed to provide iron (12.5mg), zinc (5mg), Vitamin A (300 µg retinol equivalent), Vitamin C (30mg) and folic acid (160 µg) for the children targeted.

Kind of data

Sample survey data [ssd]

Sampling procedure

The survey used a randomized experimental design to obtain rigorous and unbiased estimates of the impact of the stimulation and nutrition interventions, and of their interaction.

Mode of data collection

Computer Assisted Personal Interview [capi]

Research instrument

At follow up, it was collected a variety of developmental indicators, including the Bayley test (for cognitive, language and motor development), the MacArthur Communicative Development Inventories (for vocabulary and expressive language), the Bates Infant Characteristics Questionnaire (for temperament), and the Rothbart Infant Behaviour Questionnaires (for attention focusing, inhibitory control and sociability amongst other socio-emotional traits). These data were again complemented by an extensive socio-economic questionnaire which included information on parental investments, time use and so on.
f
Child-Staff Ratios in Early Childhood Education and Care Settings and Child...
figshare.com
plos.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michal Perlman; Brooke Fletcher; Olesya Falenchuk; Ashley Brunsek; Evelyn McMullen; Prakesh S. Shah (2023). Child-Staff Ratios in Early Childhood Education and Care Settings and Child Outcomes: A Systematic Review and Meta-Analysis [Dataset]. http://doi.org/10.1371/journal.pone.0170256
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0170256
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Michal Perlman; Brooke Fletcher; Olesya Falenchuk; Ashley Brunsek; Evelyn McMullen; Prakesh S. Shah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Child-staff ratios are a key quality indicator in early childhood education and care (ECEC) programs. Better ratios are believed to improve child outcomes by increasing opportunities for individual interactions and educational instruction from staff. The purpose of this systematic review, and where possible, meta-analysis, was to evaluate the association between child-staff ratios in preschool ECEC programs and children’s outcomes. Searches of Medline, PsycINFO, ERIC, websites of large datasets and reference sections of all retrieved articles were conducted up to July 3, 2015. Cross-sectional or longitudinal studies that evaluated the relationship between child-staff ratios in ECEC classrooms serving preschool aged children and child outcomes were independently identified by two reviewers. Data were independently extracted from included studies by two raters and differences between raters were resolved by consensus. Searches revealed 29 eligible studies (31 samples). Child-staff ratios ranged from 5 to 14.5 preschool-aged children per adult with a mean of 8.65. All 29 studies were included in the systematic review. However, the only meta-analysis that could be conducted was based on three studies that explored associations between ratios and children’s receptive language. Results of this meta-analysis were not significant. Results of the qualitative systematic review revealed few significant relationships between child-staff ratios and child outcomes construed broadly. Thus, the available literature reveal few, if any, relationships between child-staff ratios in preschool ECEC programs and children’s developmental outcomes. Substantial heterogeneity in the assessment of ratios, outcomes measured, and statistics used to capture associations limited quantitative synthesis. Other methodological limitations of the research integrated in this synthesis are discussed.

Facebook

Twitter

Click to copy link

Link copied

Cite

Anu Palojärvi; Tea Kangasvieri; Liisa Lempel; Josephine Moate (2025). The archived dataset from the IKI-project (2018-2021) [Dataset]. http://doi.org/10.17011/jyx/dataset/77794

The archived dataset from the IKI-project (2018-2021)

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.17011/jyx/dataset/77794

Dataset updated

Feb 13, 2025

Authors

Anu Palojärvi; Tea Kangasvieri; Liisa Lempel; Josephine Moate

License

https://rightsstatements.org/page/InC/1.0/https://rightsstatements.org/page/InC/1.0/

Description

The archived data set consists of 19 interview transcriptions from thematic interviews with 11 professionals working in early childhood education and care (ECEC) and 8 teachers from basic education. The participants were recruited from different Finnish cities through email requests to teacher networks and individuals and local education authorities. The main focus of the interviews was on different ways of implementing language education (e.g. principles, goals, new/innovative approaches, collaboration). The interviews were conducted in various different contexts of language education (e.g. language aware teaching, foreign language teaching, bilingual education). Most of the interviews were conducted in Finnish. More detailed information about the metadata and interviews can be found in the metadata files. The interviews can be used for studies on teacher perspectives and reflections within different language education contexts. The dataset includes accounts of individual and community innovation and development. When using the data it should also be used according the IKI-privacy notice based on the JYU guidelines (2018-2019). The IKI-research plan is available as a part of the metadata files. Please note that the teachers should not be evaluated nor the interaction between the interviewers and participants. The data was gathered 2018-2021. Data is a part of a larger IKI dataset.

Clear search

Close search

Google apps

Main menu

The archived dataset from the IKI-project (2018-2021)

Data from: A Large-Scale Dataset of Twitter Chatter about Online Learning...

Dataset on the physical and mental development of ethnic minority preschool...

TREC 2022 Deep Learning test collection

Large dataset of infancy and early childhood brain MRIs (T1w and T2w)

Meta-Dataset Dataset

Comparison of early childhood education in the six largest cities

EDA augmentation parameters.

Educational Process Mining (EPM): A Learning Analytics Data Set Data Set

A dataset on parenting styles and physical and mental development of...

2DeteCT Dataset

Education Attainment and Enrollment around the World - Dataset - Data...

SciBank: A Large Dataset of Annotated Scientific Paper Regions for Document...

Data from: BuildingsBench: A Large-Scale Dataset of 900K Buildings and...

Bogota Measurements - Concurrent Validity Bayley-Short Tests Survey Data:...

Big-Math-RL-Verified

Data from: Robotic manipulation datasets for offline compositional...

Generating Heterogeneous Big Data Set for Healthcare and Telemedicine...

The Medium-term Effects of Home-based Early Childhood Development...

Abstract

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Child-Staff Ratios in Early Childhood Education and Care Settings and Child...

The archived dataset from the IKI-project (2018-2021)