44 datasets found

f
Restaurant Menu (Data Cleaning)
rochester.figshare.com
txt
Updated Sep 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aabha Pandit; Alois Romanowski; Heather Owen (2025). Restaurant Menu (Data Cleaning) [Dataset]. http://doi.org/10.60593/ur.d.26462404.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.60593/ur.d.26462404.v1
Dataset updated
Sep 17, 2025
Dataset provided by
University of Rochester
Authors
Aabha Pandit; Alois Romanowski; Heather Owen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Restaurant Menu DatasetWith approximately 45,000 menus dating from the 1840s to the present, The New York Public Library’s restaurant menu collection is one of the largest in the world. The menu data has been transcribed, dish by dish, into this dataset. For more information, please see http://menus.nypl.org/about.This dataset is not clean and contains many missing values, making it perfect to practice data cleaning tools and techniques.Dataset Variables:id: identifier for menuname: sponsor: who sponsored the meal (organizations, people, name of restaurant)event: categoryvenue: type of place (commercial, social, professional)place: where the meal took place (often a geographic location)physical_description: dimension and material description of the menuoccasion: occasion of the meal (holidays, anniversaries, daily)notes: notes by librarians about the original materialcall_number: call number of the menukeywords: language: date: date of the menulocation: organization or business who produced the menulocation_typecurrency: system of money the menu uses (dollars, etc)currency_symbol: symbol for the currency ($, etc)status: completeness of the menu transcription (transcribed, under review, etc)page_count: how many pages the menu hasdish_count: how many dishes the menu has
B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
MESS MDIS MAP PROJ HIGH-INCIDENCE BASEMAP EAST RDR V1.0
catalog.data.gov
s.cnmilf.com
+2more
Updated Aug 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Aeronautics and Space Administration (2025). MESS MDIS MAP PROJ HIGH-INCIDENCE BASEMAP EAST RDR V1.0 [Dataset]. https://catalog.data.gov/dataset/mess-mdis-map-proj-high-incidence-basemap-east-rdr-v1-0-6a07a
Explore at:
Dataset updated
Aug 23, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Abstract ======== The Mercury Dual Imaging System (MDIS) consists of two cameras, a Wide Angle Camera (WAC) and a Narrow Angle Camera (NAC), mounted on a common pivot platform. This dataset includes Map Projected High- Incidence Angle Basemap Illuminated from the East RDRs (HIEs) which comprise a global map of I/F measured by the NAC or WAC filter 7 (both centered near 750 nm) during the the Extended Mission at high incidence angles to accentuate subtle topography, photometrically normalized to a solar incidence angle (i) = 30 degrees, emission angle (e) = 0 degrees, and phase angle (g) = 30 degrees at a spatial sampling of 256 pixels per degree. The HIE data set is a companion to the Map Projected High-Incidence Angle Basemap Illluminated from the West RDR (HIW) data set. Together the two data sets are intended to detect and allow the mapping of subtle topography. They complement a Basemap Data Record (BDR) data set also composed of WAC filter 7 and NAC images acquired at moderate/high solar incidence angles centered near 68 degrees (changed to 74 degrees in the final end-of-mission data delivery), and a Low Incidence Angle (LOI) data set also composed of WAC filter 7 and NAC images acquired at lower incidence centered near 45 degrees, analogous to the geometry used for color imaging. The map is divided into 54 'tiles', each representing the NW, NE, SW, or SE quadrant of one of the 13 non-polar or one of the 2 polar quadrangles or 'Mercury charts' already defined by the USGS. Each tile also contains 5 backplanes: observation ID; BDR metric, a metric used to determine the stacking order of component images, modified for the higher incidence angle centered near 78 degrees; solar incidence angle; emission angle; and phase angle.
e
Knowledge Complexity - Dataset - B2FIND
b2find.eudat.eu
Updated Jul 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Knowledge Complexity - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/830eedee-2595-5f27-88a8-e7201250e11a
Explore at:
Dataset updated
Jul 24, 2025
Description
KPLEX is funded under the European Commission’s Horizon 2020 research programme to undertake a 15-month investigation of the ways in which a focus on ‘big data’ in ICT research elides important issues about the information environment that we live in. While the phrase may sound inclusive and integrative, in fact, ‘big data’ approaches are highly selective, excluding any input that cannot be effectively structured, represented, or, indeed, digitised.Data of this messy, dirty sort is precisely the kind that humanities and cultural researchers deal with best. It will therefore be the contribution of the KPLEX project to investigate these elements of humanities and cultural data, and the strategies researchers have developed to deal with them. In doing so it will remain at the margins of ICT so as to better shed light on the gap between analogue or augmented digital practices and fully computational ones. As such, it will expand our awareness of the risks inherent in big data and to suggest ways in which phenomena that resist datafication can still be represented (if only by their absence) in knowledge creation approaches reliant upon the interrogation of large data corpora.KPLEX approaches this challenge in a comparative, multidisciplinary and multisectoral fashion, focusing on 3 key challenges to the knowledge creation capacity of big data approaches: the manner in which data that are not digitised or shared become ‘hidden’ from aggregation systems; the fact that data is human created, and lacks the objectivity often ascribed to the term; the subtle ways in which data that are complex almost always become simplified before they can be aggregated. It will approach these questions via a humanities research perspective, but using social science research tools to look at both the humanistic and computer science approaches to the term ‘data’ and its many possible meanings and implications.As such, KPLEX project defines and describes key aspects of data that are at risk of being left out of our knowledge creation processes in a system where large scale data aggregation is becoming ever more the gold standard.
Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...
zenodo.org
data.europa.eu
zip
Updated Oct 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. http://doi.org/10.5281/zenodo.6832242
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6832242
Dataset updated
Oct 20, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LifeSnaps Dataset Documentation

Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

Data Import: Reading CSV

For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

Data Import: Setting up a MongoDB (Recommended)

To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

For the Fitbit data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

For the SEMA data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c sema

For surveys data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c surveys

If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

Data Availability

The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

{ _id:
e
MESS: Multi-purpose Exoplanet Simulation System - Dataset - B2FIND
b2find.eudat.eu
Updated Oct 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). MESS: Multi-purpose Exoplanet Simulation System - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/f12e6f7a-1f5b-58e6-9c7e-f94f437457d1
Explore at:
Dataset updated
Oct 22, 2023
Description
The high number of planet discoveries made in the last years provides a good sample for statistical analysis, leading to some clues on the distributions of planet parameters, like masses and periods, at least in close proximity to the host star. We likely need to wait for the extremely large telescopes (ELTs) to have an overall view of the extrasolar planetary systems. In this context it would be useful to have a tool that can be used for the interpretation of the present results, and also to predict what the outcomes would be of the future instruments. For this reason we built MESS: a Monte Carlo simulation code which uses either the results of the statistical analysis of the properties of discovered planets, or the results of the planet formation theories, to build synthetic planet populations fully described in terms of frequency, orbital elements and physical properties. They can then be used to either test the consistency of their properties with the observed population of planets given different detection techniques or to actually predict the expected number of planets for future surveys. In addition to the code description, we present here some of its applications to actually probe the physical and orbital properties of a putative companion within the circumstellar disk of a given star and to test constrain the orbital distribution properties of a potential planet population around the members of the TW Hydrae association. Finally, using in its predictive mode, the synergy of future space and ground-based telescopes instrumentation has been investigated to identify the mass-period parameter space that will be probed in future surveys for giant and rocky planets.
h
Data from: imdb
huggingface.co
Updated Aug 3, 2003
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 3, 2003
Dataset authored and provided by
Stanford NLP
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for "imdb"

Dataset Summary

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
Walmart Dataset
kaggle.com
Updated Dec 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Yasser H (2021). Walmart Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/walmart-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 26, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
M Yasser H
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://raw.githubusercontent.com/Masterx-AI/Project_Retail_Analysis_with_Walmart/main/Wallmart1.jpg" alt="">

Description:

One of the leading retail stores in the US, Walmart, would like to predict the sales and demand accurately. There are certain events and holidays which impact sales on each day. There are sales data available for 45 stores of Walmart. The business is facing a challenge due to unforeseen demands and runs out of stock some times, due to the inappropriate machine learning algorithm. An ideal ML algorithm will predict demand accurately and ingest factors like economic conditions including CPI, Unemployment Index, etc.

Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of all, which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data. Historical sales data for 45 Walmart stores located in different regions are available.

Acknowledgements

The dataset is taken from Kaggle.

Objective:

Understand the Dataset & cleanup (if required).

Build Regression models to predict the sales w.r.t single & multiple features.

Also evaluate the models & compare their respective scores like R2, RMSE, etc.
HOSPI-Tools Dataset - DSLR
zenodo.org
zip
Updated Jul 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Rodrigues; Mark Rodrigues (2022). HOSPI-Tools Dataset - DSLR [Dataset]. http://doi.org/10.5281/zenodo.5895068
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5895068
Dataset updated
Jul 5, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mark Rodrigues; Mark Rodrigues
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We are working to develop a comprehensive dataset of surgical tools based on specialities, with a hierarchical structure – speciality, pack, set and tool. We belive that this dataset can be useful for computer vision and deep learning research into surgical tool tracking, management and surgical training and audit. We have therefore created an initial dataset of surgical tool (instrument and implant) images, captured using under different lighting conditions and with different backgrounds. We captured RGB images of surgical tools using a DSLR camera and webcam on site in a major hospital under realistic conditions and with the surgical tools currently in use. Image backgrounds in our initial dataset were essentially flat colours, even though different colour backgrounds were used. As we further developed our dataset, we will try to include much greater occlusions, illumination changes, and the presence of blood, tissue and smoke in the images which would be more reflective of crowded, messy, real-world conditions.

Illumination sources included natural light – direct sunlight and shaded light – LED, halogen and fluorescent lighting, and this accurately reflected the illumination working conditions within the hospital. Distances of the surgical tools to the camera to the object ranged from 60 to 150 cms., and the average class size was 74 images. Images captured included individual object images as well as cluttered, clustered and occluded objects. Our initial focus was on Orthopaedics and General Surgery, two out of the 14 surgical specialities. We selected these specialities since general surgery instruments are the most commonly used tools across all surgeries and provide instrument volume, while orthopaedics provides variety and complexity given the wide range of procedures, instruments and implants used in orthopaedic surgery. We will add other specialities as we develop this dataset, to reflect the complexities inherent in each of the surgical specialities. This dataset was designed to offer a large variety of tools, arranged hierarchically to reflect how surgical tools are organised in real-world conditions.

If you do find our dataset useful, please cite our papers in your work:

Rodrigues, M., Mayo, M, and Patros, P. (2022). OctopusNet: Machine Learning for Intelligent Management of Surgical Tools. Published in “Smart Health”, Volume 23, 2022. https://doi.org/10.1016/j.smhl.2021.100244

Rodrigues, M., Mayo, M, and Patros, P. (2021). Evaluation of Deep Learning Techniques on a Novel Hierarchical Surgical Tool Dataset. Accepted paper at The 2021 Australasian Joint Conference on Artificial Intelligence. 2021. To be Published in Lecture Notes in Computer Science series.

Rodrigues, M., Mayo, M, and Patros, P. (2021). Interpretable deep learning for surgical tool management. In M. Reyes, P. Henriques Abreu, J. Cardoso, M. Hajij, G. Zamzmi, P. Rahul, and L. Thakur (Eds.), Proc 4th International Workshop on Interpretability of Machine Intelligence in Medical Image Computing (iMIMIC 2021) LNCS 12929 (pp. 3-12). Cham: Springer.
R
Solar_panel_combine Dataset
universe.roboflow.com
zip
Updated Sep 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
home (2022). Solar_panel_combine Dataset [Dataset]. https://universe.roboflow.com/home-ocdun/solar_panel_combine/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Sep 4, 2022
Dataset authored and provided by
home
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Solar Panel Solar Panel1 Bounding Boxes
Description
Here are a few use cases for this project:

Solar Panel Maintenance: The model could be used by solar panel service providers to automate the process of assessment and maintenance. By analyzing the state of the panels (clean, unclean, or dusty) it can help them identify which panels need immediate cleaning or service.

Industrial Inspection: In facilities with a large number of solar panels such as solar farms, the model could assist in streamlining routine checks. Rather than manual inspection, images can be taken and analyzed for cleanliness, helping to efficiently allocate cleaning resources and maintain optimum efficiency.

Home Automation Systems: The model could be integrated into smart home systems to alert homeowners when their solar panels are dirty or dusty. It can act as a smart tool for homes using solar energy as one of their primary energy sources.

Drone-based Inspection: For large scale solar installations in hard-to-reach areas (e.g. large roofs, deserts), drones equipped with cameras and the computer vision model can perform inspections. This can be safer and more effective, with the AI determining the status of each panel.

Educational Purposes: This computer vision model could be used as a teaching tool in educational institutions for courses related to renewable energy. It can demonstrate the importance of solar panel cleanliness in energy efficiency, encouraging students to engage with practical, real-world issues in their learning.
MISR Aerosol Climatology Product V001 - Dataset - NASA Open Data Portal
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
Updated Apr 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). MISR Aerosol Climatology Product V001 - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/misr-aerosol-climatology-product-v001-e3cea
Explore at:
Dataset updated
Apr 1, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
MIANACP_1 is the Multi-angle Imaging SpectroRadiometer (MISR) Aerosol Climatology Product version 1. It is 1) the microphysical and scattering characteristics of pure aerosol upon which routine retrievals are based, 2) mixtures of pure aerosol to be compared with MISR observations, and 3) the likelihood value assigned to each mode geographically. The ACP describes mixtures of up to three component aerosol types from a list of eight components in varying proportions. ACP component aerosol particle data quality depends on the ACP input data, which are based on aerosol particles described in the literature and consider MISR-specific sensitivity to particle size, single-scattering albedo, and shape, and shape - roughly: small, medium, and large; dirty and clean; spherical and nonspherical [Kahn et al., 1998; 2001]. Also reported in the ACP are the mixtures of these components used by the retrieval algorithm. The MISR instrument consists of nine push-broom cameras that measure radiance in four spectral bands. Global coverage is achieved in nine days. The cameras are arranged with one camera pointing toward the nadir, four forward, and four aftward. It takes seven minutes for all nine cameras to view the same surface location. The view angles relative to the surface reference ellipsoid are 0, 26.1, 45.6, 60.0, and 70.5 degrees. The spectral band shapes are nominally Gaussian, centered at 443, 555, 670, and 865 nm.
Z
Data from: CURE-OR: Challenging Unreal and Real Environments for Object...
data.niaid.nih.gov
Updated Jun 28, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dogancan Temel (2020). CURE-OR: Challenging Unreal and Real Environments for Object Recognition [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3888508
Explore at:
Dataset updated
Jun 28, 2020
Dataset provided by
Jinsol Lee
Ghassan AlRegib
Dogancan Temel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As one of the research directions at OLIVES Lab @ Georgia Tech, we focus on the robustness of data-driven algorithms under diverse challenging conditions where trained models can possibly be depolyed. To achieve this goal, we introduced a large-sacle (1.M images) object recognition dataset (CURE-OR) which is among the most comprehensive datasets with controlled synthetic challenging conditions. In CURE-OR dataset, there are 1,000,000 images of 100 objects with varying size, color, and texture, captured with multiple devices in different setups. The majority of images in the dataset were acquired with smartphones and tested with off-the-shelf applications to benchmark the recognition performance of devices and applications that are used in our daily lives. Please refer to our GitHub page for code, papers, and more information. Some data specifications are provided below:

Image Name Format :

"backgroundID_deviceID_objectOrientationID_objectID_challengeType_challengeLevel.jpg"

Background ID:

1: White 2: Texture 1 - living room 3: Texture 2 - kitchen 4: 3D 1 - living room 5: 3D 2 – office

Object Orientation ID:

1: Front (0 º) 2: Left side (90 º) 3: Back (180 º) 4: Right side (270 º) 5: Top

Object ID:

1-100

Challenge Type:

No challenge 02: Resize 03: Underexposure 04: Overexposure 05: Gaussian blur 06: Contrast 07: Dirty lens 1 08: Dirty lens 2 09: Salt & pepper noise 10: Grayscale 11: Grayscale resize 12: Grayscale underexposure 13: Grayscale overexposure 14: Grayscale gaussian blur 15: Grayscale contrast 16: Grayscale dirty lens 1 17: Grayscale dirty lens 2 18: Grayscale salt & pepper noise

Challenge Level:

A number between [0, 5], where 0 indicates no challenge, 1 the least severe and 5 the most severe challenge. Challenge type 1 (no challenge) and 10 (grayscale) has a level of 0 only. Challenge types 2 (resize) and 11 (grayscale resize) has 4 levels (1 through 4). All other challenges have levels 1 to 5.
h
peoples_speech_v1.0
huggingface.co
Updated Nov 12, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MLCommons (2022). peoples_speech_v1.0 [Dataset]. https://huggingface.co/datasets/MLCommons/peoples_speech_v1.0
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 12, 2022
Dataset authored and provided by
MLCommons
License
Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
License information was derived automatically
Description
Dataset Card for People's Speech

Dataset Summary

The People's Speech Dataset is among the world's largest English speech recognition corpus today that is licensed for academic and commercial usage under CC-BY-SA and CC-BY 4.0. It includes 30,000+ hours of transcribed speech in English languages with a diverse set of speakers. This open dataset is large enough to train speech-to-text systems and crucially is available with a permissive license.

Supported Tasks… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/peoples_speech_v1.0.
e
Interview data from 'New norms and forms of development' - Dataset - B2FIND
b2find.eudat.eu
Updated Dec 24, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2014). Interview data from 'New norms and forms of development' - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/fc9397e8-acdd-599d-89c2-a3755dbfd35c
Explore at:
Dataset updated
Dec 24, 2014
Description
This data contains findings of the study on outsourcing of external development assistance in maternal and child health (MCH) in Malawi and Nepal. It outlines the institutional modalities and norms guiding the financing and delivery of MCH projects and programmes. First, our study of external development assistance reveals a messy assemblage of actors, institutional arrangements and activities informed by the norms: ‘value for money’ and ‘measurable results’. Second, we found that for development assistance to function effectively it is not just about the flow of financial resources to a project or a programme but also about networks and key personal and institutional relationships. Third, we found that there is increasing political pressure to show that the disbursement of resources are linked to the achievement of measurable results.Donors and international organizations involved in dispersing foreign aid now routinely employ contracts with service providers, both for-profit and non-profit, to carry out functions relating to international health service development and delivery. This outsourcing of foreign aid via contractual arrangements and partnerships is linked to a discourse on public sector reform in order to secure value for money, enhance aid efficiency and achieve the most impact with limited resources. These intermediaries include non-profits, private contractors, management consultancies, advocacy groups, research organizations, think tanks and educational institutions among others. They employ tens of thousands of expert professionals, operating within the state apparatus or as outside technical support, and advise, consult and serve in various official capacities and contribute to health service development and to the delivery of projects. They occupy and link the space between the funders and beneficiaries/target groups translating the meanings and processes of development. Sceptics have argued that much of foreign development aid is actually a giveaway to large contractors and sub-contractors. However, these intermediaries are the key actors whose function is critical in bringing together innovation, expertise, resources and political networks from different institutions to contribute to global development objectives such as the Millennium Development Goals (MDG). Through its focus on the role and functions of different types of institutions and professionals who broker health sector development projects and programmes, the research aims to understand the nature of mediation and translation involved in that process and the difference these actors make in meeting the global health development objectives. In this research we explore this phenomena for Maternal and Child Health, comparing the processes in the countries of Malawi and Nepal. Both of these countries have achieved strides towards achieving their goals for MDG 5 (focused on Improving Maternal Health by reducing by three quarters the maternal mortality ratio, and achieving universal access to reproductive health), and have been the focus of sustained resource input from both USAID and UKaid for these aims. To do this, first we will map the institutional terrain around this, and then explore - using ethnographic techniques and semi-structured interviews with those involved in the delivery of the MCH programmes - to compare and contrast this developmental landscape. A key aim is to use the research to inform policy makers in the donor community, and the respective governments, of the best institutional relations for this; in short, what works best, and what less well. To do this, we have brought together a team of researchers and in country research partners with significant expertise and experience in carrying out research and public engagement in the health sector development in Nepal and Malawi. Central to this research, we will run inception workshops in both countries to inform the aims of the research, and define the research direction. Continued engagement with the key stakeholders will culminate in dissemination workshops designed to inform policy and future direction in the arena of MCH. The research involved two key stages, mapping (of current Maternal and Child Health Service Providers and preparing a database or a list of intermediary institutions in Nepal and Malawi) and case studies. For the case studies, we interviewed both senior members of these organizations and the people who do the on the ground work, in order to find out the ‘how and why’ they work the way they do, and the problems they face in doing so. For more information, please see the Project information file attached.
f
Data from: A Data Science Course for Undergraduates: Thinking With Data
tandf.figshare.com
pdf
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ben Baumer (2023). A Data Science Course for Undergraduates: Thinking With Data [Dataset]. http://doi.org/10.6084/m9.figshare.1568372.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1568372.v2
Dataset updated
Jun 2, 2023
Dataset provided by
Taylor & Francis
Authors
Ben Baumer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data science is an emerging interdisciplinary field that combines elements of mathematics, statistics, computer science, and knowledge in a particular application domain for the purpose of extracting meaningful information from the increasingly sophisticated array of data available in many settings. These data tend to be nontraditional, in the sense that they are often live, large, complex, and/or messy. A first course in statistics at the undergraduate level typically introduces students to a variety of techniques to analyze small, neat, and clean datasets. However, whether they pursue more formal training in statistics or not, many of these students will end up working with data that are considerably more complex, and will need facility with statistical computing techniques. More importantly, these students require a framework for thinking structurally about data. We describe an undergraduate course in a liberal arts environment that provides students with the tools necessary to apply data science. The course emphasizes modern, practical, and useful skills that cover the full data analysis spectrum, from asking an interesting question to acquiring, managing, manipulating, processing, querying, analyzing, and visualizing data, as well communicating findings in written, graphical, and oral forms. Supplementary materials for this article are available online. [Received June 2014. Revised July 2015.]
Student Admission Records
kaggle.com
Updated Nov 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zeeshan Ahmad (2024). Student Admission Records [Dataset]. https://www.kaggle.com/datasets/zeeshier/student-admission-records
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2024
Dataset provided by
Kaggle
Authors
Zeeshan Ahmad
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is crafted for beginners to practice data cleaning and preprocessing techniques in machine learning. It contains 157 rows of student admission records, including duplicate rows, missing values, and some data inconsistencies (e.g., outliers, unrealistic values). It’s ideal for practicing common data preparation steps before applying machine learning algorithms.

The dataset simulates a university admission record system, where each student’s admission profile includes test scores, high school percentages, and admission status. The data contains realistic flaws often encountered in raw data, offering hands-on experience in data wrangling.

The dataset contains the following columns:

Name: Student's first name (Pakistani names). Age: Age of the student (some outliers and missing values). Gender: Gender (Male/Female). Admission Test Score: Score obtained in the admission test (includes outliers and missing values). High School Percentage: Student's high school final score percentage (includes outliers and missing values). City: City of residence in Pakistan. Admission Status: Whether the student was accepted or rejected.
h
toxic-chat
huggingface.co
Updated Jan 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Large Model Systems Organization (2024). toxic-chat [Dataset]. https://huggingface.co/datasets/lmsys/toxic-chat
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 25, 2024
Dataset authored and provided by
Large Model Systems Organization
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Update

[01/31/2024] We update the OpenAI Moderation API results for ToxicChat (0124) based on their updated moderation model on on Jan 25, 2024.[01/28/2024] We release an official T5-Large model trained on ToxicChat (toxicchat0124). Go and check it for you baseline comparision![01/19/2024] We have a new version of ToxicChat (toxicchat0124)!

Content

This dataset contains toxicity annotations on 10K user prompts collected from the Vicuna online demo. We utilize a human-AI… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/toxic-chat.
Z
CURE-TSR: Challenging Unreal and Real Environments for Traffic Sign...
data.niaid.nih.gov
zenodo.org
Updated Jun 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohit Prabhushankar (2020). CURE-TSR: Challenging Unreal and Real Environments for Traffic Sign Recognition [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3903065
Explore at:
Dataset updated
Jun 28, 2020
Dataset provided by
Ghassan AlRegib
Gukyeong Kwon
Mohit Prabhushankar
Dogancan Temel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As one of the research directions at OLIVES Lab @ Georgia Tech, we focus on the robustness of data-driven algorithms under diverse challenging conditions where trained models can possibly be depolyed. To achieve this goal, we introduced a large-sacle (>2M images) traffic sign recognition dataset (CURE-TSR) which is among the most comprehensive datasets with controlled synthetic challenging conditions. Traffic sign images in the CURE-TSR dataset were cropped from the CURE-TSD dataset, which includes around 1.7 million real-world and simulator images with more than 2 million traffic sign instances. Real-world images were obtained from the BelgiumTS video sequences and simulated images were generated with the Unreal Engine 4 game development tool. Sign types include speed limit, goods vehicles, no overtaking, no stopping, no parking, stop, bicycle, hump, no left, no right, priority to, no entry, yield, and parking. Unreal and real sequences were processed with state-of-the-art visual effect software Adobe(c) After Effects to simulate challenging conditions, which include rain, snow, haze, shadow, darkness, brightness, blurriness, dirtiness, colorlessness, sensor and codec errors. Please refer to our GitHub page for code, papers, and more information.

Instructions:

The name format of the provided images are as follows: "sequenceType_signType_challengeType_challengeLevel_Index.bmp"

sequenceType: 01 - Real data 02 - Unreal data

signType: 01 - speed_limit 02 - goods_vehicles 03 - no_overtaking 04 - no_stopping 05 - no_parking 06 - stop 07 - bicycle 08 - hump 09 - no_left 10 - no_right 11 - priority_to 12 - no_entry 13 - yield 14 - parking

challengeType: 00 - No challenge 01 - Decolorization 02 - Lens blur 03 - Codec error 04 - Darkening 05 - Dirty lens 06 - Exposure 07 - Gaussian blur 08 - Noise 09 - Rain 10 - Shadow 11 - Snow 12 - Haze

challengeLevel: A number in between [01-05] where 01 is the least severe and 05 is the most severe challenge.

Index: A number shows different instances of traffic signs in the same conditions.
g
CARMA, United States Power Plant Emissions, United States, 2000/2007/Future
geocommons.com
Updated May 2, 2008
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data (2008). CARMA, United States Power Plant Emissions, United States, 2000/2007/Future [Dataset]. http://geocommons.com/search.html
Explore at:
Dataset updated
May 2, 2008
Dataset provided by
CARMA
data
Description
All the data for this dataset is provided from CARMA: Data from CARMA (www.carma.org) This dataset provides information about Power Plant emissions in the USA. Power Plant emissions from all power plants in the United Staes were obtained by CARMA for the past (2000 Annual Report), the present (2007 data), and the future. CARMA determine data presented for the future to reflect planned plant construction, expansion, and retirement. The dataset provides the name, company, parent company, city, state, zip, county, metro area, lat/lon, and plant id for each individual power plant. The dataset reports for the three time periods: Intensity: Pounds of CO2 emitted per megawatt-hour of electricity produced. Energy: Annual megawatt-hours of electricity produced. Carbon: Annual carbon dioxide (CO2) emissions. The units are short or U.S. tons. Multiply by 0.907 to get metric tons. Carbon Monitoring for Action (CARMA) is a massive database containing information on the carbon emissions of over 50,000 power plants and 4,000 power companies worldwide. Power generation accounts for 40% of all carbon emissions in the United States and about one-quarter of global emissions. CARMA is the first global inventory of a major, sector of the economy. The objective of CARMA.org is to equip individuals with the information they need to forge a cleaner, low-carbon future. By providing complete information for both clean and dirty power producers, CARMA hopes to influence the opinions and decisions of consumers, investors, shareholders, managers, workers, activists, and policymakers. CARMA builds on experience with public information disclosure techniques that have proven successful in reducing traditional pollutants. Please see carma.org for more information http://carma.org/region/detail/202
m
Warpaint London PLC - Total-Long-Term-Debt
macro-rankings.com
csv, excel
Updated Sep 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
macro-rankings (2025). Warpaint London PLC - Total-Long-Term-Debt [Dataset]. https://www.macro-rankings.com/markets/stocks/w7l-lse/balance-sheet/total-long-term-debt
Explore at:
csv, excelAvailable download formats
Dataset updated
Sep 18, 2025
Dataset authored and provided by
macro-rankings
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
uk
Description
Total-Long-Term-Debt Time Series for Warpaint London PLC. Warpaint London PLC, together with its subsidiaries, produces and sells cosmetics. It operates through two segments, Branded and Close-Out. The company offers cosmetic, female beauty, male grooming, self-tan, and skincare products under the W7, Technic, Man'stuff, Body Collection, Chit Chat, Skin & Tan, Super Facialist, Dirty Works, Root Perfect, Fish Soho, and MR Solutions brand names. It also provides supply chain management services, as well as is involved in holding company activities and wholesale business. Warpaint London PLC sells its products to retailers, distributors, supermarkets, and retail chains. The company operates in the United Kingdom, the rest of Europe, Spain, Denmark, the United States, Australia, New Zealand, and internationally. The company was founded in 1992 and is headquartered in Iver, the United Kingdom.

Facebook

Twitter

Click to copy link

Link copied

Cite

Aabha Pandit; Alois Romanowski; Heather Owen (2025). Restaurant Menu (Data Cleaning) [Dataset]. http://doi.org/10.60593/ur.d.26462404.v1

Restaurant Menu (Data Cleaning)

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.60593/ur.d.26462404.v1

Dataset updated

Sep 17, 2025

Dataset provided by

University of Rochester

Authors

Aabha Pandit; Alois Romanowski; Heather Owen

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Restaurant Menu DatasetWith approximately 45,000 menus dating from the 1840s to the present, The New York Public Library’s restaurant menu collection is one of the largest in the world. The menu data has been transcribed, dish by dish, into this dataset. For more information, please see http://menus.nypl.org/about.This dataset is not clean and contains many missing values, making it perfect to practice data cleaning tools and techniques.Dataset Variables:id: identifier for menuname: sponsor: who sponsored the meal (organizations, people, name of restaurant)event: categoryvenue: type of place (commercial, social, professional)place: where the meal took place (often a geographic location)physical_description: dimension and material description of the menuoccasion: occasion of the meal (holidays, anniversaries, daily)notes: notes by librarians about the original materialcall_number: call number of the menukeywords: language: date: date of the menulocation: organization or business who produced the menulocation_typecurrency: system of money the menu uses (dollars, etc)currency_symbol: symbol for the currency ($, etc)status: completeness of the menu transcription (transcribed, under review, etc)page_count: how many pages the menu hasdish_count: how many dishes the menu has

Clear search

Close search

Google apps

Main menu

Restaurant Menu (Data Cleaning)

Data Cleaning Sample

MESS MDIS MAP PROJ HIGH-INCIDENCE BASEMAP EAST RDR V1.0

Knowledge Complexity - Dataset - B2FIND

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

MESS: Multi-purpose Exoplanet Simulation System - Dataset - B2FIND

Data from: imdb

Walmart Dataset

Description:

Acknowledgements

Objective:

HOSPI-Tools Dataset - DSLR

Solar_panel_combine Dataset

MISR Aerosol Climatology Product V001 - Dataset - NASA Open Data Portal

Data from: CURE-OR: Challenging Unreal and Real Environments for Object...

peoples_speech_v1.0

Interview data from 'New norms and forms of development' - Dataset - B2FIND

Data from: A Data Science Course for Undergraduates: Thinking With Data

Student Admission Records

toxic-chat

CURE-TSR: Challenging Unreal and Real Environments for Traffic Sign...

CARMA, United States Power Plant Emissions, United States, 2000/2007/Future

Warpaint London PLC - Total-Long-Term-Debt

Restaurant Menu (Data Cleaning)