100+ datasets found
  1. 🔍 Diverse CSV Dataset Samples

    • kaggle.com
    Updated Nov 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samy Baladram (2023). 🔍 Diverse CSV Dataset Samples [Dataset]. https://www.kaggle.com/datasets/samybaladram/multidisciplinary-csv-datasets-collection/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 6, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Samy Baladram
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    https://i.imgur.com/PcSDv8A.png" alt="Imgur">

    Overview

    The dataset provided here is a rich compilation of various data files gathered to support diverse analytical challenges and education in data science. It is especially curated to provide researchers, data enthusiasts, and students with real-world data across different domains, including biostatistics, travel, real estate, sports, media viewership, and more.

    Files

    Below is a brief overview of what each CSV file contains: - Addresses: Practical examples of string manipulation and address data formatting in CSV. - Air Travel: Historical dataset suitable for analyzing trends in air travel over a period of three years. - Biostats: A dataset of office workers' biometrics, ideal for introductory statistics and biology. - Cities: Geographic and administrative data for urban analysis or socio-demographic studies. - Car Crashes in Catalonia: Weekly traffic accident data from Catalonia, providing a base for public policy research. - De Niro's Film Ratings: Analyze trends in film ratings over time with this entertainment-focused dataset. - Ford Escort Sales: Pre-owned vehicle sales data, perfect for regression analysis or price prediction models. - Old Faithful Geyser: Geological data for pattern recognition and prediction in natural phenomena. - Freshman Year Weights and BMIs: Dataset depicting weight and BMI changes for health and lifestyle studies. - Grades: Education performance data which can be correlated with demographics or study patterns. - Home Sales: A dataset reflecting the housing market dynamics, useful for economic analysis or real estate appraisal. - Hooke's Law Demonstration: Physics data illustrating the classic principle of elasticity in springs. - Hurricanes and Storm Data: Climate data on hurricane and storm frequency for environmental risk assessments. - Height and Weight Measurements: Public health research dataset on anthropometric data. - Lead Shot Specs: Detailed engineering data for material sciences and manufacturing studies. - Alphabet Letter Frequency: Text analysis dataset for frequency distribution studies in large text samples. - MLB Player Statistics: Comprehensive athletic data set for analysis of performance metrics in sports. - MLB Teams' Seasonal Performance: A dataset combining financial and sports performance data from the 2012 MLB season. - TV News Viewership: Media consumption data which can be used to analyze viewing patterns and trends. - Historical Nile Flood Data: A unique environmental dataset for historical trend analysis in flood levels. - Oscar Winner Ages: A dataset to explore age trends among Oscar-winning actors and actresses. - Snakes and Ladders Statistics: Data from the game outcomes useful in studying probability and game theory. - Tallahassee Cab Fares: Price modeling data from the real-world pricing of taxi services. - Taxable Goods Data: A snapshot of economic data concerning taxation impact on prices. - Tree Measurements: Ecological and environmental science data related to tree growth and forest management. - Real Estate Prices from Zillow: Market analysis dataset for those interested in housing price determinants.

    Format

    The enclosed data respect the comma-separated values (CSV) file format standards, ensuring compatibility with most data processing libraries in Python, R, and other languages. The datasets are ready for import into Jupyter notebooks, RStudio, or any other integrated development environment (IDE) used for data science.

    Quality Assurance

    The data is pre-checked for common issues such as missing values, duplicate records, and inconsistent entries, offering a clean and reliable dataset for various analytical exercises. With initial header lines in some CSV files, users can easily identify dataset fields and start their analysis without additional data cleaning for headers.

    Acknowledgements

    The dataset adheres to the GNU LGPL license, making it freely available for modification and distribution, provided that the original source is cited. This opens up possibilities for educators to integrate real-world data into curricula, researchers to validate models against diverse datasets, and practitioners to refine their analytical skills with hands-on data.

    This dataset has been compiled from https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html, with gratitude to the authors and maintainers for their dedication to providing open data resources for educational and research purposes. https://i.imgur.com/HOtyghv.png" alt="Imgur">

  2. v

    Global Real World Evidence Solutions Market By Data Source (Electronic...

    • verifiedmarketresearch.com
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VERIFIED MARKET RESEARCH (2024). Global Real World Evidence Solutions Market By Data Source (Electronic Health Records, Claims Data, Registries, Medical Devices), By Therapeutic Area (Oncology, Cardiovascular Diseases, Neurology, Rare Diseases), By Application (Drug Development, Clinical Decision Support, Epidemiological Studies, Post-Marketing Surveillance), By Geographic Scope and Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/real-world-evidence-solutions-market/
    Explore at:
    Dataset updated
    Jul 16, 2024
    Dataset authored and provided by
    VERIFIED MARKET RESEARCH
    License

    https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/

    Time period covered
    2024 - 2031
    Area covered
    Global
    Description

    Real World Evidence Solutions Market size was valued at USD 1.30 Billion in 2024 and is projected to reach USD 3.71 Billion by 2031, growing at a CAGR of 13.92% during the forecast period 2024-2031.

    Global Real World Evidence Solutions Market Drivers

    The market drivers for the Real World Evidence Solutions Market can be influenced by various factors. These may include:

    Growing Need for Evidence-Based Healthcare: Real-world evidence (RWE) is becoming more and more important in healthcare decision-making, according to stakeholders such as payers, providers, and regulators. In addition to traditional clinical trial data, RWE solutions offer important insights into the efficacy, safety, and value of healthcare interventions in real-world situations. Growing Use of RWE by Pharmaceutical Companies: RWE solutions are being used by pharmaceutical companies to assist with market entry, post-marketing surveillance, and drug development initiatives. Pharmaceutical businesses can find new indications for their current medications, improve clinical trial designs, and convince payers and providers of the worth of their products with the use of RWE. Increasing Priority for Value-Based Healthcare: The emphasis on proving the cost- and benefit-effectiveness of healthcare interventions in real-world settings is growing as value-based healthcare models gain traction. To assist value-based decision-making, RWE solutions are essential in evaluating the economic effect and real-world consequences of healthcare interventions. Technological and Data Analytics Advancements: RWE solutions are becoming more capable due to advances in machine learning, artificial intelligence, and big data analytics. With the use of these technologies, healthcare stakeholders can obtain actionable insights from the analysis of vast and varied datasets, including patient-generated data, claims data, and electronic health records. Regulatory Support for RWE Integration: RWE is being progressively integrated into regulatory decision-making processes by regulatory organisations including the European Medicines Agency (EMA) and the U.S. Food and Drug Administration (FDA). The FDA's Real-World Evidence Programme and the EMA's Adaptive Pathways and PRIority MEdicines (PRIME) programme are two examples of initiatives that are making it easier to incorporate RWE into regulatory submissions and drug development. Increasing Emphasis on Patient-Centric Healthcare: The value of patient-reported outcomes and real-world experiences in healthcare decision-making is becoming more widely acknowledged. RWE technologies facilitate the collection and examination of patient-centered data, offering valuable insights into treatment efficacy, patient inclinations, and quality of life consequences. Extension of RWE Use Cases: RWE solutions are being used in medication development, post-market surveillance, health economics and outcomes research (HEOR), comparative effectiveness research, and market access, among other healthcare fields. The necessity for a variety of RWE solutions catered to the needs of different stakeholders is being driven by the expansion of RWE use cases.

  3. Customer Shopping Trends Dataset

    • kaggle.com
    Updated Oct 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sourav Banerjee (2023). Customer Shopping Trends Dataset [Dataset]. https://www.kaggle.com/datasets/iamsouravbanerjee/customer-shopping-trends-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sourav Banerjee
    Description

    Context

    The Customer Shopping Preferences Dataset offers valuable insights into consumer behavior and purchasing patterns. Understanding customer preferences and trends is critical for businesses to tailor their products, marketing strategies, and overall customer experience. This dataset captures a wide range of customer attributes including age, gender, purchase history, preferred payment methods, frequency of purchases, and more. Analyzing this data can help businesses make informed decisions, optimize product offerings, and enhance customer satisfaction. The dataset stands as a valuable resource for businesses aiming to align their strategies with customer needs and preferences. It's important to note that this dataset is a Synthetic Dataset Created for Beginners to learn more about Data Analysis and Machine Learning.

    Content

    This dataset encompasses various features related to customer shopping preferences, gathering essential information for businesses seeking to enhance their understanding of their customer base. The features include customer age, gender, purchase amount, preferred payment methods, frequency of purchases, and feedback ratings. Additionally, data on the type of items purchased, shopping frequency, preferred shopping seasons, and interactions with promotional offers is included. With a collection of 3900 records, this dataset serves as a foundation for businesses looking to apply data-driven insights for better decision-making and customer-centric strategies.

    Dataset Glossary (Column-wise)

    • Customer ID - Unique identifier for each customer
    • Age - Age of the customer
    • Gender - Gender of the customer (Male/Female)
    • Item Purchased - The item purchased by the customer
    • Category - Category of the item purchased
    • Purchase Amount (USD) - The amount of the purchase in USD
    • Location - Location where the purchase was made
    • Size - Size of the purchased item
    • Color - Color of the purchased item
    • Season - Season during which the purchase was made
    • Review Rating - Rating given by the customer for the purchased item
    • Subscription Status - Indicates if the customer has a subscription (Yes/No)
    • Shipping Type - Type of shipping chosen by the customer
    • Discount Applied - Indicates if a discount was applied to the purchase (Yes/No)
    • Promo Code Used - Indicates if a promo code was used for the purchase (Yes/No)
    • Previous Purchases - The total count of transactions concluded by the customer at the store, excluding the ongoing transaction
    • Payment Method - Customer's most preferred payment method
    • Frequency of Purchases - Frequency at which the customer makes purchases (e.g., Weekly, Fortnightly, Monthly)

    Structure of the Dataset

    https://i.imgur.com/6UEqejq.png" alt="">

    Acknowledgement

    This dataset is a synthetic creation generated using ChatGPT to simulate a realistic customer shopping experience. Its purpose is to provide a platform for beginners and data enthusiasts, allowing them to create, enjoy, practice, and learn from a dataset that mirrors real-world customer shopping behavior. The aim is to foster learning and experimentation in a simulated environment, encouraging a deeper understanding of data analysis and interpretation in the context of consumer preferences and retail scenarios.

    Cover Photo by: Freepik

    Thumbnail by: Clothing icons created by Flat Icons - Flaticon

  4. f

    ORBIT: A real-world few-shot dataset for teachable object recognition...

    • city.figshare.com
    bin
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniela Massiceti; Lida Theodorou; Luisa Zintgraf; Matthew Tobias Harris; Simone Stumpf; Cecily Morrison; Edward Cutrell; Katja Hofmann (2023). ORBIT: A real-world few-shot dataset for teachable object recognition collected from people who are blind or low vision [Dataset]. http://doi.org/10.25383/city.14294597.v3
    Explore at:
    binAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    City, University of London
    Authors
    Daniela Massiceti; Lida Theodorou; Luisa Zintgraf; Matthew Tobias Harris; Simone Stumpf; Cecily Morrison; Edward Cutrell; Katja Hofmann
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Object recognition predominately still relies on many high-quality training examples per object category. In contrast, learning new objects from only a few examples could enable many impactful applications from robotics to user personalization. Most few-shot learning research, however, has been driven by benchmark datasets that lack the high variation that these applications will face when deployed in the real-world. To close this gap, we present the ORBIT dataset, grounded in a real-world application of teachable object recognizers for people who are blind/low vision. We provide a full, unfiltered dataset of 4,733 videos of 588 objects recorded by 97 people who are blind/low-vision on their mobile phones, and a benchmark dataset of 3,822 videos of 486 objects collected by 77 collectors. The code for loading the dataset, computing all benchmark metrics, and running the baseline models is available at https://github.com/microsoft/ORBIT-DatasetThis version comprises several zip files:- train, validation, test: benchmark dataset, organised by collector, with raw videos split into static individual frames in jpg format at 30FPS- other: data not in the benchmark set, organised by collector, with raw videos split into static individual frames in jpg format at 30FPS (please note that the train, validation, test, and other files make up the unfiltered dataset)- *_224: as for the benchmark, but static individual frames are scaled down to 224 pixels.- *_unfiltered_videos: full unfiltered dataset, organised by collector, in mp4 format.

  5. P

    ImageNet-A Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan Hendrycks; Kevin Zhao; Steven Basart; Jacob Steinhardt; Dawn Song, ImageNet-A Dataset [Dataset]. https://paperswithcode.com/dataset/imagenet-a
    Explore at:
    Authors
    Dan Hendrycks; Kevin Zhao; Steven Basart; Jacob Steinhardt; Dawn Song
    Description

    The ImageNet-A dataset consists of real-world, unmodified, and naturally occurring examples that are misclassified by ResNet models.

  6. m

    Dataset of development of business during the COVID-19 crisis

    • data.mendeley.com
    • narcis.nl
    Updated Nov 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tatiana N. Litvinova (2020). Dataset of development of business during the COVID-19 crisis [Dataset]. http://doi.org/10.17632/9vvrd34f8t.1
    Explore at:
    Dataset updated
    Nov 9, 2020
    Authors
    Tatiana N. Litvinova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    To create the dataset, the top 10 countries leading in the incidence of COVID-19 in the world were selected as of October 22, 2020 (on the eve of the second full of pandemics), which are presented in the Global 500 ranking for 2020: USA, India, Brazil, Russia, Spain, France and Mexico. For each of these countries, no more than 10 of the largest transnational corporations included in the Global 500 rating for 2020 and 2019 were selected separately. The arithmetic averages were calculated and the change (increase) in indicators such as profitability and profitability of enterprises, their ranking position (competitiveness), asset value and number of employees. The arithmetic mean values of these indicators for all countries of the sample were found, characterizing the situation in international entrepreneurship as a whole in the context of the COVID-19 crisis in 2020 on the eve of the second wave of the pandemic. The data is collected in a general Microsoft Excel table. Dataset is a unique database that combines COVID-19 statistics and entrepreneurship statistics. The dataset is flexible data that can be supplemented with data from other countries and newer statistics on the COVID-19 pandemic. Due to the fact that the data in the dataset are not ready-made numbers, but formulas, when adding and / or changing the values in the original table at the beginning of the dataset, most of the subsequent tables will be automatically recalculated and the graphs will be updated. This allows the dataset to be used not just as an array of data, but as an analytical tool for automating scientific research on the impact of the COVID-19 pandemic and crisis on international entrepreneurship. The dataset includes not only tabular data, but also charts that provide data visualization. The dataset contains not only actual, but also forecast data on morbidity and mortality from COVID-19 for the period of the second wave of the pandemic in 2020. The forecasts are presented in the form of a normal distribution of predicted values and the probability of their occurrence in practice. This allows for a broad scenario analysis of the impact of the COVID-19 pandemic and crisis on international entrepreneurship, substituting various predicted morbidity and mortality rates in risk assessment tables and obtaining automatically calculated consequences (changes) on the characteristics of international entrepreneurship. It is also possible to substitute the actual values identified in the process and following the results of the second wave of the pandemic to check the reliability of pre-made forecasts and conduct a plan-fact analysis. The dataset contains not only the numerical values of the initial and predicted values of the set of studied indicators, but also their qualitative interpretation, reflecting the presence and level of risks of a pandemic and COVID-19 crisis for international entrepreneurship.

  7. Data from: A large synthetic dataset for machine learning applications in...

    • zenodo.org
    csv, json, png, zip
    Updated Mar 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
    Explore at:
    zip, png, csv, jsonAvailable download formats
    Dataset updated
    Mar 25, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

    This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

    Data generation algorithm

    The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

    Network

    The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

    Time series

    The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

    There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

    Usage

    The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

    Selecting a particular country

    This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

    import pandas as pd
    CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

    The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

    CH_gens_list = CH_gens.dropna().squeeze().to_list()

    Finally, we can import all the time series of Swiss generators from a given data table with

    pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

    The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

    Averaging over time

    This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

    hourly_loads = pd.read_csv('loads_2018_3.csv')

    To get a daily average of the loads, we can use:

    daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

    This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

    weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

    Source code

    The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

    Funding

    This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.

  8. Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Oct 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. http://doi.org/10.5281/zenodo.6832242
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 20, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LifeSnaps Dataset Documentation

    Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

    The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

    Data Import: Reading CSV

    For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

    Data Import: Setting up a MongoDB (Recommended)

    To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

    To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

    For the Fitbit data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c fitbit 

    For the SEMA data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c sema 

    For surveys data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c surveys 

    If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

    Data Availability

    The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

    {
      _id: 
  9. h

    realworldqa

    • huggingface.co
    Updated Apr 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nirajan Dhakal (2024). realworldqa [Dataset]. https://huggingface.co/datasets/nirajandhakal/realworldqa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 13, 2024
    Authors
    Nirajan Dhakal
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    Real World QA Dataset

    This is a benchmark dataset released by xAI under CC-by-nd-4.0 license along with Grok-1.5 Vision Announcement. This benchmark is designed to evaluate basic real-world spatial understanding capabilities of multimodal models. While many of the examples in the current benchmark are relatively easy for humans, they often pose a challenge for frontier models. This release of the RealWorldQA consists of 765 images, with a question and easily verifiable answer for… See the full description on the dataset page: https://huggingface.co/datasets/nirajandhakal/realworldqa.

  10. LAS&T: Large Shape And Texture Dataset

    • zenodo.org
    jpeg, zip
    Updated May 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sagi Eppel; Sagi Eppel (2025). LAS&T: Large Shape And Texture Dataset [Dataset]. http://doi.org/10.5281/zenodo.15453634
    Explore at:
    jpeg, zipAvailable download formats
    Dataset updated
    May 26, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sagi Eppel; Sagi Eppel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Large Shape And Texture Dataset (LAS&T)

    LAS&T is the largest and most diverse dataset for shape, texture and material recognition and retrieval in 2D and 3D with 650,000 images, based on real world shapes and textures.

    Overview

    The LAS&T Dataset aims to test the most basic aspect of vision in the most general way. Mainly the ability to identify any shape, texture, and material in any setting and environment, without being limited to specific types or classes of objects, materials, and environments. For shapes, this means identifying and retrieving any shape in 2D or 3D with every element of the shape changed between images, including the shape material and texture, orientation, size, and environment. For textures and materials, the goal is to recognize the same texture or material when appearing on different objects, environments, and light conditions. The dataset relies on shapes, textures, and materials extracted from real-world images, leading to an almost unlimited quantity and diversity of real-world natural patterns. Each section of the dataset (shapes, and textures), contains 3D parts that rely on physics-based scenes with realistic light materials and object simulation and abstract 2D parts. In addition, the real-world benchmark for 3D shapes.

    Main Dataset webpage

    The dataset contain four parts parts:

    3D shape recognition and retrieval.

    2D shape recognition and retrieval.

    3D Materials recognition and retrieval.

    2D Texture recognition and retrieval.

    Each can be used independently for training and testing.

    Additional assets are a set of 350,000 natural 2D shapes extracted from real-world images (SHAPES_COLLECTION_350k.zip)

    3D shape recognition real-world images benchmark

    The scripts used to generate and test the dataset are supplied as in SCRIPT** files.

    Shapes Recognition and Retrieval:

    For shape recognition the goal is to identify the same shape in different images, where the material/texture/color of the shape is changed, the shape is rotated, and the background is replaced. Hence, only the shape remains the same in both images. All files with 3D shapes contain samples of the 3D shape dataset. This is tested for 3D shapes/objects with realistic light simulation. All files with 2D shapes contain samples of the 2D shape dataset. Examples files contain images with examples for each set.

    Main files:

    Real_Images_3D_shape_matching_Benchmarks.zip contains real-world image benchmarks for 3D shapes.

    3D_Shape_Recognition_Synthethic_GENERAL_LARGE_SET_76k.zip A Large number of synthetic examples 3D shapes with max variability can be used for training/testing 3D shape/objects recognition/retrieval.

    2D_Shapes_Recognition_Textured_Synthetic_Resize2_GENERAL_LARGE_SET_61k.zip A Large number of synthetic examples for 2D shapes with max variability can be used for training/testing 2D shape recognition/retrieval.

    SHAPES_2D_365k.zip 365,000 2D shapes extracted from real-world images saved as black and white .png image files.

    File structure:

    All jpg images that are in the exact same subfolder contain the exact same shape (but with different texture/color/background/orientation).

    Textures and Materials Recognition and Retrieval

    For texture and materials, the goal is to identify and match images containing the same material or textures, however the shape/object on which the material texture is applied is different, and so is the background and light.

    This is done for physics-based material in 3D and abstract 2D textures.

    3D_Materials_PBR_Synthetic_GENERAL_LARGE_SET_80K.zip A Large number of examples of 3D materials in physics grounded can be used for training or testing of material recognition/retrieval.

    2D_Textures_Recogition_GENERAL_LARGE_SET_Synthetic_53K.zip

    Large number of images of 2D texture in maximum variability of setting can be used for training/testing 2D textured recognition/retrieval.

    File structure:

    All jpg images that are in the exact same subfolder contain the exact same texture/material (but overlay on different objects with different background/and illumination/orientation).

    Data Generation:

    The images in the synthetic part of the dataset were created by automatically extracting shapes and textures from natural images and combining them in synthetic images. This created synthetic images that completely rely on real-world patterns, making extremely diverse and complex shapes and textures. As far as we know this is the largest and most diverse shape and texture recognition/retrieval dataset. 3D data was generated using physics-based material and rendering (blender) making the images physically grounded and enabling using the data to train for real-world examples. The scripts for generating the data are supplied in files with the world SCRIPTS* in them.

    Real-world image data:

    For 3D shape recognition and retrieval, we also supply a real-world natural image benchmark. With a variety of natural images containing the exact same 3D shape but made/coated with different materials and in different environments and orientations. The goal is again to identify the same shape in different images. The benchmark is available at: Real_Images_3D_shape_matching_Benchmarks.zip

    File structure:

    Files containing the word 'GENERAL_LARGE_SET' contains synthetic images that can be used for training or testing, the type of data (2D shapes, 3D shapes, 2D textures, 3D materials) that appears in the file name, as well as the number of images. Files containing MultiTests contain a number of different tests in which only a single aspect of the aspect of the instance is changed (for example only the background.) File containing "SCRIPTS" contain data generation testing scripts. Images containing "examples" are example of each test.

    Shapes Collections

    The file SHAPES_COLLECTION_350k.zip contains 350,000 2D shapes extracted from natural images and used for the dataset generation.

    Evaluating and Testing

    For evaluating and testing see: SCRIPTS_Testing_LVLM_ON_LAST_VQA.zip
    This can be use to test leading LVLMs using api, create human tests, and in general turn the dataset into multichoice question images similar to the one in the paper.

  11. B

    Open Data Training Workshop: Case Studies in Open Data for Qualitative and...

    • borealisdata.ca
    • search.dataone.org
    Updated Apr 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Srinvivas Murthy; Maggie Woo Kinshella; Jessica Trawin; Teresa Johnson; Niranjan Kissoon; Matthew Wiens; Gina Ogilvie; Gurm Dhugga; J Mark Ansermino (2023). Open Data Training Workshop: Case Studies in Open Data for Qualitative and Quantitative Clinical Research [Dataset]. http://doi.org/10.5683/SP3/BNNAE7
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 18, 2023
    Dataset provided by
    Borealis
    Authors
    Srinvivas Murthy; Maggie Woo Kinshella; Jessica Trawin; Teresa Johnson; Niranjan Kissoon; Matthew Wiens; Gina Ogilvie; Gurm Dhugga; J Mark Ansermino
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Dataset funded by
    Digital Research Alliance of Canada
    Description

    Objective(s): Momentum for open access to research is growing. Funding agencies and publishers are increasingly requiring researchers make their data and research outputs open and publicly available. However, clinical researchers struggle to find real-world examples of Open Data sharing. The aim of this 1 hr virtual workshop is to provide real-world examples of Open Data sharing for both qualitative and quantitative data. Specifically, participants will learn: 1. Primary challenges and successes when sharing quantitative and qualitative clinical research data. 2. Platforms available for open data sharing. 3. Ways to troubleshoot data sharing and publish from open data. Workshop Agenda: 1. “Data sharing during the COVID-19 pandemic” - Speaker: Srinivas Murthy, Clinical Associate Professor, Department of Pediatrics, Faculty of Medicine, University of British Columbia. Investigator, BC Children's Hospital 2. “Our experience with Open Data for the 'Integrating a neonatal healthcare package for Malawi' project.” - Speaker: Maggie Woo Kinshella, Global Health Research Coordinator, Department of Obstetrics and Gynaecology, BC Children’s and Women’s Hospital and University of British Columbia This workshop draws on work supported by the Digital Research Alliance of Canada. Data Description: Presentation slides, Workshop Video, and Workshop Communication Srinivas Murthy: Data sharing during the COVID-19 pandemic presentation and accompanying PowerPoint slides. Maggie Woo Kinshella: Our experience with Open Data for the 'Integrating a neonatal healthcare package for Malawi' project presentation and accompanying Powerpoint slides. This workshop was developed as part of Dr. Ansermino's Data Champions Pilot Project supported by the Digital Research Alliance of Canada. NOTE for restricted files: If you are not yet a CoLab member, please complete our membership application survey to gain access to restricted files within 2 business days. Some files may remain restricted to CoLab members. These files are deemed more sensitive by the file owner and are meant to be shared on a case-by-case basis. Please contact the CoLab coordinator on this page under "collaborate with the pediatric sepsis colab."

  12. 2021 Amazon Last Mile Routing Research Challenge Dataset

    • registry.opendata.aws
    Updated Sep 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2022). 2021 Amazon Last Mile Routing Research Challenge Dataset [Dataset]. https://registry.opendata.aws/amazon-last-mile-challenges/
    Explore at:
    Dataset updated
    Sep 16, 2022
    Dataset provided by
    Amazon.comhttp://amazon.com/
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The 2021 Amazon Last Mile Routing Research Challenge was an innovative research initiative led by Amazon.com and supported by the Massachusetts Institute of Technology’s Center for Transportation and Logistics. Over a period of 4 months, participants were challenged to develop innovative machine learning-based methods to enhance classic optimization-based approaches to solve the travelling salesperson problem, by learning from historical routes executed by Amazon delivery drivers. The primary goal of the Amazon Last Mile Routing Research Challenge was to foster innovative applied research in route planning, building on recent advances in predictive modeling, and using a real-world problem and data. The dataset released for the research challenge includes route-, stop-, and package-level features for 9,184 historical routes performed by Amazon drivers in 2018 in five metropolitan areas in the United States. This real-world dataset excludes any personally identifiable information (all route and package identifiers have been randomly regenerated and related location data have been obfuscated to ensure anonymity). Although multiple synthetic benchmark datasets are available in the literature, the dataset of the 2021 Amazon Last Mile Routing Research Challenge is the first large and publicly available dataset to include instances based on real-world operational routing data. The dataset is fully described and formally introduced in the following Transportation Science article: https://pubsonline.informs.org/doi/10.1287/trsc.2022.1173

  13. Real-World Evidence Solutions Market Analysis Americas, EMEA, APAC - US, UK,...

    • technavio.com
    Updated Oct 16, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2019). Real-World Evidence Solutions Market Analysis Americas, EMEA, APAC - US, UK, Germany, France, Japan - Size and Forecast 2020-2024 [Dataset]. https://www.technavio.com/report/real-world-evidence-solutions-market-industry-analysis
    Explore at:
    Dataset updated
    Oct 16, 2019
    Dataset provided by
    TechNavio
    Authors
    Technavio
    Time period covered
    2021 - 2025
    Area covered
    United States, Global
    Description

    Snapshot img

    Real-World Evidence Solutions Market Analysis Report 2020-2024:

    The real-world evidence solutions market size has the potential to grow by USD 697.92 mn during 2020-2024, and the market’s growth momentum will accelerate during the forecast period.

    This report provides a detailed analysis of the market by market landscape (oncology, immunology, cardiology, neurology, and others) and geography (Americas, EMEA, and APAC). Also, the report analyzes the market’s competitive landscape and offers information on several market vendors, including Clinigen Group Plc, ICON Plc, International Business Machines Corp., IQVIA Inc., Laboratory Corp. of America Holdings, Parexel International Corp., PerkinElmer Inc., Pharmaceutical Product Development LLC, SAS Institute Inc., and and Syneos Health Inc.

    Market Overview

    Browse TOC and LoE with selected illustrations and example pages of Real-World Evidence Solutions Market

    Request a FREE sample now!

    Market Competitive Analysis

    The market is fragmented. Clinigen Group Plc, ICON Plc, International Business Machines Corp., IQVIA Inc., Laboratory Corp. of America Holdings, Parexel International Corp., PerkinElmer Inc., Pharmaceutical Product Development LLC, SAS Institute Inc., and and Syneos Health Inc. are some of the major market participants. Factors such as benefits of real-world evidence solutions will offer immense growth opportunities. To make the most of the opportunities, vendors should focus on growth prospects in the fast-growing segments, while maintaining their positions in the slow-growing segments.

    To help clients improve their market position, this real-world evidence solutions market forecast report provides a detailed analysis of the market leaders and offers information on the competencies and capacities of these companies. The report also covers details on the market’s competitive landscape and offers information on the products offered by various companies. Moreover, this real-world evidence solutions market analysis report provides information on the upcoming trends and challenges that will influence market growth. This will help companies create strategies to make the most of future growth opportunities.

    This report provides information on the production, sustainability, and prospects of several leading companies, including:

    Clinigen Group Plc
    ICON Plc
    International Business Machines Corp.
    IQVIA Inc.
    Laboratory Corp. of America Holdings
    Parexel International Corp.
    PerkinElmer Inc.
    Pharmaceutical Product Development LLC
    SAS Institute Inc.
    and Syneos Health Inc.
    

    Real-World Evidence Solutions Market: Segmentation by Geography

    For more insights on the market share of various regions Request for a FREE sample now!

    The report offers an up-to-date analysis regarding the current Global market scenario, latest trends and drivers, and the overall market environment. Americas will offer several growth opportunities to market vendors during the forecast period. The integration of ehrs with electronic data collection systems will significantly influence real-world evidence solutions market's growth in this region.

    52% of the market’s growth will originate from Americas during the forecast period. The US is the key market for real-world evidence solutions market in Americas. This report provides an accurate prediction of the contribution of all segments to the growth of the real-world evidence solutions market size.

    Real-World Evidence Solutions Market: Key Highlights of the Report for 2020-2024

    CAGR of the market during the forecast period 2020-2024
    Detailed information on factors that will real-world evidence solutions market growth during the next five years
    Precise estimation of the real-world evidence solutions market size and its contribution to the parent market
    Accurate predictions on upcoming trends and changes in consumer behavior
    The growth of the real-world evidence solutions industry across Americas, EMEA, and APAC
    A thorough analysis of the market’s competitive landscape and detailed information on vendors
    Comprehensive details of factors that will challenge the growth of real-world evidence solutions market vendors
    

    We can help! Our analysts can customize this report to meet your requirements. Get in touch

        Real-World Evidence Solutions Market Scope
    
    
    
    
        Report Coverage
    
    
        Details
    
    
    
    
        Page number
    
    
        120
    
    
    
    
        Base year
    
    
        2019
    
    
    
    
        Forecast period
    
    
        2020-2024
    
    
    
    
        Growth momentum & CAGR
    
    
        Accelerate at a CAGR of 13%
    
    
    
    
        Market growth 2020-2024
    
    
        USD 697.92 million
    
    
    
    
        Market structure
    
    
        Fragmented
    
    
    
    
        YoY growth (%)
    
    
        10.53
    
    
    
    
        Regional analysis
    
    
        Americas, EMEA, and APAC
    
    
    
    
        Performing market contribution
    
    
        Americas at
    
  14. P

    SCG Dataset

    • paperswithcode.com
    Updated Nov 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). SCG Dataset [Dataset]. https://paperswithcode.com/dataset/scg
    Explore at:
    Dataset updated
    Nov 12, 2024
    Description

    Abstract: Graph Neural Networks (GNNs) have recently gained traction in transportation, bioinformatics, language and image processing, but research on their application to supply chain management remains limited. Supply chains are inherently graph-like, making them ideal for GNN methodologies, which can optimize and solve complex problems. The barriers include a lack of proper conceptual foundations, familiarity with graph applications in SCM, and real-world benchmark datasets for GNN-based supply chain research. To address this, we discuss and connect supply chains with graph structures for effective GNN application, providing detailed formulations, examples, mathematical definitions, and task guidelines. Additionally, we present a multi-perspective real-world benchmark dataset from a leading FMCG company in Bangladesh, focusing on supply chain planning. We discuss various supply chain tasks using GNNs and benchmark several state-of-the-art models on homogeneous and heterogeneous graphs across six supply chain analytics tasks. Our analysis shows that GNN-based models consistently outperform statistical ML and other deep learning models by around 10-30% in regression, 10-30% in classification and detection tasks, and 15-40% in anomaly detection tasks on designated metrics. With this work, we lay the groundwork for solving supply chain problems using GNNs, supported by conceptual discussions, methodological insights, and a comprehensive dataset.

  15. Annotated Benchmark of Real-World Data for Approximate Functional Dependency...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jul 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcel Parciak; Marcel Parciak; Sebastiaan Weytjens; Frank Neven; Niel Hens; Liesbet M. Peeters; Stijn Vansummeren; Sebastiaan Weytjens; Frank Neven; Niel Hens; Liesbet M. Peeters; Stijn Vansummeren (2023). Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery [Dataset]. http://doi.org/10.5281/zenodo.8098909
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jul 1, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marcel Parciak; Marcel Parciak; Sebastiaan Weytjens; Frank Neven; Niel Hens; Liesbet M. Peeters; Stijn Vansummeren; Sebastiaan Weytjens; Frank Neven; Niel Hens; Liesbet M. Peeters; Stijn Vansummeren
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery

    This collection consists of ten open access relations commonly used by the data management community. In addition to the relations themselves (please take note of the references to the original sources below), we added three lists in this collection that describe approximate functional dependencies found in the relations. These lists are the result of a manual annotation process performed by two independent individuals by consulting the respective schemas of the relations and identifying column combinations where one column implies another based on its semantics. As an example, in the claims.csv file, the AirportCode implies AirportName, as each code should be unique for a given airport.

    The file ground_truth.csv is a comma separated file containing approximate functional dependencies. table describes the relation we refer to, lhs and rhs reference two columns of those relations where semantically we found that lhs implies rhs.

    The file excluded_candidates.csv and included_candidates.csv list all column combinations that were excluded or included in the manual annotation, respectively. We excluded a candidate if there was no tuple where both attributes had a value or if the g3_prime value was too small.

    Dataset References

  16. m

    Lisbon, Portugal, hotel’s customer dataset with three years of personal,...

    • data.mendeley.com
    Updated Nov 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nuno Antonio (2020). Lisbon, Portugal, hotel’s customer dataset with three years of personal, behavioral, demographic, and geographic information [Dataset]. http://doi.org/10.17632/j83f5fsh6c.1
    Explore at:
    Dataset updated
    Nov 18, 2020
    Authors
    Nuno Antonio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Portugal, Lisbon
    Description

    Hotel customer dataset with 31 variables describing a total of 83,590 instances (customers). It comprehends three full years of customer behavioral data. In addition to personal and behavioral information, the dataset also contains demographic and geographical information. This dataset contributes to reducing the lack of real-world business data that can be used for educational and research purposes. The dataset can be used in data mining, machine learning, and other analytical field problems in the scope of data science. Due to its unit of analysis, it is a dataset especially suitable for building customer segmentation models, including clustering and RFM (Recency, Frequency, and Monetary value) models, but also be used in classification and regression problems.

  17. the stack processed try version

    • kaggle.com
    Updated May 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vincenzo_Gallo (2025). the stack processed try version [Dataset]. https://www.kaggle.com/datasets/vincenzomcgiurre/the-stack-processed-try-version
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 8, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vincenzo_Gallo
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    🚀 What Makes This Dataset Unique?

    Exclusive Enrichment: Every file has been carefully selected, annotated, and enriched for superior quality—far beyond standard code dumps. Covers 45+ Languages: From modern Python to Rust, Go, TypeScript, Java, C++, and many more. Current Frameworks & Libraries: Includes real-world examples and best practices for React, TensorFlow, PyTorch, Node.js, Django, and others. Built-in Security: 150+ real vulnerability/fix pairs to help train safer, more security-aware models. Optimization & Performance: 200+ optimization patterns with benchmarks and explanations, so your models generate not just working code, but efficient code. Superior Documentation: 30,000+ files with detailed comments, docstrings, parameter explanations, edge case notes, and return value documentation. AI-Ready: Perfect for training, validating, or testing code generation, code review, code search, and programming assistant models. Simple Structure: Files are organized by language, category, and framework—easy to integrate into any machine learning pipeline. Future Updates Included: Buy now and receive all future versions and enrichments of the dataset for free. Direct Support: Email support for technical questions, integration, or dataset customization. 🎯 Who Is This For?

    AI/ML researchers and developers Startups and companies building code assistants, code review, or code search tools Universities and research centers Bootcamps and advanced training platforms Anyone seeking a high-quality, ready-to-use code dataset 💡 Example Use Cases

    Training LLMs for code generation Developing automated code review tools Security analysis and automated refactoring Benchmarking AI models on modern languages and frameworks Creating demos, tutorials, and educational content 🔒 Flexible Licensing

    Personal and commercial use included No royalty fees Custom licenses available for enterprise clients 📦 What You Get

    1 ZIP file (20GB) with the entire organized dataset Detailed README with structure, statistics, and usage tips Access to all future updates and enrichments 🔥 Launch Offer

    Special discount for early buyers! 7-day money-back guarantee Contact me for questions, demos, or custom solutions!

  18. m

    Multi-Class Driver Behavior Image Dataset

    • data.mendeley.com
    Updated Nov 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arafat Sahin Afridi (2024). Multi-Class Driver Behavior Image Dataset [Dataset]. http://doi.org/10.17632/mzb4b6dff3.1
    Explore at:
    Dataset updated
    Nov 15, 2024
    Authors
    Arafat Sahin Afridi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Distracted driving-related accidents are a critical global issue, especially as road traffic increases in densely populated areas. To address the challenge of driver distraction, we introduce a novel dataset that supports the development of real-time monitoring and detection systems by capturing authentic driver behaviors. Collected in Ashulia, Dhaka, Bangladesh, in October 2024, this dataset includes images captured under real-world driving conditions within both private vehicles and public buses. The photos were taken using personal mobile phones, ensuring a realistic and diverse set of visual data.

    This dataset spans a wide range of driving behaviors, including safe driving, turning, texting, talking on the phone, and other potentially risky behaviors, such as drowsy driving. By depicting these behaviors in everyday driving scenarios, the dataset serves as a valuable resource for training and evaluating models designed to detect unsafe driving practices in real-time.The dataset includes high-resolution photos taken inside public buses and personal cars in Ashulia, Dhaka, Bangladesh, under actual driving circumstances. The photographs, which were taken using the cameras on personal cell phones, offer a genuine and varied collection of visual information under normal driving circumstances. The following five behavioral classes comprise the dataset: I. Safe Driving: Images showing a driver who seems to be paying attention to the road, both hands on the wheel, and concentrated or 1 hand on the steering wheel and other on the gear stick. This is the perfect example of driving without distractions. II. Turning: Photographs that show drivers changing direction during turns by moving their heads or full bodies. This behavior is crucial for figuring out how focused the driver is on everyday tasks like rotating the steering wheel. III. Texting Phone: Pictures of drivers using their phones, whether it is to type messages or to interact with the screen. Since texting and driving is one of the main causes of distracted driving, this training is very important for identifying it. IV. Talking Phones: When drivers talk on their phones or hold them up to their ears while driving a vehicle. This category aids in identifying actions connected to phone talks, which are another frequent source of interruptions. V. Others: Contains any actions that go against safe driving practices, like drinking water or anything while driving, sleeping while driving, or talking with someone behind while driving. Relevant photos are included in each session, and they differ in terms of vehicle type and illumination to represent the variety of driving situations found in the real world. Because the images are unprocessed and unannotated, there is freedom in how machine learning applications pre-process them.

  19. TREC 2022 Deep Learning test collection

    • catalog.data.gov
    • data.nist.gov
    Updated May 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection
    Explore at:
    Dataset updated
    May 9, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.

  20. P

    RadarScenes Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ole Schumann; Markus Hahn; Nicolas Scheiner; Fabio Weishaupt; Julius F. Tilly; Jürgen Dickmann; Christian Wöhler, RadarScenes Dataset [Dataset]. https://paperswithcode.com/dataset/radarscenes
    Explore at:
    Authors
    Ole Schumann; Markus Hahn; Nicolas Scheiner; Fabio Weishaupt; Julius F. Tilly; Jürgen Dickmann; Christian Wöhler
    Description

    RadarScenes is a real-world radar point cloud dataset for automotive applications.

    It consists of measurements and point-wise annotations from more than four hours of driving collected by four series radar sensors mounted on one test vehicle. Individual detections of dynamic objects were manually grouped to clusters and labeled afterwards. The purpose of this data set is to enable the development of novel (machine learning-based) radar perception algorithms with the focus on moving road users. Images of the recorded sequences were captured using a documentary camera.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Samy Baladram (2023). 🔍 Diverse CSV Dataset Samples [Dataset]. https://www.kaggle.com/datasets/samybaladram/multidisciplinary-csv-datasets-collection/code
Organization logo

🔍 Diverse CSV Dataset Samples

Real-world stats for data practice

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 6, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Samy Baladram
License

http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

Description

https://i.imgur.com/PcSDv8A.png" alt="Imgur">

Overview

The dataset provided here is a rich compilation of various data files gathered to support diverse analytical challenges and education in data science. It is especially curated to provide researchers, data enthusiasts, and students with real-world data across different domains, including biostatistics, travel, real estate, sports, media viewership, and more.

Files

Below is a brief overview of what each CSV file contains: - Addresses: Practical examples of string manipulation and address data formatting in CSV. - Air Travel: Historical dataset suitable for analyzing trends in air travel over a period of three years. - Biostats: A dataset of office workers' biometrics, ideal for introductory statistics and biology. - Cities: Geographic and administrative data for urban analysis or socio-demographic studies. - Car Crashes in Catalonia: Weekly traffic accident data from Catalonia, providing a base for public policy research. - De Niro's Film Ratings: Analyze trends in film ratings over time with this entertainment-focused dataset. - Ford Escort Sales: Pre-owned vehicle sales data, perfect for regression analysis or price prediction models. - Old Faithful Geyser: Geological data for pattern recognition and prediction in natural phenomena. - Freshman Year Weights and BMIs: Dataset depicting weight and BMI changes for health and lifestyle studies. - Grades: Education performance data which can be correlated with demographics or study patterns. - Home Sales: A dataset reflecting the housing market dynamics, useful for economic analysis or real estate appraisal. - Hooke's Law Demonstration: Physics data illustrating the classic principle of elasticity in springs. - Hurricanes and Storm Data: Climate data on hurricane and storm frequency for environmental risk assessments. - Height and Weight Measurements: Public health research dataset on anthropometric data. - Lead Shot Specs: Detailed engineering data for material sciences and manufacturing studies. - Alphabet Letter Frequency: Text analysis dataset for frequency distribution studies in large text samples. - MLB Player Statistics: Comprehensive athletic data set for analysis of performance metrics in sports. - MLB Teams' Seasonal Performance: A dataset combining financial and sports performance data from the 2012 MLB season. - TV News Viewership: Media consumption data which can be used to analyze viewing patterns and trends. - Historical Nile Flood Data: A unique environmental dataset for historical trend analysis in flood levels. - Oscar Winner Ages: A dataset to explore age trends among Oscar-winning actors and actresses. - Snakes and Ladders Statistics: Data from the game outcomes useful in studying probability and game theory. - Tallahassee Cab Fares: Price modeling data from the real-world pricing of taxi services. - Taxable Goods Data: A snapshot of economic data concerning taxation impact on prices. - Tree Measurements: Ecological and environmental science data related to tree growth and forest management. - Real Estate Prices from Zillow: Market analysis dataset for those interested in housing price determinants.

Format

The enclosed data respect the comma-separated values (CSV) file format standards, ensuring compatibility with most data processing libraries in Python, R, and other languages. The datasets are ready for import into Jupyter notebooks, RStudio, or any other integrated development environment (IDE) used for data science.

Quality Assurance

The data is pre-checked for common issues such as missing values, duplicate records, and inconsistent entries, offering a clean and reliable dataset for various analytical exercises. With initial header lines in some CSV files, users can easily identify dataset fields and start their analysis without additional data cleaning for headers.

Acknowledgements

The dataset adheres to the GNU LGPL license, making it freely available for modification and distribution, provided that the original source is cited. This opens up possibilities for educators to integrate real-world data into curricula, researchers to validate models against diverse datasets, and practitioners to refine their analytical skills with hands-on data.

This dataset has been compiled from https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html, with gratitude to the authors and maintainers for their dedication to providing open data resources for educational and research purposes. https://i.imgur.com/HOtyghv.png" alt="Imgur">

Search
Clear search
Close search
Google apps
Main menu