100+ datasets found
  1. Best Books Ever Dataset

    • zenodo.org
    csv
    Updated Nov 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 10, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

    The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

    Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

    The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

    Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

    The 25 fields of the dataset are:

    | Attributes | Definition | Completeness |
    | ------------- | ------------- | ------------- | 
    | bookId | Book Identifier as in goodreads.com | 100 |
    | title | Book title | 100 |
    | series | Series Name | 45 |
    | author | Book's Author | 100 |
    | rating | Global goodreads rating | 100 |
    | description | Book's description | 97 |
    | language | Book's language | 93 |
    | isbn | Book's ISBN | 92 |
    | genres | Book's genres | 91 |
    | characters | Main characters | 26 |
    | bookFormat | Type of binding | 97 |
    | edition | Type of edition (ex. Anniversary Edition) | 9 |
    | pages | Number of pages | 96 |
    | publisher | Editorial | 93 |
    | publishDate | publication date | 98 |
    | firstPublishDate | Publication date of first edition | 59 |
    | awards | List of awards | 20 |
    | numRatings | Number of total ratings | 100 |
    | ratingsByStars | Number of ratings by stars | 97 |
    | likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
    | setting | Story setting | 22 |
    | coverImg | URL to cover image | 99 |
    | bbeScore | Score in Best Books Ever list | 100 |
    | bbeVotes | Number of votes in Best Books Ever list | 100 |
    | price | Book's price (extracted from Iberlibro) | 73 |

  2. Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

    GitHub page: https://github.com/soarsmu/NICHE

  3. TREC 2022 Deep Learning test collection

    • catalog.data.gov
    • data.nist.gov
    Updated May 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection
    Explore at:
    Dataset updated
    May 9, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.

  4. i

    Malware Analysis Datasets: Top-1000 PE Imports

    • ieee-dataport.org
    Updated Nov 8, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angelo Oliveira (2019). Malware Analysis Datasets: Top-1000 PE Imports [Dataset]. https://ieee-dataport.org/open-access/malware-analysis-datasets-top-1000-pe-imports
    Explore at:
    Dataset updated
    Nov 8, 2019
    Authors
    Angelo Oliveira
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is part of my PhD research on malware detection and classification using Deep Learning. It contains static analysis data: Top-1000 imported functions extracted from the 'pe_imports' elements of Cuckoo Sandbox reports. PE malware examples were downloaded from virusshare.com. PE goodware examples were downloaded from portableapps.com and from Windows 7 x86 directories.

  5. Car Acceptability Dataset for Machine Learning

    • kaggle.com
    Updated Sep 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ismetgocer (2023). Car Acceptability Dataset for Machine Learning [Dataset]. https://www.kaggle.com/datasets/ismetgocer/car-acceptability
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 11, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ismetgocer
    Description

    Are you ready for a good challenge? There are many missing values, extraordinary symbols and problems Let's start wonderful analyses!

    There are 1729 rows and 7 columns in this data set;

    Features;

    Column (buying): Buying price of the car (v-high, high, med, low). Examples: low, med, high Column (maint): Price of the maintenance of the car (v-high, high, med, low). Examples: low, med, high Column (doors): Number of doors (2, 3, 4, 5-more). Examples: 2, 3, 4 Column (persons): Capacity in terms of persons to carry (2, 4, more). Examples: 2, 4, more Column (lug_boot): The size of the luggage boot (small, med, big). Examples: small, med, big Column (safety): Estimated safety of the car (low, med, high). Examples: low, med, high

    Target:

    Column (class): Car acceptability (unacc: unacceptable, acc: acceptable, good: good, v-good: very good). Examples: unacc, acc, good

  6. f

    Raw dataset.

    • plos.figshare.com
    txt
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vincent Arnaud; François Pellegrino; Sumir Keenan; Xavier St-Gelais; Nicolas Mathevon; Florence Levréro; Christophe Coupé (2023). Raw dataset. [Dataset]. http://doi.org/10.1371/journal.pcbi.1010325.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS Computational Biology
    Authors
    Vincent Arnaud; François Pellegrino; Sumir Keenan; Xavier St-Gelais; Nicolas Mathevon; Florence Levréro; Christophe Coupé
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Despite the accumulation of data and studies, deciphering animal vocal communication remains challenging. In most cases, researchers must deal with the sparse recordings composing Small, Unbalanced, Noisy, but Genuine (SUNG) datasets. SUNG datasets are characterized by a limited number of recordings, most often noisy, and unbalanced in number between the individuals or categories of vocalizations. SUNG datasets therefore offer a valuable but inevitably distorted vision of communication systems. Adopting the best practices in their analysis is essential to effectively extract the available information and draw reliable conclusions. Here we show that the most recent advances in machine learning applied to a SUNG dataset succeed in unraveling the complex vocal repertoire of the bonobo, and we propose a workflow that can be effective with other animal species. We implement acoustic parameterization in three feature spaces and run a Supervised Uniform Manifold Approximation and Projection (S-UMAP) to evaluate how call types and individual signatures cluster in the bonobo acoustic space. We then implement three classification algorithms (Support Vector Machine, xgboost, neural networks) and their combination to explore the structure and variability of bonobo calls, as well as the robustness of the individual signature they encode. We underscore how classification performance is affected by the feature set and identify the most informative features. In addition, we highlight the need to address data leakage in the evaluation of classification performance to avoid misleading interpretations. Our results lead to identifying several practical approaches that are generalizable to any other animal communication system. To improve the reliability and replicability of vocal communication studies with SUNG datasets, we thus recommend: i) comparing several acoustic parameterizations; ii) visualizing the dataset with supervised UMAP to examine the species acoustic space; iii) adopting Support Vector Machines as the baseline classification approach; iv) explicitly evaluating data leakage and possibly implementing a mitigation strategy.

  7. f

    Table1_Enhancing biomechanical machine learning with limited data:...

    • frontiersin.figshare.com
    pdf
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich (2024). Table1_Enhancing biomechanical machine learning with limited data: generating realistic synthetic posture data using generative artificial intelligence.pdf [Dataset]. http://doi.org/10.3389/fbioe.2024.1350135.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Feb 14, 2024
    Dataset provided by
    Frontiers
    Authors
    Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Objective: Biomechanical Machine Learning (ML) models, particularly deep-learning models, demonstrate the best performance when trained using extensive datasets. However, biomechanical data are frequently limited due to diverse challenges. Effective methods for augmenting data in developing ML models, specifically in the human posture domain, are scarce. Therefore, this study explored the feasibility of leveraging generative artificial intelligence (AI) to produce realistic synthetic posture data by utilizing three-dimensional posture data.Methods: Data were collected from 338 subjects through surface topography. A Variational Autoencoder (VAE) architecture was employed to generate and evaluate synthetic posture data, examining its distinguishability from real data by domain experts, ML classifiers, and Statistical Parametric Mapping (SPM). The benefits of incorporating augmented posture data into the learning process were exemplified by a deep autoencoder (AE) for automated feature representation.Results: Our findings highlight the challenge of differentiating synthetic data from real data for both experts and ML classifiers, underscoring the quality of synthetic data. This observation was also confirmed by SPM. By integrating synthetic data into AE training, the reconstruction error can be reduced compared to using only real data samples. Moreover, this study demonstrates the potential for reduced latent dimensions, while maintaining a reconstruction accuracy comparable to AEs trained exclusively on real data samples.Conclusion: This study emphasizes the prospects of harnessing generative AI to enhance ML tasks in the biomechanics domain.

  8. u

    Steam Video Game and Bundle Data

    • cseweb.ucsd.edu
    json
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, Steam Video Game and Bundle Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
    Explore at:
    jsonAvailable download formats
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    These datasets contain reviews from the Steam video game platform, and information about which games were bundled together.

    Metadata includes

    • reviews

    • purchases, plays, recommends (likes)

    • product bundles

    • pricing information

    Basic Statistics:

    • Reviews: 7,793,069

    • Users: 2,567,538

    • Items: 15,474

    • Bundles: 615

  9. d

    Data from: GIS Resource Compilation Map Package - Applications of Machine...

    • catalog.data.gov
    • data.openei.org
    • +4more
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nevada Bureau of Mines and Geology (2025). GIS Resource Compilation Map Package - Applications of Machine Learning Techniques to Geothermal Play Fairway Analysis in the Great Basin Region, Nevada [Dataset]. https://catalog.data.gov/dataset/gis-resource-compilation-map-package-applications-of-machine-learning-techniques-to-geothe-8f3ee
    Explore at:
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    Nevada Bureau of Mines and Geology
    Area covered
    Nevada, Great Basin
    Description

    This submission contains an ESRI map package (.mpk) with an embedded geodatabase for GIS resources used or derived in the Nevada Machine Learning project, meant to accompany the final report. The package includes layer descriptions, layer grouping, and symbology. Layer groups include: new/revised datasets (paleo-geothermal features, geochemistry, geophysics, heat flow, slip and dilation, potential structures, geothermal power plants, positive and negative test sites), machine learning model input grids, machine learning models (Artificial Neural Network (ANN), Extreme Learning Machine (ELM), Bayesian Neural Network (BNN), Principal Component Analysis (PCA/PCAk), Non-negative Matrix Factorization (NMF/NMFk) - supervised and unsupervised), original NV Play Fairway data and models, and NV cultural/reference data. See layer descriptions for additional metadata. Smaller GIS resource packages (by category) can be found in the related datasets section of this submission. A submission linking the full codebase for generating machine learning output models is available through the "Related Datasets" link on this page, and contains results beyond the top picks present in this compilation.

  10. Geograph Machine Learning Dataset 2 - location based

    • data.geograph.org.uk
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Geograph Britain and Ireland and Contributors, Geograph Machine Learning Dataset 2 - location based [Dataset]. https://data.geograph.org.uk/datasets.html
    Explore at:
    Dataset provided by
    Geograph Britain and Irelandhttp://www.geograph.org.uk/
    Authors
    Geograph Britain and Ireland and Contributors
    License

    Attribution-ShareAlike 2.0 (CC BY-SA 2.0)https://creativecommons.org/licenses/by-sa/2.0/
    License information was derived automatically

    Time period covered
    Dec 24, 1949 - Nov 14, 2021
    Area covered
    United Kingdom, British Isles
    Description

    Sample selection of 10k images from Geograph Britain and Ireland, randomly distubuted for good geographical spread. Presized for use in Machine Learning/Image Vision processing.

  11. U

    U.S. AI Training Dataset Market Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). U.S. AI Training Dataset Market Report [Dataset]. https://www.archivemarketresearch.com/reports/us-ai-training-dataset-market-4957
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    United States
    Variables measured
    Market Size
    Description

    The U.S. AI Training Dataset Market size was valued at USD 590.4 million in 2023 and is projected to reach USD 1880.70 million by 2032, exhibiting a CAGR of 18.0 % during the forecasts period. The U. S. AI training dataset market deals with the generation, selection, and organization of datasets used in training artificial intelligence. These datasets contain the requisite information that the machine learning algorithms need to infer and learn from. Conducts include the advancement and improvement of AI solutions in different fields of business like transport, medical analysis, computing language, and money related measurements. The applications include training the models for activities such as image classification, predictive modeling, and natural language interface. Other emerging trends are the change in direction of more and better-quality, various and annotated data for the improvement of model efficiency, synthetic data generation for data shortage, and data confidentiality and ethical issues in dataset management. Furthermore, due to arising technologies in artificial intelligence and machine learning, there is a noticeable development in building and using the datasets. Recent developments include: In February 2024, Google struck a deal worth USD 60 million per year with Reddit that will give the former real-time access to the latter’s data and use Google AI to enhance Reddit’s search capabilities. , In February 2024, Microsoft announced around USD 2.1 billion investment in Mistral AI to expedite the growth and deployment of large language models. The U.S. giant is expected to underpin Mistral AI with Azure AI supercomputing infrastructure to provide top-notch scale and performance for AI training and inference workloads. .

  12. m

    Educational Attainment in North Carolina Public Schools: Use of statistical...

    • data.mendeley.com
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
    Explore at:
    Dataset updated
    Nov 14, 2018
    Authors
    Scott Herford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

  13. u

    Product Exchange/Bartering Data

    • cseweb.ucsd.edu
    json
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, Product Exchange/Bartering Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
    Explore at:
    jsonAvailable download formats
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    These datasets contain peer-to-peer trades from various recommendation platforms.

    Metadata includes

    • peer-to-peer trades

    • have and want lists

    • image data (tradesy)

  14. m

    VegNet: Vegetable Dataset with quality (Good & Bad quality)

    • data.mendeley.com
    Updated Jul 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kailas PATIL (2022). VegNet: Vegetable Dataset with quality (Good & Bad quality) [Dataset]. http://doi.org/10.17632/73n5hrn8hh.3
    Explore at:
    Dataset updated
    Jul 11, 2022
    Authors
    Kailas PATIL
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Neat and clean dataset is the elementary requirement to build accurate and robust machine learning models for the real-time environment. With this objective we have created an image dataset of Indian four vegetable with quality parameter which are highly consumed or exported. Accordingly, we have considered four vegetables namely Bell Pepper, Tomato, Chili Pepper, and New Mexico Chile to create a dataset. The dataset is categorized into 4 subfolders of vegetables namely bell pepper, tomato, Chili Pepper, and New Mexico Chile. Further each vegetable folder contains four subfolders namely 1) Multiple Mature, 2) Single Mature, 3) Multiple Over Mature, and 4) Single Over Mature vegetable classes. Total 6793 images are available in the dataset. We strongly believe that the proposed dataset is very helpful for training, testing and validation of vegetable classification or reorganization machine leaning model.

  15. A

    ‘Deep Learning A-Z - ANN dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Nov 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Deep Learning A-Z - ANN dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-deep-learning-a-z-ann-dataset-e900/a6df077a/?iid=030-068&v=presentation
    Explore at:
    Dataset updated
    Nov 21, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Deep Learning A-Z - ANN dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/filippoo/deep-learning-az-ann on 21 November 2021.

    --- Dataset description provided by original source is as follows ---

    Context

    This is the dataset used in the section "ANN (Artificial Neural Networks)" of the Udemy course from Kirill Eremenko (Data Scientist & Forex Systems Expert) and Hadelin de Ponteves (Data Scientist), called Deep Learning A-Z™: Hands-On Artificial Neural Networks. The dataset is very useful for beginners of Machine Learning, and a simple playground where to compare several techniques/skills.

    It can be freely downloaded here: https://www.superdatascience.com/deep-learning/

    The story: A bank is investigating a very high rate of customer leaving the bank. Here is a 10.000 records dataset to investigate and predict which of the customers are more likely to leave the bank soon.

    The story of the story: I'd like to compare several techniques (better if not alone, and with the experience of several Kaggle users) to improve my basic knowledge on Machine Learning.

    Content

    I will write more later, but the columns names are very self-explaining.

    Acknowledgements

    Udemy instructors Kirill Eremenko (Data Scientist & Forex Systems Expert) and Hadelin de Ponteves (Data Scientist), and their efforts to provide this dataset to their students.

    Inspiration

    Which methods score best with this dataset? Which are fastest (or, executable in a decent time)? Which are the basic steps with such a simple dataset, very useful to beginners?

    --- Original source retains full ownership of the source dataset ---

  16. Face-Detection-Dataset

    • kaggle.com
    • gts.ai
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fares Elmenshawii (2023). Face-Detection-Dataset [Dataset]. https://www.kaggle.com/datasets/fareselmenshawii/face-detection-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Fares Elmenshawii
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The dataset comprises 16.7k images and 2 annotation files, each in a distinct format. The first file, labeled "Label," contains annotations with the original scale, while the second file, named "yolo_format_labels," contains annotations in YOLO format. The dataset was obtained by employing the OIDv4 toolkit, specifically designed for scraping data from Google Open Images. Notably, this dataset exclusively focuses on face detection.

    This dataset offers a highly suitable resource for training deep learning models specifically designed for face detection tasks. The images within the dataset exhibit exceptional quality and have been meticulously annotated with bounding boxes encompassing the facial regions. The annotations are provided in two formats: the original scale, denoting the pixel coordinates of the bounding boxes, and the YOLO format, representing the bounding box coordinates in normalized form.

    The dataset was meticulously curated by scraping relevant images from Google Open Images through the use of the OIDv4 toolkit. Only images that are pertinent to face detection tasks have been included in this dataset. Consequently, it serves as an ideal choice for training deep learning models that specifically target face detection tasks.

  17. Meta Kaggle Code

    • kaggle.com
    zip
    Updated Sep 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
    Explore at:
    zip(156566588558 bytes)Available download formats
    Dataset updated
    Sep 11, 2025
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Explore our public notebook content!

    Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

    Why we’re releasing this dataset

    By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

    Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

    The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

    Sensitive data

    While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

    Joining with Meta Kaggle

    The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

    File organization

    The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

    The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

    Questions / Comments

    We love feedback! Let us know in the Discussion tab.

    Happy Kaggling!

  18. d

    Factori Machine Learning (ML) Data | 247 Countries Coverage | 5.2 B Event...

    • datarade.ai
    .csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Factori, Factori Machine Learning (ML) Data | 247 Countries Coverage | 5.2 B Event per Day [Dataset]. https://datarade.ai/data-products/factori-ai-ml-training-data-web-data-machine-learning-d-factori
    Explore at:
    .csvAvailable download formats
    Dataset authored and provided by
    Factori
    Area covered
    Sweden, Japan, Egypt, Faroe Islands, Taiwan, Turks and Caicos Islands, Austria, Cameroon, Palestine, Uzbekistan
    Description

    Factori's AI & ML training data is thoroughly tested and reviewed to ensure that what you receive on your end is of the best quality.

    Integrate the comprehensive AI & ML training data provided by Grepsr and develop a superior AI & ML model.

    Whether you're training algorithms for natural language processing, sentiment analysis, or any other AI application, we can deliver comprehensive datasets tailored to fuel your machine learning initiatives.

    Enhanced Data Quality: We have rigorous data validation processes and also conduct quality assurance checks to guarantee the integrity and reliability of the training data for you to develop the AI & ML models.

    Gain a competitive edge, drive innovation, and unlock new opportunities by leveraging the power of tailored Artificial Intelligence and Machine Learning training data with Factori.

    We offer web activity data of users that are browsing popular websites around the world. This data can be used to analyze web behavior across the web and build highly accurate audience segments based on web activity for targeting ads based on interest categories and search/browsing intent.

    Web Data Reach: Our reach data represents the total number of data counts available within various categories and comprises attributes such as Country, Anonymous ID, IP addresses, Search Query, and so on.

    Data Export Methodology: Since we collect data dynamically, we provide the most updated data and insights via a best-suited method at a suitable interval (daily/weekly/monthly).

    Data Attributes: Anonymous_id IDType Timestamp Estid Ip userAgent browserFamily deviceType Os Url_metadata_canonical_url Url_metadata_raw_query_params refDomain mappedEvent Channel searchQuery Ttd_id Adnxs_id Keywords Categories Entities Concepts

  19. Imbalanced dataset for benchmarking

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Dayvid V. R. Oliveira; Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Dayvid V. R. Oliveira (2020). Imbalanced dataset for benchmarking [Dataset]. http://doi.org/10.5281/zenodo.61452
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Dayvid V. R. Oliveira; Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Dayvid V. R. Oliveira
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Imbalanced dataset for benchmarking
    =======================

    The different algorithms of the `imbalanced-learn` toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in [1]. The following section presents the main characteristics of this benchmark.

    Characteristics
    -------------------

    |ID |Name |Repository & Target |Ratio |# samples| # features |
    |:---:|:----------------------:|--------------------------------------|:------:|:-------------:|:--------------:|
    |1 |Ecoli |UCI, target: imU |8.6:1 |336 |7 |
    |2 |Optical Digits |UCI, target: 8 |9.1:1 |5,620 |64 |
    |3 |SatImage |UCI, target: 4 |9.3:1 |6,435 |36 |
    |4 |Pen Digits |UCI, target: 5 |9.4:1 |10,992 |16 |
    |5 |Abalone |UCI, target: 7 |9.7:1 |4,177 |8 |
    |6 |Sick Euthyroid |UCI, target: sick euthyroid |9.8:1 |3,163 |25 |
    |7 |Spectrometer |UCI, target: >=44 |11:1 |531 |93 |
    |8 |Car_Eval_34 |UCI, target: good, v good |12:1 |1,728 |6 |
    |9 |ISOLET |UCI, target: A, B |12:1 |7,797 |617 |
    |10 |US Crime |UCI, target: >0.65 |12:1 |1,994 |122 |
    |11 |Yeast_ML8 |LIBSVM, target: 8 |13:1 |2,417 |103 |
    |12 |Scene |LIBSVM, target: >one label |13:1 |2,407 |294 |
    |13 |Libras Move |UCI, target: 1 |14:1 |360 |90 |
    |14 |Thyroid Sick |UCI, target: sick |15:1 |3,772 |28 |
    |15 |Coil_2000 |KDD, CoIL, target: minority |16:1 |9,822 |85 |
    |16 |Arrhythmia |UCI, target: 06 |17:1 |452 |279 |
    |17 |Solar Flare M0 |UCI, target: M->0 |19:1 |1,389 |10 |
    |18 |OIL |UCI, target: minority |22:1 |937 |49 |
    |19 |Car_Eval_4 |UCI, target: vgood |26:1 |1,728 |6 |
    |20 |Wine Quality |UCI, wine, target: <=4 |26:1 |4,898 |11 |
    |21 |Letter Img |UCI, target: Z |26:1 |20,000 |16 |
    |22 |Yeast _ME2 |UCI, target: ME2 |28:1 |1,484 |8 |
    |23 |Webpage |LIBSVM, w7a, target: minority|33:1 |49,749 |300 |
    |24 |Ozone Level |UCI, ozone, data |34:1 |2,536 |72 |
    |25 |Mammography |UCI, target: minority |42:1 |11,183 |6 |
    |26 |Protein homo. |KDD CUP 2004, minority |111:1|145,751 |74 |
    |27 |Abalone_19 |UCI, target: 19 |130:1|4,177 |8 |

    References
    ----------
    [1] Ding, Zejin, "Diversified Ensemble Classifiers for H
    ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011).

    [2] Blake, Catherine, and Christopher J. Merz. "UCI Repository of machine learning databases." (1998).

    [3] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.

    [4] Caruana, Rich, Thorsten Joachims, and Lars Backstrom. "KDD-Cup 2004: results and analysis." ACM SIGKDD Explorations Newsletter 6.2 (2004): 95-108.

  20. Best Machine Learning coding books

    • kaggle.com
    Updated Jan 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Coding Vidya (2022). Best Machine Learning coding books [Dataset]. https://www.kaggle.com/codingvidya/machine-learning-books/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 25, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Coding Vidya
    Description

    Dataset

    This dataset was created by Coding Vidya

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
Organization logo

Best Books Ever Dataset

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
csvAvailable download formats
Dataset updated
Nov 10, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

The 25 fields of the dataset are:

| Attributes | Definition | Completeness |
| ------------- | ------------- | ------------- | 
| bookId | Book Identifier as in goodreads.com | 100 |
| title | Book title | 100 |
| series | Series Name | 45 |
| author | Book's Author | 100 |
| rating | Global goodreads rating | 100 |
| description | Book's description | 97 |
| language | Book's language | 93 |
| isbn | Book's ISBN | 92 |
| genres | Book's genres | 91 |
| characters | Main characters | 26 |
| bookFormat | Type of binding | 97 |
| edition | Type of edition (ex. Anniversary Edition) | 9 |
| pages | Number of pages | 96 |
| publisher | Editorial | 93 |
| publishDate | publication date | 98 |
| firstPublishDate | Publication date of first edition | 59 |
| awards | List of awards | 20 |
| numRatings | Number of total ratings | 100 |
| ratingsByStars | Number of ratings by stars | 97 |
| likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
| setting | Story setting | 22 |
| coverImg | URL to cover image | 99 |
| bbeScore | Score in Best Books Ever list | 100 |
| bbeVotes | Number of votes in Best Books Ever list | 100 |
| price | Book's price (extracted from Iberlibro) | 73 |

Search
Clear search
Close search
Google apps
Main menu