100+ datasets found
  1. Machine Learning Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Jun 19, 2024
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

  2. d

    Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning...

    • datarade.ai
    .json, .csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xverum, Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning (DL), NLP & LLM Training [Dataset]. https://datarade.ai/data-products/xverum-company-data-b2b-data-belgium-netherlands-denm-xverum
    Explore at:
    .json, .csvAvailable download formats
    Dataset provided by
    Xverum LLC
    Authors
    Xverum
    Area covered
    Norway, Sint Maarten (Dutch part), Cook Islands, Jordan, Western Sahara, Oman, Dominican Republic, Barbados, United Kingdom, India
    Description

    Xverum’s AI & ML Training Data provides one of the most extensive datasets available for AI and machine learning applications, featuring 800M B2B profiles with 100+ attributes. This dataset is designed to enable AI developers, data scientists, and businesses to train robust and accurate ML models. From natural language processing (NLP) to predictive analytics, our data empowers a wide range of industries and use cases with unparalleled scale, depth, and quality.

    What Makes Our Data Unique?

    Scale and Coverage: - A global dataset encompassing 800M B2B profiles from a wide array of industries and geographies. - Includes coverage across the Americas, Europe, Asia, and other key markets, ensuring worldwide representation.

    Rich Attributes for Training Models: - Over 100 fields of detailed information, including company details, job roles, geographic data, industry categories, past experiences, and behavioral insights. - Tailored for training models in NLP, recommendation systems, and predictive algorithms.

    Compliance and Quality: - Fully GDPR and CCPA compliant, providing secure and ethically sourced data. - Extensive data cleaning and validation processes ensure reliability and accuracy.

    Annotation-Ready: - Pre-structured and formatted datasets that are easily ingestible into AI workflows. - Ideal for supervised learning with tagging options such as entities, sentiment, or categories.

    How Is the Data Sourced? - Publicly available information gathered through advanced, GDPR-compliant web aggregation techniques. - Proprietary enrichment pipelines that validate, clean, and structure raw data into high-quality datasets. This approach ensures we deliver comprehensive, up-to-date, and actionable data for machine learning training.

    Primary Use Cases and Verticals

    Natural Language Processing (NLP): Train models for named entity recognition (NER), text classification, sentiment analysis, and conversational AI. Ideal for chatbots, language models, and content categorization.

    Predictive Analytics and Recommendation Systems: Enable personalized marketing campaigns by predicting buyer behavior. Build smarter recommendation engines for ecommerce and content platforms.

    B2B Lead Generation and Market Insights: Create models that identify high-value leads using enriched company and contact information. Develop AI systems that track trends and provide strategic insights for businesses.

    HR and Talent Acquisition AI: Optimize talent-matching algorithms using structured job descriptions and candidate profiles. Build AI-powered platforms for recruitment analytics.

    How This Product Fits Into Xverum’s Broader Data Offering Xverum is a leading provider of structured, high-quality web datasets. While we specialize in B2B profiles and company data, we also offer complementary datasets tailored for specific verticals, including ecommerce product data, job listings, and customer reviews. The AI Training Data is a natural extension of our core capabilities, bridging the gap between structured data and machine learning workflows. By providing annotation-ready datasets, real-time API access, and customization options, we ensure our clients can seamlessly integrate our data into their AI development processes.

    Why Choose Xverum? - Experience and Expertise: A trusted name in structured web data with a proven track record. - Flexibility: Datasets can be tailored for any AI/ML application. - Scalability: With 800M profiles and more being added, you’ll always have access to fresh, up-to-date data. - Compliance: We prioritize data ethics and security, ensuring all data adheres to GDPR and other legal frameworks.

    Ready to supercharge your AI and ML projects? Explore Xverum’s AI Training Data to unlock the potential of 800M global B2B profiles. Whether you’re building a chatbot, predictive algorithm, or next-gen AI application, our data is here to help.

    Contact us for sample datasets or to discuss your specific needs.

  3. Datasets

    • figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bastian Eichenberger; YinXiu Zhan (2023). Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.12958037.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Bastian Eichenberger; YinXiu Zhan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The benchmarking datasets used for deepBlink. The npz files contain train/valid/test splits inside and can be used directly. The files belong to the following challenges / classes:- ISBI Particle tracking challenge: microtubule, vesicle, receptor- Custom synthetic (based on http://smal.ws): particle- Custom fixed cell: smfish- Custom live cell: suntagThe csv files are to determine which image in the test splits correspond to which original image, SNR, and density.

  4. e

    SYNERGY - Open machine learning dataset on study selection in systematic...

    • b2find.eudat.eu
    Updated Jul 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). SYNERGY - Open machine learning dataset on study selection in systematic reviews - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/1bea4d3c-ceef-5f63-89ed-80aeab18f601
    Explore at:
    Dataset updated
    Jul 21, 2024
    Description

    SYNERGY is a free and open dataset on study selection in systematic reviews, comprising 169,288 academic works from 26 systematic reviews. Only 2,834 (1.67%) of the academic works in the binary classified dataset are included in the systematic reviews. This makes the SYNERGY dataset a unique dataset for the development of information retrieval algorithms, especially for sparse labels. Due to the many available variables available per record (i.e. titles, abstracts, authors, references, topics), this dataset is useful for researchers in NLP, machine learning, network analysis, and more. In total, the dataset contains 82,668,134 trainable data points. The easiest way to get the SYNERGY dataset is via the synergy-dataset Python package. See https://github.com/asreview/synergy-dataset for all information. The recommended way to work with the SYNERGY dataset is via the Python package "synergy-dataset". This flexible package downloads and builds the dataset.

  5. G

    Artificial Intelligence (AI) Training Dataset Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Artificial Intelligence (AI) Training Dataset Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/artificial-intelligence-training-dataset-market-global-industry-analysis
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Aug 29, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Artificial Intelligence (AI) Training Dataset Market Outlook



    According to our latest research, the global Artificial Intelligence (AI) Training Dataset market size reached USD 3.15 billion in 2024, reflecting robust industry momentum. The market is expanding at a notable CAGR of 20.8% and is forecasted to attain USD 20.92 billion by 2033. This impressive growth is primarily attributed to the surging demand for high-quality, annotated datasets to fuel machine learning and deep learning models across diverse industry verticals. The proliferation of AI-driven applications, coupled with rapid advancements in data labeling technologies, is further accelerating the adoption and expansion of the AI training dataset market globally.




    One of the most significant growth factors propelling the AI training dataset market is the exponential rise in data-driven AI applications across industries such as healthcare, automotive, retail, and finance. As organizations increasingly rely on AI-powered solutions for automation, predictive analytics, and personalized customer experiences, the need for large, diverse, and accurately labeled datasets has become critical. Enhanced data annotation techniques, including manual, semi-automated, and fully automated methods, are enabling organizations to generate high-quality datasets at scale, which is essential for training sophisticated AI models. The integration of AI in edge devices, smart sensors, and IoT platforms is further amplifying the demand for specialized datasets tailored for unique use cases, thereby fueling market growth.




    Another key driver is the ongoing innovation in machine learning and deep learning algorithms, which require vast and varied training data to achieve optimal performance. The increasing complexity of AI models, especially in areas such as computer vision, natural language processing, and autonomous systems, necessitates the availability of comprehensive datasets that accurately represent real-world scenarios. Companies are investing heavily in data collection, annotation, and curation services to ensure their AI solutions can generalize effectively and deliver reliable outcomes. Additionally, the rise of synthetic data generation and data augmentation techniques is helping address challenges related to data scarcity, privacy, and bias, further supporting the expansion of the AI training dataset market.




    The market is also benefiting from the growing emphasis on ethical AI and regulatory compliance, particularly in data-sensitive sectors like healthcare, finance, and government. Organizations are prioritizing the use of high-quality, unbiased, and diverse datasets to mitigate algorithmic bias and ensure transparency in AI decision-making processes. This focus on responsible AI development is driving demand for curated datasets that adhere to strict quality and privacy standards. Moreover, the emergence of data marketplaces and collaborative data-sharing initiatives is making it easier for organizations to access and exchange valuable training data, fostering innovation and accelerating AI adoption across multiple domains.



    As the AI training dataset market continues to evolve, the role of Perception Dataset Management Platforms is becoming increasingly crucial. These platforms are designed to handle the complexities of managing large-scale datasets, ensuring that data is not only collected and stored efficiently but also annotated and curated to meet the specific needs of AI models. By providing tools for data organization, quality control, and collaboration, these platforms enable organizations to streamline their data management processes and enhance the overall quality of their AI training datasets. This is particularly important as the demand for diverse and high-quality datasets grows, driven by the expanding scope of AI applications across various industries.




    From a regional perspective, North America currently dominates the AI training dataset market, accounting for the largest revenue share in 2024, driven by significant investments in AI research, a mature technology ecosystem, and the presence of leading AI companies and data annotation service providers. Europe and Asia Pacific are also witnessing rapid growth, with increasing government support for AI initiatives, expanding digital infrastructure, and a rising number of AI startups. While North America sets the pace in terms of technological

  6. Walmart products free dataset

    • crawlfeeds.com
    csv, zip
    Updated Apr 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Walmart products free dataset [Dataset]. https://crawlfeeds.com/datasets/walmart-products-free-dataset
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Apr 27, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Discover the Walmart Products Free Dataset, featuring 2,000 records in CSV format. This dataset includes detailed information about various Walmart products, such as names, prices, categories, and descriptions.

    It’s perfect for data analysis, e-commerce research, and machine learning projects. Download now and kickstart your insights with accurate, real-world data.

  7. f

    Data from: Count-Based Morgan Fingerprint: A More Efficient and...

    • acs.figshare.com
    xlsx
    Updated Jul 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shifa Zhong; Xiaohong Guan (2023). Count-Based Morgan Fingerprint: A More Efficient and Interpretable Molecular Representation in Developing Machine Learning-Based Predictive Regression Models for Water Contaminants’ Activities and Properties [Dataset]. http://doi.org/10.1021/acs.est.3c02198.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 5, 2023
    Dataset provided by
    ACS Publications
    Authors
    Shifa Zhong; Xiaohong Guan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    In this study, we introduce the count-based Morgan fingerprint (C-MF) to represent chemical structures of contaminants and develop machine learning (ML)-based predictive models for their activities and properties. Compared with the binary Morgan fingerprint (B-MF), C-MF not only qualifies the presence or absence of an atom group but also quantifies its counts in a molecule. We employ six different ML algorithms (ridge regression, SVM, KNN, RF, XGBoost, and CatBoost) to develop models on 10 contaminant-related data sets based on C-MF and B-MF to compare them in terms of the model’s predictive performance, interpretation, and applicability domain (AD). Our results show that C-MF outperforms B-MF in nine of 10 data sets in terms of model predictive performance. The advantage of C-MF over B-MF is dependent on the ML algorithm, and the performance enhancements are proportional to the difference in the chemical diversity of data sets calculated by B-MF and C-MF. Model interpretation results show that the C-MF-based model can elucidate the effect of atom group counts on the target and have a wider range of SHAP values. AD analysis shows that C-MF-based models have an AD similar to that of B-MF-based ones. Finally, we developed a “ContaminaNET” platform to deploy these C-MF-based models for free use.

  8. d

    FileMarket | 20,000 photos | AI Training Data | Large Language Model (LLM)...

    • datarade.ai
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2024). FileMarket | 20,000 photos | AI Training Data | Large Language Model (LLM) Data | Machine Learning (ML) Data | Deep Learning (DL) Data | [Dataset]. https://datarade.ai/data-products/filemarket-ai-training-data-large-language-model-llm-data-filemarket
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Jun 28, 2024
    Dataset authored and provided by
    FileMarket
    Area covered
    China, Brazil, Benin, Antigua and Barbuda, Central African Republic, Colombia, Saint Kitts and Nevis, French Southern Territories, Saudi Arabia, Papua New Guinea
    Description

    FileMarket provides premium Large Language Model (LLM) Data designed to support and enhance a wide range of AI applications. Our globally sourced LLM Data sets are meticulously curated to ensure high quality, diversity, and accuracy, making them ideal for training robust and reliable language models. In addition to LLM Data, we also offer comprehensive datasets across Object Detection Data, Machine Learning (ML) Data, Deep Learning (DL) Data, and Biometric Data. Each dataset is carefully crafted to meet the specific needs of cutting-edge AI and machine learning projects.

    Key use cases of our Large Language Model (LLM) Data:

    Text generation Chatbots and virtual assistants Machine translation Sentiment analysis Speech recognition Content summarization Why choose FileMarket's data:

    Object Detection Data: Essential for training AI in image and video analysis. Machine Learning (ML) Data: Ideal for a broad spectrum of applications, from predictive analysis to NLP. Deep Learning (DL) Data: Designed to support complex neural networks and deep learning models. Biometric Data: Specialized for facial recognition, fingerprint analysis, and other biometric applications. FileMarket's premier sources for top-tier Large Language Model (LLM) Data and other specialized datasets ensure your AI projects drive innovation and achieve success across various applications.

  9. Electronics Shop Dataset

    • kaggle.com
    Updated Nov 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shafii Rajabu (2024). Electronics Shop Dataset [Dataset]. https://www.kaggle.com/datasets/shafiirajabu/electronics-shop-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shafii Rajabu
    Description

    This dataset represents simulated sales data for an electronics shop operating in the United States from 2024 (January to November). It is designed for individuals who want to practice data analysis, visualization, and machine learning techniques. The dataset reflects real-world sales scenarios, including various products, customer information, order statuses, and sales channels. It is ideal for learning and experimenting with data analytics, business insights, and visualization tools like Power BI, Tableau, or Python libraries.

    Dataset Features ProductID: Unique identifier for each product. ProductName: Name of the electronic product (e.g., Phone, Laptop, Drone). ProductPrice: The price of the product is in USD. OrderedQuantity: Number of units ordered by the customer. OrderStatus: Status of the order (e.g., Delivered, In Process, On Hold, Canceled). CustomerName: Name of the customer who placed the order. State: State of the customer in the United States (e.g., California, Texas). City: City of the customer within the state. Latitude & Longitude: Geographic coordinates of the customer's location for mapping purposes. OrderChannel: Channel through which the order was placed (e.g., Website, Phone, Physical Store, Social Media). OrderDate: Date of the order (range: January 1, 2024, to November 30, 2024).

    Potential Use Cases

    Exploratory Data Analysis (EDA): Analyze sales trends across months, states, or product categories. Identify the most popular sales channels or products. Examine the distribution of order statuses.

    Data Visualization: Create dashboards to visualize sales performance, customer demographics, and geographic distribution. Plot order locations on a map using latitude and longitude.

    Machine Learning: Predict future sales trends using historical data. Classify order statuses based on product and order details. Cluster customers based on purchase behavior or location.

    Business Insights: Analyze revenue contributions from different states or cities. Understand customer preferences across product categories.

    Technical Details File Format: Excel (with a .xlsx extension) Number of Rows: 11000 Period: January 1, 2024, to November 30, 2024 Simulated Data: The data is entirely synthetic and does not represent real customers or transactions.

    Why Use This Dataset? This dataset is tailored for individuals and students interested in: Building their data analysis and visualization skills. Learning how to work with real-world-like business datasets. Practicing machine learning with structured data. Acknowledgment This dataset was generated to mimic real-world sales data scenarios for educational and research purposes. Feel free to use it for learning and projects, and share your insights with the community!

  10. u

    PDMX

    • cseweb.ucsd.edu
    json
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, PDMX [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
    Explore at:
    jsonAvailable download formats
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing, including over 250k musical scores in MusicXML format. PDMX is the largest publicly available, copyright-free MusicXML dataset in existence. PDMX includes genre, tag, description, and popularity metadata for every file.

  11. i

    Labeled Image Datasets for AI & Computer Vision

    • images.cv
    Updated Apr 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Images.cv (2024). Labeled Image Datasets for AI & Computer Vision [Dataset]. https://images.cv/
    Explore at:
    Dataset updated
    Apr 26, 2024
    Dataset provided by
    Images.cv
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Explore and download labeled image datasets for AI, ML, and computer vision. Find datasets for object detection, image classification, and image segmentation.

  12. Network Traffic Dataset

    • kaggle.com
    Updated Oct 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravikumar Gattu (2023). Network Traffic Dataset [Dataset]. https://www.kaggle.com/datasets/ravikumargattu/network-traffic-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 31, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ravikumar Gattu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The data presented here was obtained in a Kali Machine from University of Cincinnati,Cincinnati,OHIO by carrying out packet captures for 1 hour during the evening on Oct 9th,2023 using Wireshark.This dataset consists of 394137 instances were obtained and stored in a CSV (Comma Separated Values) file.This large dataset could be used utilised for different machine learning applications for instance classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

    The dataset can be used for a variety of machine learning tasks, such as network intrusion detection, traffic classification, and anomaly detection.

    Content :

    This network traffic dataset consists of 7 features.Each instance contains the information of source and destination IP addresses, The majority of the properties are numeric in nature, however there are also nominal and date kinds due to the Timestamp.

    The network traffic flow statistics (No. Time Source Destination Protocol Length Info) were obtained using Wireshark (https://www.wireshark.org/).

    Dataset Columns:

    No : Number of Instance. Timestamp : Timestamp of instance of network traffic Source IP: IP address of Source Destination IP: IP address of Destination Portocol: Protocol used by the instance Length: Length of Instance Info: Information of Traffic Instance

    Acknowledgements :

    I would like thank University of Cincinnati for giving the infrastructure for generation of network traffic data set.

    Ravikumar Gattu , Susmitha Choppadandi

    Inspiration : This dataset goes beyond the majority of network traffic classification datasets, which only identify the type of application (WWW, DNS, ICMP,ARP,RARP) that an IP flow contains. Instead, it generates machine learning models that can identify specific applications (like Tiktok,Wikipedia,Instagram,Youtube,Websites,Blogs etc.) from IP flow statistics (there are currently 25 applications in total).

    **Dataset License: ** CC0: Public Domain

    Dataset Usages : This dataset can be used for different machine learning applications in the field of cybersecurity such as classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

    ML techniques benefits from this Dataset :

    This dataset is highly useful because it consists of 394137 instances of network traffic data obtained by using the 25 applications on a public,private and Enterprise networks.Also,the dataset consists of very important features that can be used for most of the applications of Machine learning in cybersecurity.Here are few of the potential machine learning applications that could be benefited from this dataset are :

    1. Network Performance Monitoring : This large network traffic data set can be utilised for analysing the network traffic to identifying the network patterns in the network .This help in designing the network security algorithms for minimise the network probelms.

    2. Anamoly Detection : Large network traffic dataset can be utilised training the machine learning models for finding the irregularitues in the traffic which could help identify the cyber attacks.

    3.Network Intrusion Detection : This large dataset could be utilised for machine algorithms training and designing the models for detection of the traffic issues,Malicious traffic network attacks and DOS attacks as well.

  13. u

    Steam Video Game and Bundle Data

    • cseweb.ucsd.edu
    json
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, Steam Video Game and Bundle Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
    Explore at:
    jsonAvailable download formats
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    These datasets contain reviews from the Steam video game platform, and information about which games were bundled together.

    Metadata includes

    • reviews

    • purchases, plays, recommends (likes)

    • product bundles

    • pricing information

    Basic Statistics:

    • Reviews: 7,793,069

    • Users: 2,567,538

    • Items: 15,474

    • Bundles: 615

  14. d

    Factori Machine Learning (ML) Data | 247 Countries Coverage | 5.2 B Event...

    • datarade.ai
    .csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Factori, Factori Machine Learning (ML) Data | 247 Countries Coverage | 5.2 B Event per Day [Dataset]. https://datarade.ai/data-products/factori-ai-ml-training-data-web-data-machine-learning-d-factori
    Explore at:
    .csvAvailable download formats
    Dataset authored and provided by
    Factori
    Area covered
    Uzbekistan, Cameroon, Sweden, Palestine, Japan, Egypt, Faroe Islands, Turks and Caicos Islands, Austria, Taiwan
    Description

    Factori's AI & ML training data is thoroughly tested and reviewed to ensure that what you receive on your end is of the best quality.

    Integrate the comprehensive AI & ML training data provided by Grepsr and develop a superior AI & ML model.

    Whether you're training algorithms for natural language processing, sentiment analysis, or any other AI application, we can deliver comprehensive datasets tailored to fuel your machine learning initiatives.

    Enhanced Data Quality: We have rigorous data validation processes and also conduct quality assurance checks to guarantee the integrity and reliability of the training data for you to develop the AI & ML models.

    Gain a competitive edge, drive innovation, and unlock new opportunities by leveraging the power of tailored Artificial Intelligence and Machine Learning training data with Factori.

    We offer web activity data of users that are browsing popular websites around the world. This data can be used to analyze web behavior across the web and build highly accurate audience segments based on web activity for targeting ads based on interest categories and search/browsing intent.

    Web Data Reach: Our reach data represents the total number of data counts available within various categories and comprises attributes such as Country, Anonymous ID, IP addresses, Search Query, and so on.

    Data Export Methodology: Since we collect data dynamically, we provide the most updated data and insights via a best-suited method at a suitable interval (daily/weekly/monthly).

    Data Attributes: Anonymous_id IDType Timestamp Estid Ip userAgent browserFamily deviceType Os Url_metadata_canonical_url Url_metadata_raw_query_params refDomain mappedEvent Channel searchQuery Ttd_id Adnxs_id Keywords Categories Entities Concepts

  15. Face-Detection-Dataset

    • kaggle.com
    • gts.ai
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fares Elmenshawii (2023). Face-Detection-Dataset [Dataset]. https://www.kaggle.com/datasets/fareselmenshawii/face-detection-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Fares Elmenshawii
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The dataset comprises 16.7k images and 2 annotation files, each in a distinct format. The first file, labeled "Label," contains annotations with the original scale, while the second file, named "yolo_format_labels," contains annotations in YOLO format. The dataset was obtained by employing the OIDv4 toolkit, specifically designed for scraping data from Google Open Images. Notably, this dataset exclusively focuses on face detection.

    This dataset offers a highly suitable resource for training deep learning models specifically designed for face detection tasks. The images within the dataset exhibit exceptional quality and have been meticulously annotated with bounding boxes encompassing the facial regions. The annotations are provided in two formats: the original scale, denoting the pixel coordinates of the bounding boxes, and the YOLO format, representing the bounding box coordinates in normalized form.

    The dataset was meticulously curated by scraping relevant images from Google Open Images through the use of the OIDv4 toolkit. Only images that are pertinent to face detection tasks have been included in this dataset. Consequently, it serves as an ideal choice for training deep learning models that specifically target face detection tasks.

  16. i

    Network dataset

    • ieee-dataport.org
    Updated Jul 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Theyazn Aldhyani (2020). Network dataset [Dataset]. https://ieee-dataport.org/documents/network-dataset
    Explore at:
    Dataset updated
    Jul 21, 2020
    Authors
    Theyazn Aldhyani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Intelligent Hybrid model to Enhance Time Series Models for Predicting Network Traffic

  17. d

    Re-ID Data | 600,000 ID | CCTV Data |Computer Vision Data| Identity Data| AI...

    • datarade.ai
    Updated Dec 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). Re-ID Data | 600,000 ID | CCTV Data |Computer Vision Data| Identity Data| AI Datasets [Dataset]. https://datarade.ai/data-products/nexdata-re-id-data-60-000-id-image-video-ai-ml-train-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 8, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Trinidad and Tobago, Turkmenistan, Russian Federation, United Arab Emirates, Luxembourg, Ecuador, Cuba, Sri Lanka, Portugal, Bolivia (Plurinational State of)
    Description
    1. Specifications Data size : 60,000 ID

    Population distribution : the race distribution is Asians, Caucasians and black people, the gender distribution is male and female, the age distribution is from children to the elderly

    Collecting environment : including indoor and outdoor scenes (such as supermarket, mall and residential area, etc.)

    Data diversity : different ages, different time periods, different cameras, different human body orientations and postures, different ages collecting environment

    Device : surveillance cameras, the image resolution is not less than 1,9201,080

    Data format : the image data format is .jpg, the annotation file format is .json

    Annotation content : human body rectangular bounding boxes, 15 human body attributes

    Quality Requirements : A rectangular bounding box of human body is qualified when the deviation is not more than 3 pixels, and the qualified rate of the bounding boxes shall not be lower than 97%;Annotation accuracy of attributes is over 97%

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 3 million hours of Audio Data and 800TB of Annotated Imagery Data.These ready-to-go Identity Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/computervision?source=Datarade
  18. Best Books Ever Dataset

    • zenodo.org
    csv
    Updated Nov 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 10, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

    The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

    Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

    The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

    Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

    The 25 fields of the dataset are:

    | Attributes | Definition | Completeness |
    | ------------- | ------------- | ------------- | 
    | bookId | Book Identifier as in goodreads.com | 100 |
    | title | Book title | 100 |
    | series | Series Name | 45 |
    | author | Book's Author | 100 |
    | rating | Global goodreads rating | 100 |
    | description | Book's description | 97 |
    | language | Book's language | 93 |
    | isbn | Book's ISBN | 92 |
    | genres | Book's genres | 91 |
    | characters | Main characters | 26 |
    | bookFormat | Type of binding | 97 |
    | edition | Type of edition (ex. Anniversary Edition) | 9 |
    | pages | Number of pages | 96 |
    | publisher | Editorial | 93 |
    | publishDate | publication date | 98 |
    | firstPublishDate | Publication date of first edition | 59 |
    | awards | List of awards | 20 |
    | numRatings | Number of total ratings | 100 |
    | ratingsByStars | Number of ratings by stars | 97 |
    | likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
    | setting | Story setting | 22 |
    | coverImg | URL to cover image | 99 |
    | bbeScore | Score in Best Books Ever list | 100 |
    | bbeVotes | Number of votes in Best Books Ever list | 100 |
    | price | Book's price (extracted from Iberlibro) | 73 |

  19. US Deep Learning Market Analysis, Size, and Forecast 2025-2029

    • technavio.com
    pdf
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). US Deep Learning Market Analysis, Size, and Forecast 2025-2029 [Dataset]. https://www.technavio.com/report/us-deep-learning-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 8, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    Time period covered
    2025 - 2029
    Description

    Snapshot img

    US Deep Learning Market Size 2025-2029

    The deep learning market size in US is forecast to increase by USD 5.02 billion at a CAGR of 30.1% between 2024 and 2029.

    The deep learning market is experiencing robust growth, driven by the increasing adoption of artificial intelligence (AI) in various industries for advanced solutioning. This trend is fueled by the availability of vast amounts of data, which is a key requirement for deep learning algorithms to function effectively. Industry-specific solutions are gaining traction, as businesses seek to leverage deep learning for specific use cases such as image and speech recognition, fraud detection, and predictive maintenance. Alongside, intuitive data visualization tools are simplifying complex neural network outputs, helping stakeholders understand and validate insights. 
    
    
    However, challenges remain, including the need for powerful computing resources, data privacy concerns, and the high cost of implementing and maintaining deep learning systems. Despite these hurdles, the market's potential for innovation and disruption is immense, making it an exciting space for businesses to explore further. Semi-supervised learning, data labeling, and data cleaning facilitate efficient training of deep learning models. Cloud analytics is another significant trend, as companies seek to leverage cloud computing for cost savings and scalability. 
    

    What will be the Size of the market During the Forecast Period?

    Request Free Sample

    Deep learning, a subset of machine learning, continues to shape industries by enabling advanced applications such as image and speech recognition, text generation, and pattern recognition. Reinforcement learning, a type of deep learning, gains traction, with deep reinforcement learning leading the charge. Anomaly detection, a crucial application of unsupervised learning, safeguards systems against security vulnerabilities. Ethical implications and fairness considerations are increasingly important in deep learning, with emphasis on explainable AI and model interpretability. Graph neural networks and attention mechanisms enhance data preprocessing for sequential data modeling and object detection. Time series forecasting and dataset creation further expand deep learning's reach, while privacy preservation and bias mitigation ensure responsible use.

    In summary, deep learning's market dynamics reflect a constant pursuit of innovation, efficiency, and ethical considerations. The Deep Learning Market in the US is flourishing as organizations embrace intelligent systems powered by supervised learning and emerging self-supervised learning techniques. These methods refine predictive capabilities and reduce reliance on labeled data, boosting scalability. BFSI firms utilize AI image recognition for various applications, including personalizing customer communication, maintaining a competitive edge, and automating repetitive tasks to boost productivity. Sophisticated feature extraction algorithms now enable models to isolate patterns with high precision, particularly in applications such as image classification for healthcare, security, and retail.

    How is this market segmented and which is the largest segment?

    The market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

    Application
    
      Image recognition
      Voice recognition
      Video surveillance and diagnostics
      Data mining
    
    
    Type
    
      Software
      Services
      Hardware
    
    
    End-user
    
      Security
      Automotive
      Healthcare
      Retail and commerce
      Others
    
    
    Geography
    
      North America
    
        US
    

    By Application Insights

    The Image recognition segment is estimated to witness significant growth during the forecast period. In the realm of artificial intelligence (AI) and machine learning, image recognition, a subset of computer vision, is gaining significant traction. This technology utilizes neural networks, deep learning models, and various machine learning algorithms to decipher visual data from images and videos. Image recognition is instrumental in numerous applications, including visual search, product recommendations, and inventory management. Consumers can take photographs of products to discover similar items, enhancing the online shopping experience. In the automotive sector, image recognition is indispensable for advanced driver assistance systems (ADAS) and autonomous vehicles, enabling the identification of pedestrians, other vehicles, road signs, and lane markings.

    Furthermore, image recognition plays a pivotal role in augmented reality (AR) and virtual reality (VR) applications, where it tracks physical objects and overlays digital content onto real-world scenarios. The model training process involves the backpropagation algorithm, which calculates the loss fu

  20. g

    Academic datasets

    • generated.photos
    Updated Jun 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Generated Media, Inc. (2024). Academic datasets [Dataset]. https://generated.photos/datasets/academic
    Explore at:
    Dataset updated
    Jun 26, 2024
    Dataset authored and provided by
    Generated Media, Inc.
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    AI-generated images for academic studies and machine learning research. Various demographic and age groups. Free for a link and a citation or another mention in the research paper.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
Organization logo

Machine Learning Dataset

Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Jun 19, 2024
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License

https://brightdata.com/licensehttps://brightdata.com/license

Area covered
Worldwide
Description

Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

Search
Clear search
Close search
Google apps
Main menu