100+ datasets found
  1. Machine Learning Basics for Beginners🤖🧠

    • kaggle.com
    zip
    Updated Jun 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhanupratap Biswas (2023). Machine Learning Basics for Beginners🤖🧠 [Dataset]. https://www.kaggle.com/datasets/bhanupratapbiswas/machine-learning-basics-for-beginners
    Explore at:
    zip(492015 bytes)Available download formats
    Dataset updated
    Jun 22, 2023
    Authors
    Bhanupratap Biswas
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:

    1. Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.

    2. Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.

    3. Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.

    4. Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).

    5. Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).

    6. Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.

    7. Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.

    8. Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.

    9. Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.

    10. Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.

    These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.

  2. AI/ML Youtube Videos

    • kaggle.com
    Updated Oct 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asmaa Hadir (2023). AI/ML Youtube Videos [Dataset]. https://www.kaggle.com/datasets/asmaahadir/aiml-youtube-channels-content-2018-2019
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 31, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Asmaa Hadir
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    YouTube
    Description

    I created this dataset as part of a data analysis project and concluded that it might be relevant for others who are interested in examining in analyzing content on YouTube. This dataset is a collection of over 6000 videos having the columns:

    • Channel: video's channel
    • Title: video title
    • PublishedDate: date the video was uploaded
    • Likes: likes count for the video
    • Views: views count for the video
    • Comments: comments count for the video

      Through the YouTube API and using Python, I collect data about some of these popular channels' videos that provide educational content about Machine Learning and Data Science in order to extract insights about which topics had been popular within the last couple of years. Featured in the dataset are the following creators:

    • Krish Naik

    • Nicholas Renotte

    • Sentdex

    • DeepLearningAI

    • Artificial Intelligence — All in One

    • Siraj Raval

    • Jeremy Howard

    • Applied AI Course

    • Daniel Bourke

    • Jeff Heaton

    • DeepLearning.TV

    • Arxiv Insights

    These channels are features in multiple top AI channels to subscribe to lists and have seen a big growth in the last couple of years on YouTube. They all have a creation date since or before 2018.

  3. Cone-Beam X-Ray CT Data Collection Designed for Machine Learning: Samples...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Mar 12, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Henri Der Sarkissian; Felix Lucka; Felix Lucka; Maureen CWI van Eijnatten; Giulia Colacicco; Sophia Bethany Coban; K. Joost Batenburg; Henri Der Sarkissian; Maureen CWI van Eijnatten; Giulia Colacicco; Sophia Bethany Coban; K. Joost Batenburg (2020). Cone-Beam X-Ray CT Data Collection Designed for Machine Learning: Samples 17-24 [Dataset]. http://doi.org/10.5281/zenodo.2687387
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 12, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Henri Der Sarkissian; Felix Lucka; Felix Lucka; Maureen CWI van Eijnatten; Giulia Colacicco; Sophia Bethany Coban; K. Joost Batenburg; Henri Der Sarkissian; Maureen CWI van Eijnatten; Giulia Colacicco; Sophia Bethany Coban; K. Joost Batenburg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This upload contains samples 17 - 24 from the data collection described in

    Henri Der Sarkissian, Felix Lucka, Maureen van Eijnatten, Giulia Colacicco, Sophia Bethany Coban, Kees Joost Batenburg, "A Cone-Beam X-Ray CT Data Collection Designed for Machine Learning", Sci Data 6, 215 (2019). https://doi.org/10.1038/s41597-019-0235-y or arXiv:1905.04787 (2019)

    Abstract:
    "Unlike previous works, this open data collection consists of X-ray cone-beam (CB) computed tomography (CT) datasets specifically designed for machine learning applications and high cone-angle artefact reduction: Forty-two walnuts were scanned with a laboratory X-ray setup to provide not only data from a single object but from a class of objects with natural variability. For each walnut, CB projections on three different orbits were acquired to provide CB data with different cone angles as well as being able to compute artefact-free, high-quality ground truth images from the combined data that can be used for supervised learning. We provide the complete image reconstruction pipeline: raw projection data, a description of the scanning geometry, pre-processing and reconstruction scripts using open software, and the reconstructed volumes. Due to this, the dataset can not only be used for high cone-angle artefact reduction but also for algorithm development and evaluation for other tasks, such as image reconstruction from limited or sparse-angle (low-dose) scanning, super resolution, or segmentation."

    The scans are performed using a custom-built, highly flexible X-ray CT scanner, the FleX-ray scanner, developed by XRE nvand located in the FleX-ray Lab at the Centrum Wiskunde & Informatica (CWI) in Amsterdam, Netherlands. The general purpose of the FleX-ray Lab is to conduct proof of concept experiments directly accessible to researchers in the field of mathematics and computer science. The scanner consists of a cone-beam microfocus X-ray point source that projects polychromatic X-rays onto a 1536-by-1944 pixels, 14-bit flat panel detector (Dexella 1512NDT) and a rotation stage in-between, upon which a sample is mounted. All three components are mounted on translation stages which allow them to move independently from one another.

    Please refer to the paper for all further technical details.

    The complete data set can be found via the following links: 1-8, 9-16, 17-24, 25-32, 33-37, 38-42

    The corresponding Python scripts for loading, pre-processing and reconstructing the projection data in the way described in the paper can be found on github

    For more information or guidance in using these dataset, please get in touch with

    • henri.dersarkissian [at] gmail.com
    • Felix.Lucka [at] cwi.nl
  4. d

    Factori AI & ML Training Data | People Data | USA | Machine Learning Data

    • datarade.ai
    .json, .csv
    Updated Jul 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Factori (2022). Factori AI & ML Training Data | People Data | USA | Machine Learning Data [Dataset]. https://datarade.ai/data-products/factori-ai-ml-training-data-consumer-data-usa-machine-factori
    Explore at:
    .json, .csvAvailable download formats
    Dataset updated
    Jul 23, 2022
    Dataset authored and provided by
    Factori
    Area covered
    United States of America
    Description

    Our People data is gathered and aggregated via surveys, digital services, and public data sources. We use powerful profiling algorithms to collect and ingest only fresh and reliable data points.

    Our comprehensive data enrichment solution includes a variety of data sets that can help you address gaps in your customer data, gain a deeper understanding of your customers, and power superior client experiences.

    1. Geography - City, State, ZIP, County, CBSA, Census Tract, etc.
    2. Demographics - Gender, Age Group, Marital Status, Language etc.
    3. Financial - Income Range, Credit Rating Range, Credit Type, Net worth Range, etc
    4. Persona - Consumer type, Communication preferences, Family type, etc
    5. Interests - Content, Brands, Shopping, Hobbies, Lifestyle etc.
    6. Household - Number of Children, Number of Adults, IP Address, etc.
    7. Behaviours - Brand Affinity, App Usage, Web Browsing etc.
    8. Firmographics - Industry, Company, Occupation, Revenue, etc
    9. Retail Purchase - Store, Category, Brand, SKU, Quantity, Price etc.
    10. Auto - Car Make, Model, Type, Year, etc.
    11. Housing - Home type, Home value, Renter/Owner, Year Built etc.

    People Data Schema & Reach: Our data reach represents the total number of counts available within various categories and comprises attributes such as country location, MAU, DAU & Monthly Location Pings:

    Data Export Methodology: Since we collect data dynamically, we provide the most updated data and insights via a best-suited method on a suitable interval (daily/weekly/monthly).

    People data Use Cases:

    360-Degree Customer View: Get a comprehensive image of customers by the means of internal and external data aggregation. Data Enrichment: Leverage Online to offline consumer profiles to build holistic audience segments to improve campaign targeting using user data enrichment Fraud Detection: Use multiple digital (web and mobile) identities to verify real users and detect anomalies or fraudulent activity. Advertising & Marketing: Understand audience demographics, interests, lifestyle, hobbies, and behaviors to build targeted marketing campaigns.

    Here's the schema of People Data: person_id first_name last_name age gender linkedin_url twitter_url facebook_url city state address zip zip4 country delivery_point_bar_code carrier_route walk_seuqence_code fips_state_code fips_country_code country_name latitude longtiude address_type metropolitan_statistical_area core_based+statistical_area census_tract census_block_group census_block primary_address pre_address streer post_address address_suffix address_secondline address_abrev census_median_home_value home_market_value property_build+year property_with_ac property_with_pool property_with_water property_with_sewer general_home_value property_fuel_type year month household_id Census_median_household_income household_size marital_status length+of_residence number_of_kids pre_school_kids single_parents working_women_in_house_hold homeowner children adults generations net_worth education_level occupation education_history credit_lines credit_card_user newly_issued_credit_card_user credit_range_new
    credit_cards loan_to_value mortgage_loan2_amount mortgage_loan_type
    mortgage_loan2_type mortgage_lender_code
    mortgage_loan2_render_code
    mortgage_lender mortgage_loan2_lender
    mortgage_loan2_ratetype mortgage_rate
    mortgage_loan2_rate donor investor interest buyer hobby personal_email work_email devices phone employee_title employee_department employee_job_function skills recent_job_change company_id company_name company_description technologies_used office_address office_city office_country office_state office_zip5 office_zip4 office_carrier_route office_latitude office_longitude office_cbsa_code
    office_census_block_group
    office_census_tract office_county_code
    company_phone
    company_credit_score
    company_csa_code
    company_dpbc
    company_franchiseflag
    company_facebookurl company_linkedinurl company_twitterurl
    company_website company_fortune_rank
    company_government_type company_headquarters_branch company_home_business
    company_industry
    company_num_pcs_used
    company_num_employees
    company_firm_individual company_msa company_msa_name
    company_naics_code
    company_naics_description
    company_naics_code2 company_naics_description2
    company_sic_code2
    company_sic_code2_description
    company_sic_code4 company_sic_code4_description
    company_sic_code6
    company_sic_code6_description
    company_sic_code8
    company_sic_code8_description company_parent_company
    company_parent_company_location company_public_private company_subsidiary_company company_residential_business_code company_revenue_at_side_code company_revenue_range
    company_revenue company_sales_volume
    company_small_business company_stock_ticker company_year_founded company_minorityowned
    company_female_owned_or_operated company_franchise_code company_dma company_dma_name
    company_hq_address
    company_hq_city company_hq_duns company_hq_state
    company_hq_zip5 company_hq_zip4 company_se...

  5. d

    Data from: Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  6. m

    Educational Attainment in North Carolina Public Schools: Use of statistical...

    • data.mendeley.com
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
    Explore at:
    Dataset updated
    Nov 14, 2018
    Authors
    Scott Herford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

  7. Data from: Hyperspectral reflectance and machine learning for multi-site...

    • catalog.data.gov
    • datasetcatalog.nlm.nih.gov
    • +1more
    Updated Jun 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data from: Hyperspectral reflectance and machine learning for multi-site monitoring of cotton growth [Dataset]. https://catalog.data.gov/dataset/data-from-hyperspectral-reflectance-and-machine-learning-for-multi-site-monitoring-of-cott
    Explore at:
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    Two field research projects focused on cotton were conducted at the El Reno, OK and Temple, TX USDA-ARS research stations. The study at the Southern Plains research station took place in 2019 where cotton cultivars included: FiberMax 1830 GLT, FiberMax 1888 GL, FiberMax 2498 GLT, Phytogen 300 W3FE, Phytogen 350 W3FE, and Phytogen 490 W3FE. The experiment took place on Norge silt loam soil. The experimental design included a randomized complete block design with three replications and four nitrogen treatments: unfertilized, cover crop (Lathyrus sativus) with 0, 30, or 60 kg/ha of inorganic nitrogen. The experiment was planted at a seeding rate of 12 seeds/m on May 28, 2019. The study at the Texas Gulf station incorporated four cultivars: DP 1646 B2XF, DP 2012 B3XF, DP 2239 B3XF, and DP 1646 B2XF. The experiment took place on Houston Black clay soils. The experimental design included 6-row strip plots with 0.76 m row spacings and approximately 00.1 m plant spacing at a seeding rate of 13 seeds per meter. The experiment also included a multi-planting date design that included March 28, April 11, April 22, and May 11.These two stations collected in-situ measurements of height, node count, leaf area index (LAI), canopy cover percentage, and chlorophyll content. LAI data were collected using the LAI-2200C Plant Canopy Analyzer (LI-COR, Lincoln, NE, USA). Canopy cover data were collected using the app “Canopeo” where a ratio of plant to ground pixels were calculated. Chlorophyll content data were collected using Chlorophyll Content Meter-300 (Opti-Sciences, Hudson, NH, USA) where the unit of measure was mg/m2. Directly following these measurements a spectroradiometer (FieldSpec Pro FR: Malvern Panalytical, Westborough, MA, USA) was utilized to collect hyperspectral data from 350nm to 2500nm. The locations of both the in-situ and hyperspectral data collections were chosen at random throughout the field projects and took place in 2019 (OK) and 2022 (TX) with specific dates of collection included in the database.

  8. CSIRO Sentinel-1 SAR image dataset of oil- and non-oil features for machine...

    • data.csiro.au
    • researchdata.edu.au
    Updated Dec 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Blondeau-Patissier; Thomas Schroeder; Foivos Diakogiannis; Zhibin Li (2022). CSIRO Sentinel-1 SAR image dataset of oil- and non-oil features for machine learning ( Deep Learning ) [Dataset]. http://doi.org/10.25919/4v55-dn16
    Explore at:
    Dataset updated
    Dec 15, 2022
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    David Blondeau-Patissier; Thomas Schroeder; Foivos Diakogiannis; Zhibin Li
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Time period covered
    May 1, 2015 - Aug 31, 2022
    Area covered
    Dataset funded by
    ESA
    CSIROhttp://www.csiro.au/
    Description

    What this collection is: A curated, binary-classified image dataset of grayscale (1 band) 400 x 400-pixel size, or image chips, in a JPEG format extracted from processed Sentinel-1 Synthetic Aperture Radar (SAR) satellite scenes acquired over various regions of the world, and featuring clear open ocean chips, look-alikes (wind or biogenic features) and oil slick chips.

    This binary dataset contains chips labelled as: - "0" for chips not containing any oil features (look-alikes or clean seas)
    - "1" for those containing oil features.

    This binary dataset is imbalanced, and biased towards "0" labelled chips (i.e., no oil features), which correspond to 66% of the dataset. Chips containing oil features, labelled "1", correspond to 34% of the dataset.

    Why: This dataset can be used for training, validation and/or testing of machine learning, including deep learning, algorithms for the detection of oil features in SAR imagery. Directly applicable for algorithm development for the European Space Agency Sentinel-1 SAR mission (https://sentinel.esa.int/web/sentinel/missions/sentinel-1 ), it may be suitable for the development of detection algorithms for other SAR satellite sensors.

    Overview of this dataset: Total number of chips (both classes) is N=5,630 Class 0 1 Total 3,725 1,905

    Further information and description is found in the ReadMe file provided (ReadMe_Sentinel1_SAR_OilNoOil_20221215.txt)

  9. U

    Dataset for "Exploratory Proof-of-Concept: Predicting the Outcome of Tennis...

    • researchdata.bath.ac.uk
    txt, zip
    Updated Oct 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gustav Durlind; Uriel Martinez-Hernandez; Tareq Assaf (2025). Dataset for "Exploratory Proof-of-Concept: Predicting the Outcome of Tennis Serves Using Motion Capture and Deep Learning" [Dataset]. http://doi.org/10.15125/BATH-01454
    Explore at:
    txt, zipAvailable download formats
    Dataset updated
    Oct 14, 2025
    Dataset provided by
    University of Bath
    Authors
    Gustav Durlind; Uriel Martinez-Hernandez; Tareq Assaf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    University of Bath
    Description

    Dataset used for training a Machine Learning model to classify tennis serves into “In”, “Out” and “Net” as well as predict the outcome coordinates of the serve. A marker-based motion capture system provided and operated by The University of Bath’s Applied BioMechanics Suite was used to collect spatio-temporal data on participants completing tennis serves, whilst a high-speed video camera recorded the outcome. Dataset has two parts: 1) The spatio-temporal data used for training and validation 2) The serve outcome and coordinates for labelling of the data.

  10. Replication Package for 'How do Machine Learning Models Change?'

    • zenodo.org
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Castaño Fernández; Joel Castaño Fernández; Rafael Cabañas; Rafael Cabañas; Antonio Salmerón; Antonio Salmerón; Lo David; Lo David; Silverio Martínez-Fernández; Silverio Martínez-Fernández (2024). Replication Package for 'How do Machine Learning Models Change?' [Dataset]. http://doi.org/10.5281/zenodo.14128997
    Explore at:
    Dataset updated
    Nov 13, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joel Castaño Fernández; Joel Castaño Fernández; Rafael Cabañas; Rafael Cabañas; Antonio Salmerón; Antonio Salmerón; Lo David; Lo David; Silverio Martínez-Fernández; Silverio Martínez-Fernández
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This replication package accompanies the paper "How Do Machine Learning Models Change?" In this study, we conducted a comprehensive analysis of over 200,000 commits and 1,200 releases across more than 50,000 models on the Hugging Face (HF) platform. Our goal was to understand how machine learning (ML) models evolve over time by classifying commit types based on an extended ML change taxonomy and analyzing patterns in commit and release activities using Bayesian networks.

    Our research addresses three main aspects:

    1. Categorization of Commit Changes: We classified over 200,000 commits on HF using an extended ML change taxonomy, providing a detailed breakdown of change types and their distribution across models.
    2. Analysis of Commit Sequences: We examined the sequence and dependencies of commit types using Bayesian networks to identify temporal patterns and common progression paths in model changes.
    3. Release Analysis: We investigated the distribution and evolution of release types, analyzing how model attributes and metadata change across successive releases.

    This replication package contains all the necessary code, datasets, and documentation to reproduce the results presented in the paper.

    Data Collection and Preprocessing

    Data Collection

    We collected data from the Hugging Face platform using the Hugging Face Hub API and the `HfApi` class. The data extraction was performed on November 6th, 2023. The collected data includes:

    • Model Information: Details of over 380,000 models, including dataset sizes, training hardware, evaluation metrics, model file sizes, number of downloads and likes, tags, and the raw text of model cards.
    • Commit Histories: Comprehensive commit details, including commit messages, dates, authors, and the list of files edited in each commit.
    • Release Information: Information on model releases marked by tags in their repositories.

    To enrich the commit data with detailed file change information, we integrated the PyDriller framework within the HFCommunity dataset.

    Data Preprocessing

    Commit Diffs

    We computed the differences between commits for key files, specifically JSON configuration files (e.g., `config.json`). For each commit that modifies these files, we compared the changes with the previous commit affecting the same file to identify added, deleted, and updated keys.

    Commit Classification

    We classified each commit according to Bhatia et al.'s ML change taxonomy using the Gemini 1.5 Flash Large Language Model (LLM). This classification, using LLMs to apply Bhatia et al.'s taxonomy on a large-scale ML repository, is one of the main contributions of our paper. We ensured the correctness of the classification by achieving a Cohen's kappa coefficient ≥ 0.9 through iterative validation. In addition, we performed classification based on Swanson's categories using a simpler neural network approach, following methods from prior work. This classification has less impact compared to the detailed classification using Bhatia et al.'s taxonomy.

    Model Metadata

    We extracted detailed metadata from the model files of selected releases, focusing on attributes such as the number of parameters, tensor shapes, etc. We also calculated the differences between the metadata of successive releases.

    Folder Structure

    The replication package is organized as follows:

    - code/: Contains the Jupyter notebooks with the data extraction, preprocessing, analysis, and model training scripts.

    • Collection/: Contains two Jupyter notebooks for data collection:
      • HFTotalExtraction.ipynb: Script for collecting data on the entire Hugging Face platform.
      • HFReleasesExtraction.ipynb: Script for collecting data on models that contain releases.
    • Preprocessing/: Contains preprocessing scripts:
      • HFTotalPreprocessing.ipynb: Preprocesses the dataset obtained from `HFTotalExtraction.ipynb`.
      • HFCommitsPreprocessing.ipynb: Processes commit data, including:
        • Retrieval of diff information between commits.
        • Classification of commits following Bhatia et al.'s taxonomy using LLMs.
        • Extension and adaptation of the final commits dataset, including additional variables for Bayesian network analysis.
      • HFReleasesPreprocessing.ipynb: Processes release data, including classification and preparation for analysis.
    • Analysis/: Contains three Jupyter notebooks with the analysis for each research question:
      • RQ1_Analysis.ipynb: Analysis for RQ1.
      • RQ2_Analysis.ipynb: Analysis for RQ2.
      • RQ3_Analysis.ipynb: Analysis for RQ3.

    - datasets/: Contains the raw, processed, and manually curated datasets used for the analysis.

    • Main Datasets:
      • HFCommits_50K_RANDOM.csv: Contains the commits of 50,000 randomly sampled models from HF with the classification based on Bhatia et al.'s taxonomy.
      • HFCommits_MultipleCommits.csv: Contains the commits of 10,000 models with at least 10 commits, used for analyzing commit sequences.
      • HFReleases.csv: Contains over 1,200 releases from 127 models, classified using Bhatia et al.'s taxonomy.
      • model_metadata_with_diff.csv: Contains the metadata of releases from 27 models, including differences between successive releases.
      • These datasets correspond to the following dataset splits:
        • +200,000 commits from 50,000 models: Used for RQ1. Provides a broad overview of commit types and patterns across diverse models.
        • +200,000 commits from 10,000 models: Used for RQ2. Focuses on models with at least 10 commits for detailed evolutionary study.
        • +1,200 releases from 127 models: Used for RQ3.1, RQ3.2, and RQ3.3. Facilitates the investigation of release patterns and their evolution.
        • Metadata of 173 releases from 27 models: Used for RQ3.4. Analyzes the evolution of model parameters and configurations.
    • Additional Datasets:
      • HF_Total_Raw.csv: Contains a snapshot of the entire Hugging Face platform with over 380,000 models, as obtained from HFTotalExtraction.ipynb.
      • HF_Total_Preprocessed.csv: Contains the preprocessed version of the entire HF dataset, as obtained from HFTotalPreprocessing.ipynb. This dataset is needed for the commits preprocessing.
      • Auxiliary datasets generated during processing are also included to facilitate reproduction of specific parts of the code without time-consuming steps.

    - metadata/: Contains the tags_metadata.yaml file used during preprocessing.

    - models/: Contains the model trained to classify commit messages into corrective, perfective, and adaptive types based on Swanson's traditional software maintenance categories.

    - requirements.txt: Lists the required Python packages to set up the environment and run the code.

    Setup and Execution

    Prerequisites

    • Python 3.10.11 or later.
    • Jupyter Notebook or JupyterLab.

    Installation

    1. Download and Extract the Replication Package
    2. Create a Virtual Environment (Recommended):bash
      python -m venv venv
      source venv/bin/activate # On Windows, use venv\Scripts\activate
    3. Install Required Packages:bash
      pip install -r requirements.txt

    Notes

    - LLM Usage: The classification of commits using the Gemini 1.5 Flash LLM requires access to the model. Ensure you have the necessary permissions and API keys to use the model.

    - Computational Resources: Processing large datasets and running Bayesian network analyses may require significant computational resources. It is recommended to use a machine with ample memory and processing power.

    - Reproducing Results: The auxiliary datasets included can be used to reproduce specific parts of the code without re-running the entire data collection and preprocessing pipeline.

    Additional Information

    Contact: If you have any questions or encounter issues, please contact the authors at joel.castano@upc.edu.

    This README provides detailed instructions and information to reproduce and understand the analyses performed in the paper. If you find this package useful, please cite our work.

  11. Physiological signals during activities for daily life: Dataset

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Mar 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Gutierrez Maestro; Eduardo Gutierrez Maestro (2022). Physiological signals during activities for daily life: Dataset [Dataset]. http://doi.org/10.5281/zenodo.6391454
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 29, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eduardo Gutierrez Maestro; Eduardo Gutierrez Maestro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset used in this work is composed by four participants, two men and two women. Each of them carried the wearable device Empatica E4 for a total number of 15 days. They carried the wearable during the day, and during the nights we asked participants to charge and load the data into an external memory unit. During these days, participants were asked to answer EMA questionnaires which are used to label our data. However, some participants could not complete the full experiment or some days were discarded due to data corruption. Specific demographic information, total sampling days and total number of EMA answers can be found in table I.

    Participant 1Participant 2Participant 3Participant 4
    Age67556063
    GenderMaleFemaleMaleFemale

    Final Valid Days

    9151213
    Total EMAs42576446

    Table I. Summary of participants' collected data.

    This dataset provides three different type of labels. Activeness and happiness are two of these labels. These are the answers to EMA questionnaires that participants reported during their daily activities. These labels are numbers between 0 and 4.
    These labels are used to interpolate the mental well-being state according to [1] We report in our dataset a total number of eight emotional states: (1) pleasure, (2) excitement, (3) arousal, (4) distress, (5) misery, (6) depression, (7) sleepiness, and (8) contentment.

    The data we provide in this repository consist of two type of files:

    • CSV files: These files contain physiological signals recorded during the data collection process. The first line of each CSV file defines the timestamp by which data started being sampled. The second line defines the sampling frequency used for gathering the signal. From the third line until the end of the file, one can find sampled datapoints.
    • Excel files: These files contain the labels obtained from EMA answers. It is indicated the timestamp at which the answer was registered. Labels for pleasure, activeness and mood can be found in this file.

    NOTE: Files are numbered according to each specific sampling day. For example, ACC1.csv corresponds to the signal ACC for sampling day 1. The same applied to excel files.

    Code and a tutorial of how to labelled and extract features can be found in this repository: https://github.com/edugm94/temporal-feat-emotion-prediction

    References:

    [1] . A. Russell, “A circumplex model of affect,” Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980

  12. d

    Film Dataset for AI-Generated Music (Machine Learning (ML) Data)

    • datarade.ai
    .json, .csv, .xls
    Updated Feb 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rightsify (2024). Film Dataset for AI-Generated Music (Machine Learning (ML) Data) [Dataset]. https://datarade.ai/data-products/film-dataset-for-ai-generated-music-machine-learning-ml-data-rightsify
    Explore at:
    .json, .csv, .xlsAvailable download formats
    Dataset updated
    Feb 14, 2024
    Dataset authored and provided by
    Rightsify
    Area covered
    Falkland Islands (Malvinas), Bermuda, Luxembourg, Denmark, Cuba, Guinea, Antarctica, Kuwait, Tokelau, Moldova (Republic of)
    Description

    The Film dataset is a large collection of audio files with full metadata, including chords, instrumentation, key, tempo, and timestamps. This dataset is designed for machine learning applications and serves as a reliable resource for generative AI music, Music Information Retrieval (MIR), and source separation. With an emphasis on expanding machine learning attempts, the dataset allows researchers to delve into the complexities of film music, enabling the development of algorithms capable of generating creative compositions that genuinely represent the emotive nuances of various genres.

    Film music, an essential component of cinematic storytelling, plays an important role in increasing spectator engagement and emotional resonance. Composers work collaboratively with filmmakers to create music that enhance visual aspects, set the tone, and reinforce story themes.

    Training models on this cinema dataset allows researchers to better grasp and mimic these artistic details, extending the bounds of AI-generated music and contributing to advances in MIR and source separation.

  13. d

    Data from: Subsurface Characterization and Machine Learning Predictions at...

    • catalog.data.gov
    • data.openei.org
    • +2more
    Updated Jan 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Renewable Energy Laboratory (2025). Subsurface Characterization and Machine Learning Predictions at Brady Hot Springs [Dataset]. https://catalog.data.gov/dataset/subsurface-characterization-and-machine-learning-predictions-at-brady-hot-springs-0f304
    Explore at:
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    National Renewable Energy Laboratory
    Description

    Subsurface data analysis, reservoir modeling, and machine learning (ML) techniques have been applied to the Brady Hot Springs (BHS) geothermal field in Nevada, USA to further characterize the subsurface and assist with optimizing reservoir management. Hundreds of reservoir simulations have been conducted in TETRAD-G and CMG STARS to explore different injection and production fluid flow rates and allocations and to develop a training data set for ML. This process included simulating the historical injection and production since 1979 and prediction of future performance through 2040. ML networks were created and trained using TensorFlow based on multilayer perceptron, long short-term memory, and convolutional neural network architectures. These networks took as input selected flow rates, injection temperatures, and historical field operation data and produced estimates of future production temperatures. This approach was first successfully tested on a simplified single-fracture doublet system, followed by the application to the BHS reservoir. Using an initial BHS data set with 37 simulated scenarios, the trained and validated network predicted the production temperature for six production wells with the mean absolute percentage error of less than 8%. In a complementary analysis effort, the principal component analysis applied to 13 BHS geological parameters revealed that vertical fracture permeability shows the strongest correlation with fault density and fault intersection density. A new BHS reservoir model was developed considering the fault intersection density as proxy for permeability. This new reservoir model helps to explore underexploited zones in the reservoir. A data gathering plan to obtain additional subsurface data was developed; it includes temperature surveying for three idle injection wells at which the reservoir simulations indicate high bottom-hole temperatures. The collected data assist with calibrating the reservoir model. Data gathering activities are planned for the first quarter of 2021. This GDR submission includes a preprint of the paper titled "Subsurface Characterization and Machine Learning Predictions at Brady Hot Springs" presented at the 46th Stanford Geothermal Workshop (SGW) on Geothermal Reservoir Engineering from February 16-18, 2021.

  14. Z

    Data from: FISBe: A real-world benchmark dataset for instance segmentation...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mais, Lisa; Hirsch, Peter; Managan, Claire; Kandarpa, Ramya; Rumberger, Josef Lorenz; Reinke, Annika; Maier-Hein, Lena; Ihrke, Gudrun; Kainmueller, Dagmar (2024). FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10875062
    Explore at:
    Dataset updated
    Apr 2, 2024
    Dataset provided by
    German Cancer Research Center
    Howard Hughes Medical Institute - Janelia Research Campus
    Max Delbrück Center for Molecular Medicine
    Max Delbrück Center
    Authors
    Mais, Lisa; Hirsch, Peter; Managan, Claire; Kandarpa, Ramya; Rumberger, Josef Lorenz; Reinke, Annika; Maier-Hein, Lena; Ihrke, Gudrun; Kainmueller, Dagmar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General

    For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.

    Summary

    A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains

    30 completely labeled (segmented) images

    71 partly labeled images

    altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)

    To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects

    A set of metrics and a novel ranking score for respective meaningful method benchmarking

    An evaluation of three baseline methods in terms of the above metrics and score

    Abstract

    Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.

    Dataset documentation:

    We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:

    FISBe Datasheet

    Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.

    Files

    fisbe_v1.0_{completely,partly}.zip

    contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.

    fisbe_v1.0_mips.zip

    maximum intensity projections of all samples, for convenience.

    sample_list_per_split.txt

    a simple list of all samples and the subset they are in, for convenience.

    view_data.py

    a simple python script to visualize samples, see below for more information on how to use it.

    dim_neurons_val_and_test_sets.json

    a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.

    Readme.md

    general information

    How to work with the image files

    Each sample consists of a single 3d MCFO image of neurons of the fruit fly.For each image, we provide a pixel-wise instance segmentation for all separable neurons.Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.The segmentation mask for each neuron is stored in a separate channel.The order of dimensions is CZYX.

    We recommend to work in a virtual environment, e.g., by using conda:

    conda create -y -n flylight-env -c conda-forge python=3.9conda activate flylight-env

    How to open zarr files

    Install the python zarr package:

    pip install zarr

    Opened a zarr file with:

    import zarrraw = zarr.open(, mode='r', path="volumes/raw")seg = zarr.open(, mode='r', path="volumes/gt_instances")

    optional:import numpy as npraw_np = np.array(raw)

    Zarr arrays are read lazily on-demand.Many functions that expect numpy arrays also work with zarr arrays.Optionally, the arrays can also explicitly be converted to numpy arrays.

    How to view zarr image files

    We recommend to use napari to view the image data.

    Install napari:

    pip install "napari[all]"

    Save the following Python script:

    import zarr, sys, napari

    raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")

    viewer = napari.Viewer(ndisplay=3)for idx, gt in enumerate(gts): viewer.add_labels( gt, rendering='translucent', blending='additive', name=f'gt_{idx}')viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')napari.run()

    Execute:

    python view_data.py /R9F03-20181030_62_B5.zarr

    Metrics

    S: Average of avF1 and C

    avF1: Average F1 Score

    C: Average ground truth coverage

    clDice_TP: Average true positives clDice

    FS: Number of false splits

    FM: Number of false merges

    tp: Relative number of true positives

    For more information on our selected metrics and formal definitions please see our paper.

    Baseline

    To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..For detailed information on the methods and the quantitative results please see our paper.

    License

    The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

    Citation

    If you use FISBe in your research, please use the following BibTeX entry:

    @misc{mais2024fisbe, title = {FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures}, author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller}, year = 2024, eprint = {2404.00130}, archivePrefix ={arXiv}, primaryClass = {cs.CV} }

    Acknowledgments

    We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuablediscussions.P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.This work was co-funded by Helmholtz Imaging.

    Changelog

    There have been no changes to the dataset so far.All future change will be listed on the changelog page.

    Contributing

    If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.

    All contributions are welcome!

  15. D

    The manifest and store data of 870,515 Android mobile applications

    • dataverse.nl
    zip
    Updated Jun 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fadi Mohsen; Fadi Mohsen; Dimka Karastoyanova; Dimka Karastoyanova; George Azzopardi; George Azzopardi (2022). The manifest and store data of 870,515 Android mobile applications [Dataset]. http://doi.org/10.34894/H0YJFT
    Explore at:
    zip(202636617)Available download formats
    Dataset updated
    Jun 9, 2022
    Dataset provided by
    DataverseNL
    Authors
    Fadi Mohsen; Fadi Mohsen; Dimka Karastoyanova; Dimka Karastoyanova; George Azzopardi; George Azzopardi
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Apr 15, 2017 - Jun 17, 2019
    Description

    We built a crawler to collect data from the Google Play store including the application's metadata and APK files. The manifest files were extracted from the APK files and then processed to extract the features. The data set is composed of 870,515 records/apps, and for each app we produced 48 features. The data set was used to built and test two bootstrap aggregating of multiple XGBoost machine learning classifiers. The dataset were collected between April 2017 and November 2018. We then checked the status of these applications on three different occasions; December 2018, February 2019, and May-June 2019.

  16. Cultural Heritage Learning Dataset for IoT

    • kaggle.com
    zip
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziya (2024). Cultural Heritage Learning Dataset for IoT [Dataset]. https://www.kaggle.com/datasets/ziya07/cultural-heritage-learning-dataset-for-iot
    Explore at:
    zip(11051 bytes)Available download formats
    Dataset updated
    Dec 31, 2024
    Authors
    Ziya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The dataset is designed to facilitate the development and evaluation of machine learning models for managing and distributing traditional cultural and educational resources, leveraging IoT data and deep learning techniques. The goal is to provide a rich set of features for exploring user interactions with cultural artifacts and historical sites, with the support of IoT-enabled sensors and virtual learning experiences.

    Columns Overview: User_ID: A unique identifier for each user interacting with the system. This helps track individual behavior patterns and interactions.

    Artifact_ID: The unique identifier for each artifact (e.g., a cultural artifact or historical item) that is part of the learning experience.

    Artifact_Name: The name of the artifact. This could range from physical objects like sculptures, to digital representations of ancient items or sites.

    Artifact_Type: A categorical field indicating the type of artifact, such as sculpture, musical instrument, religious artifact, etc.

    Artifact_Location: The physical or virtual location of the artifact. This could be a museum, historical site, or an online platform.

    Artifact_Historical_Significance: A textual description of the cultural and historical importance of the artifact. This can be used to provide more context for the artifact’s significance.

    Digital_Representation_URL: A URL pointing to the digital representation of the artifact, which could be an image, video, or audio file related to the artifact.

    Artifact_Tags: Tags that describe the artifact in terms of categories like "Ancient", "Religious", "Musical", etc. These can be used for classification or recommendations.

    Site_ID: The unique identifier for each site (e.g., the location where a cultural artifact is located or a historical site).

    Site_Name: The name of the site, such as "Great Wall of China" or "Colosseum".

    Site_Location: The geographical coordinates (latitude and longitude) of the site.

    Site_Cultural_Importance: A description of the cultural importance of the site.

    Virtual_Tour_URL: A URL linking to a virtual tour of the site or artifact, offering an interactive learning experience.

    IoT_Sensor_ID: The identifier of the IoT sensor collecting data from the artifact or site. These sensors track various environmental factors such as temperature, humidity, and motion.

    IoT_Sensor_Type: The type of sensor used to collect data, such as "Temperature", "Humidity", "Motion", or "Light".

    IoT_Sensor_Location: The location of the IoT sensor in relation to the artifact or site.

    Sensor_Timestamp: The date and time when the sensor data was recorded.

    Sensor_Reading: The actual value recorded by the IoT sensor. This could represent temperature, humidity, motion detection, or other environmental factors.

    User_Interaction_Type: The type of interaction the user has with the artifact or site. This can include actions like "View", "Rate", or "Click".

    User_Interaction_Timestamp: The timestamp of when the user interaction took place. This allows us to track the time of day and the frequency of interactions.

    Interaction_Duration: The duration of the user’s interaction with the artifact or site, measured in minutes. This helps in analyzing user engagement levels.

    User_Feedback: The feedback provided by the user after interacting with the artifact or site, such as "Liked", "Not Interested", or "Neutral".

    Target_Column: The target variable for machine learning models. This column indicates whether the user had a positive interaction (1) or a negative interaction (0). A positive interaction is defined as a "view" action, while negative interactions include other actions like "rate" or "click".

  17. E

    AAMOS-00 Study: Predicting Asthma Attacks Using Connected Mobile Devices and...

    • find.data.gov.scot
    • dtechtive.com
    csv, docx, xlsx
    Updated Oct 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Edinburgh. Edinburgh Medical School. Usher Institute (2022). AAMOS-00 Study: Predicting Asthma Attacks Using Connected Mobile Devices and Machine Learning [Dataset]. http://doi.org/10.7488/ds/3775
    Explore at:
    csv(3.229 MB), csv(0.018 MB), docx(0.0229 MB), csv(0.0155 MB), csv(30.96 MB), csv(0.0613 MB), xlsx(0.0235 MB), csv(0.0039 MB), csv(0.0509 MB), csv(30.7 MB), csv(0.0652 MB), csv(0.1885 MB)Available download formats
    Dataset updated
    Oct 20, 2022
    Dataset provided by
    University of Edinburgh. Edinburgh Medical School. Usher Institute
    Area covered
    UNITED KINGDOM
    Description

    Monitoring asthma condition is essential to asthma self-management. However, traditional methods of monitoring require high levels of active engagement and patients may regard this level of monitoring as tedious. Passive monitoring with mobile health devices, especially when combined with machine learning, provides an avenue to dramatically reduce management burden. However, data for developing machine learning algorithms are scarce, and gathering new data is expensive. A few asthma mHealth datasets are publicly available, but lack objective and passively collected data which may enhance asthma attack prediction systems. To fill this gap, we carried out the 2-phase, 7-month AAMOS-00 observational study to collect data about asthma status using three smart monitoring devices (smart peak flow meter, smart inhaler, smartwatch), and daily symptom questionnaires. Combined with localised weather, pollen, and air quality reports, we have collected a rich longitudinal dataset to explore the feasibility of passive monitoring and asthma attack prediction. Conducting phase 2 of device monitoring over 12 months, from June 2021 to June 2022 and during the COVID-19 pandemic, 22 participants across the UK provided 2,054 unique patient-days of data. This valuable anonymised dataset has been made publicly available with the consent of participants. Ethics approval was provided by the East of England - Cambridge Central Research Ethics Committee. IRAS project ID: 285505 with governance approval from ACCORD (Academic and Clinical Central Office for Research and Development), project number: AC20145. The study sponsor was ACCORD, the University of Edinburgh. The anonymised dataset was produced with statistical advice from Aryelly Rodriguez - Statistician, Edinburgh Clinical Trials Unit, University of Edinburgh. Protocol: 'Predicting asthma attacks using connected mobile devices and machine learning; the AAMOS-00 observational study protocol' - BMJ Open, DOI: 10.1136/bmjopen-2022-064166 # Thesis # Tsang, Kevin CH 'Application of data-driven technologies for asthma self-management' (2022) [Doctoral Thesis] University of Edinburgh https://era.ed.ac.uk/handle/1842/40547 The dataset also relates to the publication K.C.H. Tsang, H. Pinnock, A.M. Wilson, D. Salvi and S.A. Shah (2023). 'Home monitoring with connected mobile devices for asthma attack prediction with machine learning', Scientific Data 10 ( https://doi.org/10.1038/s41597-023-02241-9 ).

  18. d

    Data for: Hit2flux: A machine learning framework for boiling heat flux...

    • search.dataone.org
    • resodate.org
    Updated Jun 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christy Dunlap; Changgen Li; Hari Pandey; Han Hu (2025). Data for: Hit2flux: A machine learning framework for boiling heat flux prediction using hit-based acoustic emission sensing [Dataset]. http://doi.org/10.5061/dryad.g79cnp628
    Explore at:
    Dataset updated
    Jun 11, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Christy Dunlap; Changgen Li; Hari Pandey; Han Hu
    Description

    This paper presents Hit2Flux, a machine learning framework for boiling heat flux prediction using acoustic emission (AE) hits generated through threshold-based transient sampling. Unlike continuously sampled data, AE hits are recorded when the signal exceeds a predefined threshold and are thus discontinuous in nature. Meanwhile, each hit represents a waveform at a high sampling frequency (∼1 MHz). In order to capture the features of both the high-frequency waveforms and the temporal distribution of hits, Hit2Flux involves i) feature extraction by transforming AE hits into the frequency domain and organizing these spectra into sequences using a rolling window to form “sequences-of-sequences,†and ii) heat flux prediction using a long short-term memory (LSTM) network with sequences of sequences. The model is trained on AE hits recorded during pool boiling experiments using an AE sensor attached to the boiling chamber. Continuously sampled acoustic data using a hydrophone were also collect..., , # Data for: Hit2flux: A machine learning framework for boiling heat flux prediction using hit-based acoustic emission sensing

    Dataset DOI: 10.5061/dryad.g79cnp628

    Description of the data and file structure

    This dataset includes acoustic emission hit data and waveforms, hydrophone data, pressure, and temperature data from transient pool boiling tests on copper microchannels. The pool boiling test facility includes (a) a heating element that consists of a copper block and cartridge heaters; (b) a closed chamber with flow loops for a chiller (Thermo Scientific Polar ACCEL 500 Low/EA) connecting Graham condenser (Ace glass 5953-106) with an adapter (Ace glass 5838-76) and an in-house built coiled copper condenser; and (c) a synchronized multimodal sensing system. The copper block is submerged in deionized water and heated by nine cartridge heaters (Omega Engineering HDC19102), each with a power rating of 50 W, inserted from the bottom. The cartridge heaters ...,

  19. m

    Data from: Dataset of Student Level Prediction in UAE

    • data.mendeley.com
    Updated Dec 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    shatha Ghareeb (2020). Dataset of Student Level Prediction in UAE [Dataset]. http://doi.org/10.17632/3g8dtwbjjy.1
    Explore at:
    Dataset updated
    Dec 18, 2020
    Authors
    shatha Ghareeb
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United Arab Emirates
    Description

    The dataset comprises novel aspects specifically, in terms of student grading in diverse educational cultures within the multiple countries – Researchers and other education sectors will be able to see the impact of having varied curriculums in a country. Dataset compares different levelling cases when student transfer from curriculum to curriculum and the unreliable levelling criteria set by schools currently in an international school. The collected data can be used within the intelligent algorithms specifically machine learning and pattern analysis methods, to develop an intelligent framework applicable in multi-cultural educational systems to aid in a smooth transition “levelling, hereafter” of students who relocate from a particular education curriculum to another; and minimize the impact of switching on the students’ educational performance. The preliminary variables taken into consideration when deciding which data to collect depended on the variables. UAE is a multicultural country with many expats relocating from regions such as Asia, Europe and America. In order to meet expats needs, UAE has established many international private schools, therefore UAE was chosen to be the location of study based on many cases and struggles in levelling declared by the Ministry of Education and schools. For the first time, we present this dataset comprising students’ records for two academic years that included math, English, and science for 3 terms. Selection of subject areas and number of terms was based on influence from other researchers in similar subject matters.

  20. PSYCHE-D: predicting change in depression severity using person-generated...

    • zenodo.org
    • data.niaid.nih.gov
    bin, pdf
    Updated Jul 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariko Makhmutova; Mariko Makhmutova; Raghu Kainkaryam; Raghu Kainkaryam; Marta Ferreira; Marta Ferreira; Jae Min; Jae Min; Martin Jaggi; Martin Jaggi; Ieuan Clay; Ieuan Clay (2024). PSYCHE-D: predicting change in depression severity using person-generated health data (DATASET) [Dataset]. http://doi.org/10.5281/zenodo.5085146
    Explore at:
    pdf, binAvailable download formats
    Dataset updated
    Jul 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mariko Makhmutova; Mariko Makhmutova; Raghu Kainkaryam; Raghu Kainkaryam; Marta Ferreira; Marta Ferreira; Jae Min; Jae Min; Martin Jaggi; Martin Jaggi; Ieuan Clay; Ieuan Clay
    Description

    This dataset is made available under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). See LICENSE.pdf for details.

    Dataset description

    Parquet file, with:

    • 35694 rows
    • 154 columns

    The file is indexed on [participant]_[month], such that 34_12 means month 12 from participant 34. All participant IDs have been replaced with randomly generated integers and the conversion table deleted.

    Column names and explanations are included as a separate tab-delimited file. Detailed descriptions of feature engineering are available from the linked publications.

    File contains aggregated, derived feature matrix describing person-generated health data (PGHD) captured as part of the DiSCover Project (https://clinicaltrials.gov/ct2/show/NCT03421223). This matrix focuses on individual changes in depression status over time, as measured by PHQ-9.

    The DiSCover Project is a 1-year long longitudinal study consisting of 10,036 individuals in the United States, who wore consumer-grade wearable devices throughout the study and completed monthly surveys about their mental health and/or lifestyle changes, between January 2018 and January 2020.

    The data subset used in this work comprises the following:

    • Wearable PGHD: step and sleep data from the participants’ consumer-grade wearable devices (Fitbit) worn throughout the study
    • Screener survey: prior to the study, participants self-reported socio-demographic information, as well as comorbidities
    • Lifestyle and medication changes (LMC) survey: every month, participants were requested to complete a brief survey reporting changes in their lifestyle and medication over the past month
    • Patient Health Questionnaire (PHQ-9) score: every 3 months, participants were requested to complete the PHQ-9, a 9-item questionnaire that has proven to be reliable and valid to measure depression severity

    From these input sources we define a range of input features, both static (defined once, remain constant for all samples from a given participant throughout the study, e.g. demographic features) and dynamic (varying with time for a given participant, e.g. behavioral features derived from consumer-grade wearables).

    The dataset contains a total of 35,694 rows for each month of data collection from the participants. We can generate 3-month long, non-overlapping, independent samples to capture changes in depression status over time with PGHD. We use the notation ‘SM0’ (sample month 0), ‘SM1’, ‘SM2’ and ‘SM3’ to refer to relative time points within each sample. Each 3-month sample consists of: PHQ-9 survey responses at SM0 and SM3, one set of screener survey responses, LMC survey responses at SM3 (as well as SM1, SM2, if available), and wearable PGHD for SM3 (and SM1, SM2, if available). The wearable PGHD includes data collected from 8 to 14 days prior to the PHQ-9 label generation date at SM3. Doing this generates a total of 10,866 samples from 4,036 unique participants.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bhanupratap Biswas (2023). Machine Learning Basics for Beginners🤖🧠 [Dataset]. https://www.kaggle.com/datasets/bhanupratapbiswas/machine-learning-basics-for-beginners
Organization logo

Machine Learning Basics for Beginners🤖🧠

Machine Learning Basics

Explore at:
zip(492015 bytes)Available download formats
Dataset updated
Jun 22, 2023
Authors
Bhanupratap Biswas
License

ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically

Description

Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:

  1. Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.

  2. Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.

  3. Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.

  4. Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).

  5. Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).

  6. Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.

  7. Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.

  8. Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.

  9. Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.

  10. Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.

These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.

Search
Clear search
Close search
Google apps
Main menu