8 datasets found
  1. Smartwatch Purchase Data

    • kaggle.com
    Updated Dec 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aayush Chourasiya (2022). Smartwatch Purchase Data [Dataset]. https://www.kaggle.com/datasets/albedo0/smartwatch-purchase-data/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 30, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aayush Chourasiya
    Description

    Disclaimer: This is an artificially generated data using a python script based on arbitrary assumptions listed down.

    The data consists of 100,000 examples of training data and 10,000 examples of test data, each representing a user who may or may not buy a smart watch.

    ----- Version 1 -------

    trainingDataV1.csv, testDataV1.csv or trainingData.csv, testData.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. hour: The hour of the day (integer, 0-23) 1. weekend: A boolean indicating whether it is the weekend (True or False) 1. The data also includes a label for each user indicating whether they are likely to buy a smart watch or not (string, "yes" or "no"). The label is determined based on the following arbitrary conditions: - If the user is divorced and a random number generated by the script is less than 0.4, the label is "no" (i.e., assuming 40% of divorcees are not likely to buy a smart watch) - If it is the weekend and a random number generated by the script is less than 1.3, the label is "yes". (i.e., assuming sales are 30% more likely to occur on weekends) - If the user is male and under 30 with an income over 75,000, the label is "yes". - If the user is female and 30 or over with an income over 100,000, the label is "yes". Otherwise, the label is "no".

    The training data is intended to be used to build and train a classification model, and the test data is intended to be used to evaluate the performance of the trained model.

    Following Python script was used to generate this dataset

    import random
    import csv
    
    # Set the number of examples to generate
    numExamples = 100000
    
    # Generate the training data
    with open("trainingData.csv", "w", newline="") as csvfile:
      fieldnames = ["age", "income", "gender", "maritalStatus", "hour", "weekend", "buySmartWatch"]
      writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    
      writer.writeheader()
    
      for i in range(numExamples):
        age = random.randint(18, 70)
        income = random.randint(25000, 200000)
        gender = random.choice(["male", "female"])
        maritalStatus = random.choice(["single", "married", "divorced"])
        hour = random.randint(0, 23)
        weekend = random.choice([True, False])
    
        # Randomly assign the label based on some arbitrary conditions
        # assuming 40% of divorcees won't buy a smart watch
        if maritalStatus == "divorced" and random.random() < 0.4:
          buySmartWatch = "no"
        # assuming sales are 30% more likely to occur on weekends.
        elif weekend == True and random.random() < 1.3:
          buySmartWatch = "yes"
        elif gender == "male" and age < 30 and income > 75000:
          buySmartWatch = "yes"
        elif gender == "female" and age >= 30 and income > 100000:
          buySmartWatch = "yes"
        else:
          buySmartWatch = "no"
    
        writer.writerow({
          "age": age,
          "income": income,
          "gender": gender,
          "maritalStatus": maritalStatus,
          "hour": hour,
          "weekend": weekend,
          "buySmartWatch": buySmartWatch
        })
    

    ----- Version 2 -------

    trainingDataV2.csv, testDataV2.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. educationLevel: The education level of the user (string, "high school", "associate's degree", "bachelor's degree", "master's degree", or "doctorate") 1. occupation: The occupation of the user (string, "tech worker", "manager", "executive", "sales", "customer service", "creative", "manual labor", "healthcare", "education", "government", "unemployed", or "student") 1. familySize: The number of people in the user's family (integer, 1-5) 1. fitnessInterest: A boolean indicating whether the user is interested in fitness (True or False) 1. priorSmartwatchOwnership: A boolean indicating whether the user has owned a smartwatch in the past (True or False) 1. hour: The hour of the day when the user was surveyed (integer, 0-23) 1. weekend: A boolean indicating whether the user was surveyed on a weekend (True or False) 1. buySmartWatch: A boolean indicating whether the user purchased a smartwatch (True or False)

    Python script used to generate the data:

    import random
    import csv
    
    # Set the number of examples to generate
    numExamples = 100000
    
    with open("t...
    
  2. Rescaled Fashion-MNIST dataset

    • zenodo.org
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled Fashion-MNIST dataset [Dataset]. http://doi.org/10.5281/zenodo.15187793
    Explore at:
    Dataset updated
    Jun 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
    Time period covered
    Apr 10, 2025
    Description

    Motivation

    The goal of introducing the Rescaled Fashion-MNIST dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

    The Rescaled Fashion-MNIST dataset was introduced in the paper:

    [1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

    with a pre-print available at arXiv:

    [2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

    Importantly, the Rescaled Fashion-MNIST dataset is more challenging than the MNIST Large Scale dataset, introduced in:

    [3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.

    Access and rights

    The Rescaled Fashion-MNIST dataset is provided on the condition that you provide proper citation for the original Fashion-MNIST dataset:

    [4] Xiao, H., Rasul, K., and Vollgraf, R. (2017) “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms”, arXiv preprint arXiv:1708.07747

    and also for this new rescaled version, using the reference [1] above.

    The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

    The dataset

    The Rescaled FashionMNIST dataset is generated by rescaling 28×28 gray-scale images of clothes from the original FashionMNIST dataset [4]. The scale variations are up to a factor of 4, and the images are embedded within black images of size 72x72, with the object in the frame always centred. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

    There are 10 different classes in the dataset: “T-shirt/top”, “trouser”, “pullover”, “dress”, “coat”, “sandal”, “shirt”, “sneaker”, “bag” and “ankle boot”. In the dataset, these are represented by integer labels in the range [0, 9].

    The dataset is split into 50 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 50 000 samples from the original Fashion-MNIST training set. The validation dataset, on the other hand, is formed from the final 10 000 images of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original Fashion-MNIST test set.

    The h5 files containing the dataset

    The training dataset file (~2.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

    fashionmnist_with_scale_variations_tr50000_vl10000_te10000_outsize72-72_scte1p000_scte1p000.h5

    Additionally, for the Rescaled FashionMNIST dataset, there are 9 datasets (~415 MB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:

    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p500.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p595.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p707.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p841.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p000.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p189.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p414.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p682.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte2p000.h5

    These dataset files were used for the experiments presented in Figures 6, 7, 14, 16, 19 and 23 in [1].

    Instructions for loading the data set

    The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
    ('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

    The training dataset can be loaded in Python as:

    with h5py.File(`

    x_train = np.array( f["/x_train"], dtype=np.float32)
    x_val = np.array( f["/x_val"], dtype=np.float32)
    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_train = np.array( f["/y_train"], dtype=np.int32)
    y_val = np.array( f["/y_val"], dtype=np.int32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

    x_train = np.transpose(x_train, (0, 3, 1, 2))
    x_val = np.transpose(x_val, (0, 3, 1, 2))
    x_test = np.transpose(x_test, (0, 3, 1, 2))

    The test datasets can be loaded in Python as:

    with h5py.File(`

    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    The test datasets can be loaded in Matlab as:

    x_test = h5read(`

    The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

    There is also a closely related Fashion-MNIST with translations dataset, which in addition to scaling variations also comprises spatial translations of the objects.

  3. Z

    Integrated Agent-based Modelling and Simulation of Transportation Demand and...

    • data.niaid.nih.gov
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sprei, Frances (2024). Integrated Agent-based Modelling and Simulation of Transportation Demand and Mobility Patterns in Sweden [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10648077
    Explore at:
    Dataset updated
    Jun 19, 2024
    Dataset provided by
    Ghosh, Kaniska
    Dhamal, Swapnil
    Tozluoğlu, Çağlar
    Sprei, Frances
    Liao, Yuan
    Yeh, Sonia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Sweden
    Description

    About

    The Synthetic Sweden Mobility (SySMo) model provides a simplified yet statistically realistic microscopic representation of the real population of Sweden. The agents in this synthetic population contain socioeconomic attributes, household characteristics, and corresponding activity plans for an average weekday. This agent-based modelling approach derives the transportation demand from the agents’ planned activities using various transport modes (e.g., car, public transport, bike, and walking).

    This open data repository contains four datasets:

    (1) Synthetic Agents,

    (2) Activity Plans of the Agents,

    (3) Travel Trajectories of the Agents, and

    (4) Road Network (EPSG: 3006)

    (OpenStreetMap data were retrieved on August 28, 2023, from https://download.geofabrik.de/europe.html, and GTFS data were retrieved on September 6, 2023 from https://samtrafiken.se/)

    The database can serve as input to assess the potential impacts of new transportation technologies, infrastructure changes, and policy interventions on the mobility patterns of the Swedish population.

    Methodology

    This dataset contains statistically simulated 10.2 million agents representing the population of Sweden, their socio-economic characteristics and the activity plan for an average weekday. For preparing data for the MATSim simulation, we randomly divided all the agents into 10 batches. Each batch's agents are then simulated in MATSim using the multi-modal network combining road networks and public transit data in Sweden using the package pt2matsim (https://github.com/matsim-org/pt2matsim).

    The agents' daily activity plans along with the road network serve as the primary inputs in the MATSim environment which ensures iterative replanning while aiming for a convergence on optimal activity plans for all the agents. Subsequently, the individual mobility trajectories of the agents from the MATSim simulation are retrieved.

    The activity plans of the individual agents extracted from the MATSim simulation output data are then further processed. All agents with negative utility score and negative activity time corresponding to at least one activity are filtered out as the ‘infeasible’ agents. The dataset ‘Synthetic Agents’ contains all synthetic agents regardless of their ‘feasibility’ (0=excluded & 1=included in plans and trajectories). In the other datasets, only agents with feasible activity plans are included.

    The simulation setup adheres to the MATSim 13.0 benchmark scenario, with slight adjustments. The strategy for replanning integrates BestScore (60%), TimeAllocationMutator (30%), and ReRoute (10%)— the percentages denote the proportion of agents utilizing these strategies. In each iteration of the simulation, the agents adopt these strategies to adjust their activity plans. The "BestScore" strategy retains the plan with the highest score from the previous iteration, selecting the most successful strategy an agent has employed up until that point. The "TimeAllocationMutator" modifies the end times of activities by introducing random shifts within a specified range, allowing for the exploration of different schedules. The "ReRoute" strategy enables agents to alter their current routes, potentially optimizing travel based on updated information or preferences. These strategies are detailed further in W. Axhausen et al. (2016) work, which provides comprehensive insights into their implementation and impact within the context of transport simulation modeling.

    Data Description

    (1) Synthetic Agents

    This dataset contains all agents in Sweden and their socioeconomic characteristics.

    The attribute ‘feasibility’ has two categories: feasible agents (73%), and infeasible agents (27%). Infeasible agents are agents with negative utility score and negative activity time corresponding to at least one activity.

    File name: 1_syn_pop_all.parquet

    Column

    Description

    Data type

    Unit

    PId

    Agent ID

    Integer

    -

    Deso Zone code of Demographic statistical areas (DeSO)1

    String

    kommun

    Municipality code

    Integer

    marital

    Marital Status (single/ couple/ child)

    String

    sex

    Gender (0 = Male, 1 = Female)

    Integer

    age

    Age

    Integer

    HId

    A unique identifier for households

    Integer

    HHtype

    Type of households (single/ couple/ other)

    String

    HHsize

    Number of people living in the households

    Integer

    num_babies

    Number of children less than six years old in the household

    Integer

    employment Employment Status (0 = Not Employed, 1 = Employed)

    Integer

    studenthood Studenthood Status (0 = Not Student, 1 = Student)

    Integer

    income_class Income Class (0 = No Income, 1 = Low Income, 2 = Lower-middle Income, 3 = Upper-middle Income, 4 = High Income)

    Integer

    num_cars Number of cars owned by an individual

    Integer

    HHcars Number of cars in the household

    Integer

    feasibility

    Status of the individual (1=feasible, 0=infeasible)

    Integer

    1 https://www.scb.se/vara-tjanster/oppna-data/oppna-geodata/deso--demografiska-statistikomraden/

    (2) Activity Plans of the Agents

    The dataset contains the car agents’ (agents that use cars on the simulated day) activity plans for a simulated average weekday.

    File name: 2_plans_i.parquet, i = 0, 1, 2, ..., 8, 9. (10 files in total)

    Column

    Description

    Data type

    Unit

    act_purpose

    Activity purpose (work/ home/ school/ other)

    String

    -

    PId

    Agent ID

    Integer

    -

    act_end

    End time of activity (0:00:00 – 23:59:59)

    String

    hour:minute:seco

    nd

    act_id

    Activity index of each agent

    Integer

    -

    mode

    Transport mode to reach the activity location

    String

    -

    POINT_X

    Coordinate X of activity location (SWEREF99TM)

    Float

    metre

    POINT_Y

    Coordinate Y of activity location (SWEREF99TM)

    Float

    metre

    dep_time

    Departure time (0:00:00 – 23:59:59)

    String

    hour:minute:seco

    nd

    score

    Utility score of the simulation day as obtained from MATSim

    Float

    -

    trav_time

    Travel time to reach the activity location

    String

    hour:minute:seco

    nd

    trav_time_min

    Travel time in decimal minute

    Float

    minute

    act_time

    Activity duration in decimal minute

    Float

    minute

    distance

    Travel distance between the origin and the destination

    Float

    km

    speed

    Travel speed to reach the activity location

    Float

    km/h

    (3) Travel Trajectories of the Agents

    This dataset contains the driving trajectories of all the agents on the road network, and the public transit vehicles used by these agents, including buses, ferries, trams etc. The files are produced by MATSim simulations and organised into 10 *.parquet’ files (representing different batches of simulation) corresponding to each plan file.

    File name: 3_events_i.parquet, i = 0, 1, 2, ..., 8, 9. (10 files in total)

    Column

    Description

    Data type

    Unit

    time

    Time in second in a simulation day (0-86399)

    Integer

    second

    type

    Event type defined by MATSim simulation*

    String

    person

    Agent ID

    Integer

    link

    Nearest road link consistent with the road network

    String

    vehicle

    Vehicle ID identical to person

    Integer

    from_node

    Start node of the link

    Integer

    to_node

    End node of the link

    Integer

    • One typical episode of MATSim simulation events: Activity ends (actend) -> Agent’s vehicle enters traffic (vehicle enters traffic) -> Agent’s vehicle moves from previous road segment to its next connected one (left link) -> Agent’s vehicle leaves traffic for activity (vehicle leaves traffic) -> Activity starts (actstart)

    (4) Road Network

    This dataset contains the road network.

    File name: 4_network.shp

    Column

    Description

    Data type

    Unit

    length

    The length of road link

    Float

    metre

    freespeed

    Free speed

    Float

    km/h

    capacity

    Number of vehicles

    Integer

    permlanes

    Number of lanes

    Integer

    oneway

    Whether the segment is one-way (0=no, 1=yes)

    Integer

    modes

    Transport mode

    String

    from_node

    Start node of the link

    Integer

    to_node

    End node of the link

    Integer

    geometry

    LINESTRING (SWEREF99TM)

    geometry

    metre

    Additional Notes

    This research is funded by the RISE Research Institutes of Sweden, the Swedish Research Council for Sustainable Development (Formas, project number 2018-01768), and Transport Area of Advance, Chalmers.

    Contributions

    YL designed the simulation, analyzed the simulation data, and, along with CT, executed the simulation. CT, SD, FS, and SY conceptualized the model (SySMo), with CT and SD further developing the model to produce agents and their activity plans. KG wrote the data document. All authors reviewed, edited, and approved the final document.

  4. Dataset used in the publication entitled "Application of machine learning to...

    • zenodo.org
    • data.niaid.nih.gov
    bin, txt
    Updated Jan 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Biaobiao Yang; Biaobiao Yang; Valentin Vassilev-Galindo; Valentin Vassilev-Galindo; Javier Llorca; Javier Llorca (2024). Dataset used in the publication entitled "Application of machine learning to assess the influence of microstructure on twin nucleation in Mg alloys" [Dataset]. http://doi.org/10.5281/zenodo.10225600
    Explore at:
    bin, txtAvailable download formats
    Dataset updated
    Jan 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Biaobiao Yang; Biaobiao Yang; Valentin Vassilev-Galindo; Valentin Vassilev-Galindo; Javier Llorca; Javier Llorca
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Documentation for the Dataset used in the publication entitled "Application of machine learning to assess the influence of microstructure on twin nucleation in Mg alloys"
    ** These datasets comprise the 2D EBSD data acquired in the Mg-1Al (at.%) alloy and AZ31 Mg alloy, analyzed with MTEX 7.0 software. **
    ** More details about the experimental techniques can be found in the publication "Biaobiao Yang, Valentin Vassilev-Galindo, Javier Llorca, Application of machine learning to assess the influence of microstructure on twin nucleation in Mg alloys. npj Computational Materials, 2024." **

    1. AZ31_ML.xlsx
    - Description: Both twin and grain data were acquired by EBSD from AZ31 Mg sample before and after deformation at the same area
    - Number of grains: 2640 (rows == grains) corresponding to three samples deformed in different orientations: S0, S45, and S90
    - Number of analyzed variables (features): 31 (columns == grain characteristics)

    - Variable description by columns:
    1- (Twinned) - type: boolean
    Description: Indicates if the grain twinned or not after deformation
    0: non-twinned grain
    1: twinned grain
    2- (Orientation) - type: numerical (integer)
    Description: The loading (tensile) direction with respect to the c axis of lattice
    3- (Strain_level) - type: numerical (float)
    Description: The maximum strain level after deformation
    4- (Grain_size) - type: numerical (float)
    Description: The equivalent circle diameter (in micrometers) of the grain before deformation.
    5- (Triple_points) - type: numerical (integer)
    Description: The number of triple points of the grain before deformation
    6- (Near_edge) - type: boolean
    Description: Indicates if the grain is located near the edge of the 2D EBSD or not. This feature was used to filter out from the final dataset the grains near the edge of the sample. Hence, only those entries with Near_edge value of 0 were used to train and test the machine learning models.
    0: not near the EBSD edge
    1: near the EBSD edge
    7-12- (T_SF*) - type: numerical (float)
    Description: The twinning Schmid factor based on the loading condition, orientation of parent grain and twin variants information.
    T_SF1: The highest Schmid factor of extension twinning
    T_SF2: The 2nd highest ...
    T_SF3: 3rd
    T_SF4: 4th
    T_SF5: 5th
    T_SF6: The lowest Schmid factor of extension twinning
    13-15- (S_SF*) - type: numerical (float)
    Description: The Schmid factor for basal slip based on the loading condition, orientation of parent grain, and slip system information. Only the basal slip system is considered because it is the dominant deformation slip system in Mg during deformation.
    S_SF1: The highest Schmid factor of
    basal slip
    S_SF2: The second highest or the middle Schmid factor of
    basal slip
    S_SF3: The lowest Schmid factor of
    basal slip
    16- (Neighbor_grain_n) - type: numerical (integer)
    Description: The number of neighbors of the grain before deformation.
    17-19- (B-b_m) - type: numerical (float)
    Description: The Luster-Morris geometric compatibility factor (m') between the
    basal slip systems of the grain and its neighbors. Although there are 3 possible basal slip systems, only the one with the highest Schmid factor was considered to compute m′. Only maximum, minimum, and mean values were included in the dataset.
    (Max_B-b_m): The highest basal - basal m' between the grain and its neighbors
    (Min_B-b_m): The lowest basal - basal m' between the grain and its neighbors
    (Mean_B-b_m): The average basal - basal m' between the grain and its neighbors
    20-22- (B-t_m) - type: numerical (float)
    Description: The Luster-Morris geometric compatibility factor (m') between the 6 extension twin variants of the grain and the
    basal slip systems of its neighbors. Although there are 3 possible basal slip systems, only the one with the highest Schmid factor was considered to compute m'. However, all 6 twinning variants have been considered, given that slip induced twinning is a localized process. Only maximum, minimum, and mean values were included in the dataset.
    (Max_B-t_m): The highest basal - twin m' between the grain and its neighbors
    (Min_B-t_m): The lowest basal - twin m' between the grain and its neighbors
    (Mean_B-t_m): The average basal - twin m' between the grain and its neighbors
    23-25- (GB_misang) - type: numerical (float)
    Description: The misorientation angle (in º) between the grain and its neighbors. In fact, disorientation angle is used for the misorientation angle. Only maximum, minimum, and mean values were included in the dataset.
    (Max_GBmisang): The highest GB misorientation angle between the grain and its neighbors
    (Min_GBmisang): The lowest GB misorientation angle between the grain and its neighbors
    (Mean_GBmisang): The average GB misorientation angle between the grain and its neighbors
    26-28- (delta_Gs) - type: numerical (float)
    Description: Grain size difference (in micrometers) between a given grain and its neighbors. The grain size is calculated as the diameter of a circular grain with the same area of the grain. Only maximum, minimum, and mean values were included in the dataset.
    (Max_deltaGs): The highest grain size difference between the grain and its neighbors
    (Min_deltaGs): The smallest grain size difference between the grain and its neighbors
    (Mean_deltaGs): The average grain size difference between the grain and its neighbors
    29-31- (delta_BSF) - type: numerical (float)
    Description: The difference in the
    basal slip Schmid factor between a given grain and its neighbors. Only the highest basal slip Schmid factor is considered. Only maximum, minimum, and mean values were included in the dataset.
    (Max_deltaBSF): The highest basal SF difference between the grain and its neighbors
    (Min_deltaBSF): The smallest basal SF difference between the grain and its neighbors
    (Mean_deltaBSF): The average basal SF difference between the grain and its neighbors

    2. Mg1Al_ML.xlsx
    - Description: Both twin and grain data were acquired by EBSD from Mg-1Al (at.%) sample before and after deformation at the same area
    - Number of grains: 1496 (rows == grains) corresponding to two true strain levels: ~6%, and ~10%.
    - Number of analyzed variables (features): 31 (columns == grain characteristics)

    - Variable descriptions by columns are the same as those of AZ31_ML.xlsx

  5. Three-dimensional dataset of hydrating cement paste (CEM I Ladce, 273 m^2/kg...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michal Hlobil; Michal Hlobil; Ivana Kumpová; Ivana Kumpová (2023). Three-dimensional dataset of hydrating cement paste (CEM I Ladce, 273 m^2/kg Blaine, w/c=0.50) in TIFF format [Dataset]. http://doi.org/10.5281/zenodo.7275174
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michal Hlobil; Michal Hlobil; Ivana Kumpová; Ivana Kumpová
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A detailed description of this dataset can be found in https://doi.org/10.1016/j.dib.2023.108903.

    This dataset contains a collection of digitized three-dimensional hardened cement paste microstructures obtained from X-ray micro-computed tomography, screened after approx. 1, 2, 3, 4, 7, 14, and 28 days of elapsed hydration at 20˚C in saturated conditions. Each paste specimen had a cylindrical shape (with a diameter of ~1 mm) and was screened at a designated time (as specified in the file name, e.g. “t23hrs”=23 hours of elapsed hydration) and finally saved as an uncompressed and unprocessed *.tif greyscale image data file in 16-bit image depth (as unsigned integers) using a little-endian byte sequence.

    The dataset contains two sets of images:

    - “full-sized” digital images stored in a three-dimensional voxel-based matrix with a fixed size of 1100x1100x1100 voxels, denoted as “CEM_I_Ladce_*” in the file name; each file size amounts to ~2.5 GB and contains the whole screened specimen with a variable voxel size in the range 1.0913 − 1.1174 µm depending on the particular specimen (as specified in the file name, e.g. “1d1174um”=1.1174 µm/voxel)

    - smaller image subvolumes, denoted as Region Of Interest (ROI), extracted from the interior of the full-sized specimen from an arbitrary location, and denoted as “filteredROI_*” in the file name; this cropped ROI has a cubic shape and stores a three-dimensional voxel-based matrix with a fixed size of 500x500x500 µm3 constituted by a variable voxel count (given the fluctuating voxel size for each specimen, see above). Both the exact voxel count (i.e. three-dimensional matrix dimensions) and voxel size are further specified in each file name. A sequence of imaging filters was sequentially applied to this ROI to further enhance the contrast among the different microstructural phases, see 10.1016/j.cemconcomp.2022.104798 for details.

    Note that the same dataset stored in *raw format is available from https://doi.org/10.5281/zenodo.7193819

  6. SAFARI 2000 Tree Ring Data, Mongu, Zambia, Dry Season 2000 - Dataset - NASA...

    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • data.nasa.gov
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). SAFARI 2000 Tree Ring Data, Mongu, Zambia, Dry Season 2000 - Dataset - NASA Open Data Portal [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/safari-2000-tree-ring-data-mongu-zambia-dry-season-2000-0a3a3
    Explore at:
    Dataset updated
    Mar 20, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Area covered
    Mongu, Zambia
    Description

    This data set contains tree ring data from three sites located about 25 km of the meteorological station at Mongu, Zambia. Data from about 50 individual trees are reported. In addition, chronologies (or site mean curves) that better represent common influences (e.g., in this study, the climatic signal) were developed for each site based on the individual data (Trouet, 2004; Trouet et al., 2001). The series covers a maximum of 46 years, although most series do not extend longer than 30 years. The data were collected during the SAFARI 2000 Dry Season Field Campaign of August 2000.Ten to 23 samples were taken at each site. Brachystegia bakeriana was sampled at site 1, and Brachystegia spiciformis at sites 2 and 3. The vegetation at all sites underwent primitive harvesting for subsistence earlier the same year, thus samples could be taken from freshly cut trees and no living trees were cut. At all sites, samples consisted of full stem discs. Where possible, samples were taken at breast height (1.3 m) or slightly lower. Growth ring widths were measured to the nearest 0.01 mm using LINTAB equipment and TSAP software (Rinn and Jakel, 1997). Four radii per sample disc were measured. Cross-dating and response function analyses were performed by routine dendrochronological techniques. There are two files for each site, one containing integer values representing tree ring widths (raw data), and the other containing standardized values (chronologies), for each year. The data are stored as ASCII table files in comma-separated-value (.csv) format, with column headers.

  7. m

    Spreadsheet Implementations for Linking Multi-Level Contribution Margin...

    • data.mendeley.com
    Updated Apr 26, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Gutiérrez (2021). Spreadsheet Implementations for Linking Multi-Level Contribution Margin Accounting with Multi-Level Fixed-Charge Problems [Dataset]. http://doi.org/10.17632/s6pswx23yx.4
    Explore at:
    Dataset updated
    Apr 26, 2021
    Authors
    Michael Gutiérrez
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    This site provides the data and spreadsheet implementations for linking multi-level contribution margin accounting as a subsystem in cost accounting with several versions of a multi-level fixed-charge problem (MLFCP), the latter based on the optimization approach in operations research. For the data, plausible fictitious values have been assumed taking into consideration the calculation principles in cost accounting where applicable. They include resource-related data, market-related data, and data from cost accounting. While the deterministic version of the data does not consider uncertainty, the stochastic/robust versions assume probability distributions and rank correlations for part of the data.

    Spreadsheets

    The data and the above-mentioned linkage are implemented in three spreadsheet files, including versions for deterministic optimization, stochastic optimization, and robust optimization:

    • MLFCP deterministic.xlsx
    • MLFCP stochastic.xlsx
    • MLFCP robust.xlsx

    For a detailed description of the spreadsheet implementations and information on the software required to use them, see the associated data article published in Data in Brief. For the conceptual framework, mathematical formulation of the optimization model (MLFCP), findings, and discussion, see the associated research article published in Heliyon. (The links to both articles can be found on this page).

    Big Picture

    Furthermore, an overview (“big picture”) of the data flows between the various worksheets is provided in three main versions which correspond to the deterministic, stochastic, and robust versions of the MLFCP:

    • Overview of data flows - deterministic
    • Overview of data flows - stochastic (with three sub-variants)
    • Overview of data flows - robust

    Within each version/sub-variant of the overview, two file formats (PDF and PNG) are available. These are oversize graphics; please scale up appropriately to see the details.

    (Remark on version numbers and dates: The version numbers reported within the files might be lower than the version number of the entire dataset in case particular files remain unchanged in an update. The same might analogously apply to the dates.)

  8. z

    Geospatial Dataset of GNSS Anomalies and Political Violence Events

    • zenodo.org
    csv
    Updated Jun 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eugene Pik; Eugene Pik; João S. D. Garcia; João S. D. Garcia; Matthew Berra; Timothy Smith; Ibrahim Kocaman; Ibrahim Kocaman; Matthew Berra; Timothy Smith (2025). Geospatial Dataset of GNSS Anomalies and Political Violence Events [Dataset]. http://doi.org/10.5281/zenodo.15665065
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 14, 2025
    Dataset provided by
    Zenodo
    Authors
    Eugene Pik; Eugene Pik; João S. D. Garcia; João S. D. Garcia; Matthew Berra; Timothy Smith; Ibrahim Kocaman; Ibrahim Kocaman; Matthew Berra; Timothy Smith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 14, 2025
    Description

    Geospatial Dataset of GNSS Anomalies and Political Violence Events

    Overview

    The Geospatial Dataset of GNSS Anomalies and Political Violence Events is a collection of data that integrates aircraft flight information, GNSS (Global Navigation Satellite System) anomalies, and political violence events from the ACLED (Armed Conflict Location & Event Data Project) database.

    Dataset Files

    The dataset consists of three CSV files:

    1. Daily_GNSS_Anomalies_and_ACLED-2023-V1.csv
      • Description: Contains all grids and dates that had aircraft traffic during 2023.
      • Number of Records: 6,777,228
      • Purpose: Provides a complete view of aircraft movements and associated data, including grids without any GNSS anomalies.
    2. Daily_GNSS_Anomalies_and_ACLED-2023-V2.csv
      • Description: A filtered version of V1, including only the grids and dates where GNSS anomalies (jumps or gaps) were reported.
      • Number of Records: 718,237
      • Purpose: Focuses on areas and times with GNSS anomalies for targeted analysis.
    3. Monthly_GNSS_Anomalies_and_ACLED-2023-V9.csv
      • Description: Contains aggregated monthly data for each grid cell, combining GNSS anomalies and ACLED political violence events. Summarizes aircraft traffic, anomaly counts, and conflict activity at a monthly resolution.
      • Number of Records: 25,770
      • Purpose: Enables temporal trend analysis and spatial correlation studies between GNSS interference and political violence, using reduced data volume suitable for modeling and visualization.

    Data Fields: Daily_GNSS_Anomalies_and_ACLED-2023-V1.csv and Daily_GNSS_Anomalies_and_ACLED-2023-V2.csv

    1. grid_id
      • Description: Unique identifier for a grid cell on Earth measuring 0.5 degrees latitude by 0.5 degrees longitude.
      • Format: String combining latitude and longitude (e.g., -10.0_-36.0).
    2. day
      • Description: Date of the recorded data.
      • Format: YYYY-MM-DD (e.g., 2023-03-28).
    3. geometry
      • Description: Polygon coordinates of the grid cell in Well-Known Text (WKT) format.
      • Format: POLYGON((longitude latitude, ...)) (e.g., POLYGON((-36.0 -10.0, -35.5 -10.0, -35.5 -9.5, -36.0 -9.5, -36.0 -10.0))).
    4. flights
      • Description: Number of aircraft flights that passed through the grid on that day.
      • Format: Integer (e.g., 28).
    5. GPS_jumps
      • Description: Number of reported GNSS "jump" anomalies (possible spoofing incidents) in the grid on that day.
      • Format: Integer (e.g., 1).
    6. GPS_gaps
      • Description: Number of reported GNSS "gap" anomalies, indicating gaps in aircraft routes, in the grid on that day.
      • Format: Integer (e.g., 0).
    7. gaps_density
      • Description: Density of GNSS gaps, calculated as the number of gaps divided by the number of flights.
      • Format: Decimal (e.g., 0).
    8. jumps_density
      • Description: Density of GNSS jumps, calculated as the number of jumps divided by the number of flights.
      • Format: Decimal (e.g., 0.035714286).
    9. event_id_cnty
      • Description: ACLED event ID corresponding to political violence events in the grid on that day.
      • Format: String (e.g., BRA69267).
    10. disorder_type
      • Description: Type of disorder as classified by ACLED (e.g., "Political violence").
      • Format: String.
    11. event_type
      • Description: General category of the event according to ACLED (e.g., "Violence against civilians").
      • Format: String.
    12. sub_event_type
      • Description: Specific subtype of the event as per ACLED classification (e.g., "Attack").
      • Format: String.
    13. acled_count
      • Description: Number of ACLED events in the grid on that day.
      • Format: Integer (e.g., 1).
    14. acled_flag
      • Description: Indicator of ACLED event presence in the grid on that day (0 for no events, 1 for one or more events).
      • Format: Integer (0 or 1).

    Data Fields: Monthly_GNSS_Anomalies_and_ACLED-2023-V9.csv

    The file contains monthly aggregated GNSS anomaly and ACLED event data per grid cell. The structure and meaning of each field are detailed below:

    1. grid_id
      • Description: Unique identifier for a grid cell on Earth measuring 0.5° latitude by 0.5° longitude.
      • Format: String combining latitude and longitude (e.g., -0.5_-79.0).
    2. year_month
      • Description: Month and year of the aggregated data.
      • Format: String in Mon-YY format (e.g., Jan-23).
    3. geometry
      • Description: Polygon coordinates of the grid cell in Well-Known Text (WKT) format.
      • Format: POLYGON((longitude latitude, ...))
        (e.g., POLYGON((-79.0 -0.5, -78.5 -0.5, -78.5 0.0, -79.0 0.0, -79.0 -0.5))).
    4. flights
      • Description: Total number of aircraft flights that passed through the grid cell during the month.
      • Format: Integer (e.g., 1230).
    5. GPS_jumps
      • Description: Total number of GNSS "jump" anomalies (possible spoofing events) in the grid cell during the month.
      • Format: Integer (e.g., 13).
    6. GPS_gaps
      • Description: Total number of GNSS "gap" anomalies, indicating interruptions in aircraft routes, during the month.
      • Format: Integer (e.g., 0).
    7. event_id_cnty
      • Description: Semicolon-separated list of ACLED event IDs associated with the grid cell during the month.
      • Format: String (e.g., ECU3151;ECU3158;ECU3150).
    8. disorder_type
      • Description: Semicolon-separated list of disorder types (e.g., "Political violence", "Demonstrations") reported by ACLED in that grid cell during the month.
      • Format: String.
    9. event_type
      • Description: Semicolon-separated list of high-level ACLED event types (e.g., "Riots", "Protests").
      • Format: String.
    10. sub_event_type
    • Description: Semicolon-separated list of detailed subtypes of ACLED events (e.g., "Mob violence", "Armed clash").
    • Format: String.
    1. acled_count
    • Description: Total number of ACLED conflict events in the grid cell during the month.
    • Format: Integer (e.g., 2).
    1. acled_flag
    • Description: Conflict presence indicator: 1 if any ACLED event occurred in the grid cell during the month, otherwise 0.
    • Format: Integer (0 or 1).
    1. gaps_density
    • Description: Monthly density of GNSS gaps, calculated as GPS_gaps / flights.
    • Format: Decimal (e.g., 0.0).
    1. jumps_density
    • Description: Monthly density of GNSS jumps, calculated as GPS_jumps / flights.
    • Format: Decimal (e.g., 0.0106).

    Data Sources

    • GNSS Anomalies Data:
      • Calculated from ADS-B (Automatic Dependent Surveillance-Broadcast) messages obtained via the OpenSky Network's Trino database.
      • GNSS anomalies include "jumps" (potential spoofing incidents) and "gaps" (interruptions in aircraft route data).

    • Political Violence Events Data:
      • Sourced from the ACLED database, which provides detailed information on political violence and protest events worldwide.

    Temporal and Spatial Coverage

    • Temporal Coverage:
      • From January 1, 2023, to December 31, 2023.
      • Daily records provide temporal granularity for time-series analysis.
    • Spatial Coverage:
      • Global coverage with grid cells measuring 0.5 degrees latitude by 0.5 degrees longitude.
      • Each grid cell represents an area on Earth's surface, facilitating spatial

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Aayush Chourasiya (2022). Smartwatch Purchase Data [Dataset]. https://www.kaggle.com/datasets/albedo0/smartwatch-purchase-data/versions/2
Organization logo

Smartwatch Purchase Data

Smartwatch sales prediction: An artificial dataset of 100,000 customer profiles

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 30, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aayush Chourasiya
Description

Disclaimer: This is an artificially generated data using a python script based on arbitrary assumptions listed down.

The data consists of 100,000 examples of training data and 10,000 examples of test data, each representing a user who may or may not buy a smart watch.

----- Version 1 -------

trainingDataV1.csv, testDataV1.csv or trainingData.csv, testData.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. hour: The hour of the day (integer, 0-23) 1. weekend: A boolean indicating whether it is the weekend (True or False) 1. The data also includes a label for each user indicating whether they are likely to buy a smart watch or not (string, "yes" or "no"). The label is determined based on the following arbitrary conditions: - If the user is divorced and a random number generated by the script is less than 0.4, the label is "no" (i.e., assuming 40% of divorcees are not likely to buy a smart watch) - If it is the weekend and a random number generated by the script is less than 1.3, the label is "yes". (i.e., assuming sales are 30% more likely to occur on weekends) - If the user is male and under 30 with an income over 75,000, the label is "yes". - If the user is female and 30 or over with an income over 100,000, the label is "yes". Otherwise, the label is "no".

The training data is intended to be used to build and train a classification model, and the test data is intended to be used to evaluate the performance of the trained model.

Following Python script was used to generate this dataset

import random
import csv

# Set the number of examples to generate
numExamples = 100000

# Generate the training data
with open("trainingData.csv", "w", newline="") as csvfile:
  fieldnames = ["age", "income", "gender", "maritalStatus", "hour", "weekend", "buySmartWatch"]
  writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

  writer.writeheader()

  for i in range(numExamples):
    age = random.randint(18, 70)
    income = random.randint(25000, 200000)
    gender = random.choice(["male", "female"])
    maritalStatus = random.choice(["single", "married", "divorced"])
    hour = random.randint(0, 23)
    weekend = random.choice([True, False])

    # Randomly assign the label based on some arbitrary conditions
    # assuming 40% of divorcees won't buy a smart watch
    if maritalStatus == "divorced" and random.random() < 0.4:
      buySmartWatch = "no"
    # assuming sales are 30% more likely to occur on weekends.
    elif weekend == True and random.random() < 1.3:
      buySmartWatch = "yes"
    elif gender == "male" and age < 30 and income > 75000:
      buySmartWatch = "yes"
    elif gender == "female" and age >= 30 and income > 100000:
      buySmartWatch = "yes"
    else:
      buySmartWatch = "no"

    writer.writerow({
      "age": age,
      "income": income,
      "gender": gender,
      "maritalStatus": maritalStatus,
      "hour": hour,
      "weekend": weekend,
      "buySmartWatch": buySmartWatch
    })

----- Version 2 -------

trainingDataV2.csv, testDataV2.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. educationLevel: The education level of the user (string, "high school", "associate's degree", "bachelor's degree", "master's degree", or "doctorate") 1. occupation: The occupation of the user (string, "tech worker", "manager", "executive", "sales", "customer service", "creative", "manual labor", "healthcare", "education", "government", "unemployed", or "student") 1. familySize: The number of people in the user's family (integer, 1-5) 1. fitnessInterest: A boolean indicating whether the user is interested in fitness (True or False) 1. priorSmartwatchOwnership: A boolean indicating whether the user has owned a smartwatch in the past (True or False) 1. hour: The hour of the day when the user was surveyed (integer, 0-23) 1. weekend: A boolean indicating whether the user was surveyed on a weekend (True or False) 1. buySmartWatch: A boolean indicating whether the user purchased a smartwatch (True or False)

Python script used to generate the data:

import random
import csv

# Set the number of examples to generate
numExamples = 100000

with open("t...
Search
Clear search
Close search
Google apps
Main menu