Disclaimer: This is an artificially generated data using a python script based on arbitrary assumptions listed down.
The data consists of 100,000 examples of training data and 10,000 examples of test data, each representing a user who may or may not buy a smart watch.
----- Version 1 -------
trainingDataV1.csv, testDataV1.csv or trainingData.csv, testData.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. hour: The hour of the day (integer, 0-23) 1. weekend: A boolean indicating whether it is the weekend (True or False) 1. The data also includes a label for each user indicating whether they are likely to buy a smart watch or not (string, "yes" or "no"). The label is determined based on the following arbitrary conditions: - If the user is divorced and a random number generated by the script is less than 0.4, the label is "no" (i.e., assuming 40% of divorcees are not likely to buy a smart watch) - If it is the weekend and a random number generated by the script is less than 1.3, the label is "yes". (i.e., assuming sales are 30% more likely to occur on weekends) - If the user is male and under 30 with an income over 75,000, the label is "yes". - If the user is female and 30 or over with an income over 100,000, the label is "yes". Otherwise, the label is "no".
The training data is intended to be used to build and train a classification model, and the test data is intended to be used to evaluate the performance of the trained model.
Following Python script was used to generate this dataset
import random
import csv
# Set the number of examples to generate
numExamples = 100000
# Generate the training data
with open("trainingData.csv", "w", newline="") as csvfile:
fieldnames = ["age", "income", "gender", "maritalStatus", "hour", "weekend", "buySmartWatch"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for i in range(numExamples):
age = random.randint(18, 70)
income = random.randint(25000, 200000)
gender = random.choice(["male", "female"])
maritalStatus = random.choice(["single", "married", "divorced"])
hour = random.randint(0, 23)
weekend = random.choice([True, False])
# Randomly assign the label based on some arbitrary conditions
# assuming 40% of divorcees won't buy a smart watch
if maritalStatus == "divorced" and random.random() < 0.4:
buySmartWatch = "no"
# assuming sales are 30% more likely to occur on weekends.
elif weekend == True and random.random() < 1.3:
buySmartWatch = "yes"
elif gender == "male" and age < 30 and income > 75000:
buySmartWatch = "yes"
elif gender == "female" and age >= 30 and income > 100000:
buySmartWatch = "yes"
else:
buySmartWatch = "no"
writer.writerow({
"age": age,
"income": income,
"gender": gender,
"maritalStatus": maritalStatus,
"hour": hour,
"weekend": weekend,
"buySmartWatch": buySmartWatch
})
----- Version 2 -------
trainingDataV2.csv, testDataV2.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. educationLevel: The education level of the user (string, "high school", "associate's degree", "bachelor's degree", "master's degree", or "doctorate") 1. occupation: The occupation of the user (string, "tech worker", "manager", "executive", "sales", "customer service", "creative", "manual labor", "healthcare", "education", "government", "unemployed", or "student") 1. familySize: The number of people in the user's family (integer, 1-5) 1. fitnessInterest: A boolean indicating whether the user is interested in fitness (True or False) 1. priorSmartwatchOwnership: A boolean indicating whether the user has owned a smartwatch in the past (True or False) 1. hour: The hour of the day when the user was surveyed (integer, 0-23) 1. weekend: A boolean indicating whether the user was surveyed on a weekend (True or False) 1. buySmartWatch: A boolean indicating whether the user purchased a smartwatch (True or False)
Python script used to generate the data:
import random
import csv
# Set the number of examples to generate
numExamples = 100000
with open("t...
The goal of introducing the Rescaled Fashion-MNIST dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.
The Rescaled Fashion-MNIST dataset was introduced in the paper:
[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.
with a pre-print available at arXiv:
[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.
Importantly, the Rescaled Fashion-MNIST dataset is more challenging than the MNIST Large Scale dataset, introduced in:
[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.
The Rescaled Fashion-MNIST dataset is provided on the condition that you provide proper citation for the original Fashion-MNIST dataset:
[4] Xiao, H., Rasul, K., and Vollgraf, R. (2017) “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms”, arXiv preprint arXiv:1708.07747
and also for this new rescaled version, using the reference [1] above.
The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.
The Rescaled FashionMNIST dataset is generated by rescaling 28×28 gray-scale images of clothes from the original FashionMNIST dataset [4]. The scale variations are up to a factor of 4, and the images are embedded within black images of size 72x72, with the object in the frame always centred. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].
There are 10 different classes in the dataset: “T-shirt/top”, “trouser”, “pullover”, “dress”, “coat”, “sandal”, “shirt”, “sneaker”, “bag” and “ankle boot”. In the dataset, these are represented by integer labels in the range [0, 9].
The dataset is split into 50 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 50 000 samples from the original Fashion-MNIST training set. The validation dataset, on the other hand, is formed from the final 10 000 images of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original Fashion-MNIST test set.
The training dataset file (~2.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:
fashionmnist_with_scale_variations_tr50000_vl10000_te10000_outsize72-72_scte1p000_scte1p000.h5
Additionally, for the Rescaled FashionMNIST dataset, there are 9 datasets (~415 MB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p500.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p595.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p707.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p841.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p000.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p189.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p414.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p682.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte2p000.h5
These dataset files were used for the experiments presented in Figures 6, 7, 14, 16, 19 and 23 in [1].
The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.
The training dataset can be loaded in Python as:
with h5py.File(`
x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)
We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:
x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))
The test datasets can be loaded in Python as:
with h5py.File(`
x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)
The test datasets can be loaded in Matlab as:
x_test = h5read(`
The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.
There is also a closely related Fashion-MNIST with translations dataset, which in addition to scaling variations also comprises spatial translations of the objects.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
About
The Synthetic Sweden Mobility (SySMo) model provides a simplified yet statistically realistic microscopic representation of the real population of Sweden. The agents in this synthetic population contain socioeconomic attributes, household characteristics, and corresponding activity plans for an average weekday. This agent-based modelling approach derives the transportation demand from the agents’ planned activities using various transport modes (e.g., car, public transport, bike, and walking).
This open data repository contains four datasets:
(1) Synthetic Agents,
(2) Activity Plans of the Agents,
(3) Travel Trajectories of the Agents, and
(4) Road Network (EPSG: 3006)
(OpenStreetMap data were retrieved on August 28, 2023, from https://download.geofabrik.de/europe.html, and GTFS data were retrieved on September 6, 2023 from https://samtrafiken.se/)
The database can serve as input to assess the potential impacts of new transportation technologies, infrastructure changes, and policy interventions on the mobility patterns of the Swedish population.
Methodology
This dataset contains statistically simulated 10.2 million agents representing the population of Sweden, their socio-economic characteristics and the activity plan for an average weekday. For preparing data for the MATSim simulation, we randomly divided all the agents into 10 batches. Each batch's agents are then simulated in MATSim using the multi-modal network combining road networks and public transit data in Sweden using the package pt2matsim (https://github.com/matsim-org/pt2matsim).
The agents' daily activity plans along with the road network serve as the primary inputs in the MATSim environment which ensures iterative replanning while aiming for a convergence on optimal activity plans for all the agents. Subsequently, the individual mobility trajectories of the agents from the MATSim simulation are retrieved.
The activity plans of the individual agents extracted from the MATSim simulation output data are then further processed. All agents with negative utility score and negative activity time corresponding to at least one activity are filtered out as the ‘infeasible’ agents. The dataset ‘Synthetic Agents’ contains all synthetic agents regardless of their ‘feasibility’ (0=excluded & 1=included in plans and trajectories). In the other datasets, only agents with feasible activity plans are included.
The simulation setup adheres to the MATSim 13.0 benchmark scenario, with slight adjustments. The strategy for replanning integrates BestScore (60%), TimeAllocationMutator (30%), and ReRoute (10%)— the percentages denote the proportion of agents utilizing these strategies. In each iteration of the simulation, the agents adopt these strategies to adjust their activity plans. The "BestScore" strategy retains the plan with the highest score from the previous iteration, selecting the most successful strategy an agent has employed up until that point. The "TimeAllocationMutator" modifies the end times of activities by introducing random shifts within a specified range, allowing for the exploration of different schedules. The "ReRoute" strategy enables agents to alter their current routes, potentially optimizing travel based on updated information or preferences. These strategies are detailed further in W. Axhausen et al. (2016) work, which provides comprehensive insights into their implementation and impact within the context of transport simulation modeling.
Data Description
(1) Synthetic Agents
This dataset contains all agents in Sweden and their socioeconomic characteristics.
The attribute ‘feasibility’ has two categories: feasible agents (73%), and infeasible agents (27%). Infeasible agents are agents with negative utility score and negative activity time corresponding to at least one activity.
File name: 1_syn_pop_all.parquet
Column
Description
Data type
Unit
PId
Agent ID
Integer
-
Deso Zone code of Demographic statistical areas (DeSO)1
kommun
Municipality code
marital
Marital Status (single/ couple/ child)
sex
Gender (0 = Male, 1 = Female)
age
Age
HId
A unique identifier for households
HHtype
Type of households (single/ couple/ other)
HHsize
Number of people living in the households
num_babies
Number of children less than six years old in the household
employment Employment Status (0 = Not Employed, 1 = Employed)
studenthood Studenthood Status (0 = Not Student, 1 = Student)
income_class Income Class (0 = No Income, 1 = Low Income, 2 = Lower-middle Income, 3 = Upper-middle Income, 4 = High Income)
num_cars Number of cars owned by an individual
HHcars Number of cars in the household
feasibility
Status of the individual (1=feasible, 0=infeasible)
1 https://www.scb.se/vara-tjanster/oppna-data/oppna-geodata/deso--demografiska-statistikomraden/
(2) Activity Plans of the Agents
The dataset contains the car agents’ (agents that use cars on the simulated day) activity plans for a simulated average weekday.
File name: 2_plans_i.parquet, i = 0, 1, 2, ..., 8, 9. (10 files in total)
Column
Description
Data type
Unit
act_purpose
Activity purpose (work/ home/ school/ other)
String
-
PId
Agent ID
Integer
-
act_end
End time of activity (0:00:00 – 23:59:59)
String
hour:minute:seco
nd
act_id
Activity index of each agent
Integer
-
mode
Transport mode to reach the activity location
String
-
POINT_X
Coordinate X of activity location (SWEREF99TM)
Float
metre
POINT_Y
Coordinate Y of activity location (SWEREF99TM)
Float
metre
dep_time
Departure time (0:00:00 – 23:59:59)
String
hour:minute:seco
nd
score
Utility score of the simulation day as obtained from MATSim
Float
-
trav_time
Travel time to reach the activity location
String
hour:minute:seco
nd
trav_time_min
Travel time in decimal minute
Float
minute
act_time
Activity duration in decimal minute
Float
minute
distance
Travel distance between the origin and the destination
Float
km
speed
Travel speed to reach the activity location
Float
km/h
(3) Travel Trajectories of the Agents
This dataset contains the driving trajectories of all the agents on the road network, and the public transit vehicles used by these agents, including buses, ferries, trams etc. The files are produced by MATSim simulations and organised into 10 *.parquet’ files (representing different batches of simulation) corresponding to each plan file.
File name: 3_events_i.parquet, i = 0, 1, 2, ..., 8, 9. (10 files in total)
Column
Description
Data type
Unit
time
Time in second in a simulation day (0-86399)
Integer
second
type
Event type defined by MATSim simulation*
String
person
Agent ID
Integer
link
Nearest road link consistent with the road network
String
vehicle
Vehicle ID identical to person
Integer
from_node
Start node of the link
Integer
to_node
End node of the link
Integer
One typical episode of MATSim simulation events: Activity ends (actend) -> Agent’s vehicle enters traffic (vehicle enters traffic) -> Agent’s vehicle moves from previous road segment to its next connected one (left link) -> Agent’s vehicle leaves traffic for activity (vehicle leaves traffic) -> Activity starts (actstart)
(4) Road Network
This dataset contains the road network.
File name: 4_network.shp
Column
Description
Data type
Unit
length
The length of road link
Float
metre
freespeed
Free speed
Float
km/h
capacity
Number of vehicles
Integer
permlanes
Number of lanes
Integer
oneway
Whether the segment is one-way (0=no, 1=yes)
Integer
modes
Transport mode
String
from_node
Start node of the link
Integer
to_node
End node of the link
Integer
geometry
LINESTRING (SWEREF99TM)
geometry
metre
Additional Notes
This research is funded by the RISE Research Institutes of Sweden, the Swedish Research Council for Sustainable Development (Formas, project number 2018-01768), and Transport Area of Advance, Chalmers.
Contributions
YL designed the simulation, analyzed the simulation data, and, along with CT, executed the simulation. CT, SD, FS, and SY conceptualized the model (SySMo), with CT and SD further developing the model to produce agents and their activity plans. KG wrote the data document. All authors reviewed, edited, and approved the final document.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Documentation for the Dataset used in the publication entitled "Application of machine learning to assess the influence of microstructure on twin nucleation in Mg alloys"
** These datasets comprise the 2D EBSD data acquired in the Mg-1Al (at.%) alloy and AZ31 Mg alloy, analyzed with MTEX 7.0 software. **
** More details about the experimental techniques can be found in the publication "Biaobiao Yang, Valentin Vassilev-Galindo, Javier Llorca, Application of machine learning to assess the influence of microstructure on twin nucleation in Mg alloys. npj Computational Materials, 2024." **
1. AZ31_ML.xlsx
- Description: Both twin and grain data were acquired by EBSD from AZ31 Mg sample before and after deformation at the same area
- Number of grains: 2640 (rows == grains) corresponding to three samples deformed in different orientations: S0, S45, and S90
- Number of analyzed variables (features): 31 (columns == grain characteristics)
- Variable description by columns:
1- (Twinned) - type: boolean
Description: Indicates if the grain twinned or not after deformation
0: non-twinned grain
1: twinned grain
2- (Orientation) - type: numerical (integer)
Description: The loading (tensile) direction with respect to the c axis of lattice
3- (Strain_level) - type: numerical (float)
Description: The maximum strain level after deformation
4- (Grain_size) - type: numerical (float)
Description: The equivalent circle diameter (in micrometers) of the grain before deformation.
5- (Triple_points) - type: numerical (integer)
Description: The number of triple points of the grain before deformation
6- (Near_edge) - type: boolean
Description: Indicates if the grain is located near the edge of the 2D EBSD or not. This feature was used to filter out from the final dataset the grains near the edge of the sample. Hence, only those entries with Near_edge value of 0 were used to train and test the machine learning models.
0: not near the EBSD edge
1: near the EBSD edge
7-12- (T_SF*) - type: numerical (float)
Description: The twinning Schmid factor based on the loading condition, orientation of parent grain and twin variants information.
T_SF1: The highest Schmid factor of extension twinning
T_SF2: The 2nd highest ...
T_SF3: 3rd
T_SF4: 4th
T_SF5: 5th
T_SF6: The lowest Schmid factor of extension twinning
13-15- (S_SF*) - type: numerical (float)
Description: The Schmid factor for basal slip based on the loading condition, orientation of parent grain, and slip system information. Only the basal slip system is considered because it is the dominant deformation slip system in Mg during deformation.
S_SF1: The highest Schmid factor of basal slip
S_SF2: The second highest or the middle Schmid factor of basal slip
S_SF3: The lowest Schmid factor of basal slip
16- (Neighbor_grain_n) - type: numerical (integer)
Description: The number of neighbors of the grain before deformation.
17-19- (B-b_m) - type: numerical (float)
Description: The Luster-Morris geometric compatibility factor (m') between the basal slip systems of the grain and its neighbors. Although there are 3 possible basal slip systems, only the one with the highest Schmid factor was considered to compute m′. Only maximum, minimum, and mean values were included in the dataset.
(Max_B-b_m): The highest basal - basal m' between the grain and its neighbors
(Min_B-b_m): The lowest basal - basal m' between the grain and its neighbors
(Mean_B-b_m): The average basal - basal m' between the grain and its neighbors
20-22- (B-t_m) - type: numerical (float)
Description: The Luster-Morris geometric compatibility factor (m') between the 6 extension twin variants of the grain and the basal slip systems of its neighbors. Although there are 3 possible basal slip systems, only the one with the highest Schmid factor was considered to compute m'. However, all 6 twinning variants have been considered, given that slip induced twinning is a localized process. Only maximum, minimum, and mean values were included in the dataset.
(Max_B-t_m): The highest basal - twin m' between the grain and its neighbors
(Min_B-t_m): The lowest basal - twin m' between the grain and its neighbors
(Mean_B-t_m): The average basal - twin m' between the grain and its neighbors
23-25- (GB_misang) - type: numerical (float)
Description: The misorientation angle (in º) between the grain and its neighbors. In fact, disorientation angle is used for the misorientation angle. Only maximum, minimum, and mean values were included in the dataset.
(Max_GBmisang): The highest GB misorientation angle between the grain and its neighbors
(Min_GBmisang): The lowest GB misorientation angle between the grain and its neighbors
(Mean_GBmisang): The average GB misorientation angle between the grain and its neighbors
26-28- (delta_Gs) - type: numerical (float)
Description: Grain size difference (in micrometers) between a given grain and its neighbors. The grain size is calculated as the diameter of a circular grain with the same area of the grain. Only maximum, minimum, and mean values were included in the dataset.
(Max_deltaGs): The highest grain size difference between the grain and its neighbors
(Min_deltaGs): The smallest grain size difference between the grain and its neighbors
(Mean_deltaGs): The average grain size difference between the grain and its neighbors
29-31- (delta_BSF) - type: numerical (float)
Description: The difference in the basal slip Schmid factor between a given grain and its neighbors. Only the highest basal slip Schmid factor is considered. Only maximum, minimum, and mean values were included in the dataset.
(Max_deltaBSF): The highest basal SF difference between the grain and its neighbors
(Min_deltaBSF): The smallest basal SF difference between the grain and its neighbors
(Mean_deltaBSF): The average basal SF difference between the grain and its neighbors
2. Mg1Al_ML.xlsx
- Description: Both twin and grain data were acquired by EBSD from Mg-1Al (at.%) sample before and after deformation at the same area
- Number of grains: 1496 (rows == grains) corresponding to two true strain levels: ~6%, and ~10%.
- Number of analyzed variables (features): 31 (columns == grain characteristics)
- Variable descriptions by columns are the same as those of AZ31_ML.xlsx
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A detailed description of this dataset can be found in https://doi.org/10.1016/j.dib.2023.108903.
This dataset contains a collection of digitized three-dimensional hardened cement paste microstructures obtained from X-ray micro-computed tomography, screened after approx. 1, 2, 3, 4, 7, 14, and 28 days of elapsed hydration at 20˚C in saturated conditions. Each paste specimen had a cylindrical shape (with a diameter of ~1 mm) and was screened at a designated time (as specified in the file name, e.g. “t23hrs”=23 hours of elapsed hydration) and finally saved as an uncompressed and unprocessed *.tif greyscale image data file in 16-bit image depth (as unsigned integers) using a little-endian byte sequence.
The dataset contains two sets of images:
- “full-sized” digital images stored in a three-dimensional voxel-based matrix with a fixed size of 1100x1100x1100 voxels, denoted as “CEM_I_Ladce_*” in the file name; each file size amounts to ~2.5 GB and contains the whole screened specimen with a variable voxel size in the range 1.0913 − 1.1174 µm depending on the particular specimen (as specified in the file name, e.g. “1d1174um”=1.1174 µm/voxel)
- smaller image subvolumes, denoted as Region Of Interest (ROI), extracted from the interior of the full-sized specimen from an arbitrary location, and denoted as “filteredROI_*” in the file name; this cropped ROI has a cubic shape and stores a three-dimensional voxel-based matrix with a fixed size of 500x500x500 µm3 constituted by a variable voxel count (given the fluctuating voxel size for each specimen, see above). Both the exact voxel count (i.e. three-dimensional matrix dimensions) and voxel size are further specified in each file name. A sequence of imaging filters was sequentially applied to this ROI to further enhance the contrast among the different microstructural phases, see 10.1016/j.cemconcomp.2022.104798 for details.
Note that the same dataset stored in *raw format is available from https://doi.org/10.5281/zenodo.7193819
This data set contains tree ring data from three sites located about 25 km of the meteorological station at Mongu, Zambia. Data from about 50 individual trees are reported. In addition, chronologies (or site mean curves) that better represent common influences (e.g., in this study, the climatic signal) were developed for each site based on the individual data (Trouet, 2004; Trouet et al., 2001). The series covers a maximum of 46 years, although most series do not extend longer than 30 years. The data were collected during the SAFARI 2000 Dry Season Field Campaign of August 2000.Ten to 23 samples were taken at each site. Brachystegia bakeriana was sampled at site 1, and Brachystegia spiciformis at sites 2 and 3. The vegetation at all sites underwent primitive harvesting for subsistence earlier the same year, thus samples could be taken from freshly cut trees and no living trees were cut. At all sites, samples consisted of full stem discs. Where possible, samples were taken at breast height (1.3 m) or slightly lower. Growth ring widths were measured to the nearest 0.01 mm using LINTAB equipment and TSAP software (Rinn and Jakel, 1997). Four radii per sample disc were measured. Cross-dating and response function analyses were performed by routine dendrochronological techniques. There are two files for each site, one containing integer values representing tree ring widths (raw data), and the other containing standardized values (chronologies), for each year. The data are stored as ASCII table files in comma-separated-value (.csv) format, with column headers.
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
This site provides the data and spreadsheet implementations for linking multi-level contribution margin accounting as a subsystem in cost accounting with several versions of a multi-level fixed-charge problem (MLFCP), the latter based on the optimization approach in operations research. For the data, plausible fictitious values have been assumed taking into consideration the calculation principles in cost accounting where applicable. They include resource-related data, market-related data, and data from cost accounting. While the deterministic version of the data does not consider uncertainty, the stochastic/robust versions assume probability distributions and rank correlations for part of the data.
Spreadsheets
The data and the above-mentioned linkage are implemented in three spreadsheet files, including versions for deterministic optimization, stochastic optimization, and robust optimization:
For a detailed description of the spreadsheet implementations and information on the software required to use them, see the associated data article published in Data in Brief. For the conceptual framework, mathematical formulation of the optimization model (MLFCP), findings, and discussion, see the associated research article published in Heliyon. (The links to both articles can be found on this page).
Big Picture
Furthermore, an overview (“big picture”) of the data flows between the various worksheets is provided in three main versions which correspond to the deterministic, stochastic, and robust versions of the MLFCP:
Within each version/sub-variant of the overview, two file formats (PDF and PNG) are available. These are oversize graphics; please scale up appropriately to see the details.
(Remark on version numbers and dates: The version numbers reported within the files might be lower than the version number of the entire dataset in case particular files remain unchanged in an update. The same might analogously apply to the dates.)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Geospatial Dataset of GNSS Anomalies and Political Violence Events
Overview
The Geospatial Dataset of GNSS Anomalies and Political Violence Events is a collection of data that integrates aircraft flight information, GNSS (Global Navigation Satellite System) anomalies, and political violence events from the ACLED (Armed Conflict Location & Event Data Project) database.
Dataset Files
The dataset consists of three CSV files:
Data Fields: Daily_GNSS_Anomalies_and_ACLED-2023-V1.csv and Daily_GNSS_Anomalies_and_ACLED-2023-V2.csv
Data Fields: Monthly_GNSS_Anomalies_and_ACLED-2023-V9.csv
The file contains monthly aggregated GNSS anomaly and ACLED event data per grid cell. The structure and meaning of each field are detailed below:
Data Sources
Temporal and Spatial Coverage
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Disclaimer: This is an artificially generated data using a python script based on arbitrary assumptions listed down.
The data consists of 100,000 examples of training data and 10,000 examples of test data, each representing a user who may or may not buy a smart watch.
----- Version 1 -------
trainingDataV1.csv, testDataV1.csv or trainingData.csv, testData.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. hour: The hour of the day (integer, 0-23) 1. weekend: A boolean indicating whether it is the weekend (True or False) 1. The data also includes a label for each user indicating whether they are likely to buy a smart watch or not (string, "yes" or "no"). The label is determined based on the following arbitrary conditions: - If the user is divorced and a random number generated by the script is less than 0.4, the label is "no" (i.e., assuming 40% of divorcees are not likely to buy a smart watch) - If it is the weekend and a random number generated by the script is less than 1.3, the label is "yes". (i.e., assuming sales are 30% more likely to occur on weekends) - If the user is male and under 30 with an income over 75,000, the label is "yes". - If the user is female and 30 or over with an income over 100,000, the label is "yes". Otherwise, the label is "no".
The training data is intended to be used to build and train a classification model, and the test data is intended to be used to evaluate the performance of the trained model.
Following Python script was used to generate this dataset
import random
import csv
# Set the number of examples to generate
numExamples = 100000
# Generate the training data
with open("trainingData.csv", "w", newline="") as csvfile:
fieldnames = ["age", "income", "gender", "maritalStatus", "hour", "weekend", "buySmartWatch"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for i in range(numExamples):
age = random.randint(18, 70)
income = random.randint(25000, 200000)
gender = random.choice(["male", "female"])
maritalStatus = random.choice(["single", "married", "divorced"])
hour = random.randint(0, 23)
weekend = random.choice([True, False])
# Randomly assign the label based on some arbitrary conditions
# assuming 40% of divorcees won't buy a smart watch
if maritalStatus == "divorced" and random.random() < 0.4:
buySmartWatch = "no"
# assuming sales are 30% more likely to occur on weekends.
elif weekend == True and random.random() < 1.3:
buySmartWatch = "yes"
elif gender == "male" and age < 30 and income > 75000:
buySmartWatch = "yes"
elif gender == "female" and age >= 30 and income > 100000:
buySmartWatch = "yes"
else:
buySmartWatch = "no"
writer.writerow({
"age": age,
"income": income,
"gender": gender,
"maritalStatus": maritalStatus,
"hour": hour,
"weekend": weekend,
"buySmartWatch": buySmartWatch
})
----- Version 2 -------
trainingDataV2.csv, testDataV2.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. educationLevel: The education level of the user (string, "high school", "associate's degree", "bachelor's degree", "master's degree", or "doctorate") 1. occupation: The occupation of the user (string, "tech worker", "manager", "executive", "sales", "customer service", "creative", "manual labor", "healthcare", "education", "government", "unemployed", or "student") 1. familySize: The number of people in the user's family (integer, 1-5) 1. fitnessInterest: A boolean indicating whether the user is interested in fitness (True or False) 1. priorSmartwatchOwnership: A boolean indicating whether the user has owned a smartwatch in the past (True or False) 1. hour: The hour of the day when the user was surveyed (integer, 0-23) 1. weekend: A boolean indicating whether the user was surveyed on a weekend (True or False) 1. buySmartWatch: A boolean indicating whether the user purchased a smartwatch (True or False)
Python script used to generate the data:
import random
import csv
# Set the number of examples to generate
numExamples = 100000
with open("t...