100+ datasets found

f
Orange dataset table
figshare.com
xlsx
Updated Mar 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rui Simões (2022). Orange dataset table [Dataset]. http://doi.org/10.6084/m9.figshare.19146410.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19146410.v1
Dataset updated
Mar 4, 2022
Dataset provided by
figshare
Authors
Rui Simões
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.

Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.
m
Dataset of development of business during the COVID-19 crisis
data.mendeley.com
narcis.nl
Updated Nov 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tatiana N. Litvinova (2020). Dataset of development of business during the COVID-19 crisis [Dataset]. http://doi.org/10.17632/9vvrd34f8t.1
Explore at:
Unique identifier
https://doi.org/10.17632/9vvrd34f8t.1
Dataset updated
Nov 9, 2020
Authors
Tatiana N. Litvinova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
To create the dataset, the top 10 countries leading in the incidence of COVID-19 in the world were selected as of October 22, 2020 (on the eve of the second full of pandemics), which are presented in the Global 500 ranking for 2020: USA, India, Brazil, Russia, Spain, France and Mexico. For each of these countries, no more than 10 of the largest transnational corporations included in the Global 500 rating for 2020 and 2019 were selected separately. The arithmetic averages were calculated and the change (increase) in indicators such as profitability and profitability of enterprises, their ranking position (competitiveness), asset value and number of employees. The arithmetic mean values of these indicators for all countries of the sample were found, characterizing the situation in international entrepreneurship as a whole in the context of the COVID-19 crisis in 2020 on the eve of the second wave of the pandemic. The data is collected in a general Microsoft Excel table. Dataset is a unique database that combines COVID-19 statistics and entrepreneurship statistics. The dataset is flexible data that can be supplemented with data from other countries and newer statistics on the COVID-19 pandemic. Due to the fact that the data in the dataset are not ready-made numbers, but formulas, when adding and / or changing the values in the original table at the beginning of the dataset, most of the subsequent tables will be automatically recalculated and the graphs will be updated. This allows the dataset to be used not just as an array of data, but as an analytical tool for automating scientific research on the impact of the COVID-19 pandemic and crisis on international entrepreneurship. The dataset includes not only tabular data, but also charts that provide data visualization. The dataset contains not only actual, but also forecast data on morbidity and mortality from COVID-19 for the period of the second wave of the pandemic in 2020. The forecasts are presented in the form of a normal distribution of predicted values and the probability of their occurrence in practice. This allows for a broad scenario analysis of the impact of the COVID-19 pandemic and crisis on international entrepreneurship, substituting various predicted morbidity and mortality rates in risk assessment tables and obtaining automatically calculated consequences (changes) on the characteristics of international entrepreneurship. It is also possible to substitute the actual values identified in the process and following the results of the second wave of the pandemic to check the reliability of pre-made forecasts and conduct a plan-fact analysis. The dataset contains not only the numerical values of the initial and predicted values of the set of studied indicators, but also their qualitative interpretation, reflecting the presence and level of risks of a pandemic and COVID-19 crisis for international entrepreneurship.
s
Data from: Data files used to study change dynamics in software systems
figshare.swinburne.edu.au
pdf
Updated Jul 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajesh Vasa (2024). Data files used to study change dynamics in software systems [Dataset]. http://doi.org/10.25916/sut.26288227.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.25916/sut.26288227.v1
Dataset updated
Jul 22, 2024
Dataset provided by
Swinburne
Authors
Rajesh Vasa
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
It is a widely accepted fact that evolving software systems change and grow. However, it is less well-understood how change is distributed over time, specifically in object oriented software systems. The patterns and techniques used to measure growth permit developers to identify specific releases where significant change took place as well as to inform them of the longer term trend in the distribution profile. This knowledge assists developers in recording systemic and substantial changes to a release, as well as to provide useful information as input into a potential release retrospective. However, these analysis methods can only be applied after a mature release of the code has been developed. But in order to manage the evolution of complex software systems effectively, it is important to identify change-prone classes as early as possible. Specifically, developers need to know where they can expect change, the likelihood of a change, and the magnitude of these modifications in order to take proactive steps and mitigate any potential risks arising from these changes. Previous research into change-prone classes has identified some common aspects, with different studies suggesting that complex and large classes tend to undergo more changes and classes that changed recently are likely to undergo modifications in the near future. Though the guidance provided is helpful, developers need more specific guidance in order for it to be applicable in practice. Furthermore, the information needs to be available at a level that can help in developing tools that highlight and monitor evolution prone parts of a system as well as support effort estimation activities. The specific research questions that we address in this chapter are: (1) What is the likelihood that a class will change from a given version to the next? (a) Does this probability change over time? (b) Is this likelihood project specific, or general? (2) How is modification frequency distributed for classes that change? (3) What is the distribution of the magnitude of change? Are most modifications minor adjustments, or substantive modifications? (4) Does structural complexity make a class susceptible to change? (5) Does popularity make a class more change-prone? We make recommendations that can help developers to proactively monitor and manage change. These are derived from a statistical analysis of change in approximately 55000 unique classes across all projects under investigation. The analysis methods that we applied took into consideration the highly skewed nature of the metric data distributions. The raw metric data (4 .txt files and 4 .log files in a .zip file measuring ~2MB in total) is provided as a comma separated values (CSV) file, and the first line of the CSV file contains the header. A detailed output of the statistical analysis undertaken is provided as log files generated directly from Stata (statistical analysis software).
i
Household Expenditure and Income Survey 2010, Economic Research Forum (ERF)...
catalog.ihsn.org
datacatalog.ihsn.org
Updated Mar 29, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Hashemite Kingdom of Jordan Department of Statistics (DOS) (2019). Household Expenditure and Income Survey 2010, Economic Research Forum (ERF) Harmonization Data - Jordan [Dataset]. https://catalog.ihsn.org/index.php/catalog/7662
Explore at:
Dataset updated
Mar 29, 2019
Dataset authored and provided by
The Hashemite Kingdom of Jordan Department of Statistics (DOS)
Time period covered
2010 - 2011
Area covered
Jordan
Description
Abstract

The main objective of the HEIS survey is to obtain detailed data on household expenditure and income, linked to various demographic and socio-economic variables, to enable computation of poverty indices and determine the characteristics of the poor and prepare poverty maps. Therefore, to achieve these goals, the sample had to be representative on the sub-district level. The raw survey data provided by the Statistical Office was cleaned and harmonized by the Economic Research Forum, in the context of a major research project to develop and expand knowledge on equity and inequality in the Arab region. The main focus of the project is to measure the magnitude and direction of change in inequality and to understand the complex contributing social, political and economic forces influencing its levels. However, the measurement and analysis of the magnitude and direction of change in this inequality cannot be consistently carried out without harmonized and comparable micro-level data on income and expenditures. Therefore, one important component of this research project is securing and harmonizing household surveys from as many countries in the region as possible, adhering to international statistics on household living standards distribution. Once the dataset has been compiled, the Economic Research Forum makes it available, subject to confidentiality agreements, to all researchers and institutions concerned with data collection and issues of inequality.

Data collected through the survey helped in achieving the following objectives: 1. Provide data weights that reflect the relative importance of consumer expenditure items used in the preparation of the consumer price index 2. Study the consumer expenditure pattern prevailing in the society and the impact of demographic and socio-economic variables on those patterns 3. Calculate the average annual income of the household and the individual, and assess the relationship between income and different economic and social factors, such as profession and educational level of the head of the household and other indicators 4. Study the distribution of individuals and households by income and expenditure categories and analyze the factors associated with it 5. Provide the necessary data for the national accounts related to overall consumption and income of the household sector 6. Provide the necessary income data to serve in calculating poverty indices and identifying the poor characteristics as well as drawing poverty maps 7. Provide the data necessary for the formulation, follow-up and evaluation of economic and social development programs, including those addressed to eradicate poverty

Geographic coverage

National

Analysis unit

Households

Individuals

Kind of data

Sample survey data [ssd]

Sampling procedure

The Household Expenditure and Income survey sample for 2010, was designed to serve the basic objectives of the survey through providing a relatively large sample in each sub-district to enable drawing a poverty map in Jordan. The General Census of Population and Housing in 2004 provided a detailed framework for housing and households for different administrative levels in the country. Jordan is administratively divided into 12 governorates, each governorate is composed of a number of districts, each district (Liwa) includes one or more sub-district (Qada). In each sub-district, there are a number of communities (cities and villages). Each community was divided into a number of blocks. Where in each block, the number of houses ranged between 60 and 100 houses. Nomads, persons living in collective dwellings such as hotels, hospitals and prison were excluded from the survey framework.

A two stage stratified cluster sampling technique was used. In the first stage, a cluster sample proportional to the size was uniformly selected, where the number of households in each cluster was considered the weight of the cluster. At the second stage, a sample of 8 households was selected from each cluster, in addition to another 4 households selected as a backup for the basic sample, using a systematic sampling technique. Those 4 households were sampled to be used during the first visit to the block in case the visit to the original household selected is not possible for any reason. For the purposes of this survey, each sub-district was considered a separate stratum to ensure the possibility of producing results on the sub-district level. In this respect, the survey framework adopted that provided by the General Census of Population and Housing Census in dividing the sample strata. To estimate the sample size, the coefficient of variation and the design effect of the expenditure variable provided in the Household Expenditure and Income Survey for the year 2008 was calculated for each sub-district. These results were used to estimate the sample size on the sub-district level so that the coefficient of variation for the expenditure variable in each sub-district is less than 10%, at a minimum, of the number of clusters in the same sub-district (6 clusters). This is to ensure adequate presentation of clusters in different administrative areas to enable drawing an indicative poverty map.

It should be noted that in addition to the standard non response rate assumed, higher rates were expected in areas where poor households are concentrated in major cities. Therefore, those were taken into consideration during the sampling design phase, and a higher number of households were selected from those areas, aiming at well covering all regions where poverty spreads.

Mode of data collection

Face-to-face [f2f]

Research instrument

General form

Expenditure on food commodities form

Expenditure on non-food commodities form

Cleaning operations

Raw Data: - Organizing forms/questionnaires: A compatible archive system was used to classify the forms according to different rounds throughout the year. A registry was prepared to indicate different stages of the process of data checking, coding and entry till forms were back to the archive system. - Data office checking: This phase was achieved concurrently with the data collection phase in the field where questionnaires completed in the field were immediately sent to data office checking phase. - Data coding: A team was trained to work on the data coding phase, which in this survey is only limited to education specialization, profession and economic activity. In this respect, international classifications were used, while for the rest of the questions, coding was predefined during the design phase. - Data entry/validation: A team consisting of system analysts, programmers and data entry personnel were working on the data at this stage. System analysts and programmers started by identifying the survey framework and questionnaire fields to help build computerized data entry forms. A set of validation rules were added to the entry form to ensure accuracy of data entered. A team was then trained to complete the data entry process. Forms prepared for data entry were provided by the archive department to ensure forms are correctly extracted and put back in the archive system. A data validation process was run on the data to ensure the data entered is free of errors. - Results tabulation and dissemination: After the completion of all data processing operations, ORACLE was used to tabulate the survey final results. Those results were further checked using similar outputs from SPSS to ensure that tabulations produced were correct. A check was also run on each table to guarantee consistency of figures presented, together with required editing for tables' titles and report formatting.

Harmonized Data: - The Statistical Package for Social Science (SPSS) was used to clean and harmonize the datasets. - The harmonization process started with cleaning all raw data files received from the Statistical Office. - Cleaned data files were then merged to produce one data file on the individual level containing all variables subject to harmonization. - A country-specific program was generated for each dataset to generate/compute/recode/rename/format/label harmonized variables. - A post-harmonization cleaning process was run on the data. - Harmonized data was saved on the household as well as the individual level, in SPSS and converted to STATA format.
o
University SET data, with faculty and courses characteristics
openicpsr.org
Updated Sep 12, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Under blind review in refereed journal (2021). University SET data, with faculty and courses characteristics [Dataset]. http://doi.org/10.3886/E149801V1
Explore at:
Unique identifier
https://doi.org/10.3886/E149801V1
Dataset updated
Sep 12, 2021
Authors
Under blind review in refereed journal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper explores a unique dataset of all the SET ratings provided by students of one university in Poland at the end of the winter semester of the 2020/2021 academic year. The SET questionnaire used by this university is presented in Appendix 1. The dataset is unique for several reasons. It covers all SET surveys filled by students in all fields and levels of study offered by the university. In the period analysed, the university was entirely in the online regime amid the Covid-19 pandemic. While the expected learning outcomes formally have not been changed, the online mode of study could have affected the grading policy and could have implications for some of the studied SET biases. This Covid-19 effect is captured by econometric models and discussed in the paper. The average SET scores were matched with the characteristics of the teacher for degree, seniority, gender, and SET scores in the past six semesters; the course characteristics for time of day, day of the week, course type, course breadth, class duration, and class size; the attributes of the SET survey responses as the percentage of students providing SET feedback; and the grades of the course for the mean, standard deviation, and percentage failed. Data on course grades are also available for the previous six semesters. This rich dataset allows many of the biases reported in the literature to be tested for and new hypotheses to be formulated, as presented in the introduction section. The unit of observation or the single row in the data set is identified by three parameters: teacher unique id (j), course unique id (k) and the question number in the SET questionnaire (n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9} ). It means that for each pair (j,k), we have nine rows, one for each SET survey question, or sometimes less when students did not answer one of the SET questions at all. For example, the dependent variable SET_score_avg(j,k,n) for the triplet (j=Calculus, k=John Smith, n=2) is calculated as the average of all Likert-scale answers to question nr 2 in the SET survey distributed to all students that took the Calculus course taught by John Smith. The data set has 8,015 such observations or rows. The full list of variables or columns in the data set included in the analysis is presented in the attached filesection. Their description refers to the triplet (teacher id = j, course id = k, question number = n). When the last value of the triplet (n) is dropped, it means that the variable takes the same values for all n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9}.Two attachments:- word file with variables description- Rdata file with the data set (for R language).Appendix 1. Appendix 1. The SET questionnaire was used for this paper. Evaluation survey of the teaching staff of [university name] Please, complete the following evaluation form, which aims to assess the lecturer’s performance. Only one answer should be indicated for each question. The answers are coded in the following way: 5- I strongly agree; 4- I agree; 3- Neutral; 2- I don’t agree; 1- I strongly don’t agree. Questions 1 2 3 4 5 I learnt a lot during the course. ○ ○ ○ ○ ○ I think that the knowledge acquired during the course is very useful. ○ ○ ○ ○ ○ The professor used activities to make the class more engaging. ○ ○ ○ ○ ○ If it was possible, I would enroll for the course conducted by this lecturer again. ○ ○ ○ ○ ○ The classes started on time. ○ ○ ○ ○ ○ The lecturer always used time efficiently. ○ ○ ○ ○ ○ The lecturer delivered the class content in an understandable and efficient way. ○ ○ ○ ○ ○ The lecturer was available when we had doubts. ○ ○ ○ ○ ○ The lecturer treated all students equally regardless of their race, background and ethnicity. ○ ○
d
Data for: Integrating open education practices with data analysis of open...
search.dataone.org
data.niaid.nih.gov
Updated Jul 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marja Bakermans (2024). Data for: Integrating open education practices with data analysis of open science in an undergraduate course [Dataset]. http://doi.org/10.5061/dryad.37pvmcvst
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.37pvmcvst
Dataset updated
Jul 27, 2024
Dataset provided by
Dryad Digital Repository
Authors
Marja Bakermans
Description
The open science movement produces vast quantities of openly published data connected to journal articles, creating an enormous resource for educators to engage students in current topics and analyses. However, educators face challenges using these materials to meet course objectives. I present a case study using open science (published articles and their corresponding datasets) and open educational practices in a capstone course. While engaging in current topics of conservation, students trace connections in the research process, learn statistical analyses, and recreate analyses using the programming language R. I assessed the presence of best practices in open articles and datasets, examined student selection in the open grading policy, surveyed students on their perceived learning gains, and conducted a thematic analysis on student reflections. First, articles and datasets met just over half of the assessed fairness practices, but this increased with the publication date. There was a..., Article and dataset fairness To assess the utility of open articles and their datasets as an educational tool in an undergraduate academic setting, I measured the congruence of each pair to a set of best practices and guiding principles. I assessed ten guiding principles and best practices (Table 1), where each category was scored â€˜1â€™ or â€˜0â€™ based on whether it met that criteria, with a total possible score of ten. Open grading policies Students were allowed to specify the percentage weight for each assessment category in the course, including 1) six coding exercises (Exercises), 2) one lead exercise (Lead Exercise), 3) fourteen annotation assignments of readings (Annotations), 4) one final project (Final Project), 5) five discussion board posts and a statement of learning reflection (Discussion), and 6) attendance and participation (Participation). I examined if assessment categories (independent variable) were weighted (dependent variable) differently by students using an analysis of ..., , # Data for: Integrating open education practices with data analysis of open science in an undergraduate course

Author: Marja H Bakermans Affiliation: Worcester Polytechnic Institute, 100 Institute Rd, Worcester, MA 01609 USA ORCID: https://orcid.org/0000-0002-4879-7771 Institutional IRB approval: IRB-24â€“0314

Data and file overview

The full dataset file called OEPandOSdata (.xlsx extension) contains 8 files. Below are descriptions of the name and contents of each file. NA = not applicable or no data available

BestPracticesData.csv

Description: Data to assess the adherence of articles and datasets to open science best practices.

Column headers and descriptions:

Article: articles used in the study, numbered randomly

F1: Findable, Data are assigned a unique and persistent doi

F2: Findable, Metadata includes an identifier of data

F3: Findable, Data are registered in a searchable database

A1: ...
m
Machine learning for corrosion database
data.mendeley.com
Updated Oct 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leonardo Bertolucci Coelho (2021). Machine learning for corrosion database [Dataset]. http://doi.org/10.17632/jfn8yhrphd.1
Explore at:
Unique identifier
https://doi.org/10.17632/jfn8yhrphd.1
Dataset updated
Oct 26, 2021
Authors
Leonardo Bertolucci Coelho
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This database was firstly created for the scientific article entitled: "Reviewing Machine Learning of corrosion prediction: a data-oriented perspective"

L.B. Coelho 1 , D. Zhang 2 , Y.V. Ingelgem 1 , D. Steckelmacher 3 , A. Nowé 3 , H.A. Terryn 1

1 Department of Materials and Chemistry, Research Group Electrochemical and Surface Engineering, Vrije Universiteit Brussel, Brussels, Belgium 2 A Beijing Advanced Innovation Center for Materials Genome Engineering, National Materials Corrosion and Protection Data Center, Institute for Advanced Materials and Technology, University of Science and Technology Beijing, Beijing, China 3 VUB Artificial Intelligence Lab, Vrije Universiteit Brussel, Brussels, Belgium

Different metrics are possible to evaluate the prediction accuracy of regression models. However, only papers providing relative metrics (MAPE, R²) were included in this database. We tried as much as possible to include descriptors of all major ML procedure steps, including data collection (“Data acquisition”), data cleaning feature engineering (“Feature reduction”), model validation (“Train-Test split”*), etc.

*the total dataset is typically split into training sets and testing (unknown data) sets for performance evaluation of the model. Nonetheless, sometimes only the training or the testing performances were reported (“?” marks were added in the respective evaluation metric field(s)). The “Average R²” was sometimes considered for studies employing “CV” (cross-validation) on the dataset. For a detailed description of the ML basic procedures, the reader could refer to the References topic in the Review article.
UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
data.niaid.nih.gov
zip
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
Customer360Insights
kaggle.com
Updated Jun 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dave Darshan (2024). Customer360Insights [Dataset]. https://www.kaggle.com/datasets/davedarshan/customer360insights
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 9, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dave Darshan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Customer360Insights

The Customer360Insights dataset is a synthetic collection meticulously designed to mirror the multifaceted nature of customer interactions within an e-commerce platform. It encompasses a wide array of variables, each serving as a pillar to support various analytical explorations. Here’s a breakdown of the dataset and the potential analyses it enables:

Dataset Description

Customer Demographics: Includes FullName, Gender, Age, CreditScore, and MonthlyIncome. These variables provide a demographic snapshot of the customer base, allowing for segmentation and targeted marketing analysis.

Geographical Data: Comprising Country, State, and City, this section facilitates location-based analytics, market penetration studies, and regional sales performance.

Product Information: Details like Category, Product, Cost, and Price enable product trend analysis, profitability assessment, and inventory optimization.

Transactional Data: Captures the customer journey through SessionStart, CartAdditionTime, OrderConfirmation, OrderConfirmationTime, PaymentMethod, and SessionEnd. This rich temporal data can be used for funnel analysis, conversion rate optimization, and customer behavior modeling.

Post-Purchase Details: With OrderReturn and ReturnReason, analysts can delve into return rate calculations, post-purchase satisfaction, and quality control.

Types of Analysis

Descriptive Analytics: Understand basic metrics like average monthly income, most common product categories, and typical credit scores.

Predictive Analytics: Use machine learning to predict credit risk or the likelihood of a purchase based on demographics and session activity.

Customer Segmentation: Group customers by demographics or purchasing behavior to tailor marketing strategies.

Geospatial Analysis: Examine sales distribution across different regions and optimize logistics. Time Series Analysis: Study the seasonality of purchases and session activities over time.

Funnel Analysis: Evaluate the customer journey from session start to order confirmation and identify drop-off points.

Cohort Analysis: Track customer cohorts over time to understand retention and repeat purchase patterns.

Market Basket Analysis: Discover product affinities and develop cross-selling strategies.

This dataset is a playground for data enthusiasts to practice cleaning, transforming, visualizing, and modeling data. Whether you’re conducting A/B testing for marketing campaigns, forecasting sales, or building customer profiles, Customer360Insights offers a rich, realistic dataset for honing your data science skills.

Curious about how I created the data? Feel free to click here and take a peek! 😉

📊🔍 Good Luck and Happy Analysing 🔍📊
d
Forest Inventory and Analysis Database
catalog.data.gov
datadiscoverystudio.org
+11more
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Forest Service (2025). Forest Inventory and Analysis Database [Dataset]. https://catalog.data.gov/dataset/forest-inventory-and-analysis-database-a9cd7
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
U.S. Forest Service
Description
The Forest Inventory and Analysis (FIA) research program has been in existence since mandated by Congress in 1928. FIA's primary objective is to determine the extent, condition, volume, growth, and depletion of timber on the Nation's forest land. Before 1999, all inventories were conducted on a periodic basis. The passage of the 1998 Farm Bill requires FIA to collect data annually on plots within each State. This kind of up-to-date information is essential to frame realistic forest policies and programs. Summary reports for individual States are published but the Forest Service also provides data collected in each inventory to those interested in further analysis. Data is distributed via the FIA DataMart in a standard format. This standard format, referred to as the Forest Inventory and Analysis Database (FIADB) structure, was developed to provide users with as much data as possible in a consistent manner among States. A number of inventories conducted prior to the implementation of the annual inventory are available in the FIADB. However, various data attributes may be empty or the items may have been collected or computed differently. Annual inventories use a common plot design and common data collection procedures nationwide, resulting in greater consistency among FIA work units than earlier inventories. Links to field collection manuals and the FIADB user's manual are provided in the FIA DataMart.
e
Waterworks — water supply system_reporting
data.europa.eu
gimi9.com
unknown
Updated Mar 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Waterworks — water supply system_reporting [Dataset]. https://data.europa.eu/data/datasets/https-data-norge-no-node-1495/embed
Explore at:
unknownAvailable download formats
Dataset updated
Mar 28, 2025
License
https://data.norge.no/nlod/en/2.0/https://data.norge.no/nlod/en/2.0/
Description
The data sets provide an overview of selected data on waterworks registered with the Norwegian Food Safety Authority. The information has been reported by the waterworks through application processing or other reporting to the Norwegian Food Safety Authority. Drinking water regulations require, among other things, annual reporting. The Norwegian Food Safety Authority has created a separate form service for such reporting. The data sets include public or private waterworks that supply 50 people or more. In addition, all municipal owned businesses with their own water supply are included regardless of size. The data sets also contain decommissioned facilities. This is done for those who wish to view historical data, i.e. data for previous years or earlier. There are data sets for the following supervisory objects: 1. Water supply system. It also includes analysis of drinking water. 2. Transport system 3. Treatment facility 4. Entry point. It also includes analysis of the water source. Below you will find datasets for: 1. Water supply system_reporting In addition, there is a file (information.txt) that provides an overview of when the extracts were produced and how many lines there are in the individual files. The withdrawals are done weekly. Furthermore, for the data sets water supply system, transport system and intake point it is possible to see historical data on what is included in the annual reporting. To make use of that information, the file must be linked to the “moder” file. to get names and other static information. These files have the _reporting ending in the file name. Description of the data fields (i.e. metadata) in the individual data sets appears in separate files. These are available in pdf format. If you double-click the csv file and it opens directly in excel, then you will not get the æøå. To see the character set correctly in Excel, you must: & start Excel and a new spreadsheet & select data and then from text, press Import & select separator data and file origin 65001: Unicode (UTF-8) and tick of My Data have headings and press Next & remove tab as separator and select semicolon as separator, press next & otherwise, complete the data sets can be imported into a separate database and compiled as desired. There are link keys in the files that make it possible to link the files together. The waterworks are responsible for the quality of the datasets.

—

Purpose: Make data for drinking water supply available to the public.
Data from: Identification and Evaluation of the Cost-Effectiveness of...
catalog.data.gov
data.bts.gov
+1more
Updated Dec 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federal Highway Administration (2023). Identification and Evaluation of the Cost-Effectiveness of Highway Design Features to Reduce Nonrecurrent Congestion [supporting datasets] [Dataset]. https://catalog.data.gov/dataset/identification-and-evaluation-of-the-cost-effectiveness-of-highway-design-features-to-redu
Explore at:
Dataset updated
Dec 7, 2023
Dataset provided by
Federal Highway Administrationhttps://highways.dot.gov/
Description
This project focused specifically on design treatments that can be used to improve travel time reliability. The objectives of this research were to (1) identify the full range of possible roadway design features used by transportation agencies to improve travel time reliability and reduce delays from key causes of nonrecurrent congestion, (2) assess their costs and operational and safety effectiveness, and (3) provide recommendations for their use and eventual incorporation into appropriate design guides. This research generated two companion products that allow transportation agencies and professionals to apply these research findings effectively in daily practice. These products are the Design Guide for Addressing Nonrecurrent Congestion, which is a catalogue of the design elements and their associated use information, and the Analysis Tool for Design Treatments to Address Nonrecurring Congestion, which is a tool to execute the various analysis procedures and models to measure the effectiveness of a design element on travel time reliability. This zip file contains comma separated value (.csv) files of data to support SHRP 2 Report S2-L07-RR-1, Identification and Evaluation of the Cost-Effectiveness of Highway Design Features to Reduce Nonrecurrent Congestion, https://rosap.ntl.bts.gov/view/dot/4040 The compressed zip file is 12 MB. These files can be unzipped using any zip compression/decompression software. The .csv files can be read with any basic text editor.
d
PREDIK Data-Driven I Private Company Data I Enhanced Custom Dataset to...
datarade.ai
.json, .csv
Updated Feb 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Predik Data-driven (2021). PREDIK Data-Driven I Private Company Data I Enhanced Custom Dataset to Understand Private & Public Business Relations between US Companies [Dataset]. https://datarade.ai/data-products/company-to-company-relations-data-predik-data-driven
Explore at:
.json, .csvAvailable download formats
Dataset updated
Feb 16, 2021
Dataset authored and provided by
Predik Data-driven
Area covered
United States
Description
This private company dataset provides an in-depth view of any specific company’s truck-based supply chain and its relationships with other facilities and companies within the continental US.

Also, using robust supply chain data you will be able to map US facilities (including factories, warehouses, and retail outlets).

With this private company dataset, it is possible to track the movement of trucks and devices between locations to identify supply chain connections and company data insights.

Our Machine learning algorithms ingest 7-15bn daily events to estimate the volume of goods transported between locations. Consequently, we can map supply chain connections between:

•Different companies (expressed as a percentage of volume transported).

•Locations owned by the same company (e.g. warehouse to shop).

With this novel geolocation approach, it is possible to "draw" a knowledge graph of any private or public company´s relations with other companies within the country.

This solution, in the form of a dataset, provides an in-depth view of any specific company’s truck-based supply chain and its relationships with other facilities and companies within the continental United States.

Use cases:

Identification and understanding of relations company-to-company: It helps to identify and infer relationships and connections between specific companies or facilities and between sectors/industries.

Identification and understanding of relations place-to-place: A logistics and domestic distribution supply chain can be mapped, both nationwide and state-wide in the US, and across countries in Europe.

Visualization and mapping of an entire supply chain network.

Tracking of products in any distribution or supply chain.

Risk assessment

Correlation analysis.

Disruption analysis.

Analysis of illicit networks and tracking of illegal use of corporate assets.

Improvement of casualty risk management.

Optimization of supply chain risk management.

Security and compliance.

Identification of not only the first tier of suppliers in the value chain, but also 2nd and 3rd tier suppliers, and more.

Current largest use case: global corporation using it to model risk at a facility level (+100,000 locations).

Why should you trust PREDIK Data-Driven? In 2023, we were listed as Datarade's top providers. Why? Our solutions for private company data, supply chain data, and B2B data adapt according to the specific needs of companies. Also, PREDIK methodology focuses on the client and the necessary elements for the success of their projects.
Film Circulation dataset
zenodo.org
data.niaid.nih.gov
bin, csv, png
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
Explore at:
csv, png, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7887672
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
VineLOGIC: Experimental Data Sets
data.csiro.au
researchdata.edu.au
Updated Feb 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rob Walker; Rachel Ashley; Nicola Cooley; Anne Pellegrino; Deidre Blackmore; Peter Clingeleffer; Everard Edwards; D. C. Godwin; R. J. G. White; David Benn (2023). VineLOGIC: Experimental Data Sets [Dataset]. http://doi.org/10.25919/j503-ft52
Explore at:
Unique identifier
https://doi.org/10.25919/j503-ft52
Dataset updated
Feb 28, 2023
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Rob Walker; Rachel Ashley; Nicola Cooley; Anne Pellegrino; Deidre Blackmore; Peter Clingeleffer; Everard Edwards; D. C. Godwin; R. J. G. White; David Benn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 1, 2000 - Dec 31, 2006
Dataset funded by
CSIROhttp://www.csiro.au/
Australian Wine Research Institute
Description
Three experimental data sets (WNRA0103, WNRA0305 and WNRA0506) involving three grapevine varieties and a range of deficit irrigation and pruning treatments are described. The purpose for obtaining the data sets was two-fold, (1) to meet the research goals of the Cooperative Research Centre for Viticulture (CRCV) during its tenure 1999-2006, and (2) to test the capacity of the VineLOGIC grapevine growth and development model to predict timing of bud burst, flowering, veraison and harvest, yield and yield components, berry attributes and components of water balance. A test script, included with the VineLOGIC source code publication (https://doi.org/10.25919/5eb3536b6a8a8), enables comparison between model predicted and measured values for key variables. Key references relating to the model and data sets are provided under Related Links. A description of selected terms and outcomes of regression analysis between values predicted by the model and observed values are provided under Supporting Files. Version 3 included the following amendments: (1) to WNRA0103 – alignment of settings for irrigation simulation control and initial soil water contents for soil layers with those in WNRA0305 and WNRA0506, and addition of missing berry anthocyanin data for season 2002-03; (2) to WNRA0305 - minor corrections to values for berry and bunch number and weight, and correction of target Brix value for harvest to 24.5 Brix; (3) minor corrections to some measured berry anthocyanin concentrations as mg/g fresh weight; minor amendments to treatment names for consistency across data sets, and to the name for irrigation type to improve clarity; and (4) update of regression analysis between VineLOGIC-predicted versus observed values for key variables. Version 4 (this version) includes a metadata only amendment with two additions to Related links: ‘VineLOGIC View’ and a recent publication. Lineage: The data sets were obtained at a commercial wine company vineyard in the Mildura region of north western Victoria, Australia. Vines were spaced 2.4 m within rows and 3 m between rows, trained to a two-wire vertical trellis and drip irrigated. The soil was a Nookamka sandy loam. Data Set 1 (WNRA0103): An experiment comparing the effects on grapevine growth and development of three pruning treatments, spur, light mechanical hedging and minimal pruning, involving Shiraz on Schwarzmann rootstock, irrigated with industry standard drip irrigation and collected over three seasons 2000-01, 2001-02 and 2002-03. The experiment was established and conducted by Dr Rachel Ashley with input from Peter Clingeleffer (CSIRO), Dr Bob Emmett (Department of Primary Industries, Victoria) and Dr Peter Dry (University of Adelaide). Seasons in the southern hemisphere span two calendar years, with budburst in the second half of the first calendar year and harvest in the first half of the second calendar year. Data Set 2 (WNRA0305): An experiment comparing the effects of three irrigation treatments, industry standard drip, Regulated Deficit (RDI) and Prolonged Deficit (PD) irrigation involving Cabernet Sauvignon on own roots and pruned by light mechanical hedging, over three seasons 2002-03, 2003-04 and 2004-05. The RDI treatment involved application of a water deficit in the post-fruit set to pre-veraison period. The PD treatment was initially the same as RDI but with an extended period of extreme deficit (no irrigation) after the RDI stress period until veraison. The experiment was established and conducted by Dr Nicola Cooley with input from Peter Clingeleffer and Dr Rob Walker (CSIRO). Data Set 3 (WNRA0506): Compared basic grapevine growth, development and berry maturation post fruit set at three Trial Sites over two seasons 2004-05 and 2005-06. Trial Site one is the same site used to collect Data Set 1. Data were collected from all three pruning treatments in season 2004-05 but only from the spur and light mechanical hedging treatments in season 2005-06. Trial Site two involved comparison of two scions, Chardonnay and Shiraz, both on Schwarzmann rootstock, irrigated with industry standard drip irrigation and pruned using light mechanical hedging. Data were collected in season 2004-05. Trial Site three is the same site used to collect Data Set 2. Data were collected from all three irrigation treatments in season 2004-05 but only from the industry standard drip and PD treatments in 2005-06. Establishment and conduct of experiments at Trial Sites one, two and three was by Dr Anne Pellegrino and Deidre Blackmore with input from Peter Clingeleffer and Dr Rob Walker. The decision to develop Data Set 3 followed a mid-term CRCV review and analysis of available Australian data sets and relevant literature, which identified the need to obtain a data set covering all of the required variables necessary to run VineLOGIC and in particular, to obtain data on berry development commencing as soon as possible after fruit set. Most prior data sets were from veraison onwards, which is later than desirable from a modelling perspective. Data Set 1, 2 and 3 compilation for VineLOGIC was by Deidre Blackmore with input from Dr Doug Godwin. Review and testing of the Data Sets with VineLOGIC was conducted by David Benn with input from Dr Paul Petrie (South Australian Research and Development Institute), Dr Vinay Pagay (University of Adelaide) and Drs Everard Edwards and Rob Walker (CSIRO). A collaboration agreement with University of Adelaide established in 2017 enabled further input to review of the Data Sets and their testing with VineLOGIC by Dr Sam Culley.
Z
Interpolated data on bioavailable strontium in the southern Trans-Urals,...
data.niaid.nih.gov
Updated Dec 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ankusheva, Polina (2024). Interpolated data on bioavailable strontium in the southern Trans-Urals, 2020-2022 version 3.1 (current) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7370065
Explore at:
Dataset updated
Dec 1, 2024
Dataset provided by
Epimakhov, Andrey
Chechushkov, Igor
Ankushev, Maksim
Ankusheva, Polina
Kiseleva, Daria
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Ural Mountains
Description
Description

The Interpolated Strontium Values dataset Ver. 3.1 presents the interpolated data of strontium isotopes for the southern Trans-Urals, based on the data gathered in 2020-2022. The current dataset consists of five sets of files for five various interpolations: based on grass, mollusks, soil, and water samples, as well as the average of three (excluding the mollusk dataset). Each of the five sets consists of a CSV file and a KML file where the interpolated values are presented to use with a GIS software (ordinary kriging, 5000 m x 5000 m grid). In addition, two GeoTIFF files are provided for each set for a visual reference.

Average 5000 m interpolated points.kml / csv: these files contain averaged values of all three sample types.

Grass 5000 m interpolated points.kml / csv: these files contain data interpolated from the grass sample dataset.

Mollusks 5000 m interpolated points.kml / csv: these files contain data interpolated from the mollusk sample dataset.

Soil 5000 m interpolated points.kml / csv: these files contain data interpolated from the soil sample dataset.

Water 5000 m interpolated points.kml / csv: these files contain data interpolated from the water sample dataset.

The current version is also supplemented with GeoTiff raster files where the same interpolated values are color-coded. These files can be added to Google Earth or any GIS software together with KML files for better interpretation and comparison.

Averaged 5000 m interpolation raster.tif: this file contains a raster representing the averaged values of all three sample types.

Grass 5000 m interpolation raster.tif: this file contains a raster representing the data interpolated from the grass sample dataset.

Mollusks 5000 m interpolation raster.tif: this file contains a raster representing the data interpolated from the mollusk sample dataset.

Soil 5000 m interpolation raster.tif: this file contains a raster representing the data interpolated from the soil sample dataset.

Water 5000 m interpolation raster.tif: this file contains a raster representing the data interpolated from the water sample dataset

In addition, the cross-validation rasters created during the interpolation process are also provided. They can be used as a visual reference of the interpolation reliability. The grey areas on the raster represent the areas where expected values do not differ from interpolated values for more than 0.001. The red areas represent the areas where the error exceeded 0.001 and, thus, the interpolation is not reliable.

How to use it?

The data provided can be used to access interpolated background values of bioavailable strontium in the area of interest. Note that a single value is not a good enough predictor and should never be used as a proxy. Always calculate a mean of 4-6 (or more) nearby values to achieve the best guess possible. Never calculate averages from a single dataset, always rely on cross-validation by comparing data from all five datasets. Check the cross-validation rasters to make sure that the interpolation is reliable for the area of interest.

References

The interpolated datasets are based upon the actual measured values published as follows:

Epimakhov, Andrey; Kisileva, Daria; Chechushkov, Igor; Ankushev, Maksim; Ankusheva, Polina (2022): Strontium isotope ratios (87Sr/86Sr) analysis from various sources the southern Trans-Urals. PANGAEA, https://doi.pangaea.de/10.1594/PANGAEA.950380

Description of the original dataset of measured strontium isotopic values

The present dataset contains measurements of bioavailable strontium isotopes (87Sr/86Sr) gathered in the southern Trans-Urals. There are four sample types, such as wormwood (n = 103), leached soil (n = 103), water (n = 101), and freshwater mollusks (n = 80), collected to measure bioavailable strontium isotopes. The analysis of Sr isotopic composition was carried out in the cleanrooms (6 and 7 ISO classes) of the Geoanalitik shared research facilities of the Institute of Geology and Geochemistry, the Ural Branch of the Russian Academy of Sciences (Ekaterinburg). Mollusk shell samples preliminarily cleaned with acetic acid, as well as vegetation samples rinsed with deionized water and ashed, were dissolved by open digestion in concentrated HNO 3 with the addition of H 2 O 2 on a hotplate at 150°C. Water samples were acidified with concentrated nitric acid and filtered. To obtain aqueous leachates, pre-ground soil samples weighing 1 g were taken into polypropylene containers, 10 ml of ultrapure water was added and shaken in for 1 hour, after which they were filtered through membrane cellulose acetate filters with a pore diameter of 0.2 μm. In all samples, the strontium content was determined by ICP-MS (NexION 300S). Then the sample volume corresponding to the Sr content of 600 ng was evaporated on a hotplate at 120°C, and the precipitate was dissolved in 7M HNO 3. Sample solutions were centrifuged at 6000 rpm, and strontium was chromatographically isolated using SR resin (Triskem). The strontium isotopic composition was measured on a Neptune Plus multicollector mass spectrometer with inductively coupled plasma (MC-ICP-MS). To correct mass bias, a combination of bracketing and internal normalization according to the exponential law 88 Sr/ 86 Sr = 8.375209 was used. The results were additionally bracketed using the NIST SRM 987 strontium carbonate reference material using an average deviation from the reference value of 0.710245 for every two samples bracketed between NIST SRM 987 measurements. The long-term reproducibility of the strontium isotopic analysis was evaluated using repeated measurements of NIST SRM 987 during 2020-2022 and yielded 87 Sr/ 86 Sr = 0.71025, 2SD = 0.00012 (104 measurements in two replicates). The within-laboratory standard uncertainty (2σ) obtained for SRM-987 was ± 0.003 %.
FMCG Daily Sales Data (2022-2024)
kaggle.com
Updated Jul 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Beata Faron (2025). FMCG Daily Sales Data (2022-2024) [Dataset]. https://www.kaggle.com/datasets/beatafaron/fmcg-daily-sales-data-to-2022-2024
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 9, 2025
Dataset provided by
Kaggle
Authors
Beata Faron
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This synthetic dataset simulates daily-level FMCG sales transactions for three consecutive years (2022, 2023, 2024), designed for practicing time series forecasting, demand planning, and machine learning in realistic business conditions.

Inspired by real-world scenarios (e.g. Nestlé, Unilever, P&G), it includes: - Product hierarchy: SKU → Brand → Segment → Category - Sales channels: Retail / Discount / E-commerce - Regions: Central, North, and South (Poland) - Daily sales quantities, prices, promotions, stock, delivery lag (lead time) - Pack types: Single / Multipack / Carton - Seasonality and product introductions: - New SKUs are introduced in 2024 only - Prices gradually increase over the years

Possible Use Cases - Weekly sales forecasting - Promotion effect analysis - Seasonality and trend modeling - New product forecasting (cold start) - Feature engineering for ML models

Created by: Beata Faron
LinkedIn profile
Data Scientist working on demand forecasting, NLP, and business-oriented ML.

Image by iuriimotov on Freepik
Data from: COVID-19 Case Surveillance Public Use Data with Geography
data.cdc.gov
data.virginia.gov
+5more
application/rdfxml +5
Updated Jul 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CDC Data, Analytics and Visualization Task Force (2024). COVID-19 Case Surveillance Public Use Data with Geography [Dataset]. https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data-with-Ge/n8mc-b4w4
Explore at:
application/rssxml, csv, tsv, application/rdfxml, xml, jsonAvailable download formats
Dataset updated
Jul 9, 2024
Dataset provided by
Centers for Disease Control and Preventionhttp://www.cdc.gov/
Authors
CDC Data, Analytics and Visualization Task Force
License
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
Description
Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.

Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.

This case surveillance public use dataset has 19 elements for all COVID-19 cases shared with CDC and includes demographics, geography (county and state of residence), any exposure history, disease severity indicators and outcomes, and presence of any underlying medical conditions and risk behaviors.

Currently, CDC provides the public with three versions of COVID-19 case surveillance line-listed data: this 19 data element dataset with geography, a 12 data element public use dataset, and a 33 data element restricted access dataset.

The following apply to the public use datasets and the restricted access dataset:
Data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.
Data are considered provisional by CDC and are subject to change until the data are reconciled and verified with the state and territorial data providers.
Some data are suppressed to protect individual privacy.
Datasets will include all cases with the earliest date available in each record (date received by CDC or date related to illness/specimen collection) at least 14 days prior to the creation of the current datasets. This 14-day lag allows case reporting to be stabilized and ensure that time-dependent outcome data are accurately captured.
Datasets are updated monthly.
Datasets are created using CDC’s Policy on Public Health Research and Nonresearch Data Management and Access and include protections designed to protect individual privacy.
For more information about data collection and reporting, please see https://www.cdc.gov/coronavirus/2019-ncov/covid-data/about-us-cases-deaths.html.
For more information about the COVID-19 case surveillance data, please see https://www.cdc.gov/coronavirus/2019-ncov/covid-data/faq-surveillance.html

Overview

The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020, to clarify the interpretation of antigen detection tests and serologic test results within the case classification (Interim-20-ID-02). The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported voluntarily to CDC.

For more information: NNDSS Supports the COVID-19 Response | CDC.

COVID-19 Case Reports COVID-19 case reports are routinely submitted to CDC by public health jurisdictions using nationally standardized case reporting forms. On April 5, 2020, CSTE released an Interim Position Statement with national surveillance case definitions for COVID-19. Current versions of these case definitions are available at: https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/. All cases reported on or after were requested to be shared by public health departments to CDC using the standardized case definitions for lab-confirmed or probable cases. On May 5, 2020, the standardized case reporting form was revised. States and territories continue to use this form.

Data are Considered Provisional

The COVID-19 case surveillance data are dynamic; case reports can be modified at any time by the jurisdictions sharing COVID-19 data with CDC. CDC may update prior cases shared with CDC based on any updated information from jurisdictions. For instance, as new information is gathered about previously reported cases, health departments provide updated data to CDC. As more information and data become available, analyses might find changes in surveillance data and trends during a previously reported time window. Data may also be shared late with CDC due to the volume of COVID-19 cases.
Annual finalized data: To create the final NNDSS data used in the annual tables, CDC works carefully with the reporting jurisdictions to reconcile the data received during the year until each state or territorial epidemiologist confirms that the data from their area are correct.

Access Addressing Gaps in Public Health Reporting of Race and Ethnicity for COVID-19, a report from the Council of State and Territorial Epidemiologists, to better understand the challenges in completing race and ethnicity data for COVID-19 and recommendations for improvement.

Data Limitations

To learn more about the limitations in using case surveillance data, visit FAQ: COVID-19 Data and Surveillance.

Data Quality Assurance Procedures

CDC’s Case Surveillance Section routinely performs data quality assurance procedures (i.e., ongoing corrections and logic checks to address data errors). To date, the following data cleaning steps have been implemented:
Questions that have been left unanswered (blank) on the case report form are reclassified to a Missing value, if applicable to the question. For example, in the question "Was the individual hospitalized?" where the possible answer choices include "Yes," "No," or "Unknown," the blank value is recoded to "Missing" because the case report form did not include a response to the question.
Logic checks are performed for date data. If an illogical date has been provided, CDC reviews the data with the reporting jurisdiction. For example, if a symptom onset date in the future is reported to CDC, this value is set to null until the reporting jurisdiction updates the date appropriately.
Additional data quality processing to recode free text data is ongoing. Data on symptoms, race, ethnicity, and healthcare worker status have been prioritized.

Data Suppression

To prevent release of data that could be used to identify people, data cells are suppressed for low frequency (<11 COVID-19 case records with a given values). Suppression includes low frequency combinations of case month, geographic characteristics (county and state of residence), and demographic characteristics (sex, age group, race, and ethnicity). Suppressed values are re-coded to the NA answer option; records with data suppression are never removed.

Additional COVID-19 Data

COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths by state and by county. These and other COVID-19 data are available from multiple public locations: COVID Data Tracker; United States COVID-19 Cases and Deaths by State; COVID-19 Vaccination Reporting Data Systems; and COVID-19 Death Data and Resources.

Notes:

March 1, 2022: The "COVID-19 Case Surveillance Public Use Data with Geography" will be updated on a monthly basis.

April 7, 2022: An adjustment was made to CDC’s cleaning algorithm for COVID-19 line level case notification data. An assumption in CDC's algorithm led to misclassifying deaths that were not COVID-19 related. The algorithm has since been revised, and this dataset update reflects corrected individual level information about death status for all cases collected to date.

June 25, 2024: An adjustment
Wonders of the World Image Dataset
kaggle.com
Updated May 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bala Baskar (2022). Wonders of the World Image Dataset [Dataset]. https://www.kaggle.com/datasets/balabaskar/wonders-of-the-world-image-classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 3, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bala Baskar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Introduction

The New 7 Wonders of the World was a campaign started in 2000 to choose Wonders of the World from a selection of 200 existing monuments. The popularity poll via free Web-based voting and small amounts of telephone voting was led by Canadian-Swiss Bernard Weber and organized by the New 7 Wonders Foundation (N7W) based in Zurich, Switzerland, with winners announced on 7 July 2007 in Lisbon, at Estádio da Luz. The poll was considered unscientific partly because it was possible for people to cast multiple votes.

Context

When someday, if we plan to go on a World tour, obviously there is going to be a bucket list of wonders or places around the world, that we wish to visit. Here, we have one set of "Wonders of the World" images scraped from Google Images. Let us use our deep learning skills to build multiclass classification to identify the place in the images.

Data Preparation

This dataset contains a total of 3846 images placed in folders, with which each folder representing one of the top new wonders of the world. Below is the list of wonders with images extracted from Google Images.

Venezuela Angel Falls

Taj Mahal

Stonehenge

Statue of Liberty

Chichen Itz

Christ the Redeemer

Pyramids of Giza

Eiffel Tower

Great Wall of China

Burj Khalifa

Roman Colosseum

Machu Pichu
Expenditure and Consumption Survey, 2006 - West Bank and Gaza
catalog.ihsn.org
dev.ihsn.org
Updated Mar 29, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Palestinian Central Bureau of Statistics (2019). Expenditure and Consumption Survey, 2006 - West Bank and Gaza [Dataset]. https://catalog.ihsn.org/catalog/3087
Explore at:
Dataset updated
Mar 29, 2019
Dataset authored and provided by
Palestinian Central Bureau of Statisticshttp://pcbs.gov.ps/
Time period covered
2006 - 2007
Area covered
Palestine, West Bank
Description
Abstract

The basic goal of this survey is to provide the necessary database for formulating national policies at various levels. It represents the contribution of the household sector to the Gross National Product (GNP). Household Surveys help as well in determining the incidence of poverty, and providing weighted data which reflects the relative importance of the consumption items to be employed in determining the benchmark for rates and prices of items and services. Generally, the Household Expenditure and Consumption Survey is a fundamental cornerstone in the process of studying the nutritional status in the Palestinian territory.

The raw survey data provided by the Statistical Office was cleaned and harmonized by the Economic Research Forum, in the context of a major research project to develop and expand knowledge on equity and inequality in the Arab region. The main focus of the project is to measure the magnitude and direction of change in inequality and to understand the complex contributing social, political and economic forces influencing its levels. However, the measurement and analysis of the magnitude and direction of change in this inequality cannot be consistently carried out without harmonized and comparable micro-level data on income and expenditures. Therefore, one important component of this research project is securing and harmonizing household surveys from as many countries in the region as possible, adhering to international statistics on household living standards distribution. Once the dataset has been compiled, the Economic Research Forum makes it available, subject to confidentiality agreements, to all researchers and institutions concerned with data collection and issues of inequality. Data is a public good, in the interest of the region, and it is consistent with the Economic Research Forum's mandate to make micro data available, aiding regional research on this important topic.

Geographic coverage

The survey data covers urban, rural and camp areas in West Bank and Gaza Strip.

Analysis unit

1- Household/families. 2- Individuals.

Universe

The survey covered all the Palestinian households who are a usual residence in the Palestinian Territory.

Kind of data

Sample survey data [ssd]

Sampling procedure

Sample and Frame:

The sampling frame consists of all enumeration areas which enumerated in 1997 and the numeration area consists of buildings and housing units and has in average about 150 households in it. We use the enumeration areas as primary sampling units PSUs in the first stage of the sampling selection. The enumeration areas of the master sample were updated in 2003.

Sample Design:

The sample is stratified cluster systematic random sample with two stages: The calculated sample size is 1,616 households, the completed households were 1,281 (847 in the west bank and 434 in the Gaza strip). First stage: selection a systematic random sample of 120 enumeration areas. Second stage: selection a systematic random sample of 12-18 households from each enumeration area selected in the first stage.

Sample strata:

We divided the population by: 1- Region (North West Bank, Middle West Bank, South West Bank, Gaza Strip) 2- Type of Locality (urban, rural, refugee camps)

Target cluster size:

The target cluster size or "sample-take" is the average number of households to be selected per PSU. In this survey, the sample take is around 12 households.

Sample Size:

The calculated sample size is 1,616 households, the completed households were 1,281 (847 in the west bank and 434 in the Gaza strip).

Mode of data collection

Face-to-face [f2f]

Research instrument

The PECS questionnaire consists of two main sections:

First section: Certain articles / provisions of the form filled at the beginning of the month, and the remainder filled out at the end of the month. The questionnaire includes the following provisions:

Cover sheet: It contains detailed and particulars of the family, date of visit, particular of the field/office work team, number/sex of the family members.

Statement of the family members: Contains social, economic and demographic particulars of the selected family.

Statement of the long-lasting commodities and income generation activities: Includes a number of basic and indispensable items (i.e., Livestock, or agricultural lands).

Housing Characteristics: Includes information and data pertaining to the housing conditions, including type of house, number of rooms, ownership, rent, water, electricity supply, connection to the sewer system, source of cooking and heating fuel, and remoteness/proximity of the house to education and health facilities.

Monthly and Annual Income: Data pertaining to the income of the family is collected from different sources at the end of the registration / recording period.

Assistance and poverty: includes questions about household conditions and assistances that got through the the past month.

Second section: The second section of the questionnaire includes a list of 55 consumption and expenditure groups itemized and serially numbered according to its importance to the family. Each of these groups contains important commodities. The number of commodities items in each for all groups stood at 667 commodities and services items. Groups 1-21 include food, drink, and cigarettes. Group 22 includes homemade commodities. Groups 23-45 include all items except for food, drink and cigarettes. Groups 50-55 include all of the long-lasting commodities. Data on each of these groups was collected over different intervals of time so as to reflect expenditure over a period of one full year, except the cars group the data of which was collected for three previous years. These data was abotained from the recording book which is covered a period of month for each household.

Cleaning operations

Raw Data

Data editing took place though a number of stages, including: 1. Office editing and coding 2. Data entry 3. Structure checking and completeness 4. Structural checking of SPSS data files

Harmonized Data

The Statistical Package for Social Science (SPSS) is used to clean and harmonize the datasets.

The harmonization process starts with cleaning all raw data files received from the Statistical Office.

Cleaned data files are then all merged to produce one data file on the individual level containing all variables subject to harmonization.

A country-specific program is generated for each dataset to generate/compute/recode/rename/format/label harmonized variables.

A post-harmonization cleaning process is run on the data.

Harmonized data is saved on the household as well as the individual level, in SPSS and converted to STATA format.

Response rate

The survey sample consists of about 1,616 households interviewed over a twelve months period between (January 2006-January 2007), 1,281 households completed interview, of which 847 in the West Bank and 434 household in Gaza Strip, the response rate was 79.3% in the Palestinian Territory.

Sampling error estimates

Generally, surveys samples are exposed to two types of errors. The statistical errors, being the first type, result from studying a part of a certain society and not including all its sections. And since the Household Expenditure and Consumption Surveys are conducted using a sample method, statistical errors are then unavoidable. Therefore, a potential sample using a suitable design has been employed whereby each unit of the society has a high chance of selection. Upon calculating the rate of bias in this survey, it appeared that the data is of high quality. The second type of errors is the non-statistical errors that relate to the design of the survey, mechanisms of data collection, and management and analysis of data. Members of the work commission were trained on all possible mechanisms to tackle such potential problems, as well as on how to address cases in which there were no responses (representing 9.6%).

Facebook

Twitter

Click to copy link

Link copied

Cite

Rui Simões (2022). Orange dataset table [Dataset]. http://doi.org/10.6084/m9.figshare.19146410.v1

Orange dataset table

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

xlsxAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.19146410.v1

Dataset updated

Mar 4, 2022

Dataset provided by

figshare

Authors

Rui Simões

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.

Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.

Clear search

Close search

Google apps

Main menu

Orange dataset table

Dataset of development of business during the COVID-19 crisis

Data from: Data files used to study change dynamics in software systems

Household Expenditure and Income Survey 2010, Economic Research Forum (ERF)...

Abstract

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

University SET data, with faculty and courses characteristics

Data for: Integrating open education practices with data analysis of open...

Data and file overview

Machine learning for corrosion database

UCI and OpenML Data Sets for Ordinal Quantification

Customer360Insights

Customer360Insights

Dataset Description

Types of Analysis

Forest Inventory and Analysis Database

Waterworks — water supply system_reporting

Data from: Identification and Evaluation of the Cost-Effectiveness of...

PREDIK Data-Driven I Private Company Data I Enhanced Custom Dataset to...

Film Circulation dataset

VineLOGIC: Experimental Data Sets

Interpolated data on bioavailable strontium in the southern Trans-Urals,...

FMCG Daily Sales Data (2022-2024)

Data from: COVID-19 Case Surveillance Public Use Data with Geography

Data are Considered Provisional

Data Limitations

Data Quality Assurance Procedures

Data Suppression

Additional COVID-19 Data

Wonders of the World Image Dataset

Introduction

Context

Data Preparation

Expenditure and Consumption Survey, 2006 - West Bank and Gaza

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Sample and Frame:

Sample Design:

Sample strata:

Target cluster size:

Sample Size:

Mode of data collection

Research instrument

Cleaning operations

Raw Data

Harmonized Data

Response rate

Sampling error estimates

Orange dataset table