100+ datasets found

Sample data analysis
kaggle.com
zip
Updated Apr 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdul Hamith (2023). Sample data analysis [Dataset]. https://www.kaggle.com/datasets/abdulhamith/sample-data-analysis
Explore at:
zip(998859 bytes)Available download formats
Dataset updated
Apr 28, 2023
Authors
Abdul Hamith
Description
Dataset

This dataset was created by Abdul Hamith

Contents
Data from: PISA Data Analysis Manual: SPSS, Second Edition
catalog.data.gov
s.cnmilf.com
Updated Mar 30, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Department of State (2021). PISA Data Analysis Manual: SPSS, Second Edition [Dataset]. https://catalog.data.gov/dataset/pisa-data-analysis-manual-spss-second-edition
Explore at:
Dataset updated
Mar 30, 2021
Dataset provided by
United States Department of Statehttp://state.gov/
Description
The OECD Programme for International Student Assessment (PISA) surveys collected data on students’ performances in reading, mathematics and science, as well as contextual information on students’ background, home characteristics and school factors which could influence performance. This publication includes detailed information on how to analyse the PISA data, enabling researchers to both reproduce the initial results and to undertake further analyses. In addition to the inclusion of the necessary techniques, the manual also includes a detailed account of the PISA 2006 database and worked examples providing full syntax in SPSS.
Orange dataset table
figshare.com
xlsx
Updated Mar 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rui Simões (2022). Orange dataset table [Dataset]. http://doi.org/10.6084/m9.figshare.19146410.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19146410.v1
Dataset updated
Mar 4, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Rui Simões
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.

Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.
Dataset for Privacy Exercises
kaggle.com
zip
Updated Apr 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shining (2024). Dataset for Privacy Exercises [Dataset]. https://www.kaggle.com/datasets/shiningana/dataset-for-privacy-exercises
Explore at:
zip(7327312 bytes)Available download formats
Dataset updated
Apr 9, 2024
Authors
Shining
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset gives some data of a hypothetical business that can be used to practice your privacy data transformation and analysis skills.

The dataset contains the following files/tables: 1. customer_orders_for_privacy_exercises.csv contains data of a business about customer orders (columns separated by commas) 2. users_web_browsing_for_privacy_exercises.csv contains data collected by the business website about its users (columns separated by commas) 3. iot_example.csv contains data collected by a smart device on users' bio-metric data (columns separated by commas) 4. members.csv contains data collected by a library on its users (columns separated by commas)
H
Political Analysis Using R: Example Code and Data, Plus Data for Practice...
dataverse.harvard.edu
search.dataone.org
Updated Apr 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jamie Monogan (2020). Political Analysis Using R: Example Code and Data, Plus Data for Practice Problems [Dataset]. http://doi.org/10.7910/DVN/ARKOTI
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/ARKOTI
Dataset updated
Apr 28, 2020
Dataset provided by
Harvard Dataverse
Authors
Jamie Monogan
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Each R script replicates all of the example code from one chapter from the book. All required data for each script are also uploaded, as are all data used in the practice problems at the end of each chapter. The data are drawn from a wide array of sources, so please cite the original work if you ever use any of these data sets for research purposes.
Dataset for Exploring case-control samples with non-targeted analysis
catalog.data.gov
datasets.ai
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Dataset for Exploring case-control samples with non-targeted analysis [Dataset]. https://catalog.data.gov/dataset/dataset-for-exploring-case-control-samples-with-non-targeted-analysis
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
These data contain the results of GC-MS, LC-MS and immunochemistry analyses of mask sample extracts. The data include tentatively identified compounds through library searches and compound abundance. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The data can not be accessed. Format: The dataset contains the identification of compounds found in the mask samples as well as the abundance of those compounds for individuals who participated in the trial. This dataset is associated with the following publication: Pleil, J., M. Wallace, J. McCord, M. Madden, J. Sobus, and G. Ferguson. How do cancer-sniffing dogs sort biological samples? Exploring case-control samples with non-targeted LC-Orbitrap, GC-MS, and immunochemistry methods. Journal of Breath Research. Institute of Physics Publishing, Bristol, UK, 14(1): 016006, (2019).
UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jul 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
c
Sentiment Analysis Dataset
cubig.ai
zip
Updated May 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG (2025). Sentiment Analysis Dataset [Dataset]. https://cubig.ai/store/products/270/sentiment-analysis-dataset
Explore at:
zipAvailable download formats
Dataset updated
May 20, 2025
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Measurement technique
Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
Description
1) Data Introduction • The Sentiment Analysis Dataset is a dataset for emotional analysis, including large-scale tweet text collected from Twitter and emotional polarity (0=negative, 2=neutral, 4=positive) labels for each tweet, featuring automatic labeling based on emoticons.

2) Data Utilization (1) Sentiment Analysis Dataset has characteristics that: • Each sample consists of six columns: emotional polarity, tweet ID, date of writing, search word, author, and tweet body, and is suitable for training natural language processing and classification models using tweet text and emotion labels. (2) Sentiment Analysis Dataset can be used to: • Emotional Classification Model Development: Using tweet text and emotional polarity labels, we can build positive, negative, and neutral emotional automatic classification models with various machine learning and deep learning models such as logistic regression, SVM, RNN, and LSTM. • Analysis of SNS public opinion and trends: By analyzing the distribution of emotions by time series and keywords, you can explore changes in public opinion on specific issues or brands, positive and negative trends, and key emotional keywords.
I
Self-citation analysis data based on PubMed Central subset (2002-2005)
databank.illinois.edu
aws-databank-alb.library.illinois.edu
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik, Self-citation analysis data based on PubMed Central subset (2002-2005) [Dataset]. http://doi.org/10.13012/B2IDB-9665377_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-9665377_V1
Authors
Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
U.S. National Institutes of Health (NIH)
U.S. National Science Foundation (NSF)
Description
Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations. Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304 - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105 * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720 - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927 * MapAffil for identifying article country of affiliation: - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik * IMPLICIT journal similarity: - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 * Novelty dataset for identify article level novelty: - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 - Paper citation: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra - Code: https://github.com/napsternxg/Novelty * Expertise dataset for identifying author expertise on articles: * Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.
Survival Analysis - Example Data
kaggle.com
zip
Updated Apr 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lambros Georgiades (2023). Survival Analysis - Example Data [Dataset]. https://www.kaggle.com/datasets/lambrosgeorgiades/sa-data
Explore at:
zip(565 bytes)Available download formats
Dataset updated
Apr 27, 2023
Authors
Lambros Georgiades
Description
This is a made-up dataset in the context of a test group vs placebo group study that is used in a report introducing the Kaplan-Meier estimation and the Cox proportional hazards model.
d
Job Postings Dataset for Labour Market Research and Insights
datarade.ai
Updated Sep 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxylabs (2023). Job Postings Dataset for Labour Market Research and Insights [Dataset]. https://datarade.ai/data-products/job-postings-dataset-for-labour-market-research-and-insights-oxylabs
Explore at:
.json, .xml, .csv, .xlsAvailable download formats
Dataset updated
Sep 20, 2023
Dataset authored and provided by
Oxylabs
Area covered
British Indian Ocean Territory, Tajikistan, Anguilla, Zambia, Switzerland, Luxembourg, Jamaica, Togo, Sierra Leone, Kyrgyzstan
Description
Introducing Job Posting Datasets: Uncover labor market insights!

Elevate your recruitment strategies, forecast future labor industry trends, and unearth investment opportunities with Job Posting Datasets.

Job Posting Datasets Source:

Indeed: Access datasets from Indeed, a leading employment website known for its comprehensive job listings.

Glassdoor: Receive ready-to-use employee reviews, salary ranges, and job openings from Glassdoor.

StackShare: Access StackShare datasets to make data-driven technology decisions.

Job Posting Datasets provide meticulously acquired and parsed data, freeing you to focus on analysis. You'll receive clean, structured, ready-to-use job posting data, including job titles, company names, seniority levels, industries, locations, salaries, and employment types.

Choose your preferred dataset delivery options for convenience:

Receive datasets in various formats, including CSV, JSON, and more. Opt for storage solutions such as AWS S3, Google Cloud Storage, and more. Customize data delivery frequencies, whether one-time or per your agreed schedule.

Why Choose Oxylabs Job Posting Datasets:

Fresh and accurate data: Access clean and structured job posting datasets collected by our seasoned web scraping professionals, enabling you to dive into analysis.

Time and resource savings: Focus on data analysis and your core business objectives while we efficiently handle the data extraction process cost-effectively.

Customized solutions: Tailor our approach to your business needs, ensuring your goals are met.

Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is a founding member of the Ethical Web Data Collection Initiative, aligning with GDPR and CCPA best practices.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.

Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.

Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.

Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Effortlessly access fresh job posting data with Oxylabs Job Posting Datasets.
f
Data Sheet 2_Visual analysis of multi-omics data.csv
frontiersin.figshare.com
csv
Updated Sep 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin Swart; Ron Caspi; Suzanne Paley; Peter D. Karp (2024). Data Sheet 2_Visual analysis of multi-omics data.csv [Dataset]. http://doi.org/10.3389/fbinf.2024.1395981.s002
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.3389/fbinf.2024.1395981.s002
Dataset updated
Sep 10, 2024
Dataset provided by
Frontiers
Authors
Austin Swart; Ron Caspi; Suzanne Paley; Peter D. Karp
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present a tool for multi-omics data analysis that enables simultaneous visualization of up to four types of omics data on organism-scale metabolic network diagrams. The tool’s interactive web-based metabolic charts depict the metabolic reactions, pathways, and metabolites of a single organism as described in a metabolic pathway database for that organism; the charts are constructed using automated graphical layout algorithms. The multi-omics visualization facility paints each individual omics dataset onto a different “visual channel” of the metabolic-network diagram. For example, a transcriptomics dataset might be displayed by coloring the reaction arrows within the metabolic chart, while a companion proteomics dataset is displayed as reaction arrow thicknesses, and a complementary metabolomics dataset is displayed as metabolite node colors. Once the network diagrams are painted with omics data, semantic zooming provides more details within the diagram as the user zooms in. Datasets containing multiple time points can be displayed in an animated fashion. The tool will also graph data values for individual reactions or metabolites designated by the user. The user can interactively adjust the mapping from data value ranges to the displayed colors and thicknesses to provide more informative diagrams.
Z
[Dataset] Advanced Single Cell Analysis tutorial - Complete downstream...
data.niaid.nih.gov
Updated Mar 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soraggi, Samuele; Andersen, Stig Uggerhøj; Fechete, Lavinia Ioana; Tedeschi, Francesca; Frank, Manuel (2024). [Dataset] Advanced Single Cell Analysis tutorial - Complete downstream analysis across conditions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10782589
Explore at:
Dataset updated
Mar 7, 2024
Dataset provided by
Aarhus University
BiRC (Bioinformatics Research Center, Aarhus University)
Authors
Soraggi, Samuele; Andersen, Stig Uggerhøj; Fechete, Lavinia Ioana; Tedeschi, Francesca; Frank, Manuel
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Datasets and metadata used for the full streamline analysis of plant data under different conditions of infection. The tutorial is an example of analysis which can be useful in multiple scenario where comparisons are needed (healthy and sick patients, for example). You can find the tutorial at our website https://hds-sandbox.github.io/AdvancedSingleCell

Usage notes:

all files are ready to use, except for control1.tar.gz which is a folder that needs to be decompressed
c
Sample Sales Dataset
cubig.ai
zip
Updated Jun 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG (2025). Sample Sales Dataset [Dataset]. https://cubig.ai/store/products/477/sample-sales-dataset
Explore at:
zipAvailable download formats
Dataset updated
Jun 15, 2025
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Measurement technique
Synthetic data generation using AI techniques for model training, Privacy-preserving data transformation via differential privacy
Description
1) Data Introduction • The Sample Sales Data is a retail sales dataset of 2,823 orders and 25 columns that includes a variety of sales-related data, including order numbers, product information, quantity, unit price, sales, order date, order status, customer and delivery information.

2) Data Utilization (1) Sample Sales Data has characteristics that: • This dataset consists of numerical (sales, quantity, unit price, etc.), categorical (product, country, city, customer name, transaction size, etc.), and date (order date) variables, with missing values in some columns (STATE, ADDRESSLINE2, POSTALCODE, etc.). (2) Sample Sales Data can be used to: • Analysis of sales trends and performance by product: Key variables such as order date, product line, and country can be used to visualize and analyze monthly and yearly sales trends, the proportion of sales by product line, and top sales by country and region. • Segmentation and marketing strategies: Segmentation of customer groups based on customer information, transaction size, and regional data, and use them to design targeted marketing and customized promotion strategies.
m
Dataset of development of business during the COVID-19 crisis
data.mendeley.com
narcis.nl
Updated Nov 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tatiana N. Litvinova (2020). Dataset of development of business during the COVID-19 crisis [Dataset]. http://doi.org/10.17632/9vvrd34f8t.1
Explore at:
Unique identifier
https://doi.org/10.17632/9vvrd34f8t.1
Dataset updated
Nov 9, 2020
Authors
Tatiana N. Litvinova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
To create the dataset, the top 10 countries leading in the incidence of COVID-19 in the world were selected as of October 22, 2020 (on the eve of the second full of pandemics), which are presented in the Global 500 ranking for 2020: USA, India, Brazil, Russia, Spain, France and Mexico. For each of these countries, no more than 10 of the largest transnational corporations included in the Global 500 rating for 2020 and 2019 were selected separately. The arithmetic averages were calculated and the change (increase) in indicators such as profitability and profitability of enterprises, their ranking position (competitiveness), asset value and number of employees. The arithmetic mean values of these indicators for all countries of the sample were found, characterizing the situation in international entrepreneurship as a whole in the context of the COVID-19 crisis in 2020 on the eve of the second wave of the pandemic. The data is collected in a general Microsoft Excel table. Dataset is a unique database that combines COVID-19 statistics and entrepreneurship statistics. The dataset is flexible data that can be supplemented with data from other countries and newer statistics on the COVID-19 pandemic. Due to the fact that the data in the dataset are not ready-made numbers, but formulas, when adding and / or changing the values in the original table at the beginning of the dataset, most of the subsequent tables will be automatically recalculated and the graphs will be updated. This allows the dataset to be used not just as an array of data, but as an analytical tool for automating scientific research on the impact of the COVID-19 pandemic and crisis on international entrepreneurship. The dataset includes not only tabular data, but also charts that provide data visualization. The dataset contains not only actual, but also forecast data on morbidity and mortality from COVID-19 for the period of the second wave of the pandemic in 2020. The forecasts are presented in the form of a normal distribution of predicted values and the probability of their occurrence in practice. This allows for a broad scenario analysis of the impact of the COVID-19 pandemic and crisis on international entrepreneurship, substituting various predicted morbidity and mortality rates in risk assessment tables and obtaining automatically calculated consequences (changes) on the characteristics of international entrepreneurship. It is also possible to substitute the actual values identified in the process and following the results of the second wave of the pandemic to check the reliability of pre-made forecasts and conduct a plan-fact analysis. The dataset contains not only the numerical values of the initial and predicted values of the set of studied indicators, but also their qualitative interpretation, reflecting the presence and level of risks of a pandemic and COVID-19 crisis for international entrepreneurship.
Science Education Research Topic Modeling Dataset
zenodo.org
data.niaid.nih.gov
bin, html +2
Updated Oct 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph (2024). Science Education Research Topic Modeling Dataset [Dataset]. http://doi.org/10.5281/zenodo.4094974
Explore at:
bin, txt, html, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4094974
Dataset updated
Oct 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.

The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:

We removed duplicated text from each article: prior to 1969, articles in the journal were published in a magazine format in which the end of one article and the beginning of the next would share the same page, so we developed an automated detection of article beginnings and endings that was able to remove any duplicate text.

We removed the reference sections of the articles, as well headings (in all caps) such as “ABSTRACT”.

We reunited any partial words that were separated due to line breaks, text recognition issues, or British vs. American spellings (for example converting “per cent” to “percent”)

We removed all numbers, symbols, special characters, and punctuation, and lowercased all words.

We removed all stop words, which are words without any semantic meaning on their own—“the”, “in,” “if”, “and”, “but”, etc.—and all single-letter words.

We lemmatized all words, with the added step of including a part-of-speech tagger so our algorithm would only aggregate and lemmatize words from the same part of speech (e.g., nouns vs. verbs).

We detected and create bi-grams, sets of words that frequently co-occur and carry additional meaning together. These words were combined with an underscore: for example, “problem_solving” and “high_school”.

After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.

In addition to this file, we have also included the following files:

SciEd_paper_names_weights.pkl: A file containing limited metadata (title, author, year published, and DOI) for each of the papers, in the same order as they appear within the main datafile. This file also includes the weights assigned by an LDA model used to analyze the data

Science Education LDA Notebook.ipynb: A notebook file that replicates our LDA analysis, with a written explanation of all of the steps and suggestions on how to explore the results.

Supporting files for the notebook. These include the requirements, the README, a helper script with functions for plotting that were too long to include in the notebook, and two HTML graphs that are embedded into the notebook.

This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.
d
Data from the Chemical Analysis of Archived Stream-Sediment Samples, Alaska
catalog.data.gov
data.usgs.gov
+2more
Updated Nov 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Data from the Chemical Analysis of Archived Stream-Sediment Samples, Alaska [Dataset]. https://catalog.data.gov/dataset/data-from-the-chemical-analysis-of-archived-stream-sediment-samples-alaska
Explore at:
Dataset updated
Nov 13, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Alaska
Description
This data release contains the elemental concentration data for more than 1700 archived stream-sediment samples collected in Alaska. Samples were retrieved from the USGS Mineral Program's sample archive in Denver, CO, and the Alaska Division of Geological and Geophysical Surveys Geologic Materials Center in Anchorage, AK. All samples were analyzed using a multi-element analytical method involving fusion of the sample by sodium peroxide, dissolution of the fusion cake by nitric acid, and elemental analysis by inductively coupled plasma-optical emission spectroscopy (ICP-OES) and inductively coupled plasma-mass spectroscopy (ICP-MS). Additionally, 106 samples from the Nixon Fork area were analyzed by a second multi-element method in which the samples are decomposed by a mixture of hydrochloric, nitric, perchloric, and hydrofluoric acids and the elemental composition is determined by ICP-OES and ICP-MS. New Hg (mercury) concentrations, determined by cold-vapor atomic absorption spectrometry, are reported for 296 samples from southeast Alaska.
c
Home Depot products dataset
crawlfeeds.com
csv, zip
Updated Jun 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Home Depot products dataset [Dataset]. https://crawlfeeds.com/datasets/home-depot-products-dataset
Explore at:
zip, csvAvailable download formats
Dataset updated
Jun 13, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Unlock valuable insights with our comprehensive Home Depot product dataset. This dataset is meticulously curated, offering detailed information on a wide range of products available at Home Depot.

Whether you're conducting market research, enhancing your e-commerce platform, or analyzing retail trends, this dataset is an invaluable resource. It includes product names, descriptions, prices, categories, and more. Optimize your projects with high-quality, structured data from one of the largest home improvement retailers in the world.

Stay ahead in the competitive market with accurate and up-to-date product information.

Home Depot products latest dataset having around 2 million records. Get in touch with crawl feeds to require any updates in dataset.

For a closer look at the product-level data we’ve extracted from Home Depot, including pricing, stock status, and detailed specifications, visit the Home Depot dataset page. You can explore sample records and submit a request for tailored extracts directly from there.
Streaming Service Data
kaggle.com
Updated Dec 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chad Wambles (2024). Streaming Service Data [Dataset]. https://www.kaggle.com/datasets/chadwambles/streaming-service-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 19, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Chad Wambles
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
A dataset I generated to showcase a sample set of user data for a fictional streaming service. This data is great for practicing SQL, Excel, Tableau, or Power BI.

1000 rows and 25 columns of connected data.

See below for column descriptions.

Enjoy :)
n
Language Dataset
data.ncl.ac.uk
json
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Towers; Rob Geada; Amir Atapour-Abarghouei; Andrew Stephen McGough (2023). Language Dataset [Dataset]. http://doi.org/10.25405/data.ncl.24574729.v1
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.25405/data.ncl.24574729.v1
Dataset updated
Nov 30, 2023
Dataset provided by
Newcastle University
Authors
David Towers; Rob Geada; Amir Atapour-Abarghouei; Andrew Stephen McGough
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset containing the images and labels for the Language data used in the CVPR NAS workshop Unseen-data challenge under the codename "LaMelo"The Language dataset is a constructed dataset using words from aspell dictionaries. The intention of this dataset is to require machine learning models to not only perform image classification but also linguistic analysis to figure out which letter frequency is associated with each language. For each Language image we selected four six-letter words using the standard latin alphabet and removed any words with letters that used diacritics (such as ́e or ̈u) or included ‘y’ or ‘z’.We encode these words on a graph with one axis representing the index of the 24 character long string (the four words joined together) and the other representing the letter (going A-X).The data is in a channels-first format with a shape of (n, 1, 24, 24) where n is the number of samples in the corresponding set (50,000 for training, 10,000 for validation, and 10,000 for testing).There are ten classes in the dataset, with 7,000 examples of each, distributed evenly between the three subsets.The ten classes and corresponding numerical label are as follows:English: 0,Dutch: 1,German: 2,Spanish: 3,French: 4,Portuguese: 5,Swahili: 6,Zulu: 7,Finnish: 8,Swedish: 9