100+ datasets found

Orange dataset table
figshare.com
xlsx
Updated Mar 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rui Simões (2022). Orange dataset table [Dataset]. http://doi.org/10.6084/m9.figshare.19146410.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19146410.v1
Dataset updated
Mar 4, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Rui Simões
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.

Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.
Covid-19 variants survival data
kaggle.com
zip
Updated Jan 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massock Batalong Maurice Blaise (2025). Covid-19 variants survival data [Dataset]. https://www.kaggle.com/datasets/lumierebatalong/covid-19-variants-survival-data
Explore at:
zip(216589 bytes)Available download formats
Dataset updated
Jan 2, 2025
Authors
Massock Batalong Maurice Blaise
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Overview:

This dataset provides a unique resource for researchers and data scientists interested in the global dynamics of the COVID-19 pandemic. It focuses on the impact of different SARS-CoV-2 variants and mutations on the duration of local epidemics. By combining variant information with epidemiological data, this dataset allows for a comprehensive analysis of factors influencing the trajectory of the pandemic.

Key Features:

Global Coverage: Includes data from multiple countries.

Variant-Specific Information: Detailed records for various SARS-CoV-2 variants.

Epidemic Duration: Data on the duration of local epidemics, accounting for right-censoring.

Epidemiological Variables: Includes mortality rates, a proxy for R0, transmission proxies, and other pertinent variables.

Geographical characteristics: Include a continent variable for exploring geographical patterns

Time varying variables: Include the number of waves and the number of variants in the different countries for more in-depth exploration.

Data Source: The data combines information from the Johns Hopkins University COVID-19 dataset (confirmed_cases.csv and deaths_cases.csv) and the covariants.org dataset (variants.csv). The dataset you see here is the combination of two datasets from Johns Hopkins University and covariants.org.

Questions to Inspire Users:

This dataset is designed for a diverse set of analytical questions. Here are some ideas to inspire the Kaggle community:

Survival Analysis:

How do different SARS-CoV-2 variants influence the duration of local epidemics?

Which factors (mortality, R0, etc.) are most strongly associated with shorter or longer epidemic durations?

Does the type of variant/mutation (mutation,S, Omicron, Delta, Other) have a significant impact on epidemic duration?

Is there a geographical pattern to the duration of epidemics?

Epidemiological Analysis:

How do local transmission rates (represented by our proxy of R0) affect the duration of an epidemic?

Do countries with higher mortality rates have different patterns of epidemic progression?

How can we predict the duration of an epidemic based on its initial characteristics?

How does the number of epidemic waves impact the duration of an epidemic?

Does the number of variants in a country affect the duration of an épidémie?

Data Science/Machine Learning:

Can we develop a machine learning model to predict the duration of an epidemic?

What features have the best predictive power ?

Can we identify clusters of variants/regions with similar epidemic patterns?

Are there interactions between variables that can explain the non-linearities that we have identified ?
u
Amazon review data 2018
cseweb.ucsd.edu
nijianmo.github.io
+1more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Amazon review data 2018 [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/
Explore at:
Dataset authored and provided by
UCSD CSE Research Project
Description
Context

This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:

More reviews:

The total number of reviews is 233.1 million (142.8 million in 2014).

New reviews:

Current data includes reviews in the range May 1996 - Oct 2018.

Metadata: - We have added transaction metadata for each review shown on the review page.

Added more detailed metadata of the product landing page.

Acknowledgements

If you publish articles based on this dataset, please cite the following paper:

Jianmo Ni, Jiacheng Li, Julian McAuley. Justifying recommendations using distantly-labeled reviews and fined-grained aspects. EMNLP, 2019.
Exploring E-commerce Trends⭐️⭐️⭐️
kaggle.com
zip
Updated Jul 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Roshan Riaz (2024). Exploring E-commerce Trends⭐️⭐️⭐️ [Dataset]. https://www.kaggle.com/datasets/muhammadroshaanriaz/e-commerce-trends-a-guide-to-leveraging-dataset
Explore at:
zip(51169 bytes)Available download formats
Dataset updated
Jul 8, 2024
Authors
Muhammad Roshan Riaz
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Exploring E-commerce Trends: A Guide to Leveraging Dummy Dataset

Introduction: In the world of e-commerce, data is a powerful asset that can be leveraged to understand customer behavior, improve sales strategies, and enhance overall business performance. This guide explores how to effectively utilize a dummy dataset generated to simulate various aspects of an e-commerce platform. By analyzing this dataset, businesses can gain valuable insights into product trends, customer preferences, and market dynamics.

Dataset Overview: The dummy dataset contains information on 1000 products across different categories such as electronics, clothing, home & kitchen, books, toys & games, and more. Each product is associated with attributes such as price, rating, number of reviews, stock quantity, discounts, sales, and date added to inventory. This comprehensive dataset provides a rich source of information for analysis and exploration.

Data Analysis: Using tools like Pandas, NumPy, and visualization libraries like Matplotlib or Seaborn, businesses can perform in-depth analysis of the dataset. Key insights such as top-selling products, popular product categories, pricing trends, and seasonal variations can be extracted through exploratory data analysis (EDA). Visualization techniques can be employed to create intuitive graphs and charts for better understanding and communication of findings.

Machine Learning Applications: The dataset can be used to train machine learning models for various e-commerce tasks such as product recommendation, sales prediction, customer segmentation, and sentiment analysis. By applying algorithms like linear regression, decision trees, or neural networks, businesses can develop predictive models to optimize inventory management, personalize customer experiences, and drive sales growth.

Testing and Prototyping: Businesses can utilize the dummy dataset to test new algorithms, prototype new features, or conduct A/B testing experiments without impacting real user data. This enables rapid iteration and experimentation to validate hypotheses and refine strategies before implementation in a live environment.

Educational Resources: The dummy dataset serves as an invaluable educational resource for students, researchers, and professionals interested in learning about e-commerce data analysis and machine learning. Tutorials, workshops, and online courses can be developed using the dataset to teach concepts such as data manipulation, statistical analysis, and model training in the context of e-commerce.

Decision Support and Strategy Development: Insights derived from the dataset can inform strategic decision-making processes and guide business strategy development. By understanding customer preferences, market trends, and competitor behavior, businesses can make informed decisions regarding product assortment, pricing strategies, marketing campaigns, and resource allocation.

Conclusion: In conclusion, the dummy dataset provides a versatile and valuable resource for exploring e-commerce trends, understanding customer behavior, and driving business growth. By leveraging this dataset effectively, businesses can unlock actionable insights, optimize operations, and stay ahead in today's competitive e-commerce landscape
A
Artificial Intelligence Training Dataset Report
archivemarketresearch.com
doc, pdf, ppt
Updated Feb 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Artificial Intelligence Training Dataset Report [Dataset]. https://www.archivemarketresearch.com/reports/artificial-intelligence-training-dataset-38645
Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
Feb 21, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global Artificial Intelligence (AI) Training Dataset market is projected to reach $1605.2 million by 2033, exhibiting a CAGR of 9.4% from 2025 to 2033. The surge in demand for AI training datasets is driven by the increasing adoption of AI and machine learning technologies in various industries such as healthcare, financial services, and manufacturing. Moreover, the growing need for reliable and high-quality data for training AI models is further fueling the market growth. Key market trends include the increasing adoption of cloud-based AI training datasets, the emergence of synthetic data generation, and the growing focus on data privacy and security. The market is segmented by type (image classification dataset, voice recognition dataset, natural language processing dataset, object detection dataset, and others) and application (smart campus, smart medical, autopilot, smart home, and others). North America is the largest regional market, followed by Europe and Asia Pacific. Key companies operating in the market include Appen, Speechocean, TELUS International, Summa Linguae Technologies, and Scale AI. Artificial Intelligence (AI) training datasets are critical for developing and deploying AI models. These datasets provide the data that AI models need to learn, and the quality of the data directly impacts the performance of the model. The AI training dataset market landscape is complex, with many different providers offering datasets for a variety of applications. The market is also rapidly evolving, as new technologies and techniques are developed for collecting, labeling, and managing AI training data.
Association rule mining data for census tract chemical exposure analysis
catalog.data.gov
s.cnmilf.com
+1more
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Association rule mining data for census tract chemical exposure analysis [Dataset]. https://catalog.data.gov/dataset/association-rule-mining-data-for-census-tract-chemical-exposure-analysis
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Chemical concentration, exposure, and health risk data for U.S. census tracts from National Scale Air Toxics Assessment (NATA). This dataset is associated with the following publication: Huang, H., R. Tornero-Velez, and T. Barzyk. Associations between socio-demographic characteristics and chemical concentrations contributing to cumulative exposures in the United States. Journal of Exposure Science and Environmental Epidemiology. Nature Publishing Group, London, UK, 27(6): 544-550, (2017).
m
Global Burden of Disease analysis dataset of BMI and CVD outcomes, risk...
data.mendeley.com
Updated Aug 17, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Cundiff (2021). Global Burden of Disease analysis dataset of BMI and CVD outcomes, risk factors, and SAS codes [Dataset]. http://doi.org/10.17632/g6b39zxck4.6
Explore at:
Unique identifier
https://doi.org/10.17632/g6b39zxck4.6
Dataset updated
Aug 17, 2021
Authors
David Cundiff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This formatted dataset originates from raw data files from the Institute of Health Metrics and Evaluation Global Burden of Disease (GBD2017). It is population weighted worldwide data on male and female cohorts ages 15-69 years including body mass index (BMI) and cardiovascular disease (CVD) and associated dietary, metabolic and other risk factors. The purpose of creating this formatted database is to explore the univariate and multiple regression correlations of BMI and CVD and other health outcomes with risk factors. Our research hypothesis is that we can successfully apply artificial intelligence to model BMI and CVD risk factors and health outcomes. We derived a BMI multiple regression risk factor formula that satisfied all nine Bradford Hill causality criteria for epidemiology research. We found that animal products and added fats are negatively correlated with CVD early deaths worldwide but positively correlated with CVD early deaths in high quantities. We interpret this as showing that optimal cardiovascular outcomes come with moderate (not low and not high) intakes of animal foods and added fats.

For questions, please email davidkcundiff@gmail.com. Thanks.
Data from: SMEX04 Soil Climate Analysis Network (SCAN) Data: Arizona,...
data.nasa.gov
datasets.ai
+6more
Updated Apr 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). SMEX04 Soil Climate Analysis Network (SCAN) Data: Arizona, Version 1 [Dataset]. https://data.nasa.gov/dataset/smex04-soil-climate-analysis-network-scan-data-arizona-version-1-69d65
Explore at:
Dataset updated
Apr 1, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Notice to Data Users: The documentation for this data set was provided solely by the Principal Investigator(s) and was not further developed, thoroughly reviewed, or edited by NSIDC. Thus, support for this data set may be limited.This data set contains measurements taken during the Soil Moisture Experiment 2004 (SMEX04) in southern Arizona, USA. The SCAN station houses numerous sensors which were used to automatically record the data.
Social Media Datasets
promptcloud.com
csv
Updated Jul 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PromptCloud (2025). Social Media Datasets [Dataset]. https://www.promptcloud.com/dataset/social-media/
Explore at:
csvAvailable download formats
Dataset updated
Jul 28, 2025
Dataset authored and provided by
PromptCloud
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Social media datasets provide real-time insight into public opinion, trending topics, user behavior, sentiment, and global events as reflected on platforms like Twitter (X), Facebook, and Instagram. These datasets are crucial for marketing analysts, newsrooms, political strategists, crisis response teams, and brand managers to monitor discourse and take data-driven action. Extracted from live user-generated content, […]
Seasonal analysis data
s.cnmilf.com
datasets.ai
+1more
Updated Nov 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2022). Seasonal analysis data [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/seasonal-analysis-data
Explore at:
Dataset updated
Nov 6, 2022
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
This data set contains seasonal number for fecal coliform, E. coli, and enterococci. This dataset is associated with the following publication: Selvakumar, A., and T. Oconnor. Seasonal Variation in Indicator Organisms Infiltrating from Permeable Pavement Parking Lots at the Edison Environmental Center, New Jersey. WATER RESEARCH. Elsevier Science Ltd, New York, NY, USA, 94(9): e10791, (2022).
U
Geospatial Datasets associated with Topographic Change Analysis in Sleeping...
data.usgs.gov
datasets.ai
+1more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jessica Dewitt, Geospatial Datasets associated with Topographic Change Analysis in Sleeping Bear Dunes National Lakeshore [Dataset]. http://doi.org/10.5066/P938WSV3
Explore at:
Unique identifier
https://doi.org/10.5066/P938WSV3
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Jessica Dewitt
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
Apr 1, 1955 - Jan 1, 2015
Area covered
Sleeping Bear Dunes National Lakeshore
Description
Landing page for datasets associated with a study of topographic change in Sleeping Bear Dunes National Lakeshore. 3 datasets are published on this page - a 1955 digital elevation model (DEM) created from historical aerial photos, a 1977 DEM created from historical aerial photos, and a topographic change DEM of Difference (DoD) dataset describing the elevation difference between 1955 and the 1m lidar-based DEM available for Leelanau, Benzie, and Grand Traverse Counties (available from the National Map).
I
Conceptual novelty scores for PubMed articles
databank.illinois.edu
aws-databank-alb.library.illinois.edu
Updated Feb 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shubhanshu Mishra; Vetle I. Torvik (2024). Conceptual novelty scores for PubMed articles [Dataset]. http://doi.org/10.13012/B2IDB-5060298_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-5060298_V1
Dataset updated
Feb 1, 2024
Authors
Shubhanshu Mishra; Vetle I. Torvik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
U.S. National Institutes of Health (NIH)
U.S. National Science Foundation (NSF)
Description
Conceptual novelty analysis data based on PubMed Medical Subject Headings ---------------------------------------------------------------------- Created by Shubhanshu Mishra, and Vetle I. Torvik on April 16th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : the magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra. It contains final data generated as part of our experiments based on MEDLINE 2015 baseline and MeSH tree from 2015. The dataset is distributed in the form of the following tab separated text files: * PubMed2015_NoveltyData.tsv - Novelty scores for each paper in PubMed. The file contains 22,349,417 rows and 6 columns, as follow: - PMID: PubMed ID - Year: year of publication - TimeNovelty: time novelty score of the paper based on individual concepts (see paper) - VolumeNovelty: volume novelty score of the paper based on individual concepts (see paper) - PairTimeNovelty: time novelty score of the paper based on pair of concepts (see paper) - PairVolumeNovelty: volume novelty score of the paper based on pair of concepts (see paper) * mesh_scores.tsv - Temporal profiles for each MeSH term for all years. The file contains 1,102,831 rows and 5 columns, as follow: - MeshTerm: Name of the MeSH term - Year: year - AbsVal: Total publications with that MeSH term in the given year - TimeNovelty: age (in years since first publication) of MeSH term in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH term in the given year * meshpair_scores.txt.gz (36 GB uncompressed) - Temporal profiles for each MeSH term for all years - Mesh1: Name of the first MeSH term (alphabetically sorted) - Mesh2: Name of the second MeSH term (alphabetically sorted) - Year: year - AbsVal: Total publications with that MeSH pair in the given year - TimeNovelty: age (in years since first publication) of MeSH pair in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH pair in the given year * README.txt file ## Dataset creation This dataset was constructed using multiple datasets described in the following locations: * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * MeSH tree 2015: ftp://nlmpubs.nlm.nih.gov/online/mesh/2015/meshtrees/ * Source code provided at: https://github.com/napsternxg/Novelty Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions: Additional data related updates can be found at: Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742 . The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Conceptual novelty analysis data based on PubMed Medical Subject Headings by Shubhanshu Mishra, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/Novelty
JPX_Dataset_001
kaggle.com
zip
Updated Jun 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Mulcahy (2022). JPX_Dataset_001 [Dataset]. https://www.kaggle.com/datasets/aemulcahy/jpx-dataset-001
Explore at:
zip(555293453 bytes)Available download formats
Dataset updated
Jun 30, 2022
Authors
Anthony Mulcahy
Description
Topologicial Data Analysis (TDA) uses techniques from topology to analyse datasets.

This dataset contains the files required to install and additional files for a demonstration notebook.
D
Freight Analysis Framework - All FAF summary datasets
data.transportation.gov
data.virginia.gov
+1more
csv, xlsx, xml
Updated Dec 17, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Freight Analysis Framework - All FAF summary datasets [Dataset]. https://data.transportation.gov/Roadways-and-Bridges/Freight-Analysis-Framework-All-FAF-summary-dataset/miub-cu89
Explore at:
xlsx, xml, csvAvailable download formats
Dataset updated
Dec 17, 2018
Description
The Freight Analysis Framework (FAF) integrates data from a variety of sources to create a comprehensive picture of freight movement among states and major metropolitan areas by all modes of transportation. With data from the 2007 Commodity Flow Survey and additional sources, FAF version 3 (FAF3) provides estimates for tonnage, value, and domestic ton-miles by region of origin and destination, commodity type, and mode for 2007, the most recent year, and forecasts through 2040. Also included are state-to-state flows for these years plus 1997 and 2002, summary statistics, and flows by truck assigned to the highway network for 2007 and 2040.
H
Replication Data for: Revisiting 'The Rise and Decline' in a Population of...
dataverse.harvard.edu
search.dataone.org
Updated May 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathan TeBlunthuis; Aaron Shaw; Benjamin Mako Hill (2020). Replication Data for: Revisiting 'The Rise and Decline' in a Population of Peer Production Projects [Dataset]. http://doi.org/10.7910/DVN/SG3LP1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/SG3LP1
Dataset updated
May 5, 2020
Dataset provided by
Harvard Dataverse
Authors
Nathan TeBlunthuis; Aaron Shaw; Benjamin Mako Hill
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/2.2/customlicense?persistentId=doi:10.7910/DVN/SG3LP1https://dataverse.harvard.edu/api/datasets/:persistentId/versions/2.2/customlicense?persistentId=doi:10.7910/DVN/SG3LP1
Description
This archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the intermediate RDS files and all.edits.RDS files do not exist in the working directory. all.edits.RDS is generated from the tsv files generated by wikiq. This may take several hours. By default building the dataset will...
f
Data from: hccTAAb Atlas: An Integrated Knowledge Database for...
acs.figshare.com
datasetcatalog.nlm.nih.gov
zip
Updated Dec 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tiandong Li; Peng Wang; Guiying Sun; Yuanlin Zou; Yifan Cheng; Han Wang; Yin Lu; Jianxiang Shi; Keyan Wang; Qiang Zhang; Hua Ye (2023). hccTAAb Atlas: An Integrated Knowledge Database for Tumor-Associated Autoantibodies in Hepatocellular Carcinoma [Dataset]. http://doi.org/10.1021/acs.jproteome.3c00579.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.3c00579.s001
Dataset updated
Dec 29, 2023
Dataset provided by
ACS Publications
Authors
Tiandong Li; Peng Wang; Guiying Sun; Yuanlin Zou; Yifan Cheng; Han Wang; Yin Lu; Jianxiang Shi; Keyan Wang; Qiang Zhang; Hua Ye
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Tumor-associated autoantibodies (TAAbs) have demonstrated potential as biomarkers for cancer detection. However, the understanding of their role in hepatocellular carcinoma (HCC) remains limited. In this study, we aimed to systematically collect and standardize information about these TAAbs and establish a comprehensive database as a platform for in-depth research. A total of 170 TAAbs were identified from published papers retrieved from PubMed, Web of Science, and Embase. Following normative reannotation, these TAAbs were referred to as 162 official symbols. The hccTAAb (tumor-associated autoantibodies in hepatocellular carcinoma) atlas was developed using the R Shiny framework and incorporating literature-based and multiomics data sets. This comprehensive online resource provides key information such as sensitivity, specificity, and additional details such as official symbols, official full names, UniProt, NCBI, HPA, neXtProt, and aliases through hyperlinks. Additionally, hccTAAb offers six analytical modules for visualizing expression profiles, survival analysis, immune infiltration, similarity analysis, DNA methylation, and DNA mutation analysis. Overall, the hccTAAb Atlas provides valuable insights into the mechanisms underlying TAAb and has the potential to enhance the diagnosis and treatment of HCC using autoantibodies. The hccTAAb Atlas is freely accessible at https://nscc.v.zzu.edu.cn/hccTAAb/.
Forest Inventory and Analysis Database
ngda-land-use-land-cover-geoplatform.hub.arcgis.com
datasets.ai
+8more
Updated Apr 14, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Forest Service (2017). Forest Inventory and Analysis Database [Dataset]. https://ngda-land-use-land-cover-geoplatform.hub.arcgis.com/datasets/forest-inventory-and-analysis-database
Explore at:
Dataset updated
Apr 14, 2017
Dataset provided by
U.S. Department of Agriculture Forest Servicehttp://fs.fed.us/
Authors
U.S. Forest Service
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered

Description
The Forest Inventory and Analysis (FIA) research program has been in existence since mandated by Congress in 1928. FIA's primary objective is to determine the extent, condition, volume, growth, and depletion of timber on the Nation's forest land. Before 1999, all inventories were conducted on a periodic basis. The passage of the 1998 Farm Bill requires FIA to collect data annually on plots within each State. This kind of up-to-date information is essential to frame realistic forest policies and programs. Summary reports for individual States are published but the Forest Service also provides data collected in each inventory to those interested in further analysis. Data is distributed via the FIA DataMart in a standard format. This standard format, referred to as the Forest Inventory and Analysis Database (FIADB) structure, was developed to provide users with as much data as possible in a consistent manner among States. A number of inventories conducted prior to the implementation of the annual inventory are available in the FIADB. However, various data attributes may be empty or the items may have been collected or computed differently. Annual inventories use a common plot design and common data collection procedures nationwide, resulting in greater consistency among FIA work units than earlier inventories. Links to field collection manuals and the FIADB user's manual are provided in the FIA DataMart.
N
Comprehensive Median Household Income and Distribution Dataset for Brevard...
neilsberg.com
Updated Jan 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2024). Comprehensive Median Household Income and Distribution Dataset for Brevard County, FL: Analysis by Household Type, Size and Income Brackets [Dataset]. https://www.neilsberg.com/research/datasets/cd8d6902-b041-11ee-aaca-3860777c1fe6/
Explore at:
Dataset updated
Jan 11, 2024
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Brevard County, Florida
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the median household income in Brevard County. It can be utilized to understand the trend in median household income and to analyze the income distribution in Brevard County by household type, size, and across various income brackets.

Content

The dataset will have the following datasets when applicable

Please note: The 2020 1-Year ACS estimates data was not reported by the Census Bureau due to the impact on survey collection and analysis caused by COVID-19. Consequently, median household income data for 2020 is unavailable for large cities (population 65,000 and above).

Brevard County, FL Median Household Income Trends (2010-2021, in 2022 inflation-adjusted dollars)

Median Household Income Variation by Family Size in Brevard County, FL: Comparative analysis across 7 household sizes

Income Distribution by Quintile: Mean Household Income in Brevard County, FL

Brevard County, FL households by income brackets: family, non-family, and total, in 2022 inflation-adjusted dollars

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Interested in deeper insights and visual analysis?

Explore our comprehensive data analysis and visual representations for a deeper understanding of Brevard County median household income. You can refer the same here
N
Polk County, IA Annual Population and Growth Analysis Dataset: A...
neilsberg.com
csv, json
Updated Jul 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2024). Polk County, IA Annual Population and Growth Analysis Dataset: A Comprehensive Overview of Population Changes and Yearly Growth Rates in Polk County from 2000 to 2023 // 2024 Edition [Dataset]. https://www.neilsberg.com/insights/polk-county-ia-population-by-year/
Explore at:
json, csvAvailable download formats
Dataset updated
Jul 30, 2024
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Polk County, Iowa
Variables measured
Annual Population Growth Rate, Population Between 2000 and 2023, Annual Population Growth Rate Percent
Measurement technique
The data presented in this dataset is derived from the 20 years data of U.S. Census Bureau Population Estimates Program (PEP) 2000 - 2023. To measure the variables, namely (a) population and (b) population change in ( absolute and as a percentage ), we initially analyzed and tabulated the data for each of the years between 2000 and 2023. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the Polk County population over the last 20 plus years. It lists the population for each year, along with the year on year change in population, as well as the change in percentage terms for each year. The dataset can be utilized to understand the population change of Polk County across the last two decades. For example, using this dataset, we can identify if the population is declining or increasing. If there is a change, when the population peaked, or if it is still growing and has not reached its peak. We can also compare the trend with the overall trend of United States population over the same period of time.

Key observations

In 2023, the population of Polk County was 505,255, a 0.81% increase year-by-year from 2022. Previously, in 2022, Polk County population was 501,184, an increase of 0.67% compared to a population of 497,842 in 2021. Over the last 20 plus years, between 2000 and 2023, population of Polk County increased by 129,528. In this period, the peak population was 505,255 in the year 2023. The numbers suggest that the population has not reached its peak yet and is showing a trend of further growth. Source: U.S. Census Bureau Population Estimates Program (PEP).

Content

When available, the data consists of estimates from the U.S. Census Bureau Population Estimates Program (PEP).

Data Coverage:

From 2000 to 2023

Variables / Data Columns

Year: This column displays the data year (Measured annually and for years 2000 to 2023)

Population: The population for the specific year for the Polk County is shown in this column.

Year on Year Change: This column displays the change in Polk County population for each year compared to the previous year.

Change in Percent: This column displays the year on year change as a percentage. Please note that the sum of all percentages may not equal one due to rounding of values.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Polk County Population by Year. You can refer the same here
Experimental statistics: fostering care datasets
data.wu.ac.at
html
Updated May 9, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ofsted (2014). Experimental statistics: fostering care datasets [Dataset]. https://data.wu.ac.at/schema/data_gov_uk/YjJkNzFhNjctOGQ3ZS00OGUwLTgyYmQtY2QyZGJkY2FlZGE4
Explore at:
htmlAvailable download formats
Dataset updated
May 9, 2014
Dataset provided by
Ofstedhttps://gov.uk/ofsted
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
This is the experiemental fostering care publication comprising of datasets.

Source agency: Office for Standards in Education, Children's Services and Skills

Designation: Experimental Official Statistics

Language: English

Alternative title: Experimental statistics: fostering care datasets

Facebook

Twitter

Click to copy link

Link copied

Cite

Rui Simões (2022). Orange dataset table [Dataset]. http://doi.org/10.6084/m9.figshare.19146410.v1

Orange dataset table

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

xlsxAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.19146410.v1

Dataset updated

Mar 4, 2022

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Rui Simões

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.

Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.

Clear search

Close search

Google apps

Main menu

Orange dataset table

Covid-19 variants survival data

Overview:

Key Features:

Questions to Inspire Users:

Survival Analysis:

Epidemiological Analysis:

Data Science/Machine Learning:

Amazon review data 2018

Context

Acknowledgements

Exploring E-commerce Trends⭐️⭐️⭐️

Artificial Intelligence Training Dataset Report

Association rule mining data for census tract chemical exposure analysis

Global Burden of Disease analysis dataset of BMI and CVD outcomes, risk...

Data from: SMEX04 Soil Climate Analysis Network (SCAN) Data: Arizona,...

Social Media Datasets

Seasonal analysis data

Geospatial Datasets associated with Topographic Change Analysis in Sleeping...

Conceptual novelty scores for PubMed articles

JPX_Dataset_001

Freight Analysis Framework - All FAF summary datasets

Replication Data for: Revisiting 'The Rise and Decline' in a Population of...

Data from: hccTAAb Atlas: An Integrated Knowledge Database for...

Forest Inventory and Analysis Database

Comprehensive Median Household Income and Distribution Dataset for Brevard...

About this dataset

Content

Inspiration

Interested in deeper insights and visual analysis?

Polk County, IA Annual Population and Growth Analysis Dataset: A...

About this dataset

Content

Inspiration

Recommended for further research

Experimental statistics: fostering care datasets

Orange dataset table