100+ datasets found
  1. Orange dataset table

    • figshare.com
    xlsx
    Updated Mar 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rui Simões (2022). Orange dataset table [Dataset]. http://doi.org/10.6084/m9.figshare.19146410.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Mar 4, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Rui Simões
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.

    Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.

  2. Covid-19 variants survival data

    • kaggle.com
    zip
    Updated Jan 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massock Batalong Maurice Blaise (2025). Covid-19 variants survival data [Dataset]. https://www.kaggle.com/datasets/lumierebatalong/covid-19-variants-survival-data
    Explore at:
    zip(216589 bytes)Available download formats
    Dataset updated
    Jan 2, 2025
    Authors
    Massock Batalong Maurice Blaise
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Overview:

    This dataset provides a unique resource for researchers and data scientists interested in the global dynamics of the COVID-19 pandemic. It focuses on the impact of different SARS-CoV-2 variants and mutations on the duration of local epidemics. By combining variant information with epidemiological data, this dataset allows for a comprehensive analysis of factors influencing the trajectory of the pandemic.

    Key Features:

    • Global Coverage: Includes data from multiple countries.
    • Variant-Specific Information: Detailed records for various SARS-CoV-2 variants.
    • Epidemic Duration: Data on the duration of local epidemics, accounting for right-censoring.
    • Epidemiological Variables: Includes mortality rates, a proxy for R0, transmission proxies, and other pertinent variables.
    • Geographical characteristics: Include a continent variable for exploring geographical patterns
    • Time varying variables: Include the number of waves and the number of variants in the different countries for more in-depth exploration.

    Data Source: The data combines information from the Johns Hopkins University COVID-19 dataset (confirmed_cases.csv and deaths_cases.csv) and the covariants.org dataset (variants.csv). The dataset you see here is the combination of two datasets from Johns Hopkins University and covariants.org.

    Questions to Inspire Users:

    This dataset is designed for a diverse set of analytical questions. Here are some ideas to inspire the Kaggle community:

    Survival Analysis:

    1. How do different SARS-CoV-2 variants influence the duration of local epidemics?
    2. Which factors (mortality, R0, etc.) are most strongly associated with shorter or longer epidemic durations?
    3. Does the type of variant/mutation (mutation,S, Omicron, Delta, Other) have a significant impact on epidemic duration?
    4. Is there a geographical pattern to the duration of epidemics?

    Epidemiological Analysis:

    1. How do local transmission rates (represented by our proxy of R0) affect the duration of an epidemic?
    2. Do countries with higher mortality rates have different patterns of epidemic progression?
    3. How can we predict the duration of an epidemic based on its initial characteristics?
    4. How does the number of epidemic waves impact the duration of an epidemic?
    5. Does the number of variants in a country affect the duration of an épidémie?

    Data Science/Machine Learning:

    1. Can we develop a machine learning model to predict the duration of an epidemic?
    2. What features have the best predictive power ?
    3. Can we identify clusters of variants/regions with similar epidemic patterns?
    4. Are there interactions between variables that can explain the non-linearities that we have identified ?
  3. u

    Amazon review data 2018

    • cseweb.ucsd.edu
    • nijianmo.github.io
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, Amazon review data 2018 [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/
    Explore at:
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    Context

    This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:

    • More reviews:

      • The total number of reviews is 233.1 million (142.8 million in 2014).
    • New reviews:

      • Current data includes reviews in the range May 1996 - Oct 2018.
    • Metadata: - We have added transaction metadata for each review shown on the review page.

      • Added more detailed metadata of the product landing page.

    Acknowledgements

    If you publish articles based on this dataset, please cite the following paper:

    • Jianmo Ni, Jiacheng Li, Julian McAuley. Justifying recommendations using distantly-labeled reviews and fined-grained aspects. EMNLP, 2019.
  4. Exploring E-commerce Trends⭐️⭐️⭐️

    • kaggle.com
    zip
    Updated Jul 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Roshan Riaz (2024). Exploring E-commerce Trends⭐️⭐️⭐️ [Dataset]. https://www.kaggle.com/datasets/muhammadroshaanriaz/e-commerce-trends-a-guide-to-leveraging-dataset
    Explore at:
    zip(51169 bytes)Available download formats
    Dataset updated
    Jul 8, 2024
    Authors
    Muhammad Roshan Riaz
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Exploring E-commerce Trends: A Guide to Leveraging Dummy Dataset

    Introduction: In the world of e-commerce, data is a powerful asset that can be leveraged to understand customer behavior, improve sales strategies, and enhance overall business performance. This guide explores how to effectively utilize a dummy dataset generated to simulate various aspects of an e-commerce platform. By analyzing this dataset, businesses can gain valuable insights into product trends, customer preferences, and market dynamics.

    1. Dataset Overview: The dummy dataset contains information on 1000 products across different categories such as electronics, clothing, home & kitchen, books, toys & games, and more. Each product is associated with attributes such as price, rating, number of reviews, stock quantity, discounts, sales, and date added to inventory. This comprehensive dataset provides a rich source of information for analysis and exploration.

    2. Data Analysis: Using tools like Pandas, NumPy, and visualization libraries like Matplotlib or Seaborn, businesses can perform in-depth analysis of the dataset. Key insights such as top-selling products, popular product categories, pricing trends, and seasonal variations can be extracted through exploratory data analysis (EDA). Visualization techniques can be employed to create intuitive graphs and charts for better understanding and communication of findings.

    3. Machine Learning Applications: The dataset can be used to train machine learning models for various e-commerce tasks such as product recommendation, sales prediction, customer segmentation, and sentiment analysis. By applying algorithms like linear regression, decision trees, or neural networks, businesses can develop predictive models to optimize inventory management, personalize customer experiences, and drive sales growth.

    4. Testing and Prototyping: Businesses can utilize the dummy dataset to test new algorithms, prototype new features, or conduct A/B testing experiments without impacting real user data. This enables rapid iteration and experimentation to validate hypotheses and refine strategies before implementation in a live environment.

    5. Educational Resources: The dummy dataset serves as an invaluable educational resource for students, researchers, and professionals interested in learning about e-commerce data analysis and machine learning. Tutorials, workshops, and online courses can be developed using the dataset to teach concepts such as data manipulation, statistical analysis, and model training in the context of e-commerce.

    6. Decision Support and Strategy Development: Insights derived from the dataset can inform strategic decision-making processes and guide business strategy development. By understanding customer preferences, market trends, and competitor behavior, businesses can make informed decisions regarding product assortment, pricing strategies, marketing campaigns, and resource allocation.

    Conclusion: In conclusion, the dummy dataset provides a versatile and valuable resource for exploring e-commerce trends, understanding customer behavior, and driving business growth. By leveraging this dataset effectively, businesses can unlock actionable insights, optimize operations, and stay ahead in today's competitive e-commerce landscape

  5. A

    Artificial Intelligence Training Dataset Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Artificial Intelligence Training Dataset Report [Dataset]. https://www.archivemarketresearch.com/reports/artificial-intelligence-training-dataset-38645
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Feb 21, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global Artificial Intelligence (AI) Training Dataset market is projected to reach $1605.2 million by 2033, exhibiting a CAGR of 9.4% from 2025 to 2033. The surge in demand for AI training datasets is driven by the increasing adoption of AI and machine learning technologies in various industries such as healthcare, financial services, and manufacturing. Moreover, the growing need for reliable and high-quality data for training AI models is further fueling the market growth. Key market trends include the increasing adoption of cloud-based AI training datasets, the emergence of synthetic data generation, and the growing focus on data privacy and security. The market is segmented by type (image classification dataset, voice recognition dataset, natural language processing dataset, object detection dataset, and others) and application (smart campus, smart medical, autopilot, smart home, and others). North America is the largest regional market, followed by Europe and Asia Pacific. Key companies operating in the market include Appen, Speechocean, TELUS International, Summa Linguae Technologies, and Scale AI. Artificial Intelligence (AI) training datasets are critical for developing and deploying AI models. These datasets provide the data that AI models need to learn, and the quality of the data directly impacts the performance of the model. The AI training dataset market landscape is complex, with many different providers offering datasets for a variety of applications. The market is also rapidly evolving, as new technologies and techniques are developed for collecting, labeling, and managing AI training data.

  6. Association rule mining data for census tract chemical exposure analysis

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Association rule mining data for census tract chemical exposure analysis [Dataset]. https://catalog.data.gov/dataset/association-rule-mining-data-for-census-tract-chemical-exposure-analysis
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Chemical concentration, exposure, and health risk data for U.S. census tracts from National Scale Air Toxics Assessment (NATA). This dataset is associated with the following publication: Huang, H., R. Tornero-Velez, and T. Barzyk. Associations between socio-demographic characteristics and chemical concentrations contributing to cumulative exposures in the United States. Journal of Exposure Science and Environmental Epidemiology. Nature Publishing Group, London, UK, 27(6): 544-550, (2017).

  7. m

    Global Burden of Disease analysis dataset of BMI and CVD outcomes, risk...

    • data.mendeley.com
    Updated Aug 17, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Cundiff (2021). Global Burden of Disease analysis dataset of BMI and CVD outcomes, risk factors, and SAS codes [Dataset]. http://doi.org/10.17632/g6b39zxck4.6
    Explore at:
    Dataset updated
    Aug 17, 2021
    Authors
    David Cundiff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This formatted dataset originates from raw data files from the Institute of Health Metrics and Evaluation Global Burden of Disease (GBD2017). It is population weighted worldwide data on male and female cohorts ages 15-69 years including body mass index (BMI) and cardiovascular disease (CVD) and associated dietary, metabolic and other risk factors. The purpose of creating this formatted database is to explore the univariate and multiple regression correlations of BMI and CVD and other health outcomes with risk factors. Our research hypothesis is that we can successfully apply artificial intelligence to model BMI and CVD risk factors and health outcomes. We derived a BMI multiple regression risk factor formula that satisfied all nine Bradford Hill causality criteria for epidemiology research. We found that animal products and added fats are negatively correlated with CVD early deaths worldwide but positively correlated with CVD early deaths in high quantities. We interpret this as showing that optimal cardiovascular outcomes come with moderate (not low and not high) intakes of animal foods and added fats.

    For questions, please email davidkcundiff@gmail.com. Thanks.

  8. Data from: SMEX04 Soil Climate Analysis Network (SCAN) Data: Arizona,...

    • data.nasa.gov
    • datasets.ai
    • +6more
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). SMEX04 Soil Climate Analysis Network (SCAN) Data: Arizona, Version 1 [Dataset]. https://data.nasa.gov/dataset/smex04-soil-climate-analysis-network-scan-data-arizona-version-1-69d65
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Notice to Data Users: The documentation for this data set was provided solely by the Principal Investigator(s) and was not further developed, thoroughly reviewed, or edited by NSIDC. Thus, support for this data set may be limited.This data set contains measurements taken during the Soil Moisture Experiment 2004 (SMEX04) in southern Arizona, USA. The SCAN station houses numerous sensors which were used to automatically record the data.

  9. Social Media Datasets

    • promptcloud.com
    csv
    Updated Jul 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PromptCloud (2025). Social Media Datasets [Dataset]. https://www.promptcloud.com/dataset/social-media/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jul 28, 2025
    Dataset authored and provided by
    PromptCloud
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Social media datasets provide real-time insight into public opinion, trending topics, user behavior, sentiment, and global events as reflected on platforms like Twitter (X), Facebook, and Instagram. These datasets are crucial for marketing analysts, newsrooms, political strategists, crisis response teams, and brand managers to monitor discourse and take data-driven action. Extracted from live user-generated content, […]

  10. Seasonal analysis data

    • s.cnmilf.com
    • datasets.ai
    • +1more
    Updated Nov 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). Seasonal analysis data [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/seasonal-analysis-data
    Explore at:
    Dataset updated
    Nov 6, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    This data set contains seasonal number for fecal coliform, E. coli, and enterococci. This dataset is associated with the following publication: Selvakumar, A., and T. Oconnor. Seasonal Variation in Indicator Organisms Infiltrating from Permeable Pavement Parking Lots at the Edison Environmental Center, New Jersey. WATER RESEARCH. Elsevier Science Ltd, New York, NY, USA, 94(9): e10791, (2022).

  11. U

    Geospatial Datasets associated with Topographic Change Analysis in Sleeping...

    • data.usgs.gov
    • datasets.ai
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jessica Dewitt, Geospatial Datasets associated with Topographic Change Analysis in Sleeping Bear Dunes National Lakeshore [Dataset]. http://doi.org/10.5066/P938WSV3
    Explore at:
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Jessica Dewitt
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    Apr 1, 1955 - Jan 1, 2015
    Area covered
    Sleeping Bear Dunes National Lakeshore
    Description

    Landing page for datasets associated with a study of topographic change in Sleeping Bear Dunes National Lakeshore. 3 datasets are published on this page - a 1955 digital elevation model (DEM) created from historical aerial photos, a 1977 DEM created from historical aerial photos, and a topographic change DEM of Difference (DoD) dataset describing the elevation difference between 1955 and the 1m lidar-based DEM available for Leelanau, Benzie, and Grand Traverse Counties (available from the National Map).

  12. I

    Conceptual novelty scores for PubMed articles

    • databank.illinois.edu
    • aws-databank-alb.library.illinois.edu
    Updated Feb 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shubhanshu Mishra; Vetle I. Torvik (2024). Conceptual novelty scores for PubMed articles [Dataset]. http://doi.org/10.13012/B2IDB-5060298_V1
    Explore at:
    Dataset updated
    Feb 1, 2024
    Authors
    Shubhanshu Mishra; Vetle I. Torvik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    U.S. National Institutes of Health (NIH)
    U.S. National Science Foundation (NSF)
    Description

    Conceptual novelty analysis data based on PubMed Medical Subject Headings ---------------------------------------------------------------------- Created by Shubhanshu Mishra, and Vetle I. Torvik on April 16th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : the magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra. It contains final data generated as part of our experiments based on MEDLINE 2015 baseline and MeSH tree from 2015. The dataset is distributed in the form of the following tab separated text files: * PubMed2015_NoveltyData.tsv - Novelty scores for each paper in PubMed. The file contains 22,349,417 rows and 6 columns, as follow: - PMID: PubMed ID - Year: year of publication - TimeNovelty: time novelty score of the paper based on individual concepts (see paper) - VolumeNovelty: volume novelty score of the paper based on individual concepts (see paper) - PairTimeNovelty: time novelty score of the paper based on pair of concepts (see paper) - PairVolumeNovelty: volume novelty score of the paper based on pair of concepts (see paper) * mesh_scores.tsv - Temporal profiles for each MeSH term for all years. The file contains 1,102,831 rows and 5 columns, as follow: - MeshTerm: Name of the MeSH term - Year: year - AbsVal: Total publications with that MeSH term in the given year - TimeNovelty: age (in years since first publication) of MeSH term in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH term in the given year * meshpair_scores.txt.gz (36 GB uncompressed) - Temporal profiles for each MeSH term for all years - Mesh1: Name of the first MeSH term (alphabetically sorted) - Mesh2: Name of the second MeSH term (alphabetically sorted) - Year: year - AbsVal: Total publications with that MeSH pair in the given year - TimeNovelty: age (in years since first publication) of MeSH pair in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH pair in the given year * README.txt file ## Dataset creation This dataset was constructed using multiple datasets described in the following locations: * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * MeSH tree 2015: ftp://nlmpubs.nlm.nih.gov/online/mesh/2015/meshtrees/ * Source code provided at: https://github.com/napsternxg/Novelty Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions: Additional data related updates can be found at: Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742 . The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Conceptual novelty analysis data based on PubMed Medical Subject Headings by Shubhanshu Mishra, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/Novelty

  13. JPX_Dataset_001

    • kaggle.com
    zip
    Updated Jun 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Mulcahy (2022). JPX_Dataset_001 [Dataset]. https://www.kaggle.com/datasets/aemulcahy/jpx-dataset-001
    Explore at:
    zip(555293453 bytes)Available download formats
    Dataset updated
    Jun 30, 2022
    Authors
    Anthony Mulcahy
    Description

    Topologicial Data Analysis (TDA) uses techniques from topology to analyse datasets.

    This dataset contains the files required to install and additional files for a demonstration notebook.

  14. D

    Freight Analysis Framework - All FAF summary datasets

    • data.transportation.gov
    • data.virginia.gov
    • +1more
    csv, xlsx, xml
    Updated Dec 17, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Freight Analysis Framework - All FAF summary datasets [Dataset]. https://data.transportation.gov/Roadways-and-Bridges/Freight-Analysis-Framework-All-FAF-summary-dataset/miub-cu89
    Explore at:
    xlsx, xml, csvAvailable download formats
    Dataset updated
    Dec 17, 2018
    Description

    The Freight Analysis Framework (FAF) integrates data from a variety of sources to create a comprehensive picture of freight movement among states and major metropolitan areas by all modes of transportation. With data from the 2007 Commodity Flow Survey and additional sources, FAF version 3 (FAF3) provides estimates for tonnage, value, and domestic ton-miles by region of origin and destination, commodity type, and mode for 2007, the most recent year, and forecasts through 2040. Also included are state-to-state flows for these years plus 1997 and 2002, summary statistics, and flows by truck assigned to the highway network for 2007 and 2040.

  15. H

    Replication Data for: Revisiting 'The Rise and Decline' in a Population of...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated May 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathan TeBlunthuis; Aaron Shaw; Benjamin Mako Hill (2020). Replication Data for: Revisiting 'The Rise and Decline' in a Population of Peer Production Projects [Dataset]. http://doi.org/10.7910/DVN/SG3LP1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 5, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Nathan TeBlunthuis; Aaron Shaw; Benjamin Mako Hill
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/2.2/customlicense?persistentId=doi:10.7910/DVN/SG3LP1https://dataverse.harvard.edu/api/datasets/:persistentId/versions/2.2/customlicense?persistentId=doi:10.7910/DVN/SG3LP1

    Description

    This archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the intermediate RDS files and all.edits.RDS files do not exist in the working directory. all.edits.RDS is generated from the tsv files generated by wikiq. This may take several hours. By default building the dataset will...

  16. f

    Data from: hccTAAb Atlas: An Integrated Knowledge Database for...

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    zip
    Updated Dec 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tiandong Li; Peng Wang; Guiying Sun; Yuanlin Zou; Yifan Cheng; Han Wang; Yin Lu; Jianxiang Shi; Keyan Wang; Qiang Zhang; Hua Ye (2023). hccTAAb Atlas: An Integrated Knowledge Database for Tumor-Associated Autoantibodies in Hepatocellular Carcinoma [Dataset]. http://doi.org/10.1021/acs.jproteome.3c00579.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 29, 2023
    Dataset provided by
    ACS Publications
    Authors
    Tiandong Li; Peng Wang; Guiying Sun; Yuanlin Zou; Yifan Cheng; Han Wang; Yin Lu; Jianxiang Shi; Keyan Wang; Qiang Zhang; Hua Ye
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Tumor-associated autoantibodies (TAAbs) have demonstrated potential as biomarkers for cancer detection. However, the understanding of their role in hepatocellular carcinoma (HCC) remains limited. In this study, we aimed to systematically collect and standardize information about these TAAbs and establish a comprehensive database as a platform for in-depth research. A total of 170 TAAbs were identified from published papers retrieved from PubMed, Web of Science, and Embase. Following normative reannotation, these TAAbs were referred to as 162 official symbols. The hccTAAb (tumor-associated autoantibodies in hepatocellular carcinoma) atlas was developed using the R Shiny framework and incorporating literature-based and multiomics data sets. This comprehensive online resource provides key information such as sensitivity, specificity, and additional details such as official symbols, official full names, UniProt, NCBI, HPA, neXtProt, and aliases through hyperlinks. Additionally, hccTAAb offers six analytical modules for visualizing expression profiles, survival analysis, immune infiltration, similarity analysis, DNA methylation, and DNA mutation analysis. Overall, the hccTAAb Atlas provides valuable insights into the mechanisms underlying TAAb and has the potential to enhance the diagnosis and treatment of HCC using autoantibodies. The hccTAAb Atlas is freely accessible at https://nscc.v.zzu.edu.cn/hccTAAb/.

  17. Forest Inventory and Analysis Database

    • ngda-land-use-land-cover-geoplatform.hub.arcgis.com
    • datasets.ai
    • +8more
    Updated Apr 14, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Forest Service (2017). Forest Inventory and Analysis Database [Dataset]. https://ngda-land-use-land-cover-geoplatform.hub.arcgis.com/datasets/forest-inventory-and-analysis-database
    Explore at:
    Dataset updated
    Apr 14, 2017
    Dataset provided by
    U.S. Department of Agriculture Forest Servicehttp://fs.fed.us/
    Authors
    U.S. Forest Service
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Description

    The Forest Inventory and Analysis (FIA) research program has been in existence since mandated by Congress in 1928. FIA's primary objective is to determine the extent, condition, volume, growth, and depletion of timber on the Nation's forest land. Before 1999, all inventories were conducted on a periodic basis. The passage of the 1998 Farm Bill requires FIA to collect data annually on plots within each State. This kind of up-to-date information is essential to frame realistic forest policies and programs. Summary reports for individual States are published but the Forest Service also provides data collected in each inventory to those interested in further analysis. Data is distributed via the FIA DataMart in a standard format. This standard format, referred to as the Forest Inventory and Analysis Database (FIADB) structure, was developed to provide users with as much data as possible in a consistent manner among States. A number of inventories conducted prior to the implementation of the annual inventory are available in the FIADB. However, various data attributes may be empty or the items may have been collected or computed differently. Annual inventories use a common plot design and common data collection procedures nationwide, resulting in greater consistency among FIA work units than earlier inventories. Links to field collection manuals and the FIADB user's manual are provided in the FIA DataMart.

  18. N

    Comprehensive Median Household Income and Distribution Dataset for Brevard...

    • neilsberg.com
    Updated Jan 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2024). Comprehensive Median Household Income and Distribution Dataset for Brevard County, FL: Analysis by Household Type, Size and Income Brackets [Dataset]. https://www.neilsberg.com/research/datasets/cd8d6902-b041-11ee-aaca-3860777c1fe6/
    Explore at:
    Dataset updated
    Jan 11, 2024
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Brevard County, Florida
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the median household income in Brevard County. It can be utilized to understand the trend in median household income and to analyze the income distribution in Brevard County by household type, size, and across various income brackets.

    Content

    The dataset will have the following datasets when applicable

    Please note: The 2020 1-Year ACS estimates data was not reported by the Census Bureau due to the impact on survey collection and analysis caused by COVID-19. Consequently, median household income data for 2020 is unavailable for large cities (population 65,000 and above).

    • Brevard County, FL Median Household Income Trends (2010-2021, in 2022 inflation-adjusted dollars)
    • Median Household Income Variation by Family Size in Brevard County, FL: Comparative analysis across 7 household sizes
    • Income Distribution by Quintile: Mean Household Income in Brevard County, FL
    • Brevard County, FL households by income brackets: family, non-family, and total, in 2022 inflation-adjusted dollars

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Interested in deeper insights and visual analysis?

    Explore our comprehensive data analysis and visual representations for a deeper understanding of Brevard County median household income. You can refer the same here

  19. N

    Polk County, IA Annual Population and Growth Analysis Dataset: A...

    • neilsberg.com
    csv, json
    Updated Jul 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2024). Polk County, IA Annual Population and Growth Analysis Dataset: A Comprehensive Overview of Population Changes and Yearly Growth Rates in Polk County from 2000 to 2023 // 2024 Edition [Dataset]. https://www.neilsberg.com/insights/polk-county-ia-population-by-year/
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Jul 30, 2024
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Polk County, Iowa
    Variables measured
    Annual Population Growth Rate, Population Between 2000 and 2023, Annual Population Growth Rate Percent
    Measurement technique
    The data presented in this dataset is derived from the 20 years data of U.S. Census Bureau Population Estimates Program (PEP) 2000 - 2023. To measure the variables, namely (a) population and (b) population change in ( absolute and as a percentage ), we initially analyzed and tabulated the data for each of the years between 2000 and 2023. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the Polk County population over the last 20 plus years. It lists the population for each year, along with the year on year change in population, as well as the change in percentage terms for each year. The dataset can be utilized to understand the population change of Polk County across the last two decades. For example, using this dataset, we can identify if the population is declining or increasing. If there is a change, when the population peaked, or if it is still growing and has not reached its peak. We can also compare the trend with the overall trend of United States population over the same period of time.

    Key observations

    In 2023, the population of Polk County was 505,255, a 0.81% increase year-by-year from 2022. Previously, in 2022, Polk County population was 501,184, an increase of 0.67% compared to a population of 497,842 in 2021. Over the last 20 plus years, between 2000 and 2023, population of Polk County increased by 129,528. In this period, the peak population was 505,255 in the year 2023. The numbers suggest that the population has not reached its peak yet and is showing a trend of further growth. Source: U.S. Census Bureau Population Estimates Program (PEP).

    Content

    When available, the data consists of estimates from the U.S. Census Bureau Population Estimates Program (PEP).

    Data Coverage:

    • From 2000 to 2023

    Variables / Data Columns

    • Year: This column displays the data year (Measured annually and for years 2000 to 2023)
    • Population: The population for the specific year for the Polk County is shown in this column.
    • Year on Year Change: This column displays the change in Polk County population for each year compared to the previous year.
    • Change in Percent: This column displays the year on year change as a percentage. Please note that the sum of all percentages may not equal one due to rounding of values.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Polk County Population by Year. You can refer the same here

  20. Experimental statistics: fostering care datasets

    • data.wu.ac.at
    html
    Updated May 9, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ofsted (2014). Experimental statistics: fostering care datasets [Dataset]. https://data.wu.ac.at/schema/data_gov_uk/YjJkNzFhNjctOGQ3ZS00OGUwLTgyYmQtY2QyZGJkY2FlZGE4
    Explore at:
    htmlAvailable download formats
    Dataset updated
    May 9, 2014
    Dataset provided by
    Ofstedhttps://gov.uk/ofsted
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    This is the experiemental fostering care publication comprising of datasets.

    Source agency: Office for Standards in Education, Children's Services and Skills

    Designation: Experimental Official Statistics

    Language: English

    Alternative title: Experimental statistics: fostering care datasets

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rui Simões (2022). Orange dataset table [Dataset]. http://doi.org/10.6084/m9.figshare.19146410.v1
Organization logoOrganization logo

Orange dataset table

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
xlsxAvailable download formats
Dataset updated
Mar 4, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Rui Simões
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.

Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.

Search
Clear search
Close search
Google apps
Main menu