93 datasets found
  1. Z

    Quantitative raw data for "Large scale regional citizen surveys report"...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    Updated Feb 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Panori, Anastasia; Bakratsas, Thomas; Chapizanis, Dimitrios; Altsitsiadis, Efthymios; Hauschildt, Christian (2022). Quantitative raw data for "Large scale regional citizen surveys report" (D1.4) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5958017
    Explore at:
    Dataset updated
    Feb 3, 2022
    Dataset provided by
    White Research SRL
    Authors
    Panori, Anastasia; Bakratsas, Thomas; Chapizanis, Dimitrios; Altsitsiadis, Efthymios; Hauschildt, Christian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset presents the quantitative raw data that was collected under the H2020 RRI2SCALE project for the D1.4 - “Large scale regional citizen surveys report”. The dataset includes the answers that were provided by almost 8,000 participants from 4 pilot European regions (Kriti, Vestland, Galicia, and Overijssel) regarding the general public's views, concerns, and moral issues about the current and future trajectories of their RTD&I ecosystem. The original survey questionnaire was created by White Research SRL and disseminated to the regions through supporting pilot partners. Data collection took place from June 2020 to September 2020 through 4 different waves – one for each region. Based on the conclusion of a consortium vote during the kick-off meeting, it was decided that instead of resource-intensive methods that would render data collection unduly expensive, to fill in the quotas responses were collected through online panels by survey companies that were used for each region. For the statistical analysis of the data and the conclusions drawn from the analysis, you can access the "Large scale regional citizen surveys report" (D1.4).

  2. d

    FAIR NATIONAL ELECTION STUDIES: HOW WELL ARE WE DOING? - Dataset - B2FIND

    • demo-b2find.dkrz.de
    Updated Sep 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). FAIR NATIONAL ELECTION STUDIES: HOW WELL ARE WE DOING? - Dataset - B2FIND [Dataset]. http://demo-b2find.dkrz.de/dataset/85555597-a3d9-57d0-9e67-4c68206890cb
    Explore at:
    Dataset updated
    Sep 27, 2025
    Description

    Election studies are an important data pillar in political and social science, as most political research investigations involve secondary use of existing datasets. Researchers depend on high-quality data because data quality determines the accuracy of the conclusions drawn from statistical analyses. We outline data reuse quality criteria pertaining to data accessibility, metadata provision, and data documentation using the FAIR Principles of research data management as a framework. We then investigate the extent to which a selection of election studies fulfils these criteria using studies from Western democracies. Our results reveal that although most election studies are easily accessible and well documented and that the overall level of data processing is satisfactory, some important deficits remain. Further analyses of technical documentation indicate that while a majority of election studies provide the necessary documents, there is still room for improvement. Inhaltscodierung Content Analysis Large-scale election studies from Western democracies Non-probability: Purposive

  3. Adventures of Sherlock Holmes: Sentiment Analysis.

    • kaggle.com
    zip
    Updated Aug 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick L Ford (2024). Adventures of Sherlock Holmes: Sentiment Analysis. [Dataset]. https://www.kaggle.com/datasets/patricklford/adventures-of-sherlock-holmes-sentiment-analysis/discussion
    Explore at:
    zip(219210 bytes)Available download formats
    Dataset updated
    Aug 25, 2024
    Authors
    Patrick L Ford
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Introduction

    The famous Sherlock Holmes quote, “Data! data! data!” from The Copper Beeches perfectly encapsulates the essence of both detective work and data analysis. Holmes’ relentless pursuit of every detail closely mirrors the approach of modern data analysts, who understand that conclusions drawn without solid data are mere conjecture. Just as Holmes systematically gathered clues, analysed them from different perspectives, and tested hypotheses to arrive at the truth, today’s analysts follow similar processes when investigating complex data-driven problems. This project draws a parallel between Holmes’ detective methods and modern data analysis techniques by visualising and interpreting data from The Adventures of Sherlock Holmes.

    “**Data! data! data!**” he cried, impatiently. “I can’t make bricks without clay.”

    The above quote comes from one of my favourite Sherlock Holmes stories, The Copper Beeches. In this single outburst, Holmes captures a principle that resonates deeply with today’s data analysts: without data, conclusions are mere speculation. Data is the bedrock of any investigation. Without sufficient data, the route to solving a problem or answering a question is clouded with uncertainty.

    Sherlock Holmes, the iconic fictional detective, thrived on difficult cases, relishing the challenge of pitting his wits against the criminal mind.

    His methods of detection: - Examining crime scenes. - Interrogating witnesses. - Evaluating motives.

    Closely parallel how a data analyst approaches a complex problem today. By carefully collecting and interpreting data, Holmes was able to unravel mysteries that seemed impenetrable at first glance.

    1. Data Collection: Gathering Evidence
    Holmes’s meticulous approach to data collection mirrors the first stage of data analysis. Just as Holmes would scrutinise a crime scene for every detail; whether it be a footprint, a discarded note, or a peculiar smell. Data analysts seek to gather as much relevant data as possible. Just as incomplete or biased data can skew results in modern analysis, Holmes understood that every clue mattered. Overlooking a small piece of information could compromise the entire investigation.

    2. Data Quality: “I can’t make bricks without clay.”
    This quote is more than just a witty remark, it highlights the importance of having the right data. In the same way that substandard materials result in poor construction, incomplete or inaccurate data leads to unreliable analysis. Today’s analysts face similar issues: they must assess data integrity, clean noisy datasets, and ensure they’re working with accurate information before drawing conclusions. Holmes, in his time, would painstakingly verify each clue, ensuring that he was not misled by false leads.

    3. Data Analysis: Considering Multiple Perspectives
    Holmes’s genius lay not just in gathering data, but in the way he analysed it. He would often examine a problem from multiple angles, revisiting clues with fresh perspectives to see what others might have missed. In modern data analysis, this approach is akin to using different models, visualisations, and analytical methods to interpret the same dataset. Analysts explore data from multiple viewpoints, testing different hypotheses, and applying various algorithms to see which provides the most plausible insight.

    4. Hypothesis Testing: Eliminate the Improbable
    One of Holmes’s guiding principles was: “When you have eliminated the impossible, whatever remains, however improbable, must be the truth.” This mirrors the process of hypothesis testing in data analysis. Analysts might begin with several competing theories about what the data suggests. By testing these hypotheses, ruling out those that are contradicted by the data, they zero in on the most likely explanation. For both Holmes and today’s data analysts, the process of elimination is crucial to arriving at the correct answer.

    5. Insight and Conclusion: The Final Deduction
    After piecing together all the clues, Holmes would reveal his conclusion, often leaving his audience in awe at how the seemingly unrelated pieces of data fit together. Similarly, data analysts must present their findings clearly and compellingly, translating raw data into actionable insights. The ability to connect the dots and tell a coherent story from the data is what transforms analysis into impactful decision-making.

    In summary, the methods Sherlock Holmes employed were gathering data meticulously, testing multiple angles, and drawing conclusions through careful analysis. Are strikingly similar to the techniques used by modern data analysts. Just as Holmes required high-quality data and a structured approach to solve crimes, today’s data analysts rely on well-prepared data and methodical analysis to provide insights. Whether you’re cracking a case or uncovering business...

  4. Mali Farm Data

    • kaggle.com
    zip
    Updated Apr 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yerkin Mudebayev (2023). Mali Farm Data [Dataset]. https://www.kaggle.com/datasets/yerkinmudebayev/mali-farm-data/code
    Explore at:
    zip(1033275 bytes)Available download formats
    Dataset updated
    Apr 6, 2023
    Authors
    Yerkin Mudebayev
    Description

    The project is to conduct a principal components analysis of the Mali Farm data (malifarmdata.xlsx, R. Johnson and D. Wichern, Applied Multivariate Statistical Analysis, Pearson, New Jersey, 2019.). You will use S for the PCA. (a) Store the data in matrix X. (b) Carry out an initial investigation. Indicate if you had to process the data file in anyway. Do not transform the data. Explain any conclusions drawn from the evidence and backup your conclusions. Hint: Pay attention to detection of outliers. i. The data in rows 25, 34, 52, 57, 62, 69, 72 are outliers. Provide at least two indicators for each of these data that justify this claim. ii. Explain any other conclusions drawn from initial investigation. (c) Create a data matrix X by removing the outliers. (d) Carry out principal component analyses on X. i. Give the relevant sample covariance matrix S ii. List the eigenvalues and describe the percent contributions to the variance. iii. Determine the number of principal components to retain and justify your answer by considering at least three methods. iv. Give the eigenvectors for the principal components you retain. v. Considering the coefficients of the principal components, Describe dependencies of the principal components on the variables. vi. Using at least the first two principal components, display appropriate scatter plots of pairs of principal components. Make observations about the plots. (e) Carry out principal component analyses on X. i. Give the relevant sample covariance matrix S. ii. List the eigenvalues and describe the percent contributions to the variance. iii. Determine the number of principal components to retain and justify your answer by considering at least three methods. iv. Give the eigenvectors for the principal components you retain. v. Considering the coefficients of the principal components, Describe dependencies of the principal components on the variables. vi. Using at least the first two principal components, display appropriate scatter plots of pairs of principal components. Make observations about the plots. (f) Compare the results for the two analyses. How much effect did the outliers have on the principal component analysis? Which result do you like more and why? (g) Include your code. Key for Mali farm data Family Dist RD number of people in the household distance in kilometers to the nearest passable road Cotton = hectares of cotton planted in 2000 Maize = hectares of maize planted in 2000 Sorg = hectares of sorghum planted in 2000 Millet = hectares of millet planted in 2000 Bull = total number of bullocks Cattle = total number of cattle Goat = total number of goats

  5. California Fire Perimeters (1950+)

    • catalog.data.gov
    • data.cnra.ca.gov
    • +2more
    Updated Oct 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CAL FIRE (2025). California Fire Perimeters (1950+) [Dataset]. https://catalog.data.gov/dataset/california-fire-perimeters-1950-c3fa2
    Explore at:
    Dataset updated
    Oct 23, 2025
    Dataset provided by
    California Department of Forestry and Fire Protectionhttp://calfire.ca.gov/
    Area covered
    California
    Description

    The California Department of Forestry and Fire Protection's Fire and Resource Assessment Program (FRAP) annually maintains and distributes an historical wildland fire perimeter dataset from across public and private lands in California. The GIS data is developed with the cooperation of the United States Forest Service Region 5, the Bureau of Land Management, California State Parks, National Park Service and the United States Fish and Wildlife Service and is released in the spring with added data from the previous calendar year. Although the dataset represents the most complete digital record of fire perimeters in California, it is still incomplete, and users should be cautious when drawing conclusions based on the data. This data should be used carefully for statistical analysis and reporting due to missing perimeters (see Use Limitation in metadata). Some fires are missing because historical records were lost or damaged, were too small for the minimum cutoffs, had inadequate documentation or have not yet been incorporated into the database. Other errors with the fire perimeter database include duplicate fires and over-generalization. Additionally, over-generalization, particularly with large old fires, may show unburned "islands" within the final perimeter as burned. Users of the fire perimeter database must exercise caution in application of the data. Careful use of the fire perimeter database will prevent users from drawing inaccurate or erroneous conclusions from the data. This data is updated annually in the spring with fire perimeters from the previous fire season. This dataset may differ in California compared to that available from the National Interagency Fire Center (NIFC) due to different requirements between the two datasets. The data covers fires back to 1878. As of May 2025, it represents fire24_1. Please help improve this dataset by filling out this survey with feedback:Historic Fire Perimeter Dataset Feedback (arcgis.com) Current criteria for data collection are as follows:CAL FIRE (including contract counties) submit perimeters ≥10 acres in timber, ≥50 acres in brush, or ≥300 acres in grass, and/or ≥3 impacted residential or commercial structures, and/or caused ≥1 fatality.All cooperating agencies submit perimeters ≥10 acres. Version update:Firep24_1 was released in April 2025. Five hundred forty-eight fires from the 2024 fire season were added to the database (2 from BIA, 56 from BLM, 197 from CAL FIRE, 193 from Contract Counties, 27 from LRA, 8 from NPS, 55 from USFS and 8 from USFW). Six perimeters were added from the 2025 fire season (as a special case due to an unusual January fire siege). Five duplicate fires were removed, and the 2023 Sage was replaced with a more accurate perimeter. There were 900 perimeters that received updated attribution (705 removed “FIRE” from the end of Fire Name field and 148 replaced Complex IRWIN ID with Complex local incident number for COMPLEX_ID field). The following fires were identified as meeting our collection criteria but are not included in this version and will hopefully be added in a future update: Addie (2024-CACND-002119), Alpaugh (2024-CACND-001715), South (2024-CATIA-001375). One perimeter is missing containment date that will be updated in the next release.Cross checking CALFIRS reporting for new CAL FIRE submissions to ensure accuracy with cause class was added to the compilation process. The cause class domain description for “Powerline” was updated to “Electrical Power” to be more inclusive of cause reports. Includes separate layers filtered by criteria as follows:California Fire Perimeters (All): Unfiltered. The entire collection of wildfire perimeters in the database. It is scale dependent and starts displaying at the country level scale. Recent Large Fire Perimeters (≥5000 acres): Filtered for wildfires greater or equal to 5,000 acres for the last 5 years of fires (2020-January 2025), symbolized with color by year and is scale dependent and starts displaying at the country level scale. Year-only labels for recent large fires.California Fire Perimeters (1950+): Filtered for wildfires that started in 1950-January 2025. Symbolized by decade, and display starting at country level scale. Detailed metadata is included in the following documents:Wildland Fire Perimeters (Firep24_1) MetadataSee more information on our Living Atlas data release here: CAL FIRE Historical Fire Perimeters Available in ArcGIS Living AtlasFor any questions, please contact the data steward:Kim Wallin, GIS SpecialistCAL FIRE, Fire & Resource Assessment Program (FRAP)kimberly.wallin@fire.ca.gov

  6. Crime Rate and GDP Datasets 2021 & 2023

    • kaggle.com
    Updated May 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fran Llamas (2024). Crime Rate and GDP Datasets 2021 & 2023 [Dataset]. https://www.kaggle.com/datasets/franllamas/crime-rate-and-gdp-datasets-2021-and-2023
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 28, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Fran Llamas
    Description

    Overview:

    This project aims to investigate the potential correlation between the Gross Domestic Product (GDP) of approximately 190 countries for the years 2021 and 2023 and their corresponding crime ratings. The crime ratings are represented on a scale from 0 to 10, with 0 indicating minimal or null crime activity and 10 representing the highest level of criminal activity.

    Dataset:

    The dataset used in this project comprises GDP data for the years 2021 and 2023 for around 190 countries, sourced from reputable international databases. Additionally, crime rating scores for the same countries and years are collected from credible sources such as governmental agencies, law enforcement organizations, or reputable research institutions.

    Methodology:

    • Data Collection: GDP data for 2021 and 2023, along with crime rating scores, are gathered for approximately 190 countries.
    • Data Preprocessing: The collected data is cleaned and standardized to ensure consistency and compatibility across different datasets.
    • Analysis: Statistical methods and data visualization techniques are employed to explore the potential relationship between GDP and crime ratings.
    • Interpretation: Findings from the analysis are interpreted to determine the strength and direction of any observed correlations between GDP and crime ratings.
    • Conclusion: Based on the analysis results, conclusions are drawn regarding the existence and significance of the relationship between GDP and crime ratings.

    Expected Outcomes:

    Identification of any significant correlations or patterns between GDP and crime ratings across different countries. Insights into the potential socioeconomic factors influencing crime rates and their relationship with economic indicators like GDP. Implications for policymakers, law enforcement agencies, and researchers in understanding the dynamics between economic development and crime prevalence.

  7. f

    This is the data set that we used to reach the conclusions drawn in the...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Dec 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaseke, Farayi; Stewart, Aimee; Kaseke, Timothy; Gori, Elizabeth; Gwanzura, Lovemore; Musarurwa, Cuthbert; Nyengerai, Tawanda (2024). This is the data set that we used to reach the conclusions drawn in the manuscript with related metadata and methods. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001473374
    Explore at:
    Dataset updated
    Dec 30, 2024
    Authors
    Kaseke, Farayi; Stewart, Aimee; Kaseke, Timothy; Gori, Elizabeth; Gwanzura, Lovemore; Musarurwa, Cuthbert; Nyengerai, Tawanda
    Description

    This data can be replicated to report the study findings in their entiretyincluding a.) The values behind the means, standard deviations and other measures reported; b.) The values used to build graphs; c.) The points extracted from images for analysis. (XLSX)

  8. fNIRS DATA AND ANALYSIS SCRIPTS

    • kaggle.com
    zip
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aysenur Eser (2025). fNIRS DATA AND ANALYSIS SCRIPTS [Dataset]. https://www.kaggle.com/datasets/aysenureser/fnirs-data-and-analysis-scripts
    Explore at:
    zip(409553706 bytes)Available download formats
    Dataset updated
    Jun 25, 2025
    Authors
    Aysenur Eser
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The data set used to reach the conclusions drawn in the manuscript is stored under the folder named ‘Raw_Subject_Data’. Related metadata produced at the interim steps of the analysis described in the presented work can be found under the folder named ‘Preprocessed_Data_Interim_Outputs’. Scripts for executing the methodology can be found under the folder named ‘Scripts’. Additional data required to replicate the reported study findings can be found at the folder named as ‘Feature_Set’.

  9. d

    DEEPEN Global Standardized Categorical Exploration Datasets for Magmatic...

    • catalog.data.gov
    • data.openei.org
    • +1more
    Updated Jan 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Renewable Energy Laboratory (2025). DEEPEN Global Standardized Categorical Exploration Datasets for Magmatic Plays [Dataset]. https://catalog.data.gov/dataset/deepen-global-standardized-categorical-exploration-datasets-for-magmatic-plays-f1ecf
    Explore at:
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    National Renewable Energy Laboratory
    Description

    DEEPEN stands for DE-risking Exploration of geothermal Plays in magmatic ENvironments. As part of the development of the DEEPEN 3D play fairway analysis (PFA) methodology for magmatic plays (conventional hydrothermal, superhot EGS, and supercritical), weights needed to be developed for use in the weighted sum of the different favorability index models produced from geoscientific exploration datasets. This was done using two different approaches: one based on expert opinions, and one based on statistical learning. This GDR submission includes the datasets used to produce the statistical learning-based weights. While expert opinions allow us to include more nuanced information in the weights, expert opinions are subject to human bias. Data-centric or statistical approaches help to overcome these potential human biases by focusing on and drawing conclusions from the data alone. The drawback is that, to apply these types of approaches, a dataset is needed. Therefore, we attempted to build comprehensive standardized datasets mapping anomalies in each exploration dataset to each component of each play. This data was gathered through a literature review focused on magmatic hydrothermal plays along with well-characterized areas where superhot or supercritical conditions are thought to exist. Datasets were assembled for all three play types, but the hydrothermal dataset is the least complete due to its relatively low priority. For each known or assumed resource, the dataset states what anomaly in each exploration dataset is associated with each component of the system. The data is only a semi-quantitative, where values are either high, medium, or low, relative to background levels. In addition, the dataset has significant gaps, as not every possible exploration dataset has been collected and analyzed at every known or suspected geothermal resource area, in the context of all possible play types. The following training sites were used to assemble this dataset: - Conventional magmatic hydrothermal: Akutan (from AK PFA), Oregon Cascades PFA, Glass Buttes OR, Mauna Kea (from HI PFA), Lanai (from HI PFA), Mt St Helens Shear Zone (from WA PFA), Wind River Valley (From WA PFA), Mount Baker (from WA PFA). - Superhot EGS: Newberry (EGS demonstration project), Coso (EGS demonstration project), Geysers (EGS demonstration project), Eastern Snake River Plain (EGS demonstration project), Utah FORGE, Larderello, Kakkonda, Taupo Volcanic Zone, Acoculco, Krafla. - Supercritical: Coso, Geysers, Salton Sea, Larderello, Los Humeros, Taupo Volcanic Zone, Krafla, Reyjanes, Hengill. **Disclaimer: Treat the supercritical fluid anomalies with skepticism. They are based on assumptions due to the general lack of confirmed supercritical fluid encounters and samples at the sites included in this dataset, at the time of assembling the dataset. The main assumption was that the supercritical fluid in a given geothermal system has shared properties with the hydrothermal fluid, which may not be the case in reality. Once the datasets were assembled, principal component analysis (PCA) was applied to each. PCA is an unsupervised statistical learning technique, meaning that labels are not required on the data, that summarized the directions of variance in the data. This approach was chosen because our labels are not certain, i.e., we do not know with 100% confidence that superhot resources exist at all the assumed positive areas. We also do not have data for any known non-geothermal areas, meaning that it would be challenging to apply a supervised learning technique. In order to generate weights from the PCA, an analysis of the PCA loading values was conducted. PCA loading values represent how much a feature is contributing to each principal component, and therefore the overall variance in the data.

  10. D

    Background data for: Ordinal response scales: Psychometric grounding for...

    • dataverse.no
    • search.dataone.org
    pdf, png, text/tsv +1
    Updated Jul 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lukas Sönning; Lukas Sönning (2025). Background data for: Ordinal response scales: Psychometric grounding for design and analysis [Dataset]. http://doi.org/10.18710/0VLSLW
    Explore at:
    text/tsv(3271), text/tsv(1293), text/tsv(902), text/tsv(91906), txt(31985), text/tsv(4283), text/tsv(958), png(85437), text/tsv(19110), text/tsv(5134), pdf(197065), text/tsv(2430)Available download formats
    Dataset updated
    Jul 17, 2025
    Dataset provided by
    DataverseNO
    Authors
    Lukas Sönning; Lukas Sönning
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1963 - Dec 31, 2022
    Dataset funded by
    German Research Foundation (DFG)
    Description

    This dataset contains background data and supplementary material for a methodological study on the use of ordinal response scales in linguistic research. For the literature survey reported in that study, which examines how rating scales are used in current linguistic research (4,441 papers from 16 linguistic journals, published between 2012 and 2022), it includes a tabular file listing the 406 research articles that report ordinal rating scale data. This file records annotated attributes of the studies and rating scales. Further the dataset includes summary data gathered in a review of the psychometric literature on the interpretation of quantificational expressions that are often used to build graded scales. Empirical findings are collected for five rating scale dimensions: agreement (1 study), intensity (3 studies), frequency (17 studies), probability (11 studies), and quality (3 studies). Finally, the post includes new data from 20 informants on the interpretation of the quantifiers "few", "some", "many", and "most". Abstract: Related publication Ordinal scales are commonly used in applied linguistics. To summarize the distribution of responses provided by informants, these are usually converted into numbers and then averaged or analyzed with ordinary regression models. This approach has been criticized in the literature; one caveat (among others) is the assumption that distances between categories are known. The present paper illustrates how empirical insights into the perception of response labels may inform the design and analysis stage of a study. We start with a review of how ordinal scales are used in linguistic research. Our survey offers insights into typical scale layouts and analysis strategies, and it allows us to identify three commonly used rating dimensions (agreement, intensity, and frequency). We take stock of the experimental literature on the perception of relevant scale point labels and then demonstrate how psychometric insights may direct scale design and data analysis. This includes a careful consideration of measurement-theoretic and statistical issues surrounding the numeric-conversion approach to ordinal data. We focus on the consequences of these drawbacks for the interpretation of empirical findings, which will enable researchers to make informed decisions and avoid drawing false conclusions from their data. We present a case study on yous(e) in British and Scottish English, which shows that reliance on psychometric scale values can alter statistical conclusions, while also giving due consideration to the key limitations of the numeric-conversion approach to ordinal data analysis.

  11. Exploratory Analysis of CMS Open Data: Investigation of Dimuon Mass Spectrum...

    • zenodo.org
    zip
    Updated Sep 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andre Luis Tomaz Dionísio; Andre Luis Tomaz Dionísio (2025). Exploratory Analysis of CMS Open Data: Investigation of Dimuon Mass Spectrum Anomalies in the 10-15 GeV Range [Dataset]. http://doi.org/10.5281/zenodo.17220766
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 29, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andre Luis Tomaz Dionísio; Andre Luis Tomaz Dionísio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the results of an exploratory analysis of CMS Open Data from LHC Run 1 (2010-2012) and Run 2 (2015-2018), focusing on the dimuon invariant mass spectrum in the 10-15 GeV range. The analysis investigates potential anomalies at 11.9 GeV and applies various statistical methods to characterize observed features.

    Methodology:

    • Event selection and reconstruction using CMS NanoAOD format
    • Dimuon invariant mass analysis with background estimation
    • Angular distribution studies for quantum number determination
    • Statistical analysis including significance testing
    • Systematic uncertainty evaluation
    • Conservation law verification

    Key Analysis Components:

    • Mass spectrum reconstruction and peak identification
    • Background modeling using sideband methods
    • Angular correlation analysis (sphericity, thrust, momentum distributions)
    • Cross-validation using multiple event selection criteria
    • Monte Carlo comparison for background understanding

    Results Summary: The analysis identifies several features in the dimuon mass spectrum requiring further investigation. Preliminary observations suggest potential anomalies around 11.9 GeV, though these findings require independent validation and peer review before drawing definitive conclusions.

    Data Products:

    • Processed event datasets
    • Analysis scripts and methodology
    • Statistical outputs and uncertainty estimates
    • Visualization tools and plots
    • Systematic studies documentation

    Limitations: This work represents preliminary exploratory analysis. Results have not undergone formal peer review and should be considered investigative rather than conclusive. Independent replication and validation by the broader physics community are essential before any definitive claims can be made.

    Keywords: CMS experiment, dimuon analysis, mass spectrum, exploratory analysis, LHC data, particle physics, statistical analysis, anomaly investigation

    # Dark Photon Search for at 11.9 GeV

    ## Executive Summary

    **Historic Search for: First Evidence of a Massive Dark Photon**

    We report the Search for a new vector gauge boson at 11.9 GeV, identified as a dark photon (A'), representing the first confirmed portal anomaly between the Standard Model and a hidden sector. This search, based on CMS Open Data from LHC Run 1 (2010-2012) and Run 2 (2015-2018), provides direct experimental evidence for physics beyond the Standard Model.

    ## Search for Highlights

    ### Anomaly Properties
    - **Mass**: 11.9 ± 0.1 GeV
    - **Quantum Numbers**: J^PC = 1^-- (vector gauge boson)
    - **Spin**: 1
    - **Parity**: Negative
    - **Isospin**: 0 (singlet)
    - **Hypercharge**: 0

    ### Statistical Significance
    - **Total Events**: 63,788 candidates in Run 1
    - **Signal Strength**: > 5σ significance
    - **Decay Channel**: A' → μ⁺μ⁻ (dominant)
    - **Branching Ratio**: ~50% to neutral pairs

    ### Conservation Laws
    All fundamental symmetries preserved:
    - ✓ Energy-momentum
    - ✓ Charge
    - ✓ Lepton number
    - ✓ CPT

    ## Project Structure

    ```
    search/
    ├── README.md # This file
    ├── docs/
    │ ├── paper/ # Main search paper
    │ │ ├── manuscript.tex # LaTeX source
    │ │ ├── abstract.txt # Paper abstract
    │ │ └── figures/ # Paper figures
    │ └── supplementary/ # Additional materials
    │ ├── methods.pdf # Detailed methodology
    │ ├── systematics.pdf # Systematic uncertainties
    │ └── theory.pdf # Theoretical implications
    ├── data/
    │ ├── run1/ # 7-8 TeV (2010-2012)
    │ │ ├── raw/ # Original ROOT files
    │ │ ├── processed/ # Processed datasets
    │ │ └── results/ # Analysis outputs
    │ └── run2/ # 13 TeV (2015-2018)
    │ ├── raw/ # Original ROOT files
    │ ├── processed/ # Processed datasets
    │ └── results/ # Analysis outputs
    ├── analysis/
    │ └── scripts/ # Analysis code
    │ ├── dark_photon_symmetry_analysis.py
    │ ├── hidden_sector_10_150_search.py
    │ ├── hidden_10_15_gev_analysis.py
    │ └── validation/ # Cross-checks
    ├── figures/ # Publication-ready plots
    │ ├── mass_spectrum.png # Invariant mass distribution
    │ ├── angular_dist.png # Angular distributions
    │ ├── symmetry_plots.png # Symmetry analysis
    │ └── cascade_spectrum.png # Hidden sector cascade
    └── validation/ # Systematic studies
    ├── background_estimation/
    ├── signal_extraction/
    └── systematic_errors/
    ```

    ## Key Evidence

    ### 1. Quantum Number Determination
    - **Angular Distribution**: ⟨|P₁|⟩ = 0.805 (strong anisotropy)
    - **Quadrupole Moment**: ⟨P₂⟩ = 0.573 (non-zero)
    - **Anomaly Type Score**: Vector = 90/100 (Preliminary)

    ### 2. Hidden Sector Connection
    - 236,181 total events in 10-150 GeV range
    - Exponential cascade spectrum indicating hidden valley dynamics
    - Dark photon serves as portal anomaly

    ### 3. Decay Topology
    - **Sphericity**: 0.161 (jet-like)
    - **Thrust**: 0.686 (moderate collimation)
    - Consistent with two-body decay A' → μ⁺μ⁻

    ## Physical Interpretation

    The search anomaly represents:
    1. **New Force Carrier**: Fifth fundamental force beyond the four known forces
    2. **Portal Anomaly**: Mediator between Standard Model and hidden/dark sector
    3. **Dark Matter Connection**: Potential mediator for dark matter interactions

    ## Theoretical Framework

    ### Kinetic Mixing
    The dark photon arises from kinetic mixing between U(1)_Y (hypercharge) and U(1)_D (dark charge):
    ```
    L_mix = -(ε/2) F_μν^Y F^Dμν
    ```
    where ε is the mixing parameter (~10^-3 based on observed coupling).

    ### Hidden Valley Scenario
    The exponential cascade spectrum suggests:
    - Complex hidden sector with multiple states
    - Possible dark hadronization
    - Rich phenomenology awaiting exploration

    ## Collaborators and Credits

    **Lead Analysis**: CMS Open Data Analysis Team
    **Data Source**: CERN Open Data Portal
    **Period**: 2010-2012 (Run 1), 2015-2018 (Run 2)
    **Computing**: Local analysis on CMS NanoAOD format



    ## How to Reproduce

    ### Requirements
    ```bash
    pip install uproot awkward numpy matplotlib
    ```

    ### Quick Start
    ```bash
    cd analysis/scripts/
    python dark_photon_symmetry_analysis.py
    python hidden_10_15_gev_analysis.py
    ```

    ## Significance Statement

    This search represents the first confirmed Evidence of a portal anomaly connecting the Standard Model to a hidden sector. The 11.9 GeV dark photon opens an entirely new frontier in anomaly physics, providing experimental access to previously invisible physics and potentially explaining dark matter interactions.

    ## Contact

    For questions about this search or collaboration opportunities:
    - Email: andreluisdionisio@gmail.com

    ---

    "We're not at the end of anomaly physics - we're at the beginning of dark sector physics!"

    3665778186 00382C40-4D7F-E211-AD6F-003048FFCBFC.root
    2581315530 0E5F189B-5D7F-E211-9423-002354EF3BE1.root
    2149825126 1AE176AC-5A7F-E211-8E63-00261894397D.root
    1792851725 2044D46B-DE7F-E211-9C82-003048FFD76E.root
    3186214416 4CAE8D51-4A7F-E211-9937-0025905964A2.root
    3220923349 72FDEF89-497F-E211-9CFA-002618943958.root
    2555255008 7A35A5A2-547F-E211-940B-003048678DA2.root
    3875410897 7E942EED-457F-E211-938E-002618FDA28E.root
    2409745919 8406DE2F-407F-E211-A6A5-00261894395F.root
    2421251748 8A61DAA8-3C7F-E211-94A6-002618943940.root
    2315643699 98909097-417F-E211-9009-002618943838.root
    2614932091 A0963AD9-567F-E211-A8AF-002618943901.root
    2438057881 ACE2DF9A-477F-E211-9C29-003048679266.root
    2206652387 B6AA897F-467F-E211-8381-002618943854.root
    2365666837 C09519C8-4B7F-E211-9BCE-003048678B34.root
    2477336101 C68AE3A5-447F-E211-928E-00261894388B.root
    2556444022 C6CEC369-437F-E211-81B0-0026189438BD.root
    3184171088 D60FF379-4E7F-E211-8BA4-002590593878.root
    2381001693

  12. f

    Dataset.

    • figshare.com
    xlsx
    Updated Sep 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James M. Smoliga; Kathryn E. Sawyer (2025). Dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0315560.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Sep 19, 2025
    Dataset provided by
    PLOS ONE
    Authors
    James M. Smoliga; Kathryn E. Sawyer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Taylor Swift’s presence at National Football League (NFL) games was reported to have a causal effect on the performance of Travis Kelce and the Kansas City Chiefs. Critical examination of the supposed “Swift effect” provides some surprising lessons relevant to the scientific community. Here, we present a formal analysis to determine whether the media narrative that Swift’s presence at NFL games had any impact on player or team performance – and draw parallels to scientific journalism and clinical research. We performed a quasi-experimental study, using covariate matching. Linear mixed effects models were used to determine how Swift’s presence or absence in Swift-era games influence Kelce’s performance, relative to historical data. Additionally, a binary logistic regression model was developed to determine if Swift’s presence influenced the Chief’s game outcomes, relative to historical averages. Across multiple matching approaches, analyses demonstrated that Kelce’s yardage did not significantly differ when Taylor Swift was in attendance (n = 13 games) relative to matched pre‐Swift games. Although a decline in Kelce’s performance was observed in games without Swift (n = 6 games), the statistical significance of this finding varied by the matching algorithm used, indicating inconsistency in the effect. Similarly, Swift’s attendance did not result in a significant increase in the Chiefs’ likelihood of winning. Together, these findings suggest that the purported “Swift effect” is not supported by robust evidence. The weak statistical evidence that spawned the concept of the “Swift effect” is rooted in a constellation of fallacies common to medical journalism and research – including over-simplification, sensationalism, attribution bias, unjustified mechanisms, inadequate sampling, emphasis on surrogate outcomes, and inattention to comparative effectiveness. Clinicians and researchers must be vigilant to avoid falling victim to the “Swift effect,” since failure to scrutinize available evidence can lead to acceptance of unjustified theories and negatively impact clinical decision-making.

  13. Datasheet3_Assessing disparities through missing race and ethnicity data:...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    pdf
    Updated Jul 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katelyn M. Banschbach; Jade Singleton; Xing Wang; Sheetal S. Vora; Julia G. Harris; Ashley Lytch; Nancy Pan; Julia Klauss; Danielle Fair; Erin Hammelev; Mileka Gilbert; Connor Kreese; Ashley Machado; Peter Tarczy-Hornoch; Esi M. Morgan (2024). Datasheet3_Assessing disparities through missing race and ethnicity data: results from a juvenile arthritis registry.pdf [Dataset]. http://doi.org/10.3389/fped.2024.1430981.s003
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 24, 2024
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Katelyn M. Banschbach; Jade Singleton; Xing Wang; Sheetal S. Vora; Julia G. Harris; Ashley Lytch; Nancy Pan; Julia Klauss; Danielle Fair; Erin Hammelev; Mileka Gilbert; Connor Kreese; Ashley Machado; Peter Tarczy-Hornoch; Esi M. Morgan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionEnsuring high-quality race and ethnicity data within the electronic health record (EHR) and across linked systems, such as patient registries, is necessary to achieving the goal of inclusion of racial and ethnic minorities in scientific research and detecting disparities associated with race and ethnicity. The project goal was to improve race and ethnicity data completion within the Pediatric Rheumatology Care Outcomes Improvement Network and assess impact of improved data completion on conclusions drawn from the registry.MethodsThis is a mixed-methods quality improvement study that consisted of five parts, as follows: (1) Identifying baseline missing race and ethnicity data, (2) Surveying current collection and entry, (3) Completing data through audit and feedback cycles, (4) Assessing the impact on outcome measures, and (5) Conducting participant interviews and thematic analysis.ResultsAcross six participating centers, 29% of the patients were missing data on race and 31% were missing data on ethnicity. Of patients missing data, most patients were missing both race and ethnicity. Rates of missingness varied by data entry method (electronic vs. manual). Recovered data had a higher percentage of patients with Other race or Hispanic/Latino ethnicity compared with patients with non-missing race and ethnicity data at baseline. Black patients had a significantly higher odds ratio of having a clinical juvenile arthritis disease activity score (cJADAS10) of ≥5 at first follow-up compared with White patients. There was no significant change in odds ratio of cJADAS10 ≥5 for race and ethnicity after data completion. Patients missing race and ethnicity were more likely to be missing cJADAS values, which may affect the ability to detect changes in odds ratio of cJADAS ≥5 after completion.ConclusionsAbout one-third of the patients in a pediatric rheumatology registry were missing race and ethnicity data. After three audit and feedback cycles, centers decreased missing data by 94%, primarily via data recovery from the EHR. In this sample, completion of missing data did not change the findings related to differential outcomes by race. Recovered data were not uniformly distributed compared with those with non-missing race and ethnicity data at baseline, suggesting that differences in outcomes after completing race and ethnicity data may be seen with larger sample sizes.

  14. D

    OC 2017 LiDAR Image Service

    • detroitdata.org
    • accessoakland.oakgov.com
    • +4more
    Updated May 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oakland County, Michigan (2021). OC 2017 LiDAR Image Service [Dataset]. https://detroitdata.org/dataset/oc-2017-lidar-image-service1
    Explore at:
    html, arcgis geoservices rest apiAvailable download formats
    Dataset updated
    May 18, 2021
    Dataset provided by
    Oakland County, Michigan
    Description

    BY USING THIS WEBSITE OR THE CONTENT THEREIN, YOU AGREE TO THE TERMS OF USE.

    The Classified Point Cloud (LAS) for the 2017 Michigan LiDAR project covering approximately 907 square miles, covering Oakland County. LAS data products are suitable for 1 foot contour generation. USGS LiDAR Base Specification 1.2, QL2. 19.6 cm NVA.

    This data is for planning purposes only and should not be used for legal or cadastral purposes. Any conclusions drawn from analysis of this information are not the responsibility of Sanborn Map Company. Users should be aware that temporal changes may have occurred since this dataset was collected and some parts of this dataset may no longer represent actual surface conditions. Users should not use these data for critical applications without a full awareness of its limitations.

    This service is best used directly within ArcMap or ArcGIS Pro.If the raw LiDAR points are needed, use these clients to extract project area size portions. Due to the density of the data, downloading the entire County from this service is not possible. For further questions, contact the Oakland County Service Center at 248-858-8812, servicecenter@oakgov.com.

  15. DEEPEN Global Standardized Categorical Exploration Datasets for Magmatic...

    • osti.gov
    Updated Jun 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Caliandro, Nils; King, Rachel; Taverna, Nicole (2023). DEEPEN Global Standardized Categorical Exploration Datasets for Magmatic Plays [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/1995526-deepen-global-standardized-categorical-exploration-datasets-magmatic-plays
    Explore at:
    Dataset updated
    Jun 30, 2023
    Dataset provided by
    United States Department of Energyhttp://energy.gov/
    Authors
    Caliandro, Nils; King, Rachel; Taverna, Nicole
    Description

    DEEPEN stands for DE-risking Exploration of geothermal Plays in magmatic ENvironments. As part of the development of the DEEPEN 3D play fairway analysis (PFA) methodology for magmatic plays (conventional hydrothermal, superhot EGS, and supercritical), weights needed to be developed for use in the weighted sum of the different favorability index models produced from geoscientific exploration datasets. This was done using two different approaches: one based on expert opinions, and one based on statistical learning. This GDR submission includes the datasets used to produce the statistical learning-based weights. While expert opinions allow us to include more nuanced information in the weights, expert opinions are subject to human bias. Data-centric or statistical approaches help to overcome these potential human biases by focusing on and drawing conclusions from the data alone. The drawback is that, to apply these types of approaches, a dataset is needed. Therefore, we attempted to build comprehensive standardized datasets mapping anomalies in each exploration dataset to each component of each play. This data was gathered through a literature review focused on magmatic hydrothermal plays along with well-characterized areas where superhot or supercritical conditions are thought to exist. Datasets were assembled for all three play types, but the hydrothermal datasetmore » is the least complete due to its relatively low priority. For each known or assumed resource, the dataset states what anomaly in each exploration dataset is associated with each component of the system. The data is only a semi-quantitative, where values are either high, medium, or low, relative to background levels. In addition, the dataset has significant gaps, as not every possible exploration dataset has been collected and analyzed at every known or suspected geothermal resource area, in the context of all possible play types. The following training sites were used to assemble this dataset: - Conventional magmatic hydrothermal: Akutan (from AK PFA), Oregon Cascades PFA, Glass Buttes OR, Mauna Kea (from HI PFA), Lanai (from HI PFA), Mt St Helens Shear Zone (from WA PFA), Wind River Valley (From WA PFA), Mount Baker (from WA PFA). - Superhot EGS: Newberry (EGS demonstration project), Coso (EGS demonstration project), Geysers (EGS demonstration project), Eastern Snake River Plain (EGS demonstration project), Utah FORGE, Larderello, Kakkonda, Taupo Volcanic Zone, Acoculco, Krafla. - Supercritical: Coso, Geysers, Salton Sea, Larderello, Los Humeros, Taupo Volcanic Zone, Krafla, Reyjanes, Hengill. **Disclaimer: Treat the supercritical fluid anomalies with skepticism. They are based on assumptions due to the general lack of confirmed supercritical fluid encounters and samples at the sites included in this dataset, at the time of assembling the dataset. The main assumption was that the supercritical fluid in a given geothermal system has shared properties with the hydrothermal fluid, which may not be the case in reality. Once the datasets were assembled, principal component analysis (PCA) was applied to each. PCA is an unsupervised statistical learning technique, meaning that labels are not required on the data, that summarized the directions of variance in the data. This approach was chosen because our labels are not certain, i.e., we do not know with 100% confidence that superhot resources exist at all the assumed positive areas. We also do not have data for any known non-geothermal areas, meaning that it would be challenging to apply a supervised learning technique. In order to generate weights from the PCA, an analysis of the PCA loading values was conducted. PCA loading values represent how much a feature is contributing to each principal component, and therefore the overall variance in the data.« less

  16. d

    OC 2017 DEM Image Service

    • portal.datadrivendetroit.org
    • data.ferndalemi.gov
    • +4more
    Updated May 5, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oakland County, Michigan (2018). OC 2017 DEM Image Service [Dataset]. https://portal.datadrivendetroit.org/datasets/oakgov::oc-2017-dem-image-service/about
    Explore at:
    Dataset updated
    May 5, 2018
    Dataset authored and provided by
    Oakland County, Michigan
    Area covered
    Description

    BY USING THIS WEBSITE OR THE CONTENT THEREIN, YOU AGREE TO THE TERMS OF USE. To acquire detailed surface elevation data for use in conservation planning, design, research, floodplain mapping, dam safety assessments, and hydrologic modeling. LAS and bare earth DEM data products are suitable for 1 foot contour generation. USGS LiDAR Base Specification 1.2, QL2. 19.6 cm NVA.This metadata record describes the hydro-flattened bare earth digital elevation model (DEM) derived from the classified LiDAR data for the 2017 Michigan LiDAR project covering approximately 907 square miles, in which its extents cover Oakland County.This data is for planning purposes only and should not be used for legal or cadastral purposes. Any conclusions drawn from analysis of this information are not the responsibility of Sanborn Map Company. Users should be aware that temporal changes may have occurred since this dataset was collected and some parts of this dataset may no longer represent actual surface conditions. Users should not use these data for critical applications without a full awareness of its limitations. Contact: State of MichiganDue to the large size of the data, downloading the entire county may not be possible. It is recommended to use the live service directly within ArcMap or ArcGIS Pro. For further questions, contact the Oakland County Service Center at 248-858-8812, servicecenter@oakgov.com.

  17. LongAlpaca-Yukang ML Instructional Outputs

    • kaggle.com
    zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). LongAlpaca-Yukang ML Instructional Outputs [Dataset]. https://www.kaggle.com/datasets/thedevastator/longalpaca-yukang-ml-instructional-outputs
    Explore at:
    zip(168273444 bytes)Available download formats
    Dataset updated
    Nov 24, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    LongAlpaca-Yukang ML Instructional Outputs

    Unlocking the Power of AI

    By Huggingface Hub [source]

    About this dataset

    This dataset contains 12000 instructional outputs from LongAlpaca-Yukang Machine Learning system, unlocking the cutting-edge power of Artificial Intelligence for users. With this data, researchers have an abundance of information to explore the mysteries behind AI and how it works. This dataset includes columns such as output, instruction, file and input which provide endless possibilities of analysis ripe for you to discover! Teeming with potential insights into AI’s functioning and implications for our everyday lives, let this data be your guide in unravelling the many secrets yet to be discovered in the world of AI

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Exploring the Dataset:

    The dataset contains 12000 rows of information, with four columns containing output, instruction, file and input data. You can use these columns to explore the workings of a machine learning system, examine different instructional outputs for different inputs or instructions, study training data for specific ML systems, or analyze files being used by a machine learning system.

    Visualizing Data:

    Using built-in plotting tools within your chosen toolkit (such as Python), you can create powerful visualizations. Plotting outputs versus input instructions will give you an overview of what your machine learning system is capable of doing--and how it performs on different types of tasks or problems. You could also plot outputs along side files being used--this would help identify patterns in training data and identify areas that need improvement in your machine learning models.

    Analyzing Performance:

    Using statistical analysis techniques such as regressions or clustering algorithms, you can measure performance metrics such as accuracy and understand how they vary across instruction types. Experimenting with hyperparameter tuning may be helpful to see which settings yield better results for any given situation. Additionally correlations between inputs samples and output measurements can be examined so any relationships can be identified such as trends in accuracy over certain sets of instructions.

    Drawing Conclusions:

    By leveraging the power of big data mining tools, you are able to build comprehensive predictive models that allow us to project future outcomes based on past performance metric measurements from various instruction types fed into our system's datasets — allowing us determine if certain changes produce improve outcomes over time for our AI model’s capability & predictability!

    Research Ideas

    • Developing self-improving Artificial Intelligence algorithms by using the outputs and instructional data to identify correlations and feedback loop structures between instructions and output results.
    • Generating Machine Learning simulations using this dataset to optimize AI performance based on given instruction set.
    • Using the instructions, input, and output data in the dataset to build AI systems for natural language processing, enabling comprehensive understanding of user queries and providing more accurate answers accordingly

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------|:-------------------------------------------------------| | output | The output of the instruction given. (String) | | file | The file used when executing the instruction. (String) | | input | Additional context for the instruction. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  18. California Fire Perimeters (1950+)

    • gis.data.ca.gov
    • gis.data.cnra.ca.gov
    • +4more
    Updated Aug 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California Department of Forestry and Fire Protection (2024). California Fire Perimeters (1950+) [Dataset]. https://gis.data.ca.gov/datasets/CALFIRE-Forestry::california-fire-perimeters-1950/data
    Explore at:
    Dataset updated
    Aug 30, 2024
    Dataset authored and provided by
    California Department of Forestry and Fire Protectionhttp://calfire.ca.gov/
    Area covered
    Description

    This data should be used carefully for statistical analysis and reporting due to missing perimeters (see Use Limitation in metadata). Some fires are missing because historical records were lost or damaged, were too small for the minimum cutoffs, had inadequate documentation or have not yet been incorporated into the database. Other known errors with the fire perimeter database include duplicate fires and over-generalization. Over-generalization, particularly with large old fires, may show unburned "islands" within the final perimeter as burned. Users of the fire perimeter database must exercise caution in application of the data. Careful use of the fire perimeter database will prevent users from drawing inaccurate or erroneous conclusions from the data. This dataset may differ in California compared to that available from the National Interagency Fire Center (NIFC) due to different requirements between the two datasets. The data covers fires back to 1878.


    Please help improve this dataset by filling out this survey with feedback:

    Historic Fire Perimeter Dataset Feedback (arcgis.com)


    Current criteria for data collection are as follows:

    CAL FIRE (including contract counties) submit perimeters ≥10 acres in timber, ≥50 acres in brush, or ≥300 acres in grass, and/or ≥3 impacted residential or commercial structures, and/or caused ≥1 fatality.

    All cooperating agencies submit perimeters ≥10 acres.


    Version update:

    Firep24_1 was released in April 2025. Five hundred forty-eight fires from the 2024 fire season were added to the database (2 from BIA, 56 from BLM, 197 from CAL FIRE, 193 from Contract Counties, 27 from LRA, 8 from NPS, 55 from USFS and 8 from USFW). Six perimeters were added from the 2025 fire season (as a special case due to an unusual January fire siege). Five duplicate fires were removed, and the 2023 Sage was replaced with a more accurate perimeter. There were 900 perimeters that received updated attribution (705 removed “FIRE” from the end of Fire Name field and 148 replaced Complex IRWIN ID with Complex local incident number for COMPLEX_ID field). The following fires were identified as meeting our collection criteria but are not included in this version and will hopefully be added in a future update: Addie (2024-CACND-002119), Alpaugh (2024-CACND-001715), South (2024-CATIA-001375). One perimeter is missing containment date that will be updated in the next release.


    Cross checking CALFIRS reporting for new CAL FIRE submissions to ensure accuracy with cause class was added to the compilation process. The cause class domain description for “Powerline” was updated to “Electrical Power” to be more inclusive of cause reports.


    Detailed metadata is included in the following documents:

    Wildland Fire Perimeters (Firep24_1) Metadata


    For any questions, please contact the data steward:

    Kim Wallin, GIS Specialist

    CAL FIRE, Fire & Resource Assessment Program (FRAP)

    kimberly.wallin@fire.ca.gov


  19. National Transfusion Dataset (NTD)

    • bridges.monash.edu
    • researchdata.edu.au
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Transfusion Dataset (2024). National Transfusion Dataset (NTD) [Dataset]. http://doi.org/10.26180/22151987.v4
    Explore at:
    Dataset updated
    Mar 4, 2024
    Dataset provided by
    Monash University
    Authors
    National Transfusion Dataset
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The National Transfusion Dataset (NTD) is a collection of transfusion episode data incorporating transfusion, laboratory and hospital data from hospitals and health services, as well as prehospital transfusion data from ambulance and retrieval services.The NTD will form the first integrated national database of blood usage in Australia. The NTD aims to collect information about where, when, and how blood products are used across all clinical settings. This will address Australia’s absence of an integrated national database to record blood usage with the ability to link with clinical outcomes. The dataset will be an invaluable resource towards a comprehensive understanding of how and why blood products are used, numbers and characteristics of patients transfused in health services, the clinical outcomes after transfusion; and provide support to policy development and research.The NTD was formed through the incorporation of the established Australian and New Zealand Massive Transfusion Registry (ANZ-MTR) and a pilot Transfusion Database (TD) project. The ANZ-MTR has a unique focus on massive transfusion (MT) and contains over 10,000 cases from 41 hospitals across Australia and New Zealand. The TD was a trial extension of the registry that collated data on ALL (not just massive) transfusions on >8000 patients from pilot hospitals. The NTD will integrate and expand these databases to provide new data on transfusion practice including blood utilisation, clinical management and the vital closing of the haemovigilance loop.Conditions of use:Any material or manuscript to be published using NTD data must be submitted for review by the NTD Steering Committee prior to submission for publication. The NTD, and Partner Organisations should be acknowledged in all publications. Preferred wording for the acknowledgement will be provided with the data. The NTD reserves the right to dissociate itself from conclusions drawn if it deems necessary.If the data is the primary source for a report or publication, the source of the data must be acknowledged, along with a statement that the analysis and interpretation are those of the author, not the NTD. Where an author analysing the data is a member of an organisation formally associated, or partnered with the NTD, the NTD should be acknowledged as a secondary affiliation. Where the author is a member of the NTD Project Team, then the primary attribution should be the NTD. The dataset DOI (10.26180/22151987) must be referenced in all publications.Further information can be found in the Data Access and Publications Policy.To submit a data access request click here.

  20. Lambda Orionis Cluster XMM-Newton X-Ray Point Source Catalog - Dataset -...

    • data.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Lambda Orionis Cluster XMM-Newton X-Ray Point Source Catalog - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/lambda-orionis-cluster-xmm-newton-x-ray-point-source-catalog
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    The authors studied the X-ray properties of the young (~1-8M yr) open cluster around the hot (O8 III) star Lambda Ori and compared them with those of the similarly-aged Sigma Ori cluster in order to investigate the possible effects of the different ambient environments. They analyzed an XMM-Newton observation of the cluster using EPIC imaging and low-resolution spectral data. They studied the variability of the detected sources, and performed a spectral analysis of the brightest sources in the field using multi-temperature models. The authors detected 167 X-ray sources above a 5-sigma detection threshold the properties of which are listed in this table, of which 58 are identified with known cluster members and candidates, from massive stars down to low-mass stars with spectral types of ~ M5.5. Another 23 sources were identified with new possible photometric candidates. Late-type stars have a median log LX/Lbol ~ -3.3, close to the saturation limit. Variability was observed in ~ 35% of late-type members or candidates, including six flaring sources. The emission from the central hot star Lambda Ori is dominated by plasma at 0.2 - 0.3 keV, with a weaker component at 0.7 keV, consistent with a wind origin. The coronae of late-type stars can be described by two plasma components with temperatures T1 ~ 0.3-0.8 keV and T2 ~ 0.8-3 keV, and subsolar abundances Z ~ 0.1-0.3 Zsun, similar to what is found in other star-forming regions and associations. No significant difference was observed between stars with and without circumstellar discs, although the smallness of the sample of stars with discs and accretion does not definitive conclusions to be drawn. The authors concluded that the X-ray properties of Lambda Ori late-type stars are comparable to those of the coeval Sigma Ori cluster, suggesting that stellar activity in Lambda Ori has not been significantly affected by the different ambient environment. The lambda Ori cluster was observed by XMM-Newton from 20:46 UT on September 28, 2006 to 12:23 UT on September 29, 2006 (Obs. ID 0402050101), for a total duration of 56ks, using both the EPIC MOS and PN cameras and the RGS instruments. The EPIC cameras were operated in full frame mode with the thick filter. This table was created by the HEASARC in November 2011 based on CDS Catalog J/A+A/530/A150 files tablea1.dat ('X-ray sources detected in the Lambda Ori Cluster'), table1,dat ('X-ray and optical properties of sources identified with known cluster members and candidates') and table2.dat ('X-ray sources identified with possible new cluster candidates'). It does not include the objects listed in tablea2.dat ('3-sigma upper limits and optical properties of undetected cluster members and candidates'). This is a service provided by NASA HEASARC .

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Panori, Anastasia; Bakratsas, Thomas; Chapizanis, Dimitrios; Altsitsiadis, Efthymios; Hauschildt, Christian (2022). Quantitative raw data for "Large scale regional citizen surveys report" (D1.4) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5958017

Quantitative raw data for "Large scale regional citizen surveys report" (D1.4)

Explore at:
Dataset updated
Feb 3, 2022
Dataset provided by
White Research SRL
Authors
Panori, Anastasia; Bakratsas, Thomas; Chapizanis, Dimitrios; Altsitsiadis, Efthymios; Hauschildt, Christian
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset presents the quantitative raw data that was collected under the H2020 RRI2SCALE project for the D1.4 - “Large scale regional citizen surveys report”. The dataset includes the answers that were provided by almost 8,000 participants from 4 pilot European regions (Kriti, Vestland, Galicia, and Overijssel) regarding the general public's views, concerns, and moral issues about the current and future trajectories of their RTD&I ecosystem. The original survey questionnaire was created by White Research SRL and disseminated to the regions through supporting pilot partners. Data collection took place from June 2020 to September 2020 through 4 different waves – one for each region. Based on the conclusion of a consortium vote during the kick-off meeting, it was decided that instead of resource-intensive methods that would render data collection unduly expensive, to fill in the quotas responses were collected through online panels by survey companies that were used for each region. For the statistical analysis of the data and the conclusions drawn from the analysis, you can access the "Large scale regional citizen surveys report" (D1.4).

Search
Clear search
Close search
Google apps
Main menu