93 datasets found

Z
Quantitative raw data for "Large scale regional citizen surveys report"...
data.niaid.nih.gov
zenodo.org
+1more
Updated Feb 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Panori, Anastasia; Bakratsas, Thomas; Chapizanis, Dimitrios; Altsitsiadis, Efthymios; Hauschildt, Christian (2022). Quantitative raw data for "Large scale regional citizen surveys report" (D1.4) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5958017
Explore at:
Dataset updated
Feb 3, 2022
Dataset provided by
White Research SRL
Authors
Panori, Anastasia; Bakratsas, Thomas; Chapizanis, Dimitrios; Altsitsiadis, Efthymios; Hauschildt, Christian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset presents the quantitative raw data that was collected under the H2020 RRI2SCALE project for the D1.4 - “Large scale regional citizen surveys report”. The dataset includes the answers that were provided by almost 8,000 participants from 4 pilot European regions (Kriti, Vestland, Galicia, and Overijssel) regarding the general public's views, concerns, and moral issues about the current and future trajectories of their RTD&I ecosystem. The original survey questionnaire was created by White Research SRL and disseminated to the regions through supporting pilot partners. Data collection took place from June 2020 to September 2020 through 4 different waves – one for each region. Based on the conclusion of a consortium vote during the kick-off meeting, it was decided that instead of resource-intensive methods that would render data collection unduly expensive, to fill in the quotas responses were collected through online panels by survey companies that were used for each region. For the statistical analysis of the data and the conclusions drawn from the analysis, you can access the "Large scale regional citizen surveys report" (D1.4).
d
FAIR NATIONAL ELECTION STUDIES: HOW WELL ARE WE DOING? - Dataset - B2FIND
demo-b2find.dkrz.de
Updated Sep 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). FAIR NATIONAL ELECTION STUDIES: HOW WELL ARE WE DOING? - Dataset - B2FIND [Dataset]. http://demo-b2find.dkrz.de/dataset/85555597-a3d9-57d0-9e67-4c68206890cb
Explore at:
Dataset updated
Sep 27, 2025
Description
Election studies are an important data pillar in political and social science, as most political research investigations involve secondary use of existing datasets. Researchers depend on high-quality data because data quality determines the accuracy of the conclusions drawn from statistical analyses. We outline data reuse quality criteria pertaining to data accessibility, metadata provision, and data documentation using the FAIR Principles of research data management as a framework. We then investigate the extent to which a selection of election studies fulfils these criteria using studies from Western democracies. Our results reveal that although most election studies are easily accessible and well documented and that the overall level of data processing is satisfactory, some important deficits remain. Further analyses of technical documentation indicate that while a majority of election studies provide the necessary documents, there is still room for improvement. Inhaltscodierung Content Analysis Large-scale election studies from Western democracies Non-probability: Purposive
Adventures of Sherlock Holmes: Sentiment Analysis.
kaggle.com
zip
Updated Aug 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick L Ford (2024). Adventures of Sherlock Holmes: Sentiment Analysis. [Dataset]. https://www.kaggle.com/datasets/patricklford/adventures-of-sherlock-holmes-sentiment-analysis/discussion
Explore at:
zip(219210 bytes)Available download formats
Dataset updated
Aug 25, 2024
Authors
Patrick L Ford
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Introduction

The famous Sherlock Holmes quote, “Data! data! data!” from The Copper Beeches perfectly encapsulates the essence of both detective work and data analysis. Holmes’ relentless pursuit of every detail closely mirrors the approach of modern data analysts, who understand that conclusions drawn without solid data are mere conjecture. Just as Holmes systematically gathered clues, analysed them from different perspectives, and tested hypotheses to arrive at the truth, today’s analysts follow similar processes when investigating complex data-driven problems. This project draws a parallel between Holmes’ detective methods and modern data analysis techniques by visualising and interpreting data from The Adventures of Sherlock Holmes.

“**Data! data! data!**” he cried, impatiently. “I can’t make bricks without clay.”

The above quote comes from one of my favourite Sherlock Holmes stories, The Copper Beeches. In this single outburst, Holmes captures a principle that resonates deeply with today’s data analysts: without data, conclusions are mere speculation. Data is the bedrock of any investigation. Without sufficient data, the route to solving a problem or answering a question is clouded with uncertainty.

Sherlock Holmes, the iconic fictional detective, thrived on difficult cases, relishing the challenge of pitting his wits against the criminal mind.

His methods of detection: - Examining crime scenes. - Interrogating witnesses. - Evaluating motives.

Closely parallel how a data analyst approaches a complex problem today. By carefully collecting and interpreting data, Holmes was able to unravel mysteries that seemed impenetrable at first glance.

1. Data Collection: Gathering Evidence
Holmes’s meticulous approach to data collection mirrors the first stage of data analysis. Just as Holmes would scrutinise a crime scene for every detail; whether it be a footprint, a discarded note, or a peculiar smell. Data analysts seek to gather as much relevant data as possible. Just as incomplete or biased data can skew results in modern analysis, Holmes understood that every clue mattered. Overlooking a small piece of information could compromise the entire investigation.

2. Data Quality: “I can’t make bricks without clay.”
This quote is more than just a witty remark, it highlights the importance of having the right data. In the same way that substandard materials result in poor construction, incomplete or inaccurate data leads to unreliable analysis. Today’s analysts face similar issues: they must assess data integrity, clean noisy datasets, and ensure they’re working with accurate information before drawing conclusions. Holmes, in his time, would painstakingly verify each clue, ensuring that he was not misled by false leads.

3. Data Analysis: Considering Multiple Perspectives
Holmes’s genius lay not just in gathering data, but in the way he analysed it. He would often examine a problem from multiple angles, revisiting clues with fresh perspectives to see what others might have missed. In modern data analysis, this approach is akin to using different models, visualisations, and analytical methods to interpret the same dataset. Analysts explore data from multiple viewpoints, testing different hypotheses, and applying various algorithms to see which provides the most plausible insight.

4. Hypothesis Testing: Eliminate the Improbable
One of Holmes’s guiding principles was: “When you have eliminated the impossible, whatever remains, however improbable, must be the truth.” This mirrors the process of hypothesis testing in data analysis. Analysts might begin with several competing theories about what the data suggests. By testing these hypotheses, ruling out those that are contradicted by the data, they zero in on the most likely explanation. For both Holmes and today’s data analysts, the process of elimination is crucial to arriving at the correct answer.

5. Insight and Conclusion: The Final Deduction
After piecing together all the clues, Holmes would reveal his conclusion, often leaving his audience in awe at how the seemingly unrelated pieces of data fit together. Similarly, data analysts must present their findings clearly and compellingly, translating raw data into actionable insights. The ability to connect the dots and tell a coherent story from the data is what transforms analysis into impactful decision-making.

In summary, the methods Sherlock Holmes employed were gathering data meticulously, testing multiple angles, and drawing conclusions through careful analysis. Are strikingly similar to the techniques used by modern data analysts. Just as Holmes required high-quality data and a structured approach to solve crimes, today’s data analysts rely on well-prepared data and methodical analysis to provide insights. Whether you’re cracking a case or uncovering business...
Mali Farm Data
kaggle.com
zip
Updated Apr 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yerkin Mudebayev (2023). Mali Farm Data [Dataset]. https://www.kaggle.com/datasets/yerkinmudebayev/mali-farm-data/code
Explore at:
zip(1033275 bytes)Available download formats
Dataset updated
Apr 6, 2023
Authors
Yerkin Mudebayev
Description
The project is to conduct a principal components analysis of the Mali Farm data (malifarmdata.xlsx, R. Johnson and D. Wichern, Applied Multivariate Statistical Analysis, Pearson, New Jersey, 2019.). You will use S for the PCA. (a) Store the data in matrix X. (b) Carry out an initial investigation. Indicate if you had to process the data file in anyway. Do not transform the data. Explain any conclusions drawn from the evidence and backup your conclusions. Hint: Pay attention to detection of outliers. i. The data in rows 25, 34, 52, 57, 62, 69, 72 are outliers. Provide at least two indicators for each of these data that justify this claim. ii. Explain any other conclusions drawn from initial investigation. (c) Create a data matrix X by removing the outliers. (d) Carry out principal component analyses on X. i. Give the relevant sample covariance matrix S ii. List the eigenvalues and describe the percent contributions to the variance. iii. Determine the number of principal components to retain and justify your answer by considering at least three methods. iv. Give the eigenvectors for the principal components you retain. v. Considering the coefficients of the principal components, Describe dependencies of the principal components on the variables. vi. Using at least the first two principal components, display appropriate scatter plots of pairs of principal components. Make observations about the plots. (e) Carry out principal component analyses on X. i. Give the relevant sample covariance matrix S. ii. List the eigenvalues and describe the percent contributions to the variance. iii. Determine the number of principal components to retain and justify your answer by considering at least three methods. iv. Give the eigenvectors for the principal components you retain. v. Considering the coefficients of the principal components, Describe dependencies of the principal components on the variables. vi. Using at least the first two principal components, display appropriate scatter plots of pairs of principal components. Make observations about the plots. (f) Compare the results for the two analyses. How much effect did the outliers have on the principal component analysis? Which result do you like more and why? (g) Include your code. Key for Mali farm data Family Dist RD number of people in the household distance in kilometers to the nearest passable road Cotton = hectares of cotton planted in 2000 Maize = hectares of maize planted in 2000 Sorg = hectares of sorghum planted in 2000 Millet = hectares of millet planted in 2000 Bull = total number of bullocks Cattle = total number of cattle Goat = total number of goats
California Fire Perimeters (1950+)
catalog.data.gov
data.cnra.ca.gov
+2more
Updated Oct 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CAL FIRE (2025). California Fire Perimeters (1950+) [Dataset]. https://catalog.data.gov/dataset/california-fire-perimeters-1950-c3fa2
Explore at:
Dataset updated
Oct 23, 2025
Dataset provided by
California Department of Forestry and Fire Protectionhttp://calfire.ca.gov/
Area covered
California
Description
The California Department of Forestry and Fire Protection's Fire and Resource Assessment Program (FRAP) annually maintains and distributes an historical wildland fire perimeter dataset from across public and private lands in California. The GIS data is developed with the cooperation of the United States Forest Service Region 5, the Bureau of Land Management, California State Parks, National Park Service and the United States Fish and Wildlife Service and is released in the spring with added data from the previous calendar year. Although the dataset represents the most complete digital record of fire perimeters in California, it is still incomplete, and users should be cautious when drawing conclusions based on the data. This data should be used carefully for statistical analysis and reporting due to missing perimeters (see Use Limitation in metadata). Some fires are missing because historical records were lost or damaged, were too small for the minimum cutoffs, had inadequate documentation or have not yet been incorporated into the database. Other errors with the fire perimeter database include duplicate fires and over-generalization. Additionally, over-generalization, particularly with large old fires, may show unburned "islands" within the final perimeter as burned. Users of the fire perimeter database must exercise caution in application of the data. Careful use of the fire perimeter database will prevent users from drawing inaccurate or erroneous conclusions from the data. This data is updated annually in the spring with fire perimeters from the previous fire season. This dataset may differ in California compared to that available from the National Interagency Fire Center (NIFC) due to different requirements between the two datasets. The data covers fires back to 1878. As of May 2025, it represents fire24_1. Please help improve this dataset by filling out this survey with feedback:Historic Fire Perimeter Dataset Feedback (arcgis.com) Current criteria for data collection are as follows:CAL FIRE (including contract counties) submit perimeters ≥10 acres in timber, ≥50 acres in brush, or ≥300 acres in grass, and/or ≥3 impacted residential or commercial structures, and/or caused ≥1 fatality.All cooperating agencies submit perimeters ≥10 acres. Version update:Firep24_1 was released in April 2025. Five hundred forty-eight fires from the 2024 fire season were added to the database (2 from BIA, 56 from BLM, 197 from CAL FIRE, 193 from Contract Counties, 27 from LRA, 8 from NPS, 55 from USFS and 8 from USFW). Six perimeters were added from the 2025 fire season (as a special case due to an unusual January fire siege). Five duplicate fires were removed, and the 2023 Sage was replaced with a more accurate perimeter. There were 900 perimeters that received updated attribution (705 removed “FIRE” from the end of Fire Name field and 148 replaced Complex IRWIN ID with Complex local incident number for COMPLEX_ID field). The following fires were identified as meeting our collection criteria but are not included in this version and will hopefully be added in a future update: Addie (2024-CACND-002119), Alpaugh (2024-CACND-001715), South (2024-CATIA-001375). One perimeter is missing containment date that will be updated in the next release.Cross checking CALFIRS reporting for new CAL FIRE submissions to ensure accuracy with cause class was added to the compilation process. The cause class domain description for “Powerline” was updated to “Electrical Power” to be more inclusive of cause reports. Includes separate layers filtered by criteria as follows:California Fire Perimeters (All): Unfiltered. The entire collection of wildfire perimeters in the database. It is scale dependent and starts displaying at the country level scale. Recent Large Fire Perimeters (≥5000 acres): Filtered for wildfires greater or equal to 5,000 acres for the last 5 years of fires (2020-January 2025), symbolized with color by year and is scale dependent and starts displaying at the country level scale. Year-only labels for recent large fires.California Fire Perimeters (1950+): Filtered for wildfires that started in 1950-January 2025. Symbolized by decade, and display starting at country level scale. Detailed metadata is included in the following documents:Wildland Fire Perimeters (Firep24_1) MetadataSee more information on our Living Atlas data release here: CAL FIRE Historical Fire Perimeters Available in ArcGIS Living AtlasFor any questions, please contact the data steward:Kim Wallin, GIS SpecialistCAL FIRE, Fire & Resource Assessment Program (FRAP)kimberly.wallin@fire.ca.gov
Crime Rate and GDP Datasets 2021 & 2023
kaggle.com
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fran Llamas (2024). Crime Rate and GDP Datasets 2021 & 2023 [Dataset]. https://www.kaggle.com/datasets/franllamas/crime-rate-and-gdp-datasets-2021-and-2023
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 28, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Fran Llamas
Description
Overview:

This project aims to investigate the potential correlation between the Gross Domestic Product (GDP) of approximately 190 countries for the years 2021 and 2023 and their corresponding crime ratings. The crime ratings are represented on a scale from 0 to 10, with 0 indicating minimal or null crime activity and 10 representing the highest level of criminal activity.

Dataset:

The dataset used in this project comprises GDP data for the years 2021 and 2023 for around 190 countries, sourced from reputable international databases. Additionally, crime rating scores for the same countries and years are collected from credible sources such as governmental agencies, law enforcement organizations, or reputable research institutions.

Methodology:

Data Collection: GDP data for 2021 and 2023, along with crime rating scores, are gathered for approximately 190 countries.

Data Preprocessing: The collected data is cleaned and standardized to ensure consistency and compatibility across different datasets.

Analysis: Statistical methods and data visualization techniques are employed to explore the potential relationship between GDP and crime ratings.

Interpretation: Findings from the analysis are interpreted to determine the strength and direction of any observed correlations between GDP and crime ratings.

Conclusion: Based on the analysis results, conclusions are drawn regarding the existence and significance of the relationship between GDP and crime ratings.

Expected Outcomes:

Identification of any significant correlations or patterns between GDP and crime ratings across different countries. Insights into the potential socioeconomic factors influencing crime rates and their relationship with economic indicators like GDP. Implications for policymakers, law enforcement agencies, and researchers in understanding the dynamics between economic development and crime prevalence.
f
This is the data set that we used to reach the conclusions drawn in the...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Dec 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaseke, Farayi; Stewart, Aimee; Kaseke, Timothy; Gori, Elizabeth; Gwanzura, Lovemore; Musarurwa, Cuthbert; Nyengerai, Tawanda (2024). This is the data set that we used to reach the conclusions drawn in the manuscript with related metadata and methods. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001473374
Explore at:
Dataset updated
Dec 30, 2024
Authors
Kaseke, Farayi; Stewart, Aimee; Kaseke, Timothy; Gori, Elizabeth; Gwanzura, Lovemore; Musarurwa, Cuthbert; Nyengerai, Tawanda
Description
This data can be replicated to report the study findings in their entiretyincluding a.) The values behind the means, standard deviations and other measures reported; b.) The values used to build graphs; c.) The points extracted from images for analysis. (XLSX)
fNIRS DATA AND ANALYSIS SCRIPTS
kaggle.com
zip
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aysenur Eser (2025). fNIRS DATA AND ANALYSIS SCRIPTS [Dataset]. https://www.kaggle.com/datasets/aysenureser/fnirs-data-and-analysis-scripts
Explore at:
zip(409553706 bytes)Available download formats
Dataset updated
Jun 25, 2025
Authors
Aysenur Eser
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The data set used to reach the conclusions drawn in the manuscript is stored under the folder named ‘Raw_Subject_Data’. Related metadata produced at the interim steps of the analysis described in the presented work can be found under the folder named ‘Preprocessed_Data_Interim_Outputs’. Scripts for executing the methodology can be found under the folder named ‘Scripts’. Additional data required to replicate the reported study findings can be found at the folder named as ‘Feature_Set’.
d
DEEPEN Global Standardized Categorical Exploration Datasets for Magmatic...
catalog.data.gov
data.openei.org
+1more
Updated Jan 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Renewable Energy Laboratory (2025). DEEPEN Global Standardized Categorical Exploration Datasets for Magmatic Plays [Dataset]. https://catalog.data.gov/dataset/deepen-global-standardized-categorical-exploration-datasets-for-magmatic-plays-f1ecf
Explore at:
Dataset updated
Jan 20, 2025
Dataset provided by
National Renewable Energy Laboratory
Description
DEEPEN stands for DE-risking Exploration of geothermal Plays in magmatic ENvironments. As part of the development of the DEEPEN 3D play fairway analysis (PFA) methodology for magmatic plays (conventional hydrothermal, superhot EGS, and supercritical), weights needed to be developed for use in the weighted sum of the different favorability index models produced from geoscientific exploration datasets. This was done using two different approaches: one based on expert opinions, and one based on statistical learning. This GDR submission includes the datasets used to produce the statistical learning-based weights. While expert opinions allow us to include more nuanced information in the weights, expert opinions are subject to human bias. Data-centric or statistical approaches help to overcome these potential human biases by focusing on and drawing conclusions from the data alone. The drawback is that, to apply these types of approaches, a dataset is needed. Therefore, we attempted to build comprehensive standardized datasets mapping anomalies in each exploration dataset to each component of each play. This data was gathered through a literature review focused on magmatic hydrothermal plays along with well-characterized areas where superhot or supercritical conditions are thought to exist. Datasets were assembled for all three play types, but the hydrothermal dataset is the least complete due to its relatively low priority. For each known or assumed resource, the dataset states what anomaly in each exploration dataset is associated with each component of the system. The data is only a semi-quantitative, where values are either high, medium, or low, relative to background levels. In addition, the dataset has significant gaps, as not every possible exploration dataset has been collected and analyzed at every known or suspected geothermal resource area, in the context of all possible play types. The following training sites were used to assemble this dataset: - Conventional magmatic hydrothermal: Akutan (from AK PFA), Oregon Cascades PFA, Glass Buttes OR, Mauna Kea (from HI PFA), Lanai (from HI PFA), Mt St Helens Shear Zone (from WA PFA), Wind River Valley (From WA PFA), Mount Baker (from WA PFA). - Superhot EGS: Newberry (EGS demonstration project), Coso (EGS demonstration project), Geysers (EGS demonstration project), Eastern Snake River Plain (EGS demonstration project), Utah FORGE, Larderello, Kakkonda, Taupo Volcanic Zone, Acoculco, Krafla. - Supercritical: Coso, Geysers, Salton Sea, Larderello, Los Humeros, Taupo Volcanic Zone, Krafla, Reyjanes, Hengill. **Disclaimer: Treat the supercritical fluid anomalies with skepticism. They are based on assumptions due to the general lack of confirmed supercritical fluid encounters and samples at the sites included in this dataset, at the time of assembling the dataset. The main assumption was that the supercritical fluid in a given geothermal system has shared properties with the hydrothermal fluid, which may not be the case in reality. Once the datasets were assembled, principal component analysis (PCA) was applied to each. PCA is an unsupervised statistical learning technique, meaning that labels are not required on the data, that summarized the directions of variance in the data. This approach was chosen because our labels are not certain, i.e., we do not know with 100% confidence that superhot resources exist at all the assumed positive areas. We also do not have data for any known non-geothermal areas, meaning that it would be challenging to apply a supervised learning technique. In order to generate weights from the PCA, an analysis of the PCA loading values was conducted. PCA loading values represent how much a feature is contributing to each principal component, and therefore the overall variance in the data.
D
Background data for: Ordinal response scales: Psychometric grounding for...
dataverse.no
search.dataone.org
pdf, png, text/tsv +1
Updated Jul 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lukas Sönning; Lukas Sönning (2025). Background data for: Ordinal response scales: Psychometric grounding for design and analysis [Dataset]. http://doi.org/10.18710/0VLSLW
Explore at:
text/tsv(3271), text/tsv(1293), text/tsv(902), text/tsv(91906), txt(31985), text/tsv(4283), text/tsv(958), png(85437), text/tsv(19110), text/tsv(5134), pdf(197065), text/tsv(2430)Available download formats
Unique identifier
https://doi.org/10.18710/0VLSLW
Dataset updated
Jul 17, 2025
Dataset provided by
DataverseNO
Authors
Lukas Sönning; Lukas Sönning
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
Jan 1, 1963 - Dec 31, 2022
Dataset funded by
German Research Foundation (DFG)
Description
This dataset contains background data and supplementary material for a methodological study on the use of ordinal response scales in linguistic research. For the literature survey reported in that study, which examines how rating scales are used in current linguistic research (4,441 papers from 16 linguistic journals, published between 2012 and 2022), it includes a tabular file listing the 406 research articles that report ordinal rating scale data. This file records annotated attributes of the studies and rating scales. Further the dataset includes summary data gathered in a review of the psychometric literature on the interpretation of quantificational expressions that are often used to build graded scales. Empirical findings are collected for five rating scale dimensions: agreement (1 study), intensity (3 studies), frequency (17 studies), probability (11 studies), and quality (3 studies). Finally, the post includes new data from 20 informants on the interpretation of the quantifiers "few", "some", "many", and "most". Abstract: Related publication Ordinal scales are commonly used in applied linguistics. To summarize the distribution of responses provided by informants, these are usually converted into numbers and then averaged or analyzed with ordinary regression models. This approach has been criticized in the literature; one caveat (among others) is the assumption that distances between categories are known. The present paper illustrates how empirical insights into the perception of response labels may inform the design and analysis stage of a study. We start with a review of how ordinal scales are used in linguistic research. Our survey offers insights into typical scale layouts and analysis strategies, and it allows us to identify three commonly used rating dimensions (agreement, intensity, and frequency). We take stock of the experimental literature on the perception of relevant scale point labels and then demonstrate how psychometric insights may direct scale design and data analysis. This includes a careful consideration of measurement-theoretic and statistical issues surrounding the numeric-conversion approach to ordinal data. We focus on the consequences of these drawbacks for the interpretation of empirical findings, which will enable researchers to make informed decisions and avoid drawing false conclusions from their data. We present a case study on yous(e) in British and Scottish English, which shows that reliance on psychometric scale values can alter statistical conclusions, while also giving due consideration to the key limitations of the numeric-conversion approach to ordinal data analysis.
Exploratory Analysis of CMS Open Data: Investigation of Dimuon Mass Spectrum...
zenodo.org
zip
Updated Sep 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andre Luis Tomaz Dionísio; Andre Luis Tomaz Dionísio (2025). Exploratory Analysis of CMS Open Data: Investigation of Dimuon Mass Spectrum Anomalies in the 10-15 GeV Range [Dataset]. http://doi.org/10.5281/zenodo.17220766
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.17220766
Dataset updated
Sep 29, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andre Luis Tomaz Dionísio; Andre Luis Tomaz Dionísio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the results of an exploratory analysis of CMS Open Data from LHC Run 1 (2010-2012) and Run 2 (2015-2018), focusing on the dimuon invariant mass spectrum in the 10-15 GeV range. The analysis investigates potential anomalies at 11.9 GeV and applies various statistical methods to characterize observed features.

Methodology:

Event selection and reconstruction using CMS NanoAOD format

Dimuon invariant mass analysis with background estimation

Angular distribution studies for quantum number determination

Statistical analysis including significance testing

Systematic uncertainty evaluation

Conservation law verification

Key Analysis Components:

Mass spectrum reconstruction and peak identification

Background modeling using sideband methods

Angular correlation analysis (sphericity, thrust, momentum distributions)

Cross-validation using multiple event selection criteria

Monte Carlo comparison for background understanding

Results Summary: The analysis identifies several features in the dimuon mass spectrum requiring further investigation. Preliminary observations suggest potential anomalies around 11.9 GeV, though these findings require independent validation and peer review before drawing definitive conclusions.

Data Products:

Processed event datasets

Analysis scripts and methodology

Statistical outputs and uncertainty estimates

Visualization tools and plots

Systematic studies documentation

Limitations: This work represents preliminary exploratory analysis. Results have not undergone formal peer review and should be considered investigative rather than conclusive. Independent replication and validation by the broader physics community are essential before any definitive claims can be made.

Keywords: CMS experiment, dimuon analysis, mass spectrum, exploratory analysis, LHC data, particle physics, statistical analysis, anomaly investigation

# Dark Photon Search for at 11.9 GeV

## Executive Summary

**Historic Search for: First Evidence of a Massive Dark Photon**

We report the Search for a new vector gauge boson at 11.9 GeV, identified as a dark photon (A'), representing the first confirmed portal anomaly between the Standard Model and a hidden sector. This search, based on CMS Open Data from LHC Run 1 (2010-2012) and Run 2 (2015-2018), provides direct experimental evidence for physics beyond the Standard Model.

## Search for Highlights

### Anomaly Properties

- **Mass**: 11.9 ± 0.1 GeV

- **Quantum Numbers**: J^PC = 1^-- (vector gauge boson)

- **Spin**: 1

- **Parity**: Negative

- **Isospin**: 0 (singlet)

- **Hypercharge**: 0

### Statistical Significance

- **Total Events**: 63,788 candidates in Run 1

- **Signal Strength**: > 5σ significance

- **Decay Channel**: A' → μ⁺μ⁻ (dominant)

- **Branching Ratio**: ~50% to neutral pairs

### Conservation Laws

All fundamental symmetries preserved:

- ✓ Energy-momentum

- ✓ Charge

- ✓ Lepton number

- ✓ CPT

## Project Structure

```

search/

├── README.md # This file

├── docs/

│ ├── paper/ # Main search paper

│ │ ├── manuscript.tex # LaTeX source

│ │ ├── abstract.txt # Paper abstract

│ │ └── figures/ # Paper figures

│ └── supplementary/ # Additional materials

│ ├── methods.pdf # Detailed methodology

│ ├── systematics.pdf # Systematic uncertainties

│ └── theory.pdf # Theoretical implications

├── data/

│ ├── run1/ # 7-8 TeV (2010-2012)

│ │ ├── raw/ # Original ROOT files

│ │ ├── processed/ # Processed datasets

│ │ └── results/ # Analysis outputs

│ └── run2/ # 13 TeV (2015-2018)

│ ├── raw/ # Original ROOT files

│ ├── processed/ # Processed datasets

│ └── results/ # Analysis outputs

├── analysis/

│ └── scripts/ # Analysis code

│ ├── dark_photon_symmetry_analysis.py

│ ├── hidden_sector_10_150_search.py

│ ├── hidden_10_15_gev_analysis.py

│ └── validation/ # Cross-checks

├── figures/ # Publication-ready plots

│ ├── mass_spectrum.png # Invariant mass distribution

│ ├── angular_dist.png # Angular distributions

│ ├── symmetry_plots.png # Symmetry analysis

│ └── cascade_spectrum.png # Hidden sector cascade

└── validation/ # Systematic studies

├── background_estimation/

├── signal_extraction/

└── systematic_errors/

```

## Key Evidence

### 1. Quantum Number Determination

- **Angular Distribution**: ⟨|P₁|⟩ = 0.805 (strong anisotropy)

- **Quadrupole Moment**: ⟨P₂⟩ = 0.573 (non-zero)

- **Anomaly Type Score**: Vector = 90/100 (Preliminary)

### 2. Hidden Sector Connection

- 236,181 total events in 10-150 GeV range

- Exponential cascade spectrum indicating hidden valley dynamics

- Dark photon serves as portal anomaly

### 3. Decay Topology

- **Sphericity**: 0.161 (jet-like)

- **Thrust**: 0.686 (moderate collimation)

- Consistent with two-body decay A' → μ⁺μ⁻

## Physical Interpretation

The search anomaly represents:

1. **New Force Carrier**: Fifth fundamental force beyond the four known forces

2. **Portal Anomaly**: Mediator between Standard Model and hidden/dark sector

3. **Dark Matter Connection**: Potential mediator for dark matter interactions

## Theoretical Framework

### Kinetic Mixing

The dark photon arises from kinetic mixing between U(1)_Y (hypercharge) and U(1)_D (dark charge):

```

L_mix = -(ε/2) F_μν^Y F^Dμν

```

where ε is the mixing parameter (~10^-3 based on observed coupling).

### Hidden Valley Scenario

The exponential cascade spectrum suggests:

- Complex hidden sector with multiple states

- Possible dark hadronization

- Rich phenomenology awaiting exploration

## Collaborators and Credits

**Lead Analysis**: CMS Open Data Analysis Team

**Data Source**: CERN Open Data Portal

**Period**: 2010-2012 (Run 1), 2015-2018 (Run 2)

**Computing**: Local analysis on CMS NanoAOD format

## How to Reproduce

### Requirements

```bash

pip install uproot awkward numpy matplotlib

```

### Quick Start

```bash

cd analysis/scripts/

python dark_photon_symmetry_analysis.py

python hidden_10_15_gev_analysis.py

```

## Significance Statement

This search represents the first confirmed Evidence of a portal anomaly connecting the Standard Model to a hidden sector. The 11.9 GeV dark photon opens an entirely new frontier in anomaly physics, providing experimental access to previously invisible physics and potentially explaining dark matter interactions.

## Contact

For questions about this search or collaboration opportunities:

- Email: andreluisdionisio@gmail.com

---

"We're not at the end of anomaly physics - we're at the beginning of dark sector physics!"

3665778186 00382C40-4D7F-E211-AD6F-003048FFCBFC.root
2581315530 0E5F189B-5D7F-E211-9423-002354EF3BE1.root
2149825126 1AE176AC-5A7F-E211-8E63-00261894397D.root
1792851725 2044D46B-DE7F-E211-9C82-003048FFD76E.root
3186214416 4CAE8D51-4A7F-E211-9937-0025905964A2.root
3220923349 72FDEF89-497F-E211-9CFA-002618943958.root
2555255008 7A35A5A2-547F-E211-940B-003048678DA2.root
3875410897 7E942EED-457F-E211-938E-002618FDA28E.root
2409745919 8406DE2F-407F-E211-A6A5-00261894395F.root
2421251748 8A61DAA8-3C7F-E211-94A6-002618943940.root
2315643699 98909097-417F-E211-9009-002618943838.root
2614932091 A0963AD9-567F-E211-A8AF-002618943901.root
2438057881 ACE2DF9A-477F-E211-9C29-003048679266.root
2206652387 B6AA897F-467F-E211-8381-002618943854.root
2365666837 C09519C8-4B7F-E211-9BCE-003048678B34.root
2477336101 C68AE3A5-447F-E211-928E-00261894388B.root
2556444022 C6CEC369-437F-E211-81B0-0026189438BD.root
3184171088 D60FF379-4E7F-E211-8BA4-002590593878.root
2381001693
f
Dataset.
figshare.com
xlsx
Updated Sep 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James M. Smoliga; Kathryn E. Sawyer (2025). Dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0315560.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315560.s001
Dataset updated
Sep 19, 2025
Dataset provided by
PLOS ONE
Authors
James M. Smoliga; Kathryn E. Sawyer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Taylor Swift’s presence at National Football League (NFL) games was reported to have a causal effect on the performance of Travis Kelce and the Kansas City Chiefs. Critical examination of the supposed “Swift effect” provides some surprising lessons relevant to the scientific community. Here, we present a formal analysis to determine whether the media narrative that Swift’s presence at NFL games had any impact on player or team performance – and draw parallels to scientific journalism and clinical research. We performed a quasi-experimental study, using covariate matching. Linear mixed effects models were used to determine how Swift’s presence or absence in Swift-era games influence Kelce’s performance, relative to historical data. Additionally, a binary logistic regression model was developed to determine if Swift’s presence influenced the Chief’s game outcomes, relative to historical averages. Across multiple matching approaches, analyses demonstrated that Kelce’s yardage did not significantly differ when Taylor Swift was in attendance (n = 13 games) relative to matched pre‐Swift games. Although a decline in Kelce’s performance was observed in games without Swift (n = 6 games), the statistical significance of this finding varied by the matching algorithm used, indicating inconsistency in the effect. Similarly, Swift’s attendance did not result in a significant increase in the Chiefs’ likelihood of winning. Together, these findings suggest that the purported “Swift effect” is not supported by robust evidence. The weak statistical evidence that spawned the concept of the “Swift effect” is rooted in a constellation of fallacies common to medical journalism and research – including over-simplification, sensationalism, attribution bias, unjustified mechanisms, inadequate sampling, emphasis on surrogate outcomes, and inattention to comparative effectiveness. Clinicians and researchers must be vigilant to avoid falling victim to the “Swift effect,” since failure to scrutinize available evidence can lead to acceptance of unjustified theories and negatively impact clinical decision-making.
Datasheet3_Assessing disparities through missing race and ethnicity data:...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
pdf
Updated Jul 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katelyn M. Banschbach; Jade Singleton; Xing Wang; Sheetal S. Vora; Julia G. Harris; Ashley Lytch; Nancy Pan; Julia Klauss; Danielle Fair; Erin Hammelev; Mileka Gilbert; Connor Kreese; Ashley Machado; Peter Tarczy-Hornoch; Esi M. Morgan (2024). Datasheet3_Assessing disparities through missing race and ethnicity data: results from a juvenile arthritis registry.pdf [Dataset]. http://doi.org/10.3389/fped.2024.1430981.s003
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fped.2024.1430981.s003
Dataset updated
Jul 24, 2024
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Katelyn M. Banschbach; Jade Singleton; Xing Wang; Sheetal S. Vora; Julia G. Harris; Ashley Lytch; Nancy Pan; Julia Klauss; Danielle Fair; Erin Hammelev; Mileka Gilbert; Connor Kreese; Ashley Machado; Peter Tarczy-Hornoch; Esi M. Morgan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionEnsuring high-quality race and ethnicity data within the electronic health record (EHR) and across linked systems, such as patient registries, is necessary to achieving the goal of inclusion of racial and ethnic minorities in scientific research and detecting disparities associated with race and ethnicity. The project goal was to improve race and ethnicity data completion within the Pediatric Rheumatology Care Outcomes Improvement Network and assess impact of improved data completion on conclusions drawn from the registry.MethodsThis is a mixed-methods quality improvement study that consisted of five parts, as follows: (1) Identifying baseline missing race and ethnicity data, (2) Surveying current collection and entry, (3) Completing data through audit and feedback cycles, (4) Assessing the impact on outcome measures, and (5) Conducting participant interviews and thematic analysis.ResultsAcross six participating centers, 29% of the patients were missing data on race and 31% were missing data on ethnicity. Of patients missing data, most patients were missing both race and ethnicity. Rates of missingness varied by data entry method (electronic vs. manual). Recovered data had a higher percentage of patients with Other race or Hispanic/Latino ethnicity compared with patients with non-missing race and ethnicity data at baseline. Black patients had a significantly higher odds ratio of having a clinical juvenile arthritis disease activity score (cJADAS10) of ≥5 at first follow-up compared with White patients. There was no significant change in odds ratio of cJADAS10 ≥5 for race and ethnicity after data completion. Patients missing race and ethnicity were more likely to be missing cJADAS values, which may affect the ability to detect changes in odds ratio of cJADAS ≥5 after completion.ConclusionsAbout one-third of the patients in a pediatric rheumatology registry were missing race and ethnicity data. After three audit and feedback cycles, centers decreased missing data by 94%, primarily via data recovery from the EHR. In this sample, completion of missing data did not change the findings related to differential outcomes by race. Recovered data were not uniformly distributed compared with those with non-missing race and ethnicity data at baseline, suggesting that differences in outcomes after completing race and ethnicity data may be seen with larger sample sizes.
D
OC 2017 LiDAR Image Service
detroitdata.org
accessoakland.oakgov.com
+4more
Updated May 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oakland County, Michigan (2021). OC 2017 LiDAR Image Service [Dataset]. https://detroitdata.org/dataset/oc-2017-lidar-image-service1
Explore at:
html, arcgis geoservices rest apiAvailable download formats
Dataset updated
May 18, 2021
Dataset provided by
Oakland County, Michigan
Description
BY USING THIS WEBSITE OR THE CONTENT THEREIN, YOU AGREE TO THE TERMS OF USE.
The Classified Point Cloud (LAS) for the 2017 Michigan LiDAR project covering approximately 907 square miles, covering Oakland County. LAS data products are suitable for 1 foot contour generation. USGS LiDAR Base Specification 1.2, QL2. 19.6 cm NVA.

This data is for planning purposes only and should not be used for legal or cadastral purposes. Any conclusions drawn from analysis of this information are not the responsibility of Sanborn Map Company. Users should be aware that temporal changes may have occurred since this dataset was collected and some parts of this dataset may no longer represent actual surface conditions. Users should not use these data for critical applications without a full awareness of its limitations.

This service is best used directly within ArcMap or ArcGIS Pro.If the raw LiDAR points are needed, use these clients to extract project area size portions. Due to the density of the data, downloading the entire County from this service is not possible. For further questions, contact the Oakland County Service Center at 248-858-8812, servicecenter@oakgov.com.
DEEPEN Global Standardized Categorical Exploration Datasets for Magmatic...
osti.gov
Updated Jun 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caliandro, Nils; King, Rachel; Taverna, Nicole (2023). DEEPEN Global Standardized Categorical Exploration Datasets for Magmatic Plays [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/1995526-deepen-global-standardized-categorical-exploration-datasets-magmatic-plays
Explore at:
Dataset updated
Jun 30, 2023
Dataset provided by
United States Department of Energyhttp://energy.gov/
Authors
Caliandro, Nils; King, Rachel; Taverna, Nicole
Description
DEEPEN stands for DE-risking Exploration of geothermal Plays in magmatic ENvironments. As part of the development of the DEEPEN 3D play fairway analysis (PFA) methodology for magmatic plays (conventional hydrothermal, superhot EGS, and supercritical), weights needed to be developed for use in the weighted sum of the different favorability index models produced from geoscientific exploration datasets. This was done using two different approaches: one based on expert opinions, and one based on statistical learning. This GDR submission includes the datasets used to produce the statistical learning-based weights. While expert opinions allow us to include more nuanced information in the weights, expert opinions are subject to human bias. Data-centric or statistical approaches help to overcome these potential human biases by focusing on and drawing conclusions from the data alone. The drawback is that, to apply these types of approaches, a dataset is needed. Therefore, we attempted to build comprehensive standardized datasets mapping anomalies in each exploration dataset to each component of each play. This data was gathered through a literature review focused on magmatic hydrothermal plays along with well-characterized areas where superhot or supercritical conditions are thought to exist. Datasets were assembled for all three play types, but the hydrothermal datasetmore » is the least complete due to its relatively low priority. For each known or assumed resource, the dataset states what anomaly in each exploration dataset is associated with each component of the system. The data is only a semi-quantitative, where values are either high, medium, or low, relative to background levels. In addition, the dataset has significant gaps, as not every possible exploration dataset has been collected and analyzed at every known or suspected geothermal resource area, in the context of all possible play types. The following training sites were used to assemble this dataset: - Conventional magmatic hydrothermal: Akutan (from AK PFA), Oregon Cascades PFA, Glass Buttes OR, Mauna Kea (from HI PFA), Lanai (from HI PFA), Mt St Helens Shear Zone (from WA PFA), Wind River Valley (From WA PFA), Mount Baker (from WA PFA). - Superhot EGS: Newberry (EGS demonstration project), Coso (EGS demonstration project), Geysers (EGS demonstration project), Eastern Snake River Plain (EGS demonstration project), Utah FORGE, Larderello, Kakkonda, Taupo Volcanic Zone, Acoculco, Krafla. - Supercritical: Coso, Geysers, Salton Sea, Larderello, Los Humeros, Taupo Volcanic Zone, Krafla, Reyjanes, Hengill. **Disclaimer: Treat the supercritical fluid anomalies with skepticism. They are based on assumptions due to the general lack of confirmed supercritical fluid encounters and samples at the sites included in this dataset, at the time of assembling the dataset. The main assumption was that the supercritical fluid in a given geothermal system has shared properties with the hydrothermal fluid, which may not be the case in reality. Once the datasets were assembled, principal component analysis (PCA) was applied to each. PCA is an unsupervised statistical learning technique, meaning that labels are not required on the data, that summarized the directions of variance in the data. This approach was chosen because our labels are not certain, i.e., we do not know with 100% confidence that superhot resources exist at all the assumed positive areas. We also do not have data for any known non-geothermal areas, meaning that it would be challenging to apply a supervised learning technique. In order to generate weights from the PCA, an analysis of the PCA loading values was conducted. PCA loading values represent how much a feature is contributing to each principal component, and therefore the overall variance in the data.« less
d
OC 2017 DEM Image Service
portal.datadrivendetroit.org
data.ferndalemi.gov
+4more
Updated May 5, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oakland County, Michigan (2018). OC 2017 DEM Image Service [Dataset]. https://portal.datadrivendetroit.org/datasets/oakgov::oc-2017-dem-image-service/about
Explore at:
Dataset updated
May 5, 2018
Dataset authored and provided by
Oakland County, Michigan
Area covered

Description
BY USING THIS WEBSITE OR THE CONTENT THEREIN, YOU AGREE TO THE TERMS OF USE. To acquire detailed surface elevation data for use in conservation planning, design, research, floodplain mapping, dam safety assessments, and hydrologic modeling. LAS and bare earth DEM data products are suitable for 1 foot contour generation. USGS LiDAR Base Specification 1.2, QL2. 19.6 cm NVA.This metadata record describes the hydro-flattened bare earth digital elevation model (DEM) derived from the classified LiDAR data for the 2017 Michigan LiDAR project covering approximately 907 square miles, in which its extents cover Oakland County.This data is for planning purposes only and should not be used for legal or cadastral purposes. Any conclusions drawn from analysis of this information are not the responsibility of Sanborn Map Company. Users should be aware that temporal changes may have occurred since this dataset was collected and some parts of this dataset may no longer represent actual surface conditions. Users should not use these data for critical applications without a full awareness of its limitations. Contact: State of MichiganDue to the large size of the data, downloading the entire county may not be possible. It is recommended to use the live service directly within ArcMap or ArcGIS Pro. For further questions, contact the Oakland County Service Center at 248-858-8812, servicecenter@oakgov.com.
LongAlpaca-Yukang ML Instructional Outputs
kaggle.com
zip
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). LongAlpaca-Yukang ML Instructional Outputs [Dataset]. https://www.kaggle.com/datasets/thedevastator/longalpaca-yukang-ml-instructional-outputs
Explore at:
zip(168273444 bytes)Available download formats
Dataset updated
Nov 24, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
LongAlpaca-Yukang ML Instructional Outputs

Unlocking the Power of AI

By Huggingface Hub [source]

About this dataset

This dataset contains 12000 instructional outputs from LongAlpaca-Yukang Machine Learning system, unlocking the cutting-edge power of Artificial Intelligence for users. With this data, researchers have an abundance of information to explore the mysteries behind AI and how it works. This dataset includes columns such as output, instruction, file and input which provide endless possibilities of analysis ripe for you to discover! Teeming with potential insights into AI’s functioning and implications for our everyday lives, let this data be your guide in unravelling the many secrets yet to be discovered in the world of AI

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Exploring the Dataset:

The dataset contains 12000 rows of information, with four columns containing output, instruction, file and input data. You can use these columns to explore the workings of a machine learning system, examine different instructional outputs for different inputs or instructions, study training data for specific ML systems, or analyze files being used by a machine learning system.

Visualizing Data:

Using built-in plotting tools within your chosen toolkit (such as Python), you can create powerful visualizations. Plotting outputs versus input instructions will give you an overview of what your machine learning system is capable of doing--and how it performs on different types of tasks or problems. You could also plot outputs along side files being used--this would help identify patterns in training data and identify areas that need improvement in your machine learning models.

Analyzing Performance:

Using statistical analysis techniques such as regressions or clustering algorithms, you can measure performance metrics such as accuracy and understand how they vary across instruction types. Experimenting with hyperparameter tuning may be helpful to see which settings yield better results for any given situation. Additionally correlations between inputs samples and output measurements can be examined so any relationships can be identified such as trends in accuracy over certain sets of instructions.

Drawing Conclusions:

By leveraging the power of big data mining tools, you are able to build comprehensive predictive models that allow us to project future outcomes based on past performance metric measurements from various instruction types fed into our system's datasets — allowing us determine if certain changes produce improve outcomes over time for our AI model’s capability & predictability!

Research Ideas

Developing self-improving Artificial Intelligence algorithms by using the outputs and instructional data to identify correlations and feedback loop structures between instructions and output results.

Generating Machine Learning simulations using this dataset to optimize AI performance based on given instruction set.

Using the instructions, input, and output data in the dataset to build AI systems for natural language processing, enabling comprehensive understanding of user queries and providing more accurate answers accordingly

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:-------------------------------------------------------| | output | The output of the instruction given. (String) | | file | The file used when executing the instruction. (String) | | input | Additional context for the instruction. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
California Fire Perimeters (1950+)
gis.data.ca.gov
gis.data.cnra.ca.gov
+4more
Updated Aug 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California Department of Forestry and Fire Protection (2024). California Fire Perimeters (1950+) [Dataset]. https://gis.data.ca.gov/datasets/CALFIRE-Forestry::california-fire-perimeters-1950/data
Explore at:
Dataset updated
Aug 30, 2024
Dataset authored and provided by
California Department of Forestry and Fire Protectionhttp://calfire.ca.gov/
Area covered

Description
This data should be used carefully for statistical analysis and reporting due to missing perimeters (see Use Limitation in metadata). Some fires are missing because historical records were lost or damaged, were too small for the minimum cutoffs, had inadequate documentation or have not yet been incorporated into the database. Other known errors with the fire perimeter database include duplicate fires and over-generalization. Over-generalization, particularly with large old fires, may show unburned "islands" within the final perimeter as burned. Users of the fire perimeter database must exercise caution in application of the data. Careful use of the fire perimeter database will prevent users from drawing inaccurate or erroneous conclusions from the data. This dataset may differ in California compared to that available from the National Interagency Fire Center (NIFC) due to different requirements between the two datasets. The data covers fires back to 1878.

Please help improve this dataset by filling out this survey with feedback:
Historic Fire Perimeter Dataset Feedback (arcgis.com)

Current criteria for data collection are as follows:
CAL FIRE (including contract counties) submit perimeters ≥10 acres in timber, ≥50 acres in brush, or ≥300 acres in grass, and/or ≥3 impacted residential or commercial structures, and/or caused ≥1 fatality.
All cooperating agencies submit perimeters ≥10 acres.

Version update:
Firep24_1 was released in April 2025. Five hundred forty-eight fires from the 2024 fire season were added to the database (2 from BIA, 56 from BLM, 197 from CAL FIRE, 193 from Contract Counties, 27 from LRA, 8 from NPS, 55 from USFS and 8 from USFW). Six perimeters were added from the 2025 fire season (as a special case due to an unusual January fire siege). Five duplicate fires were removed, and the 2023 Sage was replaced with a more accurate perimeter. There were 900 perimeters that received updated attribution (705 removed “FIRE” from the end of Fire Name field and 148 replaced Complex IRWIN ID with Complex local incident number for COMPLEX_ID field). The following fires were identified as meeting our collection criteria but are not included in this version and will hopefully be added in a future update: Addie (2024-CACND-002119), Alpaugh (2024-CACND-001715), South (2024-CATIA-001375). One perimeter is missing containment date that will be updated in the next release.

Cross checking CALFIRS reporting for new CAL FIRE submissions to ensure accuracy with cause class was added to the compilation process. The cause class domain description for “Powerline” was updated to “Electrical Power” to be more inclusive of cause reports.

Detailed metadata is included in the following documents:
Wildland Fire Perimeters (Firep24_1) Metadata

For any questions, please contact the data steward:
Kim Wallin, GIS Specialist
CAL FIRE, Fire & Resource Assessment Program (FRAP)
kimberly.wallin@fire.ca.gov
National Transfusion Dataset (NTD)
bridges.monash.edu
researchdata.edu.au
Updated Mar 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Transfusion Dataset (2024). National Transfusion Dataset (NTD) [Dataset]. http://doi.org/10.26180/22151987.v4
Explore at:
Unique identifier
https://doi.org/10.26180/22151987.v4
Dataset updated
Mar 4, 2024
Dataset provided by
Monash University
Authors
National Transfusion Dataset
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The National Transfusion Dataset (NTD) is a collection of transfusion episode data incorporating transfusion, laboratory and hospital data from hospitals and health services, as well as prehospital transfusion data from ambulance and retrieval services.The NTD will form the first integrated national database of blood usage in Australia. The NTD aims to collect information about where, when, and how blood products are used across all clinical settings. This will address Australia’s absence of an integrated national database to record blood usage with the ability to link with clinical outcomes. The dataset will be an invaluable resource towards a comprehensive understanding of how and why blood products are used, numbers and characteristics of patients transfused in health services, the clinical outcomes after transfusion; and provide support to policy development and research.The NTD was formed through the incorporation of the established Australian and New Zealand Massive Transfusion Registry (ANZ-MTR) and a pilot Transfusion Database (TD) project. The ANZ-MTR has a unique focus on massive transfusion (MT) and contains over 10,000 cases from 41 hospitals across Australia and New Zealand. The TD was a trial extension of the registry that collated data on ALL (not just massive) transfusions on >8000 patients from pilot hospitals. The NTD will integrate and expand these databases to provide new data on transfusion practice including blood utilisation, clinical management and the vital closing of the haemovigilance loop.Conditions of use:Any material or manuscript to be published using NTD data must be submitted for review by the NTD Steering Committee prior to submission for publication. The NTD, and Partner Organisations should be acknowledged in all publications. Preferred wording for the acknowledgement will be provided with the data. The NTD reserves the right to dissociate itself from conclusions drawn if it deems necessary.If the data is the primary source for a report or publication, the source of the data must be acknowledged, along with a statement that the analysis and interpretation are those of the author, not the NTD. Where an author analysing the data is a member of an organisation formally associated, or partnered with the NTD, the NTD should be acknowledged as a secondary affiliation. Where the author is a member of the NTD Project Team, then the primary attribution should be the NTD. The dataset DOI (10.26180/22151987) must be referenced in all publications.Further information can be found in the Data Access and Publications Policy.To submit a data access request click here.
Lambda Orionis Cluster XMM-Newton X-Ray Point Source Catalog - Dataset -...
data.nasa.gov
Updated Apr 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Lambda Orionis Cluster XMM-Newton X-Ray Point Source Catalog - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/lambda-orionis-cluster-xmm-newton-x-ray-point-source-catalog
Explore at:
Dataset updated
Apr 1, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
The authors studied the X-ray properties of the young (~1-8M yr) open cluster around the hot (O8 III) star Lambda Ori and compared them with those of the similarly-aged Sigma Ori cluster in order to investigate the possible effects of the different ambient environments. They analyzed an XMM-Newton observation of the cluster using EPIC imaging and low-resolution spectral data. They studied the variability of the detected sources, and performed a spectral analysis of the brightest sources in the field using multi-temperature models. The authors detected 167 X-ray sources above a 5-sigma detection threshold the properties of which are listed in this table, of which 58 are identified with known cluster members and candidates, from massive stars down to low-mass stars with spectral types of ~ M5.5. Another 23 sources were identified with new possible photometric candidates. Late-type stars have a median log L_X/L_bol ~ -3.3, close to the saturation limit. Variability was observed in ~ 35% of late-type members or candidates, including six flaring sources. The emission from the central hot star Lambda Ori is dominated by plasma at 0.2 - 0.3 keV, with a weaker component at 0.7 keV, consistent with a wind origin. The coronae of late-type stars can be described by two plasma components with temperatures T₁ ~ 0.3-0.8 keV and T₂ ~ 0.8-3 keV, and subsolar abundances Z ~ 0.1-0.3 Z_sun, similar to what is found in other star-forming regions and associations. No significant difference was observed between stars with and without circumstellar discs, although the smallness of the sample of stars with discs and accretion does not definitive conclusions to be drawn. The authors concluded that the X-ray properties of Lambda Ori late-type stars are comparable to those of the coeval Sigma Ori cluster, suggesting that stellar activity in Lambda Ori has not been significantly affected by the different ambient environment. The lambda Ori cluster was observed by XMM-Newton from 20:46 UT on September 28, 2006 to 12:23 UT on September 29, 2006 (Obs. ID 0402050101), for a total duration of 56ks, using both the EPIC MOS and PN cameras and the RGS instruments. The EPIC cameras were operated in full frame mode with the thick filter. This table was created by the HEASARC in November 2011 based on CDS Catalog J/A+A/530/A150 files tablea1.dat ('X-ray sources detected in the Lambda Ori Cluster'), table1,dat ('X-ray and optical properties of sources identified with known cluster members and candidates') and table2.dat ('X-ray sources identified with possible new cluster candidates'). It does not include the objects listed in tablea2.dat ('3-sigma upper limits and optical properties of undetected cluster members and candidates'). This is a service provided by NASA HEASARC .

Facebook

Twitter

Click to copy link

Link copied

Cite

Panori, Anastasia; Bakratsas, Thomas; Chapizanis, Dimitrios; Altsitsiadis, Efthymios; Hauschildt, Christian (2022). Quantitative raw data for "Large scale regional citizen surveys report" (D1.4) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5958017

Quantitative raw data for "Large scale regional citizen surveys report" (D1.4)

Explore at:

Dataset updated

Feb 3, 2022

Dataset provided by

White Research SRL

Authors

Panori, Anastasia; Bakratsas, Thomas; Chapizanis, Dimitrios; Altsitsiadis, Efthymios; Hauschildt, Christian

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset presents the quantitative raw data that was collected under the H2020 RRI2SCALE project for the D1.4 - “Large scale regional citizen surveys report”. The dataset includes the answers that were provided by almost 8,000 participants from 4 pilot European regions (Kriti, Vestland, Galicia, and Overijssel) regarding the general public's views, concerns, and moral issues about the current and future trajectories of their RTD&I ecosystem. The original survey questionnaire was created by White Research SRL and disseminated to the regions through supporting pilot partners. Data collection took place from June 2020 to September 2020 through 4 different waves – one for each region. Based on the conclusion of a consortium vote during the kick-off meeting, it was decided that instead of resource-intensive methods that would render data collection unduly expensive, to fill in the quotas responses were collected through online panels by survey companies that were used for each region. For the statistical analysis of the data and the conclusions drawn from the analysis, you can access the "Large scale regional citizen surveys report" (D1.4).

Clear search

Close search

Google apps

Main menu

Quantitative raw data for "Large scale regional citizen surveys report"...

FAIR NATIONAL ELECTION STUDIES: HOW WELL ARE WE DOING? - Dataset - B2FIND

Adventures of Sherlock Holmes: Sentiment Analysis.

Introduction

“**Data! data! data!**” he cried, impatiently. “I can’t make bricks without clay.”

Mali Farm Data

California Fire Perimeters (1950+)

Crime Rate and GDP Datasets 2021 & 2023

This is the data set that we used to reach the conclusions drawn in the...

fNIRS DATA AND ANALYSIS SCRIPTS

DEEPEN Global Standardized Categorical Exploration Datasets for Magmatic...

Background data for: Ordinal response scales: Psychometric grounding for...

Exploratory Analysis of CMS Open Data: Investigation of Dimuon Mass Spectrum...

Dataset.

Datasheet3_Assessing disparities through missing race and ethnicity data:...

OC 2017 LiDAR Image Service

DEEPEN Global Standardized Categorical Exploration Datasets for Magmatic...

OC 2017 DEM Image Service

LongAlpaca-Yukang ML Instructional Outputs

LongAlpaca-Yukang ML Instructional Outputs

Unlocking the Power of AI

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Exploring the Dataset:

Visualizing Data:

Analyzing Performance:

Drawing Conclusions:

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

California Fire Perimeters (1950+)

National Transfusion Dataset (NTD)

Lambda Orionis Cluster XMM-Newton X-Ray Point Source Catalog - Dataset -...

Quantitative raw data for "Large scale regional citizen surveys report" (D1.4)

“Data! data! data!” he cried, impatiently. “I can’t make bricks without clay.”