Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset presents the quantitative raw data that was collected under the H2020 RRI2SCALE project for the D1.4 - “Large scale regional citizen surveys report”. The dataset includes the answers that were provided by almost 8,000 participants from 4 pilot European regions (Kriti, Vestland, Galicia, and Overijssel) regarding the general public's views, concerns, and moral issues about the current and future trajectories of their RTD&I ecosystem. The original survey questionnaire was created by White Research SRL and disseminated to the regions through supporting pilot partners. Data collection took place from June 2020 to September 2020 through 4 different waves – one for each region. Based on the conclusion of a consortium vote during the kick-off meeting, it was decided that instead of resource-intensive methods that would render data collection unduly expensive, to fill in the quotas responses were collected through online panels by survey companies that were used for each region. For the statistical analysis of the data and the conclusions drawn from the analysis, you can access the "Large scale regional citizen surveys report" (D1.4).
Facebook
TwitterElection studies are an important data pillar in political and social science, as most political research investigations involve secondary use of existing datasets. Researchers depend on high-quality data because data quality determines the accuracy of the conclusions drawn from statistical analyses. We outline data reuse quality criteria pertaining to data accessibility, metadata provision, and data documentation using the FAIR Principles of research data management as a framework. We then investigate the extent to which a selection of election studies fulfils these criteria using studies from Western democracies. Our results reveal that although most election studies are easily accessible and well documented and that the overall level of data processing is satisfactory, some important deficits remain. Further analyses of technical documentation indicate that while a majority of election studies provide the necessary documents, there is still room for improvement. Inhaltscodierung Content Analysis Large-scale election studies from Western democracies Non-probability: Purposive
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The famous Sherlock Holmes quote, “Data! data! data!” from The Copper Beeches perfectly encapsulates the essence of both detective work and data analysis. Holmes’ relentless pursuit of every detail closely mirrors the approach of modern data analysts, who understand that conclusions drawn without solid data are mere conjecture. Just as Holmes systematically gathered clues, analysed them from different perspectives, and tested hypotheses to arrive at the truth, today’s analysts follow similar processes when investigating complex data-driven problems. This project draws a parallel between Holmes’ detective methods and modern data analysis techniques by visualising and interpreting data from The Adventures of Sherlock Holmes.
The above quote comes from one of my favourite Sherlock Holmes stories, The Copper Beeches. In this single outburst, Holmes captures a principle that resonates deeply with today’s data analysts: without data, conclusions are mere speculation. Data is the bedrock of any investigation. Without sufficient data, the route to solving a problem or answering a question is clouded with uncertainty.
Sherlock Holmes, the iconic fictional detective, thrived on difficult cases, relishing the challenge of pitting his wits against the criminal mind.
His methods of detection: - Examining crime scenes. - Interrogating witnesses. - Evaluating motives.
Closely parallel how a data analyst approaches a complex problem today. By carefully collecting and interpreting data, Holmes was able to unravel mysteries that seemed impenetrable at first glance.
1. Data Collection: Gathering Evidence
Holmes’s meticulous approach to data collection mirrors the first stage of data analysis. Just as Holmes would scrutinise a crime scene for every detail; whether it be a footprint, a discarded note, or a peculiar smell. Data analysts seek to gather as much relevant data as possible. Just as incomplete or biased data can skew results in modern analysis, Holmes understood that every clue mattered. Overlooking a small piece of information could compromise the entire investigation.
2. Data Quality: “I can’t make bricks without clay.”
This quote is more than just a witty remark, it highlights the importance of having the right data. In the same way that substandard materials result in poor construction, incomplete or inaccurate data leads to unreliable analysis. Today’s analysts face similar issues: they must assess data integrity, clean noisy datasets, and ensure they’re working with accurate information before drawing conclusions. Holmes, in his time, would painstakingly verify each clue, ensuring that he was not misled by false leads.
3. Data Analysis: Considering Multiple Perspectives
Holmes’s genius lay not just in gathering data, but in the way he analysed it. He would often examine a problem from multiple angles, revisiting clues with fresh perspectives to see what others might have missed. In modern data analysis, this approach is akin to using different models, visualisations, and analytical methods to interpret the same dataset. Analysts explore data from multiple viewpoints, testing different hypotheses, and applying various algorithms to see which provides the most plausible insight.
4. Hypothesis Testing: Eliminate the Improbable
One of Holmes’s guiding principles was: “When you have eliminated the impossible, whatever remains, however improbable, must be the truth.” This mirrors the process of hypothesis testing in data analysis. Analysts might begin with several competing theories about what the data suggests. By testing these hypotheses, ruling out those that are contradicted by the data, they zero in on the most likely explanation. For both Holmes and today’s data analysts, the process of elimination is crucial to arriving at the correct answer.
5. Insight and Conclusion: The Final Deduction
After piecing together all the clues, Holmes would reveal his conclusion, often leaving his audience in awe at how the seemingly unrelated pieces of data fit together. Similarly, data analysts must present their findings clearly and compellingly, translating raw data into actionable insights. The ability to connect the dots and tell a coherent story from the data is what transforms analysis into impactful decision-making.
In summary, the methods Sherlock Holmes employed were gathering data meticulously, testing multiple angles, and drawing conclusions through careful analysis. Are strikingly similar to the techniques used by modern data analysts. Just as Holmes required high-quality data and a structured approach to solve crimes, today’s data analysts rely on well-prepared data and methodical analysis to provide insights. Whether you’re cracking a case or uncovering business...
Facebook
TwitterThe project is to conduct a principal components analysis of the Mali Farm data (malifarmdata.xlsx, R. Johnson and D. Wichern, Applied Multivariate Statistical Analysis, Pearson, New Jersey, 2019.). You will use S for the PCA. (a) Store the data in matrix X. (b) Carry out an initial investigation. Indicate if you had to process the data file in anyway. Do not transform the data. Explain any conclusions drawn from the evidence and backup your conclusions. Hint: Pay attention to detection of outliers. i. The data in rows 25, 34, 52, 57, 62, 69, 72 are outliers. Provide at least two indicators for each of these data that justify this claim. ii. Explain any other conclusions drawn from initial investigation. (c) Create a data matrix X by removing the outliers. (d) Carry out principal component analyses on X. i. Give the relevant sample covariance matrix S ii. List the eigenvalues and describe the percent contributions to the variance. iii. Determine the number of principal components to retain and justify your answer by considering at least three methods. iv. Give the eigenvectors for the principal components you retain. v. Considering the coefficients of the principal components, Describe dependencies of the principal components on the variables. vi. Using at least the first two principal components, display appropriate scatter plots of pairs of principal components. Make observations about the plots. (e) Carry out principal component analyses on X. i. Give the relevant sample covariance matrix S. ii. List the eigenvalues and describe the percent contributions to the variance. iii. Determine the number of principal components to retain and justify your answer by considering at least three methods. iv. Give the eigenvectors for the principal components you retain. v. Considering the coefficients of the principal components, Describe dependencies of the principal components on the variables. vi. Using at least the first two principal components, display appropriate scatter plots of pairs of principal components. Make observations about the plots. (f) Compare the results for the two analyses. How much effect did the outliers have on the principal component analysis? Which result do you like more and why? (g) Include your code. Key for Mali farm data Family Dist RD number of people in the household distance in kilometers to the nearest passable road Cotton = hectares of cotton planted in 2000 Maize = hectares of maize planted in 2000 Sorg = hectares of sorghum planted in 2000 Millet = hectares of millet planted in 2000 Bull = total number of bullocks Cattle = total number of cattle Goat = total number of goats
Facebook
TwitterThe California Department of Forestry and Fire Protection's Fire and Resource Assessment Program (FRAP) annually maintains and distributes an historical wildland fire perimeter dataset from across public and private lands in California. The GIS data is developed with the cooperation of the United States Forest Service Region 5, the Bureau of Land Management, California State Parks, National Park Service and the United States Fish and Wildlife Service and is released in the spring with added data from the previous calendar year. Although the dataset represents the most complete digital record of fire perimeters in California, it is still incomplete, and users should be cautious when drawing conclusions based on the data. This data should be used carefully for statistical analysis and reporting due to missing perimeters (see Use Limitation in metadata). Some fires are missing because historical records were lost or damaged, were too small for the minimum cutoffs, had inadequate documentation or have not yet been incorporated into the database. Other errors with the fire perimeter database include duplicate fires and over-generalization. Additionally, over-generalization, particularly with large old fires, may show unburned "islands" within the final perimeter as burned. Users of the fire perimeter database must exercise caution in application of the data. Careful use of the fire perimeter database will prevent users from drawing inaccurate or erroneous conclusions from the data. This data is updated annually in the spring with fire perimeters from the previous fire season. This dataset may differ in California compared to that available from the National Interagency Fire Center (NIFC) due to different requirements between the two datasets. The data covers fires back to 1878. As of May 2025, it represents fire24_1. Please help improve this dataset by filling out this survey with feedback:Historic Fire Perimeter Dataset Feedback (arcgis.com) Current criteria for data collection are as follows:CAL FIRE (including contract counties) submit perimeters ≥10 acres in timber, ≥50 acres in brush, or ≥300 acres in grass, and/or ≥3 impacted residential or commercial structures, and/or caused ≥1 fatality.All cooperating agencies submit perimeters ≥10 acres. Version update:Firep24_1 was released in April 2025. Five hundred forty-eight fires from the 2024 fire season were added to the database (2 from BIA, 56 from BLM, 197 from CAL FIRE, 193 from Contract Counties, 27 from LRA, 8 from NPS, 55 from USFS and 8 from USFW). Six perimeters were added from the 2025 fire season (as a special case due to an unusual January fire siege). Five duplicate fires were removed, and the 2023 Sage was replaced with a more accurate perimeter. There were 900 perimeters that received updated attribution (705 removed “FIRE” from the end of Fire Name field and 148 replaced Complex IRWIN ID with Complex local incident number for COMPLEX_ID field). The following fires were identified as meeting our collection criteria but are not included in this version and will hopefully be added in a future update: Addie (2024-CACND-002119), Alpaugh (2024-CACND-001715), South (2024-CATIA-001375). One perimeter is missing containment date that will be updated in the next release.Cross checking CALFIRS reporting for new CAL FIRE submissions to ensure accuracy with cause class was added to the compilation process. The cause class domain description for “Powerline” was updated to “Electrical Power” to be more inclusive of cause reports. Includes separate layers filtered by criteria as follows:California Fire Perimeters (All): Unfiltered. The entire collection of wildfire perimeters in the database. It is scale dependent and starts displaying at the country level scale. Recent Large Fire Perimeters (≥5000 acres): Filtered for wildfires greater or equal to 5,000 acres for the last 5 years of fires (2020-January 2025), symbolized with color by year and is scale dependent and starts displaying at the country level scale. Year-only labels for recent large fires.California Fire Perimeters (1950+): Filtered for wildfires that started in 1950-January 2025. Symbolized by decade, and display starting at country level scale. Detailed metadata is included in the following documents:Wildland Fire Perimeters (Firep24_1) MetadataSee more information on our Living Atlas data release here: CAL FIRE Historical Fire Perimeters Available in ArcGIS Living AtlasFor any questions, please contact the data steward:Kim Wallin, GIS SpecialistCAL FIRE, Fire & Resource Assessment Program (FRAP)kimberly.wallin@fire.ca.gov
Facebook
TwitterOverview:
This project aims to investigate the potential correlation between the Gross Domestic Product (GDP) of approximately 190 countries for the years 2021 and 2023 and their corresponding crime ratings. The crime ratings are represented on a scale from 0 to 10, with 0 indicating minimal or null crime activity and 10 representing the highest level of criminal activity.
Dataset:
The dataset used in this project comprises GDP data for the years 2021 and 2023 for around 190 countries, sourced from reputable international databases. Additionally, crime rating scores for the same countries and years are collected from credible sources such as governmental agencies, law enforcement organizations, or reputable research institutions.
Methodology:
Expected Outcomes:
Identification of any significant correlations or patterns between GDP and crime ratings across different countries. Insights into the potential socioeconomic factors influencing crime rates and their relationship with economic indicators like GDP. Implications for policymakers, law enforcement agencies, and researchers in understanding the dynamics between economic development and crime prevalence.
Facebook
TwitterThis data can be replicated to report the study findings in their entiretyincluding a.) The values behind the means, standard deviations and other measures reported; b.) The values used to build graphs; c.) The points extracted from images for analysis. (XLSX)
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The data set used to reach the conclusions drawn in the manuscript is stored under the folder named ‘Raw_Subject_Data’. Related metadata produced at the interim steps of the analysis described in the presented work can be found under the folder named ‘Preprocessed_Data_Interim_Outputs’. Scripts for executing the methodology can be found under the folder named ‘Scripts’. Additional data required to replicate the reported study findings can be found at the folder named as ‘Feature_Set’.
Facebook
TwitterDEEPEN stands for DE-risking Exploration of geothermal Plays in magmatic ENvironments. As part of the development of the DEEPEN 3D play fairway analysis (PFA) methodology for magmatic plays (conventional hydrothermal, superhot EGS, and supercritical), weights needed to be developed for use in the weighted sum of the different favorability index models produced from geoscientific exploration datasets. This was done using two different approaches: one based on expert opinions, and one based on statistical learning. This GDR submission includes the datasets used to produce the statistical learning-based weights. While expert opinions allow us to include more nuanced information in the weights, expert opinions are subject to human bias. Data-centric or statistical approaches help to overcome these potential human biases by focusing on and drawing conclusions from the data alone. The drawback is that, to apply these types of approaches, a dataset is needed. Therefore, we attempted to build comprehensive standardized datasets mapping anomalies in each exploration dataset to each component of each play. This data was gathered through a literature review focused on magmatic hydrothermal plays along with well-characterized areas where superhot or supercritical conditions are thought to exist. Datasets were assembled for all three play types, but the hydrothermal dataset is the least complete due to its relatively low priority. For each known or assumed resource, the dataset states what anomaly in each exploration dataset is associated with each component of the system. The data is only a semi-quantitative, where values are either high, medium, or low, relative to background levels. In addition, the dataset has significant gaps, as not every possible exploration dataset has been collected and analyzed at every known or suspected geothermal resource area, in the context of all possible play types. The following training sites were used to assemble this dataset: - Conventional magmatic hydrothermal: Akutan (from AK PFA), Oregon Cascades PFA, Glass Buttes OR, Mauna Kea (from HI PFA), Lanai (from HI PFA), Mt St Helens Shear Zone (from WA PFA), Wind River Valley (From WA PFA), Mount Baker (from WA PFA). - Superhot EGS: Newberry (EGS demonstration project), Coso (EGS demonstration project), Geysers (EGS demonstration project), Eastern Snake River Plain (EGS demonstration project), Utah FORGE, Larderello, Kakkonda, Taupo Volcanic Zone, Acoculco, Krafla. - Supercritical: Coso, Geysers, Salton Sea, Larderello, Los Humeros, Taupo Volcanic Zone, Krafla, Reyjanes, Hengill. **Disclaimer: Treat the supercritical fluid anomalies with skepticism. They are based on assumptions due to the general lack of confirmed supercritical fluid encounters and samples at the sites included in this dataset, at the time of assembling the dataset. The main assumption was that the supercritical fluid in a given geothermal system has shared properties with the hydrothermal fluid, which may not be the case in reality. Once the datasets were assembled, principal component analysis (PCA) was applied to each. PCA is an unsupervised statistical learning technique, meaning that labels are not required on the data, that summarized the directions of variance in the data. This approach was chosen because our labels are not certain, i.e., we do not know with 100% confidence that superhot resources exist at all the assumed positive areas. We also do not have data for any known non-geothermal areas, meaning that it would be challenging to apply a supervised learning technique. In order to generate weights from the PCA, an analysis of the PCA loading values was conducted. PCA loading values represent how much a feature is contributing to each principal component, and therefore the overall variance in the data.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains background data and supplementary material for a methodological study on the use of ordinal response scales in linguistic research. For the literature survey reported in that study, which examines how rating scales are used in current linguistic research (4,441 papers from 16 linguistic journals, published between 2012 and 2022), it includes a tabular file listing the 406 research articles that report ordinal rating scale data. This file records annotated attributes of the studies and rating scales. Further the dataset includes summary data gathered in a review of the psychometric literature on the interpretation of quantificational expressions that are often used to build graded scales. Empirical findings are collected for five rating scale dimensions: agreement (1 study), intensity (3 studies), frequency (17 studies), probability (11 studies), and quality (3 studies). Finally, the post includes new data from 20 informants on the interpretation of the quantifiers "few", "some", "many", and "most". Abstract: Related publication Ordinal scales are commonly used in applied linguistics. To summarize the distribution of responses provided by informants, these are usually converted into numbers and then averaged or analyzed with ordinary regression models. This approach has been criticized in the literature; one caveat (among others) is the assumption that distances between categories are known. The present paper illustrates how empirical insights into the perception of response labels may inform the design and analysis stage of a study. We start with a review of how ordinal scales are used in linguistic research. Our survey offers insights into typical scale layouts and analysis strategies, and it allows us to identify three commonly used rating dimensions (agreement, intensity, and frequency). We take stock of the experimental literature on the perception of relevant scale point labels and then demonstrate how psychometric insights may direct scale design and data analysis. This includes a careful consideration of measurement-theoretic and statistical issues surrounding the numeric-conversion approach to ordinal data. We focus on the consequences of these drawbacks for the interpretation of empirical findings, which will enable researchers to make informed decisions and avoid drawing false conclusions from their data. We present a case study on yous(e) in British and Scottish English, which shows that reliance on psychometric scale values can alter statistical conclusions, while also giving due consideration to the key limitations of the numeric-conversion approach to ordinal data analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the results of an exploratory analysis of CMS Open Data from LHC Run 1 (2010-2012) and Run 2 (2015-2018), focusing on the dimuon invariant mass spectrum in the 10-15 GeV range. The analysis investigates potential anomalies at 11.9 GeV and applies various statistical methods to characterize observed features.
Methodology:
Key Analysis Components:
Results Summary: The analysis identifies several features in the dimuon mass spectrum requiring further investigation. Preliminary observations suggest potential anomalies around 11.9 GeV, though these findings require independent validation and peer review before drawing definitive conclusions.
Data Products:
Limitations: This work represents preliminary exploratory analysis. Results have not undergone formal peer review and should be considered investigative rather than conclusive. Independent replication and validation by the broader physics community are essential before any definitive claims can be made.
Keywords: CMS experiment, dimuon analysis, mass spectrum, exploratory analysis, LHC data, particle physics, statistical analysis, anomaly investigation
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Taylor Swift’s presence at National Football League (NFL) games was reported to have a causal effect on the performance of Travis Kelce and the Kansas City Chiefs. Critical examination of the supposed “Swift effect” provides some surprising lessons relevant to the scientific community. Here, we present a formal analysis to determine whether the media narrative that Swift’s presence at NFL games had any impact on player or team performance – and draw parallels to scientific journalism and clinical research. We performed a quasi-experimental study, using covariate matching. Linear mixed effects models were used to determine how Swift’s presence or absence in Swift-era games influence Kelce’s performance, relative to historical data. Additionally, a binary logistic regression model was developed to determine if Swift’s presence influenced the Chief’s game outcomes, relative to historical averages. Across multiple matching approaches, analyses demonstrated that Kelce’s yardage did not significantly differ when Taylor Swift was in attendance (n = 13 games) relative to matched pre‐Swift games. Although a decline in Kelce’s performance was observed in games without Swift (n = 6 games), the statistical significance of this finding varied by the matching algorithm used, indicating inconsistency in the effect. Similarly, Swift’s attendance did not result in a significant increase in the Chiefs’ likelihood of winning. Together, these findings suggest that the purported “Swift effect” is not supported by robust evidence. The weak statistical evidence that spawned the concept of the “Swift effect” is rooted in a constellation of fallacies common to medical journalism and research – including over-simplification, sensationalism, attribution bias, unjustified mechanisms, inadequate sampling, emphasis on surrogate outcomes, and inattention to comparative effectiveness. Clinicians and researchers must be vigilant to avoid falling victim to the “Swift effect,” since failure to scrutinize available evidence can lead to acceptance of unjustified theories and negatively impact clinical decision-making.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionEnsuring high-quality race and ethnicity data within the electronic health record (EHR) and across linked systems, such as patient registries, is necessary to achieving the goal of inclusion of racial and ethnic minorities in scientific research and detecting disparities associated with race and ethnicity. The project goal was to improve race and ethnicity data completion within the Pediatric Rheumatology Care Outcomes Improvement Network and assess impact of improved data completion on conclusions drawn from the registry.MethodsThis is a mixed-methods quality improvement study that consisted of five parts, as follows: (1) Identifying baseline missing race and ethnicity data, (2) Surveying current collection and entry, (3) Completing data through audit and feedback cycles, (4) Assessing the impact on outcome measures, and (5) Conducting participant interviews and thematic analysis.ResultsAcross six participating centers, 29% of the patients were missing data on race and 31% were missing data on ethnicity. Of patients missing data, most patients were missing both race and ethnicity. Rates of missingness varied by data entry method (electronic vs. manual). Recovered data had a higher percentage of patients with Other race or Hispanic/Latino ethnicity compared with patients with non-missing race and ethnicity data at baseline. Black patients had a significantly higher odds ratio of having a clinical juvenile arthritis disease activity score (cJADAS10) of ≥5 at first follow-up compared with White patients. There was no significant change in odds ratio of cJADAS10 ≥5 for race and ethnicity after data completion. Patients missing race and ethnicity were more likely to be missing cJADAS values, which may affect the ability to detect changes in odds ratio of cJADAS ≥5 after completion.ConclusionsAbout one-third of the patients in a pediatric rheumatology registry were missing race and ethnicity data. After three audit and feedback cycles, centers decreased missing data by 94%, primarily via data recovery from the EHR. In this sample, completion of missing data did not change the findings related to differential outcomes by race. Recovered data were not uniformly distributed compared with those with non-missing race and ethnicity data at baseline, suggesting that differences in outcomes after completing race and ethnicity data may be seen with larger sample sizes.
Facebook
TwitterBY USING THIS WEBSITE OR THE CONTENT THEREIN, YOU AGREE TO THE TERMS OF USE.
Facebook
TwitterDEEPEN stands for DE-risking Exploration of geothermal Plays in magmatic ENvironments. As part of the development of the DEEPEN 3D play fairway analysis (PFA) methodology for magmatic plays (conventional hydrothermal, superhot EGS, and supercritical), weights needed to be developed for use in the weighted sum of the different favorability index models produced from geoscientific exploration datasets. This was done using two different approaches: one based on expert opinions, and one based on statistical learning. This GDR submission includes the datasets used to produce the statistical learning-based weights. While expert opinions allow us to include more nuanced information in the weights, expert opinions are subject to human bias. Data-centric or statistical approaches help to overcome these potential human biases by focusing on and drawing conclusions from the data alone. The drawback is that, to apply these types of approaches, a dataset is needed. Therefore, we attempted to build comprehensive standardized datasets mapping anomalies in each exploration dataset to each component of each play. This data was gathered through a literature review focused on magmatic hydrothermal plays along with well-characterized areas where superhot or supercritical conditions are thought to exist. Datasets were assembled for all three play types, but the hydrothermal datasetmore » is the least complete due to its relatively low priority. For each known or assumed resource, the dataset states what anomaly in each exploration dataset is associated with each component of the system. The data is only a semi-quantitative, where values are either high, medium, or low, relative to background levels. In addition, the dataset has significant gaps, as not every possible exploration dataset has been collected and analyzed at every known or suspected geothermal resource area, in the context of all possible play types. The following training sites were used to assemble this dataset: - Conventional magmatic hydrothermal: Akutan (from AK PFA), Oregon Cascades PFA, Glass Buttes OR, Mauna Kea (from HI PFA), Lanai (from HI PFA), Mt St Helens Shear Zone (from WA PFA), Wind River Valley (From WA PFA), Mount Baker (from WA PFA). - Superhot EGS: Newberry (EGS demonstration project), Coso (EGS demonstration project), Geysers (EGS demonstration project), Eastern Snake River Plain (EGS demonstration project), Utah FORGE, Larderello, Kakkonda, Taupo Volcanic Zone, Acoculco, Krafla. - Supercritical: Coso, Geysers, Salton Sea, Larderello, Los Humeros, Taupo Volcanic Zone, Krafla, Reyjanes, Hengill. **Disclaimer: Treat the supercritical fluid anomalies with skepticism. They are based on assumptions due to the general lack of confirmed supercritical fluid encounters and samples at the sites included in this dataset, at the time of assembling the dataset. The main assumption was that the supercritical fluid in a given geothermal system has shared properties with the hydrothermal fluid, which may not be the case in reality. Once the datasets were assembled, principal component analysis (PCA) was applied to each. PCA is an unsupervised statistical learning technique, meaning that labels are not required on the data, that summarized the directions of variance in the data. This approach was chosen because our labels are not certain, i.e., we do not know with 100% confidence that superhot resources exist at all the assumed positive areas. We also do not have data for any known non-geothermal areas, meaning that it would be challenging to apply a supervised learning technique. In order to generate weights from the PCA, an analysis of the PCA loading values was conducted. PCA loading values represent how much a feature is contributing to each principal component, and therefore the overall variance in the data.« less
Facebook
TwitterBY USING THIS WEBSITE OR THE CONTENT THEREIN, YOU AGREE TO THE TERMS OF USE. To acquire detailed surface elevation data for use in conservation planning, design, research, floodplain mapping, dam safety assessments, and hydrologic modeling. LAS and bare earth DEM data products are suitable for 1 foot contour generation. USGS LiDAR Base Specification 1.2, QL2. 19.6 cm NVA.This metadata record describes the hydro-flattened bare earth digital elevation model (DEM) derived from the classified LiDAR data for the 2017 Michigan LiDAR project covering approximately 907 square miles, in which its extents cover Oakland County.This data is for planning purposes only and should not be used for legal or cadastral purposes. Any conclusions drawn from analysis of this information are not the responsibility of Sanborn Map Company. Users should be aware that temporal changes may have occurred since this dataset was collected and some parts of this dataset may no longer represent actual surface conditions. Users should not use these data for critical applications without a full awareness of its limitations. Contact: State of MichiganDue to the large size of the data, downloading the entire county may not be possible. It is recommended to use the live service directly within ArcMap or ArcGIS Pro. For further questions, contact the Oakland County Service Center at 248-858-8812, servicecenter@oakgov.com.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset contains 12000 instructional outputs from LongAlpaca-Yukang Machine Learning system, unlocking the cutting-edge power of Artificial Intelligence for users. With this data, researchers have an abundance of information to explore the mysteries behind AI and how it works. This dataset includes columns such as output, instruction, file and input which provide endless possibilities of analysis ripe for you to discover! Teeming with potential insights into AI’s functioning and implications for our everyday lives, let this data be your guide in unravelling the many secrets yet to be discovered in the world of AI
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Exploring the Dataset:
The dataset contains 12000 rows of information, with four columns containing output, instruction, file and input data. You can use these columns to explore the workings of a machine learning system, examine different instructional outputs for different inputs or instructions, study training data for specific ML systems, or analyze files being used by a machine learning system.
Visualizing Data:
Using built-in plotting tools within your chosen toolkit (such as Python), you can create powerful visualizations. Plotting outputs versus input instructions will give you an overview of what your machine learning system is capable of doing--and how it performs on different types of tasks or problems. You could also plot outputs along side files being used--this would help identify patterns in training data and identify areas that need improvement in your machine learning models.
Analyzing Performance:
Using statistical analysis techniques such as regressions or clustering algorithms, you can measure performance metrics such as accuracy and understand how they vary across instruction types. Experimenting with hyperparameter tuning may be helpful to see which settings yield better results for any given situation. Additionally correlations between inputs samples and output measurements can be examined so any relationships can be identified such as trends in accuracy over certain sets of instructions.
Drawing Conclusions:
By leveraging the power of big data mining tools, you are able to build comprehensive predictive models that allow us to project future outcomes based on past performance metric measurements from various instruction types fed into our system's datasets — allowing us determine if certain changes produce improve outcomes over time for our AI model’s capability & predictability!
- Developing self-improving Artificial Intelligence algorithms by using the outputs and instructional data to identify correlations and feedback loop structures between instructions and output results.
- Generating Machine Learning simulations using this dataset to optimize AI performance based on given instruction set.
- Using the instructions, input, and output data in the dataset to build AI systems for natural language processing, enabling comprehensive understanding of user queries and providing more accurate answers accordingly
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:-------------------------------------------------------| | output | The output of the instruction given. (String) | | file | The file used when executing the instruction. (String) | | input | Additional context for the instruction. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
TwitterThis data should be used carefully for statistical analysis and reporting due to missing perimeters (see Use Limitation in metadata). Some fires are missing because historical records were lost or damaged, were too small for the minimum cutoffs, had inadequate documentation or have not yet been incorporated into the database. Other known errors with the fire perimeter database include duplicate fires and over-generalization. Over-generalization, particularly with large old fires, may show unburned "islands" within the final perimeter as burned. Users of the fire perimeter database must exercise caution in application of the data. Careful use of the fire perimeter database will prevent users from drawing inaccurate or erroneous conclusions from the data. This dataset may differ in California compared to that available from the National Interagency Fire Center (NIFC) due to different requirements between the two datasets. The data covers fires back to 1878.
Please help improve this dataset by filling out this survey with feedback:
Historic Fire Perimeter Dataset Feedback (arcgis.com)
Current criteria for data collection are as follows:
CAL FIRE (including contract counties) submit perimeters ≥10 acres in timber, ≥50 acres in brush, or ≥300 acres in grass, and/or ≥3 impacted residential or commercial structures, and/or caused ≥1 fatality.
All cooperating agencies submit perimeters ≥10 acres.
Version update:
Firep24_1 was released in April 2025. Five hundred forty-eight fires from the 2024 fire season were added to the database (2 from BIA, 56 from BLM, 197 from CAL FIRE, 193 from Contract Counties, 27 from LRA, 8 from NPS, 55 from USFS and 8 from USFW). Six perimeters were added from the 2025 fire season (as a special case due to an unusual January fire siege). Five duplicate fires were removed, and the 2023 Sage was replaced with a more accurate perimeter. There were 900 perimeters that received updated attribution (705 removed “FIRE” from the end of Fire Name field and 148 replaced Complex IRWIN ID with Complex local incident number for COMPLEX_ID field). The following fires were identified as meeting our collection criteria but are not included in this version and will hopefully be added in a future update: Addie (2024-CACND-002119), Alpaugh (2024-CACND-001715), South (2024-CATIA-001375). One perimeter is missing containment date that will be updated in the next release.
Cross checking CALFIRS reporting for new CAL FIRE submissions to ensure accuracy with cause class was added to the compilation process. The cause class domain description for “Powerline” was updated to “Electrical Power” to be more inclusive of cause reports.
Detailed metadata is included in the following documents:
Wildland Fire Perimeters (Firep24_1) Metadata
For any questions, please contact the data steward:
Kim Wallin, GIS Specialist
CAL FIRE, Fire & Resource Assessment Program (FRAP)
kimberly.wallin@fire.ca.gov
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The National Transfusion Dataset (NTD) is a collection of transfusion episode data incorporating transfusion, laboratory and hospital data from hospitals and health services, as well as prehospital transfusion data from ambulance and retrieval services.The NTD will form the first integrated national database of blood usage in Australia. The NTD aims to collect information about where, when, and how blood products are used across all clinical settings. This will address Australia’s absence of an integrated national database to record blood usage with the ability to link with clinical outcomes. The dataset will be an invaluable resource towards a comprehensive understanding of how and why blood products are used, numbers and characteristics of patients transfused in health services, the clinical outcomes after transfusion; and provide support to policy development and research.The NTD was formed through the incorporation of the established Australian and New Zealand Massive Transfusion Registry (ANZ-MTR) and a pilot Transfusion Database (TD) project. The ANZ-MTR has a unique focus on massive transfusion (MT) and contains over 10,000 cases from 41 hospitals across Australia and New Zealand. The TD was a trial extension of the registry that collated data on ALL (not just massive) transfusions on >8000 patients from pilot hospitals. The NTD will integrate and expand these databases to provide new data on transfusion practice including blood utilisation, clinical management and the vital closing of the haemovigilance loop.Conditions of use:Any material or manuscript to be published using NTD data must be submitted for review by the NTD Steering Committee prior to submission for publication. The NTD, and Partner Organisations should be acknowledged in all publications. Preferred wording for the acknowledgement will be provided with the data. The NTD reserves the right to dissociate itself from conclusions drawn if it deems necessary.If the data is the primary source for a report or publication, the source of the data must be acknowledged, along with a statement that the analysis and interpretation are those of the author, not the NTD. Where an author analysing the data is a member of an organisation formally associated, or partnered with the NTD, the NTD should be acknowledged as a secondary affiliation. Where the author is a member of the NTD Project Team, then the primary attribution should be the NTD. The dataset DOI (10.26180/22151987) must be referenced in all publications.Further information can be found in the Data Access and Publications Policy.To submit a data access request click here.
Facebook
TwitterThe authors studied the X-ray properties of the young (~1-8M yr) open cluster around the hot (O8 III) star Lambda Ori and compared them with those of the similarly-aged Sigma Ori cluster in order to investigate the possible effects of the different ambient environments. They analyzed an XMM-Newton observation of the cluster using EPIC imaging and low-resolution spectral data. They studied the variability of the detected sources, and performed a spectral analysis of the brightest sources in the field using multi-temperature models. The authors detected 167 X-ray sources above a 5-sigma detection threshold the properties of which are listed in this table, of which 58 are identified with known cluster members and candidates, from massive stars down to low-mass stars with spectral types of ~ M5.5. Another 23 sources were identified with new possible photometric candidates. Late-type stars have a median log LX/Lbol ~ -3.3, close to the saturation limit. Variability was observed in ~ 35% of late-type members or candidates, including six flaring sources. The emission from the central hot star Lambda Ori is dominated by plasma at 0.2 - 0.3 keV, with a weaker component at 0.7 keV, consistent with a wind origin. The coronae of late-type stars can be described by two plasma components with temperatures T1 ~ 0.3-0.8 keV and T2 ~ 0.8-3 keV, and subsolar abundances Z ~ 0.1-0.3 Zsun, similar to what is found in other star-forming regions and associations. No significant difference was observed between stars with and without circumstellar discs, although the smallness of the sample of stars with discs and accretion does not definitive conclusions to be drawn. The authors concluded that the X-ray properties of Lambda Ori late-type stars are comparable to those of the coeval Sigma Ori cluster, suggesting that stellar activity in Lambda Ori has not been significantly affected by the different ambient environment. The lambda Ori cluster was observed by XMM-Newton from 20:46 UT on September 28, 2006 to 12:23 UT on September 29, 2006 (Obs. ID 0402050101), for a total duration of 56ks, using both the EPIC MOS and PN cameras and the RGS instruments. The EPIC cameras were operated in full frame mode with the thick filter. This table was created by the HEASARC in November 2011 based on CDS Catalog J/A+A/530/A150 files tablea1.dat ('X-ray sources detected in the Lambda Ori Cluster'), table1,dat ('X-ray and optical properties of sources identified with known cluster members and candidates') and table2.dat ('X-ray sources identified with possible new cluster candidates'). It does not include the objects listed in tablea2.dat ('3-sigma upper limits and optical properties of undetected cluster members and candidates'). This is a service provided by NASA HEASARC .
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset presents the quantitative raw data that was collected under the H2020 RRI2SCALE project for the D1.4 - “Large scale regional citizen surveys report”. The dataset includes the answers that were provided by almost 8,000 participants from 4 pilot European regions (Kriti, Vestland, Galicia, and Overijssel) regarding the general public's views, concerns, and moral issues about the current and future trajectories of their RTD&I ecosystem. The original survey questionnaire was created by White Research SRL and disseminated to the regions through supporting pilot partners. Data collection took place from June 2020 to September 2020 through 4 different waves – one for each region. Based on the conclusion of a consortium vote during the kick-off meeting, it was decided that instead of resource-intensive methods that would render data collection unduly expensive, to fill in the quotas responses were collected through online panels by survey companies that were used for each region. For the statistical analysis of the data and the conclusions drawn from the analysis, you can access the "Large scale regional citizen surveys report" (D1.4).