82 datasets found

Z
Dataset: A Systematic Literature Review on the topic of High-value datasets
data.niaid.nih.gov
zenodo.org
Updated Jun 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anastasija Nikiforova; Nina Rizun; Magdalena Ciesielska; Charalampos Alexopoulos; Andrea Miletič (2023). Dataset: A Systematic Literature Review on the topic of High-value datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7944424
Explore at:
Dataset updated
Jun 23, 2023
Dataset provided by
University of the Aegean
University of Tartu
University of Zagreb
Gdańsk University of Technology
Authors
Anastasija Nikiforova; Nina Rizun; Magdalena Ciesielska; Charalampos Alexopoulos; Andrea Miletič
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.

The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.

Methodology

To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).

These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.

To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.

Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.

Description of the data in this data set

Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies

The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information

Descriptive information
1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet 2) Complete reference - the complete source information to refer to the study 3) Year of publication - the year in which the study was published 4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter} 5) DOI / Website- a link to the website where the study can be found 6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science 7) Availability in OA - availability of an article in the Open Access 8) Keywords - keywords of the paper as indicated by the authors 9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}

Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?

Quality- and relevance- related information
17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)? 18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))

HVD determination-related information
19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term? 20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output") 21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description) 22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles? 23) Data - what data do HVD cover? 24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)

Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx

Licenses or restrictions CC-BY

For more info, see README.txt
Data from: DE 2 Vector Electric Field Instrument, VEFI, Magnetometer, MAG-B,...
data.nasa.gov
s.cnmilf.com
+1more
Updated Apr 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). DE 2 Vector Electric Field Instrument, VEFI, Magnetometer, MAG-B, Merged Magnetic and Electric Field Parameters, 62 ms Data [Dataset]. https://data.nasa.gov/dataset/de-2-vector-electric-field-instrument-vefi-magnetometer-mag-b-merged-magnetic-and-electric
Explore at:
Dataset updated
Apr 8, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
This Dynamics Explorer 2, DE 2, data set is a combination of the Vector Electric Field Instrument, VEFI, and Magnetometer-B, MAGB, high resolution data sets in spacecraft, SC, coordinates submitted to NSSDC. The following orbit-altitude, OA, parameters have been added to the data set: 1) Model magnetic field, SC coordinates 2) Satellite altitude 3) Geographic latitude and longitude 4) Magnetic local time 5) Invariant latitudeThe VEFI data set is described in the file VEFIVOLDESC.SFD and the MAGB data set is described in the file MAGBVOLDESC.SFD, these files are portions of the Standard Format Data Unit, SFDU, metadata files submitted with the VEFI and MAGB data to NSSDC and are included in each volume of this data set. This data set consists of daily files from 1981-08-15, day of year 227, to 1983-02-16, day of year 47. Each file contains all the data available for a given day. During the merging of the data sets it was found that although VEFI and MAGB should cover the same time spans, they do not, due perhaps to the fact that the original MAGB high resolution data set was created on the DE Sigma-9 in Sigma-9 format by using the DE telemetry tapes, while the VEFI high resolution data set was created on the DE MicroVAX system using the DE telemetry data base on optical disk. In order to keep the largest amount of data possible, the merged data set includes all the available VEFI and MAGB data, for those times when VEFI data was available but MAGB was not, 6.54% of the time spanned by this data product, a fill value of 9999999. was given to the MAGB data. Likewise, for those times when MAGB data was available but VEFI was not, 6.87% of the time, the fill value was assigned to the VEFI data. Times for which both VEFI and MAGB data were fill values in the original data sets were not included in the merged data set. There were also times when certain OA parameters were fill values in the OA data base and they are therefore also fill values in this merged data set. The model magnetic field had fill values for 8.55% of the data. Statistics were not kept for the other OA parameters. Each daily file contains a record per measurement. The total number of records in each file varies depending on the amount of data available for a given day.The DE 2 spacecraft, which was the low-altitude mission component, complemented the high-altitude mission DE 1 spacecraft and was placed into an orbit with a perigee sufficiently low to permit measurements of neutral composition, temperature, and wind. The apogee was high enough to permit measurements above the interaction regions of suprathermal ions, and also plasma flow measurements at the feet of the magnetospheric field lines. The general form of the spacecraft was a short polygon 137 cm in diameter and 115 cm high. The triaxial antennas were 23 m tip-to-tip. One 6 m boom was provided for remote measurements. The spacecraft weight was 403 kg. Power was supplied by a solar cell array, which charged two 6 ampere-hour nickel-cadmium batteries. The spacecraft was three-axis stabilized with the yaw axis aligned toward the center of the Earth to within 1°. The spin axis was normal to the orbit plane within 1° with a spin rate of one revolution per orbit. A single-axis scan platform was included in order to mount the low-altitude plasma instrument (ID: 81-070B-08). The platform rotated about the spin axis. A pulse code modulation telemetry data system was used that operated in real time or in a tape recorder mode. Data were acquired on a science-problem-oriented basis, with closely coordinated operations of the various instruments, both satellites, and supportive experiments. Measurements were temporarily stored on tape recorders before transmission at an 8:1 playback-to-record ratio. Since commands were also stored in a command memory unit, spacecraft operations were not real time. Additional details can be found in R.A. Hoffman et al., Space Sci. Instrum., 5(4), 349, 1981. DE-2 reentered the atmosphere on February 19, 1983. A triaxial fluxgate magnetometer onboard DE 2, MAG-B, similar to one on board DE 1 (ID: 81-070A-01), was used to obtain the magnetic field data needed to study the magnetosphere-ionosphere-atmosphere coupling.The primary objectives of this investigation were to measure field aligned currents in the auroral oval and over the polar cap at two different altitudes using the two spacecraft, and to correlate these measurements with observations of electric fields, plasma waves, suprathermal particles, thermal particles, and auroral images obtained from investigation (ID: 81-070A-03). The magnetometer had digital compensation of the ambient field in 8000 nT increments. The instrument incorporated its own 12-bit analog-to-digital, A/D, converter, a 4-bit digital compensation register for each axis, and a system control that generated a 48-bit data word consisting of a 16-bit representation of the field measured along each of three magnetometer axes. Track and hold modules were used to obtain simultaneous samples on all three axes. The instrument bandwidth was 25 Hz. The analog range was ±62000 nT, the accuracy was ±4 nT, and the resolution was 1.5 nT. The time resolution was 16 vector samples/s. More details can be found in W.H. Farthing et al., Space Sci. Instrum., 5(4), 551, 1981. The Vector Electric Field Instrument, VEFI, used flight-proven double-probe techniques with 20 m baselines to obtain measurements of DC electric fields.This electric field investigation had the following objectives: 1) obtain accurate and comprehensive triaxial DC electric field measurements at ionospheric altitudes in order to refine the basic spatial patterns, define the large-scale time history of these patterns, and study the small-scale temporal and spatial variations within the overall patterns 2) study the degree to which and in what region the electric field projects to the equatorial plane 3) obtain measurements of extreme low frequency, ELF, and lower frequency irregularity structures* 4) perform numerous correlative studiesThe VEFI instrument consisted of six cylindrical elements 11 m long and 28 mm in diameter. Each antenna was insulated from the plasma except for the outer 2 m. The baseline, or distance between the midpoints of these 2-m active elements, was 20 m. The antennas were interlocked along the edges to prevent oscillation and to increase their rigidity against drag forces. The basic electronic system was very similar in concept to those used on IMP-8 and ISEE 1, but modified for a three-axis measurement on a nonspinning spacecraft. At the core of the system were the high-impedance (10¹² ohm) preamplifiers, whose outputs were accurately subtracted and digitized with 14-bit A/D conversion for sensitivity to about 0.1 µV/m to maintain high resolution for subsequent removal of the cross product of the electric field, V, and magnetic field, B, vectors in data processing. This provided the basic DC measurement. Other circuitry was used to aid in interpreting the DC data and to measure rapid variations in the signals detected by the antennas. The planned DC electric field range was ±1 V/m, the planned resolution was 0.1 mV/m, and the variational AC electric field was measured from 4 Hz to 1024 Hz. The DC electric field was measured at 16 samples/s. The AC electric field was measured from 1 µV/m to 10 mV/m root mean square, rms. Note that the VEFI antenna pair perpendicular to the orbit plane onboard DE 2 did not deploy. Additional details are found in N.C. Maynard et al., Space Sci. Instrum., 5(4), 523, 1981.
f
Data from: Automatic Spectroscopic Data Categorization by Clustering...
acs.figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xin Zou; Elaine Holmes; Jeremy K Nicholson; Ruey Leng Loo (2023). Automatic Spectroscopic Data Categorization by Clustering Analysis (ASCLAN): A Data-Driven Approach for Distinguishing Discriminatory Metabolites for Phenotypic Subclasses [Dataset]. http://doi.org/10.1021/acs.analchem.5b04020.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.analchem.5b04020.s001
Dataset updated
May 30, 2023
Dataset provided by
ACS Publications
Authors
Xin Zou; Elaine Holmes; Jeremy K Nicholson; Ruey Leng Loo
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
We propose a novel data-driven approach aiming to reliably distinguish discriminatory metabolites from nondiscriminatory metabolites for a given spectroscopic data set containing two biological phenotypic subclasses. The automatic spectroscopic data categorization by clustering analysis (ASCLAN) algorithm aims to categorize spectral variables within a data set into three clusters corresponding to noise, nondiscriminatory and discriminatory metabolites regions. This is achieved by clustering each spectral variable based on the r2 value representing the loading weight of each spectral variable as extracted from a orthogonal partial least-squares discriminant (OPLS-DA) model of the data set. The variables are ranked according to r2 values and a series of principal component analysis (PCA) models are then built for subsets of these spectral data corresponding to ranges of r2 values. The Q2X value for each PCA model is extracted. K-means clustering is then applied to the Q2X values to generate two clusters based on minimum Euclidean distance criterion. The cluster consisting of lower Q2X values is deemed devoid of metabolic information (noise), while the cluster consists of higher Q2X values is then further subclustered into two groups based on the r2 values. We considered the cluster with high Q2X but low r2 values as nondiscriminatory, while the cluster with high Q2X and r2 values as discriminatory variables. The boundaries between these three clusters of spectral variables, on the basis of the r2 values were considered as the cut off values for defining the noise, nondiscriminatory and discriminatory variables. We evaluated the ASCLAN algorithm using six simulated 1H NMR spectroscopic data sets representing small, medium and large data sets (N = 50, 500, and 1000 samples per group, respectively), each with a reduced and full resolution set of variables (0.005 and 0.0005 ppm, respectively). ASCLAN correctly identified all discriminatory metabolites and showed zero false positive (100% specificity and positive predictive value) irrespective of the spectral resolution or the sample size in all six simulated data sets. This error rate was found to be superior to existing methods for ascertaining feature significance: univariate t test by Bonferroni correction (up to 10% false positive rate), Benjamini–Hochberg correction (up to 35% false positive rate) and metabolome wide significance level (MWSL, up to 0.4% false positive rate), as well as by various OPLS-DA parameters: variable importance to projection, (up to 15% false positive rate), loading coefficients (up to 35% false positive rate), and regression coefficients (up to 39% false positive rate). The application of ASCLAN was further exemplified using a widely investigated renal toxin, mercury II chloride (HgCl2) in rat model. ASCLAN successfully identified many of the known metabolites related to renal toxicity such as increased excretion of urinary creatinine, and different amino acids. The ASCLAN algorithm provides a framework for reliably differentiating discriminatory metabolites from nondiscriminatory metabolites in a biological data set without the need to set an arbitrary cut off value as applied to some of the conventional methods. This offers significant advantages over existing methods and the possibility for automation of high-throughput screening in “omics” data.
f
A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...
acs.figshare.com
datasetcatalog.nlm.nih.gov
+1more
xlsx
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker (2023). A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets [Dataset]. http://doi.org/10.1021/acs.jproteome.1c00070.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.1c00070.s003
Dataset updated
Jun 11, 2023
Dataset provided by
ACS Publications
Authors
Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.
Optimized parameter values for play detection.
plos.figshare.com
xls
Updated Apr 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonas Bischofberger; Arnold Baca; Erich Schikuta (2024). Optimized parameter values for play detection. [Dataset]. http://doi.org/10.1371/journal.pone.0298107.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0298107.t004
Dataset updated
Apr 18, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Jonas Bischofberger; Arnold Baca; Erich Schikuta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With recent technological advancements, quantitative analysis has become an increasingly important area within professional sports. However, the manual process of collecting data on relevant match events like passes, goals and tacklings comes with considerable costs and limited consistency across providers, affecting both research and practice. In football, while automatic detection of events from positional data of the players and the ball could alleviate these issues, it is not entirely clear what accuracy current state-of-the-art methods realistically achieve because there is a lack of high-quality validations on realistic and diverse data sets. This paper adds context to existing research by validating a two-step rule-based pass and shot detection algorithm on four different data sets using a comprehensive validation routine that accounts for the temporal, hierarchical and imbalanced nature of the task. Our evaluation shows that pass and shot detection performance is highly dependent on the specifics of the data set. In accordance with previous studies, we achieve F-scores of up to 0.92 for passes, but only when there is an inherent dependency between event and positional data. We find a significantly lower accuracy with F-scores of 0.71 for passes and 0.65 for shots if event and positional data are independent. This result, together with a critical evaluation of existing methodologies, suggests that the accuracy of current football event detection algorithms operating on positional data is currently overestimated. Further analysis reveals that the temporal extraction of passes and shots from positional data poses the main challenge for rule-based approaches. Our results further indicate that the classification of plays into shots and passes is a relatively straightforward task, achieving F-scores between 0.83 to 0.91 ro rule-based classifiers and up to 0.95 for machine learning classifiers. We show that there exist simple classifiers that accurately differentiate shots from passes in different data sets using a low number of human-understandable rules. Operating on basic spatial features, our classifiers provide a simple, objective event definition that can be used as a foundation for more reliable event-based match analysis.
Discovering Hidden Trends in Global Video Games
kaggle.com
zip
Updated Dec 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Discovering Hidden Trends in Global Video Games [Dataset]. https://www.kaggle.com/datasets/thedevastator/discovering-hidden-trends-in-global-video-games
Explore at:
zip(57229 bytes)Available download formats
Dataset updated
Dec 3, 2022
Authors
The Devastator
Description
Discovering Hidden Trends in Global Video Games Sales

Platforms, Genres, and Profitable Regions

By Andy Bramwell [source]

About this dataset

This dataset contains sales data for video games from all around the world, across different platforms, genres and regions. From the thought-provoking latest release of RPGs to the thrilling adventures of racing games, this database provides an insight into what constitutes as a hit game in today’s gaming industry. Armed with this data and analysis, future developers can better understand what types of gameplay and mechanics resonate more with players to create a new gaming experience. Through its comprehensive analysis on various game titles, genres and platforms this dataset displays detailed insights into how video games can achieve global success as well as providing a wonderful window into the ever-changing trends of gaming culture

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset can be used to uncover hidden trends in Global Video Games Sales. To make the most of this data, it is important to understand the different columns and their respective values.

The 'Rank' column identifies each game's ranking according to its global sales (highest to lowest). This can help you identify which games are most popular globally. The 'Game Title' column contains the name of each video game, which allows you to easily discern one entry from another. The 'Platform' column lists the type of platform on which each game was released, e.g., PlayStation 4 or Xbox One, so that you can make comparisons between platforms as well as specific games for each platform. The 'Year' column provides an additional way of making year-on-year comparisons and tracking changes over time in global video game sales.
In addition, this dataset also contains metadata such as genre ('Genre'), publisher ('Publisher'), and review score ('Review') that add context when considering a particular title's performance in terms of global sales rankings. For example, it might be more compelling to compare two similar genres than two disparate ones when analyzing how successful a select set of titles have been at generating revenue in comparison with others released globally within that timeline. Lastly but no less important are the three variables dedicated exclusively for geographic breakdowns: North America ('North America'), Europe (Europe), Japan (Japan), Rest of World (Rest of World), and Global (Global). This allows us to see how certain regions contribute individually or collectively towards a given title's overall sales figures; by comparing these metrics regionally or collectively an interesting picture arises -- from which inferences about consumer preferences and supplier priorities emerge!

Overall this powerful dataset allows researchers and marketers alike a deep dive into market performance for those persistent questions about demand patterns across demographics around the world!

Research Ideas

Analyzing the effects of genre and platform on a game's success - By comparing different genres and platforms, one can get a better understanding of what type of games have the highest sales in different regions across the globe. This could help developers decide which type of gaming content to create in order to maximize their profits.

Tracking changes in global video games trends over time - This dataset could be used to analyze how various elements such as genre or platform affect success over various years, allowing developers an inside look into what kind of videos are being favored at any given moment across the world.

Identifying highly successful games and their key elements- Developers could look at this data to find any common factors such as publisher or platform shared by successful titles to uncover characteristics that lead to a high rate-of-return when creating video games or other forms media entertainment

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

See the dataset description for more information.

Columns

File: Video Games Sales.csv | Column name | Description | |:------------------|:------------------------------------------------------------| | Rank | The ranking of the game in terms of global sales. (Integer) | | Game Title | The title of the game. (String) | | Platform | The platform the game was released on. (String) ...
Wine Quality Data Set (Red & White Wine)
kaggle.com
zip
Updated Nov 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ruthgn (2021). Wine Quality Data Set (Red & White Wine) [Dataset]. https://www.kaggle.com/datasets/ruthgn/wine-quality-data-set-red-white-wine
Explore at:
zip(100361 bytes)Available download formats
Dataset updated
Nov 3, 2021
Authors
ruthgn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Set Information

This data set contains records related to red and white variants of the Portuguese Vinho Verde wine. It contains information from 1599 red wine samples and 4898 white wine samples. Input variables in the data set consist of the type of wine (either red or white wine) and metrics from objective tests (e.g. acidity levels, PH values, ABV, etc.), while the target/output variable is a numerical score based on sensory data—median of at least 3 evaluations made by wine experts. Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Due to privacy and logistic issues, there is no data about grape types, wine brand, and wine selling price.

This data set is a combined version of the two separate files (distinct red and white wine data sets) originally shared in the UCI Machine Learning Repository.

The following are some existing data sets on Kaggle from the same source (with notable differences from this data set): - Red Wine Quality (contains red wine data only) - Wine Quality (combination of red and white wine data but with some values randomly removed) - Wine Quality (red and white wine data not combined)

Contents

Input variables:

1 - type of wine: type of wine (categorical: 'red', 'white')

(continuous variables based on physicochemical tests)

2 - fixed acidity: The acids that naturally occur in the grapes used to ferment the wine and carry over into the wine. They mostly consist of tartaric, malic, citric or succinic acid that mostly originate from the grapes used to ferment the wine. They also do not evaporate easily. (g / dm^3)

3 - volatile acidity: Acids that evaporate at low temperatures—mainly acetic acid which can lead to an unpleasant, vinegar-like taste at very high levels. (g / dm^3)

4 - citric acid: Citric acid is used as an acid supplement which boosts the acidity of the wine. It's typically found in small quantities and can add 'freshness' and flavor to wines. (g / dm^3)

5 - residual sugar: The amount of sugar remaining after fermentation stops. It's rare to find wines with less than 1 gram/liter. Wines residual sugar level greater than 45 grams/liter are considered sweet. On the other end of the spectrum, a wine that does not taste sweet is considered as dry. (g / dm^3)

6 - chlorides: The amount of chloride salts (sodium chloride) present in the wine. (g / dm^3)

7 - free sulfur dioxide: The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. All else constant, the higher the free sulfur dioxide content, the stronger the preservative effect. (mg / dm^3)

8 - total sulfur dioxide: The amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. (mg / dm^3)

9 - density: The density of wine juice depending on the percent alcohol and sugar content; it's typically similar but higher than that of water (wine is 'thicker'). (g / cm^3)

10 - pH: A measure of the acidity of wine; most wines are between 3-4 on the pH scale. The lower the pH, the more acidic the wine is; the higher the pH, the less acidic the wine. (The pH scale technically is a logarithmic scale that measures the concentration of free hydrogen ions floating around in your wine. Each point of the pH scale is a factor of 10. This means a wine with a pH of 3 is 10 times more acidic than a wine with a pH of 4)

11 - sulphates: Amount of potassium sulphate as a wine additive which can contribute to sulfur dioxide gas (S02) levels; it acts as an antimicrobial and antioxidant agent.(g / dm3)

12 - alcohol: How much alcohol is contained in a given volume of wine (ABV). Wine generally contains between 5–15% of alcohols. (% by volume)

Output variable:

13 - quality: score between 0 (very bad) and 10 (very excellent) by wine experts

Acknowledgements

Source: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Data credit goes to UCI. Visit their website to access the original data set directly: https://archive.ics.uci.edu/ml/datasets/wine+quality

Context

So much about wine making remains elusive—taste is very subjective, making it extremely challenging to predict exactly how consumers will react to a certain bottle of wine. There is no doubt that winemakers, connoisseurs, and scientists have greatly contributed their expertise to ...
ARCADE Dataset
kaggle.com
zip
Updated Jul 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmal Gaud (2025). ARCADE Dataset [Dataset]. https://www.kaggle.com/datasets/nirmalgaud/arcade-dataset/code
Explore at:
zip(452162028 bytes)Available download formats
Dataset updated
Jul 3, 2025
Authors
Nirmal Gaud
Description
ARCADE: Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiography imagEs Dataset Creators

Description ARCADE: Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiography imagEs Dataset Phase 2 consist of two folders with 300 images in each of them as well as annotations.

ARCADE: Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiography imagEs Dataset Phase 1 consists of two datasets of XCA images for each of two tasks of ARCADE challenge. The first task includes in total 1200 coronary vessel tree images, which are divided into train(1000) and validation(200) groups, images for training are followed with annotations, depicting the division of a heart into 26 different regions based on the Syntax Score methodology[1]. Similarly, the second task includes a different set of 1200 images with same train-val division proportion with annotated regions containing atherosclerotic plaques. This dataset, carefully annotated by medical experts, enables scientists to actively contribute towards the advancement of an automated risk assessment system for patients with CAD.

The dataset structure is as follows: top-level directories "syntax" and "stenosis" contain files for the two dataset objectives, namely: i) vessel branch classification according to the SYNTAX methodology; and ii) stenosis detection. Inside both directories, there are 3 subsets of the dataset, such as "train", "val", and "test". Inside each of those folders, there are 2 lower-level directories - "images", and "annotations". Inside the "images" folder there are images in ".png" format, extracted from DICOM recordings. The "annotations" folders contain single ".JSON" files, which are named in correspondence to the objective, i.e. "train.JSON", "val.JSON", and "test.JSON".

The structure of ".JSON" contains three top-level fields: "images", "categories", and "annotations". The "images" field contains the unique "id" of the image in the dataset, its "width" and "height" in pixels, and the "file_name" sub-field, which contains specific information about the image. The "categories" field contains a unique "id" from 1 to 26, and a "name", relating it to the SYNTAX descriptions. The "annotations" field contains a unique "id" of the annotation, "image_id" value, relating it to the specific image from the "images" field, and a "category_id" relating it to the specific category from the "categories" field. The "segmentation" sub-field contains coordinates of mask edge points in "XYXY" format. Bounding box coordinates are given in the "bbox" field in the "XYWH" format, where the first 2 values represent the x and y coordinates of the left-most and top-most points in the segmentation mask. The height and width of the bounding box are determined by the difference between the right-most and bottom-most points and the first two values. Finally, the "area" field provides the total area of the bounding box, calculated as the area of a rectangle.

The corresponding Dataset Article will be provided later.

[1] Syntax score segment definitions. https://syntaxscore.org/index.php/tutorial/definitions/14-appendix-i-segment-definitions
VOYAGER 1 SAT LOW ENERGY CHARGED PARTICLE CALIB. BR 15MIN - Dataset - NASA...
data.nasa.gov
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). VOYAGER 1 SAT LOW ENERGY CHARGED PARTICLE CALIB. BR 15MIN - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/voyager-1-sat-low-energy-charged-particle-calib-br-15min-c4d46
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
THIS BROWSE DATA CONSISTS OF RESAMPLED DATA FROM THE LOW ENERGY CHARGED PARTICLE (LECP) EXPERIMENT ON VOYAGER 1 WHILE THE SPACECRAFT WAS IN THE VICINITY OF SATURN. THIS INSTRUMENT MEASURES THE INTENSITIES OF IN-SITU CHARGED PARTICLES (>26 KEV ELECTRONS AND >30 KEV IONS) WITH VARIOUS LEVELS OF DISCRIMINATION BASED ON ENERGY, MASS SPECIES, AND ANGULAR ARRIVAL DIRECTION. A SUBSET OF ALMOST 100 LECP CHANNELS ARE INCLUDED WITH THIS DATA SET. THE LECP DATA ARE GLOBALLY CALIBRATED TO THE EXTENT POSSIBLE (SEE BELOW) AND THEY ARE TIME AVERAGED TO ABOUT 15 MINUTE TIME INTERVALS WITH THE EXACT BEGINNING AND ENDING TIMES FOR THOSE INTERVALS MATCHING THE LECP INSTRUMENTAL CYCLE PERIODS (THE ANGULAR SCANNING PERIODS). THE LECP INSTUMENT HAS A ROTATING HEAD FOR OBTAINING ANGULAR ANISOTROPY MEASUREMENTS OF THE MEDIUM ENERGY CHARGED PARTICLES THAT IT MEASURES. THE CYCLE TIME FOR THE ROTATION IF VARIABLE, BUT DURING ENCOUNTERS IT IS ALWAYS FASTER THAN 15 MINUTES. FOR THIS BROWSE DATA SET ONLY SCAN AVERAGE DATA IS GIVEN (NO ANGULAR INFORMATION). THE DATA IS IN THE FORM OF 'RATE' DATA WHICH HAS NOT BEEN CONVERTED TO THE USUAL PHYSICAL UNITS. THE REASON IS THAT SUCH A CONVERSION WOULD DEPEND ON UNCERTAIN DETERMINATIONS SUCH AS THE MASS SPECIES OF THE PARTICLES AND THE LEVEL OF BACKGROUND. BOTH MASS SPECIES AND BACKGROUND ARE GENERALLY DETERMINED FROM CONTEXT DURING THE STUDY OF PARTICULAR REGIONS. TO CONVERT 'RATE' TO 'INTENSITY' FOR A PARTICULAR CHANNEL ONE PERFORMS THE FOLLOWING TASKS: 1) DECIDE ON THE LEVEL OF BACKGROUND CONTAMINATION AND SUBTRACT THAT OFF THE GIVEN RATE LEVEL. BACKGROUND IS TO BE DETERMINED FROM CONTEXT AND FROM MAKING USE OF SECTOR 8 RATES (SECTOR 8 HAS A 2 mm AL SHIELD COVERING IT). 2) DIVIDE THE BACKGROUND CORRECTED RATE BY THE CHANNEL GEOMETRIC FACTOR AND BY THE ENERGY BANDPASS OF THE CHANNEL. THE GEOMETRIC FACTOR IS FOUND IN ENTRY 'channel_geometric_ factor' AS ASSOCIATED WITH EACH CHANNEL 'channel_id'. TO DETERMINE THE ENERGY BANDPASS, ONE MUST JUDGE THE MASS SPECIES OF THE OF THE DETECTED PARTICLES (FOR IONS BUT NOT FOR ELECTRONS). THE ENERGY BAND PASSES ARE GIVEN IN ENTRIES 'minimum_instrument_parameter' and 'maximum_instrument_ parameter' IN TABLE 'FPLECPENERGY', AND ARE GIVEN IN THE FORM 'ENERGY/NUCLEON'. FOR CHANNELS THAT BEGIN THEIR NAMES WITH THE DESIGNATIONS 'CH' THESE BANDPASSES CAN BE USED ON MASS SPECIES THAT ARE ACCEPTED INTO THAT CHANNEL (SEE ENTRIES 'minimum_instrument_parameter' and 'maximum_instrument_ parameter' IN TABLE 'FPLECPCHANZ', WHICH GIVE THE MINIMUM AND MAXIMUM 'Z' VALUE ACCEPTED -- THESE ENTRIES ARE BLANK FOR ELECTRON CHANNELS). FOR OTHER CHANNELS THE GIVEN BANDPASS REFERS ONLY TO THE LOWEST 'Z' VALUE ACCEPTED. THE BANDPASSES FOR OTHER 'Z' VALUES ARE NOT ALL KNOWN, BUT SOME ARE GIVEN IN THE LITERATURE (E.G. KRIMIGIS ET AL., 1979). THE FINAL PRODUCT OF THESE INSTRUCTIONS WILL BE THE PARTICLE INTENSITY WITH THE UNITS: COUNTS/(CM**2.STR.SEC.KEV). SOME CHANNELS ARE SUBJECT TO SERIOUS CONTAMINATIONS, AND MANY OF THESE CONTAMINATIONS CANNOT BE REMOVED EXCEPT WITH A REGION-BY-REGION ANALYSIS, WHICH HAS NOT BEEN DONE FOR THIS DATA. THUS, TO USE THIS DATA IT IS ABSOLUTELY VITAL THAT THE CONTAMINATION TYPES ('contamination_id' , 'contamination_desc') AND THE LEVELS OF CONTAMINATION ('data_quality_id' CORRESPONDING TO THE DEFINITIONS 'data_quality_desc') BE CAREFULLY EXAMINED FOR ALL REGIONS OF STUDY. A DEAD TIME CORRECTION PROCEDURE HAS BEEN APPLIED IN AN ATTEMPT TO CORRECT THE LINEAR EFFECTS OF DETECTOR OVERDRIVE (PULSE-PILEUP). THIS PROCEDURE DOES NOT FIX SEVERELY OVERDRIVEN DETECTORS. A PROCEDURE IS AVAILABLE FOR CORRECTING VOYAGER 2 LECP ELECTRON CONTAMINATION OF LOW ENERGY ION CHANNELS, BUT ITS EFFECTIVENESS HAS BEEN EVALUATED ONLY FOR THE URANUS DATA SET. THUS, CORRECTIONS HAVE BEEN APPLIED ONLY TO THE URANUS DATA SET.
Gender, Age, and Emotion Detection from Voice
kaggle.com
zip
Updated May 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rohit Zaman (2021). Gender, Age, and Emotion Detection from Voice [Dataset]. https://www.kaggle.com/rohitzaman/gender-age-and-emotion-detection-from-voice
Explore at:
zip(967820 bytes)Available download formats
Dataset updated
May 29, 2021
Authors
Rohit Zaman
Description
Context

Our target was to predict gender, age and emotion from audio. We found audio labeled datasets on Mozilla and RAVDESS. So by using R programming language 20 statistical features were extracted and then after adding the labels these datasets were formed. Audio files were collected from "Mozilla Common Voice" and “Ryerson AudioVisual Database of Emotional Speech and Song (RAVDESS)”.

Content

Datasets contains 20 feature columns and 1 column for denoting the label. The 20 statistical features were extracted through the Frequency Spectrum Analysis using R programming Language. They are: 1) meanfreq - The mean frequency (in kHz) is a pitch measure, that assesses the center of the distribution of power across frequencies. 2) sd - The standard deviation of frequency is a statistical measure that describes a dataset’s dispersion relative to its mean and is calculated as the variance’s square root. 3) median - The median frequency (in kHz) is the middle number in the sorted, ascending, or descending list of numbers. 4) Q25 - The first quartile (in kHz), referred to as Q1, is the median of the lower half of the data set. This means that about 25 percent of the data set numbers are below Q1, and about 75 percent are above Q1. 5) Q75 - The third quartile (in kHz), referred to as Q3, is the central point between the median and the highest distributions. 6) IQR - The interquartile range (in kHz) is a measure of statistical dispersion, equal to the difference between 75th and 25th percentiles or between upper and lower quartiles. 7) skew - The skewness is the degree of distortion from the normal distribution. It measures the lack of symmetry in the data distribution. 8) kurt - The kurtosis is a statistical measure that determines how much the tails of distribution vary from the tails of a normal distribution. It is actually the measure of outliers present in the data distribution. 9) sp.ent - The spectral entropy is a measure of signal irregularity that sums up the normalized signal’s spectral power. 10) sfm - The spectral flatness or tonality coefficient, also known as Wiener entropy, is a measure used for digital signal processing to characterize an audio spectrum. Spectral flatness is usually measured in decibels, which, instead of being noise-like, offers a way to calculate how tone-like a sound is. 11) mode - The mode frequency is the most frequently observed value in a data set. 12) centroid - The spectral centroid is a metric used to describe a spectrum in digital signal processing. It means where the spectrum’s center of mass is centered. 13) meanfun - The meanfun is the average of the fundamental frequency measured across the acoustic signal. 14) minfun - The minfun is the minimum fundamental frequency measured across the acoustic signal 15) maxfun - The maxfun is the maximum fundamental frequency measured across the acoustic signal. 16) meandom - The meandom is the average of dominant frequency measured across the acoustic signal. 17) mindom - The mindom is the minimum of dominant frequency measured across the acoustic signal. 18) maxdom - The maxdom is the maximum of dominant frequency measured across the acoustic signal 19) dfrange - The dfrange is the range of dominant frequency measured across the acoustic signal. 20) modindx - the modindx is the modulation index, which calculates the degree of frequency modulation expressed numerically as the ratio of the frequency deviation to the frequency of the modulating signal for a pure tone modulation.

Acknowledgements

Gender and Age Audio Data Souce: Link: https://commonvoice.mozilla.org/en Emotion Audio Data Souce: Link : https://smartlaboratory.org/ravdess/
C
Replication data for Hyperedge prediction and the statistical mechanisms of...
dataverse.csuc.cat
csv, text/markdown +1
Updated Oct 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marta Sales-Pardo; Marta Sales-Pardo; Aleix Mariné-Tena; Aleix Mariné-Tena; Roger Guimerà; Roger Guimerà (2025). Replication data for Hyperedge prediction and the statistical mechanisms of higher-order and lower-order interactions in complex networks [Dataset]. http://doi.org/10.34810/data2103
Explore at:
csv(1604664), csv(1604846), csv(1605402), text/x-fixed-field(1794022), text/x-fixed-field(448653), csv(1595947), text/x-fixed-field(448343), csv(1597579), text/x-fixed-field(448744), text/x-fixed-field(1793915), csv(1604906), text/x-fixed-field(448546), text/markdown(8235), text/x-fixed-field(1794286), csv(1603253), csv(1595165), csv(1604953), csv(1603250), text/x-fixed-field(1793824), csv(1604515), text/x-fixed-field(1794225), csv(1604703), csv(1599575), text/x-fixed-field(448282), csv(1604321), csv(1604786), csv(1597114), csv(1603797), csv(1604592), csv(1605683), csv(1600955)Available download formats
Unique identifier
https://doi.org/10.34810/data2103
Dataset updated
Oct 9, 2025
Dataset provided by
CORA.Repositori de Dades de Recerca
Authors
Marta Sales-Pardo; Marta Sales-Pardo; Aleix Mariné-Tena; Aleix Mariné-Tena; Roger Guimerà; Roger Guimerà
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains all the input data and results necessary to reproduce the experiments described in the associated publication on hyperedge prediction and statistical mechanisms in complex networks. ### Folder structure #### data/DATA_FOLDS/ This directory contains the data used for 5-fold cross-validation of the model. Each fold consists of a pair of files: - trainX.dat: the training data for fold X (where X ∈ [0–4]) - testX.dat: the corresponding test data for the same fold Each training set represents 80% of the original dataset, and each test set represents the remaining 20%. The data files are in .dat format and contain encoded hyperedges used to fit and evaluate the generative model. #### data/REDUCED_DEFINITIVE_RESULTS/ This directory includes the prediction results for each of the five folds (fold0 to fold4) for different values of the model parameter K, which determines the number of latent groups: - K2_foldX.csv → prediction results for fold X using K = 2 - K3_foldX.csv → results using K = 3 - ... - K5_foldX.csv → results using K = 5 All result files are in .csv format and contain metrics and predicted scores generated by the model. #### README.md This file describes the usage of the dataset and details the experimental setup. ### Notes The dataset is provided in support of reproducibility and open research. The full version of this dataset is also structured in a public GitHub repository: https://github.com/AleixMT/TrigenicInteractionPredictor All data is released under a Creative Commons Attribution 4.0 International License (CC BY 4.0).
m
Data set for: Identification of Sindhi cows that are susceptible or...
data.mendeley.com
Updated Jul 17, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cecilia Miraballes (2019). Data set for: Identification of Sindhi cows that are susceptible or resistant to Haematobia irritans [Dataset]. http://doi.org/10.17632/pwsgz5hp6p.2
Explore at:
Unique identifier
https://doi.org/10.17632/pwsgz5hp6p.2
Dataset updated
Jul 17, 2019
Authors
Cecilia Miraballes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The objective was to identify horn fly-susceptible and horn fly-resistant animals in a Sindhi herd by two different methods. The number of horn flies on 25 adult cows from a Sindhi herd was counted every 14 days. As it was an open herd, the trial period was divided into three stages based on cow composition, with the same cows maintained within each period: 2011-2012 (36 biweekly observations); 2012-2013 (26 biweekly observations); and 2013-2014 (22 biweekly observations). Only ten cows were present in the herd throughout the entire period from 2011-2014 (84 biweekly observations). The variables evaluated were the number of horn flies on the cows, the sampling date and a binary variable for rainy or dry season. Descriptive statistics were calculated, including the median, the interquartile range, and the minimum and maximum number of horn flies, for each observation day. For the present analysis, fly-susceptible cows were identified as those for which the infestation of flies appeared in the upper quartile for more than 50% of the weeks and in the lower quartile for less than 20% of the weeks. In contrast, fly-resistant cows were defined as those for which the fly counts appeared in the lower quartile for more than 50% of the weeks and in the upper quartile for less than 20% of the weeks. To identify resistant and susceptible cows for the best linear unbiased predictions analysis, three repeated measures linear mixed models (one for each period) were constructed with cow as a random effect intercept. The response variable was the log ten transformed counts of horn flies per cow, and the explanatory variable were the observation date and season. As the trail took place in a semiarid region with two seasons well stablished the season was evaluated monthly as a binary outcome, considering a rainy season if it rained more or equal than 50mm or dry season if the rain was less than 50mm. The Standardized residuals and the BLUPs of the random effects were obtained and assessed for normality, heteroscedasticity and outlying observations. Each cow’s BLUPs were plotted against the average quantile rank values that were determined as the difference between the number of weeks in the high-risk quartile group and the number of weeks in the low risk quartile group, averaged by the total number of weeks in each of the observation periods. A linear model fit for the values of BLUPS against the average rank values and the correlation between the two methods was tested using Spearman’s correlation coefficient. The animal effect values (BLUPs) were evaluated by percentiles, with 0 representing the lowest counts (or more resistant cows) and 10 representing the highest counts (or more susceptible cows). These BLUPs represented only the effect of cow and not the effect of day, season or other unmeasured counfounders.
d
Data from: Least Bell's Vireo Habitat Suitability Model for California...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Least Bell's Vireo Habitat Suitability Model for California (2019) [Dataset]. https://catalog.data.gov/dataset/least-bells-vireo-habitat-suitability-model-for-california-2019
Explore at:
Dataset updated
Nov 26, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This habitat model was developed to identify suitable habitat for the federally-endangered least Bell’s vireo (Vireo bellii pusillus) across its current and historic range in California. The vireo disappeared from most of its range by the 1980s, remaining only in small populations in southern California. Habitat protection and management since the mid-1980s has led to an increase in southern California vireo populations with small numbers of birds recently expanding into the historic range. Predictions from this model will be used to focus surveys in the historic range to determine where vireos are recolonizing and to track the status and distribution of populations over time. We used the Partitioned Mahalanobis D2 modeling technique to construct alternative models with different combinations of environmental variables. We developed calibration models for the current range in southern California using vireo locations recorded from 1990 to present. We selected spatially non-redundant observations reflecting average, below average and above average rainfall conditions. For each rainfall condition, we selected three to four years of spatially non-redundant location data from the period 1990-2013. We used this dataset to randomly select 70 percent of the observations for a calibration dataset and used the remaining 30 percent of observations as a validation dataset. We used supplementary validation datasets with observations from 2016, 2017 and 2018 representing average, above average, and below average rainfall conditions, respectively. We cross-walked and merged detailed digital vegetation maps for southern California and utilized the Fire Resource Assessment Program 2015 Vegetation Map as a base map for the rest of California. We used the Klausmeyer et al. (2016) Groundwater Dependent Ecosystems map to capture riparian areas not mapped with other source layers. We selected riparian vegetation types used by vireos to develop a grid of riparian points spaced 150m apart and buffered with 500m of adjacent non-riparian habitats. We calculated environmental variables at each grid point in the center of a 150m x 150m cell for the grid of points in this modeling landscape. Variables reflect various aspects of topography, climate, and land use (percent riparian vegetation and urbanization at 150m, 500m and 1km scales). We developed several Normalized Difference Vegetation Index (NDVI) variables based on means, maximums and percentages of pixels with a minimum specified value at the 150m and 500m spatial scales. We developed alternative calibration models with different combinations of environmental variables reflecting hypotheses about least Bell’s vireo habitat relationships. Due to spatial unevenness in vireo location data, we divided southern California into ten sampling regions and randomly subsampled 70 locations from each region. We repeated this process 1,000 times using a total of 2,270 spatially precise and non-redundant vireo locations in the calibration dataset. We model-averaged the results from sampling iterations to create a calibration model with partitions for each set of variables. We compared among these calibration model-partitions using the randomly selected validation dataset of 972 observations and the 2016, 2017 and 2018 validation datasets of 610, 1,066, and 882 observations, respectively. We created a presence and pseudo-absence dataset for evaluating each model-partition’s performance with the combined 3,530 observations in the validation datasets and 3,566 pseudo-absence points randomly selected from a grid of points encompassing the vireo’s current range in southern California. For every model-partition, we calculated Habitat Similarity Index (HSI) predictions for presence and pseudo-absence points ranging from Very High (0.75-1.00); High (0.50–0.74); Low to Moderate (0-0.49). Suitable habitat is identified as grid cells with HSI equal to or greater than 0.5. We calculated Area Under the Curve (AUC) values from a Receiver Operating Curve (ROC) to determine how well models distinguish between the combined presence and pseudo-absence points. We selected a set of best performing calibration model-partitions based upon median HSI calibration and validation values and AUC results. We then used these models to predict suitable habitat for the riparian grid across California, including the current and historic range. We qualitatively evaluated how well the model-partitions predicted suitable habitat in the historic range across California for historic and recent vireo records in the California Natural Diversity Database. We also used e-Bird observations to qualitatively assess how well the model predicted habitat at recently observed vireo locations in the historic range. Several top-performing model-partitions for southern California did not predict suitable habitat in the historic range. These models included climate variables, elevation, and NDVI variables, which vary widely between the current and historic ranges. Model 30 Partition 1 is the best model-partition for predicting habitat in both the current and historic ranges across California. This model-partition has an AUC of 0.98 and median calibration and random validation HSI values of 0.70. Supplementary validation datasets for 2016, 2017 and 2018 have median HSI values of 0.66, 0.64, and 0.63, respectively. This model includes the following variables: median slope, percent flat land, and percent riparian vegetation at the 150m-scale and distance from water (m). We mapped HSI predictions from this model for each cell in the 150m-scale grid across the California riparian study area to create the habitat suitability map.
d
Tropical Australia Sentinel 2 Satellite Composite Imagery - Low Tide - 30th...
data.gov.au
html, png, wms
Updated Sep 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Australian Ocean Data Network (2025). Tropical Australia Sentinel 2 Satellite Composite Imagery - Low Tide - 30th percentile true colour and near infrared false colour (NESP MaC 3.17, AIMS) [Dataset]. https://www.data.gov.au/data/dataset/tropical-australia-sentinel-2-satellite-composite-imagery-low-tide-30th-percentile-true-colour-
Explore at:
png, html, wmsAvailable download formats
Dataset updated
Sep 28, 2025
Dataset authored and provided by
Australian Ocean Data Network
Area covered
Australia
Description
This dataset contains cloud free, low tide composite satellite images for the tropical Australia region based on 10 m resolution Sentinel 2 imagery from 2018 – 2023. This image collection was created as part of the NESP MaC 3.17 project and is intended to allow mapping of the reef features in tropical Australia. This collection contains composite imagery for 200 Sentinel 2 tiles around the tropical Australian coast. This dataset uses two styles: 1. a true colour contrast and colour enhancement style (TrueColour) using the bands B2 (blue), B3 (green), and B4 (red) 2. a near infrared false colour style (Shallow) using the bands B5 (red edge), B8 (near infrared), and B12 (short wave infrared). These styles are useful for identifying shallow features along the coastline. The Shallow false colour styling is optimised for viewing the first 3 m of the water column, providing an indication of water depth. This is because the different far red and near infrared bands used in this styling have limited penetration of the water column. In clear waters the maximum penetrations of each of the bands is 3-5 m for B5, 0.5 - 1 m for B8 and < 0.05 m for B12. As a result, the image changes in colour with the depth of the water with the following colours indicating the following different depths: - White, brown, bright green, red, light blue: dry land - Grey brown: damp intertidal sediment - Turquoise: 0.05 - 0.5 m of water - Blue: 0.5 - 3 m of water - Black: Deeper than 3 m In very turbid areas the visible limit will be slightly reduced. Change log: Changes to this dataset and metadata will be noted here: 2024-07-24 - Add tiles for the Great Barrier Reef 2024-05-22 - Initial release for low-tide composites using 30th percentile (Git tag: "low_tide_composites_v1") Methods: The satellite image composites were created by combining multiple Sentinel 2 images using the Google Earth Engine. The core algorithm was: 1. For each Sentinel 2 tile filter the "COPERNICUS/S2_HARMONIZED" image collection by - tile ID - maximum cloud cover 0.1% - date between '2018-01-01' and '2023-12-31' - asset_size > 100000000 (remove small fragments of tiles) 2. Remove high sun-glint images (see "High sun-glint image detection" for more information). 3. Split images by "SENSING_ORBIT_NUMBER" (see "Using SENSING_ORBIT_NUMBER for a more balanced composite" for more information). 4. Iterate over all images in the split collections to predict the tide elevation for each image from the image timestamp (see "Tide prediction" for more information). 5. Remove images where tide elevation is above mean sea level to make sure no high tide images are included. 6. Select the 10 images with the lowest tide elevation. 7. Combine SENSING_ORBIT_NUMBER collections into one image collection. 8. Remove sun-glint (true colour only) and apply atmospheric correction on each image (see "Sun-glint removal and atmospheric correction" for more information). 9. Duplicate image collection to first create a composite image without cloud masking and using the 30th percentile of the images in the collection (i.e. for each pixel the 30th percentile value of all images is used). 10. Apply cloud masking to all images in the original image collection (see "Cloud Masking" for more information) and create a composite by using the 30th percentile of the images in the collection (i.e. for each pixel the 30th percentile value of all images is used). 11. Combine the two composite images (no cloud mask composite and cloud mask composite). This solves the problem of some coral cays and islands being misinterpreted as clouds and therefore creating holes in the composite image. These holes are "plugged" with the underlying composite without cloud masking. (Lawrey et al. 2022) 12. The final composite was exported as cloud optimized 8 bit GeoTIFF Note: The following tiles were generated with different settings as they did not have enough images to create a composite with the standard settings: - 51KWA: no high sun-glint filter - 54LXP: maximum cloud cover set to 1% - 54LXP: maximum cloud cover set to 1% - 54LYK: maximum cloud cover set to 2% - 54LYM: maximum cloud cover set to 5% - 54LYN: maximum cloud cover set to 1% - 54LYQ: maximum cloud cover set to 5% - 54LYP: maximum cloud cover set to 1% - 54LZL: maximum cloud cover set to 1% - 54LZM: maximum cloud cover set to 1% - 54LZN: maximum cloud cover set to 1% - 54LZQ: maximum cloud cover set to 5% - 54LZP: maximum cloud cover set to 1% - 55LBD: maximum cloud cover set to 2% - 55LBE: maximum cloud cover set to 1% - 55LCC: maximum cloud cover set to 5% - 55LCD: maximum cloud cover set to 1% High sun-glint image detection: Images with high sun-glint can lead to lower quality composite images. To determine high sun-glint images, a mask is created for all pixels above a high reflectance threshold for the near-infrared and short-wave infrared bands. Then the proportion of this is calculated and compared against a sun-glint threshold. If the image exceeds this threshold, it is filtered out of the image collection. As we are only interested in the sun-glint on water pixels, a water mask is created using NDWI before creating the sun-glint mask. Sun-glint removal and atmospheric correction: Sun-glint was removed from the images using the infrared B8 band to estimate the reflection off the water from the sun-glint. B8 penetrates water less than 0.5 m and so in water areas it only detects reflections off the surface of the water. The sun-glint detected by B8 correlates very highly with the sun-glint experienced by the visible channels (B2, B3 and B4) and so the sun-glint in these channels can be removed by subtracting B8 from these channels. Eric Lawrey developed this algorithm by fine tuning the value of the scaling between the B8 channel and each individual visible channel (B2, B3 and B4) so that the maximum level of sun-glint would be removed. This work was based on a representative set of images, trying to determine a set of values that represent a good compromise across different water surface conditions. This algorithm is an adjustment of the algorithm already used in Lawrey et al. 2022 Tide prediction: To determine the tide elevation in a specific satellite image, we used a tide prediction model to predict the tide elevation for the image timestamp. After investigating and comparing a number of models, it was decided to use the empirical ocean tide model EOT20 (Hart-Davis et al., 2021). The model data can be freely accessed at https://doi.org/10.17882/79489 and works with the Python library pyTMD (https://github.com/tsutterley/pyTMD). In our comparison we found this model was able to predict accurately the tide elevation across multiple points along the study coastline when compared to historic Bureau of Meteorolgy and AusTide data. To determine the tide elevation of the satellite images we manually created a point dataset where we placed a central point on the water for each Sentinel tile in the study area . We used these points as centroids in the ocean models and calculated the tide elevation from the image timestamp. Using "SENSING_ORBIT_NUMBER" for a more balanced composite: Some of the Sentinel 2 tiles are made up of different sections depending on the "SENSING_ORBIT_NUMBER". For example, a tile could have a small triangle on the left side and a bigger section on the right side. If we filter an image collection and use a subset to create a composite, we could end up with a high number of images for one section (e.g. the left side triangle) and only few images for the other section(s). This would result in a composite image with a balanced section and other sections with a very low input. To avoid this issue, the initial unfiltered image collection is divided into multiple image collections by using the image property "SENSING_ORBIT_NUMBER". The filtering and limiting (max number of images in collection) is then performed on each "SENSING_ORBIT_NUMBER" image collection and finally, they are combined back into one image collection to generate the final composite. Cloud Masking: Each image was processed to mask out clouds and their shadows before creating the composite image. The cloud masking uses the COPERNICUS/S2_CLOUD_PROBABILITY dataset developed by SentinelHub (Google, n.d.; Zupanc, 2017). The mask includes the cloud areas, plus a mask to remove cloud shadows. The cloud shadows were estimated by projecting the cloud mask in the direction opposite the angle to the sun. The shadow distance was estimated in two parts. A low cloud mask was created based on the assumption that small clouds have a small shadow distance. These were detected using a 35% cloud probability threshold. These were projected over 400 m, followed by a 150 m buffer to expand the final mask. A high cloud mask was created to cover longer shadows created by taller, larger clouds. These clouds were detected based on an 80% cloud probability threshold, followed by an erosion and dilation of 300 m to remove small clouds. These were then projected over a 1.5 km distance followed by a 300 m buffer. The parameters for the cloud masking (probability threshold, projection distance and buffer radius) were determined through trial and error on a small number of scenes. As such there are probably significant potential improvements that could be made to this algorithm. Erosion, dilation and buffer operations were performed at a lower image resolution than the native satellite image resolution to improve the computational speed. The resolution of these operations was adjusted so that they were performed with approximately a 4 pixel resolution during these operations. This made the cloud mask significantly more spatially coarse than the 10 m Sentinel imagery. This resolution was chosen as a trade-off between the coarseness of the mask verse the processing time for these operations. With 4-pixel filter resolutions these operations were still using over 90% of the total
f
Data from: A Statistical Approach for Identifying the Best Combination of...
datasetcatalog.nlm.nih.gov
acs.figshare.com
+1more
Updated Dec 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jha, Girish Kumar; Mishra, Dwijesh Chandra; Sakthivel, Kabilan; Khan, Yasin Jeshima; Lal, Shashi Bhushan; Madival, Sharanbasappa D; Vaidhyanathan, Ramasubramanian; Chaturvedi, Krishna Kumar; Srivastava, Sudhir (2024). A Statistical Approach for Identifying the Best Combination of Normalization and Imputation Methods for Label-Free Proteomics Expression Data [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001385078
Explore at:
Dataset updated
Dec 11, 2024
Authors
Jha, Girish Kumar; Mishra, Dwijesh Chandra; Sakthivel, Kabilan; Khan, Yasin Jeshima; Lal, Shashi Bhushan; Madival, Sharanbasappa D; Vaidhyanathan, Ramasubramanian; Chaturvedi, Krishna Kumar; Srivastava, Sudhir
Description
Label-free proteomics expression data sets often exhibit data heterogeneity and missing values, necessitating the development of effective normalization and imputation methods. The selection of appropriate normalization and imputation methods is inherently data-specific, and choosing the optimal approach from the available options is critical for ensuring robust downstream analysis. This study aimed to identify the most suitable combination of these methods for quality control and accurate identification of differentially expressed proteins. In this study, we developed nine combinations by integrating three normalization methods, locally weighted linear regression (LOESS), variance stabilization normalization (VSN), and robust linear regression (RLR) with three imputation methods: k-nearest neighbors (k-NN), local least-squares (LLS), and singular value decomposition (SVD). We utilized statistical measures, including the pooled coefficient of variation (PCV), pooled estimate of variance (PEV), and pooled median absolute deviation (PMAD), to assess intragroup and intergroup variation. The combinations yielding the lowest values corresponding to each statistical measure were chosen as the data set’s suitable normalization and imputation methods. The performance of this approach was tested using two spiked-in standard label-free proteomics benchmark data sets. The identified combinations returned a low NRMSE and showed better performance in identifying spiked-in proteins. The developed approach can be accessed through the R package named ’lfproQC’ and a user-friendly Shiny web application (https://dabiniasri.shinyapps.io/lfproQC and http://omics.icar.gov.in/lfproQC), making it a valuable resource for researchers looking to apply this method to their data sets.
n
Online appendix and simulated data sets for assesment of Birth-Death...
data-staging.niaid.nih.gov
search.dataone.org
+2more
zip
Updated Sep 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Zhukova (2023). Online appendix and simulated data sets for assesment of Birth-Death Exposed-Infectious (BDEI) phylodynamic model estimators [Dataset]. http://doi.org/10.5061/dryad.r7sqv9sgx
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.r7sqv9sgx
Dataset updated
Sep 11, 2023
Dataset provided by
Institut Pasteur
Authors
Anna Zhukova
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The birth-death exposed-infectious (BDEI) phylodynamic model describes the transmission of pathogens featuring an incubation period (when there is a delay between the moment of infection and becoming infectious, as for Ebola and SARS-CoV-2), and permits its estimation along with other parameters, from time-scaled phylogenetic trees. We implemented a highly parallelizable estimator for the BDEI model in a maximum likelihood framework (PyBDEI) using a combination of numerical analysis methods for efficient equation resolution. This dataset contains the assessment of PyBDEI in comparison with a Bayesian implementation in BEAST2 (mtbd package) and a deep learning estimator PhyloDeep: the parameter values estimated by the 3 tools.The PyBDEI and the theoretical findings behind it are described in A Zhukova, F Hecht, Y Maday, and O Gascuel. Fast and Accurate Maximum-Likelihood Estimation of Multi-Type Birth-Death Epidemiological Models from Phylogenetic Trees Syst Biol 2023. This dataset contains the online Appendix (Fig S1-S3 and Table S1). Methods Simulated data We assessed the performance of our estimator on two data sets from Voznica et al. 2021 (accessible at (doi.org/10.5281/zenodo.7358555)):

medium, a data set of 100 medium-sized trees (200 − 500 tips),

large, a data set of 100 large trees (5 000 − 10 000 tips)

The data were downloaded from github.com/evolbioinfo/phylodeep (also accessible at (doi.org/10.5281/zenodo.7358555), under GNU GPL v3 licence). To produce medium trees, Voznica et al. generated 10 000 trees with 200 − 500 tips under the BDEI model, with the parameter values sampled uniformly at random within the following boundaries:

incubation period 1/µ ∈ [0.2, 50] basic reproductive number R_0 = λ/ψ ∈ [1, 5] infectious period 1/ψ ∈ [1, 10].

Then randomly selected 100 out of those 10 000 trees to evaluate them with the gold standard method, BEAST2. For 100 large tree generation, the same parameter values as for the 100 medium ones were used, but the tree size varied between 5000 and 10 000 tips. Large forest data set To evaluate PyBDEI performance on forests, we additionally generated two types of forests for the large data set. Type 1 forests (e.g. health policy change) The first type of forests was produced by cutting the oldest (i.e., closest to the root) 25% of each full tree, and keeping the forest of bottom-75% subtrees (in terms of time). We hence obtained 100 forests representing sup-epidemics that all started at the same time. They can be found in large/forests folder. Type 2 forests (e.g. multiple introductions to a country) The second type of forests represented epidemics that started with multiple introductions happening at different times. To generate them we

took the parameter values Θi corresponding to each tree Treei in the large dataset (i ∈ {1, . . . , 100}) calculated the time Ti between the start of the tree Treei and the time of its last sampled tip kept

uniformly drawing a time Ti,j ∈ [0, Ti], and generating a (potentially hidden) tree Treei,j under parameters Θi till reaching the time Ti,j.

Steps (3.i) and (3.ii) were repeated till the total number of sampled tips in the generated trees reached at least 5 000: tips(Treei,j) ⩾ 5 000. The resulting forest Fi included those of the trees Treei,j that contained at least one sampled tip (i.e., observed trees). These forests can be found in large/subepidemics folder. As the BDEI model requires one of the parameters to be fixed in order to become asymptomatically identifiable, ρ was fixed to the real value. Data preparation and parameter estimation pipelines are available at github.com/evolbioinfo/bdei This dataset contains:

a forest of Type 1 for each large tree: large/forests/forest.[0-99].nwk a forest of Type 2 for each large tree: large/subepidemics/subepidemic.[0-99].nwk the estimated and real parameter values for fixed ρ: medium/estimates.tab and large/estimates.tab tab-separated tables
NYC PLUTO Lagged Longitudinal Residential Data
kaggle.com
zip
Updated Mar 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oliver Shetler (2022). NYC PLUTO Lagged Longitudinal Residential Data [Dataset]. https://www.kaggle.com/datasets/olivershetler/pluto
Explore at:
zip(202232306 bytes)Available download formats
Dataset updated
Mar 23, 2022
Authors
Oliver Shetler
License
https://www.usa.gov/government-works/https://www.usa.gov/government-works/
Area covered
New York
Description
Background

This data set was engineered for the purpose of modeling apartment rent prices. See my pluto-modeling repository for more information on how this data was used for modeling, and why the target variable was chosen as a proxy for rent prices. For more information on how the data were engineered from the PLUTO data set, see my pluto-database repository.

If you have requests or suggestions for improving this data set, please reach out to me on LinkedIn. I'm always happy to hear from people who use my creations and I'm glad to help you get what you need.

Variables

Identifiers

These variables are primarily used to identify records in the data set. With the exception of year, they are not reccommended for use in the modeling process.

NOTE: This data set only contains BBL-identified records from residential buildings. Other building types are excluded, such as commercial, industrial, and parking lots.

year

the year of the record

(year, BBL) uniquely identify records in this data set and can be used as the primary key if the CSV files are imported into a database

bbl

BBL stands for "Borough, Block, Lot"

the BBL is a unique numeric identifier for each lot in the NYC building dataset

individual buildings are not identified directly in this data-set, but most lots contain only one building, and those that contain more usually contain only a few buildings

block

a code that identifies a block; unique up to the boroough

zipcode

postal code

Building (Lot) Level Features

These variables are used to building features up to the lot level of precision. In most cases, they are an adequate substitute for direct building-level data, which are not available.

Location

xcoord

gives the x-coordinate of the building in New York and Long Island Projection units

ycoord

gives the y-coordinate of the building in New York and Long Island Projection units

Age and Alteration

yearbuilt

the year the building was built

yearalter

the year the building was last altered

an alteration is defined as a major rennovation such as gut rennovation, core structural change, etc.

this variable is equal to year built if a building has not been altered

age

the age of the building in years (equal to year - yearbuilt)

build_alter_gap

the difference between the year built and the most recent alteration

alterage

the age of the most recent alteration in years (equal to year - yearalter)

this variable is equal to age if a building has not been altered (the same caveat applies to the squared and cubed variants of this variable)

alterage_squared

the age of the most recent alteration in years squared

the square of age has been added to the data set for linear modeling purposes (see note below)

alterage_cubed

the age of the most recent alteration in years cubed

the cube of age has been added to the data set for linear modeling purposes (see note below)

NOTE: Regression models can account for non-linear effects by squaring and/or cubing continuous variables. The intuition behind including squared and cubed alterage variants is that the deterioration of a building matters most when it is either new or older. In general, if the influence of a variable X has a quadratic significance pattern, then we include the squared and cubed versions of X in the model. The reason for this is that d/dX B_1*X + B_2*X^2 + B_3*X^3 = B_1 + 2*B_2*X + 3*B_3 X^2.

Building Class Features

elevator

1 if the building has an elevator, 0 otherwise

commercial

1 if the residential building also has stores or offices on premesis, 0 otherwise

garage

1 if the building has a garage, 0 otherwise

storage

1 if the building has a storage space, 0 otherwise

basement

1 if the building has a basement, 0 otherwise

waterfront

1 if the building is on the waterfront, 0 otherwise

frontage

1 if the building has a frontage (abbutts at least one street), 0 otherwise

block_assmeblage

1 if the building is in a block assmeblage, 0 otherwise

cooperative

1 if the building is managed as cooperative, 0 otherwise

conv_loft_wh

1 if the building is converted from a loft or warehouse, 0 otherwise

walk-up building features

tenament

1 if the building was originally constructed as a tenament, 0 otherwise

garden

1 if the building is a garden community, 0 otherwise

garden communities are low-sitting buildings with a wide footprint

these buildings often have a couryard with a garden and a large number of residential units

elevator building featu...
Homestays data
kaggle.com
zip
Updated May 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyanshu shukla (2024). Homestays data [Dataset]. https://www.kaggle.com/datasets/priyanshu594/homestays-data
Explore at:
zip(44330689 bytes)Available download formats
Dataset updated
May 25, 2024
Authors
Priyanshu shukla
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Objective: Build a robust predictive model to estimate the log_price of homestay listings based on comprehensive analysis of their characteristics, amenities, and host information. First make sure that the entire dataset is clean and ready to be used. 1. Feature Engineering: Task: Enhance the dataset by creating actionable and insightful features. Calculate Host_Tenure by determining the number of years from host_since to the current date, providing a measure of host experience. Generate Amenities_Count by counting the items listed in the amenities array to quantify property offerings. Determine Days_Since_Last_Review by calculating the days between last_review and today to assess listing activity and relevance. 2. Exploratory Data Analysis (EDA): Task: Conduct a deep dive into the dataset to uncover underlying patterns and relationships. Analyze how pricing (log_price) correlates with both categorical (such as room_type and property_type) and numerical features (like accommodates and number_of_reviews). Utilize statistical tools and visualizations such as correlation matrices, histograms for distribution analysis, and scatter plots to explore relationships between variables. 3. Geospatial Analysis: Task: Investigate the geographical data to understand regional pricing trends. Plot listings on a map using latitude and longitude data to visually assess price distribution. Examine if certain neighbourhoods or proximity to city centres influence pricing, providing a spatial perspective to the pricing strategy. 4. Sentiment Analysis on Textual Data: Task: Apply advanced natural language processing techniques to the description texts to extract sentiment scores. Use sentiment analysis tools to determine whether positive or negative descriptions influence listing prices, incorporating these findings into the predictive model being trained as a feature. 5. Amenities Analysis: Task: Thoroughly parse and analyse the amenities provided in the listings. Identify which amenities are most associated with higher or lower prices by applying statistical tests to determine correlations, thereby informing both pricing strategy and model inputs. 6. Categorical Data Encoding: Task: Convert categorical data into a format suitable for machine learning analysis. Apply one-hot encoding to variables like room_type, city, and property_type, ensuring that the model can interpret these as distinct features without any ordinal implication. 7. Model Development and Training: Task: Design and train predictive models to estimate log_price. Begin with a simple linear regression to establish a baseline, then explore more complex models such as RandomForest and GradientBoosting to better capture non-linear relationships and interactions between features. Document (briefly within Jupyter notebook itself) the model-building process, specifying the choice of algorithms and rationale. 8. Model Optimization and Validation: Task: Systematically optimize the models to achieve the best performance. Employ techniques like grid search to experiment with different hyperparameters settings. Validate model choices through techniques like k-fold cross-validation, ensuring the model generalizes well to unseen data. 9. Feature Importance and Model Insights: Task: Analyze the trained models to identify which features most significantly impact log_price. Utilize model-specific methods like feature importance scores for tree-based models and SHAP values for an in depth understanding of feature contributions. 10. Predictive Performance Assessment: Task: Critically evaluate the performance of the final model on a reserved test set. Use metrics such as Root Mean Squared Error (RMSE) and R-squared to assess accuracy and goodness of fit. Provide a detailed analysis of the residuals to check for any patterns that might suggest model biases or misfit.
Life Expectancy - Data from 2000 to 2015
kaggle.com
zip
Updated Apr 23, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vignesh Coumarane (2020). Life Expectancy - Data from 2000 to 2015 [Dataset]. https://www.kaggle.com/vignesh1694/who-life-expectancy
Explore at:
zip(395083 bytes)Available download formats
Dataset updated
Apr 23, 2020
Authors
Vignesh Coumarane
Description
Context Although there have been lot of studies undertaken in the past on factors affecting life expectancy considering demographic variables, income composition and mortality rates. It was found that affect of immunization and human development index was not taken into account in the past. Also, some of the past research was done considering multiple linear regression based on data set of one year for all the countries. Hence, this gives motivation to resolve both the factors stated previously by formulating a regression model based on mixed effects model and multiple linear regression while considering data from a period of 2000 to 2015 for all the countries. Important immunization like Hepatitis B, Polio and Diphtheria will also be considered. In a nutshell, this study will focus on immunization factors, mortality factors, economic factors, social factors and other health related factors as well. Since the observations this dataset are based on different countries, it will be easier for a country to determine the predicting factor which is contributing to lower value of life expectancy. This will help in suggesting a country which area should be given importance in order to efficiently improve the life expectancy of its population.

Content The project relies on accuracy of data. The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries The data-sets are made available to public for the purpose of health data analysis. The data-set related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website. Among all categories of health-related factors only those critical factors were chosen which are more representative. It has been observed that in the past 15 years , there has been a huge development in health sector resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. Therefore, in this project we have considered data from year 2000-2015 for 193 countries for further analysis. The individual data files have been merged together into a single data-set. On initial visual inspection of the data showed some missing values. As the data-sets were from WHO, we found no evident errors. Missing data was handled in R software by using Missmap command. The result indicated that most of the missing data was for population, Hepatitis B and GDP. The missing data were from less known countries like Vanuatu, Tonga, Togo, Cabo Verde etc. Finding all data for these countries was difficult and hence, it was decided that we exclude these countries from the final model data-set. The final merged file(final dataset) consists of 22 Columns and 2938 rows which meant 20 predicting variables. All predicting variables was then divided into several broad categories:Immunization related factors, Mortality factors, Economical factors and Social factors.

Acknowledgements The data was collected from WHO and United Nations website with the help of Deeksha Russell and Duan Wang.
MultiOrg
kaggle.com
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christina Bukas (2024). MultiOrg [Dataset]. http://doi.org/10.34740/kaggle/ds/5097172
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/5097172
Dataset updated
May 28, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Christina Bukas
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
We release a large lung organoid 2D microscopy image dataset, for multi-rater benchmarking of object detection methods and to study uncertainty estimation. The dataset comprises more than 400 images of an entire microscopy plate well, along with more than 60,000 annotated organoids, deriving from different biological experimental setups, where two different types of organoids grew under varying conditions. The organoids in the dataset were annotated by two expert annotators by fitting the organoids within bounding boxes.

Most importantly, we introduce three unique label sets for our test set images, which derive from the two annotators at different time points, allowing for quantification of label noise.

Join our MultiOrg challenge now to develop annotation-noise-aware models!!

Images

All images are in the TIFF file format. Image resolution is 1.29um in x and y.

Labels

The labels correspond to bounding boxes fitted around each organoid in the image.

Train set

All annotations in the train set are in JSON`` format. The end of the filename, gives information on the annotator who created the labels (all annotations in the train set are from time point *t0*). For example,image_1_Annotator_A.json```, means that image 1(in the same directory) was labelled by Annotator A.

The annotation file, comprises of a dictionary, where keys are the bounding box ids, and include the four points of the bounding box, i.e. for each bounding box: 0: # p0 0: x1 1: y1 1: #p1 0: x2 1: y1 2: #p2 0: x2 1: y2 3: # p3 0: x1 1: y1 Note that x1 corresponds to the minimum row value, y1 to the minimum column value, x2 to the maximum row value and y2 to the maximum column value.

Test set

This dataset consists of three label sets for the test set: * test0: Annotations at time point t0, for each image can belong to Annotator A or B. Images with IDs 1-22 were annotated by A and 23-55 by B. * test1_A: Annotations at time point t1, annotated by Annotator A * test1_B: Annotations at time point t1, annotated by Annotator B

While the label set test0 is made directly available here, to indirectly access the labels for test1_A and test1_B one must join our MultiOrg competition and submit solutions to the leaderboard!

Object detection benchmark

To run our object detection benchmark with MultiOrg you will need to run the notebooks we provide: * create-benchmark-dataset * multiorg-detection-benchmark

Provided Data structure

The dataset is structured in the following way: ``` ├─ train -> The train set, consisting of 356 images │ ├── Macros -> The experimental setup (Macros os Normal) │ ├─── Plate_1 -> Contains all images from this plate, 26 plates, or experiments, are available in total │ ├──── image_0 -> Contains all files related to this image │ ├───── image_0.tiff -> The image in TIFF format │ ├───── image_0_Annotator_A.json -> The annotation in json format, with information on the annotator (A or B) in the file name │ ├── Normal -> The experimental setup (Macros os Normal) │ └─ test -> The test set, consisting of 55 images, and annotations provided only for label set test0 │ ├── Macros
│ ├─── Plate_4
│ ├──── image_0
│ ├───── image_0.tiff
│ ├───── image_0_t0_A.json -> The annotation, with information on the time point, here t0, and the annotator (A or B) │ ├── Normal

Facebook

Twitter

Click to copy link

Link copied

Cite

Anastasija Nikiforova; Nina Rizun; Magdalena Ciesielska; Charalampos Alexopoulos; Andrea Miletič (2023). Dataset: A Systematic Literature Review on the topic of High-value datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7944424

Dataset: A Systematic Literature Review on the topic of High-value datasets

Explore at:

Dataset updated

Jun 23, 2023

Dataset provided by

University of the Aegean
University of Tartu
University of Zagreb
Gdańsk University of Technology

Authors

Anastasija Nikiforova; Nina Rizun; Magdalena Ciesielska; Charalampos Alexopoulos; Andrea Miletič

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.

The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.

Methodology

To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).

These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.

To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.

Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.

Description of the data in this data set

Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies

The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information

Descriptive information
1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet 2) Complete reference - the complete source information to refer to the study 3) Year of publication - the year in which the study was published 4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter} 5) DOI / Website- a link to the website where the study can be found 6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science 7) Availability in OA - availability of an article in the Open Access 8) Keywords - keywords of the paper as indicated by the authors 9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}

Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?

Quality- and relevance- related information
17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)? 18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))

HVD determination-related information
19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term? 20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output") 21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description) 22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles? 23) Data - what data do HVD cover? 24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)

Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx

Licenses or restrictions CC-BY

For more info, see README.txt

Clear search

Close search

Google apps

Main menu

Dataset: A Systematic Literature Review on the topic of High-value datasets

Data from: DE 2 Vector Electric Field Instrument, VEFI, Magnetometer, MAG-B,...

Data from: Automatic Spectroscopic Data Categorization by Clustering...

A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...

Optimized parameter values for play detection.

Discovering Hidden Trends in Global Video Games

Discovering Hidden Trends in Global Video Games Sales

Platforms, Genres, and Profitable Regions

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Wine Quality Data Set (Red & White Wine)

Data Set Information

Contents

Acknowledgements

Context

ARCADE Dataset

VOYAGER 1 SAT LOW ENERGY CHARGED PARTICLE CALIB. BR 15MIN - Dataset - NASA...

Gender, Age, and Emotion Detection from Voice

Context

Content

Acknowledgements

Replication data for Hyperedge prediction and the statistical mechanisms of...

Data set for: Identification of Sindhi cows that are susceptible or...

Data from: Least Bell's Vireo Habitat Suitability Model for California...

Tropical Australia Sentinel 2 Satellite Composite Imagery - Low Tide - 30th...

Data from: A Statistical Approach for Identifying the Best Combination of...

Online appendix and simulated data sets for assesment of Birth-Death...

NYC PLUTO Lagged Longitudinal Residential Data

Background

Variables

Identifiers

Building (Lot) Level Features

Location

Age and Alteration

Building Class Features

walk-up building features

elevator building featu...

Homestays data

Life Expectancy - Data from 2000 to 2015

MultiOrg

Join our MultiOrg challenge now to develop annotation-noise-aware models!!

Images

Labels

Train set

Test set

Object detection benchmark

Provided Data structure

Dataset: A Systematic Literature Review on the topic of High-value datasets