Objective Daily COVID-19 data reported by the World Health Organization (WHO) may provide the basis for political ad hoc decisions including travel restrictions. Data reported by countries, however, is heterogeneous and metrics to evaluate its quality are scarce. In this work, we analyzed COVID-19 case counts provided by WHO and developed tools to evaluate country-specific reporting behaviors. Methods In this retrospective cross-sectional study, COVID-19 data reported daily to WHO from 3rd January 2020 until 14th June 2021 were analyzed. We proposed the concepts of binary reporting rate and relative reporting behavior and performed descriptive analyses for all countries with these metrics. We developed a score to evaluate the consistency of incidence and binary reporting rates. Further, we performed spectral clustering of the binary reporting rate and relative reporting behavior to identify salient patterns in these metrics. Results Our final analysis included 222 countries and regions...., Data collection COVID-19 data was downloaded from WHO. Using a public repository, we have added the countries' full names to the WHO data set using the two-letter abbreviations for each country to merge both data sets. The provided COVID-19 data covers January 2020 until June 2021. We uploaded the final data set used for the analyses of this paper. Data processing We processed data using a Jupyter Notebook with a Python kernel and publically available external libraries. This upload contains the required Jupyter Notebook (reporting_behavior.ipynb) with all analyses and some additional work, a README, and the conda environment yml (env.yml)., Any text editor including Microsoft Excel and their free alternatives can open the uploaded CSV file. Any web browser and some code editors (like the freely available Visual Studio Code) can show the uploaded Jupyter Notebook if the required Python environment is set up correctly.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains >800K CSV files behind the GitTables 1M corpus.
For more information about the GitTables corpus, visit:
- our website for GitTables, or
This dataset was created by Moses Moncy
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.
With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.
We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.
Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.
Usage
You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.
Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.
Data Extraction: In your terminal, you can call either
make
(recommended), or
julia --project="." --eval "using Pkg; Pkg.instantiate()"
julia --project="." extract-oq.jl
Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.
Further Reading
Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A CSV dataset containing the number of references of each bibliographic entity identified by an OMID in the OpenCitations Index (https://opencitations.net/index).The dataset is based on the last release of the OpenCitations Index (https://opencitations.net/download) – November 2023. The size of the zipped archive is 0.35 GB, while the size of the unzipped CSV file is 1.7 GB.The CSV dataset contains the reference count of 71,805,806 bibliographic entities. The first column (omid) lists the entities, while the second column (references) indicates the corresponding number of incoming citations.
This data set includes gravity measurements for the Island of Hawai`i collected as the source data for "Deep magmatic structures of Hawaiian volcanoes, imaged by three-dimensional gravity models" (Kauahikaua, Hildenbrand, and Webring, 2000). Data for 3,611 observations are stored as a single table and disseminated in .CSV format. Each observation record includes values for field station ID, latitude and longitude (in both Old Hawaiian and WGS84 projections), elevation, and Observed Gravity value. See associated publication for reduction and interpretation of these data.
SummaryThe cumulative number of COVID-19 positive Maryland residents who have been released from home isolation.DescriptionThe MD COVID-19 - Total Number Released from Isolation data layer is a collection of the statewide cumulative total of individuals who tested positive for COVID-19 that have been reported each day by each local health department via the ESSENCE system as having been released from home isolation. As "recovery" can mean different things as people experience COVID-19 disease to varying degrees of severity, MDH reports on individuals released from isolation. "Released from isolation" refers to those who have met criteria and are well enough to be released from home isolation. Some of these individuals may have been hospitalized at some point.COVID-19 is a disease caused by a respiratory virus first identified in Wuhan, Hubei Province, China in December 2019. COVID-19 is a new virus that hasn't caused illness in humans before. Worldwide, COVID-19 has resulted in thousands of infections, causing illness and in some cases death. Cases have spread to countries throughout the world, with more cases reported daily. The Maryland Department of Health reports daily on COVID-19 cases by county.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery
This collection consists of ten open access relations commonly used by the data management community. In addition to the relations themselves (please take note of the references to the original sources below), we added three lists in this collection that describe approximate functional dependencies found in the relations. These lists are the result of a manual annotation process performed by two independent individuals by consulting the respective schemas of the relations and identifying column combinations where one column implies another based on its semantics. As an example, in the claims.csv file, the AirportCode implies AirportName, as each code should be unique for a given airport.
The file ground_truth.csv is a comma separated file containing approximate functional dependencies. table describes the relation we refer to, lhs and rhs reference two columns of those relations where semantically we found that lhs implies rhs.
The file excluded_candidates.csv and included_candidates.csv list all column combinations that were excluded or included in the manual annotation, respectively. We excluded a candidate if there was no tuple where both attributes had a value or if the g3_prime value was too small.
Dataset References
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Attrition analysis: Identify factors correlated with attrition like department, role, salary, etc. Segment high-risk employees. Predict future attrition.
Performance management: Analyze the relationship between metrics like ratings, and salary increments. recommend performance improvement programs.
Workforce planning: Forecast staffing needs based on historical hiring/turnover trends. Determine optimal recruitment strategies.
Compensation analysis: Benchmark salaries vs performance, and experience. Identify pay inequities. Inform compensation policies.
Diversity monitoring: Assess diversity metrics like gender ratio over roles, and departments. Identify underrepresented groups.
Succession planning: Identify high-potential candidates and critical roles. Predict internal promotions/replacements in advance.
Given its longitudinal employee data and multiple variables, this dataset provides rich opportunities for exploration, predictive modeling, and actionable insights. With a large sample size, it can uncover subtle patterns. Cleaning, joining with other contextual data sources can yield even deeper insights. This makes it a valuable starting point for many organizational studies and evidence-based decision-making.
.............................................................................................................................................................................................................................................
This dataset contains information about different attributes of employees from a company. It includes 1000 employee records and 12 feature columns.
satisfaction_level: Employee satisfaction score (1-5 scale) last_evaluation: Score on last evaluation (1-5 scale) number_project: Number of projects employee worked on average_monthly_hours: Average hours worked in a month time_spend_company: Number of years spent with the company work_accident: If an employee had a workplace accident (yes/no) left: If an employee has left the company (yes/no) promotion_last_5years: Number of promotions in last 5 years Department: Department of the employee Salary: Annual salary of employee satisfaction_level: Employee satisfaction level (1-5 scale) last_evaluation: Score on last evaluation (1-5 scale)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Drug consumption database with original values of attributes. DescriptionDB.pdf contains detailed description of database.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The datasets were used to validate and test the data pipeline deployment following the RADON approach. The dataset has a CSV file that contains around 32000 Twitter tweets. 100 CSV files have been created from the single CSV file and each CSV file containing 320 tweets. Those 100 CSV files are used to validate and test (performance/load testing) the data pipeline components.
SummaryThe cumulative number of confirmed COVID-19-related deaths among Maryland residents by race and ethnicity: African American; White; Hispanic; Asian; Other; Unknown.DescriptionThe MD COVID-19 - Confirmed Deaths by Race and Ethnicity Distribution data layer is a collection of the statewide confirmed and probable COVID-19 related deaths that have been reported each day by the Vital Statistics Administration by categories of race and ethnicity. A death is classified as confirmed if the person had a laboratory-confirmed positive COVID-19 test result. Some data on deaths may be unavailable due to the time lag between the death, typically reported by a hospital or other facility, and the submission of the complete death certificate. Probable deaths are available from the MD COVID-19 - Probable Deaths by Race and Ethnicity Distribution data layer.COVID-19 is a disease caused by a respiratory virus first identified in Wuhan, Hubei Province, China in December 2019. COVID-19 is a new virus that hasn't caused illness in humans before. Worldwide, COVID-19 has resulted in thousands of infections, causing illness and in some cases death. Cases have spread to countries throughout the world, with more cases reported daily. The Maryland Department of Health reports daily on COVID-19 cases by county.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Access our Trustpilot Reviews Data in CSV Format, offering a comprehensive collection of customer reviews from Trustpilot.
This dataset includes detailed reviews, ratings, and feedback across various industries and businesses. Available in a convenient CSV format, it is ideal for market research, sentiment analysis, and competitive benchmarking.
Leverage this data to gain insights into customer satisfaction, identify trends, and enhance your business strategies. Whether you're analyzing consumer sentiment or conducting competitive analysis, this dataset provides valuable information to support your needs.
A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival. The metadata and files (if any) are available to the public. A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival. The metadata and files (if any) are available to the public.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
This curated dataset contains only products from CultBeauty.com that include detailed ingredient information, ideal for brands, formulators, analysts, and researchers seeking transparency in cosmetics and skincare data.
It focuses on ingredient-rich listings — allowing deep analysis of formulation trends, compliance mapping, and clean beauty initiatives. Whether you're building an internal database or powering an AI model, this dataset offers a clean, structured foundation for insight.
Product Name
Brand
Full Ingredient List
Category
Product URL
Price (if available)
Description
Image links
Timestamps
Ingredient analysis for clean beauty scoring
Competitor formulation comparison
Cosmetic safety mapping (e.g., for allergen research)
Building training sets for AI/ML models in skincare
Trend monitoring across skincare and cosmetic products
Monthly or on demand
The Facility Registry System (FRS) identifies facilities, sites, or places subject to environmental regulation or of environmental interest to EPA programs or delegated states. Using vigorous verification and data management procedures, FRS integrates facility data from program national systems, state master facility records, tribal partners, and other federal agencies and provides the Agency with a centrally managed, single source of comprehensive and authoritative information on facilities.
This dataset was created by SHAMANTH
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Main CSV file extracted from zip file download of World Bank gender statistics file.Copy of data as of 25th September 2019.
SummaryThe cumulative number of probable COVID-19 deaths among Maryland residents by age: 0-9; 10-19; 20-29; 30-39; 40-49; 50-59; 60-69; 70-79; 80+; Unknown.DescriptionThe MD COVID-19 - Probable Deaths by Age Distribution data layer is a collection of the statewide confirmed and probable COVID-19 related deaths that have been reported each day by the Vital Statistics Administration by designated age ranges. A death is classified as probable if the person's death certificate notes COVID-19 to be a probable, suspect or presumed cause or condition. Probable deaths are not yet been confirmed by a laboratory test. Some data on deaths may be unavailable due to the time lag between the death, typically reported by a hospital or other facility, and the submission of the complete death certificate. Confirmed deaths are available from the MD COVID-19 - Confirmed Deaths by Age Distribution data layer.COVID-19 is a disease caused by a respiratory virus first identified in Wuhan, Hubei Province, China in December 2019. COVID-19 is a new virus that hasn't caused illness in humans before. Worldwide, COVID-19 has resulted in thousands of infections, causing illness and in some cases death. Cases have spread to countries throughout the world, with more cases reported daily. The Maryland Department of Health reports daily on COVID-19 cases by county.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This CSV dataset (numbered 1–8) demonstrates the construction processes of the regression models using machine learning methods, which are used to plot Fig. 2–7. The CSV file of 1.LSM_R^2 (plotting Fig. 2) shows the data of the relationship between estimated values and actual values when the least-squares method was used for a model construction. In the CSV file 2.PCR_R^2 (plotting Fig. 3), the number of the principal components was varied from 1 to 5 during the construction of a model using the principal component regression. The data in the CSV file 3.SVR_R^2 (plotting Fig. 4) is the result of the construction using the support vector regression. The hyperparameters were decided by the comprehensive combination from the listed candidates by exploring hyperparameters with maximum R2 values. When a deep neural network was applied to the construction of a regression model, NNeur., NH.L. and NL.T. were varied. The CSV file 4.DNN_HL (plotting Fig. 5a)) shows the changes in the relationship between estimated values and actual values at each NH.L.. Similarly, changes in the relationships between estimated values and actual values in the case NNeur. or NL.T. were varied in the CSV files 5.DNN_ Neur (plotting Fig. 5b)) and 6.DNN_LT (plotting Fig. 5c)). The data in the CSV file 7.DNN_R^2 (plotting Fig. 6) is the result using optimal NNeur., NH.L. and NL.T.. In the CSV file 8.R^2 (plotting Fig. 7), the validity of each machine learning method was compared by showing the optimal results for each method. Experimental conditions Supply volume of the raw material: 25–125 mL Addition rate of TiO2: 5.0–15.0 wt% Operation time: 1–15 min Rotation speed: 2,200–5,700 min-1 Temperature: 295–319 K Nomenclature NNeur.: the number of neurons NH.L.: the number of hidden layers NL.T.: the number of learning times
Objective Daily COVID-19 data reported by the World Health Organization (WHO) may provide the basis for political ad hoc decisions including travel restrictions. Data reported by countries, however, is heterogeneous and metrics to evaluate its quality are scarce. In this work, we analyzed COVID-19 case counts provided by WHO and developed tools to evaluate country-specific reporting behaviors. Methods In this retrospective cross-sectional study, COVID-19 data reported daily to WHO from 3rd January 2020 until 14th June 2021 were analyzed. We proposed the concepts of binary reporting rate and relative reporting behavior and performed descriptive analyses for all countries with these metrics. We developed a score to evaluate the consistency of incidence and binary reporting rates. Further, we performed spectral clustering of the binary reporting rate and relative reporting behavior to identify salient patterns in these metrics. Results Our final analysis included 222 countries and regions...., Data collection COVID-19 data was downloaded from WHO. Using a public repository, we have added the countries' full names to the WHO data set using the two-letter abbreviations for each country to merge both data sets. The provided COVID-19 data covers January 2020 until June 2021. We uploaded the final data set used for the analyses of this paper. Data processing We processed data using a Jupyter Notebook with a Python kernel and publically available external libraries. This upload contains the required Jupyter Notebook (reporting_behavior.ipynb) with all analyses and some additional work, a README, and the conda environment yml (env.yml)., Any text editor including Microsoft Excel and their free alternatives can open the uploaded CSV file. Any web browser and some code editors (like the freely available Visual Studio Code) can show the uploaded Jupyter Notebook if the required Python environment is set up correctly.