50 datasets found
  1. Z

    Empathy dataset

    • data.niaid.nih.gov
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathematical Research Data Initiative (2024). Empathy dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7683906
    Explore at:
    Dataset updated
    Dec 18, 2024
    Dataset authored and provided by
    Mathematical Research Data Initiative
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The database for this study (Briganti et al. 2018; the same for the Braun study analysis) was composed of 1973 French-speaking students in several universities or schools for higher education in the following fields: engineering (31%), medicine (18%), nursing school (16%), economic sciences (15%), physiotherapy, (4%), psychology (11%), law school (4%) and dietetics (1%). The subjects were 17 to 25 years old (M = 19.6 years, SD = 1.6 years), 57% were females and 43% were males. Even though the full dataset was composed of 1973 participants, only 1270 answered the full questionnaire: missing data are handled using pairwise complete observations in estimating a Gaussian Graphical Model, meaning that all available information from every subject are used.

    The feature set is composed of 28 items meant to assess the four following components: fantasy, perspective taking, empathic concern and personal distress. In the questionnaire, the items are mixed; reversed items (items 3, 4, 7, 12, 13, 14, 15, 18, 19) are present. Items are scored from 0 to 4, where “0” means “Doesn’t describe me very well” and “4” means “Describes me very well”; reverse-scoring is calculated afterwards. The questionnaires were anonymized. The reanalysis of the database in this retrospective study was approved by the ethical committee of the Erasmus Hospital.

    Size: A dataset of size 1973*28

    Number of features: 28

    Ground truth: No

    Type of Graph: Mixed graph

    The following gives the description of the variables:

    Feature FeatureLabel Domain Item meaning from Davis 1980

    001 1FS Green I daydream and fantasize, with some regularity, about things that might happen to me.

    002 2EC Purple I often have tender, concerned feelings for people less fortunate than me.

    003 3PT_R Yellow I sometimes find it difficult to see things from the “other guy’s” point of view.

    004 4EC_R Purple Sometimes I don’t feel very sorry for other people when they are having problems.

    005 5FS Green I really get involved with the feelings of the characters in a novel.

    006 6PD Red In emergency situations, I feel apprehensive and ill-at-ease.

    007 7FS_R Green I am usually objective when I watch a movie or play, and I don’t often get completely caught up in it.(Reversed)

    008 8PT Yellow I try to look at everybody’s side of a disagreement before I make a decision.

    009 9EC Purple When I see someone being taken advantage of, I feel kind of protective towards them.

    010 10PD Red I sometimes feel helpless when I am in the middle of a very emotional situation.

    011 11PT Yellow sometimes try to understand my friends better by imagining how things look from their perspective

    012 12FS_R Green Becoming extremely involved in a good book or movie is somewhat rare for me. (Reversed)

    013 13PD_R Red When I see someone get hurt, I tend to remain calm. (Reversed)

    014 14EC_R Purple Other people’s misfortunes do not usually disturb me a great deal. (Reversed)

    015 15PT_R Yellow If I’m sure I’m right about something, I don’t waste much time listening to other people’s arguments. (Reversed)

    016 16FS Green After seeing a play or movie, I have felt as though I were one of the characters.

    017 17PD Red Being in a tense emotional situation scares me.

    018 18EC_R Purple When I see someone being treated unfairly, I sometimes don’t feel very much pity for them. (Reversed)

    019 19PD_R Red I am usually pretty effective in dealing with emergencies. (Reversed)

    020 20FS Green I am often quite touched by things that I see happen.

    021 21PT Yellow I believe that there are two sides to every question and try to look at them both.

    022 22EC Purple I would describe myself as a pretty soft-hearted person.

    023 23FS Green When I watch a good movie, I can very easily put myself in the place of a leading character.

    024 24PD Red I tend to lose control during emergencies.

    025 25PT Yellow When I’m upset at someone, I usually try to “put myself in his shoes” for a while.

    026 26FS Green When I am reading an interesting story or novel, I imagine how I would feel if the events in the story were happening to me.

    027 27PD Red When I see someone who badly needs help in an emergency, I go to pieces.

    028 28PT Yellow Before criticizing somebody, I try to imagine how I would feel if I were in their place

    More information about the dataset is contained in empathy_description.html file.

  2. ERA5 post-processed daily statistics on single levels from 1940 to present

    • cds.climate.copernicus.eu
    grib
    Updated Oct 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ECMWF (2025). ERA5 post-processed daily statistics on single levels from 1940 to present [Dataset]. http://doi.org/10.24381/cds.4991cf48
    Explore at:
    gribAvailable download formats
    Dataset updated
    Oct 2, 2025
    Dataset provided by
    European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
    Authors
    ECMWF
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1940 - Sep 26, 2025
    Description

    ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. This catalogue entry provides post-processed ERA5 hourly single-level data aggregated to daily time steps. In addition to the data selection options found on the hourly page, the following options can be selected for the daily statistic calculation:

    The daily aggregation statistic (daily mean, daily max, daily min, daily sum*) The sub-daily frequency sampling of the original data (1 hour, 3 hours, 6 hours) The option to shift to any local time zone in UTC (no shift means the statistic is computed from UTC+00:00)

    *The daily sum is only available for the accumulated variables (see ERA5 documentation for more details). Users should be aware that the daily aggregation is calculated during the retrieval process and is not part of a permanently archived dataset. For more details on how the daily statistics are calculated, including demonstrative code, please see the documentation. For more details on the hourly data used to calculate the daily statistics, please refer to the ERA5 hourly single-level data catalogue entry and the documentation found therein.

  3. t

    Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...

    • test.researchdata.tuwien.at
    bin, csv, json +1
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak (2025). Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis [Dataset]. http://doi.org/10.70124/f5t2d-xt904
    Explore at:
    bin, json, text/markdown, csvAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 2025
    Description

    Context and Methodology

    Research Domain:
    The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.

    Purpose:
    The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.

    How the Dataset Was Created:
    The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.

    Technical Details

    Dataset Structure:

    The dataset consists of three main files, each with its specific role:

    1. Train:
      This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).

      https://handle.test.datacite.org/10.82556/yb6j-jw41
      PID: b1c59499-9c6e-42c2-af8f-840181e809db
    2. Test2:
      The test dataset mirrors the structure of train.csv but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.

      https://handle.test.datacite.org/10.82556/jerg-4b84
      PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9
    3. Store:
      This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.

      https://handle.test.datacite.org/10.82556/nqeg-gy34
      PID: 9627ec46-4ee6-4969-b14a-bda555fe34db

    Data Fields Description:

    • Id: A unique identifier for each (Store, Date) combination within the test set.

    • Store: A unique identifier for each store.

    • Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).

    • Customers: The number of customers visiting the store on a given day.

    • Open: An indicator of whether the store was open (1 = open, 0 = closed).

    • StateHoliday: Indicates if the day is a state holiday, with values like:

      • 'a' = public holiday,

      • 'b' = Easter holiday,

      • 'c' = Christmas,

      • '0' = no holiday.

    • SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).

    • StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.

    • Assortment: Describes the level of product assortment in the store:

      • 'a' = basic,

      • 'b' = extra,

      • 'c' = extended.

    • CompetitionDistance: Distance (in meters) to the nearest competitor store.

    • CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.

    • Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).

    • Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).

    • Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.

    • PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.

    Software Requirements

    To work with this dataset, you will need to have specific software installed, including:

    • DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.

    • Python Libraries: Key libraries for working with the dataset include:

      • pandas for data manipulation,

      • numpy for numerical operations,

      • matplotlib and seaborn for data visualization,

      • scikit-learn for machine learning algorithms.

    Additional Resources

    Several additional resources are available for working with the dataset:

    1. Presentation:
      A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.

    2. Jupyter Notebook:
      A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.

    3. Model Evaluation Results:
      The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.

    4. Trained Models (.pkl files):
      The models trained during the project are saved as .pkl files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.

    5. sample_submission.csv:
      This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.

    These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.

  4. c

    Essential climate variables for water sector applications derived from...

    • cds.climate.copernicus.eu
    netcdf
    Updated Jan 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ECMWF (2025). Essential climate variables for water sector applications derived from climate projections [Dataset]. http://doi.org/10.24381/cds.201321f6
    Explore at:
    netcdfAvailable download formats
    Dataset updated
    Jan 31, 2025
    Dataset authored and provided by
    ECMWF
    License

    https://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/esgf-cmip5/esgf-cmip5_1fe0fc3e6a6d03717651f8de7a111f80c75b5aef1d4e8989a8ccfb8f02b15ef2.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/esgf-cmip5/esgf-cmip5_1fe0fc3e6a6d03717651f8de7a111f80c75b5aef1d4e8989a8ccfb8f02b15ef2.pdf

    Time period covered
    Oct 12, 2000 - Oct 18, 2018
    Description

    This dataset contains 4 Essential Climate Variables (ECV) for the 18 bias adjusted Global Climate Models (GCM) from CMIP5: daily precipitation rate, and daily mean, maximum and minimum temperatures. The data are bias adjusted using the Distribution Based Scaling (DBS) method versus the global reference dataset HydroGFD2.0, both bias adjustment method and global reference dataset developed by the Swedish Meteorological and Hydrological Institute (SMHI). The DBS method is a parametric quantile-mapping variant. This type of methods fit a statistical distribution to the cumulative distribution function and use those fitted distributions to conduct the quantile-mapping. Here, we used a double-gamma distribution (i.e. separate gamma distributions for the bulk and the high tail) for precipitation and the normal distribution for all temperature variables. Temperature corrections were done conditional on the wet/dry state of the corresponding precipitation time series. The seasonal variations in the biases were represented by monthly parameter windows for precipitation and a smoothed seasonal cycle for the temperature distribution parameters. The smoothing was done using twelve harmonic components. There is some post-processing in place for the data set to be suitable for hydrological impact modeling. Bias-adjustment of daily mean, maximum and minimum temperature using quantile mapping can in some cases lead to inconsistencies. For instance, maximum (minimum) temperature could be lower (higher) than mean temperature. If such inconsistencies occur, daily minimum and maximum temperatures are adjusted in such a way that the anomalies with respect to the daily mean temperature are in line with the climatological anomalies for the particular day in the seasonal cycle. This means, for example, that an inconsistency occurring on June 25 will be adjusted using the climatological anomalies for June 25, estimated by a moving window. Also, the adjustment is done conditional on the wet/dry state of the corresponding precipitation series. The climatology of the anomalies was derived from the HydroGFD2.0 dataset. In addition, DBS’s limitations lead to some data not being bias-adjusted or values beyond physically plausible ranges. In those cases, DBS gives missing values as output. Grid cells where the DBS method resulted in such missing values have been interpolated in time or space. If interpolation was not possible, full time series from nearest grid cell was copied to relating grid point.

  5. o

    Public Health Portfolio Dataset

    • nihr.opendatasoft.com
    csv, excel, json
    Updated Sep 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Public Health Portfolio Dataset [Dataset]. https://nihr.opendatasoft.com/explore/dataset/phof-datase/
    Explore at:
    excel, json, csvAvailable download formats
    Dataset updated
    Sep 26, 2025
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    The NIHR is one of the main funders of public health research in the UK. Public health research falls within the remit of a range of NIHR Research Programmes, NIHR Centres of Excellence and Facilities, plus the NIHR Academy. NIHR awards from all NIHR Research Programmes and the NIHR Academy that were funded between January 2006 and the present extraction date are eligible for inclusion in this dataset. An agreed inclusion/exclusion criteria is used to categorise awards as public health awards (see below). Following inclusion in the dataset, public health awards are second level coded to one of the four Public Health Outcomes Framework domains. These domains are: (1) wider determinants (2) health improvement (3) health protection (4) healthcare and premature mortality.More information on the Public Health Outcomes Framework domains can be found here.This dataset is updated quarterly to include new NIHR awards categorised as public health awards. Please note that for those Public Health Research Programme projects showing an Award Budget of £0.00, the project is undertaken by an on-call team for example, PHIRST, Public Health Review Team, or Knowledge Mobilisation Team, as part of an ongoing programme of work.Inclusion criteriaThe NIHR Public Health Overview project team worked with colleagues across NIHR public health research to define the inclusion criteria for NIHR public health research awards. NIHR awards are categorised as public health awards if they are determined to be ‘investigations of interventions in, or studies of, populations that are anticipated to have an effect on health or on health inequity at a population level.’ This definition of public health is intentionally broad to capture the wide range of NIHR public health awards across prevention, health improvement, health protection, and healthcare services (both within and outside of NHS settings). This dataset does not reflect the NIHR’s total investment in public health research. The intention is to showcase a subset of the wider NIHR public health portfolio. This dataset includes NIHR awards categorised as public health awards from NIHR Research Programmes and the NIHR Academy. This dataset does not currently include public health awards or projects funded by any of the three NIHR Research Schools or any of the NIHR Centres of Excellence and Facilities. Therefore, awards from the NIHR Schools for Public Health, Primary Care and Social Care, NIHR Public Health Policy Research Unit and the NIHR Health Protection Research Units do not feature in this curated portfolio.DisclaimersUsers of this dataset should acknowledge the broad definition of public health that has been used to develop the inclusion criteria for this dataset. This caveat applies to all data within the dataset irrespective of the funding NIHR Research Programme or NIHR Academy award.Please note that this dataset is currently subject to a limited data quality review. We are working to improve our data collection methodologies. Please also note that some awards may also appear in other NIHR curated datasets. Further informationFurther information on the individual awards shown in the dataset can be found on the NIHR’s Funding & Awards website here. Further information on individual NIHR Research Programme’s decision making processes for funding health and social care research can be found here.Further information on NIHR’s investment in public health research can be found as follows: NIHR School for Public Health here. NIHR Public Health Policy Research Unit here. NIHR Health Protection Research Units here. NIHR Public Health Research Programme Health Determinants Research Collaborations (HDRC) here. NIHR Public Health Research Programme Public Health Intervention Responsive Studies Teams (PHIRST) here.

  6. HR Dataset (Multinational Company)

    • kaggle.com
    Updated Aug 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Science Lovers (2025). HR Dataset (Multinational Company) [Dataset]. https://www.kaggle.com/datasets/rohitgrewal/hr-data-mnc
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 23, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Data Science Lovers
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    📹Project Video available on YouTube - https://youtu.be/fykrwQD3HR4

    🖇️Connect with me on LinkedIn - https://www.linkedin.com/in/rohit-grewal

    Human Resource (HR) Data of a Multi-national Corporation (MNC) - 2 Million Records

    This dataset contains HR information for employees of a multinational corporation (MNC). It includes 2 Million (20 Lakhs) employee records with details about personal identifiers, job-related attributes, performance, employment status, and salary information. The dataset can be used for HR analytics, including workforce distribution, attrition analysis, salary trends, and performance evaluation.

    This data is available as a CSV file. We are going to analyse this data set using the Pandas. This analyse will be helpful for those working in HR domain.

    Using this dataset, we answered multiple questions with Python in our Project.

    Q.1) What is the distribution of Employee Status (Active, Resigned, Retired, Terminated) ?

    Q.2) What is the distribution of work modes (On-site, Remote) ?

    Q.3) How many employees are there in each department ?

    Q.4) What is the average salary by Department ?

    Q.5) Which job title has the highest average salary ?

    Q.6) What is the average salary in different Departments based on Job Title ?

    Q.7) How many employees Resigned & Terminated in each department ?

    Q.8) How does salary vary with years of experience ?

    Q.9) What is the average performance rating by department ?

    Q.10) Which Country have the highest concentration of employees ?

    Q.11) Is there a correlation between performance rating and salary ?

    Q.12) How has the number of hires changed over time (per year) ?

    Q.13) Compare salaries of Remote vs. On-site employees — is there a significant difference ?

    Q.14) Find the top 10 employees with the highest salary in each department.

    Q.15) Identify departments with the highest attrition rate (Resigned %).

    Enrol in our Udemy courses : 1. Python Data Analytics Projects - https://www.udemy.com/course/bigdata-analysis-python/?referralCode=F75B5F25D61BD4E5F161 2. Python For Data Science - https://www.udemy.com/course/python-for-data-science-real-time-exercises/?referralCode=9C91F0B8A3F0EB67FE67 3. Numpy For Data Science - https://www.udemy.com/course/python-numpy-exercises/?referralCode=FF9EDB87794FED46CBDF

    These are the main Features/Columns available in the dataset :

    1) Unnamed: 0 – Index column (auto-generated, not useful for analysis, will be deleted).

    2) Employee_ID – Unique identifier assigned to each employee (e.g., EMP0000001).

    3) Full_Name – Full name of the employee.

    4) Department – Department in which the employee works (e.g., IT, HR, Marketing, Operations).

    5) Job_Title – Designation or role of the employee (e.g., Software Engineer, HR Manager).

    6) Hire_Date – The date when the employee was hired by the company.

    7) Location – Geographical location of the employee (city, country).

    8) Performance_Rating – Performance evaluation score (numeric scale, higher is better).

    9) Experience_Years – Number of years of professional experience the employee has.

    10) Status – Current employment status (e.g., Active, Resigned).

    11) Work_Mode – Mode of working (e.g., On-site, Hybrid, Remote).

    12) Salary_INR – Annual salary of the employee in Indian Rupees.

  7. f

    Dataset characteristics.

    • plos.figshare.com
    xls
    Updated Jul 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xuemei Bai; Yuqing Zhang; Chenjie Zhang; Zhijun Wang (2025). Dataset characteristics. [Dataset]. http://doi.org/10.1371/journal.pone.0328131.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jul 14, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Xuemei Bai; Yuqing Zhang; Chenjie Zhang; Zhijun Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Person re-identification (ReID) technology has many applications in intelligent surveillance and public safety. However, the domain difference between the source and target domains makes the generalization ability of the model extremely challenging. To reduce the dependence on labeled data, Unsupervised Domain Adaptation (UDA) methods have become an effective way to solve this problem. However, the influence of pseudo-label generated noise on model training in existing UDA methods is still significant, resulting in limited model performance on the target domain. For this reason, this paper proposes a contrast learning-based pseudo-label refinement with probabilistic uncertainty in the unsupervised domain, adapted to Person re-identification, aiming to improve the effectiveness of the unsupervised domain adapted to Person re-identification. We first enhance the feature representation of the target domain samples based on the contrast learning technique to improve their discrimination in the feature space, thereby enhancing the cross-domain migration performance of the model. Subsequently, an innovative loss function is proposed to effectively reduce the interference of label noise on the training process by refining the generation process of pseudo-labels, which solves the negative impact of inaccurate pseudo-labels on model training. Through a series of experimental validation, the method experiments on two large-scale public datasets, Market1501 and DukeMTMC, and the Rank-1 accuracy of the proposed method reaches 91.4% and 81.4%, with the mean average precision (mAP) of 79.0% and 67.9%, respectively, which proves that the research in this paper provides a good solution for the Person re-identification task with effective technical support for label noise processing and model generalization capability improvement.

  8. d

    Data set for "Freeze-casting uniformity and domains"

    • data.dtu.dk
    avi
    Updated Sep 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rasmus Bjørk; Peter Stanley Jørgensen; Cathrine D. Christiansen (2024). Data set for "Freeze-casting uniformity and domains" [Dataset]. http://doi.org/10.11583/DTU.26144113.v1
    Explore at:
    aviAvailable download formats
    Dataset updated
    Sep 19, 2024
    Dataset provided by
    Technical University of Denmark
    Authors
    Rasmus Bjørk; Peter Stanley Jørgensen; Cathrine D. Christiansen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The is the data set for the paper [Jørgensen, P. S., Christiansen, C. D. & Bjørk, R., Freeze-casting uniformity and domains, Journal of the European Ceramic Society, 45, 116907, 2025]. The DOI for the publication is 10.1016/j.jeurceramsoc.2024.116907This dataset contains three folders with the data shown in the article. Besides this, there is a movie showing the domain structure throughout the dataset, as mentioned in the article.The folder "Raw_reconstructions" contains the raw reconstructed TIF images of the high resolution cutout (4X) and the bottom and top scans in the respective subfolders. These are shown in Fig. 2.The folder "Segmentations" contains the segmented structure, i.e. split into pores (with a value of 0) and solid (with a value of 1) of the high resolution cutout (4X) and the bottom and top scans in the respective subfolders. The structures are stored as a 3D logical array, corresponding to the physical directions x, y and z, in a Matlab file format. These are shown in Fig. 2 in the article.The folder "Computed_properties" contains the various computed properties illustrated in the article. The files are individually described below:Porosity_4X.mat: Contains the porosity as function of the vertical (z) distance for the high resolution cutout shown in Fig. 4 in the article. The data is stored as two arrays in a Matlab file.PSD_4X.mat: Contains two variables, which are the solid particle size distribution and the pore particle size distribution, shown in Fig 3b in the article. These are two dimensional arrays that in the first column contains the particle diameter in um and in the second column contains the volume covered in percent.Structure_tensor_bottom.mat: Contains three arrays for the structure tensor for the bottom scan. The first array is the voxel_size array, which simply gives the voxel size also stated in the article. The second array is the image_stack array with dimensions x by y by color by z. There are three colors (RGB) for each image. These are the images visualized in Fig. 6 in the article. The third array is the theta_slice which contains the local orientation angle with dimensions x, y and z. The latter array has been upsampled in x and y, as also described in the article, such that it has onethird the dimensions of the image_stack in x and y.Structure_tensor_top.mat: Contains three arrays for the structure tensor for the top scan. The variables are identical to Structure_tensor_bottom.mat.Structure_tensor_domains.mat: Contains two arrays for the structure tensor combined for the top and bottom scans. The array domain_sizes_slice contains the size of the domains with the dimensions of the array being the slice number times the number of color bins, which is taken to be five, times the number of domains with a size corresponding to the number of pixels. As an example the entry domain_sizes_slice(100,1,50) contains the number of domains in color bin 1 with a size of 50 pixels in the 100 vertical slice in the data set. Similarly the array centroid_positions_slice has the same structure. Its first dimension is the slice number, its second dimension is the color bin, its third dimension is a running number for each domain present and the final dimension of two contains the x and y coordinates in pixels of the centroid position of the respective domains. To make the structure an array, there can be empty numbers for high running domain number. As an example centroid_positions_slice(100,1,:,:) contains all domains in vertical slice 100 in color bin 1. There are 365 such domains, meaning that centroid_positions_slice(100,1,366,:) and onwards contains zeros.Tau_por_dir_1_4X.mat: Contains an array with the relative local flux in the pores in the high resolution cutout in the x-direction, as visualized in Fig. 5 in the article.Tau_por_dir_2_4X.mat: Contains an array with the relative local flux in the pores in the high resolution cutout in the y-direction, as visualized in Fig. 5 in the article.Tau_por_dir_3_4X.mat: Contains an array with the relative local flux in the pores in the high resolution cutout in the z-direction, as visualized in Fig. 5 in the article.Tau_sol_dir_1_4X.mat: Contains an array with the relative local flux in the solid in the high resolution cutout in the x-direction, as visualized in Fig. 5 in the article.Tau_sol_dir_2_4X.mat: Contains an array with the relative local flux in the solid in the high resolution cutout in the y-direction, as visualized in Fig. 5 in the article.Tau_sol_dir_3_4X.mat: Contains an array with the relative local flux in the solid in the high resolution cutout in the z-direction, as visualized in Fig. 5 in the article.

  9. h

    domain-translations

    • huggingface.co
    Updated Jul 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HumbleWorth (2025). domain-translations [Dataset]. https://huggingface.co/datasets/humbleworth/domain-translations
    Explore at:
    Dataset updated
    Jul 15, 2025
    Dataset authored and provided by
    HumbleWorth
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Multilingual Domain Name Translations Dataset

      Dataset Description
    

    This dataset contains 155,004 domain names with their multilingual translations across 20 languages. Each domain has been segmented into constituent words and translated while preserving semantic meaning and commercial appeal. The dataset is particularly valuable for domain name research, multilingual NLP tasks, and understanding how brand names and concepts translate across languages.

      Dataset… See the full description on the dataset page: https://huggingface.co/datasets/humbleworth/domain-translations.
    
  10. u

    Miami Isopycnic Coordinate Ocean Model Output

    • rda.ucar.edu
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miami Isopycnic Coordinate Ocean Model Output [Dataset]. https://rda.ucar.edu/#!lfd?nb=y&b=doi&v=Matching+Datasets
    Explore at:
    Description

    This archive has three distinct Miami Isopycnic Coordinate Ocean Model (MICOM) outputs. The output grids are for isopycnal layers, at one-twelfth degree spatial and three-day temporal resolution. Water temperature, salinity, ... and velocity components are the primary output variables. The model domain is the Atlantic Ocean from 65-70N to 28S latitude and in two cases includes the Mediterranean Sea. The three cases are briefly outlined below and differ primarily in the atmospheric forcing that is used. MICOM.ICOADS International Comprehensive Ocean-Atmosphere Data Set (ICOADS) mean wind forcing Domain: 28S to 65N, 98W to 17E 16 isopycnal layers Six years of data MICOM.ECMWF ECMWF mean wind forcing Domain 28S to 70N, 98W to 36E, includes the Mediterranean Sea 20 isopycnal layers Three years of data MICOM.DAILY ECMWF 6-hourly forcing (wind, surface radiation, air temperature and humidity) Domain 28S to 70N, 98W to 36E, includes the Mediterranean Sea 20 isopycnal layers 1979-1986 Details about the model forcing, mixing parameterization, bottom topography, computational requirements, and browse graphics are available at the MICOM website

  11. Z

    Overhead Wind Turbine Dataset (NAIP)

    • data.niaid.nih.gov
    Updated Dec 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kyle Bradbury (2022). Overhead Wind Turbine Dataset (NAIP) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7385226
    Explore at:
    Dataset updated
    Dec 2, 2022
    Dataset provided by
    Jordan Malof
    Saksham Jain
    Yuxi Long
    Frank Willard
    Kyle Bradbury
    Caroline Tang
    Caleb Kornfein
    Simiao Ren
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    1 - OVERVIEW

    This dataset contains overhead images of wind turbines from three regions of the United States – the Eastern Midwest (EM), Northwest (NW), and Southwest (SW). The images come from the National Agricultural Imagery Program and were extracted using Google Earth Engine and wind turbine latitude-longitude coordinates from the U.S. Wind Turbine Database. Overall, there are 2003 NAIP collected images, of which 988 images contain wind turbines and the other 1015 are background images (not containing wind turbines) collected from regions nearby the wind turbines. Labels are provided for all images containing wind turbines. We welcome uses of this dataset for object detection or other research purposes.

    2 - DATA DETAILS

    Each image is 608 x 608 pixels, with a GSD of 1m. This means each image represents a frame of approximately 608 m x 608m. Because images were collected from overhead the exact wind turbine coordinates, images used to be nearly exactly centered on turbines. To avoid this issue, images were randomly shifted up to 75m in two directions.

    We refer to images without turbines as "background images", and further split up the images with turbines into the training and testing set splits. We call the training images with turbines "real images" and the testing images "test images".

    Distribution of gathered images by region and type:

    Domain

    Real

    Test

    Background

    EM

    267

    100

    244

    NW

    213

    100

    415

    SW

    208

    100

    356

    Note that this dataset is part of a larger research project in Duke's 2021-2022 Bass Connections team, Creating Artificial Worlds with AI to Improve Energy Access Data. Our research proposes a technique to synthetically generate images with implanted energy infrastructure objects. We include the synthetic images we generated along with the NAIP collected images above. Generating synthetic images requires a training and testing domain, so for each pair of domains we include 173 synthetically generated images. For a fuller picture on our research, including additional image data from domain adaptation techniques we benchmark our method against, visit our github: https://github.com/energydatalab/closing-the-domain-gap. If you use this dataset, please cite the citation found in our Github README.

    3 - NAVIGATING THE DATASET

    Once the data is unzipped, you will see that the base level of the dataset contains an image and a labels folder, which have the exact same structure. Here is how the images directory is divided:

    | - images

    | | - SW

    | | | - Background

    | | | - Test

    | | | - Real

    | | - EM

    | | | - Background

    | | | - Test

    | | | - Real

    | | - NW

    | | | - Background

    | | | - Test

    | | | - Real

    | | - Synthetic

    | | | - s_EM_t_NW

    | | | - s_SW_t_NW

    | | | - s_NW_t_NW

    | | | - s_NW_t_EM

    | | | - s_SW_t_EM

    | | | - s_EM_t_SW

    | | | - s_NW_t_SW

    | | | - s_EM_t_EM

    | | | - s_SW_t_SW

    For example images/SW/Real has the 208 .jpg images from the Southwest that contain turbines. The synthetic subdirectory is structured such that for example images/Synthetic/s_EM_t_NW contains synthetic images using a source domain of Eastern Midwest and a target domain of Northwest, meaning the images were stylized to artificially look like Northwest images.

    Note that we also provide a domain_overview.json file at the top level to help you navigate the directory. The domain_overview.json file navigates the directory with keys, so if you load the file as f, then f['images']['SW']['Background'] should list all the background photos from the SW. The keys in the domain json are ordered in the order we used the images for our experiments. So if our experiment used 100 SW background images, we used the images corresponding to the first 100 keys.

    Naming conventions:

    1 - Real and Test images:

    {DOMAIN}_{UNIQUE ID}.jpg

    For example 'EM_136.jpg' with corresponding label file 'EM_136.txt' refers to an image from the Eastern Midwest with unique ID 136.

    2 - Background images:

    Background images were collected in 3 waves with the purpose to create a set of images similar visually to real images, just without turbines:

    The first wave came from NAIP images from the U.S. Wind Turbine Database coordinates where no wind turbine was present in the snapshot (NAIP images span a relatively large time, thus it is possible that wind turbines might be missing from the images). These images are labeled {DOMAIN}_{UNIQUE ID}.jpg, for example 'EM_1612_background.jpg'.

    Using wind turbine coordinates, images were randomly collected either 4000m Southeast or Northwest. These images are labeled {DOMAIN}_{UNIQUE_ID}_{SHIFT DIRECTION (SE or NW)}.jpg. For example 'NW_12750_SE_background.jpg' refers to an image from the Northwest without turbines captured at a shift of 4000m Southeast from a wind turbine with unique ID 12750. Using wind turbine coordinates, images were randomly collected either 6000m Southeast or Northwest. These images are labeled {DOMAIN}_{UNIQUE_ID}_{SHIFT DIRECTION (SE or NW)}_6000.jpg, for example 'NW_12937_NW_6000_background.jpg'.

    3 - Synthetic images

    Each synthetic image takes in labeled wind turbine examples from the source domain, a background image from the target domain, and a mask. It uses the mask to place wind turbine examples and blends those examples onto the background image using GP-GAN. Thus, the naming conventions for synthetic images are:

    {BACKGROUND IMAGE NAME FROM TARGET DOMAIN}_{MASK NUMBER}.jpg.

    For example, images/Synthetic/s_NW_t_SW/SW_2246_m15.jpg corresponds to a synthetic image created using labeled wind turbine examples from the Northwest and stylized in the image of the Southwest using Southwest background image SW_2246 and mask 15.

    For any remaining questions, please reach out to the author point of contact at caleb.kornfein@gmail.com.

  12. Data for research Chinese Checkers as a Strategic Thinking Development

    • figshare.com
    pdf
    Updated Jan 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mario de la puente (2025). Data for research Chinese Checkers as a Strategic Thinking Development [Dataset]. http://doi.org/10.6084/m9.figshare.28138928.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jan 5, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    mario de la puente
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset comprises measurements from 179 participants divided into experimental (n=93) and control (n=86) groups, with scores recorded across five strategic thinking domains (Analytical Reasoning, Strategic Planning, Adaptive Decision-Making, Coalition Building, and Conflict Resolution) at three time points (Pre, Post, and Second assessment). For each domain, mean scores and standard deviations are provided for both groups at all time points, with values ranging from 3.0 to 4.7 on a 5-point scale. The experimental group consistently shows higher mean values (ranging from 3.1-4.7) compared to the control group (ranging from 3.0-4.0), with standard deviations varying between 0.32 and 0.45 across all measurements.

  13. County-Level Human Well-Being Index and Domain Scores (2000-2010) plus EQI...

    • catalog.data.gov
    Updated Apr 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2021). County-Level Human Well-Being Index and Domain Scores (2000-2010) plus EQI data set (2000-2005) [Dataset]. https://catalog.data.gov/dataset/county-level-human-well-being-index-and-domain-scores-2000-2010-plus-eqi-data-set-2000-200
    Explore at:
    Dataset updated
    Apr 12, 2021
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The HWBI_Draft_1 is an internal map service being prepared for public release (early FY18). This map services contains mean county-level HWBI, domain, indicators and service scores related to research efforts completed in 2014. The EQI map service is a publically accessible map services that contains average county-level results for 2000-2005. This dataset is associated with the following publication: Harwell, L., L. Smith, and K. Summers. Modified HWBI Model(s) Linking Service Flows to Well-Being Endpoints: Accounting for Environmental Quality. U.S. Environmental Protection Agency, Washington, DC, USA, 2017.

  14. f

    Domain generalization results (%) in the low-data regime with a comparison...

    • plos.figshare.com
    xls
    Updated Sep 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumaiya Zoha; Jeong-Gun Lee; Young-Woong Ko (2025). Domain generalization results (%) in the low-data regime with a comparison of various models in SSDG settings, evaluated on all datasets. Results are reported as mean ± standard deviation over 5 random seeds. Here, u denotes utilization of unlabeled data. Paired t-tests were conducted between CAT and other baselines, with p-values shown in the last row. [Dataset]. http://doi.org/10.1371/journal.pone.0329799.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 4, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Sumaiya Zoha; Jeong-Gun Lee; Young-Woong Ko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Domain generalization results (%) in the low-data regime with a comparison of various models in SSDG settings, evaluated on all datasets. Results are reported as mean ± standard deviation over 5 random seeds. Here, u denotes utilization of unlabeled data. Paired t-tests were conducted between CAT and other baselines, with p-values shown in the last row.

  15. U

    CMAQ Model Version 5.0.2 Output Data -- 2006 CONUS_12km

    • dataverse-staging.rdmc.unc.edu
    Updated Apr 25, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UNC Dataverse (2019). CMAQ Model Version 5.0.2 Output Data -- 2006 CONUS_12km [Dataset]. http://doi.org/10.15139/S3/56JKOO
    Explore at:
    Dataset updated
    Apr 25, 2019
    Dataset provided by
    UNC Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Data Summary: Community Multiscale Air Quality (CMAQ) Model Version 5.0.2 output data from a 2006 CONUS simulation. Note:The datasets are on a Google Drive. The metadata associated with this DOI contain the link to the Google Drive folder and instructions for downloading the data. File Location and Download Instructions: The 2006 model output are available in two forms. The hourly datasets are a set of monthly files with surface-layer hourly concentrations for a model domain that encompasses the contiguous U.S. with a horizontal grid resolution of 12km x 12km. The daily average dataset is a single file with a year of daily average data for the same domain. Link to hourly dataLink to daily average dataDownload instructions File Format: The 2006 model output are stored as netcdf formatted files using I/O API data structures (https://www.cmascenter.org/ioapi/). Information on the model projection and grid structure is contained in the header information of the netcdf file. The output files can be opened and manipulated using I/O API utilities (e.g. M3XTRACT, M3WNDW) or other software programs that can read and write netcdf formatted files (e.g. Fortran, R, Python). Model Variables Variable names in hourly data files: Variable Name,Units,Variable Description CO, ppb, carbon monoxideNO, ppb, nitric oxideNO2, ppb, nitrogen dioxideO3, ppb, ozoneSO2, ppb, sulfur dioxideSO2_UGM3, micrograms/m^3, sulfur dioxideAECIJ, micrograms/m^3, aerosol elemental carbon (sum of i-mode and j-mode) * AOCIJ, micrograms/m^3, aerosol organic carbon (sum of i-mode and j-mode) *ANO3IJ, micrograms/m^3, aerosol nitrate (sum of i-mode and j-mode) * TNO3, micrograms/m^3, total nitrate= NO3 (ANO3IJ)+ nitric acid (HNO3) ANH4IJ,micrograms/m^3, aerosol ammonium (sum of i-mode and j-mode) * ASO4IJ,micrograms/m^3, aerosol sulfate (sum of i-mode and j-mode) * PMIJ **, micrograms/m^3, total fine particulate matter (sum of i-mode and j-mode) * PM10 **, micrograms/m^3, total particulate matter (sum of i-mode, j-mode, k-mode) * Variable names in daily data files: Note: All daily averages are computed using Local Standard Time (LST)Variable Name,Units,Variable DescriptionCO_AVG, ppb, 24-hr average carbon monoxideNO_AVG, ppb, 24-hr average nitric oxideNO2_AVG, ppb, 24-hr average nitrogen dioxideO3_AVG, ppb, 24-hr average ozoneO3_MDA8, ppb, Maximum daily 8-hr average ozone + SO2_AVG, ppb, 24-hr average sulfur dioxideSO2_UGM3_AVG, micrograms/m^3, 24-hr average sulfur dioxideAECIJ_AVG, micrograms/m^3, 24-hr average aerosol elemental carbon (sum of i-mode and j-mode) *AOCIJ_AVG, micrograms/m^3, 24-hr average aerosol organic carbon (sum of i-mode and j-mode) *ANO3IJ_AVG, micrograms/m^3, 24-hr average aerosol nitrate (sum of i-mode and j-mode) *TNO3_AVG, micrograms/m^3, 24-hr average total nitrate= NO3 (ANO3IJ)+ nitric acid (HNO3) ANH4IJ_AVG,micrograms/m^3, 24-hr average aerosol ammonium (sum of i-mode and j-mode) * ASO4IJ_AVG,micrograms/m^3, 24-hr average aerosol sulfate (sum of i-mode and j-mode) * PMIJ_AVG **, micrograms/m^3, 24-hr average total fine particulate matter (sum of i-mode and j-mode) * PM10_AVG **, micrograms/m^3, 24-hr average total particulate matter (sum of i-mode, j-mode, k-mode) * +The calculation of the MDA8 O3 variable is based on the current ozone NAAQS and is derived from the highest of the 17 consecutive 8-hr averages beginning with the 8-hr period from 7:00am to 3:00pm LST and ending with the 8-hr period from 11pm to 7am the following day. *CMAQ represents PM using three interacting lognormal distributions, or modes. Two modes, Aitken (i-mode) and accumulation (j-mode) are generally less than 2.5 microns in diameter while the coarse mode (k-mode) contains significant amounts of mass above 2.5 microns. **Note that modeled size distributions can also be used to output PM species that represent the aerosol mass that falls below a specific diameter, e.g. 2.5 um or 10um. The output variables that are based on the sharp cut-off method are typically very similar to the aggregate PMIJ (i+j mode) and PM10 (i+j+k mode) variables included in these files. Further information on particle size-composition distributions in CMAQv5.0 can be found in Nolte et al. (2015), https://doi.org/10.5194/gmd-8-2877-2015. Simulation Settings and Inputs: CMAQ Model Model version: 5.0.2 Bi-directional NH3 air-surface exchange: Massad formulation Chemical mechanism: CB05TUCL Aerosol module: aero6 Domain: Continental U.S. (CONUS) using a 12 km grid size and a Lambert Conformal projection assuming a spherical earth with radius 6370.0 km. Vertical Resolution: 35 layers from the surface to the top of the free troposphere with layer 1 nominally 19 m tall. Boundary Condition Inputs Hourly values from 2006 simulation of GEOS-Chem v8-03-02 with GEOS-5 meteorologyNLCD land cover Used in WRF: 2006 Emissions Inputs Anthropogenic emissions: Emission inventory label 2007ed_06. 2007/2008 modeling platform based on AQMEII phase 2 emissions:...

  16. Car Insurance

    • kaggle.com
    Updated Nov 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Car Insurance [Dataset]. https://www.kaggle.com/datasets/thedevastator/insurance-companies-secret-sauce-finally-exposed
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2022
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Allstate’s Car Insurance

    The Dataset From "Suckers List: How Allstate’s Secret Auto Insurance Algorithm"

    About this dataset

    This dataset contains insurance rates data from across the United States, providing insights into the premiums charged by insurers, the underlying factors that affect those rates, and claims history analysis. The data is designed to help researchers understand the inner workings of the insurance industry, and how rates are calculated. It includes information on premiums, underlying factors, current premium prices, indicated premium prices, selected premium prices, fixed expenses, and more

    How to use the dataset

    This dataset can be used to understand the inner workings of the insurance industry, and how rates are calculated. The data includes information on premiums, underlying factors, claims history analysis, and more. This dataset can be used to research insurance rates across the United States and to understand how these rates are determined

    Research Ideas

    • Understand the inner workings of the insurance industry, and how rates are calculated
    • Help insurance companies better understand their own pricing models
    • Understand how their premiums are calculated

    Acknowledgements

    I would like to acknowledge The Markup for providing the data for this dataset

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: cgr-definitions-table.csv | Column name | Description | |:--------------|:----------------------------------| | cgr | Combined grade rating. (Numeric) | | aa | Average annual premium. (Numeric) | | bb | Base premium. (Numeric) | | cc | Cost of capital. (Numeric) | | va | Value of assets. (Numeric) | | dd | Direct written premium. (Numeric) | | hh | Homeownership. (Categorical) |

    File: cgr-premiums-table.csv | Column name | Description | |:-----------------------------|:--------------------------------------------------| | territory | The territory in which the person lives. (String) | | gender | The person's gender. (String) | | birthdate | The person's birthdate. (Date) | | ypc | The person's years of prior coverage. (Integer) | | current_premium | The person's current premium. (Float) | | indicated_premium | The person's indicated premium. (Float) | | selected_premium | The person's selected premium. (Float) | | underlying_premium | The person's underlying premium. (Float) | | fixed_expenses | The person's fixed expenses. (Float) | | underlying_total_premium | The person's underlying total premium. (Float) | | cgr_factor | The person's CGR factor. (Float) |

    File: territory-definitions-table.csv | Column name | Description | |:----------------|:-------------------------------------------------------------------| | territory | The territory in which the person lives. (String) | | county | The county in which the person lives. (String) | | county_code | The county code for the county in which the person lives. (String) | | zipcode | The zip code for the county in which the person lives. (String) | | town | The town in which the person lives. (String) |

    ]

  17. e

    GloBCORD-HD: Global Bias-Corrected CORDEX Datasets at Half Degree Resolution...

    • b2find.eudat.eu
    Updated Aug 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). GloBCORD-HD: Global Bias-Corrected CORDEX Datasets at Half Degree Resolution - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/7ddb3d13-f3ec-587c-9062-7307d5828aa2
    Explore at:
    Dataset updated
    Aug 11, 2025
    Description

    Abstract: This dataset provides globally consistent, bias-corrected climate data at 0.5° spatial resolution, consisting of a set of seven climate variables derived from three General Circulation Models (GCMs) participating in CMIP5 downscaled by 10 CORDEX Regional Climate Model (RCM) simulations and bias-corrected globally for the period 1950/1960–2099. It includes data from three climate change scenarios, namely RCP2.6, RCP4.5 and RCP8.5. The three GCMs are: ICHEC-EC-EARTH, MPI-M-MPI-ESM-LR, NOAA-GFDL-GFDL-ESM2M. Data are originally available as one netCDF file per GCM (3) per variable (7, NOAA-GFDL-GFDL-ESM2M: 5) per run (4, NOAA-GFDL-GFDL-ESM2M: 3). Available here are zip-archives of all netCDF files of one run, i.e. only rcp26 or only rcp45, per GCM (see Size for the overall sum per GCM). TableOfContents: daily mean 2m-air temperature (tas); daily minimum 2m-air temperature (tasmin), daily maximum 2m-air temperature (tasmax); daily sum of precipitation (pr); daily mean surface downwelling longwave radiation (rlds); daily mean 10m wind speed (sfcWind); daily mean relative humidity (hurs) *: These variables are NOT included in the NOAA-GFDL-GFDL-ESM2M driven data. TechnicalInfo: dimension: 720 columns x 360 rows; temporalExtent_startDate_Historlcal: 1950-01-01 00:00:00; temporalExtent_endDate_Historical: 2019-12-31 23:59:59; temporalDuration_Historical: 70; temporalDurationUnit_Historical: a; temporalExtent_startDate_RCPs: 2020-01-01 00:00:00; temporalExtent_endDate_RCPs: 2099-12-31 23:59:59; temporalDuration_RCPs: 80; temporalDurationUnit_RCPs: a; temporalResolution: 1; temporalResolutionUnit: d; spatialResolution: 0.5; spatialResolutionUnit: degrees; horizontalResolutionXdirection: 0.5; horizontalResolutionXdirectionUnit: degrees; horizontalResolutionYdirection: 0.5; horizontalResolutionYdirectionUnit: degrees; verticalResolution: none; verticalResolutionUnit: none *) For MPI-M-MPI-ESM-LR: temporalExtent_startDate_Historlcal: 1960-01-01 00:00:00; temporalExtent_endDate_Historical: 2019-12-31 23:59:59; temporalDuration_Historical: 60; Methods: The ISIMIP3BASD v2.5 bias correction method (see Lange [2019; 2021]) was applied to adjust systematic biases using the GSWP3-W5E53 observational dataset. The regional climate models (RCMs) used are: (listed are Institution/working group, RCM Model, Driving GCM): Climate Service Center Germany (GERICS), REMO2009, MPI-ESM-LR Swedish Meteorological and Hydrological Institute (SMHI), RCA4, MPI-ESM-LR Climate Limited-area Modelling Community (CLMcom), CCLM4-8-17-CLM3-5, MPI-ESM-LR Climate Limited-area Modelling Community (CLMcom), CCLM5-0-2, MPI-ESM-LR Universite du Quebec a Montreal, CRCM5, MPI-ESM-LR Swedish Meteorological and Hydrological Institute (SMHI), RCA4, ICHEC-EC-EARTH Climate Limited-area Modelling Community (CLMcom), CCLM4-8-17-CLM3-5, ICHEC-EC-EARTH Climate Limited-area Modelling Community (CLMcom), CCLM5-0-2, ICHEC-EC-EARTH Swedish Meteorological and Hydrological Institute (SMHI), RCA4, NOAA-GFDL-GFDL-ESM2M National Center for Atmospheric Research, WRF, NOAA-GFDL-GFDL-ESM2M The historical runs begin 1950-01-01 (ICHEC-EC-EARTH and NOAA-GFDL-GFDL-ESM2M) or 1960-01-01 (MPI-M-MPI-ESM-LR) and end 2005-12-31. Historical runs are appended by rcp85 runs for years 2006-01-01 to 2019-12-31. All projection runs begin 2020-01-01 and end 2099-12-31. Quality: Not all of the domains have been downscaled by CORDEX RCMs. Therefore, data files for scenario rcp26 only contain 7 CORDEX domains; all other files contain 8 domains (see also https://cordex.org/domains/cordex-domain-description/) Units: K; K; K; kg m-2 s-1; W m-2; m s-1; percent GeoLocation: westBoundCoordinate: -180.0; westBoundCoordinateUnit: degrees East; eastBoundCoordinate: 180.0; eastBoundCoordinateUnit: degrees East; southBoundCoordinate: -90.0; southBoundCoordinateUnit: degrees North; northBoundCoordinate: 90.0; northBoundCoordinateUnit: degrees North Size: ICHEC-EC-EARTH: 137.7 GByte, MPI-M-MPI-ESM-LR: 130.6 GByte, NOAA-GFDL-GFDL-ESM2M: 55.5 GByte Format: netCDF DataSources: See the file "DataSources_RCM_Table.pdf"

  18. SSMI(S) Hydrological Products

    • catalog.data.gov
    Updated Sep 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NOAA National Centers for Environmental Information (Point of Contact) (2023). SSMI(S) Hydrological Products [Dataset]. https://catalog.data.gov/dataset/ssmis-hydrological-products2
    Explore at:
    Dataset updated
    Sep 19, 2023
    Dataset provided by
    National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
    National Centers for Environmental Informationhttps://www.ncei.noaa.gov/
    Description

    The Special Sensor Microwave/Imager (SSM/I) and Special Sensor Microwave Imager-Sounder (SSMIS) have been making measurements of earth-emitted microwave radiation for over two decades. Antenna temperature data from these sensors are used to derive both atmospheric and surface hydrological parameters and to generate these global monthly mean products. Specifically, this includes monthly or 5-day estimates of rainfall and its frequency, cloud liquid water and cloud frequency, water vapor, liquid water path, snow cover frequency, satellite sampling frequency and various imagery. The data collection consists of five different products: the Monthly 2.5 Degree Gridded Dataset, the Monthly 1.0 Degree Gridded Dataset, the Pentad 2.5 Degree Gridded Dataset, Imagery, and specially formatted files used in the Global Precipitation Climatology Project Dataset (GPCP). All data in this collection (except the Pentad and Imagery datasets) extend from July 1987 until present. Monthly 2.5 Degree Gridded Dataset Products included in this domain are taken from early orbits and include: mean cloud fraction (cfr), mean sea-ice cover (ice), liquid water path (lwp), mean rain fraction (pf2), rainfall (pr1), snow cover fraction (snw), mean sampling fraction (ssa), and mean total precipitable water (wvp). Products are averaged to 2.5 degree lat × 2.5 degree long grids. Imagery This dataset portrays monthly estimates of precipitable water (wvp), liquid water path (lwp), rainfall (pr1), and snow cover in both color and black and white. This domain of the SSM/I and SSMIS archive includes images of data from January 2006 to the present. GPCP Input files This time series domain of the SSM/I and SSMIS archive is formatted for the production of GPCP (Global Precipitation Climatology Project) datasets. The dataset is processed using early and late orbits plus a special "dual" orbit that combines the early and late satellite data. Products include rainfall (pr1 and pr2) and mean sampling fraction (ssa). The product is averaged to 2.5 degrees lat × 2.5 degrees long grids. Binary All data is also available in a legacy binary format, with the addition of a daily and pentad product. Products included in the daily domain are taken from early orbits and include: mean cloud fraction (cfr), mean sea-ice cover (ice), liquid water path (lwp), mean rain fraction (pf2), rainfall (pr1), snow cover fraction (snw), mean sampling fraction (ssa), and mean total precipitable water (wvp). Products are averaged to 1 lat x 1 long grids. Products included in Pentad domain are taken from specific satellite orbits and include: mean cloud fraction (cfr), mean sea-ice cover (ice), liquid water path (lwp), mean rain fraction (pf1 and pf2), rainfall (pr1 and pr2), snow cover fraction (snw), mean sampling fraction (ssa), and mean total precipitable water (wvp). Products are averaged to 2.5 lat x 2.5 long grids using 5-day estimates of rainfall and its frequency, cloud liquid water and cloud frequency, water vapor, liquid water path, snow cover frequency, satellite sampling frequency and various imagery. Pentad data only includes data from March 2008 to present. For a complete description of these products, please refer to all associated documentation.

  19. Temperature change

    • kaggle.com
    Updated Nov 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sevgi SY (2024). Temperature change [Dataset]. https://www.kaggle.com/datasets/sevgisarac/temperature-change/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 2, 2024
    Dataset provided by
    Kaggle
    Authors
    Sevgi SY
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Context

    Data description

    The FAOSTAT Temperature Change domain disseminates statistics of mean surface temperature change by country, with annual updates. The current dissemination covers the period 1961–2023. Statistics are available for monthly, seasonal and annual mean temperature anomalies, i.e., temperature change with respect to a baseline climatology, corresponding to the period 1951–1980. The standard deviation of the temperature change of the baseline methodology is also available. Data are based on the publicly available GISTEMP data, the Global Surface Temperature Change data distributed by the National Aeronautics and Space Administration Goddard Institute for Space Studies (NASA-GISS).

    Content

    Statistical concepts and definitions

    Statistical standards: Data in the Temperature Change domain are not an explicit SEEA variable. Nonetheless, country and regional calculations employ a definition of “Land area” consistent with SEEA Land Use definitions, specifically SEEA CF Table 5.11 “Land Use Classification” and SEEA AFF Table 4.8, “Physical asset account for land use.” The Temperature Change domain of the FAOSTAT Agri-Environmental Indicators section is compliant with the Framework for the Development of Environmental Statistics (FDES 2013), contributing to FDES Component 1: Environmental Conditions and Quality, Sub-component 1.1: Physical Conditions, Topic 1.1.1: Atmosphere, climate and weather, Core set/ Tier 1 statistics a.1.

    Statistical unit: Countries and Territories.

    Statistical population: Countries and Territories.

    Reference area: Area of all the Countries and Territories of the world. In 2019: 190 countries and 37 other territorial entities.

    Code - reference area: FAOSTAT, M49, ISO2 and ISO3 (http://www.fao.org/faostat/en/#definitions). FAO Global Administrative Unit Layer (GAUL National level – reference year 2014. FAO Geospatial data repository GeoNetwork. Permanent address: http://www.fao.org:80/geonetwork?uuid=f7e7adb0-88fd-11da-a88f-000d939bc5d8.

    Code - Number of countries/areas covered: In 2019: 190 countries and 37 other territorial entities.

    Time coverage: 1961-2023

    Periodicity: Monthly, Seasonal, Yearly

    Base period: 1951-1980

    Unit of Measure: Celsius degrees °C

    Reference period: Months, Seasons, Meteorological year

    Acknowledgements

    Documentation on methodology: Details on the methodology can be accessed at the Related Documents section of the Temperature Change (ET) domain in the Agri-Environmental Indicators section of FAOSTAT.

    Quality documentation: For more information on the methods, coverage, accuracy and limitations of the Temperature Change dataset please refer to the NASA GISTEMP website: https://data.giss.nasa.gov/gistemp/

                                                                              Source: http://www.fao.org/faostat/en/#data/ET/metadata
    

    Inspiration

    Climate change is one of the important issues that face the world in this technological era. The best proof of this situation is the historical temperature change. You can investigate if any hope there is for stopping global warming :)

    • Can you find any correlation between temperature change and any other variable? (Using ISO3 codes for merging any other countries' data sets possible.)

    • Prediction of temperature change: there is also an overall world temperature change in the country list as 'World'.

  20. NOAA's Coastal Ocean Reanalysis (CORA) Dataset: 1979-2022

    • registry.opendata.aws
    Updated Jan 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NOAA’s National Ocean Service, The Center for Operational Oceanographic Products and Services (CO-OPS) (2025). NOAA's Coastal Ocean Reanalysis (CORA) Dataset: 1979-2022 [Dataset]. https://registry.opendata.aws/noaa-nos-cora/
    Explore at:
    Dataset updated
    Jan 18, 2025
    Dataset provided by
    National Ocean Servicehttps://oceanservice.noaa.gov/
    National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
    Description

    NOAA's Coastal Ocean Reanalysis (CORA) for the Gulf, East Coast/Atlantic, and Caribbean (GEC) is produced using verified hourly water levels from the National Ocean Service’s Center of Operational Oceanographic Products & Services (CO-OPS). ADvanced CIRCulation Model (ADCIRC) and Simulating WAves Nearshore (SWAN) models are coupled to model coastal water levels and nearshore waves. Hourly water level observations are used for data assimilation and validation to improve the accuracy of modeled water levels and wave datasets.

    Additional Details:
    Metadata associated with model domain and time span:

    • Timeseries - 1979 to 2022
    • Size - Approx. 44.6 TB
    • Domain - Lat 5.8 to 45.8 ; Long -98.0 to -53.8
    • Nodes - CORA Metadata Library
    • Grid cells - CORA Metadata Library
    • Spatial Resolution:
      • Centroids: 300-400 meters
      • Gridded: 500 meters
      • Projection: 1983 Contiguous USA Albers projection (EPSG:5070)

    Datasets:
    Water level and wave datasets resulting from the computation, assimilation, validation, and optimization reanalysis datasets. All products are available in NetCDF (.nc) format:
    • fort.63.nc - Water level elevation
    • fort.73.nc - Atmospheric pressure at sea level
    • fort.74.nc - Wind Velocity - 10 m elevation
    • maxele.63.nc - Maximum water elevation
    • swan_DIR.63.nc - Spectral mean wave direction
    • swan_TMM10.63.nc - Spectral mean wave period
    • swan_TPS.63.nc - Spectral peak wave period
    • swan_HS.63.nc - Spectral zeroth moment wave height
    • swan_HS_max.63.nc - Maximum spectral zeroth moment wave height

    Derived Products:
    Datasets resulting from the computation, modeling, or other processing using existing/collected data. All products are available in NetCDF (.nc) format:
    • CORA-V1.1-fort.63: Hourly water levels
    • CORA-V1.1-swan_DIR.63: Hourly mean wave direction
    • CORA-V1.1-swan_TPS.63: Hourly peak wave periods
    • CORA-V1.1-swan_HS.63: Hourly significant wave heights
    • CORA-V1.1-Grid: Hourly water levels interpolated from model nodes to uniform 500-meter resolution grid

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mathematical Research Data Initiative (2024). Empathy dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7683906

Empathy dataset

Explore at:
Dataset updated
Dec 18, 2024
Dataset authored and provided by
Mathematical Research Data Initiative
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

The database for this study (Briganti et al. 2018; the same for the Braun study analysis) was composed of 1973 French-speaking students in several universities or schools for higher education in the following fields: engineering (31%), medicine (18%), nursing school (16%), economic sciences (15%), physiotherapy, (4%), psychology (11%), law school (4%) and dietetics (1%). The subjects were 17 to 25 years old (M = 19.6 years, SD = 1.6 years), 57% were females and 43% were males. Even though the full dataset was composed of 1973 participants, only 1270 answered the full questionnaire: missing data are handled using pairwise complete observations in estimating a Gaussian Graphical Model, meaning that all available information from every subject are used.

The feature set is composed of 28 items meant to assess the four following components: fantasy, perspective taking, empathic concern and personal distress. In the questionnaire, the items are mixed; reversed items (items 3, 4, 7, 12, 13, 14, 15, 18, 19) are present. Items are scored from 0 to 4, where “0” means “Doesn’t describe me very well” and “4” means “Describes me very well”; reverse-scoring is calculated afterwards. The questionnaires were anonymized. The reanalysis of the database in this retrospective study was approved by the ethical committee of the Erasmus Hospital.

Size: A dataset of size 1973*28

Number of features: 28

Ground truth: No

Type of Graph: Mixed graph

The following gives the description of the variables:

Feature FeatureLabel Domain Item meaning from Davis 1980

001 1FS Green I daydream and fantasize, with some regularity, about things that might happen to me.

002 2EC Purple I often have tender, concerned feelings for people less fortunate than me.

003 3PT_R Yellow I sometimes find it difficult to see things from the “other guy’s” point of view.

004 4EC_R Purple Sometimes I don’t feel very sorry for other people when they are having problems.

005 5FS Green I really get involved with the feelings of the characters in a novel.

006 6PD Red In emergency situations, I feel apprehensive and ill-at-ease.

007 7FS_R Green I am usually objective when I watch a movie or play, and I don’t often get completely caught up in it.(Reversed)

008 8PT Yellow I try to look at everybody’s side of a disagreement before I make a decision.

009 9EC Purple When I see someone being taken advantage of, I feel kind of protective towards them.

010 10PD Red I sometimes feel helpless when I am in the middle of a very emotional situation.

011 11PT Yellow sometimes try to understand my friends better by imagining how things look from their perspective

012 12FS_R Green Becoming extremely involved in a good book or movie is somewhat rare for me. (Reversed)

013 13PD_R Red When I see someone get hurt, I tend to remain calm. (Reversed)

014 14EC_R Purple Other people’s misfortunes do not usually disturb me a great deal. (Reversed)

015 15PT_R Yellow If I’m sure I’m right about something, I don’t waste much time listening to other people’s arguments. (Reversed)

016 16FS Green After seeing a play or movie, I have felt as though I were one of the characters.

017 17PD Red Being in a tense emotional situation scares me.

018 18EC_R Purple When I see someone being treated unfairly, I sometimes don’t feel very much pity for them. (Reversed)

019 19PD_R Red I am usually pretty effective in dealing with emergencies. (Reversed)

020 20FS Green I am often quite touched by things that I see happen.

021 21PT Yellow I believe that there are two sides to every question and try to look at them both.

022 22EC Purple I would describe myself as a pretty soft-hearted person.

023 23FS Green When I watch a good movie, I can very easily put myself in the place of a leading character.

024 24PD Red I tend to lose control during emergencies.

025 25PT Yellow When I’m upset at someone, I usually try to “put myself in his shoes” for a while.

026 26FS Green When I am reading an interesting story or novel, I imagine how I would feel if the events in the story were happening to me.

027 27PD Red When I see someone who badly needs help in an emergency, I go to pieces.

028 28PT Yellow Before criticizing somebody, I try to imagine how I would feel if I were in their place

More information about the dataset is contained in empathy_description.html file.

Search
Clear search
Close search
Google apps
Main menu