11 datasets found
  1. Enterprise-Driven Open Source Software

    • zenodo.org
    • opendatalab.com
    • +1more
    application/gzip
    Updated Apr 22, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diomidis Spinellis; Diomidis Spinellis; Zoe Kotti; Zoe Kotti; Konstantinos Kravvaritis; Konstantinos Kravvaritis; Georgios Theodorou; Georgios Theodorou; Panos Louridas; Panos Louridas (2020). Enterprise-Driven Open Source Software [Dataset]. http://doi.org/10.5281/zenodo.3742962
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Apr 22, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Diomidis Spinellis; Diomidis Spinellis; Zoe Kotti; Zoe Kotti; Konstantinos Kravvaritis; Konstantinos Kravvaritis; Georgios Theodorou; Georgios Theodorou; Panos Louridas; Panos Louridas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,264 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.

    The main dataset is provided as a 17,264 record tab-separated file named enterprise_projects.txt with the following 29 fields.

    • url: the project's GitHub URL
    • project_id: the project's GHTorrent identifier
    • sdtc: true if selected using the same domain top committers heuristic (9,016 records)
    • mcpc: true if selected using the multiple committers from a valid enterprise heuristic (8,314 records)
    • mcve: true if selected using the multiple committers from a probable company heuristic (8,015 records),
    • star_number: number of GitHub watchers
    • commit_count: number of commits
    • files: number of files in current main branch
    • lines: corresponding number of lines in text files
    • pull_requests: number of pull requests
    • github_repo_creation: timestamp of the GitHub repository creation
    • earliest_commit: timestamp of the earliest commit
    • most_recent_commit: date of the most recent commit
    • committer_count: number of different committers
    • author_count: number of different authors
    • dominant_domain: the projects dominant email domain
    • dominant_domain_committer_commits: number of commits made by committers whose email matches the project's dominant domain
    • dominant_domain_author_commits: corresponding number for commit authors
    • dominant_domain_committers: number of committers whose email matches the project's dominant domain
    • dominant_domain_authors: corresponding number for commit authors
    • cik: SEC's EDGAR "central index key"
    • fg500: true if this is a Fortune Global 500 company (2,233 records)
    • sec10k: true if the company files SEC 10-K forms (4,180 records)
    • sec20f: true if the company files SEC 20-F forms (429 records)
    • project_name: GitHub project name
    • owner_login: GitHub project's owner login
    • company_name: company name as derived from the SEC and Fortune 500 data
    • owner_company: GitHub project's owner company name
    • license: SPDX license identifier

    The file cohost_project_details.txt provides the full set of 311,223 cohort projects that are not part of the enterprise data set, but have comparable quality attributes.

    • url: the project's GitHub URL
    • project_id: the project's GHTorrent identifier
    • stars: number of GitHub watchers
    • commit_count: number of commits
  2. Hospital Management Dataset

    • kaggle.com
    Updated May 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kanak Baghel (2025). Hospital Management Dataset [Dataset]. https://www.kaggle.com/datasets/kanakbaghel/hospital-management-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kanak Baghel
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.

    Dataset Overview

    This dataset includes five CSV files:

    1. patients.csv – Patient demographics, contact details, registration info, and insurance data

    2. doctors.csv – Doctor profiles with specializations, experience, and contact information

    3. appointments.csv – Appointment dates, times, visit reasons, and statuses

    4. treatments.csv – Treatment types, descriptions, dates, and associated costs

    5. billing.csv – Billing amounts, payment methods, and status linked to treatments

    📁 Files & Column Descriptions

    ** patients.csv**

    Contains patient demographic and registration details.

    Column Description

    patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address

    ** doctors.csv**

    Details about the doctors working in the hospital.

    Column Description

    doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address

    appointments.csv

    Records of scheduled and completed patient appointments.

    Column Description

    appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)

    treatments.csv

    Information about the treatments given during appointments.

    Column Description

    treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given

    ** billing.csv**

    Billing and payment details for treatments.

    Column Description

    bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)

    Possible Use Cases

    SQL queries and relational database design

    Exploratory data analysis (EDA) and dashboarding

    Machine learning projects (e.g., cost prediction, no-show analysis)

    Feature engineering and data cleaning practice

    End-to-end healthcare analytics workflows

    Recommended Tools & Resources

    SQL (joins, filters, window functions)

    Pandas and Matplotlib/Seaborn for EDA

    Scikit-learn for ML models

    Pandas Profiling for automated EDA

    Plotly for interactive visualizations

    Please Note that :

    All data is synthetically generated for educational and project use. No real patient information is included.

    If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.

  3. Enterprise-Driven Open Source Software

    • data.europa.eu
    unknown
    Updated Feb 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2020). Enterprise-Driven Open Source Software [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-3653878?locale=en
    Explore at:
    unknown(8339687)Available download formats
    Dataset updated
    Feb 7, 2020
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,252 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders. The main dataset is provided as a 17,252 record tab-separated file named enterprise_projects.txt with the following 27 fields. url: the project's GitHub URL project_id: the project's GHTorrent identifier sdtc: true if selected using the same domain top committers heuristic (9,006 records) mcpc: true if selected using the multiple committers from a valid enterprise heuristic (8,289 records) mcve: true if selected using the multiple committers from a probable company heuristic (7,990 records), star_number: number of GitHub watchers commit_count: number of commits files: number of files in current main branch lines: corresponding number of lines in text files pull_requests: number of pull requests most_recent_commit: date of the most recent commit committer_count: number of different committers author_count: number of different authors dominant_domain: the projects dominant email domain dominant_domain_committer_commits: number of commits made by committers whose email matches the project's dominant domain dominant_domain_author_commits: corresponding number for commit authors dominant_domain_committers: number of committers whose email matches the project's dominant domain dominant_domain_authors: corresponding number of commit authors cik: SEC's EDGAR "central index key" fg500: true if this is a Fortune Global 500 company (2,232 records) sec10k: true if the company files SEC 10-K forms (4,178 records) sec20f: true if the company files SEC 20-F forms (429 records) project_name: GitHub project name owner_login: GitHub project's owner login company_name: company name as derived from the SEC and Fortune 500 data owner_company: GitHub project's owner company name license: SPDX license identifier The file cohost_project_details.txt provides the full set of 309,531 cohort projects that are not part of the enterprise data set, but have comparable quality attributes. url: the project's GitHub URL project_id: the project's GHTorrent identifier stars: number of GitHub watchers commit_count: number of commits

  4. College Student Placement Factors Dataset

    • kaggle.com
    Updated Jul 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sahil Islam007 (2025). College Student Placement Factors Dataset [Dataset]. https://www.kaggle.com/datasets/sahilislam007/college-student-placement-factors-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 2, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sahil Islam007
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📘 College Student Placement Dataset

    A realistic, large-scale synthetic dataset of 10,000 students designed to analyze factors affecting college placements.

    📄 Dataset Description

    This dataset simulates the academic and professional profiles of 10,000 college students, focusing on factors that influence placement outcomes. It includes features like IQ, academic performance, CGPA, internships, communication skills, and more.

    The dataset is ideal for:

    • Predictive modeling of placement outcomes
    • Educational exercises in classification
    • Feature importance analysis
    • End-to-end machine learning projects

    📊 Columns Description

    Column NameDescription
    College_IDUnique ID of the college (e.g., CLG0001 to CLG0100)
    IQStudent’s IQ score (normally distributed around 100)
    Prev_Sem_ResultGPA from the previous semester (range: 5.0 to 10.0)
    CGPACumulative Grade Point Average (range: ~5.0 to 10.0)
    Academic_PerformanceAnnual academic rating (scale: 1 to 10)
    Internship_ExperienceWhether the student has completed any internship (Yes/No)
    Extra_Curricular_ScoreInvolvement in extracurriculars (score from 0 to 10)
    Communication_SkillsSoft skill rating (scale: 1 to 10)
    Projects_CompletedNumber of academic/technical projects completed (0 to 5)
    PlacementFinal placement result (Yes = Placed, No = Not Placed)

    🎯 Target Variable

    • Placement: This is the binary classification target (Yes/No) that you can try to predict based on the other features.

    🧠 Use Cases

    • 📈 Classification Modeling (Logistic Regression, Decision Trees, Random Forest, etc.)
    • 🔍 Exploratory Data Analysis (EDA)
    • 🎯 Feature Engineering and Selection
    • 🧪 Model Evaluation Practice
    • 👩‍🏫 Academic Projects & Capstone Use

    📦 Dataset Size

    • Rows: 10,000
    • Columns: 10
    • File Format: .csv

    📚 Context

    This dataset was generated to resemble real-world data in academic institutions for research and machine learning use. While it is synthetic, the variables and relationships are crafted to mimic authentic trends observed in student placements.

    📜 License

    MIT

    🔗 Source

    Created using Python (NumPy, Pandas) with data logic designed for educational and ML experimentation purposes.

  5. t

    Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...

    • test.researchdata.tuwien.ac.at
    bin, csv, json +1
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak (2025). Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis [Dataset]. http://doi.org/10.70124/f5t2d-xt904
    Explore at:
    csv, text/markdown, json, binAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 2025
    Description

    Context and Methodology

    Research Domain:
    The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.

    Purpose:
    The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.

    How the Dataset Was Created:
    The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.

    Technical Details

    Dataset Structure:

    The dataset consists of three main files, each with its specific role:

    1. Train:
      This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).

      https://handle.test.datacite.org/10.82556/yb6j-jw41
      PID: b1c59499-9c6e-42c2-af8f-840181e809db
    2. Test2:
      The test dataset mirrors the structure of train.csv but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.

      https://handle.test.datacite.org/10.82556/jerg-4b84
      PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9
    3. Store:
      This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.

      https://handle.test.datacite.org/10.82556/nqeg-gy34
      PID: 9627ec46-4ee6-4969-b14a-bda555fe34db

    Data Fields Description:

    • Id: A unique identifier for each (Store, Date) combination within the test set.

    • Store: A unique identifier for each store.

    • Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).

    • Customers: The number of customers visiting the store on a given day.

    • Open: An indicator of whether the store was open (1 = open, 0 = closed).

    • StateHoliday: Indicates if the day is a state holiday, with values like:

      • 'a' = public holiday,

      • 'b' = Easter holiday,

      • 'c' = Christmas,

      • '0' = no holiday.

    • SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).

    • StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.

    • Assortment: Describes the level of product assortment in the store:

      • 'a' = basic,

      • 'b' = extra,

      • 'c' = extended.

    • CompetitionDistance: Distance (in meters) to the nearest competitor store.

    • CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.

    • Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).

    • Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).

    • Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.

    • PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.

    Software Requirements

    To work with this dataset, you will need to have specific software installed, including:

    • DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.

    • Python Libraries: Key libraries for working with the dataset include:

      • pandas for data manipulation,

      • numpy for numerical operations,

      • matplotlib and seaborn for data visualization,

      • scikit-learn for machine learning algorithms.

    Additional Resources

    Several additional resources are available for working with the dataset:

    1. Presentation:
      A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.

    2. Jupyter Notebook:
      A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.

    3. Model Evaluation Results:
      The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.

    4. Trained Models (.pkl files):
      The models trained during the project are saved as .pkl files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.

    5. sample_submission.csv:
      This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.

    These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.

  6. w

    Appalachian Basin Play Fairway Analysis: Thermal Quality Analysis in...

    • data.wu.ac.at
    zip
    Updated Mar 6, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HarvestMaster (2018). Appalachian Basin Play Fairway Analysis: Thermal Quality Analysis in Low-Temperature Geothermal Play Fairway Analysis (GPFA-AB) ThermalQualityAnalysisThermalResourceInterpolationResultsDataFilesImages.zip [Dataset]. https://data.wu.ac.at/schema/geothermaldata_org/MjQ3ZDg1ZmEtMGJkZi00ZGQ5LTlhMjAtZDg1ZTBlOTZmOWMx
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 6, 2018
    Dataset provided by
    HarvestMaster
    Area covered
    936274585b978c6848894628fe23e43e4d0f7b86
    Description

    This collection of files are part of a larger dataset uploaded in support of Low Temperature Geothermal Play Fairway Analysis for the Appalachian Basin (GPFA-AB, DOE Project DE-EE0006726). Phase 1 of the GPFA-AB project identified potential Geothermal Play Fairways within the Appalachian basin of Pennsylvania, West Virginia and New York. This was accomplished through analysis of 4 key criteria: thermal quality, natural reservoir productivity, risk of seismicity, and heat utilization. Each of these analyses represent a distinct project task, with the fifth task encompassing combination of the 4 risks factors. Supporting data for all five tasks has been uploaded into the Geothermal Data Repository node of the National Geothermal Data System (NGDS).

    This submission comprises the data for Thermal Quality Analysis (project task 1) and includes all of the necessary shapefiles, rasters, datasets, code, and references to code repositories that were used to create the thermal resource and risk factor maps as part of the GPFA-AB project. The identified Geothermal Play Fairways are also provided with the larger dataset. Figures (.png) are provided as examples of the shapefiles and rasters. The regional standardized 1 square km grid used in the project is also provided as points (cell centers), polygons, and as a raster. Two ArcGIS toolboxes are available: 1) RegionalGridModels.tbx for creating resource and risk factor maps on the standardized grid, and 2) ThermalRiskFactorModels.tbx for use in making the thermal resource maps and cross sections. These toolboxes contain item description documentation for each model within the toolbox, and for the toolbox itself. This submission also contains three R scripts: 1) AddNewSeisFields.R to add seismic risk data to attribute tables of seismic risk, 2) StratifiedKrigingInterpolation.R for the interpolations used in the thermal resource analysis, and 3) LeaveOneOutCrossValidation.R for the cross validations used in the thermal interpolations.

    Some file descriptions make reference to various 'memos'. These are contained within the final report submitted October 16, 2015.

    Each zipped file in the submission contains an 'about' document describing the full Thermal Quality Analysis content available, along with key sources, authors, citation, use guidelines, and assumptions, with the specific file(s) contained within the .zip file highlighted.

    UPDATE: Newer version of the Thermal Quality Analysis has been added here: https://gdr.openei.org/submissions/879 (Also linked below) Newer version of the Combined Risk Factor Analysis has been added here: https://gdr.openei.org/submissions/880 (Also linked below) This is one of sixteen associated .zip files relating to thermal resource interpolation results within the Thermal Quality Analysis task of the Low Temperature Geothermal Play Fairway Analysis for the Appalachian Basin. This file contains 6 images (.png) including predicted and associated error for surface heat flow, depth to 80 degrees C, depth to 100 degrees C, temperature at 1.5 km, temperature at 2.5 km and temperature at 3.5 km.

    The sixteen files contain the results of the thermal resource interpolation as binary grid (raster) files, images (.png) of the rasters, and toolbox of ArcGIS Models used. Note that raster files ending in “pred” are the predicted mean for that resource, and files ending in “err” are the standard error of the predicted mean for that resource. Leave one out cross validation results are provided for each thermal resource.

    Several models were built in order to process the well database with outliers removed. ArcGIS toolbox ThermalRiskFactorModels contains the ArcGIS processing tools used. First, the WellClipsToWormSections model was used to clip the wells to the worm sections (interpolation regions). Then, the 1 square km gridded regions (see series of 14 Worm Based Interpolation Boundaries .zip files) along with the wells in those regions were loaded into R using the rgdal package. Then, a stratified kriging algorithm implemented in the R gstat package was used to create rasters of the predicted mean and the standard error of the predicted mean. The code used to make these rasters is called StratifiedKrigingInterpolation.R Details about the interpolation, and exploratory data analysis on the well data is provided in 9_GPFA-AB_InterpolationThermalFieldEstimation.pdf (Smith, 2015), contained within the final report.

    The output rasters from R are brought into ArcGIS for further spatial processing. First, the BufferedRasterToClippedRaster tool is used to clip the interpolations back to the Worm Sections. Then, the Mosaic tool in ArcGIS is used to merge all predicted mean rasters into a single raster, and all error rasters into a single raster for each thermal resource.

    A leave one out cross validation was performed on each of the thermal resources. The code used to implement the cross validation is provided in the R script LeaveOneOutCrossValidation.R. The results of the cross validation are given for each thermal resource.

    Other tools provided in this toolbox are useful for creating cross sections of the thermal resource. ExtractThermalPropertiesToCrossSection model extracts the predicted mean and the standard error of predicted mean to the attribute table of a line of cross section. The AddExtraInfoToCrossSection model is then used to add any other desired information, such as state and county boundaries, to the cross section attribute table. These two functions can be combined as a single function, as provided by the CrossSectionExtraction model.

  7. SQL Bike Stores

    • kaggle.com
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed ZRIRAK (2024). SQL Bike Stores [Dataset]. https://www.kaggle.com/datasets/mohamedzrirak/sql-bkestores
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mohamed ZRIRAK
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Download: SQL Query This SQL project is focused on analyzing sales data from a relational database to gain insights into customer behavior, store performance, product sales, and the effectiveness of sales representatives. By executing a series of complex SQL queries across multiple tables, the project aggregates key metrics, such as total units sold and total revenue, and links them with customer, store, product, and staff details.

    Key Objectives:

    Customer Analysis: Understand customer purchasing patterns by analyzing the total number of units and revenue generated per customer. Product and Category Insights: Evaluate product performance and its category’s impact on overall sales. Store Performance: Identify which stores generate the most revenue and handle the highest sales volume. Sales Representative Effectiveness: Assess the performance of sales representatives by linking sales data with each representative’s handled orders. Techniques Used:

    SQL Joins: The project integrates data from multiple tables, including orders, customers, order_items, products, categories, stores, and staffs, using INNER JOIN to merge information from related tables. Aggregation: SUM functions are used to compute total units sold and revenue generated by each order, providing valuable insights into sales performance. Grouping: Data is grouped by order ID, customer, product, store, and sales representative, ensuring accurate and summarized sales metrics. Use Cases:

    Business Decision-Making: The analysis can help businesses identify high-performing products and stores, optimize inventory, and evaluate the impact of sales teams. Market Segmentation: Segment customers based on geographic location (city/state) and identify patterns in purchasing behavior. Sales Strategy Optimization: Provide recommendations to improve sales strategies by analyzing product categories and sales rep performance.

  8. e

    Plasma Proteomics Analysis of the Cooperative Health Research in South Tyrol...

    • ebi.ac.uk
    Updated Aug 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clemens Dierks (2024). Plasma Proteomics Analysis of the Cooperative Health Research in South Tyrol (CHRIS) Study, QC Samples [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD052892
    Explore at:
    Dataset updated
    Aug 7, 2024
    Authors
    Clemens Dierks
    Variables measured
    Proteomics
    Description

    The Cooperative Health Research in South Tyrol (CHRIS) study is a single-site population-based study aimed to investigate the genetic and molecular basis of common age-related chronic conditions and their interaction with lifestyle and environment. In recent work, we have been evaluating the impact of age, sex and diet, amongst others, on the human metabolome. Moreover, gene-metabolite associations, as well as genetic and metabolomic determinants of disease, have been investigated. Using Scanning SWATH acquisition on Triple TOF 6600 instruments (Sciex) we created a MS-based plasma proteomics data set with low technical variability for n = 3,632 CHRIS participants. We performed a general exploratory analysis of the data set to identify relevant factors affecting the plasma proteome, including commonly used drugs. We found hormonal contraceptives to be the main factor explaining the variation in this data set. Here we present the commerical plasma samples used as quality controls.

  9. Klib library python

    • kaggle.com
    Updated Jan 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sripaad Srinivasan (2021). Klib library python [Dataset]. https://www.kaggle.com/sripaadsrinivasan/klib-library-python/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 11, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sripaad Srinivasan
    Description

    klib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).

    Original Github repo

    https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">

    Usage

    !pip install klib
    
    import klib
    import pandas as pd
    
    df = pd.DataFrame(data)
    
    # klib.describe functions for visualizing datasets
    - klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features
    - klib.corr_mat(df) # returns a color-encoded correlation matrix
    - klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations
    - klib.dist_plot(df) # returns a distribution plot for every numeric feature
    - klib.missingval_plot(df) # returns a figure containing information about missing values
    

    Examples

    Take a look at this starter notebook.

    Further examples, as well as applications of the functions can be found here.

    Contributing

    Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.

    License

    MIT

  10. McKinsey Solve Assessment Data (2018–2025)

    • kaggle.com
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oluwademilade Adeniyi (2025). McKinsey Solve Assessment Data (2018–2025) [Dataset]. http://doi.org/10.34740/kaggle/dsv/11720554
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Oluwademilade Adeniyi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    McKinsey Solve Global Assessment Dataset (2018–2025)

    🧠 Context

    McKinsey's Solve is a gamified problem-solving assessment used globally in the consulting firm’s recruitment process. This dataset simulates assessment results across geographies, education levels, and roles over a 7-year period. It aims to provide deep insights into performance trends, candidate readiness, resume quality, and cognitive task outcomes.

    📌 Inspiration & Purpose

    Inspired by McKinsey’s real-world assessment framework, this dataset was designed to enable: - Exploratory Data Analysis (EDA) - Recruitment trend analysis - Gamified performance modelling - Dashboard development in Excel / Power BI - Resume and education impact evaluation - Regional performance benchmarking - Data storytelling for portfolio projects

    Whether you're building dashboards or training models, this dataset offers practical and relatable data for HR analytics and consulting use cases.

    🔍 Dataset Source

    • Data generated by Oluwademilade Adeniyi (Demibolt) with the assistance of ChatGPT by OpenAI Structure and logic inspired by McKinsey’s public-facing Solve information, including role categories, game types (Ecosystem, Redrock, Seawolf), education tiers, and global office locations The entire dataset is synthetic and designed for analytical learning, ethical use, and professional development

    🧾 Dataset Structure

    This dataset includes 4,000 rows and the following columns: - Testtaker ID: Unique identifier - Country / Region: Geographic segmentation - Gender / Age: Demographics - Year: Assessment year (2018–2025) - Highest Level of Education: From high school to PhD / MBA - School or University Attended: Mapped to country and education level - First-generation University Student: Yes/No - Employment Status: Student, Employed, Unemployed - Role Applied For and Department / Interest: Business/tech disciplines - Past Test Taker: Indicates repeat attempts - Prepared with Online Materials: Indicates test prep involvement - Desired Office Location: Mapped to McKinsey's international offices - Ecosystem / Redrock / Seawolf (%): Game performance scores - Time Spent on Each Game (mins) - Total Product Score: Average of the 3 game scores - Process Score: A secondary assessment component - Resume Score: Scored based on education prestige, role fit, and clarity - Total Assessment Score (%): Final decision metric - Status (Pass/Fail): Based on total score ≥ 75%

    ✅ Why Use This Dataset

    • Benchmark educational and regional trends in global assessments
    • Build KPI cards, donut charts, histograms, or speedometer visuals
    • Train pass/fail classifiers or regression models
    • Segment job applicants by role, location, or game behaviour
    • Showcase portfolio skills across Excel, SQL, Power BI, Python, or R
    • Test dashboards or predictive logic in a business-relevant scenario

    💡 Credit & Collaboration

    • Data Creator: Oluwademilade Adeniyi (Me) (LinkedIn, Twitter, GitHub, Medium)
    • Collaborator: ChatGPT by OpenAI
    • Inspired by: McKinsey & Company’s Solve Assessment
  11. Reproductive Health In india

    • kaggle.com
    Updated Feb 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AKshay (2025). Reproductive Health In india [Dataset]. https://www.kaggle.com/datasets/ak0212/reproductive-health-in-india/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 15, 2025
    Dataset provided by
    Kaggle
    Authors
    AKshay
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    India
    Description

    The "Reproductive Health" dataset on Kaggle provides an in-depth view of various factors impacting reproductive health across different populations. It includes demographic information such as age, marital status, and educational background, as well as health-related data like contraceptive use, medical conditions, and fertility history. This dataset is particularly useful for data analysts, researchers, and public health professionals aiming to understand trends in reproductive health and identify patterns or associations between lifestyle, medical history, and reproductive health outcomes.

    The dataset enables users to explore key questions in reproductive health, such as how socioeconomic factors influence family planning choices or how health conditions may correlate with fertility. It can be applied to various types of analysis, including statistical modeling, machine learning algorithms, and predictive analysis. For example, analysts can use this dataset to build classification models that predict contraceptive use or explore regression models to understand factors contributing to reproductive health outcomes.

    The dataset includes several attributes related to individual health profiles, such as whether individuals have previously experienced pregnancies, their contraceptive methods, and other relevant health conditions. It also provides valuable demographic details that can support intersectional analysis, examining how different factors like age, education, and income level impact reproductive health decisions.

    With this dataset, one can also conduct exploratory data analysis (EDA), build visualizations, and identify correlations between variables such as health conditions, lifestyle choices, and reproductive outcomes. Additionally, it can serve as a base for conducting hypothesis testing to validate assumptions about reproductive health patterns.

    For those interested in public health research or working on health data science projects, the dataset offers a comprehensive foundation for analyzing reproductive health issues. It can be particularly beneficial for projects focused on improving access to family planning services, promoting awareness of reproductive health issues, or creating predictive tools for healthcare interventions.

    The "Reproductive Health" dataset is a valuable resource for anyone involved in data-driven public health research, machine learning, or statistical modeling in the context of reproductive health. It is accessible for both beginner and advanced data scientists, offering diverse possibilities for analysis and insights that can have a real-world impact on public health policies and interventions.

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Diomidis Spinellis; Diomidis Spinellis; Zoe Kotti; Zoe Kotti; Konstantinos Kravvaritis; Konstantinos Kravvaritis; Georgios Theodorou; Georgios Theodorou; Panos Louridas; Panos Louridas (2020). Enterprise-Driven Open Source Software [Dataset]. http://doi.org/10.5281/zenodo.3742962
Organization logo

Enterprise-Driven Open Source Software

Explore at:
49 scholarly articles cite this dataset (View in Google Scholar)
application/gzipAvailable download formats
Dataset updated
Apr 22, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Diomidis Spinellis; Diomidis Spinellis; Zoe Kotti; Zoe Kotti; Konstantinos Kravvaritis; Konstantinos Kravvaritis; Georgios Theodorou; Georgios Theodorou; Panos Louridas; Panos Louridas
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,264 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.

The main dataset is provided as a 17,264 record tab-separated file named enterprise_projects.txt with the following 29 fields.

  • url: the project's GitHub URL
  • project_id: the project's GHTorrent identifier
  • sdtc: true if selected using the same domain top committers heuristic (9,016 records)
  • mcpc: true if selected using the multiple committers from a valid enterprise heuristic (8,314 records)
  • mcve: true if selected using the multiple committers from a probable company heuristic (8,015 records),
  • star_number: number of GitHub watchers
  • commit_count: number of commits
  • files: number of files in current main branch
  • lines: corresponding number of lines in text files
  • pull_requests: number of pull requests
  • github_repo_creation: timestamp of the GitHub repository creation
  • earliest_commit: timestamp of the earliest commit
  • most_recent_commit: date of the most recent commit
  • committer_count: number of different committers
  • author_count: number of different authors
  • dominant_domain: the projects dominant email domain
  • dominant_domain_committer_commits: number of commits made by committers whose email matches the project's dominant domain
  • dominant_domain_author_commits: corresponding number for commit authors
  • dominant_domain_committers: number of committers whose email matches the project's dominant domain
  • dominant_domain_authors: corresponding number for commit authors
  • cik: SEC's EDGAR "central index key"
  • fg500: true if this is a Fortune Global 500 company (2,233 records)
  • sec10k: true if the company files SEC 10-K forms (4,180 records)
  • sec20f: true if the company files SEC 20-F forms (429 records)
  • project_name: GitHub project name
  • owner_login: GitHub project's owner login
  • company_name: company name as derived from the SEC and Fortune 500 data
  • owner_company: GitHub project's owner company name
  • license: SPDX license identifier

The file cohost_project_details.txt provides the full set of 311,223 cohort projects that are not part of the enterprise data set, but have comparable quality attributes.

  • url: the project's GitHub URL
  • project_id: the project's GHTorrent identifier
  • stars: number of GitHub watchers
  • commit_count: number of commits
Search
Clear search
Close search
Google apps
Main menu