11 datasets found

Enterprise-Driven Open Source Software
zenodo.org
opendatalab.com
+1more
application/gzip
Updated Apr 22, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diomidis Spinellis; Diomidis Spinellis; Zoe Kotti; Zoe Kotti; Konstantinos Kravvaritis; Konstantinos Kravvaritis; Georgios Theodorou; Georgios Theodorou; Panos Louridas; Panos Louridas (2020). Enterprise-Driven Open Source Software [Dataset]. http://doi.org/10.5281/zenodo.3742962
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3742962
Dataset updated
Apr 22, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Diomidis Spinellis; Diomidis Spinellis; Zoe Kotti; Zoe Kotti; Konstantinos Kravvaritis; Konstantinos Kravvaritis; Georgios Theodorou; Georgios Theodorou; Panos Louridas; Panos Louridas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,264 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.

The main dataset is provided as a 17,264 record tab-separated file named enterprise_projects.txt with the following 29 fields.

url: the project's GitHub URL

project_id: the project's GHTorrent identifier

sdtc: true if selected using the same domain top committers heuristic (9,016 records)

mcpc: true if selected using the multiple committers from a valid enterprise heuristic (8,314 records)

mcve: true if selected using the multiple committers from a probable company heuristic (8,015 records),

star_number: number of GitHub watchers

commit_count: number of commits

files: number of files in current main branch

lines: corresponding number of lines in text files

pull_requests: number of pull requests

github_repo_creation: timestamp of the GitHub repository creation

earliest_commit: timestamp of the earliest commit

most_recent_commit: date of the most recent commit

committer_count: number of different committers

author_count: number of different authors

dominant_domain: the projects dominant email domain

dominant_domain_committer_commits: number of commits made by committers whose email matches the project's dominant domain

dominant_domain_author_commits: corresponding number for commit authors

dominant_domain_committers: number of committers whose email matches the project's dominant domain

dominant_domain_authors: corresponding number for commit authors

cik: SEC's EDGAR "central index key"

fg500: true if this is a Fortune Global 500 company (2,233 records)

sec10k: true if the company files SEC 10-K forms (4,180 records)

sec20f: true if the company files SEC 20-F forms (429 records)

project_name: GitHub project name

owner_login: GitHub project's owner login

company_name: company name as derived from the SEC and Fortune 500 data

owner_company: GitHub project's owner company name

license: SPDX license identifier

The file cohost_project_details.txt provides the full set of 311,223 cohort projects that are not part of the enterprise data set, but have comparable quality attributes.

url: the project's GitHub URL

project_id: the project's GHTorrent identifier

stars: number of GitHub watchers

commit_count: number of commits
Hospital Management Dataset
kaggle.com
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kanak Baghel (2025). Hospital Management Dataset [Dataset]. https://www.kaggle.com/datasets/kanakbaghel/hospital-management-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 30, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kanak Baghel
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.

Dataset Overview

This dataset includes five CSV files:

patients.csv – Patient demographics, contact details, registration info, and insurance data

doctors.csv – Doctor profiles with specializations, experience, and contact information

appointments.csv – Appointment dates, times, visit reasons, and statuses

treatments.csv – Treatment types, descriptions, dates, and associated costs

billing.csv – Billing amounts, payment methods, and status linked to treatments

📁 Files & Column Descriptions

** patients.csv**

Contains patient demographic and registration details.

Column Description

patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address

** doctors.csv**

Details about the doctors working in the hospital.

Column Description

doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address

appointments.csv

Records of scheduled and completed patient appointments.

Column Description

appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)

treatments.csv

Information about the treatments given during appointments.

Column Description

treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given

** billing.csv**

Billing and payment details for treatments.

Column Description

bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)

Possible Use Cases

SQL queries and relational database design

Exploratory data analysis (EDA) and dashboarding

Machine learning projects (e.g., cost prediction, no-show analysis)

Feature engineering and data cleaning practice

End-to-end healthcare analytics workflows

Recommended Tools & Resources

SQL (joins, filters, window functions)

Pandas and Matplotlib/Seaborn for EDA

Scikit-learn for ML models

Pandas Profiling for automated EDA

Plotly for interactive visualizations

Please Note that :

All data is synthetically generated for educational and project use. No real patient information is included.

If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.
Enterprise-Driven Open Source Software
data.europa.eu
unknown
Updated Feb 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2020). Enterprise-Driven Open Source Software [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-3653878?locale=en
Explore at:
unknown(8339687)Available download formats
Dataset updated
Feb 7, 2020
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,252 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders. The main dataset is provided as a 17,252 record tab-separated file named enterprise_projects.txt with the following 27 fields. url: the project's GitHub URL project_id: the project's GHTorrent identifier sdtc: true if selected using the same domain top committers heuristic (9,006 records) mcpc: true if selected using the multiple committers from a valid enterprise heuristic (8,289 records) mcve: true if selected using the multiple committers from a probable company heuristic (7,990 records), star_number: number of GitHub watchers commit_count: number of commits files: number of files in current main branch lines: corresponding number of lines in text files pull_requests: number of pull requests most_recent_commit: date of the most recent commit committer_count: number of different committers author_count: number of different authors dominant_domain: the projects dominant email domain dominant_domain_committer_commits: number of commits made by committers whose email matches the project's dominant domain dominant_domain_author_commits: corresponding number for commit authors dominant_domain_committers: number of committers whose email matches the project's dominant domain dominant_domain_authors: corresponding number of commit authors cik: SEC's EDGAR "central index key" fg500: true if this is a Fortune Global 500 company (2,232 records) sec10k: true if the company files SEC 10-K forms (4,178 records) sec20f: true if the company files SEC 20-F forms (429 records) project_name: GitHub project name owner_login: GitHub project's owner login company_name: company name as derived from the SEC and Fortune 500 data owner_company: GitHub project's owner company name license: SPDX license identifier The file cohost_project_details.txt provides the full set of 309,531 cohort projects that are not part of the enterprise data set, but have comparable quality attributes. url: the project's GitHub URL project_id: the project's GHTorrent identifier stars: number of GitHub watchers commit_count: number of commits

College Student Placement Factors Dataset

kaggle.com

Updated Jul 2, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Sahil Islam007 (2025). College Student Placement Factors Dataset [Dataset]. https://www.kaggle.com/datasets/sahilislam007/college-student-placement-factors-dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 2, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Sahil Islam007

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

📘 College Student Placement Dataset

A realistic, large-scale synthetic dataset of 10,000 students designed to analyze factors affecting college placements.

📄 Dataset Description

This dataset simulates the academic and professional profiles of 10,000 college students, focusing on factors that influence placement outcomes. It includes features like IQ, academic performance, CGPA, internships, communication skills, and more.

The dataset is ideal for:

Predictive modeling of placement outcomes
Educational exercises in classification
Feature importance analysis
End-to-end machine learning projects

📊 Columns Description

Column Name	Description
College_ID	Unique ID of the college (e.g., CLG0001 to CLG0100)
IQ	Student’s IQ score (normally distributed around 100)
Prev_Sem_Result	GPA from the previous semester (range: 5.0 to 10.0)
CGPA	Cumulative Grade Point Average (range: ~5.0 to 10.0)
Academic_Performance	Annual academic rating (scale: 1 to 10)
Internship_Experience	Whether the student has completed any internship (Yes/No)
Extra_Curricular_Score	Involvement in extracurriculars (score from 0 to 10)
Communication_Skills	Soft skill rating (scale: 1 to 10)
Projects_Completed	Number of academic/technical projects completed (0 to 5)
Placement	Final placement result (Yes = Placed, No = Not Placed)

🎯 Target Variable

Placement: This is the binary classification target (Yes/No) that you can try to predict based on the other features.

🧠 Use Cases

📈 Classification Modeling (Logistic Regression, Decision Trees, Random Forest, etc.)
🔍 Exploratory Data Analysis (EDA)
🎯 Feature Engineering and Selection
🧪 Model Evaluation Practice
👩‍🏫 Academic Projects & Capstone Use

📦 Dataset Size

Rows: 10,000
Columns: 10
File Format: .csv

📚 Context

This dataset was generated to resemble real-world data in academic institutions for research and machine learning use. While it is synthetic, the variables and relationships are crafted to mimic authentic trends observed in student placements.

📜 License

MIT

🔗 Source

Created using Python (NumPy, Pandas) with data logic designed for educational and ML experimentation purposes.

t
Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...
test.researchdata.tuwien.ac.at
bin, csv, json +1
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak (2025). Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis [Dataset]. http://doi.org/10.70124/f5t2d-xt904
Explore at:
csv, text/markdown, json, binAvailable download formats
Unique identifier
https://doi.org/10.70124/f5t2d-xt904
Dataset updated
Apr 28, 2025
Dataset provided by
TU Wien
Authors
Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 2025
Description
Context and Methodology

Research Domain:
The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.

Purpose:
The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.

How the Dataset Was Created:
The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.

Technical Details

Dataset Structure:

The dataset consists of three main files, each with its specific role:

Train:
This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).

https://handle.test.datacite.org/10.82556/yb6j-jw41
PID: b1c59499-9c6e-42c2-af8f-840181e809db

Test2:
The test dataset mirrors the structure of train.csv but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.

https://handle.test.datacite.org/10.82556/jerg-4b84
PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9

Store:
This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.

https://handle.test.datacite.org/10.82556/nqeg-gy34
PID: 9627ec46-4ee6-4969-b14a-bda555fe34db

Data Fields Description:

Id: A unique identifier for each (Store, Date) combination within the test set.

Store: A unique identifier for each store.

Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).

Customers: The number of customers visiting the store on a given day.

Open: An indicator of whether the store was open (1 = open, 0 = closed).

StateHoliday: Indicates if the day is a state holiday, with values like:

'a' = public holiday,

'b' = Easter holiday,

'c' = Christmas,

'0' = no holiday.

SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).

StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.

Assortment: Describes the level of product assortment in the store:

'a' = basic,

'b' = extra,

'c' = extended.

CompetitionDistance: Distance (in meters) to the nearest competitor store.

CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.

Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).

Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).

Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.

PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.

Software Requirements

To work with this dataset, you will need to have specific software installed, including:

DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.

Python Libraries: Key libraries for working with the dataset include:

pandas for data manipulation,

numpy for numerical operations,

matplotlib and seaborn for data visualization,

scikit-learn for machine learning algorithms.

Additional Resources

Several additional resources are available for working with the dataset:

Presentation:
A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.

Jupyter Notebook:
A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.

Model Evaluation Results:
The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.

Trained Models (.pkl files):
The models trained during the project are saved as .pkl files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.

sample_submission.csv:
This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.

These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.
w
Appalachian Basin Play Fairway Analysis: Thermal Quality Analysis in...
data.wu.ac.at
zip
Updated Mar 6, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HarvestMaster (2018). Appalachian Basin Play Fairway Analysis: Thermal Quality Analysis in Low-Temperature Geothermal Play Fairway Analysis (GPFA-AB) ThermalQualityAnalysisThermalResourceInterpolationResultsDataFilesImages.zip [Dataset]. https://data.wu.ac.at/schema/geothermaldata_org/MjQ3ZDg1ZmEtMGJkZi00ZGQ5LTlhMjAtZDg1ZTBlOTZmOWMx
Explore at:
zipAvailable download formats
Dataset updated
Mar 6, 2018
Dataset provided by
HarvestMaster
Area covered
936274585b978c6848894628fe23e43e4d0f7b86
Description
This collection of files are part of a larger dataset uploaded in support of Low Temperature Geothermal Play Fairway Analysis for the Appalachian Basin (GPFA-AB, DOE Project DE-EE0006726). Phase 1 of the GPFA-AB project identified potential Geothermal Play Fairways within the Appalachian basin of Pennsylvania, West Virginia and New York. This was accomplished through analysis of 4 key criteria: thermal quality, natural reservoir productivity, risk of seismicity, and heat utilization. Each of these analyses represent a distinct project task, with the fifth task encompassing combination of the 4 risks factors. Supporting data for all five tasks has been uploaded into the Geothermal Data Repository node of the National Geothermal Data System (NGDS).

This submission comprises the data for Thermal Quality Analysis (project task 1) and includes all of the necessary shapefiles, rasters, datasets, code, and references to code repositories that were used to create the thermal resource and risk factor maps as part of the GPFA-AB project. The identified Geothermal Play Fairways are also provided with the larger dataset. Figures (.png) are provided as examples of the shapefiles and rasters. The regional standardized 1 square km grid used in the project is also provided as points (cell centers), polygons, and as a raster. Two ArcGIS toolboxes are available: 1) RegionalGridModels.tbx for creating resource and risk factor maps on the standardized grid, and 2) ThermalRiskFactorModels.tbx for use in making the thermal resource maps and cross sections. These toolboxes contain item description documentation for each model within the toolbox, and for the toolbox itself. This submission also contains three R scripts: 1) AddNewSeisFields.R to add seismic risk data to attribute tables of seismic risk, 2) StratifiedKrigingInterpolation.R for the interpolations used in the thermal resource analysis, and 3) LeaveOneOutCrossValidation.R for the cross validations used in the thermal interpolations.

Some file descriptions make reference to various 'memos'. These are contained within the final report submitted October 16, 2015.

Each zipped file in the submission contains an 'about' document describing the full Thermal Quality Analysis content available, along with key sources, authors, citation, use guidelines, and assumptions, with the specific file(s) contained within the .zip file highlighted.

UPDATE: Newer version of the Thermal Quality Analysis has been added here: https://gdr.openei.org/submissions/879 (Also linked below) Newer version of the Combined Risk Factor Analysis has been added here: https://gdr.openei.org/submissions/880 (Also linked below) This is one of sixteen associated .zip files relating to thermal resource interpolation results within the Thermal Quality Analysis task of the Low Temperature Geothermal Play Fairway Analysis for the Appalachian Basin. This file contains 6 images (.png) including predicted and associated error for surface heat flow, depth to 80 degrees C, depth to 100 degrees C, temperature at 1.5 km, temperature at 2.5 km and temperature at 3.5 km.

The sixteen files contain the results of the thermal resource interpolation as binary grid (raster) files, images (.png) of the rasters, and toolbox of ArcGIS Models used. Note that raster files ending in “pred” are the predicted mean for that resource, and files ending in “err” are the standard error of the predicted mean for that resource. Leave one out cross validation results are provided for each thermal resource.

Several models were built in order to process the well database with outliers removed. ArcGIS toolbox ThermalRiskFactorModels contains the ArcGIS processing tools used. First, the WellClipsToWormSections model was used to clip the wells to the worm sections (interpolation regions). Then, the 1 square km gridded regions (see series of 14 Worm Based Interpolation Boundaries .zip files) along with the wells in those regions were loaded into R using the rgdal package. Then, a stratified kriging algorithm implemented in the R gstat package was used to create rasters of the predicted mean and the standard error of the predicted mean. The code used to make these rasters is called StratifiedKrigingInterpolation.R Details about the interpolation, and exploratory data analysis on the well data is provided in 9_GPFA-AB_InterpolationThermalFieldEstimation.pdf (Smith, 2015), contained within the final report.

The output rasters from R are brought into ArcGIS for further spatial processing. First, the BufferedRasterToClippedRaster tool is used to clip the interpolations back to the Worm Sections. Then, the Mosaic tool in ArcGIS is used to merge all predicted mean rasters into a single raster, and all error rasters into a single raster for each thermal resource.

A leave one out cross validation was performed on each of the thermal resources. The code used to implement the cross validation is provided in the R script LeaveOneOutCrossValidation.R. The results of the cross validation are given for each thermal resource.

Other tools provided in this toolbox are useful for creating cross sections of the thermal resource. ExtractThermalPropertiesToCrossSection model extracts the predicted mean and the standard error of predicted mean to the attribute table of a line of cross section. The AddExtraInfoToCrossSection model is then used to add any other desired information, such as state and county boundaries, to the cross section attribute table. These two functions can be combined as a single function, as provided by the CrossSectionExtraction model.
SQL Bike Stores
kaggle.com
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed ZRIRAK (2024). SQL Bike Stores [Dataset]. https://www.kaggle.com/datasets/mohamedzrirak/sql-bkestores
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 21, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mohamed ZRIRAK
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Download: SQL Query This SQL project is focused on analyzing sales data from a relational database to gain insights into customer behavior, store performance, product sales, and the effectiveness of sales representatives. By executing a series of complex SQL queries across multiple tables, the project aggregates key metrics, such as total units sold and total revenue, and links them with customer, store, product, and staff details.

Key Objectives:

Customer Analysis: Understand customer purchasing patterns by analyzing the total number of units and revenue generated per customer. Product and Category Insights: Evaluate product performance and its category’s impact on overall sales. Store Performance: Identify which stores generate the most revenue and handle the highest sales volume. Sales Representative Effectiveness: Assess the performance of sales representatives by linking sales data with each representative’s handled orders. Techniques Used:

SQL Joins: The project integrates data from multiple tables, including orders, customers, order_items, products, categories, stores, and staffs, using INNER JOIN to merge information from related tables. Aggregation: SUM functions are used to compute total units sold and revenue generated by each order, providing valuable insights into sales performance. Grouping: Data is grouped by order ID, customer, product, store, and sales representative, ensuring accurate and summarized sales metrics. Use Cases:

Business Decision-Making: The analysis can help businesses identify high-performing products and stores, optimize inventory, and evaluate the impact of sales teams. Market Segmentation: Segment customers based on geographic location (city/state) and identify patterns in purchasing behavior. Sales Strategy Optimization: Provide recommendations to improve sales strategies by analyzing product categories and sales rep performance.
e
Plasma Proteomics Analysis of the Cooperative Health Research in South Tyrol...
ebi.ac.uk
Updated Aug 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clemens Dierks (2024). Plasma Proteomics Analysis of the Cooperative Health Research in South Tyrol (CHRIS) Study, QC Samples [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD052892
Explore at:
Dataset updated
Aug 7, 2024
Authors
Clemens Dierks
Variables measured
Proteomics
Description
The Cooperative Health Research in South Tyrol (CHRIS) study is a single-site population-based study aimed to investigate the genetic and molecular basis of common age-related chronic conditions and their interaction with lifestyle and environment. In recent work, we have been evaluating the impact of age, sex and diet, amongst others, on the human metabolome. Moreover, gene-metabolite associations, as well as genetic and metabolomic determinants of disease, have been investigated. Using Scanning SWATH acquisition on Triple TOF 6600 instruments (Sciex) we created a MS-based plasma proteomics data set with low technical variability for n = 3,632 CHRIS participants. We performed a general exploratory analysis of the data set to identify relevant factors affecting the plasma proteome, including commonly used drugs. We found hormonal contraceptives to be the main factor explaining the variation in this data set. Here we present the commerical plasma samples used as quality controls.
Klib library python
kaggle.com
Updated Jan 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sripaad Srinivasan (2021). Klib library python [Dataset]. https://www.kaggle.com/sripaadsrinivasan/klib-library-python/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 11, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sripaad Srinivasan
Description
klib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).

Original Github repo

https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">

Usage

!pip install klib

import klib import pandas as pd df = pd.DataFrame(data) # klib.describe functions for visualizing datasets - klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features - klib.corr_mat(df) # returns a color-encoded correlation matrix - klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations - klib.dist_plot(df) # returns a distribution plot for every numeric feature - klib.missingval_plot(df) # returns a figure containing information about missing values

Examples

Take a look at this starter notebook.

Further examples, as well as applications of the functions can be found here.

Contributing

Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.

License

MIT
McKinsey Solve Assessment Data (2018–2025)
kaggle.com
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oluwademilade Adeniyi (2025). McKinsey Solve Assessment Data (2018–2025) [Dataset]. http://doi.org/10.34740/kaggle/dsv/11720554
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/11720554
Dataset updated
May 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Oluwademilade Adeniyi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
McKinsey Solve Global Assessment Dataset (2018–2025)

🧠 Context

McKinsey's Solve is a gamified problem-solving assessment used globally in the consulting firm’s recruitment process. This dataset simulates assessment results across geographies, education levels, and roles over a 7-year period. It aims to provide deep insights into performance trends, candidate readiness, resume quality, and cognitive task outcomes.

📌 Inspiration & Purpose

Inspired by McKinsey’s real-world assessment framework, this dataset was designed to enable: - Exploratory Data Analysis (EDA) - Recruitment trend analysis - Gamified performance modelling - Dashboard development in Excel / Power BI - Resume and education impact evaluation - Regional performance benchmarking - Data storytelling for portfolio projects

Whether you're building dashboards or training models, this dataset offers practical and relatable data for HR analytics and consulting use cases.

🔍 Dataset Source

Data generated by Oluwademilade Adeniyi (Demibolt) with the assistance of ChatGPT by OpenAI Structure and logic inspired by McKinsey’s public-facing Solve information, including role categories, game types (Ecosystem, Redrock, Seawolf), education tiers, and global office locations The entire dataset is synthetic and designed for analytical learning, ethical use, and professional development

🧾 Dataset Structure

This dataset includes 4,000 rows and the following columns: - Testtaker ID: Unique identifier - Country / Region: Geographic segmentation - Gender / Age: Demographics - Year: Assessment year (2018–2025) - Highest Level of Education: From high school to PhD / MBA - School or University Attended: Mapped to country and education level - First-generation University Student: Yes/No - Employment Status: Student, Employed, Unemployed - Role Applied For and Department / Interest: Business/tech disciplines - Past Test Taker: Indicates repeat attempts - Prepared with Online Materials: Indicates test prep involvement - Desired Office Location: Mapped to McKinsey's international offices - Ecosystem / Redrock / Seawolf (%): Game performance scores - Time Spent on Each Game (mins) - Total Product Score: Average of the 3 game scores - Process Score: A secondary assessment component - Resume Score: Scored based on education prestige, role fit, and clarity - Total Assessment Score (%): Final decision metric - Status (Pass/Fail): Based on total score ≥ 75%

✅ Why Use This Dataset

Benchmark educational and regional trends in global assessments

Build KPI cards, donut charts, histograms, or speedometer visuals

Train pass/fail classifiers or regression models

Segment job applicants by role, location, or game behaviour

Showcase portfolio skills across Excel, SQL, Power BI, Python, or R

Test dashboards or predictive logic in a business-relevant scenario

💡 Credit & Collaboration

Data Creator: Oluwademilade Adeniyi (Me) (LinkedIn, Twitter, GitHub, Medium)

Collaborator: ChatGPT by OpenAI

Inspired by: McKinsey & Company’s Solve Assessment
Reproductive Health In india
kaggle.com
Updated Feb 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AKshay (2025). Reproductive Health In india [Dataset]. https://www.kaggle.com/datasets/ak0212/reproductive-health-in-india/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 15, 2025
Dataset provided by
Kaggle
Authors
AKshay
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
India
Description
The "Reproductive Health" dataset on Kaggle provides an in-depth view of various factors impacting reproductive health across different populations. It includes demographic information such as age, marital status, and educational background, as well as health-related data like contraceptive use, medical conditions, and fertility history. This dataset is particularly useful for data analysts, researchers, and public health professionals aiming to understand trends in reproductive health and identify patterns or associations between lifestyle, medical history, and reproductive health outcomes.

The dataset enables users to explore key questions in reproductive health, such as how socioeconomic factors influence family planning choices or how health conditions may correlate with fertility. It can be applied to various types of analysis, including statistical modeling, machine learning algorithms, and predictive analysis. For example, analysts can use this dataset to build classification models that predict contraceptive use or explore regression models to understand factors contributing to reproductive health outcomes.

The dataset includes several attributes related to individual health profiles, such as whether individuals have previously experienced pregnancies, their contraceptive methods, and other relevant health conditions. It also provides valuable demographic details that can support intersectional analysis, examining how different factors like age, education, and income level impact reproductive health decisions.

With this dataset, one can also conduct exploratory data analysis (EDA), build visualizations, and identify correlations between variables such as health conditions, lifestyle choices, and reproductive outcomes. Additionally, it can serve as a base for conducting hypothesis testing to validate assumptions about reproductive health patterns.

For those interested in public health research or working on health data science projects, the dataset offers a comprehensive foundation for analyzing reproductive health issues. It can be particularly beneficial for projects focused on improving access to family planning services, promoting awareness of reproductive health issues, or creating predictive tools for healthcare interventions.

The "Reproductive Health" dataset is a valuable resource for anyone involved in data-driven public health research, machine learning, or statistical modeling in the context of reproductive health. It is accessible for both beginner and advanced data scientists, offering diverse possibilities for analysis and insights that can have a real-world impact on public health policies and interventions.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Diomidis Spinellis; Diomidis Spinellis; Zoe Kotti; Zoe Kotti; Konstantinos Kravvaritis; Konstantinos Kravvaritis; Georgios Theodorou; Georgios Theodorou; Panos Louridas; Panos Louridas (2020). Enterprise-Driven Open Source Software [Dataset]. http://doi.org/10.5281/zenodo.3742962

Enterprise-Driven Open Source Software

Explore at:

49 scholarly articles cite this dataset (View in Google Scholar)

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.3742962

Dataset updated

Apr 22, 2020

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Diomidis Spinellis; Diomidis Spinellis; Zoe Kotti; Zoe Kotti; Konstantinos Kravvaritis; Konstantinos Kravvaritis; Georgios Theodorou; Georgios Theodorou; Panos Louridas; Panos Louridas

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,264 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.

The main dataset is provided as a 17,264 record tab-separated file named enterprise_projects.txt with the following 29 fields.

url: the project's GitHub URL
project_id: the project's GHTorrent identifier
sdtc: true if selected using the same domain top committers heuristic (9,016 records)
mcpc: true if selected using the multiple committers from a valid enterprise heuristic (8,314 records)
mcve: true if selected using the multiple committers from a probable company heuristic (8,015 records),
star_number: number of GitHub watchers
commit_count: number of commits
files: number of files in current main branch
lines: corresponding number of lines in text files
pull_requests: number of pull requests
github_repo_creation: timestamp of the GitHub repository creation
earliest_commit: timestamp of the earliest commit
most_recent_commit: date of the most recent commit
committer_count: number of different committers
author_count: number of different authors
dominant_domain: the projects dominant email domain
dominant_domain_committer_commits: number of commits made by committers whose email matches the project's dominant domain
dominant_domain_author_commits: corresponding number for commit authors
dominant_domain_committers: number of committers whose email matches the project's dominant domain
dominant_domain_authors: corresponding number for commit authors
cik: SEC's EDGAR "central index key"
fg500: true if this is a Fortune Global 500 company (2,233 records)
sec10k: true if the company files SEC 10-K forms (4,180 records)
sec20f: true if the company files SEC 20-F forms (429 records)
project_name: GitHub project name
owner_login: GitHub project's owner login
company_name: company name as derived from the SEC and Fortune 500 data
owner_company: GitHub project's owner company name
license: SPDX license identifier

The file cohost_project_details.txt provides the full set of 311,223 cohort projects that are not part of the enterprise data set, but have comparable quality attributes.

url: the project's GitHub URL
project_id: the project's GHTorrent identifier
stars: number of GitHub watchers
commit_count: number of commits

Clear search

Close search

Google apps

Main menu

Enterprise-Driven Open Source Software

Hospital Management Dataset

Enterprise-Driven Open Source Software

College Student Placement Factors Dataset

📘 College Student Placement Dataset

📄 Dataset Description

📊 Columns Description

🎯 Target Variable

🧠 Use Cases

📦 Dataset Size

📚 Context

📜 License

🔗 Source

Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...

Context and Methodology

Technical Details

Data Fields Description:

Software Requirements

Additional Resources

Appalachian Basin Play Fairway Analysis: Thermal Quality Analysis in...

SQL Bike Stores

Plasma Proteomics Analysis of the Cooperative Health Research in South Tyrol...

Klib library python

Usage

Examples

Contributing

License

McKinsey Solve Assessment Data (2018–2025)

McKinsey Solve Global Assessment Dataset (2018–2025)

🧠 Context

📌 Inspiration & Purpose

🔍 Dataset Source

🧾 Dataset Structure

✅ Why Use This Dataset

💡 Credit & Collaboration

Reproductive Health In india

Enterprise-Driven Open Source SoftwareSee More Versions

Enterprise-Driven Open Source Software