23 datasets found

H
Data Management Project for Collaborative Groundwater Research
hydroshare.org
search.dataone.org
zip
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abbygael Johnson; Collins Stephenson; Brett Safely; Brooklyn Taylor (2025). Data Management Project for Collaborative Groundwater Research [Dataset]. https://www.hydroshare.org/resource/faa268eaa07547938d0e696247fc81fd
Explore at:
zip(2.1 GB)Available download formats
Dataset updated
Apr 24, 2025
Dataset provided by
HydroShare
Authors
Abbygael Johnson; Collins Stephenson; Brett Safely; Brooklyn Taylor
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This project developed a comprehensive data management system designed to support collaborative groundwater research across institutions by establishing a centralized, structured database for hydrologic time series data. Built on the Observations Data Model (ODM), the system stores time series data and metadata in a relational SQLite database. Key project components included database construction, automation of data formatting and importation, development of analytical and visualization tools, and integration with ArcGIS for geospatial representation. The data import workflow standardizes and validates diverse .csv datasets by aligning them with ODM formatting. A Python-based module was created to facilitate data retrieval, analysis, visualization, and export, while an interactive map feature enables users to explore site-specific data availability. Additionally, a custom ArcGIS script was implemented to generate maps that incorporate stream networks, site locations, and watershed boundaries using DEMs from USGS sources. The system was tested using real-world datasets from groundwater wells and surface water gages across Utah, demonstrating its flexibility in handling diverse formats and parameters. The relational structure enabled efficient querying and visualization, and the developed tools promoted accessibility and alignment with FAIR principles.
Monday Coffee SQL Data Analysis Project
kaggle.com
Updated Nov 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Najir 0123 (2024). Monday Coffee SQL Data Analysis Project [Dataset]. https://www.kaggle.com/datasets/najir0123/monday-coffee-sql-data-analysis-project/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Najir 0123
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Najir 0123

Released under MIT

Contents
Employee Attrition Case Study
kaggle.com
Updated Aug 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hunter Gonzalez (2023). Employee Attrition Case Study [Dataset]. https://www.kaggle.com/datasets/huntergonzalez247/employee-attrition-case-study
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hunter Gonzalez
Description
This is a in-depth analysis I have created using data pulled from an open source (ODbL) data project that was provided on Kaggle:

Pavansubhash. (2017). IBM HR Analytics Employee Attrition & Performance, Version 1. Retrieved August 3rd, 2023 from https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset.

Problem: The VP of People Operations/HR at [Company] wants to better understand what efforts they can make to retain more employees every year.

Question: How does education, job involvement, and work life balance effect employee attrition?

Metrics

A Survey was sent out 2068 current and past employees which asked a series of clear and consist questions inquiring about different variables involving the workplace. The surveys where anonymous to assure that employees answered truthfully and protecting the integrity of the data collected.

Education: 1)Below College 2)Some College 3)Bachelor 4)Master 5)Doctor

Job Involvement: 1)Low 2)Medium 3)High 4)Very High

Work Life Balance: 1)Bad 2)Good 3)Better 4)Best
SQL Bike Stores
kaggle.com
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed ZRIRAK (2024). SQL Bike Stores [Dataset]. https://www.kaggle.com/datasets/mohamedzrirak/sql-bkestores
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 21, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mohamed ZRIRAK
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Download: SQL Query This SQL project is focused on analyzing sales data from a relational database to gain insights into customer behavior, store performance, product sales, and the effectiveness of sales representatives. By executing a series of complex SQL queries across multiple tables, the project aggregates key metrics, such as total units sold and total revenue, and links them with customer, store, product, and staff details.

Key Objectives:

Customer Analysis: Understand customer purchasing patterns by analyzing the total number of units and revenue generated per customer. Product and Category Insights: Evaluate product performance and its category’s impact on overall sales. Store Performance: Identify which stores generate the most revenue and handle the highest sales volume. Sales Representative Effectiveness: Assess the performance of sales representatives by linking sales data with each representative’s handled orders. Techniques Used:

SQL Joins: The project integrates data from multiple tables, including orders, customers, order_items, products, categories, stores, and staffs, using INNER JOIN to merge information from related tables. Aggregation: SUM functions are used to compute total units sold and revenue generated by each order, providing valuable insights into sales performance. Grouping: Data is grouped by order ID, customer, product, store, and sales representative, ensuring accurate and summarized sales metrics. Use Cases:

Business Decision-Making: The analysis can help businesses identify high-performing products and stores, optimize inventory, and evaluate the impact of sales teams. Market Segmentation: Segment customers based on geographic location (city/state) and identify patterns in purchasing behavior. Sales Strategy Optimization: Provide recommendations to improve sales strategies by analyzing product categories and sales rep performance.
Data from: myPhyloDB
catalog.data.gov
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). myPhyloDB [Dataset]. https://catalog.data.gov/dataset/myphylodb-c588e
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
myPhyloDB is an open-source software package aimed at developing a user-friendly web-interface for accessing and analyzing all of your laboratory's microbial ecology data (currently supported project types: soil, air, water, microbial, and human-associated). The storage and handling capabilities of myPhyloDB archives users' raw sequencing files, and allows for easy selection of any combination of projects/samples from all of your projects using the built-in SQL database. The data processing capabilities of myPhyloDB are also flexible enough to allow the upload, storage, and analysis of pre-processed data or raw (454 or Illumina) data files using the built-in versions of Mothur and R. myPhyloDB is designed to run as a local web-server, which allows a single installation to be accessible to all of your laboratory members, regardless of their operating system or other hardware limitations. myPhyloDB includes an embedded copy of the popular Mothur program and uses a customizable batch file to perform sequence editing and processing. This allows myPhyloDB to leverage the flexibility of Mothur and allow for greater standardization of data processing and handling across all of your sequencing projects. myPhyloDB also includes an embedded copy of the R software environment for a variety of statistical analyses and graphics. Currently, myPhyloDB includes analysis for factor or regression-based ANcOVA, principal coordinates analysis (PCoA), differential abundance analysis (DESeq), and sparse partial least-squares regression (sPLS). Resources in this dataset:Resource Title: Website Pointer to myPhyloDB. File Name: Web Page, url: https://myphylodb.azurecloudgov.us/myPhyloDB/home/ Provides information and links to download latest version, release history, documentation, and tutorials including type of analysis you would like to perform (Univariate: ANCOVA/GLM; Multivariate: DiffAbund, PcoA, or sPLS).
B
To Estimate and Optimize the Source of Drinking Water for Metro Vancouver...
borealisdata.ca
dataone.org
Updated Feb 28, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahram Yarmand (2019). To Estimate and Optimize the Source of Drinking Water for Metro Vancouver until 2040 [Dataset]. http://doi.org/10.5683/SP2/6KU4I7
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/6KU4I7
Dataset updated
Feb 28, 2019
Dataset provided by
Borealis
Authors
Shahram Yarmand
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Oct 2017 - Nov 2017
Area covered
Metro Vancouver
Description
The population of Metro Vancouver (20110729Regional Growth Strategy Projections Population, Housing and Employment 2006 – 2041 File) will have increased greatly by 2040, and finding a new source of reservoirs for drinking water (2015_ Water Consumption_ Statistics File) will be essential. This issue of drinking water needs to be optimized and estimated (Data Mining file) with the aim of developing the region. Three current sources of water reservoirs for Metro Vancouver are Capilano, Seymour, and Coquitlam, in which the treated water is being supplied to the customer. The linear optimization (LP) model (Optimization, Sensitivity Report File) illustrates the amount of drinking water for each reservoir and region. In fact, the B.C. government has a specific strategy for the growing population till 2040, which leads them toward their goal. In addition, another factor is the new water source for drinking water that needs to be estimated and monitored to anticipate the feasible water source (wells) until 2040. As such, the government will have to make a decision on how much groundwater is used. The goal of the project is two steps: (1) an optimization model for three water reservoirs, and (2) estimating the new source of water to 2040. The process of data analysis for the project includes: the data is analyzed with six software—Trifacta Wrangler, AMPL, Excel Solver, Arc GIS, and SQL—and is visualized in Tableau. 1. Trifacta Wrangler Software clean data (Data Mining file). 2. AMPL and Solver Excel Software optimize drinking water consumption for Metro Vancouver (data in the Optimization and Sensitivity Report file). 3. ArcMap collaborates the raw data and result of the optimization water reservoir and estimating population till 2040 with the ArcGIS software (GIS Map for Tableau file). 4. Visualizing, estimating, and optimizing the source of drinking water for Metro Vancouver until 2040 with SQL software in Tableau (export tableau data file).
AdventureWorks-2014
kaggle.com
Updated Aug 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick McKown (2024). AdventureWorks-2014 [Dataset]. https://www.kaggle.com/datasets/duckduckboot/adventureworks-2014
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 18, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Patrick McKown
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
About This Dataset

This dataset is derived from the AdventureWorks 2014 test database published by Microsoft, and is designed to simplify and enhance data analysis workflows. The dataset consists of multiple CSV files that have been pre-joined and transformed from the original SQL database, facilitating a smoother analytical experience in Python.

Dataset Composition

The dataset includes: * SalesOrderHeader: Integrates the sales header and sales item tables, providing a unified view of sales transactions. * CustomerMaster: Combines customer names, countries, addresses, and other related information into a single, comprehensive file. * VendorMaster: Combines vendor names, countries, addresses, and other related information into a single, comprehensive file.

These pre-joined CSVs aim to streamline data analysis, making it more accessible for users working in Python. The dataset can be used to showcase various Python projects or as a foundation for your own analyses.

Usage

Feel free to leverage this dataset for your data analysis projects, explore trends, and create visualizations. Whether you're showcasing your own Python projects or conducting independent analyses, this dataset is designed to support a wide range of data science tasks.

Documentation

For those interested in recreating the CSV files from the SQL database, detailed documentation is included at the bottom of this section. It provides step-by-step instructions on how to replicate the CSVs from the AdventureWorks 2014 database using SQL queries.

AdventureWorks_SalesOrderHeader

SELECT SalesOrderID , CAST (OrderDate AS date) AS OrderDate , CAST (ShipDate AS date) AS ShipDate , CustomerID , ShipToAddressID , BillToAddressID , SubTotal , TaxAmt , Freight , TotalDue FROM Sales.SalesOrderHeader

AdventureWorks_CustomerMaster

SELECT pa.AddressID , pbea.BusinessEntityID , pa.AddressLine1 , pa.City , pa.PostalCode , psp.[Name] AS ProvinceStateName , pat.[Name] AS AddressType , pea.EmailAddress , ppp.PhoneNumber , pp.FirstName , pp.LastName , sst.CountryRegionCode , pcr.[Name] AS CountryName , sst.[Group] AS CountryGroup FROM Person.[Address] AS pa INNER JOIN Person.BusinessEntityAddress AS pbea ON pa.AddressID = pbea.AddressID INNER JOIN Person.StateProvince AS psp ON pa.StateProvinceID = psp.StateProvinceID INNER JOIN Person.AddressType AS pat ON pbea.AddressTypeID = pat.AddressTypeID INNER JOIN Person.EmailAddress AS pea ON pbea.BusinessEntityID = pea.BusinessEntityID INNER JOIN Person.Person AS pp ON pbea.BusinessEntityID = pp.BusinessEntityID INNER JOIN Person.PersonPhone AS ppp ON pbea.BusinessEntityID = ppp.BusinessEntityID INNER JOIN Sales.SalesTerritory AS sst ON psp.TerritoryID = sst.TerritoryID INNER JOIN Person.CountryRegion AS pcr ON sst.CountryRegionCode = pcr.CountryRegionCode;
ghtorrent-projects Dataset
zenodo.org
data.niaid.nih.gov
bin, txt
Updated Jul 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marios Papachristou; Marios Papachristou (2021). ghtorrent-projects Dataset [Dataset]. http://doi.org/10.5281/zenodo.5111043
Explore at:
txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5111043
Dataset updated
Jul 17, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marios Papachristou; Marios Papachristou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A hypergraph dataset mined from the GHTorrent project is presented. The dataset contains two files

1. project_members.txt: Contains GitHub projects with at least 2 contributors and the corresponding contributors (as a hyperedge). The format of the data is:

2. num_followers.txt: Contains all GitHub users and their number of followers.

The artifact also contains the SQL queries used to obtain the data from GHTorrent (schema).
Data from: myPhyloDB
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
data.amerigeoss.org
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). myPhyloDB [Dataset]. https://res1catalogd-o-tdatad-o-tgov.vcapture.xyz/dataset/myphylodb-c588e
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
myPhyloDB is an open-source software package aimed at developing a user-friendly web-interface for accessing and analyzing all of your laboratory's microbial ecology data (currently supported project types: soil, air, water, microbial, and human-associated). The storage and handling capabilities of myPhyloDB archives users' raw sequencing files, and allows for easy selection of any combination of projects/samples from all of your projects using the built-in SQL database. The data processing capabilities of myPhyloDB are also flexible enough to allow the upload, storage, and analysis of pre-processed data or raw (454 or Illumina) data files using the built-in versions of Mothur and R. myPhyloDB is designed to run as a local web-server, which allows a single installation to be accessible to all of your laboratory members, regardless of their operating system or other hardware limitations. myPhyloDB includes an embedded copy of the popular Mothur program and uses a customizable batch file to perform sequence editing and processing. This allows myPhyloDB to leverage the flexibility of Mothur and allow for greater standardization of data processing and handling across all of your sequencing projects. myPhyloDB also includes an embedded copy of the R software environment for a variety of statistical analyses and graphics. Currently, myPhyloDB includes analysis for factor or regression-based ANcOVA, principal coordinates analysis (PCoA), differential abundance analysis (DESeq), and sparse partial least-squares regression (sPLS). Resources in this dataset:Resource Title: Website Pointer to myPhyloDB. File Name: Web Page, url: https://res1myphylodbd-o-tazurecloudgovd-o-tus.vcapture.xyz/myPhyloDB/home/ Provides information and links to download latest version, release history, documentation, and tutorials including type of analysis you would like to perform (Univariate: ANCOVA/GLM; Multivariate: DiffAbund, PcoA, or sPLS).
Z
Qualisign: Software Metrics and GoF Design Patterns of the Maven Central...
data.niaid.nih.gov
zenodo.org
Updated Sep 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aichberger, Johann (2020). Qualisign: Software Metrics and GoF Design Patterns of the Maven Central Repository [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3731871
Explore at:
Dataset updated
Sep 24, 2020
Dataset authored and provided by
Aichberger, Johann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains software metric and design pattern data for around 100,000 projects from the Maven Central repository. The data was collected and analyzed as part of my master's thesis "Mining Software Repositories for the Effects of Design Patterns on Software Quality" (https://www.overleaf.com/read/vnfhydqxmpvx, https://zenodo.org/record/4048275).

The included qualisign.* files all contain the same data in different formats: - qualisign.sql: standard SQL format (exported using "pg_dump --inserts ..."), - qualisign.psql: PostgreSQL plain format (exported using "pg_dump -Fp ..."), - qualisign.csql: PostgreSQL custom format (exported using "pg_dump -Fc ...").

create-tables.sql has to be executed before importing one of the qualisign.* files. Once qualisign.*sql has been imported, create-views.sql can be executed to preprocess the data, thereby creating materialized views that are more appropriate for data analysis purposes.

Software metrics were calculated using CKJM extended: http://gromit.iiar.pwr.wroc.pl/p_inf/ckjm/

Included software metrics are (21 total): - AMC: Average Method Complexity - CA: Afferent Coupling - CAM: Cohesion Among Methods - CBM: Coupling Between Methods - CBO: Coupling Between Objects - CC: Cyclomatic Complexity - CE: Efferent Coupling - DAM: Data Access Metric - DIT: Depth of Inheritance Tree - IC: Inheritance Coupling - LCOM: Lack of Cohesion of Methods (Chidamber and Kemerer) - LCOM3: Lack of Cohesion of Methods (Constantine and Graham) - LOC: Lines of Code - MFA: Measure of Functional Abstraction - MOA: Measure of Aggregation - NOC: Number of Children - NOM: Number of Methods - NOP: Number of Polymorphic Methods - NPM: Number of Public Methods - RFC: Response for Class - WMC: Weighted Methods per Class

In the qualisign.* data, these metrics are only available on the class level. create-views.sql additionally provides averages of these metrics on the package and project levels.

Design patterns were detected using SSA: https://users.encs.concordia.ca/~nikolaos/pattern_detection.html

Included design patterns are (15 total): - Adapter - Bridge - Chain of Responsibility - Command - Composite - Decorator - Factory Method - Observer - Prototype - Proxy - Singleton - State - Strategy - Template Method - Visitor

The code to generate the dataset is available at: https://github.com/jaichberg/qualisign

The code to perform quality analysis on the dataset is available at: https://github.com/jaichberg/qualisign-analysis

Geodatabase for the Baltimore Ecosystem Study Spatial Data

portal.edirepository.org
search.dataone.org

application/vnd.rar

Updated May 4, 2012

Facebook

Twitter

Click to copy link

Link copied

Cite

Jarlath O'Neal-Dunne; Morgan Grove (2012). Geodatabase for the Baltimore Ecosystem Study Spatial Data [Dataset]. http://doi.org/10.6073/pasta/377da686246f06554f7e517de596cd2b

Explore at:

application/vnd.rar(29574980 kilobyte)Available download formats

Unique identifier

https://doi.org/10.6073/pasta/377da686246f06554f7e517de596cd2b

Dataset updated

May 4, 2012

Dataset provided by

EDI

Authors

Jarlath O'Neal-Dunne; Morgan Grove

Time period covered

Jan 1, 1999 - Jun 1, 2014

Area covered

Description

The establishment of a BES Multi-User Geodatabase (BES-MUG) allows for the storage, management, and distribution of geospatial data associated with the Baltimore Ecosystem Study. At present, BES data is distributed over the internet via the BES website. While having geospatial data available for download is a vast improvement over having the data housed at individual research institutions, it still suffers from some limitations. BES-MUG overcomes these limitations; improving the quality of the geospatial data available to BES researches, thereby leading to more informed decision-making.

   BES-MUG builds on Environmental Systems Research Institute's (ESRI) ArcGIS and ArcSDE technology. ESRI was selected because its geospatial software offers robust capabilities. ArcGIS is implemented agency-wide within the USDA and is the predominant geospatial software package used by collaborating institutions.


   Commercially available enterprise database packages (DB2, Oracle, SQL) provide an efficient means to store, manage, and share large datasets. However, standard database capabilities are limited with respect to geographic datasets because they lack the ability to deal with complex spatial relationships. By using ESRI's ArcSDE (Spatial Database Engine) in conjunction with database software, geospatial data can be handled much more effectively through the implementation of the Geodatabase model. Through ArcSDE and the Geodatabase model the database's capabilities are expanded, allowing for multiuser editing, intelligent feature types, and the establishment of rules and relationships. ArcSDE also allows users to connect to the database using ArcGIS software without being burdened by the intricacies of the database itself.


   For an example of how BES-MUG will help improve the quality and timeless of BES geospatial data consider a census block group layer that is in need of updating. Rather than the researcher downloading the dataset, editing it, and resubmitting to through ORS, access rules will allow the authorized user to edit the dataset over the network. Established rules will ensure that the attribute and topological integrity is maintained, so that key fields are not left blank and that the block group boundaries stay within tract boundaries. Metadata will automatically be updated showing who edited the dataset and when they did in the event any questions arise.


   Currently, a functioning prototype Multi-User Database has been developed for BES at the University of Vermont Spatial Analysis Lab, using Arc SDE and IBM's DB2 Enterprise Database as a back end architecture. This database, which is currently only accessible to those on the UVM campus network, will shortly be migrated to a Linux server where it will be accessible for database connections over the Internet. Passwords can then be handed out to all interested researchers on the project, who will be able to make a database connection through the Geographic Information Systems software interface on their desktop computer. 


   This database will include a very large number of thematic layers. Those layers are currently divided into biophysical, socio-economic and imagery categories. Biophysical includes data on topography, soils, forest cover, habitat areas, hydrology and toxics. Socio-economics includes political and administrative boundaries, transportation and infrastructure networks, property data, census data, household survey data, parks, protected areas, land use/land cover, zoning, public health and historic land use change. Imagery includes a variety of aerial and satellite imagery.


   See the readme: http://96.56.36.108/geodatabase_SAL/readme.txt


   See the file listing: http://96.56.36.108/geodatabase_SAL/diroutput.txt

A
SciSpark: Highly Interactive and Scalable Model Evaluation and Climate...
data.amerigeoss.org
data.wu.ac.at
html
Updated Jul 19, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States (2018). SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics for Scientific Data and Analysis [Dataset]. https://data.amerigeoss.org/sk/dataset/scispark-highly-interactive-and-scalable-model-evaluation-and-climate-metrics-for-scientif
Explore at:
htmlAvailable download formats
Dataset updated
Jul 19, 2018
Dataset provided by
United States
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
We will construct SciSpark, a scalable system for interactive model evaluation and for the rapid development of climate metrics and analyses. SciSpark directly leverages the Apache Spark technology and its notion of Resilient Distributed Datasets (RDDs). RDDs represent an immutable data set that can be reused across multi-stage operations, partitioned across multiple machines and automatically reconstructed if a partition is lost. The RDD notion directly enables the reuse of array data across multi-stage operations and it ensures data can be replicated, distributed and easily reconstructed in different storage tiers, e.g., memory for fast interactivity, SSDs for near real time availability and I/O oriented spinning disk for later operations. RDDs also allow Spark's performance to degrade gracefully when there is not sufficient memory available to the system. It may seem surprising to consider an in-memory solution for massive datasets, however a recent study found that at Facebook 96% of active jobs could have their entire data inputs in memory at the same time. In addition, it is worth noting that Spark has shown to be 100x faster in memory and 10x faster on disk than Apache Hadoop, the de facto industry platform for Big Data. Hadoop scales well and there are emerging examples of its use in NASA climate projects (e.g., Teng et al. and Schnase et al.) but as is being discovered in these projects, Hadoop is most suited for batch processing and long running operations. SciSpark contributes a Scientific RDD that corresponds to a multi-dimensional array representing a scientific measurement subset by space, or by time. Scientific RDDs can be created in a handful of ways by: (1) directly loading HDF and NetCDF data into Hadoop Distributed File System (HDFS); (2) creating a partition or split function that divides up a multi-dimensional array by space or time; (3) taking the results of a regridding operation or a climate metrics computation; or (4) telling SciSpark to cache an existing Scientific RDD (sRDD), keeping it cached in memory for data reuse between stages. Scientific RDDs will form the basis for a variety of advanced and interactive climate analyses, starting by default in memory, and then being cached and replicated to disk when not directly needed. SciSpark will also use the Shark interactive SQL technology that allows structured query language (SQL) to be used to store/retrieve RDDs; and will use Apache Mesos to be a good tenant in cloud environments interoperating with other data system frameworks (e.g., HDFS, iRODS, SciDB, etc.).

One of the key components of SciSpark is interactive sRDD visualizations and to accomplish this SciSpark delivers a user interface built around the Data Driven Documents (D3) framework. D3 is an immersive, javascript based technology that exploits the underlying Document Object Model (DOM) structure of the web to create histograms, cartographic displays and inspections of climate variables and statistics.

SciSpark is evaluated using several topical iterative scientific algorithms inspired by the NASA RCMES project including machine-learning (ML) based clustering of temperature PDFs and other quantities over North America, and graph-based algorithms for searching for Mesocale Convective Complexes in West Africa.
Source files For Bike Share Case Study
kaggle.com
Updated Aug 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MG (2022). Source files For Bike Share Case Study [Dataset]. https://www.kaggle.com/datasets/magdas0/source-files-for-bike-share-case-study
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 4, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
MG
Description
Bike Share Case Study

This case study has been prepared as a partial fulfillment for the Capstone project, the final course in Google Data Analytics offered by Google at the Coursera platform.

I created a dataset that contains source files I wrote to perform this analysis:

Files presenting and documenting the analysis

2022-08-04-bike-share-pres.pdf - the final presentation of the results including diagrams, conclusions and recommendations, and

2022-08-04-bike-share-report.pdf - document describing all stages of the project

scripts - R, bash, and SQL scripts I created and used for this project

spreadsheets - spreadsheets I created and used for this project

The original data regarding bike sharing program is available publicly. The link is provided in the presentation and in the report.
F
Crowdsourced Flow Cytometry Dataset from EVE Online’s Project Discovery for...
frdr-dfdr.ca
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brinkman, Ryan; Yokosawa, Daniel Y. O. (2025). Crowdsourced Flow Cytometry Dataset from EVE Online’s Project Discovery for Machine Learning Applications [Dataset]. http://doi.org/10.20383/103.01043
Explore at:
Unique identifier
https://doi.org/10.20383/103.01043
Dataset updated
Apr 24, 2025
Dataset provided by
Federated Research Data Repository / dépôt fédéré de données de recherche
Authors
Brinkman, Ryan; Yokosawa, Daniel Y. O.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains a diverse collection of pre-processed flow cytometry data assembled to support the training and evaluation of machine learning (ML) models for the gating of cell populations. The data was curated through a citizen science initiative embedded in the EVE Online video game, known as Project Discovery. Participants contributed to scientific research by gating bivariate plots generated from flow cytometry data, creating a crowdsourced reference set. The original flow cytometry datasets were sourced from publicly available COVID-19 and immunology-related studies on FlowRepository.org and PubMed. Data were compensated, transformed, and split into bivariate plots for analysis. This datset includes: 1) CSV files containing two-channel marker combinations per plot, 2) A SQL database capturing player-generated gating polygons in normalized coordinates, 3) Scripts and containerized environments (Singularity and Docker) for reproducible evaluation of gating accuracy and consensus scoring using the flowMagic pipeline, 4) Code for filtering bot inputs, evaluating user submissions, calculating F1 scores, and generating consensus gating regions. This data is especially valuable for training and benchmarking models that aim to automate the labor-intensive gating process in immunological and clinical cytometry applications.
SQLite database containing all the project data
figshare.com
Updated Apr 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amandine Bertrand (2021). SQLite database containing all the project data [Dataset]. http://doi.org/10.6084/m9.figshare.14258528.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14258528.v1
Dataset updated
Apr 6, 2021
Dataset provided by
Figsharehttp://figshare.com/
Authors
Amandine Bertrand
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
See Fig. S3 for the SQL schema
Pulsar Voices
figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Ferrers; Anderson Murray; Ben Raymond; Gary Ruben; CHRISTOPHER RUSSELL; Sarath Tomy; Michael Walker (2023). Pulsar Voices [Dataset]. http://doi.org/10.6084/m9.figshare.3084748.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3084748.v2
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Richard Ferrers; Anderson Murray; Ben Raymond; Gary Ruben; CHRISTOPHER RUSSELL; Sarath Tomy; Michael Walker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data is sourced from CSIRO Parkes ATNF.eg http://www.atnf.csiro.au/research/pulsar/psrcat/Feel the pulse of the universeWe're taking signal data from astronomical "pulsar" sources and creating a way to listen to their signals audibly.Pulsar data is available from ATNF at CSIRO.au. Our team at #SciHackMelb has been working on a #datavis to give researchers and others a novel way to explore the Pulsar corpus, especially through the sound of the frequencies at which the Pulsars emit pulses.Link to project page at #SciHackMelb - http://www.the-hackfest.com/events/melbourne-science-hackfest/projects/pulsar-voices/The files attached here include: source data, project presentation, data as used in website final_pulsar.sql, and other methodology documentation. Importantly, see the Github link which contains data manipulation code, html code to present the data, and render audibly, iPython Notebook to process single pulsar data into an audible waveform file. Together all these resources are the Pulsar Voices activity and resulting data.Source Data;* RA - east/west coordinates (0 - 24 hrs, roughly equates to longitude) [theta; transforms RA to 0 - 360*]* Dec - north/south coordinates (-90, +90 roughly equates to latitude i.e. 90 is above north pole, and -90 south pole)* P0 - the time in seconds that a pulsar repeats its signal* f - 1/P0 which ranges from 700 cycles per sec, to some which pulses which occur every few seconds* kps - distance from Earth in kilo-parsecs. 1 kps = 3,000 light years. The furthest data is 30 kps. The galactic centre is about 25,000 light years away i.e. about 8kps.psrcatShort.csv = 2,295 Pulsars all known pulsars with above fields; RA, Dec, ThetapsrcatMedium.csv - add P0 and kps, only 1428 lines - i.e. not available for all 2,295 datapointpsrcatSparse.csv - add P0 and kps, banks if n/a, 2,295 linesshort.txt - important pulsars with high levels of observation (** even more closely examined)pulsar.R - code contributed by Ben Raymond to visualise Pulsar frequency, period in histogrampulsarVoices_authors.JPG - added photo of authors from SciHackMelbAdded to the raw data:- Coordinates to map RA, Dec to screen width(y)/height(x)y = RA[Theta]*width/360; x = (Dec + 90)*height/180- audible frequency converted from Pulsar frequency (1/P0)Formula for 1/P0(x) -> Hz(y) => y = 10 ^ (0.5 log(x) + 2.8)Explanation in text file; Convert1/P0toHz.txtTone generator from: http://www.softsynth.com/webaudio/tone.php- detailed waveform file audible converted from Pulsar signal data, and waveform image (and python notebook to generate; available):The project source is hosted on github at:https://github.com/gazzar/pulsarvoicesAn IPython/Jupyter notebook contains code and a rough description of the method used to process a psrfits .sf filedownloaded via the CSIRO Data Access Portal at http://doi.org/10.4225/08/55940087706E1The notebook contains experimental code to read one of these .sf files and access the contained spectrogram data, processing it to generate an audible signal.It also reads the .txt files containing columnar pulse phase data (which is also contained in the .sf files) and processes these by frequency modulating the signal with an audible carrier.This is the method used to generate the .wav and .png files used in the web interface.https://github.com/gazzar/pulsarvoices/blob/master/ipynb/hackfest1.ipynb A standalone python script that does the .txt to .png and .wav signal processing was used to process 15 more pulsar data examples. These can be reproduced by running the script.https://github.com/gazzar/pulsarvoices/blob/master/data/pulsarvoices.pyProcessed file at: https://github.com/gazzar/pulsarvoices/tree/master/webhttps://github.com/gazzar/pulsarvoices/blob/master/web/J0437-4715.pngJ0437-4715.wav | J0437-4715.png)#Datavis online at: http://checkonline.com.au/tooltip.php. Code at Github linked above. See especially:https://github.com/gazzar/pulsarvoices/blob/master/web/index.phpparticularly, lines 314 - 328 (or search: "SELECT * FROM final_pulsar";) which loads pulsar data from DB and push to screen with Hz on mouseover.Pulsar Voices webpage Functions:1.There is sound when you run the mouse across the Pulsars. We plot all known pulsars (N=2,295), and play a tone for pulsars we had data on frequency i.e. about 75%.2. In the bottom left corner a more detailed Pulsar sound, and wave image pops up when you click the star icon. Two of the team worked exclusively on turning a single pulsars waveform into an audible wav file. They created 16 of these files, and a workflow, but the team only had time to load one waveform. With more time, it would be great to load these files.3. If you leave the mouse over a Pulsar, a little data description pops up, with location (RA, Dec), distance (kilo parsecs; 1 = 3,000 light years), and frequency of rotation (and Hz converted to human hearing).4.If you click on a Pulsar, other pulsars with similar frequency are highlighted in white. With more time I was interested to see if there are harmonics between pulsars. i.e. related frequencies.The TeamMichael Walker is: orcid.org/0000-0003-3086-6094 ; Biosciences PhD student, Unimelb, Melbourne.Richard Ferrers is: orcid.org/0000-0002-2923-9889 ; ANDS Research Data Analyst, Innovation/Value Researcher, Melbourne.Sarath Tomy is: http://orcid.org/0000-0003-4301-0690 ; La Trobe PhD Comp Sci, Melbourne.Gary Ruben is: http://orcid.org/0000-0002-6591-1820 ; CSIRO Postdoc at Australian Synchrotron, Melbourne.Christopher Russell is: Data Manager, CSIRO, Sydney.https://wiki.csiro.au/display/ASC/Chris+RussellAnderson Murray is: orcid.org/0000-0001-6986-9140; Physics Honours, Monash, Melbourne.Contact: richard.ferrers@ands.org.au for more information.What is still left to do?* load data, description, images fileset to figshare :: DOI ; DONE except DOI* add overview images as option eg frequency bi-modal histogram* colour code pulsars by distance; DONE* add pulsar detail sound to Top three Observants; 16 pulsars processed but not loaded* add tones to pulsars to indicate f; DONE* add tooltips to show location, distance, frequency, name; DONE* add title and description; DONE* project data onto a planetarium dome with interaction to play pulsar frequencies.DONE see youtube video at https://youtu.be/F119gqOKJ1U* zoom into parts of sky to get separation between close data points - see youtube; function in Google Earth #datavis of dataset. Link at youtube.* set upper and lower tone boundaries, so tones aren't annoying* colour code pulsars by frequency bins e.g. >100 Hz, 10 - 100, 1 - 10,
McKinsey Solve Assessment Data (2018–2025)
kaggle.com
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oluwademilade Adeniyi (2025). McKinsey Solve Assessment Data (2018–2025) [Dataset]. http://doi.org/10.34740/kaggle/dsv/11720554
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/11720554
Dataset updated
May 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Oluwademilade Adeniyi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
McKinsey Solve Global Assessment Dataset (2018–2025)

🧠 Context

McKinsey's Solve is a gamified problem-solving assessment used globally in the consulting firm’s recruitment process. This dataset simulates assessment results across geographies, education levels, and roles over a 7-year period. It aims to provide deep insights into performance trends, candidate readiness, resume quality, and cognitive task outcomes.

📌 Inspiration & Purpose

Inspired by McKinsey’s real-world assessment framework, this dataset was designed to enable: - Exploratory Data Analysis (EDA) - Recruitment trend analysis - Gamified performance modelling - Dashboard development in Excel / Power BI - Resume and education impact evaluation - Regional performance benchmarking - Data storytelling for portfolio projects

Whether you're building dashboards or training models, this dataset offers practical and relatable data for HR analytics and consulting use cases.

🔍 Dataset Source

Data generated by Oluwademilade Adeniyi (Demibolt) with the assistance of ChatGPT by OpenAI Structure and logic inspired by McKinsey’s public-facing Solve information, including role categories, game types (Ecosystem, Redrock, Seawolf), education tiers, and global office locations The entire dataset is synthetic and designed for analytical learning, ethical use, and professional development

🧾 Dataset Structure

This dataset includes 4,000 rows and the following columns: - Testtaker ID: Unique identifier - Country / Region: Geographic segmentation - Gender / Age: Demographics - Year: Assessment year (2018–2025) - Highest Level of Education: From high school to PhD / MBA - School or University Attended: Mapped to country and education level - First-generation University Student: Yes/No - Employment Status: Student, Employed, Unemployed - Role Applied For and Department / Interest: Business/tech disciplines - Past Test Taker: Indicates repeat attempts - Prepared with Online Materials: Indicates test prep involvement - Desired Office Location: Mapped to McKinsey's international offices - Ecosystem / Redrock / Seawolf (%): Game performance scores - Time Spent on Each Game (mins) - Total Product Score: Average of the 3 game scores - Process Score: A secondary assessment component - Resume Score: Scored based on education prestige, role fit, and clarity - Total Assessment Score (%): Final decision metric - Status (Pass/Fail): Based on total score ≥ 75%

✅ Why Use This Dataset

Benchmark educational and regional trends in global assessments

Build KPI cards, donut charts, histograms, or speedometer visuals

Train pass/fail classifiers or regression models

Segment job applicants by role, location, or game behaviour

Showcase portfolio skills across Excel, SQL, Power BI, Python, or R

Test dashboards or predictive logic in a business-relevant scenario

💡 Credit & Collaboration

Data Creator: Oluwademilade Adeniyi (Me) (LinkedIn, Twitter, GitHub, Medium)

Collaborator: ChatGPT by OpenAI

Inspired by: McKinsey & Company’s Solve Assessment
agile project dataset 2024
kaggle.com
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
digro k (2025). agile project dataset 2024 [Dataset]. https://www.kaggle.com/datasets/digrok/agile-project-dataset-2024
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 20, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
digro k
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Dataset Description: 200 Agile Software Projects Overview This dataset contains records of 200 Agile software development projects. It includes various performance metrics related to Agile methodologies, measuring their effectiveness in project success, risk mitigation, time efficiency, and cost savings. The dataset is designed for analysis of AI-driven automation in Agile software teams.

Dataset Variables Agile Effectiveness (Likert scale: 2 to 5)

Measures how well Agile methodologies enhance project management processes. Risk Mitigation (Likert scale: 2 to 5)

Captures the effectiveness of Agile in identifying and reducing risks throughout the project lifecycle. Management Satisfaction (Likert scale: 2 to 5)

Represents how satisfied management is with the outcomes of Agile-implemented projects. Supply Chain Improvement (Likert scale: 2 to 5)

Evaluates the impact of Agile practices on optimizing supply chain processes. Time Efficiency (Likert scale: 2 to 5)

Measures improvements in time management within Agile projects. Cost Savings (%) (Range: 10% to 48%)

Quantifies the percentage of cost savings achieved due to Agile methodologies. Project Success (Binary: 0 = Failure, 1 = Success)

Indicates whether the project was considered successful. Usage This dataset is useful for: ✅ Evaluating the impact of AI automation on Agile workflows. ✅ Understanding factors contributing to Agile project success. ✅ Analyzing cost savings and efficiency improvements in Agile teams. ✅ Building machine learning models to predict project success based on Agile metrics.
Hospital Management Dataset
kaggle.com
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kanak Baghel (2025). Hospital Management Dataset [Dataset]. https://www.kaggle.com/datasets/kanakbaghel/hospital-management-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 30, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kanak Baghel
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.

Dataset Overview

This dataset includes five CSV files:

patients.csv – Patient demographics, contact details, registration info, and insurance data

doctors.csv – Doctor profiles with specializations, experience, and contact information

appointments.csv – Appointment dates, times, visit reasons, and statuses

treatments.csv – Treatment types, descriptions, dates, and associated costs

billing.csv – Billing amounts, payment methods, and status linked to treatments

📁 Files & Column Descriptions

** patients.csv**

Contains patient demographic and registration details.

Column Description

patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address

** doctors.csv**

Details about the doctors working in the hospital.

Column Description

doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address

appointments.csv

Records of scheduled and completed patient appointments.

Column Description

appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)

treatments.csv

Information about the treatments given during appointments.

Column Description

treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given

** billing.csv**

Billing and payment details for treatments.

Column Description

bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)

Possible Use Cases

SQL queries and relational database design

Exploratory data analysis (EDA) and dashboarding

Machine learning projects (e.g., cost prediction, no-show analysis)

Feature engineering and data cleaning practice

End-to-end healthcare analytics workflows

Recommended Tools & Resources

SQL (joins, filters, window functions)

Pandas and Matplotlib/Seaborn for EDA

Scikit-learn for ML models

Pandas Profiling for automated EDA

Plotly for interactive visualizations

Please Note that :

All data is synthetically generated for educational and project use. No real patient information is included.

If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.
My analysis of the "bike share" data: Google S.
kaggle.com
Updated Jul 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lamar McMillan (2025). My analysis of the "bike share" data: Google S. [Dataset]. https://www.kaggle.com/lamarmcmillan/my-analysis-of-the-bike-share-data-google-s/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 9, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Lamar McMillan
Description
Context

** One analysis Done in spreadsheets with 202004 and 202005 data **

Content

To adjust for outlier Ride lengths like the max and min below: Max RL =MAX(N:N)978:40:02 minimum RL =MIN(N:N)-0:02:56

TRIMMean to shave off the top and bottom of a dataset. TRIMMEAN =TRIMMEAN(N:N,5%)0:20:20 =TRIMMEAN(N:N,2%)0:21:27

Otherwise the Ride length for 202004 is Average RL 0:35:51

The most common day of the week is Sunday. There are 61,148 members and 23,628 casual riders. mode of DOW 1 CountIf member of MC 61148 CountIf casual of MC 23628

Pivot table 1 2020-04 member_casual AVERAGE of ride_length

Same calculations for 2020-05 Average RL 0:33:23 Max RL 481:36:53 minimum RL -0:01:48 mode of DOW 7 CountIf member of MC 113365 CountIf casual of MC 86909 TRIMMEAN 0:25:22 0:26:59

There are 4 pivot tables included in seperate sheets for other comparisons.

Acknowledgements

I gathered this data using the sources provided by the Google Data Analytics course. All work seen is done by myself.

Inspiration

I want to further use the data in SQL, and Tableau.

Facebook

Twitter

Click to copy link

Link copied

Cite

Abbygael Johnson; Collins Stephenson; Brett Safely; Brooklyn Taylor (2025). Data Management Project for Collaborative Groundwater Research [Dataset]. https://www.hydroshare.org/resource/faa268eaa07547938d0e696247fc81fd

Data Management Project for Collaborative Groundwater Research

Explore at:

zip(2.1 GB)Available download formats

Dataset updated

Apr 24, 2025

Dataset provided by

HydroShare

Authors

Abbygael Johnson; Collins Stephenson; Brett Safely; Brooklyn Taylor

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This project developed a comprehensive data management system designed to support collaborative groundwater research across institutions by establishing a centralized, structured database for hydrologic time series data. Built on the Observations Data Model (ODM), the system stores time series data and metadata in a relational SQLite database. Key project components included database construction, automation of data formatting and importation, development of analytical and visualization tools, and integration with ArcGIS for geospatial representation. The data import workflow standardizes and validates diverse .csv datasets by aligning them with ODM formatting. A Python-based module was created to facilitate data retrieval, analysis, visualization, and export, while an interactive map feature enables users to explore site-specific data availability. Additionally, a custom ArcGIS script was implemented to generate maps that incorporate stream networks, site locations, and watershed boundaries using DEMs from USGS sources. The system was tested using real-world datasets from groundwater wells and surface water gages across Utah, demonstrating its flexibility in handling diverse formats and parameters. The relational structure enabled efficient querying and visualization, and the developed tools promoted accessibility and alignment with FAIR principles.

Clear search

Close search

Google apps

Main menu

Data Management Project for Collaborative Groundwater Research

Monday Coffee SQL Data Analysis Project

Dataset

Contents

Employee Attrition Case Study

SQL Bike Stores

Data from: myPhyloDB

To Estimate and Optimize the Source of Drinking Water for Metro Vancouver...

AdventureWorks-2014

About This Dataset

Dataset Composition

Usage

Documentation

AdventureWorks_SalesOrderHeader

AdventureWorks_CustomerMaster

ghtorrent-projects Dataset

Data from: myPhyloDB

Qualisign: Software Metrics and GoF Design Patterns of the Maven Central...

Geodatabase for the Baltimore Ecosystem Study Spatial Data

SciSpark: Highly Interactive and Scalable Model Evaluation and Climate...

Source files For Bike Share Case Study

Bike Share Case Study

Crowdsourced Flow Cytometry Dataset from EVE Online’s Project Discovery for...

SQLite database containing all the project data

Pulsar Voices

McKinsey Solve Assessment Data (2018–2025)

McKinsey Solve Global Assessment Dataset (2018–2025)

🧠 Context

📌 Inspiration & Purpose

🔍 Dataset Source

🧾 Dataset Structure

✅ Why Use This Dataset

💡 Credit & Collaboration

agile project dataset 2024

Hospital Management Dataset

My analysis of the "bike share" data: Google S.

Context

Content

Acknowledgements

Inspiration

Data Management Project for Collaborative Groundwater Research