https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Discover the Walmart Products Free Dataset, featuring 2,000 records in CSV format. This dataset includes detailed information about various Walmart products, such as names, prices, categories, and descriptions.
It’s perfect for data analysis, e-commerce research, and machine learning projects. Download now and kickstart your insights with accurate, real-world data.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Context:
This dataset provides comprehensive historical data for the EPX/USDT trading pair on Binance, dating from November 21, 2021, to May 21, 2024. It is particularly curated for facilitating advanced predictive analytics and machine learning projects, especially in the field of financial time series forecasting.
Sources:
The data was meticulously sourced from investing.com, a reliable platform for financial information and data analytics. It captures critical daily trading metrics, including the opening, closing, highest, and lowest prices, along with daily trading volume and percentage changes. This rich dataset is integral for constructing robust models that can predict future trading behaviors and trends.
Inspiration:
With a background in artificial intelligence and financial modeling, I have embarked on a project to predict the future prices of EPX/USDT using advanced neural network architectures. This project aims to leverage the power of several cutting-edge algorithms to create a robust forecasting backbone, combining:
Gated Recurrent Units (GRU): Employed to capture the complexities of sequential data while efficiently handling long-term dependencies.
Long Short-Term Memory (LSTM): Utilized to overcome the vanishing gradient problem, ensuring the model remembers essential patterns over extended periods.
Recurrent Neural Networks (RNN): Applied to process sequences of trading data, retaining the temporal dynamics and dependencies inherent in time series data.
Transformers: Integrated to benefit from their ability to handle both local and global dependencies in data, ensuring more accurate and contextually aware predictions.
The synergy of these algorithms aims to forge a resilient and accurate predictive model, capable of anticipating price movements and trends for the month of June 2024. This project showcases the potential of deploying hybrid neural network architectures for tackling real-world financial forecasting challenges.
Usage:
Users can utilize this dataset to:
Conduct time series analysis and predictive modeling.
Train and evaluate various machine learning and deep learning models.
Develop custom financial forecasting tools and algorithms.
Enhance their understanding of cryptocurrency trading patterns and dynamics.
With this dataset, the financial forecasting community can explore novel modeling techniques and validate their approaches against real-world data, contributing to the development of more precise and reliable predictive models.
Conclusion:
This dataset not only serves as a vital resource for academic and professional research but also stands as a testament to the power of innovative neural network architectures in the realm of financial forecasting. Whether you are a novice data scientist eager to explore time series data or a seasoned researcher looking to refine your models, this dataset offers a valuable foundation for your endeavors.
https://bigan.iacs.es/https://bigan.iacs.es/
BIGAN is the Big Data project of the Department of Health of the Government of Aragon, created to improve healthcare using data that are routinely collected within the public health system of Aragon. Development of the project has been entrusted to the Aragon Institute of Health Sciences (IACS).
The purpose of the project is to integrate all data collected within the health system on a technological platform, where it can be analysed by healthcare professionals, managers, educators, and researchers. The ultimate goal is to improve the healthcare system and the health of residents in Aragon through data observation. To achieve this, collection, analysis, and sharing of information between all involved stakeholders is vital.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Students in statistics, data science, analytics, and related fields study the theory and methodology of data-related topics. Some, but not all, are exposed to experiential learning courses that cover essential parts of the life cycle of practical problem-solving. Experiential learning enables students to convert real-world issues into solvable technical questions and effectively communicate their findings to clients. We describe several experiential learning course designs in statistics, data science, and analytics curricula. We present findings from interviews with faculty from the U.S., Europe, and the Middle East and surveys of former students. We observe that courses featuring live projects and coaching by experienced faculty have a high career impact, as reported by former participants. However, such courses are labor-intensive for both instructors and students. We give estimates of the required effort to deliver courses with live projects and the perceived benefits and tradeoffs of such courses. Overall, we conclude that courses offering live-project experiences, despite being more time-consuming than traditional courses, offer significant benefits for students regarding career impact and skill development, making them worthwhile investments. Supplementary materials for this article are available online.
The purpose of this project is to improve the accuracy of statistical software by providing reference datasets with certified computational results that enable the objective evaluation of statistical software. Currently datasets and certified values are provided for assessing the accuracy of software for univariate statistics, linear regression, nonlinear regression, and analysis of variance. The collection includes both generated and 'real-world' data of varying levels of difficulty. Generated datasets are designed to challenge specific computations. These include the classic Wampler datasets for testing linear regression algorithms and the Simon & Lesage datasets for testing analysis of variance algorithms. Real-world data include challenging datasets such as the Longley data for linear regression, and more benign datasets such as the Daniel & Wood data for nonlinear regression. Certified values are 'best-available' solutions. The certification procedure is described in the web pages for each statistical method. Datasets are ordered by level of difficulty (lower, average, and higher). Strictly speaking the level of difficulty of a dataset depends on the algorithm. These levels are merely provided as rough guidance for the user. Producing correct results on all datasets of higher difficulty does not imply that your software will pass all datasets of average or even lower difficulty. Similarly, producing correct results for all datasets in this collection does not imply that your software will do the same for your particular dataset. It will, however, provide some degree of assurance, in the sense that your package provides correct results for datasets known to yield incorrect results for some software. The Statistical Reference Datasets is also supported by the Standard Reference Data Program.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Deterministic and stochastic are two methods for modeling of crude oil and bottled water market. Forecasting the price of the market directly affected energy producer and water user.There are two software, Tableau and Python, which are utilized to model and visualize both markets for the aim of estimating possible price in the future.The role of those software is to provide an optimal alternative with different methods (deterministic versus stochastic). The base of predicted price in Tableau is deterministic—global optimization and time series. In contrast, Monte Carlo simulation as a stochastic method is modeled by Python software. The purpose of the project is, first, to predict the price of crude oil and bottled water with stochastic (Monte Carlo simulation) and deterministic (Tableau software),second, to compare the prices in a case study of Crude Oil Prices: West Texas Intermediate (WTI) and the U.S. bottled water. 1. Introduction Predicting stock and stock price index is challenging due to uncertainties involved. We can analyze with a different aspect; the investors perform before investing in a stock or the evaluation of stocks by means of studying statistics generated by market activity such as past prices and volumes. The data analysis attempt to identify stock patterns and trends that may predict the estimation price in the future. Initially, the classical regression (deterministic) methods were used to predict stock trends; furthermore, the uncertainty (stochastic) methods were used to forecast as same as deterministic. According to Deterministic versus stochastic volatility: implications for option pricing models (1997), Paul Brockman & Mustafa Chowdhury researched that the stock return volatility is deterministic or stochastic. They reported that “Results reported herein add support to the growing literature on preference-based stochastic volatility models and generally reject the notion of deterministic volatility” (Pag.499). For this argument, we need to research for modeling forecasting historical data with two software (Tableau and Python). In order to forecast analyze Tableau feature, the software automatically chooses the best of up to eight models which generates the highest quality forecast. According to the manual of Tableau , Tableau assesses forecast quality optimize the smoothing of each model. The optimization model is global. The main part of the model is a taxonomy of exponential smoothing that analyzes the best eight models with enough data. The real- world data generating process is a part of the forecast feature and to support deterministic method. Therefore, Tableau forecast feature is illustrated the best possible price in the future by deterministic (time – series and prices). Monte Carlo simulation (MCs) is modeled by Python, which is predicted the floating stock market index . Forecasting the stock market by Monte Carlo demonstrates in mathematics to solve various problems by generating suitable random numbers and observing that fraction of the numbers that obeys some property or properties. The method utilizes to obtain numerical solutions to problems too complicated to solve analytically. It randomly generates thousands of series representing potential outcomes for possible returns. Therefore, the variable price is the base of a random number between possible spot price between 2002-2016 that present a stochastic method.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MethodsThe objective of this project was to determine the capability of a federated analysis approach using DataSHIELD to maintain the level of results of a classical centralized analysis in a real-world setting. This research was carried out on an anonymous synthetic longitudinal real-world oncology cohort randomly splitted in three local databases, mimicking three healthcare organizations, stored in a federated data platform integrating DataSHIELD. No individual data transfer, statistics were calculated simultaneously but in parallel within each healthcare organization and only summary statistics (aggregates) were provided back to the federated data analyst.Descriptive statistics, survival analysis, regression models and correlation were first performed on the centralized approach and then reproduced on the federated approach. The results were then compared between the two approaches.ResultsThe cohort was splitted in three samples (N1 = 157 patients, N2 = 94 and N3 = 64), 11 derived variables and four types of analyses were generated. All analyses were successfully reproduced using DataSHIELD, except for one descriptive variable due to data disclosure limitation in the federated environment, showing the good capability of DataSHIELD. For descriptive statistics, exactly equivalent results were found for the federated and centralized approaches, except some differences for position measures. Estimates of univariate regression models were similar, with a loss of accuracy observed for multivariate models due to source database variability.ConclusionOur project showed a practical implementation and use case of a real-world federated approach using DataSHIELD. The capability and accuracy of common data manipulation and analysis were satisfying, and the flexibility of the tool enabled the production of a variety of analyses while preserving the privacy of individual data. The DataSHIELD forum was also a practical source of information and support. In order to find the right balance between privacy and accuracy of the analysis, set-up of privacy requirements should be established prior to the start of the analysis, as well as a data quality review of the participating healthcare organization.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
McKinsey's Solve is a gamified problem-solving assessment used globally in the consulting firm’s recruitment process. This dataset simulates assessment results across geographies, education levels, and roles over a 7-year period. It aims to provide deep insights into performance trends, candidate readiness, resume quality, and cognitive task outcomes.
Inspired by McKinsey’s real-world assessment framework, this dataset was designed to enable: - Exploratory Data Analysis (EDA) - Recruitment trend analysis - Gamified performance modelling - Dashboard development in Excel / Power BI - Resume and education impact evaluation - Regional performance benchmarking - Data storytelling for portfolio projects
Whether you're building dashboards or training models, this dataset offers practical and relatable data for HR analytics and consulting use cases.
This dataset includes 4,000 rows and the following columns: - Testtaker ID: Unique identifier - Country / Region: Geographic segmentation - Gender / Age: Demographics - Year: Assessment year (2018–2025) - Highest Level of Education: From high school to PhD / MBA - School or University Attended: Mapped to country and education level - First-generation University Student: Yes/No - Employment Status: Student, Employed, Unemployed - Role Applied For and Department / Interest: Business/tech disciplines - Past Test Taker: Indicates repeat attempts - Prepared with Online Materials: Indicates test prep involvement - Desired Office Location: Mapped to McKinsey's international offices - Ecosystem / Redrock / Seawolf (%): Game performance scores - Time Spent on Each Game (mins) - Total Product Score: Average of the 3 game scores - Process Score: A secondary assessment component - Resume Score: Scored based on education prestige, role fit, and clarity - Total Assessment Score (%): Final decision metric - Status (Pass/Fail): Based on total score ≥ 75%
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a minimal example of Data Subject Access Request Packages (SARPs), as they can be retrieved under data protection laws, specifically the GDPR. It includes data from two data subjects, each with accounts for five major sevices, namely Amazon, Apple, Facebook, Google, and Linkedin.
This dataset is meant to be an initial dataset that allows for manual exploration of structures and contents found in SARPs. Hence, the number of controllers and user profiles should be minimal but sufficient to allow cross-subject and cross-controller analysis. This dataset can be used to explore structures, formats and data types found in real-world SARPs. Thereby, the planning of future SARP-based research projects and studies shall be facilitated.
We invite other researchers to use this dataset to explore the structure of SARPs. The envisioned primary usage includes the development of user-centric privacy interfaces and other technical contributions in the area of data access rights. Moreover, these packages can also be used for examplified data analyses, although no substantive research questions can be answered using this data. In particular, this data does not reflect how data subjects behave in real world. However, it is representative enough to give a first impression on the types of data analysis possible when using real world data.
In order to allow cross-subject analysis, while keeping the re-identification risk minimal, we used research-only accounts for the data generation. A detailed explanation of the data generation method can be found in the paper corresponding to the dataset, accepted for the Annual Privacy Forum 2024.
In short, two user profiles were designed and corresponding accounts were created for each of the five services. Then, those accounts were used for two to four month. During the usage period, we minimized the amount of identifying data and also avoided interactions with data subjects not part of this research. Afterwards, we performed a data access request via the controller's web interface. Finally, the data was cleansed as described in detail in the acconpanying paper and in brief within the following section.
Before publication, both possibly identifying information and security relevant attributes need to be obfuscated or deleted. Moreover, multi-party data (especially messages with external entities) must be deleted. If data is obfuscated, we made sure to substitute multiple occurances of the same information with the same replacement.
We provide a list of deleted and obfuscated items, the obfuscation scheme and, if applicable, the replacement.
The list of obfuscated items looks like the following example:
path | filetype | filename | attribute | scheme | replacement |
linkedin\Linkedin_Basic | csv | messages.csv | TO | semantic description | Firstname Lastname |
gooogle\Meine Aktivitäten\Datenexport | html | MeineAktivitäten.html | IP Address | loopback | 127.142.201.194 |
facebook\personal_information | json | profile_information.json | emails | semantic description | firstname.lastname@gmail.com |
To give you an overview of the dataset, we publicly provide some meta-data about the usage time and SARP characteristics of exports from subject A/ subject B.
provider | usage time (in month) | export options | file types | # subfolders | # files | export size |
Amazon | 2/4 | all categories | CSV (32/49) EML (2/5) JPEG (1/2) JSON (3/3) PDF (9/10) TXT (4/4) | 41/49 | 51/73 | 1.2 MB / 1.4 MB |
Apple | 2/4 | all data max. 1 GB/ max. 4 GB | CSV (8/3) | 20/1 | 8/3 | 71.8 KB / 294.8 KB |
2/4 |
all data JSON/HTML on my computer | JSON (39/0) HTML (0/63) TXT (29/28) JPG (0/4) PNG (1/15) GIF (7/7) | 45/76 | 76/117 | 12.3 MB / 13.5 MB | |
2/4 |
all data frequency once ZIP max. 4 GB | HTML (8/11) CSV (10/13) JSON (27/28) TXT (14/14) PDF (1/1) MBOX (1/1) VCF (1/0) ICS (1/0) README (1/1) JPG (0/2) | 44/51 | 64/71 | 1.54 MB /1.2 MB | |
2/4 | all data | CSV (18/21) | 0/0 (part 1/2) 0/0 (part 1/2) | 13/18 19/21 |
3.9 KB / 6.0 KB 6.2 KB / 9.2 KB |
This data collection was performed by Daniela Pöhn (Universität der Bundeswehr München, Germany), Frank Pallas and Nicola Leschke (Paris Lodron Universität Salzburg, Austria). For questions, please contact nicola.leschke@plus.ac.at.
The dataset was collected according to the method presented in:
Leschke, Pöhn, and Pallas (2024). "How to Drill Into Silos: Creating a Free-to-Use Dataset of Data Subject Access Packages". Accepted for Annual Privacy Forum 2024.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
Asking questions, designing experiments, collecting data and interpreting those data, and communicating the results are some of the most important concepts an undergraduate biology lab can teach students. When the SARS-CoV2 pandemic forced our institution to transition to online learning in the middle of the spring 2020 semester, I was faced with the dilemma of converting a large laboratory course normally structured around active and guided inquiry labs, into an online course, while still covering the planned subject material. To that end, I developed this multi-week lab exercise featuring birdwatching to give students a chance to collect real-world data and use that large, complex dataset to answer an ecological question of their choosing. For this exercise, students were instructed to watch birds at their homes or where they were quarantining, and record a number of different parameters and observations twice a week for three weeks. All records were compiled across the entire class of 347 students. Students were then asked to pick an ecological question that could be addressed with these data, and use the full dataset and a PivotTable to generate a chart summarizing the data relevant to their question. The final project consisted of developing an infographic summarizing the key question and findings of their study in language geared toward the birdwatching public. This exercise gave students the opportunity to explore the scientific process including hypothesis development, data collection, analysis and summarization while still feeling connected to a larger class project.
Primary image: Mockingbirds are one of the most common songbirds identified by students in this project. This image was taken by the author and is her own work.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Synthetic Student Performance Dataset is designed to support research, analytics, and educational projects focused on academic performance, family background, and behavioral factors affecting students. It mirrors real-world educational data and offers diverse features to explore student success patterns.
https://storage.googleapis.com/opendatabay_public/41933042-6ec7-49c4-b151-508fc8f5592b/7537d999da0b_student_performance_visuals.png" alt="Synthetic student performance data visuals and distribution.png">
This dataset is ideal for:
Captures a comprehensive view of student life, including family background, academic history, health, and lifestyle. The dataset supports multi-disciplinary research across education, sociology, and data science.
CC0 (Public Domain)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MethodsThe objective of this project was to determine the capability of a federated analysis approach using DataSHIELD to maintain the level of results of a classical centralized analysis in a real-world setting. This research was carried out on an anonymous synthetic longitudinal real-world oncology cohort randomly splitted in three local databases, mimicking three healthcare organizations, stored in a federated data platform integrating DataSHIELD. No individual data transfer, statistics were calculated simultaneously but in parallel within each healthcare organization and only summary statistics (aggregates) were provided back to the federated data analyst.Descriptive statistics, survival analysis, regression models and correlation were first performed on the centralized approach and then reproduced on the federated approach. The results were then compared between the two approaches.ResultsThe cohort was splitted in three samples (N1 = 157 patients, N2 = 94 and N3 = 64), 11 derived variables and four types of analyses were generated. All analyses were successfully reproduced using DataSHIELD, except for one descriptive variable due to data disclosure limitation in the federated environment, showing the good capability of DataSHIELD. For descriptive statistics, exactly equivalent results were found for the federated and centralized approaches, except some differences for position measures. Estimates of univariate regression models were similar, with a loss of accuracy observed for multivariate models due to source database variability.ConclusionOur project showed a practical implementation and use case of a real-world federated approach using DataSHIELD. The capability and accuracy of common data manipulation and analysis were satisfying, and the flexibility of the tool enabled the production of a variety of analyses while preserving the privacy of individual data. The DataSHIELD forum was also a practical source of information and support. In order to find the right balance between privacy and accuracy of the analysis, set-up of privacy requirements should be established prior to the start of the analysis, as well as a data quality review of the participating healthcare organization.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
🌌 Persona Omni Verse
Overview
Welcome to the Persona Omni Verse dataset! 🎉 This dataset is a rich collection of diverse and detailed personal information, meticulously crafted to provide comprehensive insights into fictional profiles. Ideal for use in various data analysis, machine learning, and simulation projects, it offers a wide range of attributes designed to reflect real-world complexity and variety. 🌟
Dataset Details
Name: Persona Omni Verse… See the full description on the dataset page: https://huggingface.co/datasets/MawLab/persona_omni_verse.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundResearch suggests that early detection of hearing loss, coupled with prompt and appropriate treatment, can significantly alleviate its negative impacts. Routinely collected real-world data, such as those from electronic health records data, provide an opportunity to enhance our understanding of the management of hearing loss. This project aims to create the HEaring Impairment Data Infrastructure (HEIDI) data lake by assembling datasets from general practice (GP), audiology clinic registries, and cohort studies to investigate hearing-impaired patients’ care pathways. This study seeks to answer key research questions such as “How do patients with hearing loss navigate the care pathway from general practice clinics to audiology clinics?”.Methods and analysisThe HEIDI data lake will be hosted in a secure research environment at Macquarie University, Sydney, Australia, that complies with Australian legal and ethical requirements to protect patient privacy. Afterwards, new integrated datasets will be built through data linkage of hearing and GP datasets. Finally, the HEIDI data warehouse will be developed and used as a stand-alone dataset for future research. Descriptive and predictive analytics will be undertaken to answer our research questions with the data warehouse. Descriptive analysis will include both conventional and advanced statistical techniques and visualisation that will help us understand the journey of patients with hearing loss. Machine learning strategies such as deep neural networks, support vector machines, and random forests for predictive analytics will also be employed to identify participants that could benefit from proactive management by their GP and determine the effect of interventions through the patient’s journey (e.g., referrals to specialist) on outcomes (e.g., adherence to the intervention).DisseminationThe findings will be disseminated widely through academic journals, conferences and other presentations.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22146488%2Fea6d9cd5a0304946f83cb1cfb08cde27%2Fimage.jpg?generation=1742439627767040&alt=media" alt="">
This dataset provides a unique glimpse into the revolutionary approach of the Oakland Athletics during the early 2000s—a period famously chronicled in the Moneyball story. When budget constraints forced the team to rethink traditional scouting, Billy Beane and Paul DePodesta turned to advanced statistics, focusing on undervalued metrics like on-base percentage (OBP) and slugging percentage (SLG) to identify key contributors on the field.
Timeframe & Context:
Capturing the early 2000s, the dataset reflects the strategic shift in baseball, where data-driven decision-making helped underdog teams compete against financially superior clubs.
Key Features:
The dataset includes critical performance metrics such as:
Inspired by the groundbreaking Moneyball strategy, this dataset was originally assembled to highlight how teams can leverage data analytics to uncover hidden value. The approach not only reshaped baseball recruitment but also set the stage for modern sports analytics. The data itself was gathered from reliable sources such as baseball-reference.com and Sports-reference.com, and it was featured in The Analytics Edge course on EdX, which further demonstrates its educational and analytical value.
Sports Analytics:
Explore how key statistics correlate with game outcomes, playoff appearances, and overall team success.
Predictive Modeling:
Build models to predict wins, playoff qualification, and even future team performance.
Educational Resource:
Ideal for coursework or projects in data science and sports analytics, offering a real-world example of how data can drive strategic decisions.
This dataset is a powerful resource for anyone interested in the intersection of sports and data science, providing both historical insights and a platform for further analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For the sake of research transparency and reducing research and reproducibility costs, we have stored all data and computer code of the project "What make readers love a fiction book: a stat analysis on Wild Wise Weird using real-world data from Amazon readers' reviews" on Zenodo.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This project developed a comprehensive data management system designed to support collaborative groundwater research across institutions by establishing a centralized, structured database for hydrologic time series data. Built on the Observations Data Model (ODM), the system stores time series data and metadata in a relational SQLite database. Key project components included database construction, automation of data formatting and importation, development of analytical and visualization tools, and integration with ArcGIS for geospatial representation. The data import workflow standardizes and validates diverse .csv datasets by aligning them with ODM formatting. A Python-based module was created to facilitate data retrieval, analysis, visualization, and export, while an interactive map feature enables users to explore site-specific data availability. Additionally, a custom ArcGIS script was implemented to generate maps that incorporate stream networks, site locations, and watershed boundaries using DEMs from USGS sources. The system was tested using real-world datasets from groundwater wells and surface water gages across Utah, demonstrating its flexibility in handling diverse formats and parameters. The relational structure enabled efficient querying and visualization, and the developed tools promoted accessibility and alignment with FAIR principles.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The increasing scale and diversity of seismic data, and the growing role of big data in seismology, has raised interest in methods to make data exploration more accessible. This paper presents the use of knowledge graphs (KGs) for representing seismic data and metadata to improve data exploration and analysis, focusing on usability, flexibility, and extensibility. Using constraints derived from domain knowledge in seismology, we define semantic models of seismic station and event information used to construct the KGs. Our approach utilizes the capability of KGs to integrate data across many sources and diverse schema formats. We use schema-diverse, real-world seismic data to construct KGs with millions of nodes, and illustrate potential applications with three big-data examples. Our findings demonstrate the potential of KGs to enhance the efficiency and efficacy of seismological workflows in research and beyond, indicating a promising interdisciplinary future for this technology. Methods The data here consists of, and was collected from:
Station metadata, in StationXML format, acquired from IRIS DMC using the fdsnws-station webservice (https://service.iris.edu/fdsnws/station/1/). Earthquake event data, in NDK format, acquired from the Global Centroid-Moment Tensor (GCMT) catalog webservice (https://www.globalcmt.org) [1,2]. Earthquake event data, in CSV format, acquired from the USGS earthquake catalog webservice (https://doi.org/10.5066/F7MS3QZH) [3].
The format of the data is described in the README. In addition, a complete description of the StationXML, NDK, and USGS file formats can be found at https://www.fdsn.org/xml/station/, https://www.ldeo.columbia.edu/~gcmt/projects/CMT/catalog/allorder.ndk_explained, and https://earthquake.usgs.gov/data/comcat/#event-terms, respectively. Also provided are conversions from NDK and StationXML file formats into JSON format. References: [1] Dziewonski, A. M., Chou, T. A., & Woodhouse, J. H. (1981). Determination of earthquake source parameters from waveform data for studies of global and regional seismicity. Journal of Geophysical Research: Solid Earth, 86(B4), 2825-2852. [2] Ekström, G., Nettles, M., & Dziewoński, A. M. (2012). The global CMT project 2004–2010: Centroid-moment tensors for 13,017 earthquakes. Physics of the Earth and Planetary Interiors, 200, 1-9. [3] U.S. Geological Survey, Earthquake Hazards Program, 2017, Advanced National Seismic System (ANSS) Comprehensive Catalog of Earthquake Events and Products: Various, https://doi.org/10.5066/F7MS3QZH.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Motivation/Problem Statement: DEREChOS is a natural advancement of the existing, highly-successful Automated Event Service (AES) project. AES is an advanced system that facilitates efficient exploration and analysis of Earth science data. While AES is well-suited for the original purpose of searching for phenomena in regularly gridded data (e.g., reanalyses), targeted extensions would enable a much broader class of Earth science investigations to exploit the performance and flexibility of this service. We present a relevancy scenario, Event-based Hydrometeorological Science Data Analysis, which highlights the need for these features that would maximize the potential of DEREChOS for scientific research.
Proposed solution: We propose to develop DEREChOS, an extension of AES, that: (1) generalizes the underlying representation to support irregularly spaced observations such as point and swath data, (2) incorporates appropriate re-gridding and interpolation utilities to enable analysis across data from different sources, (3) introduces nonlinear dimensionality reduction (NDR) to facilitate identification of scientific relationships among high-dimensional datasets, and (4) integrates Moving Object Database technology to improve treatment of continuity for the events with coarse representation in time. With these features, DEREChOS will become a powerful environment that is appropriate for a very wide variety of Earth science analysis scenarios.
Research strategy: DEREChOS will be created by integrating various separately developed technologies. In most cases this will require some re-implementation to exploit SciDB, the underlying database that has strong support for multidimensional scientific data. Where possible, synthetic data/inputs will be generated to facilitate independent testing of new components. A scientific use case will be used to derive specific interface requirements and to demonstrate integration success.
Significance: Freshwater resources are predicted to be a major focus of contention and conflict in the 21st century. Thus, hydrometeorology and hydrology communities are particularly attracted by the superior research productivity through AES, which has been demonstrated for two real-world use cases. This interest is reflected by the participation in DEREChOS of our esteemed collaborators, who include the Project Scientist of NASA SMAP, the Principal Scientist of NOAA MRMS, and lead algorithm developers of NASA GPM.
Relevance to the Program Element: This proposal responds to the core AIST program topic: 2.1.3 Data-Centric-Technologies. DEREChOS specifically addresses the request for big data analytics, including tools and techniques for data fusion and data mining, applied to the substantial data and metadata that result from Earth science observation and the use of other data-centric technologies.
TRL: Although AES will have achieved an exit TRL of 5 by the start date of this proposed project, DEREChOS will have an entry TRL of 3 due to the new innovations that have not previously been implemented within the underlying SciDB database. We expect that DEREChOS will have an exit TRL of 5 corresponding to an end-to-end test of the full system in a relevant environment.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Discover the Walmart Products Free Dataset, featuring 2,000 records in CSV format. This dataset includes detailed information about various Walmart products, such as names, prices, categories, and descriptions.
It’s perfect for data analysis, e-commerce research, and machine learning projects. Download now and kickstart your insights with accurate, real-world data.