100+ datasets found

🔍 Diverse CSV Dataset Samples
kaggle.com
zip
Updated Nov 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samy Baladram (2023). 🔍 Diverse CSV Dataset Samples [Dataset]. https://www.kaggle.com/datasets/samybaladram/multidisciplinary-csv-datasets-collection
Explore at:
zip(330927 bytes)Available download formats
Dataset updated
Nov 6, 2023
Authors
Samy Baladram
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
https://i.imgur.com/PcSDv8A.png" alt="Imgur">

Overview

The dataset provided here is a rich compilation of various data files gathered to support diverse analytical challenges and education in data science. It is especially curated to provide researchers, data enthusiasts, and students with real-world data across different domains, including biostatistics, travel, real estate, sports, media viewership, and more.

Files

Below is a brief overview of what each CSV file contains: - Addresses: Practical examples of string manipulation and address data formatting in CSV. - Air Travel: Historical dataset suitable for analyzing trends in air travel over a period of three years. - Biostats: A dataset of office workers' biometrics, ideal for introductory statistics and biology. - Cities: Geographic and administrative data for urban analysis or socio-demographic studies. - Car Crashes in Catalonia: Weekly traffic accident data from Catalonia, providing a base for public policy research. - De Niro's Film Ratings: Analyze trends in film ratings over time with this entertainment-focused dataset. - Ford Escort Sales: Pre-owned vehicle sales data, perfect for regression analysis or price prediction models. - Old Faithful Geyser: Geological data for pattern recognition and prediction in natural phenomena. - Freshman Year Weights and BMIs: Dataset depicting weight and BMI changes for health and lifestyle studies. - Grades: Education performance data which can be correlated with demographics or study patterns. - Home Sales: A dataset reflecting the housing market dynamics, useful for economic analysis or real estate appraisal. - Hooke's Law Demonstration: Physics data illustrating the classic principle of elasticity in springs. - Hurricanes and Storm Data: Climate data on hurricane and storm frequency for environmental risk assessments. - Height and Weight Measurements: Public health research dataset on anthropometric data. - Lead Shot Specs: Detailed engineering data for material sciences and manufacturing studies. - Alphabet Letter Frequency: Text analysis dataset for frequency distribution studies in large text samples. - MLB Player Statistics: Comprehensive athletic data set for analysis of performance metrics in sports. - MLB Teams' Seasonal Performance: A dataset combining financial and sports performance data from the 2012 MLB season. - TV News Viewership: Media consumption data which can be used to analyze viewing patterns and trends. - Historical Nile Flood Data: A unique environmental dataset for historical trend analysis in flood levels. - Oscar Winner Ages: A dataset to explore age trends among Oscar-winning actors and actresses. - Snakes and Ladders Statistics: Data from the game outcomes useful in studying probability and game theory. - Tallahassee Cab Fares: Price modeling data from the real-world pricing of taxi services. - Taxable Goods Data: A snapshot of economic data concerning taxation impact on prices. - Tree Measurements: Ecological and environmental science data related to tree growth and forest management. - Real Estate Prices from Zillow: Market analysis dataset for those interested in housing price determinants.

Format

The enclosed data respect the comma-separated values (CSV) file format standards, ensuring compatibility with most data processing libraries in Python, R, and other languages. The datasets are ready for import into Jupyter notebooks, RStudio, or any other integrated development environment (IDE) used for data science.

Quality Assurance

The data is pre-checked for common issues such as missing values, duplicate records, and inconsistent entries, offering a clean and reliable dataset for various analytical exercises. With initial header lines in some CSV files, users can easily identify dataset fields and start their analysis without additional data cleaning for headers.

Acknowledgements

The dataset adheres to the GNU LGPL license, making it freely available for modification and distribution, provided that the original source is cited. This opens up possibilities for educators to integrate real-world data into curricula, researchers to validate models against diverse datasets, and practitioners to refine their analytical skills with hands-on data.

This dataset has been compiled from https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html, with gratitude to the authors and maintainers for their dedication to providing open data resources for educational and research purposes. https://i.imgur.com/HOtyghv.png" alt="Imgur">
CSV file used in statistical analyses
data.csiro.au
researchdata.edu.au
+1more
Updated Oct 13, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6
Explore at:
Unique identifier
https://doi.org/10.4225/08/543B4B4CA92E6
Dataset updated
Oct 13, 2014
Dataset authored and provided by
CSIROhttps://www.csiro.au/
License
https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/
Time period covered
Mar 14, 2008 - Jun 9, 2009
Dataset funded by
CSIROhttps://www.csiro.au/
Description
A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.
s
Data from: Data files used to study change dynamics in software systems
figshare.swinburne.edu.au
pdf
Updated Jul 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajesh Vasa (2024). Data files used to study change dynamics in software systems [Dataset]. http://doi.org/10.25916/sut.26288227.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.25916/sut.26288227.v1
Dataset updated
Jul 22, 2024
Dataset provided by
Swinburne
Authors
Rajesh Vasa
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
It is a widely accepted fact that evolving software systems change and grow. However, it is less well-understood how change is distributed over time, specifically in object oriented software systems. The patterns and techniques used to measure growth permit developers to identify specific releases where significant change took place as well as to inform them of the longer term trend in the distribution profile. This knowledge assists developers in recording systemic and substantial changes to a release, as well as to provide useful information as input into a potential release retrospective. However, these analysis methods can only be applied after a mature release of the code has been developed. But in order to manage the evolution of complex software systems effectively, it is important to identify change-prone classes as early as possible. Specifically, developers need to know where they can expect change, the likelihood of a change, and the magnitude of these modifications in order to take proactive steps and mitigate any potential risks arising from these changes. Previous research into change-prone classes has identified some common aspects, with different studies suggesting that complex and large classes tend to undergo more changes and classes that changed recently are likely to undergo modifications in the near future. Though the guidance provided is helpful, developers need more specific guidance in order for it to be applicable in practice. Furthermore, the information needs to be available at a level that can help in developing tools that highlight and monitor evolution prone parts of a system as well as support effort estimation activities. The specific research questions that we address in this chapter are: (1) What is the likelihood that a class will change from a given version to the next? (a) Does this probability change over time? (b) Is this likelihood project specific, or general? (2) How is modification frequency distributed for classes that change? (3) What is the distribution of the magnitude of change? Are most modifications minor adjustments, or substantive modifications? (4) Does structural complexity make a class susceptible to change? (5) Does popularity make a class more change-prone? We make recommendations that can help developers to proactively monitor and manage change. These are derived from a statistical analysis of change in approximately 55000 unique classes across all projects under investigation. The analysis methods that we applied took into consideration the highly skewed nature of the metric data distributions. The raw metric data (4 .txt files and 4 .log files in a .zip file measuring ~2MB in total) is provided as a comma separated values (CSV) file, and the first line of the CSV file contains the header. A detailed output of the statistical analysis undertaken is provided as log files generated directly from Stata (statistical analysis software).
Student & School Performance Data Analysis
kaggle.com
zip
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dalton A. Schmidt (2025). Student & School Performance Data Analysis [Dataset]. https://www.kaggle.com/datasets/daltonschmidt/student-and-school-performance
Explore at:
zip(501198 bytes)Available download formats
Dataset updated
Apr 24, 2025
Authors
Dalton A. Schmidt
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Demonstrate my ability to use Jupyter Notebook and Pandas libraries to perform data transformation and analysis of CSV files. The data found within the "Resources" CSV files is synthesized and is to not be used within an official capacity.
c
Mecca Australia Extracted Data in CSV Format
crawlfeeds.com
csv, zip
Updated Sep 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2024). Mecca Australia Extracted Data in CSV Format [Dataset]. https://crawlfeeds.com/datasets/mecca-australia-extracted-data-in-csv-format
Explore at:
csv, zipAvailable download formats
Dataset updated
Sep 2, 2024
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Area covered
Australia
Description
format. This dataset provides comprehensive details on a wide range of beauty products listed on Mecca Australia, one of the leading beauty retailers in the country.

Perfect for market researchers, data analysts, and beauty industry professionals, this dataset enables a deep dive into product offerings and trends without the clutter of customer reviews.

Features:

Product Information: Detailed data on various beauty products, including product names, categories, and brands.

Pricing Data: Up-to-date pricing details for each product, allowing for competitive analysis and pricing strategy development.

Product Descriptions: Comprehensive descriptions that provide insight into product features and benefits.

Stock Availability: Information on stock status to help track product availability and manage inventory.

CSV Format: Easy-to-use CSV file format for seamless integration into any data analysis or business intelligence tool.

Applications:

Market Analysis: Gain insights into the beauty market trends in Australia by analyzing product categories, brands, and pricing.

Competitor Research: Compare product offerings and pricing strategies to understand the competitive landscape.

Inventory Management: Use stock availability data to optimize inventory and ensure popular items are always in stock.

Product Development: Leverage product descriptions to identify gaps in the market and innovate new product offerings.

With the "Mecca Australia Extracted Data" in CSV format, you can easily access and analyze crucial product data, enabling informed decision-making and strategic planning in the beauty industry.
The Canada Trademarks Dataset
zenodo.org
data.niaid.nih.gov
+1more
pdf, zip
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeremy Sheff; Jeremy Sheff (2024). The Canada Trademarks Dataset [Dataset]. http://doi.org/10.5281/zenodo.4999655
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4999655
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jeremy Sheff; Jeremy Sheff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Canada Trademarks Dataset

18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303

Dataset Selection and Arrangement (c) 2021 Jeremy Sheff

Python and Stata Scripts (c) 2021 Jeremy Sheff

Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.

This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.

Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.

Terms of Use:

As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.

The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:

The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.

Details of Repository Contents:

This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:

/csv: contains the .csv versions of the data files

/do: contains Stata do-files used to convert the .csv files to .dta format and perform the statistical analyses set forth in the paper reporting this dataset

/dta: contains the .dta versions of the data files

/py: contains the python scripts used to download CIPO’s historical trademarks data via SFTP and generate the .csv data files

If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.

The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.

With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.

The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.

This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.
Sample Rainfall Dataset
kaggle.com
zip
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kuntal Patel (2025). Sample Rainfall Dataset [Dataset]. https://www.kaggle.com/datasets/kpatel123/sample-rainfall-dataset
Explore at:
zip(1161 bytes)Available download formats
Dataset updated
May 28, 2025
Authors
Kuntal Patel
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
A sample rainfall dataset containing 100 records (in CSV file format), which includes some missing values, has been created to practice fundamental data cleaning operations and to extract basic statistical information from the provided CSV file.
r
1000 Empirical Time series
researchdata.edu.au
Updated May 5, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ben Fulcher (2022). 1000 Empirical Time series [Dataset]. http://doi.org/10.6084/m9.figshare.5436136.v10
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.5436136.v10
Dataset updated
May 5, 2022
Dataset provided by
Monash University
Authors
Ben Fulcher
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A diverse selection of 1000 empirical time series, along with results of an hctsa feature extraction, using v1.06 of hctsa and Matlab 2019b, computed on a server at The University of Sydney.

The results of the computation are in the hctsa file, HCTSA_Empirical1000.mat for use in Matlab using v1.06 of hctsa.

The same data is also provided in .csv format for the hctsa_datamatrix.csv (results of feature computation), with information about rows (time series) in hctsa_timeseries-info.csv, information about columns (features) in hctsa_features.csv (and corresponding hctsa code used to compute each feature in hctsa_masterfeatures.csv), and the data of individual time series (each line a time series, for time series described in hctsa_timeseries-info.csv) is in hctsa_timeseries-data.csv.

These .csv files were produced by running >>OutputToCSV(HCTSA_Empirical1000.mat,true,true); in hctsa.

The input file, INP_Empirical1000.mat, is for use with hctsa, and contains the time-series data and metadata for the 1000 time series. For example, massive feature extraction from these data on the user's machine, using hctsa, can proceed as
>> TS_Init('INP_Empirical1000.mat');

Some visualizations of the dataset are in CarpetPlot.png (first 1000 samples of all time series as a carpet (color) plot) and 150TS-250samples.png (conventional time-series plots of the first 250 samples of a sample of 150 time series from the dataset). More visualizations can be performed by the user using TS_PlotTimeSeries from the hctsa package.

See links in references for more comprehensive documentation for performing methodological comparison using this dataset, and on how to download and use v1.06 of hctsa.
Datasets for Sentiment Analysis
zenodo.org
datalens.synthax.codes
csv
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10157504
Dataset updated
Dec 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.

----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
product_id - Product ID
product_name - Name of the Product
category - Category of the Product
discounted_price - Discounted Price of the Product
actual_price - Actual Price of the Product
discount_percentage - Percentage of Discount for the Product
rating - Rating of the Product
rating_count - Number of people who voted for the Amazon rating
about_product - Description about the Product
user_id - ID of the user who wrote review for the Product
user_name - Name of the user who wrote review for the Product
review_id - ID of the user review
review_title - Short review
review_content - Long review
img_link - Image Link of the Product
product_link - Official Website Link of the Product
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
Z
Assessing the impact of hints in learning formal specification: Research...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jan 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macedo, Nuno; Cunha, Alcino; Campos, José Creissac; Sousa, Emanuel; Margolis, Iara (2024). Assessing the impact of hints in learning formal specification: Research artifact [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10450608
Explore at:
Dataset updated
Jan 29, 2024
Dataset provided by
Centro de Computação Gráfica
INESC TEC
Authors
Macedo, Nuno; Cunha, Alcino; Campos, José Creissac; Sousa, Emanuel; Margolis, Iara
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This artifact accompanies the SEET@ICSE article "Assessing the impact of hints in learning formal specification", which reports on a user study to investigate the impact of different types of automated hints while learning a formal specification language, both in terms of immediate performance and learning retention, but also in the emotional response of the students. This research artifact provides all the material required to replicate this study (except for the proprietary questionnaires passed to assess the emotional response and user experience), as well as the collected data and data analysis scripts used for the discussion in the paper.

Dataset

The artifact contains the resources described below.

Experiment resources

The resources needed for replicating the experiment, namely in directory experiment:

alloy_sheet_pt.pdf: the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment. The sheet was passed in Portuguese due to the population of the experiment.

alloy_sheet_en.pdf: a version the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment translated into English.

docker-compose.yml: a Docker Compose configuration file to launch Alloy4Fun populated with the tasks in directory data/experiment for the 2 sessions of the experiment.

api and meteor: directories with source files for building and launching the Alloy4Fun platform for the study.

Experiment data

The task database used in our application of the experiment, namely in directory data/experiment:

Model.json, Instance.json, and Link.json: JSON files with to populate Alloy4Fun with the tasks for the 2 sessions of the experiment.

identifiers.txt: the list of all (104) available participant identifiers that can participate in the experiment.

Collected data

Data collected in the application of the experiment as a simple one-factor randomised experiment in 2 sessions involving 85 undergraduate students majoring in CSE. The experiment was validated by the Ethics Committee for Research in Social and Human Sciences of the Ethics Council of the University of Minho, where the experiment took place. Data is shared the shape of JSON and CSV files with a header row, namely in directory data/results:

data_sessions.json: data collected from task-solving in the 2 sessions of the experiment, used to calculate variables productivity (PROD1 and PROD2, between 0 and 12 solved tasks) and efficiency (EFF1 and EFF2, between 0 and 1).

data_socio.csv: data collected from socio-demographic questionnaire in the 1st session of the experiment, namely:

participant identification: participant's unique identifier (ID);

socio-demographic information: participant's age (AGE), sex (SEX, 1 through 4 for female, male, prefer not to disclosure, and other, respectively), and average academic grade (GRADE, from 0 to 20, NA denotes preference to not disclosure).

data_emo.csv: detailed data collected from the emotional questionnaire in the 2 sessions of the experiment, namely:

participant identification: participant's unique identifier (ID) and the assigned treatment (column HINT, either N, L, E or D);

detailed emotional response data: the differential in the 5-point Likert scale for each of the 14 measured emotions in the 2 sessions, ranging from -5 to -1 if decreased, 0 if maintained, from 1 to 5 if increased, or NA denoting failure to submit the questionnaire. Half of the emotions are positive (Admiration1 and Admiration2, Desire1 and Desire2, Hope1 and Hope2, Fascination1 and Fascination2, Joy1 and Joy2, Satisfaction1 and Satisfaction2, and Pride1 and Pride2), and half are negative (Anger1 and Anger2, Boredom1 and Boredom2, Contempt1 and Contempt2, Disgust1 and Disgust2, Fear1 and Fear2, Sadness1 and Sadness2, and Shame1 and Shame2). This detailed data was used to compute the aggregate data in data_emo_aggregate.csv and in the detailed discussion in Section 6 of the paper.

data_umux.csv: data collected from the user experience questionnaires in the 2 sessions of the experiment, namely:

participant identification: participant's unique identifier (ID);

user experience data: summarised user experience data from the UMUX surveys (UMUX1 and UMUX2, as a usability metric ranging from 0 to 100).

participants.txt: the list of participant identifiers that have registered for the experiment.

Analysis scripts

The analysis scripts required to replicate the analysis of the results of the experiment as reported in the paper, namely in directory analysis:

analysis.r: An R script to analyse the data in the provided CSV files; each performed analysis is documented within the file itself.

requirements.r: An R script to install the required libraries for the analysis script.

normalize_task.r: A Python script to normalize the task JSON data from file data_sessions.json into the CSV format required by the analysis script.

normalize_emo.r: A Python script to compute the aggregate emotional response in the CSV format required by the analysis script from the detailed emotional response data in the CSV format of data_emo.csv.

Dockerfile: Docker script to automate the analysis script from the collected data.

Setup

To replicate the experiment and the analysis of the results, only Docker is required.

If you wish to manually replicate the experiment and collect your own data, you'll need to install:

A modified version of the Alloy4Fun platform, which is built in the Meteor web framework. This version of Alloy4Fun is publicly available in branch study of its repository at https://github.com/haslab/Alloy4Fun/tree/study.

If you wish to manually replicate the analysis of the data collected in our experiment, you'll need to install:

Python to manipulate the JSON data collected in the experiment. Python is freely available for download at https://www.python.org/downloads/, with distributions for most platforms.

R software for the analysis scripts. R is freely available for download at https://cran.r-project.org/mirrors.html, with binary distributions available for Windows, Linux and Mac.

Usage

Experiment replication

This section describes how to replicate our user study experiment, and collect data about how different hints impact the performance of participants.

To launch the Alloy4Fun platform populated with tasks for each session, just run the following commands from the root directory of the artifact. The Meteor server may take a few minutes to launch, wait for the "Started your app" message to show.

cd experimentdocker-compose up

This will launch Alloy4Fun at http://localhost:3000. The tasks are accessed through permalinks assigned to each participant. The experiment allows for up to 104 participants, and the list of available identifiers is given in file identifiers.txt. The group of each participant is determined by the last character of the identifier, either N, L, E or D. The task database can be consulted in directory data/experiment, in Alloy4Fun JSON files.

In the 1st session, each participant was given one permalink that gives access to 12 sequential tasks. The permalink is simply the participant's identifier, so participant 0CAN would just access http://localhost:3000/0CAN. The next task is available after a correct submission to the current task or when a time-out occurs (5mins). Each participant was assigned to a different treatment group, so depending on the permalink different kinds of hints are provided. Below are 4 permalinks, each for each hint group:

Group N (no hints): http://localhost:3000/0CAN

Group L (error locations): http://localhost:3000/CA0L

Group E (counter-example): http://localhost:3000/350E

Group D (error description): http://localhost:3000/27AD

In the 2nd session, likewise the 1st session, each permalink gave access to 12 sequential tasks, and the next task is available after a correct submission or a time-out (5mins). The permalink is constructed by prepending the participant's identifier with P-. So participant 0CAN would just access http://localhost:3000/P-0CAN. In the 2nd sessions all participants were expected to solve the tasks without any hints provided, so the permalinks from different groups are undifferentiated.

Before the 1st session the participants should answer the socio-demographic questionnaire, that should ask the following information: unique identifier, age, sex, familiarity with the Alloy language, and average academic grade.

Before and after both sessions the participants should answer the standard PrEmo 2 questionnaire. PrEmo 2 is published under an Attribution-NonCommercial-NoDerivatives 4.0 International Creative Commons licence (CC BY-NC-ND 4.0). This means that you are free to use the tool for non-commercial purposes as long as you give appropriate credit, provide a link to the license, and do not modify the original material. The original material, namely the depictions of the diferent emotions, can be downloaded from https://diopd.org/premo/. The questionnaire should ask for the unique user identifier, and for the attachment with each of the depicted 14 emotions, expressed in a 5-point Likert scale.

After both sessions the participants should also answer the standard UMUX questionnaire. This questionnaire can be used freely, and should ask for the user unique identifier and answers for the standard 4 questions in a 7-point Likert scale. For information about the questions, how to implement the questionnaire, and how to compute the usability metric ranging from 0 to 100 score from the answers, please see the original paper:

Kraig Finstad. 2010. The usability metric for user experience. Interacting with computers 22, 5 (2010), 323–327.

Analysis of other applications of the experiment

This section describes how to replicate the analysis of the data collected in an application of the experiment described in Experiment replication.

The analysis script expects data in 4 CSV files,
Pandas Practice Dataset
kaggle.com
zip
Updated Jan 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mrityunjay Pathak (2023). Pandas Practice Dataset [Dataset]. https://www.kaggle.com/datasets/themrityunjaypathak/pandas-practice-dataset
Explore at:
zip(493 bytes)Available download formats
Dataset updated
Jan 27, 2023
Authors
Mrityunjay Pathak
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
What is Pandas?

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

What Can Pandas Do?

Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?

What is average value?

Max value?

Min value?
D
CSV files for GLMM and BRT analysis
researchdata.ntu.edu.sg
tsv
Updated May 2, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ser Huay Janice Teresa Lee; Ser Huay Janice Teresa Lee (2019). CSV files for GLMM and BRT analysis [Dataset]. http://doi.org/10.21979/N9/O48JED
Explore at:
tsv(8685), tsv(18621856)Available download formats
Unique identifier
https://doi.org/10.21979/N9/O48JED
Dataset updated
May 2, 2019
Dataset provided by
DR-NTU (Data)
Authors
Ser Huay Janice Teresa Lee; Ser Huay Janice Teresa Lee
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset funded by
Ministry of Education (MOE)
Description
CSV files for conducting GLMM analysis using regencies as units and BRT analysis using pixels (1km) as units.
all csv files used for analysis of NCBI data
figshare.com
txt
Updated Oct 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cassandre Pyne (2023). all csv files used for analysis of NCBI data [Dataset]. http://doi.org/10.6084/m9.figshare.24461239.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24461239.v1
Dataset updated
Oct 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Cassandre Pyne
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
all csv files used for analysis of NCBIall files with "WOAH" in it are the disease and disease agents from WOAH's list (see manuscript for link) all breed files (with breed names in name) are from web scrapingMASTER_DATA_coordinates_FINAL_AUG_5: cleaned mined data from NCBI
Data from: ESS-DIVE Reporting Format for Comma-separated Values (CSV) File...
osti.gov
Updated Jan 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crow, Michael; Crystal-Ornelas, Robert; Devarakonda, Ranjeet; Heinz, Susan; Velliquette, Terri; Welch, Jessica (2021). ESS-DIVE Reporting Format for Comma-separated Values (CSV) File Structure [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/1734841
Explore at:
Dataset updated
Jan 1, 2021
Dataset provided by
United States Department of Energyhttp://energy.gov/
Environmental System Science Data Infrastructure for a Virtual Ecosystem; Environmental Systems Science Data Infrastructure for a Virtual Ecosystem
Authors
Crow, Michael; Crystal-Ornelas, Robert; Devarakonda, Ranjeet; Heinz, Susan; Velliquette, Terri; Welch, Jessica
Description
The ESS-DIVE reporting format for Comma-separated Values (CSV) file structure is based on a combination of existing guidelines and recommendations including some found within the Earth Science Community with valuable input from the Environmental Systems Science (ESS) Community. The CSV reporting format is designed to promote interoperability and machine-readability of CSV data files while also facilitating the collection of some file-level metadata content. Tabular data in the form of rows and columns should be archived in its simplest form, and we recommend submitting these tabular data following the ESS-DIVE reporting format for generic comma-separated values (CSV) text format files. In general, the CSV file format is more likely accessible by future systems when compared to a proprietary format and CSV files are preferred because this format is easier to exchange between different programs increasing the interoperability of a data file. By defining the reporting format and providing guidelines for how to structure CSV files and some field content within, this can increase the machine-readability of the data file for extracting, compiling, and comparing the data across files and systems.Data package files are in .csv, .png, and .md. Open the .csv with e.g. Microsoft Excel, LibreOffice, or Google Sheets. Open themore » .md files by downloading and using a text editor (e.g., notepad or TextEdit). Open the .png in e.g. a web browser, photo viewer/editor, or Google Drive.« less
d
COVID-19 Time Series Data
data.world
kaggle.com
csv, zip
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shad Reynolds (2025). COVID-19 Time Series Data [Dataset]. https://data.world/shad/covid-19-time-series-data
Explore at:
csv, zipAvailable download formats
Dataset updated
Mar 18, 2025
Authors
Shad Reynolds
Time period covered
Jan 22, 2020 - Mar 9, 2023
Area covered
Description

This data is synced hourly from https://github.com/CSSEGISandData/COVID-19. All credit is to them.

Latest Confirmed Cases

@(https://data.world/shad/covid-analysis/workspace/query?datasetid=covid-19-time-series-data&queryid=e066701e-fa8d-4c9f-97f8-aab3a6f219a8)

I have also added confirmed_pivot.csv which gives a slightly more workable view of the data. Extra columns/day makes things difficult.

@(https://data.world/shad/covid-analysis/workspace/file?datasetid=covid-19-time-series-data&filename=confirmed_pivot)

#
RQ4
figshare.com
pdf
Updated Jan 8, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yunior Pacheco (2019). RQ4 [Dataset]. http://doi.org/10.6084/m9.figshare.7562237.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7562237.v2
Dataset updated
Jan 8, 2019
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Yunior Pacheco
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Each CSV file contains information about the number of classes in a framework according to the number of extension points defined in them.This file can be used as input to the plots.R scriptThe pdf files show the resulting graphics corresponding to the data of the csv files.
Z
Data from: Reliability Analysis of Random Telegraph Noisebased True Random...
data.niaid.nih.gov
zenodo.org
Updated Sep 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zanotti, Tommaso; Ranjan, Alok; O'Shea, Sean J.; Raghavan, Nagarajan; Thamankar, Dr. Ramesh; Pey, Kin Leong; PUGLISI, Francesco Maria (2024). Reliability Analysis of Random Telegraph Noisebased True Random Number Generators [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13169457
Explore at:
Dataset updated
Sep 30, 2024
Dataset provided by
Singapore University of Technology and Design
Chalmers University of Technology
University of Modena and Reggio Emilia
Università degli Studi di Modena e Reggio Emilia
Agency for Science, Technology and Research
VIT University
Authors
Zanotti, Tommaso; Ranjan, Alok; O'Shea, Sean J.; Raghavan, Nagarajan; Thamankar, Dr. Ramesh; Pey, Kin Leong; PUGLISI, Francesco Maria
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Repository author: Tommaso Zanotti* email: tommaso.zanotti@unimore.it or francescomaria.puglisi@unimore.it * Version v1.0

This repository includes MATLAB files and datasets related to the IEEE IIRW 2023 conference proceeding:T. Zanotti et al., "Reliability Analysis of Random Telegraph Noisebased True Random Number Generators," 2023 IEEE International Integrated Reliability Workshop (IIRW), South Lake Tahoe, CA, USA, 2023, pp. 1-6, doi: 10.1109/IIRW59383.2023.10477697

The repository includes:

The data of the bitmaps reported in Fig. 4, i.e., the results of the simulation of the ideal RTN-based TRNG circuit for different reseeding strategies. To load and plot the data use the "plot_bitmaps.mat" file.

The result of the circuit simulations considering the EvolvingRTN from the HfO2 device shown in Fig. 7, for two Rgain values. Specifically, the data is contained in the following csv files:

"Sim_TRNG_Circuit_HfO2_3_20s_Vth_210m_no_Noise_Ibias_11n.csv" (lower Rgain)

"Sim_TRNG_Circuit_HfO2_3_20s_Vth_210m_no_Noise_Ibias_4_8n.csv" (higher Rgain)

The result of the circuit simulations considering the temporary RTN from the SiO2 device shown in Fig. 8. Specifically, the data is contained in the following csv files:

"Sim_TRNG_Circuit_SiO2_1c_300s_Vth_180m_Noise_Ibias_1.5n.csv" (ref. Rgain)

"Sim_TRNG_Circuit_SiO2_1c_100s_200s_Vth_180m_Noise_Ibias_1.575n.csv" (lower Rgain)

"Sim_TRNG_Circuit_SiO2_1c_100s_200s_Vth_180m_Noise_Ibias_1.425n.csv" (higher Rgain)
m
Ransomware and user samples for training and validating ML models
data.mendeley.com
Updated Sep 16, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Berrueta (2020). Ransomware and user samples for training and validating ML models [Dataset]. http://doi.org/10.17632/yhg5wk39kf.1
Explore at:
Unique identifier
https://doi.org/10.17632/yhg5wk39kf.1
Dataset updated
Sep 16, 2020
Authors
Eduardo Berrueta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ransomware is considered as a significant threat for most enterprises since past few years. In scenarios wherein users can access all files on a shared server, one infected host is capable of locking the access to all shared files. In the article related to this repository, we detect ransomware infection based on file-sharing traffic analysis, even in the case of encrypted traffic. We compare three machine learning models and choose the best for validation. We train and test the detection model using more than 70 ransomware binaries from 26 different families and more than 2500 h of ‘not infected’ traffic from real users. The results reveal that the proposed tool can detect all ransomware binaries, including those not used in the training phase (zero-days). This paper provides a validation of the algorithm by studying the false positive rate and the amount of information from user files that the ransomware could encrypt before being detected.

This dataset directory contains the 'infected' and 'not infected' samples and the models used for each T configuration, each one in a separated folder.

Each folder (for example 10s/) contains 8 files: - tree.py -> Python script with the Tree model. - ensemble.json -> JSON file with the information about the Ensemble model. - NN_XhiddenLayer.json -> JSON file with the information about the NN model with X hidden layers (1, 2 or 3). - train_data.csv -> All samples used for training each model in this folder. It is in csv format for using in bigML application. - zeroDays.csv -> All zero-day samples used for testing each model in this folder. It is in csv format for using in bigML application. - FPtest_data -> All samples used for validating each model in this folder. It is in csv format for using in bigML application.

The six models present in each folder are the configuration reached training it with the train_data.csv file. The validation and the measurement of the data lost have been done with the FPtest_data.csv and the zeroDays.csv files respectively.

The files containing samples (train_data.csv, zeroDays.csv and FPtest_data.csv) are structured as follows: - Each line is one sample. - Each sample has 3*T features and the label (1 if it is 'infected' sample and 0 if it is not). - The features are separated by ',' because it is a csv file. - The last column is the label of the sample.
t
Data from: Trusted Research Environments: Analysis of Characteristics and...
researchdata.tuwien.ac.at
researchdata.tuwien.at
bin, csv
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber (2024). Trusted Research Environments: Analysis of Characteristics and Data Availability [Dataset]. http://doi.org/10.48436/cv20m-sg117
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.48436/cv20m-sg117
Dataset updated
Jun 25, 2024
Dataset provided by
TU Wien
Authors
Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Trusted Research Environments (TREs) enable analysis of sensitive data under strict security assertions that protect the data with technical organizational and legal measures from (accidentally) being leaked outside the facility. While many TREs exist in Europe, little information is available publicly on the architecture and descriptions of their building blocks & their slight technical variations. To shine light on these problems, we give an overview of existing, publicly described TREs and a bibliography linking to the system description. We further analyze their technical characteristics, especially in their commonalities & variations and provide insight on their data type characteristics and availability. Our literature study shows that 47 TREs worldwide provide access to sensitive data of which two-thirds provide data themselves, predominantly via secure remote access. Statistical offices make available a majority of available sensitive data records included in this study.
Methodology
We performed a literature study covering 47 TREs worldwide using scholarly databases (Scopus, Web of Science, IEEE Xplore, Science Direct), a computer science library (dblp.org), Google and grey literature focusing on retrieving the following source material:
Peer-reviewed articles where available,
TRE websites,
TRE metadata catalogs.
The goal for this literature study is to discover existing TREs, analyze their characteristics and data availability to give an overview on available infrastructure for sensitive data research as many European initiatives have been emerging in recent months.
Technical details
This dataset consists of five comma-separated values (.csv) files describing our inventory:
countries.csv: Table of countries with columns id (number), name (text) and code (text, in ISO 3166-A3 encoding, optional)
tres.csv: Table of TREs with columns id (number), name (text), countryid (number, refering to column id of table countries), structureddata (bool, optional), datalevel (one of [1=de-identified, 2=pseudonomized, 3=anonymized], optional), outputcontrol (bool, optional), inceptionyear (date, optional), records (number, optional), datatype (one of [1=claims, 2=linked records]), optional), statistics_office (bool), size (number, optional), source (text, optional), comment (text, optional)
access.csv: Table of access modes of TREs with columns id (number), suf (bool, optional), physical_visit (bool, optional), external_physical_visit (bool, optional), remote_visit (bool, optional)
inclusion.csv: Table of included TREs into the literature study with columns id (number), included (bool), exclusion reason (one of [peer review, environment, duplicate], optional), comment (text, optional)
major_fields.csv: Table of data categorization into the major research fields with columns id (number), life_sciences (bool, optional), physical_sciences (bool, optional), arts_and_humanities (bool, optional), social_sciences (bool, optional).
Additionally, a MariaDB (10.5 or higher) schema definition .sql file is needed, properly modelling the schema for databases:
schema.sql: Schema definition file to create the tables and views used in the analysis.
The analysis was done through Jupyter Notebook which can be found in our source code repository: https://gitlab.tuwien.ac.at/martin.weise/tres/-/blob/master/analysis.ipynb
d
Data and scripts associated with “Allometric scaling of hyporheic...
search.dataone.org
knb.ecoinformatics.org
+2more
Updated Apr 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Regier; Kyongho Son; Xingyuan Chen; Yilin Fang; Peishi Jiang; Micah Taylor; Wil Wollheim; Beck Powers-McCormack; Brieanne Forbes; Amy E. Goldman; James C. Stegen (2025). Data and scripts associated with “Allometric scaling of hyporheic respiration across basins in the Pacific Northwest USA" [Dataset]. http://doi.org/10.15485/2527349
Explore at:
Unique identifier
https://doi.org/10.15485/2527349
Dataset updated
Apr 9, 2025
Dataset provided by
ESS-DIVE
Authors
Peter Regier; Kyongho Son; Xingyuan Chen; Yilin Fang; Peishi Jiang; Micah Taylor; Wil Wollheim; Beck Powers-McCormack; Brieanne Forbes; Amy E. Goldman; James C. Stegen
Area covered

Description
This data package is associated with the publication “Allometric scaling of hyporheic respiration across basins in the Pacific Northwest USA” submitted to JGR-Biogeosciences (Regier et al. 2025). This study used reach-scale modeled estimates of hyporheic aerobic respiration made by the River Corridor Model (Fang et al. 2020) and watershed characteristics across the Willamette and Yakima River basins to explore potential allometric scaling (i.e., power-law relationships between size and function) of cumulative hyporheic respiration across catchment-to-basin scales. Scaling was explored quantitatively via the R2, slope, and y-intercept of relationships between cumulative hyporheic respiration and watershed area, divided into hyporheic exchange flux (HEF) quantiles. We also explored relationships between allometric scaling and other watershed characteristics through linear regression, spatial patterns, and mutual information analyses. Our results also suggest variability of hyporheic respiration allometry for middle exchange flux quantiles, and in relation to land-cover. Our findings provide initial evidence that allometric scaling may be useful for predicting hyporheic biogeochemical dynamics across watersheds from reach to basin scales. This data package is associated with the GitHub repository found at https://github.com/peterregier/rc_wrb_yrb_scaling. The data package is organized into several key directories. The “data” folder contains multiple CSV files, including landscape heterogeneity, scaling analysis, and watershed boundary data. The “figures” folder has all figure files in both PDF and PNG formats. Core analysis scripts and figure generation scripts are in the “scripts” directory, systematically numbered for sequential execution. The root directory includes essential project files; please see the file ending in “flmd.csv” for a list and description of all files contained in this data package and the file ending in “dd.csv” for data dictionaries used to describe tabular column headers.

Facebook

Twitter

Click to copy link

Link copied

Cite

Samy Baladram (2023). 🔍 Diverse CSV Dataset Samples [Dataset]. https://www.kaggle.com/datasets/samybaladram/multidisciplinary-csv-datasets-collection

🔍 Diverse CSV Dataset Samples

Real-world stats for data practice

Explore at:

zip(330927 bytes)Available download formats

Dataset updated

Nov 6, 2023

Authors

Samy Baladram

License

http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

Description

https://i.imgur.com/PcSDv8A.png" alt="Imgur">

Overview

The dataset provided here is a rich compilation of various data files gathered to support diverse analytical challenges and education in data science. It is especially curated to provide researchers, data enthusiasts, and students with real-world data across different domains, including biostatistics, travel, real estate, sports, media viewership, and more.

Files

Below is a brief overview of what each CSV file contains: - Addresses: Practical examples of string manipulation and address data formatting in CSV. - Air Travel: Historical dataset suitable for analyzing trends in air travel over a period of three years. - Biostats: A dataset of office workers' biometrics, ideal for introductory statistics and biology. - Cities: Geographic and administrative data for urban analysis or socio-demographic studies. - Car Crashes in Catalonia: Weekly traffic accident data from Catalonia, providing a base for public policy research. - De Niro's Film Ratings: Analyze trends in film ratings over time with this entertainment-focused dataset. - Ford Escort Sales: Pre-owned vehicle sales data, perfect for regression analysis or price prediction models. - Old Faithful Geyser: Geological data for pattern recognition and prediction in natural phenomena. - Freshman Year Weights and BMIs: Dataset depicting weight and BMI changes for health and lifestyle studies. - Grades: Education performance data which can be correlated with demographics or study patterns. - Home Sales: A dataset reflecting the housing market dynamics, useful for economic analysis or real estate appraisal. - Hooke's Law Demonstration: Physics data illustrating the classic principle of elasticity in springs. - Hurricanes and Storm Data: Climate data on hurricane and storm frequency for environmental risk assessments. - Height and Weight Measurements: Public health research dataset on anthropometric data. - Lead Shot Specs: Detailed engineering data for material sciences and manufacturing studies. - Alphabet Letter Frequency: Text analysis dataset for frequency distribution studies in large text samples. - MLB Player Statistics: Comprehensive athletic data set for analysis of performance metrics in sports. - MLB Teams' Seasonal Performance: A dataset combining financial and sports performance data from the 2012 MLB season. - TV News Viewership: Media consumption data which can be used to analyze viewing patterns and trends. - Historical Nile Flood Data: A unique environmental dataset for historical trend analysis in flood levels. - Oscar Winner Ages: A dataset to explore age trends among Oscar-winning actors and actresses. - Snakes and Ladders Statistics: Data from the game outcomes useful in studying probability and game theory. - Tallahassee Cab Fares: Price modeling data from the real-world pricing of taxi services. - Taxable Goods Data: A snapshot of economic data concerning taxation impact on prices. - Tree Measurements: Ecological and environmental science data related to tree growth and forest management. - Real Estate Prices from Zillow: Market analysis dataset for those interested in housing price determinants.

Format

The enclosed data respect the comma-separated values (CSV) file format standards, ensuring compatibility with most data processing libraries in Python, R, and other languages. The datasets are ready for import into Jupyter notebooks, RStudio, or any other integrated development environment (IDE) used for data science.

Quality Assurance

The data is pre-checked for common issues such as missing values, duplicate records, and inconsistent entries, offering a clean and reliable dataset for various analytical exercises. With initial header lines in some CSV files, users can easily identify dataset fields and start their analysis without additional data cleaning for headers.

Acknowledgements

The dataset adheres to the GNU LGPL license, making it freely available for modification and distribution, provided that the original source is cited. This opens up possibilities for educators to integrate real-world data into curricula, researchers to validate models against diverse datasets, and practitioners to refine their analytical skills with hands-on data.

This dataset has been compiled from https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html, with gratitude to the authors and maintainers for their dedication to providing open data resources for educational and research purposes. https://i.imgur.com/HOtyghv.png" alt="Imgur">

Clear search

Close search

Google apps

Main menu

🔍 Diverse CSV Dataset Samples

Overview

Files

Format

Quality Assurance

Acknowledgements

CSV file used in statistical analyses

Data from: Data files used to study change dynamics in software systems

Student & School Performance Data Analysis

Mecca Australia Extracted Data in CSV Format

Features:

Applications:

The Canada Trademarks Dataset

Sample Rainfall Dataset

1000 Empirical Time series

Datasets for Sentiment Analysis

Assessing the impact of hints in learning formal specification: Research...

Pandas Practice Dataset

CSV files for GLMM and BRT analysis

all csv files used for analysis of NCBI data

Data from: ESS-DIVE Reporting Format for Comma-separated Values (CSV) File...

COVID-19 Time Series Data

RQ4

Data from: Reliability Analysis of Random Telegraph Noisebased True Random...

Ransomware and user samples for training and validating ML models

Data from: Trusted Research Environments: Analysis of Characteristics and...

Methodology

Technical details

Data and scripts associated with “Allometric scaling of hyporheic...

🔍 Diverse CSV Dataset Samples

Real-world stats for data practice

Overview

Files

Format

Quality Assurance

Acknowledgements