62 datasets found

Data from: tableone: An open source Python package for producing summary...
zenodo.org
search.dataone.org
+1more
csv, txt
Updated May 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark; Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark (2022). Data from: tableone: An open source Python package for producing summary statistics for research papers [Dataset]. http://doi.org/10.5061/dryad.26c4s35
Explore at:
csv, txtAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.26c4s35
Dataset updated
May 30, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark; Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Objectives: In quantitative research, understanding basic parameters of the study population is key for interpretation of the results. As a result, it is typical for the first table ("Table 1") of a research paper to include summary statistics for the study data. Our objectives are 2-fold. First, we seek to provide a simple, reproducible method for providing summary statistics for research papers in the Python programming language. Second, we seek to use the package to improve the quality of summary statistics reported in research papers.

Materials and Methods: The tableone package is developed following good practice guidelines for scientific computing and all code is made available under a permissive MIT License. A testing framework runs on a continuous integration server, helping to maintain code stability. Issues are tracked openly and public contributions are encouraged.

Results: The tableone software package automatically compiles summary statistics into publishable formats such as CSV, HTML, and LaTeX. An executable Jupyter Notebook demonstrates application of the package to a subset of data from the MIMIC-III database. Tests such as Tukey's rule for outlier detection and Hartigan's Dip Test for modality are computed to highlight potential issues in summarizing the data.

Discussion and Conclusion: We present open source software for researchers to facilitate carrying out reproducible studies in Python, an increasingly popular language in scientific research. The toolkit is intended to mature over time with community feedback and input. Development of a common tool for summarizing data may help to promote good practice when used as a supplement to existing guidelines and recommendations. We encourage use of tableone alongside other methods of descriptive statistics and, in particular, visualization to ensure appropriate data handling. We also suggest seeking guidance from a statistician when using tableone for a research study, especially prior to submitting the study for publication.
Python based Github Repositories(above 500 stars)
kaggle.com
zip
Updated May 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ravineesh (2022). Python based Github Repositories(above 500 stars) [Dataset]. https://www.kaggle.com/datasets/ravineesh/python-based-github-repositories-500-stars
Explore at:
zip(1002715 bytes)Available download formats
Dataset updated
May 4, 2022
Authors
Ravineesh
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
About the Dataset: This dataset is a collection of all the python language-based public repositories which are having 500 or above stars on Github. The dataset is collected on 05-05-2022, having a total repository count of 9031.

An upvote would be great if you found this dataset useful 🙂

Purpose - Generate descriptive statistics - Data visualization. - NLP can be performed on the description field of repositories - Clustering by topics. - Finding hidden gems of open source projects

Description of Columns: | Column | Description | | --- | --- | | full_name | Full name of repository | | repo_lang | Programming used in the repository | | repo_topics | Topics of the repository | | created_at | Repository creation date | | description | Description of repository | | forks_count | Total fork count of the repository | | open_issues_count | Current open issues count | | repo_size | size of repo | | repo_stargazers_count | Star count of repository | | repo_subscribers_count | Subscriber count of repository | | repo_watchers_count | Watchers count | | git_url | Git URL of repository | | html_url | HTML URL of repository |
Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21967265.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

GitHub page: https://github.com/soarsmu/NICHE
d
Protected Areas Database of the United States (PAD-US) 3.0 Vector Analysis...
catalog.data.gov
data.usgs.gov
+1more
Updated Oct 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Protected Areas Database of the United States (PAD-US) 3.0 Vector Analysis and Summary Statistics [Dataset]. https://catalog.data.gov/dataset/protected-areas-database-of-the-united-states-pad-us-3-0-vector-analysis-and-summary-stati
Explore at:
Dataset updated
Oct 22, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
United States
Description
Spatial analysis and statistical summaries of the Protected Areas Database of the United States (PAD-US) provide land managers and decision makers with a general assessment of management intent for biodiversity protection, natural resource management, and recreation access across the nation. The PAD-US 3.0 Combined Fee, Designation, Easement feature class (with Military Lands and Tribal Areas from the Proclamation and Other Planning Boundaries feature class) was modified to remove overlaps, avoiding overestimation in protected area statistics and to support user needs. A Python scripted process ("PADUS3_0_CreateVectorAnalysisFileScript.zip") associated with this data release prioritized overlapping designations (e.g. Wilderness within a National Forest) based upon their relative biodiversity conservation status (e.g. GAP Status Code 1 over 2), public access values (in the order of Closed, Restricted, Open, Unknown), and geodatabase load order (records are deliberately organized in the PAD-US full inventory with fee owned lands loaded before overlapping management designations, and easements). The Vector Analysis File ("PADUS3_0VectorAnalysisFile_ClipCensus.zip") associated item of PAD-US 3.0 Spatial Analysis and Statistics ( https://doi.org/10.5066/P9KLBB5D ) was clipped to the Census state boundary file to define the extent and serve as a common denominator for statistical summaries. Boundaries of interest to stakeholders (State, Department of the Interior Region, Congressional District, County, EcoRegions I-IV, Urban Areas, Landscape Conservation Cooperative) were incorporated into separate geodatabase feature classes to support various data summaries ("PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.zip") and Comma-separated Value (CSV) tables ("PADUS3_0SummaryStatistics_TabularData_CSV.zip") summarizing "PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.zip" are provided as an alternative format and enable users to explore and download summary statistics of interest (Comma-separated Table [CSV], Microsoft Excel Workbook [.XLSX], Portable Document Format [.PDF] Report) from the PAD-US Lands and Inland Water Statistics Dashboard ( https://www.usgs.gov/programs/gap-analysis-project/science/pad-us-statistics ). In addition, a "flattened" version of the PAD-US 3.0 combined file without other extent boundaries ("PADUS3_0VectorAnalysisFile_ClipCensus.zip") allow for other applications that require a representation of overall protection status without overlapping designation boundaries. The "PADUS3_0VectorAnalysis_State_Clip_CENSUS2020" feature class ("PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.gdb") is the source of the PAD-US 3.0 raster files (associated item of PAD-US 3.0 Spatial Analysis and Statistics, https://doi.org/10.5066/P9KLBB5D ). Note, the PAD-US inventory is now considered functionally complete with the vast majority of land protection types represented in some manner, while work continues to maintain updates and improve data quality (see inventory completeness estimates at: http://www.protectedlands.net/data-stewards/ ). In addition, changes in protected area status between versions of the PAD-US may be attributed to improving the completeness and accuracy of the spatial data more than actual management actions or new acquisitions. USGS provides no legal warranty for the use of this data. While PAD-US is the official aggregation of protected areas ( https://www.fgdc.gov/ngda-reports/NGDA_Datasets.html ), agencies are the best source of their lands data.
UCI Automobile Dataset
kaggle.com
Updated Feb 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Otrivedi (2023). UCI Automobile Dataset [Dataset]. https://www.kaggle.com/datasets/otrivedi/automobile-data/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Otrivedi
Description
In this project, I have done exploratory data analysis on the UCI Automobile dataset available at https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

This dataset consists of data From the 1985 Ward's Automotive Yearbook. Here are the sources

1) 1985 Model Import Car and Truck Specifications, 1985 Ward's Automotive Yearbook. 2) Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038 3) Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037

Number of Instances: 398 Number of Attributes: 9 including the class attribute

Attribute Information:

mpg: continuous cylinders: multi-valued discrete displacement: continuous horsepower: continuous weight: continuous acceleration: continuous model year: multi-valued discrete origin: multi-valued discrete car name: string (unique for each instance)

This data set consists of three types of entities:

I - The specification of an auto in terms of various characteristics

II - Tts assigned an insurance risk rating. This corresponds to the degree to which the auto is riskier than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is riskier (or less), this symbol is adjusted by moving it up (or down) the scale. Actuaries call this process "symboling".

III - Its normalized losses in use as compared to other cars. This is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/specialty, etc...), and represents the average loss per car per year.

The analysis is divided into two parts:

Data Wrangling

Pre-processing data in python

Dealing with missing values

Data formatting

Data normalization

Binning

Exploratory Data Analysis

Descriptive statistics

Groupby

Analysis of variance

Correlation

Correlation stats

Acknowledgment Dataset: UCI Machine Learning Repository Data link: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
Storage and Transit Time Data and Code
zenodo.org
zip
Updated Nov 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14171251
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14171251
Dataset updated
Nov 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrew Felton; Andrew Felton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Author: Andrew J. Felton
Date: 11/15/2024

This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

"Global estimates of the storage and transit time of water through vegetation"

Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated throughout the peer review process.

#Data information:

The data folder contains key data sets used for analysis. In particular:

"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

#Code information

Python scripts can be found in the "supporting_code" folder.

Each R script in this project has a role:

"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

"02_functions.R": This script contains custom functions. Load this using the `source()` function in the 01_start.R script.

"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.

"04_figures_tables.R": This is the main workhouse for figure/table production and supporting analyses. This script generates the key figures and summary statistics used in the study that then get saved in the "manuscript_figures" folder. Note that all maps were produced using Python code found in the "supporting_code"" folder. Also note that within the "manuscript_figures" folder there is an "extended_data" folder, which contains tables of the summary statistics (e.g., quartiles and sample sizes) behind figures containing box plots or depicting regression coefficients.

"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
Survey Dataset and Python Code for Preprocessing, Statistical Tests
figshare.com
csv
Updated Oct 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammed Nurudeen (2025). Survey Dataset and Python Code for Preprocessing, Statistical Tests [Dataset]. http://doi.org/10.6084/m9.figshare.30487193.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.30487193.v1
Dataset updated
Oct 30, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Mohammed Nurudeen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Purpose – This study examines why individuals remain in or exit social media groups by integrating Diffusion of Innovation, Social Identity Theory, and the Stimulus–Organism–Response framework, with emphasis on group-level drivers of retention.Design/methodology/approach – A cross-sectional survey of 551 participants was analyzed using descriptive statistics, confirmatory factor analysis, and structural equation modelling with robust estimation for ordinal data. Mediation was tested via bootstrapping and moderation via latent interaction, with covariates controlled. Supporting figures and tables are available in the repositoryData processing was implemented in Python 3.11 and attached are all the python scripts used for preprocessing , and statistical Analysis and also the questionnaire used for the study.
Real State Website Data
kaggle.com
zip
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M. Mazhar (2023). Real State Website Data [Dataset]. https://www.kaggle.com/datasets/mazhar01/real-state-website-data/code
Explore at:
zip(228356 bytes)Available download formats
Dataset updated
Jun 11, 2023
Authors
M. Mazhar
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Check: End-to-End Regression Model Pipeline Development with FastAPI: From Data Scraping to Deployment with CI/CD Integration

This CSV dataset provides comprehensive information about house prices. It consists of 9,819 entries and 54 columns, offering a wealth of features for analysis. The dataset includes various numerical and categorical variables, providing insights into factors that influence house prices.

The key columns in the dataset are as follows:

Location1: The location of the houses. Location2 column is identical or shorter version of Location1 Year: The year of construction. Type: The type of the house. Bedrooms: The number of bedrooms in the house. Bathrooms: The number of bathrooms in the house. Size_in_SqYds: The size of the house in square yards. Price: The price of the house. Parking_Spaces: The number of parking spaces available. Floors_in_Building: The number of floors in the building. Elevators: The presence of elevators in the building. Lobby_in_Building: The presence of a lobby in the building.

In addition to these, the dataset contains several other features related to various amenities and facilities available in the houses, such as double-glazed windows, central air conditioning, central heating, waste disposal, furnished status, service elevators, and more.

By performing exploratory data analysis on this dataset using Python and the Pandas library, valuable insights can be gained regarding the relationships between different variables and the impact they have on house prices. Descriptive statistics, data visualization, and feature engineering techniques can be applied to uncover patterns and trends in the housing market.

This dataset serves as a valuable resource for real estate professionals, analysts, and researchers interested in understanding the factors that contribute to house prices and making informed decisions in the real estate market.
Data and tools for studying isograms
figshare.com
Updated Jul 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5245810.v1
Dataset updated
Jul 31, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Florian Breit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
d
Partisan Double Standards in Protest Judgment
search.dataone.org
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David, Lemuel Kenneth (2025). Partisan Double Standards in Protest Judgment [Dataset]. http://doi.org/10.7910/DVN/ADTJPQ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ADTJPQ
Dataset updated
Oct 29, 2025
Dataset provided by
Harvard Dataverse
Authors
David, Lemuel Kenneth
Description
This dataset supports the manuscript “Partisan Double Standards in Protest Judgment: How Group Identity and Moral Framing Influence Behavioral Reactions,” submitted to Personality and Social Psychology Bulletin. It contains the full data, code, codebook, and documentation necessary to reproduce all results, figures, and analyses. The study investigates how political affiliation and moral framing affect behavioral responses to protest scenarios, using a 2 × 2 experimental design. Behavioral responses include the choice to donate to, share, or report a protest. This repository contains the cleaned dataset, annotated analysis code in Python, a full codebook of variables, and a brief README.0517pspb_behavioral_data.csv – Cleaned dataset of participant responses used in analysis. 0517pspb_codebook.csv – Full codebook listing variable names, descriptions, coding schema. pspb_analysis_code.py – Python script to replicate descriptive statistics, chi-square tests, and behavioral plots (choice_by_identity.png).
Explore Bike Share Data
kaggle.com
zip
Updated Jun 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaltout (2021). Explore Bike Share Data [Dataset]. https://www.kaggle.com/shaltout/explore-bike-share-data
Explore at:
zip(26232124 bytes)Available download formats
Dataset updated
Jun 3, 2021
Authors
Shaltout
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Bike Share Data Over the past decade, bicycle-sharing systems have been growing in number and popularity in cities across the world. Bicycle-sharing systems allow users to rent bicycles on a very short-term basis for a price. This allows people to borrow a bike from point A and return it at point B, though they can also return it to the same location if they'd like to just go for a ride. Regardless, each bike can serve several users per day.

Thanks to the rise in information technologies, it is easy for a user of the system to access a dock within the system to unlock or return bicycles. These technologies also provide a wealth of data that can be used to explore how these bike-sharing systems are used.

In this project, you will use data provided by Motivate, a bike share system provider for many major cities in the United States, to uncover bike share usage patterns. You will compare the system usage between three large cities: Chicago, New York City, and Washington, DC.

The Datasets Randomly selected data for the first six months of 2017 are provided for all three cities. All three of the data files contain the same core six (6) columns:

Start Time (e.g., 2017-01-01 00:07:57) End Time (e.g., 2017-01-01 00:20:53) Trip Duration (in seconds - e.g., 776) Start Station (e.g., Broadway & Barry Ave) End Station (e.g., Sedgwick St & North Ave) User Type (Subscriber or Customer) The Chicago and New York City files also have the following two columns:

Gender Birth Year

Data for the first 10 rides in the new_york_city.csv file

The original files are much larger and messier, and you don't need to download them, but they can be accessed here if you'd like to see them (Chicago, New York City, Washington). These files had more columns and they differed in format in many cases. Some data wrangling has been performed to condense these files to the above core six columns to make your analysis and the evaluation of your Python skills more straightforward. In the Data Wrangling course that comes later in the Data Analyst Nanodegree program, students learn how to wrangle the dirtiest, messiest datasets, so don't worry, you won't miss out on learning this important skill!

Statistics Computed You will learn about bike share use in Chicago, New York City, and Washington by computing a variety of descriptive statistics. In this project, you'll write code to provide the following information:

1 Popular times of travel (i.e., occurs most often in the start time)

most common month most common day of week most common hour of day

2 Popular stations and trip

most common start station most common end station most common trip from start to end (i.e., most frequent combination of start station and end station)

3 Trip duration

total travel time average travel time

4 User info

counts of each user type counts of each gender (only available for NYC and Chicago) earliest, most recent, most common year of birth (only available for NYC and Chicago) The Files To answer these questions using Python, you will need to write a Python script. To help guide your work in this project, a template with helper code and comments is provided in a bikeshare.py file, and you will do your scripting in there also. You will need the three city dataset files too:

chicago.csv new_york_city.csv washington.csv

All four of these files are zipped up in the Bikeshare file in the resource tab in the sidebar on the left side of this page. You may download and open up that zip file to do your project work on your local machine.
Z
Data from: Sentinel2GlobalLULC: A dataset of Sentinel-2 georeferenced RGB...
data.niaid.nih.gov
observatorio-cientifico.ua.es
+2more
Updated Jul 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yassir Benhammou; Domingo Alcaraz-Segura; Emilio Guirado; Rohaifa Khaldi; Siham Tabik (2023). Sentinel2GlobalLULC: A dataset of Sentinel-2 georeferenced RGB imagery annotated for global land use/land cover mapping with deep learning (License CC BY 4.0) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5055631
Explore at:
Dataset updated
Jul 21, 2023
Dataset provided by
Department of Computer Science and Artificial Intelligence, Andalusian Research Institute in Data Science and Computational Intelligence, DaSCI, University of Granada, 18071, Granada, Spain. Systems Analysis and Modeling for Decision Support Laboratory, National School of Applied Sciences of Berrechid, Hassan 1st University, Berrechid 218, Morocco
Department of Computer Science and Artificial Intelligence, Andalusian Research Institute in Data Science and Computational Intelligence, DaSCI, University of Granada, 18071, Granada, Spain
Department of Botany, Faculty of Science, University of Granada, 18071 Granada, Spain. iEcolab, Inter-University Institute for Earth System Research, University of Granada, 18006 Granada, Spain
Department of Botany, Faculty of Science, University of Granada, 18071 Granada, Spain. ENSIAS, Mohammed V University, Rabat, 10170, Morocco
Andalusian Center for Assessment and Monitoring of Global Change (CAESCG), University of Almería, 04120 Almería, Spain
Authors
Yassir Benhammou; Domingo Alcaraz-Segura; Emilio Guirado; Rohaifa Khaldi; Siham Tabik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sentinel2GlobalLULC is a deep learning-ready dataset of RGB images from the Sentinel-2 satellites designed for global land use and land cover (LULC) mapping. Sentinel2GlobalLULC v2.1 contains 194,877 images in GeoTiff and JPEG format corresponding to 29 broad LULC classes. Each image has 224 x 224 pixels at 10 m spatial resolution and was produced by assigning the 25th percentile of all available observations in the Sentinel-2 collection between June 2015 and October 2020 in order to remove atmospheric effects (i.e., clouds, aerosols, shadows, snow, etc.). A spatial purity value was assigned to each image based on the consensus across 15 different global LULC products available in Google Earth Engine (GEE).

Our dataset is structured into 3 main zip-compressed folders, an Excel file with a dictionary for class names and descriptive statistics per LULC class, and a python script to convert RGB GeoTiff images into JPEG format. The first folder called "Sentinel2LULC_GeoTiff.zip" contains 29 zip-compressed subfolders where each one corresponds to a specific LULC class with hundreds to thousands of GeoTiff Sentinel-2 RGB images. The second folder called "Sentinel2LULC_JPEG.zip" contains 29 zip-compressed subfolders with a JPEG formatted version of the same images provided in the first main folder. The third folder called "Sentinel2LULC_CSV.zip" includes 29 zip-compressed CSV files with as many rows as provided images and with 12 columns containing the following metadata (this same metadata is provided in the image filenames):

Land Cover Class ID: is the identification number of each LULC class

Land Cover Class Short Name: is the short name of each LULC class

Image ID: is the identification number of each image within its corresponding LULC class

Pixel purity Value: is the spatial purity of each pixel for its corresponding LULC class calculated as the spatial consensus across up to 15 land-cover products

GHM Value: is the spatial average of the Global Human Modification index (gHM) for each image

Latitude: is the latitude of the center point of each image

Longitude: is the longitude of the center point of each image

Country Code: is the Alpha-2 country code of each image as described in the ISO 3166 international standard. To understand the country codes, we recommend the user to visit the following website where they present the Alpha-2 code for each country as described in the ISO 3166 international standard:https: //www.iban.com/country-codes

Administrative Department Level1: is the administrative level 1 name to which each image belongs

Administrative Department Level2: is the administrative level 2 name to which each image belongs

Locality: is the name of the locality to which each image belongs

Number of S2 images : is the number of found instances in the corresponding Sentinel-2 image collection between June 2015 and October 2020, when compositing and exporting its corresponding image tile

For seven LULC classes, we could not export from GEE all images that fulfilled a spatial purity of 100% since there were millions of them. In this case, we exported a stratified random sample of 14,000 images and provided an additional CSV file with the images actually contained in our dataset. That is, for these seven LULC classes, we provide these 2 CSV files:

A CSV file that contains all exported images for this class

A CSV file that contains all images available for this class at spatial purity of 100%, both the ones exported and the ones not exported, in case the user wants to export them. These CSV filenames end with "including_non_downloaded_images".

To clearly state the geographical coverage of images available in this dataset, we included in the version v2.1, a compressed folder called "Geographic_Representativeness.zip". This zip-compressed folder contains a csv file for each LULC class that provides the complete list of countries represented in that class. Each csv file has two columns, the first one gives the country code and the second one gives the number of images provided in that country for that LULC class. In addition to these 29 csv files, we provided another csv file that maps each ISO Alpha-2 country code to its original full country name.

© Sentinel2GlobalLULC Dataset by Yassir Benhammou, Domingo Alcaraz-Segura, Emilio Guirado, Rohaifa Khaldi, Boujemâa Achchab, Francisco Herrera & Siham Tabik is marked with Attribution 4.0 International (CC-BY 4.0)
Z
Accurate and Efficient Estimation of Local Heritability using Summary...
data.niaid.nih.gov
zenodo.org
Updated Feb 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li, Hui; Mazumder, Rahul; Lin, Xihong (2023). Accurate and Efficient Estimation of Local Heritability using Summary Statistics and LD Matrix -- Demo datasets for the HEELS tutorials [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7618666
Explore at:
Dataset updated
Feb 14, 2023
Dataset provided by
Harvard T.H. Chan School of Public Health
MIT Sloan School of Management
Authors
Li, Hui; Mazumder, Rahul; Lin, Xihong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We introduced a new estimator for local heritability, "HEELS", which attains comparable statistical efficiency as the REML estimator (such as those produced by GCTA and BOLT-REML) but only requires summary-level statistics – Z-scores from marginal association tests and the empirical LD. Our method has been implemented into an open-source Python-based command line tool.

The datasets released here can be downloaded to test the two main functions of our software package: 1) estimating local heritability; 2) computing the low-dimensional representation of the LD matrix. They are meant to accompany the HEELS tutorials we have posted onto the wiki pages of our github repository: https://github.com/huilisabrina/HEELS/wiki.
Recognition Without Implementation: Institutional Gaps and Forestry...
zenodo.org
bin
Updated Oct 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anon Anon; Anon Anon (2025). Recognition Without Implementation: Institutional Gaps and Forestry Expansion in Post-Girjas Swedish Sápmi - Dataset and Analysis [Dataset]. http://doi.org/10.5281/zenodo.17249309
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.17249309
Dataset updated
Oct 2, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anon Anon; Anon Anon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Sápmi
Description
Recognition Without Implementation: Institutional Gaps and Forestry Expansion in Post-Girjas Swedish Sápmi - Dataset and Analysis

Description

This deposit contains the dataset and analysis code supporting the research paper "Recognition Without Implementation: Institutional Gaps and Forestry Expansion in Post-Girjas Swedish Sápmi" by Stefan Holgersson and Scott Brown.

Research Overview: This study examines forestry permit trends in Swedish Sámi territories following the landmark 2020 Girjas Supreme Court ruling, which recognized exclusive Sámi rights over hunting and fishing in traditional lands. Using 432 region-year observations (1998-2024) from the Swedish Forest Agency, we document a 242% increase in clearcutting approvals during 2020-2024 compared to pre-2020 averages, with state/corporate actors showing 313% increases and private landowners 197%.

Key Findings:

Clearcutting intensified most in regions with strongest Sámi territorial claims (Västerbotten +369%, Norra Norrland +275%)

State actors exhibited greater intensification than private landowners despite public accountability mandates

Three institutional mechanisms correlate with continued extraction: legal non-integration of customary tenure, implementation deficits between judicial recognition and administrative enforcement, and ESG disclosure opacity

Important Limitation: We cannot isolate causal effects of the Girjas ruling from concurrent shocks including COVID-19 economic disruption, EU Taxonomy implementation, and commodity price volatility. The analysis documents institutional conditions and correlational patterns rather than establishing causation.

Dataset Contents:

Clearcut.xlsx: Swedish Forest Agency clearcutting permit data (1998-2024) disaggregated by region, ownership type, and year

SAMI.ipynb: Jupyter notebook containing Python code for descriptive statistics, time series analysis, and figure generation

How to Use These Files in Google Colab:

Download the files from this Zenodo deposit to your computer

Open Google Colab at https://colab.research.google.com

Upload the notebook:

Click "File" → "Upload notebook"

Select SAMI.ipynb from your downloads

Upload the data file:

In the Colab notebook, click the folder icon in the left sidebar

Click the upload button (page with up arrow)

Select Clearcut.xlsx from your downloads

The file will appear in the /content/ directory

Run the analysis:

Execute cells sequentially by pressing Shift+Enter

The notebook will automatically load Clearcut.xlsx from the current directory

All figures and statistics will generate inline

Alternative method (direct from Zenodo):

python

# Add this cell at the top of the notebook to download files directly !wget https://zenodo.org/record/[RECORD_ID]/files/Clearcut.xlsx

Replace [RECORD_ID] with the actual Zenodo record number after publication.

Requirements: The notebook uses standard Python libraries: pandas, numpy, matplotlib, seaborn. These are pre-installed in Google Colab. No additional setup required.

Methodology: Descriptive statistical analysis combined with institutional document review. Data covers eight administrative regions in northern Sweden with mountain-adjacent forests relevant to Sámi reindeer herding territories.

Policy Relevance: Findings inform debates on Indigenous land rights implementation, forestry governance reform, ESG disclosure requirements, and the gap between legal recognition and operational constraints in resource extraction contexts.

Keywords: Indigenous rights, Sámi, forestry governance, legal pluralism, Sweden, Girjas ruling, land tenure, corporate accountability, ESG disclosure

License: Creative Commons Attribution 4.0 International (CC BY 4.0)
O
Demand-Side Grid Model (dsgrid) Data from the Electrification Futures...
data.openei.org
osti.gov
+1more
code, data +2
Updated Jul 8, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elaine Hale; Henry Horsey; Brandon Johnson; Matteo Muratori; Eric Wilson; Brennan Borlaug; Craig Christensen; Amanda Farthing; Dylan Hettinger; Andrew Parker; Joseph Robertson; Michael Rossol; Gord Stephen; Eric Wood; Baskar Vairamohan; Elaine Hale; Henry Horsey; Brandon Johnson; Matteo Muratori; Eric Wilson; Brennan Borlaug; Craig Christensen; Amanda Farthing; Dylan Hettinger; Andrew Parker; Joseph Robertson; Michael Rossol; Gord Stephen; Eric Wood; Baskar Vairamohan (2018). Demand-Side Grid Model (dsgrid) Data from the Electrification Futures Project (EFS) [Dataset]. http://doi.org/10.25984/1823248
Explore at:
data, website, image_document, codeAvailable download formats
Unique identifier
https://doi.org/10.25984/1823248
Dataset updated
Jul 8, 2018
Dataset provided by
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Multiple Programs (EE)
Open Energy Data Initiative (OEDI)
National Renewable Energy Laboratory
Authors
Elaine Hale; Henry Horsey; Brandon Johnson; Matteo Muratori; Eric Wilson; Brennan Borlaug; Craig Christensen; Amanda Farthing; Dylan Hettinger; Andrew Parker; Joseph Robertson; Michael Rossol; Gord Stephen; Eric Wood; Baskar Vairamohan; Elaine Hale; Henry Horsey; Brandon Johnson; Matteo Muratori; Eric Wilson; Brennan Borlaug; Craig Christensen; Amanda Farthing; Dylan Hettinger; Andrew Parker; Joseph Robertson; Michael Rossol; Gord Stephen; Eric Wood; Baskar Vairamohan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set contains the full-resolution and state-level data described in the linked technical report (https://www.nrel.gov/docs/fy18osti/71492.pdf). It can be accessed with the NREL-dsgrid-legacy-efs-api, available on GitHub at https://github.com/dsgrid/dsgrid-legacy-efs-api and through PyPI (pip install NREL-dsgrid-legacy-efs-api). The data format is HDF5. The API is written in Python.

This initial dsgrid data set, whose description was originally published in 2018, covers electricity demand in the contiguous United States (CONUS) for the historical year of 2012. It is a proof-of-concept demonstrating the feasibility of reconciling bottom-up demand modeling results with top-down information about electricity demand to create a more detailed description than is possible with either type of data source on its own. The result is demand data that is more highly resolved along geographic, temporal, sectoral, and end-use dimensions as may be helpful for conducting electricity sector-wide "what-if" analysis of, e.g., energy efficiency, electrification, and/or demand flexibility.

Although we conducted bottom-up versus top-down validation, the final residuals were significant, especially at higher geographic and temporal resolution. Please see the Executive Summary and/or Section 3 of the report to obtain an understanding of the data set limitations before deciding whether these data are suitable for any particular use case.

New dsgrid datasets are under development. Please visit https://www.nrel.gov/analysis/dsgrid.html for the latest information which is also linked in the data resources.
Diwali_Sales_Dataset
kaggle.com
zip
Updated Aug 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BharathiD8 (2024). Diwali_Sales_Dataset [Dataset]. https://www.kaggle.com/datasets/bharathid8/diwali-sales-dataset/discussion
Explore at:
zip(217877 bytes)Available download formats
Dataset updated
Aug 30, 2024
Authors
BharathiD8
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Project Overview

Objective: Analyze Diwali sales data to uncover trends, customer behavior, and sales performance during the festive season. - Tools Used: Python, Pandas, NumPy, Matplotlib, Seaborn

Data Collection and Preparation

Dataset: A dataset containing sales data for Diwali, including details like product categories, customer demographics, sales amounts, discounts, etc.

**Data Cleaning: **Handle missing values, remove duplicates, and correct any inconsistencies in the data.

- Feature Engineering: Create new features if necessary, such as total sales per customer, average discount per sale, etc.

Exploratory Data Analysis (EDA)

Descriptive Statistics: Calculate basic statistics (mean, median, mode) to get a sense of the data distribution. Visualizations: Sales Trends: Plot sales over time to see how they varied during the Diwali season. Top-Selling Products: Identify the products or categories with the highest sales. Customer Demographics: Analyze sales by age, gender, and location to understand customer behavior. Discount Impact: Evaluate how different discount levels affected sales volume.

Key Findings

Customer Behavior: Insights on which customer segments contributed the most to sales. Sales Performance: Which products or categories had the highest sales, and during which days of Diwali sales peaked. Discount Effectiveness: The impact of discounts on sales and whether higher discounts led to significantly higher sales or not.

Conclusion

Summarize the key insights derived from the EDA. Discuss any patterns or trends that were unexpected or particularly interesting. Provide recommendations for future sales strategies based on the findings. .
Decomposed matrices used for the analysis described in 'Components of...
nih.figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yosuke Tanigawa; Manuel Rivas (2023). Decomposed matrices used for the analysis described in 'Components of genetic associations across 2,138 phenotypes in the UK Biobank highlight adipocyte biology' [Dataset]. http://doi.org/10.35092/yhjc.9202247.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.35092/yhjc.9202247.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Yosuke Tanigawa; Manuel Rivas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset deposited here contains decomposed matrices of GWAS summary statistics across 2,138 phenotypes described in the following publication:Y. Tanigawa*, J. Li*, et al., Components of genetic associations across 2,138 phenotypes in the UK Biobankhighlight adipocyte biology. Nature Communications (2019). doi:10.1038/s41467-019-11953-9.The data are provided as three Python Numpy data (npz) files, each of which corresponds to the three datasets used in computational analysis described in our manuscript.- "all" dataset: dev_allNonMHC_z_center_p0001_100PCs_20180129.npz- "Coding only" dataset: dev_codingNonMHC_z_center_p0001_100PCs_20180129.npz- "PTVs only" dataset: dev_PTVsNonMHC_z_center_p0001_100PCs_20180129.npzThose files can be loaded with Python numpy package and were used in our analysis scripts and notebook (https://github.com/rivas-lab/public-resources/tree/master/uk_biobank/DeGAs).Please read our publication for more information regarding this dataset.AbstractPopulation-based biobanks with genomic and dense phenotype data provide opportunities for generating effective therapeutic hypotheses and understanding the genomic role in disease predisposition. To characterize latent components of genetic associations, we applied truncated singular value decomposition (DeGAs) to matrices of summary statistics derived from genome-wide association analyses across 2,138 phenotypes measured in 337,199 White British individuals in the UK Biobank study. We systematically identified key components of genetic associations and the contributions of variants, genes, and phenotypes to each component. As an illustration of the utility of the approach to inform downstream experiments, we report putative loss of function variants, rs114285050 (GPR151) and rs150090666 (PDE3B), that substantially contribute to obesity-related traits, and experimentally demonstrate the role of these genes in adipocyte biology. Our approach to dissect components of genetic associations across the human phenome will accelerate biomedical hypothesis generation by providing insights on previously unexplored latent structures.
Network descriptive statistics for the Deezer networks.
plos.figshare.com
figshare.com
xls
Updated Dec 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Stivala; Peng Wang; Alessandro Lomi (2024). Network descriptive statistics for the Deezer networks. [Dataset]. http://doi.org/10.1371/journal.pcsy.0000021.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcsy.0000021.t001
Dataset updated
Dec 6, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Alex Stivala; Peng Wang; Alessandro Lomi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Network descriptive statistics for the Deezer networks.
d
Replication Data for: The Wikipedia Adventure: Field Evaluation of an...
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Narayan, Sneha; Orlowitz, Jake; Morgan, Jonathan T.; Shaw, Aaron D.; Hill, Benjamin Mako (2023). Replication Data for: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users [Dataset]. http://doi.org/10.7910/DVN/6HPRIG
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/6HPRIG
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Narayan, Sneha; Orlowitz, Jake; Morgan, Jonathan T.; Shaw, Aaron D.; Hill, Benjamin Mako
Description
This dataset contains the data and code necessary to replicate work in the following paper: Narayan, Sneha, Jake Orlowitz, Jonathan Morgan, Benjamin Mako Hill, and Aaron Shaw. 2017. “The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users.” in Proceedings of the 20th ACM Conference on Computer-Supported Cooperative Work & Social Computing (CSCW '17). New York, New York: ACM Press. http://dx.doi.org/10.1145/2998181.2998307 The published paper contains two studies. Study 1 is a descriptive analysis of a survey of Wikipedia editors who played a gamified tutorial. Study 2 is a field experiment that evaluated the same the tutorial. These data are the data used in the field experiment described in Study 2. Description of Files This dataset contains the following files beyond this README: twa.RData — An RData file that includes all variables used in Study 2. twa_analysis.R — A GNU R script that includes all the code used to generate the tables and plots related to Study 2 in the paper. The RData file contains one variable (d) which is an R dataframe (i.e., table) that includes the following columns: userid (integer): The unique numerical ID representing each user on in our sample. These are 8-digit integers and describe public accounts on Wikipedia. sample.date (date string): The day the user was recruited to the study. Dates are formatted in “YYYY-MM-DD” format. In the case of invitees, it is the date their invitation was sent. For users in the control group, these is the date that they would have been invited to the study. edits.all (integer): The total number of edits made by the user on Wikipedia in the 180 days after they joined the study. Edits to user's user pages, user talk pages and subpages are ignored. edits.ns0 (integer): The total number of edits made by user to article pages on Wikipedia in the 180 days after they joined the study. edits.talk (integer): The total number of edits made by user to talk pages on Wikipedia in the 180 days after they joined the study. Edits to a user's user page, user talk page and subpages are ignored. treat (logical): TRUE if the user was invited, FALSE if the user was in control group. play (logical): TRUE if the user played the game. FALSE if the user did not. All users in control are listed as FALSE because any user who had not been invited to the game but played was removed. twa.level (integer): Takes a value 0 of if the user has not played the game. Ranges from 1 to 7 for those who did, indicating the highest level they reached in the game. quality.score (float). This is the average word persistence (over a 6 revision window) over all edits made by this userid. Our measure of word persistence (persistent word revision per word) is a measure of edit quality developed by Halfaker et al. that tracks how long words in an edit persist after subsequent revisions are made to the wiki-page. For more information on how word persistence is calculated, see the following paper: Halfaker, Aaron, Aniket Kittur, Robert Kraut, and John Riedl. 2009. “A Jury of Your Peers: Quality, Experience and Ownership in Wikipedia.” In Proceedings of the 5th International Symposium on Wikis and Open Collaboration (OpenSym '09), 1–10. New York, New York: ACM Press. doi:10.1145/1641309.1641332. Or this page: https://meta.wikimedia.org/wiki/Research:Content_persistence How we created twa.RData The files twa.RData combines datasets drawn from three places: A dataset created by Wikimedia Foundation staff that tracked the details of the experiment and how far people got in the game. The variables userid, sample.date, treat, play, and twa.level were all generated in a dataset created by WMF staff when The Wikipedia Adventure was deployed. All users in the sample created their accounts within 2 days before the date they were entered into the study. None of them had received a Teahouse invitation, a Level 4 user warning, or been blocked from editing at the time that they entered the study. Additionally, all users made at least one edit after the day they were invited. Users were sorted randomly into treatment and control groups, based on which they either received or did not receive an invite to play The Wikipedia Adventure. Edit and text persistence data drawn from public XML dumps created on May 21st, 2015. We used publicly available XML dumps to generate the outcome variables, namely edits.all, edits.ns0, edits.talk and quality.score. We first extracted all edits made by users in our sample during the six month period since they joined the study, excluding edits made to user pages or user talk pages using. We parsed the XML dumps using the Python based wikiq and MediaWikiUtilities software online at: http://projects.mako.cc/source/?p=mediawiki_dump_tools https://github.com/mediawiki-utilities/python-mediawiki-utilities We o... Visit https://dataone.org/datasets/sha256%3Ab1240bda398e8fa311ac15dbcc04880333d5f3fbe67a7a951786da2d44e33018 for complete metadata about this dataset.
B
Data from: The ecology of spider sociality – A spatial model
borealisdata.ca
search.dataone.org
Updated Feb 2, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zsóka Vásárhelyi; István Scheuring; Leticia Avilés (2022). Data from: The ecology of spider sociality – A spatial model [Dataset]. http://doi.org/10.5683/SP3/N1JPRO
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/N1JPRO
Dataset updated
Feb 2, 2022
Dataset provided by
Borealis
Authors
Zsóka Vásárhelyi; István Scheuring; Leticia Avilés
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
Hungary’s Economic Development and Innovation Operative Program*
Hungarian Scientific Research Fund
Natural Sciences and Engineering Research Council of Canada
Description
AbstractThe emergence of animal societies offers unsolved problems for both evolutionary and ecological studies. Social spiders are specially well suited to address this problem given their multiple independent origins and distinct geographical distribution. Based on long term research on the spider genus Anelosimus, we developed a spatial model that recreates observed macroecological patterns in the distribution of social and subsocial spiders. We show that parallel gradients of increasing insect size and disturbance (rain, predation) with proximity to the lowland tropical rainforest would explain why social species are concentrated in the lowland wet tropics, but absent from higher elevations and latitudes. The model further shows that disturbance, which disproportionately affects small colonies, not only creates conditions that require group living, but also tempers the dynamics of large social groups. Similarly simple underlying processes, albeit with different players on a somewhat different stage, may explain the diversity of other social systems.
MethodsThis dataset was created by a spatial computer model written in python. The dataset contains the main results, further results can be re-generated by the python code, or its minor variants, available as a supplement of our publication. The modelled grid incorporates parallel gradients of insect size and disturbance in a square lattice grid, one end of which represents a high elevation tropical cloudforest, the other, a lowland tropical rainforest. As we move from the cloudforest to the rainforest, insects get larger and disturbances more severe. Each node can be inhabited by a single colony of either a subsocial or a social spider species, as inspired by those in the genus Anelosimus. Usage notesreadme.txt -> help FOLDERS basic_setting -> the model with the basic parameters test_preysize_hyp -> test of the prey size hypothesis test_disturbance_hyp -> test of the disturbance hypothesis control_preysize_hyp -> control for the prey size hypothesis control_disturbance_hyp -> control for of the disturbance hypothesis FILES WITHIN FOLDERS col_sizes.txt -> records colony sizes at 1 arbitrary position in each environment data_allsizes -> descriptive statistics for all social colony sizes averaged throughout the last 100 generations data_social -> descriptive statistics on all social colonies within each generation data_subsocial -> descriptive statistics on all subsocial colonies within each generation parameters -> main parameters of the simulation population -> records the whole grid (both populations) in the last two generations

Facebook

Twitter

Click to copy link

Link copied

Cite

Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark; Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark (2022). Data from: tableone: An open source Python package for producing summary statistics for research papers [Dataset]. http://doi.org/10.5061/dryad.26c4s35

Data from: tableone: An open source Python package for producing summary statistics for research papers

Explore at:

csv, txtAvailable download formats

Unique identifier

https://doi.org/10.5061/dryad.26c4s35

Dataset updated

May 30, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark; Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Objectives: In quantitative research, understanding basic parameters of the study population is key for interpretation of the results. As a result, it is typical for the first table ("Table 1") of a research paper to include summary statistics for the study data. Our objectives are 2-fold. First, we seek to provide a simple, reproducible method for providing summary statistics for research papers in the Python programming language. Second, we seek to use the package to improve the quality of summary statistics reported in research papers.

Materials and Methods: The tableone package is developed following good practice guidelines for scientific computing and all code is made available under a permissive MIT License. A testing framework runs on a continuous integration server, helping to maintain code stability. Issues are tracked openly and public contributions are encouraged.

Results: The tableone software package automatically compiles summary statistics into publishable formats such as CSV, HTML, and LaTeX. An executable Jupyter Notebook demonstrates application of the package to a subset of data from the MIMIC-III database. Tests such as Tukey's rule for outlier detection and Hartigan's Dip Test for modality are computed to highlight potential issues in summarizing the data.

Discussion and Conclusion: We present open source software for researchers to facilitate carrying out reproducible studies in Python, an increasingly popular language in scientific research. The toolkit is intended to mature over time with community feedback and input. Development of a common tool for summarizing data may help to promote good practice when used as a supplement to existing guidelines and recommendations. We encourage use of tableone alongside other methods of descriptive statistics and, in particular, visualization to ensure appropriate data handling. We also suggest seeking guidance from a statistician when using tableone for a research study, especially prior to submitting the study for publication.

Clear search

Close search

Google apps

Main menu

Data from: tableone: An open source Python package for producing summary...

Python based Github Repositories(above 500 stars)

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

Protected Areas Database of the United States (PAD-US) 3.0 Vector Analysis...

UCI Automobile Dataset

Storage and Transit Time Data and Code

Survey Dataset and Python Code for Preprocessing, Statistical Tests

Real State Website Data

Data and tools for studying isograms

Partisan Double Standards in Protest Judgment

Explore Bike Share Data

1 Popular times of travel (i.e., occurs most often in the start time)

2 Popular stations and trip

3 Trip duration

4 User info

Data from: Sentinel2GlobalLULC: A dataset of Sentinel-2 georeferenced RGB...

Accurate and Efficient Estimation of Local Heritability using Summary...

Recognition Without Implementation: Institutional Gaps and Forestry...

Recognition Without Implementation: Institutional Gaps and Forestry Expansion in Post-Girjas Swedish Sápmi - Dataset and Analysis

Description

Demand-Side Grid Model (dsgrid) Data from the Electrification Futures...

Diwali_Sales_Dataset

Project Overview

Data Collection and Preparation

Exploratory Data Analysis (EDA)

Key Findings

Conclusion

Decomposed matrices used for the analysis described in 'Components of...

Network descriptive statistics for the Deezer networks.

Replication Data for: The Wikipedia Adventure: Field Evaluation of an...

Data from: The ecology of spider sociality – A spatial model

Data from: tableone: An open source Python package for producing summary statistics for research papers