100+ datasets found

CSV file used in statistical analyses
data.csiro.au
researchdata.edu.au
+1more
Updated Oct 13, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6
Explore at:
Unique identifier
https://doi.org/10.4225/08/543B4B4CA92E6
Dataset updated
Oct 13, 2014
Dataset authored and provided by
CSIROhttp://www.csiro.au/
License
https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/
Time period covered
Mar 14, 2008 - Jun 9, 2009
Dataset funded by
CSIROhttp://www.csiro.au/
Description
A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.
CORD-19-CSV
kaggle.com
Updated Apr 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huáscar Méndez (2020). CORD-19-CSV [Dataset]. https://www.kaggle.com/huascarmendez1/cord19csv/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 12, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Huáscar Méndez
Description
Context

This dataset is an extract from COVID-19 Open Research Dataset Challenge (CORD-19).

This pre-process is neccesary since the original input data it is stored in JSON files, whose structure is likely too complex to directly perform the analysis.

The preprocessing of the data further consisted of filtering the documents that specifically talk about the covid-19 disease and its other names, among other general data review and cleaning activities.

Content

As a result, this dataset contains a set of files in csv format, grouped into original sources (Biorxiv, Comm_use, Custom_licence, Nomcomm_use). Each of those files contains a subset of data columns, specifically paper_id, doc_title, doc_text, and source.
Image csv file
kaggle.com
Updated Jun 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muluken Hakim (2024). Image csv file [Dataset]. https://www.kaggle.com/datasets/mulukenhakim/image-csv-file/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 14, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Muluken Hakim
Description
Dataset

This dataset was created by Muluken Hakim

Released under Other (specified in description)

Contents
Z
Data from: Census of the Ecosystem of Decentralized Autonomous Organizations...
data.niaid.nih.gov
produccioncientifica.ucm.es
+2more
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schwartz, Andrew (2024). Census of the Ecosystem of Decentralized Autonomous Organizations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10794915
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
Arroyo, Javier
Peña-Calvin, Andrea
Davó, David
Schwartz, Andrew
Hassan, Samer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset includes data from various Decentralized Autonomous Organizations (DAOs) platforms, namely Aragon, DAOHaus, DAOstack, Realms, Snapshot and Tally. DAOs are a new form of self-governed online communities deployed in the blockchain. DAO members typically use governance tokens to participate in the DAO decision-making process, often through a voting system where members submit proposals and vote on them.

The description of the methods used for the generation of data, for processing it and the quality-assurance procedures performed on the data can be found here:https://doi.org/10.1145/3589335.3651481

Recommended citation for this dataset:Peña-Calvin, A., Arroyo, J., Schwartz, A., & Hassan, S. (2024). Concentration of Power and Participation in Online Governance: the Ecosystem of Decentralized Autonomous Organizations. Companion Proceedings of the ACM Web Conference, 13–17, 2024, Singapore, doi: https://doi.org/10.1145/3589335.3651481

The dataset comprises three CSV files: deployments.csv, proposals.csv, and votes.csv, each containing essential information regarding DAOs deployments, theirproposals, and the corresponding votes.

The file deployments.csv provides insights into the general aspects of DAO deployments, including the platform it is deployed in, the number of proposals, unique voters, votes cast, and estimated voting power.

The proposals.csv file contains comprehensive information about all proposals associated with the deployments, including their date, the number of votes they received, and the total voting power voters employed on that proposal.

In votes.csv, data regarding the votes cast for the deployment proposals is recorded. It includes the voter's blockchain address, the vote's weight in voting power, and the day it was cast.
A
Mapping incident locations from a CSV file in a web map (video)
data.amerigeoss.org
esri rest, html
Updated Mar 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ESRI (2020). Mapping incident locations from a CSV file in a web map (video) [Dataset]. https://data.amerigeoss.org/zh_CN/dataset/mapping-incident-locations-from-a-csv-file-in-a-web-map-video
Explore at:
esri rest, htmlAvailable download formats
Dataset updated
Mar 17, 2020
Dataset provided by
ESRI
Description
Mapping incident locations from a CSV file in a web map (YouTube video).

View this short demonstration video to learn how to geocode incident locations from a spreadsheet in ArcGIS Online. In this demonstration, the presenter drags a simple .csv file into a browser-based Web Map and maps the appropriate address fields to display incident points allowing different types of spatial overlays and analysis.

_

Communities around the world are taking strides in mitigating the threat that COVID-19 (coronavirus) poses. Geography and location analysis have a crucial role in better understanding this evolving pandemic.

When you need help quickly, Esri can provide data, software, configurable applications, and technical support for your emergency GIS operations. Use GIS to rapidly access and visualize mission-critical information. Get the information you need quickly, in a way that’s easy to understand, to make better decisions during a crisis.

Esri’s Disaster Response Program (DRP) assists with disasters worldwide as part of our corporate citizenship. We support response and relief efforts with GIS technology and expertise.

More information...
Gene expression csv files
figshare.com
txt
Updated Jun 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cristina Alvira (2023). Gene expression csv files [Dataset]. http://doi.org/10.6084/m9.figshare.21861975.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21861975.v1
Dataset updated
Jun 12, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Cristina Alvira
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Csv files containing all detectable genes.
c
Waitrose Products Information Dataset in CSV Format - Comprehensive Product...
crawlfeeds.com
csv, zip
Updated Jun 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Waitrose Products Information Dataset in CSV Format - Comprehensive Product Data [Dataset]. https://crawlfeeds.com/datasets/waitrose-products-information-dataset-in-csv-format-comprehensive-product-data
Explore at:
zip, csvAvailable download formats
Dataset updated
Jun 7, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
The Waitrose Product Dataset offers a comprehensive and structured collection of grocery items listed on the Waitrose online platform. This dataset includes 25,000+ product records across multiple categories, curated specifically for use in retail analytics, pricing comparison, AI training, and eCommerce integration.

Each record contains detailed attributes such as:

Product title, brand, MPN, and product ID

Price and currency

Availability status

Description, ingredients, and raw nutrition data

Review count and average rating

Breadcrumbs, image links, and more

Delivered in CSV format (ZIP archive), this dataset is ideal for professionals in the FMCG, retail, and grocery tech industries who need structured, crawl-ready data for their projects.
AOI polygon fire statistics CSV files
nwcc-nrcs.hub.arcgis.com
Updated Nov 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
USDA NRCS ArcGIS Online (2024). AOI polygon fire statistics CSV files [Dataset]. https://nwcc-nrcs.hub.arcgis.com/datasets/6928853d9d7c450a84984d4c66f95e9c
Explore at:
Dataset updated
Nov 24, 2024
Dataset provided by
Natural Resources Conservation Servicehttp://www.nrcs.usda.gov/
United States Department of Agriculturehttp://usda.gov/

Authors
USDA NRCS ArcGIS Online
Description
Annual and time-period fire statistics in CSV format for the AOIs of the NWCC active forecast stations. The statistics are based on NIFC fire historical and current perimeters and MTBS burn severity data. This release contains NIFC data from 1996 to current (July 10, 2025) and MTBS data from 1996 to 2022. Annual statsitics were generated for the time period of 1996 to 2025. Time-period statistics were generated from 1998 to 2022 with a 5 years time interval. The time periods are: 2018-2022 (last 5 years), 2013-2022 (last 10 years), 2008-2022 (last 15 years), 2003-2022 (last 20 years), and 1998-2022 (last 25 years).
m
ShoppingAppReviews Dataset
data.mendeley.com
Updated Sep 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Noor Mairukh Khan Arnob (2024). ShoppingAppReviews Dataset [Dataset]. http://doi.org/10.17632/chr5b94c6y.2
Explore at:
Unique identifier
https://doi.org/10.17632/chr5b94c6y.2
Dataset updated
Sep 16, 2024
Authors
Noor Mairukh Khan Arnob
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
A dataset consisting of 751,500 English app reviews of 12 online shopping apps. The dataset was scraped from the internet using a python script. This ShoppingAppReviews dataset contains app reviews of the 12 most popular online shopping android apps: Alibaba, Aliexpress, Amazon, Daraz, eBay, Flipcart, Lazada, Meesho, Myntra, Shein, Snapdeal and Walmart. Each review entry contains many metadata like review score, thumbsupcount, review posting time, reply content etc. The dataset is organized in a zip file, under which there are 12 json files and 12 csv files for 12 online shopping apps. This dataset can be used to obtain valuable information about customers' feedback regarding their user experience of these financially important apps.
e
Data from: The Tropical Andes Biodiversity Hotspot: A Comprehensive Dataset...
knb.ecoinformatics.org
dataone.org
+3more
Updated May 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pablo Jarrín-V.; Mario H Yánez-Muñoz (2024). The Tropical Andes Biodiversity Hotspot: A Comprehensive Dataset for the Mira-Mataje Binational Basins [Dataset]. http://doi.org/10.5063/F14F1P6H
Explore at:
Unique identifier
https://doi.org/10.5063/F14F1P6H
Dataset updated
May 30, 2024
Dataset provided by
Knowledge Network for Biocomplexity
Authors
Pablo Jarrín-V.; Mario H Yánez-Muñoz
Time period covered
Jun 11, 2022 - Jun 11, 2023
Area covered

Description
We present a flora and fauna dataset for the Mira-Mataje binational basins. This is an area shared between southwestern Colombia and northwestern Ecuador, where both the Chocó and Tropical Andes biodiversity hotspots converge. Information from 120 sources was systematized in the Darwin Core Archive (DwC-A) standard and geospatial vector data format for geographic information systems (GIS) (shapefiles). Sources included natural history museums, published literature, and citizen science repositories across 18 countries. The resulting database has 33,460 records from 5,281 species, of which 1,083 are endemic and 680 threatened. The diversity represented in the dataset is equivalent to 10\% of the total plant species and 26\% of the total terrestrial vertebrate species in the hotspots. It corresponds to 0.07\% of their total area. The dataset can be used to estimate and compare biodiversity patterns with environmental parameters and provide value to ecosystems, ecoregions, and protected areas. The dataset is a baseline for future assessments of biodiversity in the face of environmental degradation, climate change, and accelerated extinction processes. The data has been formally presented in the manuscript entitled "The Tropical Andes Biodiversity Hotspot: A Comprehensive Dataset for the Mira-Mataje Binational Basins" in the journal "Scientific Data". To maintain DOI integrity, this version will not change after publication of the manuscript and therefore we cannot provide further references on volume, issue, and DOI of manuscript publication. - Data format 1: The .rds file extension saves a single object to be read in R and provides better compression, serialization, and integration within the R environment, than simple .csv files. The description of file names is in the original manuscript. -- m_m_flora_2021_voucher_ecuador.rds -- m_m_flora_2021_observation_ecuador.rds -- m_m_flora_2021_total_ecuador.rds -- m_m_fauna_2021_ecuador.rds - Data format 2: The .csv file has been encoded in UTF-8, and is an ASCII file with text separated by commas. The description of file names is in the original manuscript. -- m_m_flora_fauna_2021_all.zip. This file includes all biodiversity datasets. -- m_m_flora_2021_voucher_ecuador.csv -- m_m_flora_2021_observation_ecuador.csv -- m_m_flora_2021_total_ecuador.csv -- m_m_fauna_2021_ecuador.csv - Data format 3: We consolidated a shapefile for the basin containing layers for vegetation ecosystems and the total number of occurrences, species, and endemic and threatened species for each ecosystem. -- biodiversity_measures_mira_mataje.zip. This file includes the .shp file and accessory geomatic files. - A set of 3D shaded-relief map representations of the data in the shapefile can be found at https://doi.org/10.6084/m9.figshare.23499180.v4 Three taxonomic data tables were used in our technical validation of the presented dataset. These three files are: 1) the_catalog_of_life.tsv (Source: Bánki, O. et al. Catalogue of life checklist (version 2024-03-26). https://doi.org/10.48580/dfz8d (2024)) 2) world_checklist_of_vascular_plants_names.csv (we are also including ancillary tables "world_checklist_of_vascular_plants_distribution.csv", and "README_world_checklist_of_vascular_plants_.xlsx") (Source: Govaerts, R., Lughadha, E. N., Black, N., Turner, R. & Paton, A. The World Checklist of Vascular Plants is a continuously updated resource for exploring global plant diversity. Sci. Data 8, 215, 10.1038/s41597-021-00997-6 (2021).) 3) world_flora_online.csv (Source: The World Flora Online Consortium et al. World flora online plant list December 2023, 10.5281/zenodo.10425161 (2023).)
d
E-learning Recommender System Dataset
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hafsa, Mounir (2023). E-learning Recommender System Dataset [Dataset]. http://doi.org/10.7910/DVN/BMY3UD
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/BMY3UD
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Hafsa, Mounir
Description
Mandarine Academy Recommender System (MARS) Dataset is captured from real-world open MOOC {https://mooc.office365-training.com/}. The dataset offers both explicit and implicit ratings, for both French and English versions of the MOOC. Compared with classical recommendation datasets like Movielens, this is a rather small dataset due to the nature of available content (educational). However, the dataset offers insights into real-world ratings and provides testing grounds away from common datasets. All items are available online for viewing in both French and English versions. All selected users had rated at least 1 item. No demographic information is included. Each user is represented by an id and job (if available). For both French and English, the same kind of files is available in .csv format. We provide the following files: Users: contains information about user ids and their jobs. Items: contains information about items (resources) in the selected language. Contains a mix of feature types. Ratings: Both explicit (Watch time) and implicit (page views of items). Formatting and Encoding The dataset files are written as comma-separated values files with a single header row. Columns that contain commas (,) are escaped using double quotes ("). These files are encoded as UTF-8. User Ids User ids are consistent between explicit_ratings.csv and implicit_ratings.csv and users.csv (i.e., the same id refers to the same user across the dataset). Item Ids Item ids are consistent between explicit_ratings.csv, implicit_ratings.csv, and items.csv (i.e., the same id refers to the same item across the dataset). Ratings Data File Structure All ratings are contained in the files explicit_ratings.csv and implicit_ratings.csv. Each line of this file after the header row represents one rating of one item by one user, and has the following format: item_id,user_id,created_at (implicit_ratings.csv) user_id,item_id,watch_percentage,created_at,rating (explicit_ratings.csv) Item Data File Structure Item information is contained in the file items.csv. Each line of this file after the header row represents one item, and has the following format: item_id,language,name,nb_views,description,created_at,Difficulty,Job,Software,Theme,duration,type
m
Network traffic and code for machine learning classification
data.mendeley.com
Updated Feb 20, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Víctor Labayen (2020). Network traffic and code for machine learning classification [Dataset]. http://doi.org/10.17632/5pmnkshffm.2
Explore at:
Unique identifier
https://doi.org/10.17632/5pmnkshffm.2
Dataset updated
Feb 20, 2020
Authors
Víctor Labayen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is a set of network traffic traces in pcap/csv format captured from a single user. The traffic is classified in 5 different activities (Video, Bulk, Idle, Web, and Interactive) and the label is shown in the filename. There is also a file (mapping.csv) with the mapping of the host's IP address, the csv/pcap filename and the activity label.

Activities:

Interactive: applications that perform real-time interactions in order to provide a suitable user experience, such as editing a file in google docs and remote CLI's sessions by SSH. Bulk data transfer: applications that perform a transfer of large data volume files over the network. Some examples are SCP/FTP applications and direct downloads of large files from web servers like Mediafire, Dropbox or the university repository among others. Web browsing: contains all the generated traffic while searching and consuming different web pages. Examples of those pages are several blogs and new sites and the moodle of the university. Vídeo playback: contains traffic from applications that consume video in streaming or pseudo-streaming. The most known server used are Twitch and Youtube but the university online classroom has also been used. Idle behaviour: is composed by the background traffic generated by the user computer when the user is idle. This traffic has been captured with every application closed and with some opened pages like google docs, YouTube and several web pages, but always without user interaction.

The capture is performed in a network probe, attached to the router that forwards the user network traffic, using a SPAN port. The traffic is stored in pcap format with all the packet payload. In the csv file, every non TCP/UDP packet is filtered out, as well as every packet with no payload. The fields in the csv files are the following (one line per packet): Timestamp, protocol, payload size, IP address source and destination, UDP/TCP port source and destination. The fields are also included as a header in every csv file.

The amount of data is stated as follows:

Bulk : 19 traces, 3599 s of total duration, 8704 MBytes of pcap files Video : 23 traces, 4496 s, 1405 MBytes Web : 23 traces, 4203 s, 148 MBytes Interactive : 42 traces, 8934 s, 30.5 MBytes Idle : 52 traces, 6341 s, 0.69 MBytes

The code of our machine learning approach is also included. There is a README.txt file with the documentation of how to use the code.
f
PPG signals in .CSV files
figshare.com
csv
Updated Oct 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arthur Robles Bolivar (2024). PPG signals in .CSV files [Dataset]. http://doi.org/10.6084/m9.figshare.27132552.v2
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27132552.v2
Dataset updated
Oct 1, 2024
Dataset provided by
figshare
Authors
Arthur Robles Bolivar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This information is related to the project "PPG-Based Cholesterol Assessment," which showcases different PPG signals along with the cholesterol levels of the subjects. The purpose is to validate techniques or tools for estimating total cholesterol levels in the blood using the PPG signal.
i
Classification of online health messages - Dataset - CKAN
rdm.inesctec.pt
Updated Jul 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Classification of online health messages - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/cs-2022-008
Explore at:
Dataset updated
Jul 6, 2022
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Classification of online health messages The dataset has 487 annotated messages taken from Medhelp, an online health forum with several health communities (https://www.medhelp.org/). It was built in a master thesis entitled "Automatic categorization of health-related messages in online health communities" of the Master in Informatics and Computing Engineering of the Faculty of Engineering of the University of Porto. It expands a dataset created in a previous work [see Relation metadata] whose objective was to propose a classification scheme to analyze messages exchanged in online health forums. A website was built to allow the classification of additional messages collected from Medhelp. After using a Python script to scrape the five most recent discussions from popular forums (https://www.medhelp.org/forums/list), we sampled 285 messages from them to annotate. Each message was classified three times by anonymous people in 11 categories from April 2022 until the end of May 2022. For each message, the rater picked the categories associated with the message and its emotional polarity (positive, neutral, and negative). Our dataset is organized in two CSV files, one containing information regarding the 885 (=3*285) classifications collected via crowdsourcing (CrowdsourcingClassification.csv) and the other containing the 487 messages with their final and consensual classifications (FinalClassification.csv). The readMe file provides detailed information about the two .csv files.
Online Package for the manuscript "Continuous Integration and Delivery for...
zenodo.org
data.niaid.nih.gov
zip
Updated Sep 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2021). Online Package for the manuscript "Continuous Integration and Delivery for Cyber-Physical systems: A Grounded-Theory" [Dataset]. http://doi.org/10.5281/zenodo.5379186
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5379186
Dataset updated
Sep 2, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This package contains the material of the (grounded theory) study related to the paper

"Continuous Integration and Delivery for Cyber-Physical systems: A Grounded-Theory"

The content of the various files and directories is detailed in the following.

InterviewStructure.pdf: complete interview structure (the paper reports an overview in Table 1)

open_coding/: this directory contains details from the open coding procedure. In particular:
- AllCodesUsedWhileCodingAndMapping.csv contains the list of all codes generated during the open coding, with a symbol near each one (first column) then used to compute the inter-rater agreement
- first_round.csv, second_round.csv, third_round.csv, fourth_round.csv: files used to compute the inter_rater agreement

CodesContributingToMindMap.csv: final set of codes, that contributed to the taxonomy (see below)

FinalCodingTraceability.csv: this file describes how the final set of codes is traced onto the ten interviews. Note that, at this stage, the transcripts have been redacted for confidentiality purposes.

MindMap_Complete.pdf: complete taxonomy of codes, in the form of a mind map. The one reported in the paper (Figure 2) cuts leaves, unless (as for benefits, for example) they are necessary to properly describe and understand the category. Also, note that the complete mind map separates the benefits into "actual" and "expected" (based on what was collected from the interviews) whereas the summary mindmap shown in the paper (Figure 2) does not make this distinction.

6C_Diagrams/ : This directory contains the detailed 6C diagrams (i.e., each box related to a "C" contains the list of codes pertaining to it) for the three dimensions investigated in the paper and addressed in RQ1, RQ2, and RQ3.
d
Replication Data for: Revisiting 'The Rise and Decline' in a Population of...
search.dataone.org
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill (2023). Replication Data for: Revisiting 'The Rise and Decline' in a Population of Peer Production Projects [Dataset]. http://doi.org/10.7910/DVN/SG3LP1
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/SG3LP1
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill
Description
This archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the int... Visit https://dataone.org/datasets/sha256%3Acfa4980c107154267d8eb6dc0753ed0fde655a73a062c0c2f5af33f237da3437 for complete metadata about this dataset.
f
fluTwitterData.csv: Data file containing weekly ILI and tweet counts from A...
rs.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lewis Mitchell; Joshua V. Ross (2023). fluTwitterData.csv: Data file containing weekly ILI and tweet counts from A data-driven model for influenza transmission incorporating media effects [Dataset]. http://doi.org/10.6084/m9.figshare.4021752.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4021752.v1
Dataset updated
May 31, 2023
Dataset provided by
The Royal Society
Authors
Lewis Mitchell; Joshua V. Ross
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Numerous studies have attempted to model the effect of mass media on the transmission of diseases such as influenza, however, quantitative data on media engagement has until recently been difficult to obtain. With the recent explosion of ‘big data’ coming from online social media and the like, large volumes of data on a population’s engagement with mass media during an epidemic are becoming available to researchers. In this study, we combine an online dataset comprising millions of shared messages relating to influenza with traditional surveillance data on flu activity to suggest a functional form for the relationship between the two. Using this data, we present a simple deterministic model for influenza dynamics incorporating media effects, and show that such a model helps explain the dynamics of historical influenza outbreaks. Furthermore, through model selection we show that the proposed media function fits historical data better than other media functions proposed in earlier studies.
Effects of community management on user activity in online communities
zenodo.org
data.niaid.nih.gov
zip
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alberto Cottica; Alberto Cottica (2025). Effects of community management on user activity in online communities [Dataset]. http://doi.org/10.5281/zenodo.1320261
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1320261
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alberto Cottica; Alberto Cottica
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and code needed to reproduce the results of the paper "Effects of community management on user activity in online communities", available in draft here.

Instructions:

Unzip the files.

Start with JSON files obtained from calling platform APIs: each dataset consists of one file for posts, one for comments, one for users. In the paper we use two datasets, one referring Edgeryders, the other to Matera 2019.

Run them through edgesense (https://github.com/edgeryders/edgesense). Edgesense allows to set the length of the observation period. We set it to 1 week and 1 day for Edgeryders data, and to 1 day for Matera 2019 data. Edgesense stores its results in a file called JSON network.min.json, which we then rename to keep track of the data source and observation length.

Launch Jupyter Notebook and run the notebook provided to convert the network.min.json files into CSV flat files, one for each netwrk file

Launch Stata and open each flat csv files with it, then save it in Stata format.

Use the provided Stata .do scripts to replicate results.

Please note: I use both Stata and Jupyter Notebook interactively, running a block with a few lines of code at a time. Expect to have to change directories, file names etc.
ToS;DR policies dataset (training)
zenodo.org
csv
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahmoud Istaiti; Mahmoud Istaiti (2025). ToS;DR policies dataset (training) [Dataset]. http://doi.org/10.5281/zenodo.15014823
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15014823
Dataset updated
Mar 17, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mahmoud Istaiti; Mahmoud Istaiti
License
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
Description
Dataset Overview

This dataset is derived from Terms of Service; Didn't Read (ToS;DR), a project that analyzes and categorizes terms of service from various online services. The dataset has been cleaned and organized into two CSV files, with a focus on reproducibility and usability. The privacy dataset is a subset of the full dataset, specifically filtering for privacy-related terms.

File Descriptions

1. training_tosdr_all_data.csv

This file contains the complete collection of terms of service data after cleaning and preprocessing. Each row represents a statement (or "point") extracted from a service's terms of service.

Key Columns:

case_id: Unique identifier for the case.

case_title: Brief description of the case.

topic_id: Unique identifier for the topic.

topic_title: Broad category the case falls under (e.g., Transparency, Copyright License).

sentence: The extracted text from the terms of service.

seq_case_id: Sequential identifier for the case, used for mapping.

seq_topic_id: Sequential identifier for the topic, used for mapping.

2. training_tosdr_privacy_data.csv

This file is a subset of the full dataset, focusing exclusively on privacy-related terms. It includes cases related to tracking, data collection, account deletion policies, and other privacy-related topics.

Key Columns:

case_id: Unique identifier for the case.

case_title: Brief description of the case.

topic_id: Unique identifier for the topic.

topic_title: Broad category the case falls under (e.g., Privacy, Data Collection).

sentence: The extracted text from the terms of service.

seq_case_id: Sequential identifier for the case, used for mapping.

seq_topic_id: Sequential identifier for the topic, used for mapping.
a
Florida COVID19 08222020 ByCounty CSV
hub.arcgis.com
covid19-usflibrary.hub.arcgis.com
Updated Aug 22, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of South Florida GIS (2020). Florida COVID19 08222020 ByCounty CSV [Dataset]. https://hub.arcgis.com/datasets/a993a65a91cf473aa0eabc2395e779be
Explore at:
Dataset updated
Aug 22, 2020
Dataset authored and provided by
University of South Florida GIS
Area covered

Description
Florida COVID-19 Cases by County exported from the Florida Department of Health GIS Layer on date seen in file name. Archived by the University of South Florida Libraries, Digital Heritage and Humanities Collections. Contact: LibraryGIS@usf.edu.Please Cite Our GIS HUB. If you are a researcher or other utilizing our Florida COVID-19 HUB as a tool or accessing and utilizing the data provided herein, please provide an acknowledgement of such in any publication or re-publication. The following citation is suggested: University of South Florida Libraries, Digital Heritage and Humanities Collections. 2020. Florida COVID-19 Hub. Available at https://covid19-usflibrary.hub.arcgis.com/ . https://doi.org/10.5038/USF-COVID-19-GISLive FDOH DataSource: https://services1.arcgis.com/CY1LXxl9zlJeBuRZ/arcgis/rest/services/Florida_COVID19_Cases/FeatureServerFor data 5/10/2020 or after: Archived data was exported directly from the live FDOH layer into the archive. For data prior to 5/10/2020: Data was exported by the University of South Florida - Digital Heritage and Humanities Collection using ArcGIS Pro Software. Data was then converted to shapefile and csv and uploaded into ArcGIS Online archive. Up until 3/25 the FDOH Cases by County layer was updated twice a day, archives are taken from the 11AM update.For data definitions please visit the following box folder: https://usf.box.com/s/vfjwbczkj73ucj19yvwz53at6v6w614hData definition files names include the relative date they were published. The below information was taken from ancillary documents associated with the original layer from FDOH.Persons Under Investigation/Surveillance (PUI):Essentially, PUIs are any person who has been or is waiting to be tested. This includes: persons who are considered high-risk for COVID-19 due to recent travel, contact with a known case, exhibiting symptoms of COVID-19 as determined by a healthcare professional, or some combination thereof. PUI’s also include people who meet laboratory testing criteria based on symptoms and exposure, as well as confirmed cases with positive test results. PUIs include any person who is or was being tested, including those with negative and pending results. All PUIs fit into one of three residency types: 1. Florida residents tested in Florida2. Non-Florida residents tested in Florida3. Florida residents tested outside of Florida Florida Residents Tested Elsewhere: The total number of Florida residents with positive COVID-19 test results who were tested outside of Florida, and were not exposed/infectious in Florida.Non-Florida Residents Tested in Florida: The total number of people with positive COVID-19 test results who were tested, exposed, and/or infectious while in Florida, but are legal residents of another state. Total Cases: The total (sum) number of Persons Under Investigation (PUI) who tested positive for COVID-19 while in Florida, as well as Florida residents who tested positive or were exposed/contagious while outside of Florida, and out-of-state residents who were exposed, contagious and/or tested in Florida.Deaths: The Deaths by Day chart shows the total number of Florida residents with confirmed COVID-19 that died on each calendar day (12:00 AM - 11:59 PM). Caution should be used in interpreting recent trends, as deaths are added as they are reported to the Department. Death data often has significant delays in reporting, so data within the past two weeks will be updated frequently.Prefix guide: "PUI" = PUI: Persons under surveillance (any person for which we have data about)"T_ " = Testing: Testing information for all PUIs and cases."C_" = Cases only: Information about cases, which are those persons who have COVID-19 positive test results on file“W_” = Surveillance and syndromic dataKey Data about Testing:T_negative : Testing: Total negative persons tested for all Florida and non-Florida residents, including Florida residents tested outside of the state, and those tested at private facilities.T_positive : Testing: Total positive persons tested for all Florida and non-Florida resident types, including Florida residents tested outside of the state, and those tested at private facilities.PUILab_Yes : All persons tested with lab results on file, including negative, positive and inconclusive. This total does NOT include those who are waiting to be tested or have submitted tests to labs for which results are still pending.Key Data about Confirmed COVID-19 Positive Cases: CasesAll: Cases only: The sum total of all positive cases, including Florida residents in Florida, Florida residents outside Florida, and non-Florida residents in FloridaFLResDeaths: Deaths of Florida ResidentsC_Hosp_Yes : Cases (confirmed positive) with a hospital admission notedC_AgeRange Cases Only: Age range for all cases, regardless of residency typeC_AgeMedian: Cases Only: Median range for all cases, regardless of residency typeC_AllResTypes : Cases Only: Sum of COVID-19 positive Florida Residents; includes in and out of state Florida residents, but does not include out-of-state residents who were treated/tested/isolated in Florida. All questions regarding this dataset should be directed to the Florida Department of Health.

Facebook

Twitter

Click to copy link

Link copied

Cite

CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6

CSV file used in statistical analyses

Explore at:

Unique identifier

https://doi.org/10.4225/08/543B4B4CA92E6

Dataset updated

Oct 13, 2014

Dataset authored and provided by

CSIROhttp://www.csiro.au/

License

https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/

Time period covered

Mar 14, 2008 - Jun 9, 2009

Dataset funded by

CSIROhttp://www.csiro.au/

Description

A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.

Clear search

Close search

Google apps

Main menu

CSV file used in statistical analyses

CORD-19-CSV

Context

Content

Image csv file

Dataset

Contents

Data from: Census of the Ecosystem of Decentralized Autonomous Organizations...

Mapping incident locations from a CSV file in a web map (video)

Gene expression csv files

Waitrose Products Information Dataset in CSV Format - Comprehensive Product...

AOI polygon fire statistics CSV files

ShoppingAppReviews Dataset

Data from: The Tropical Andes Biodiversity Hotspot: A Comprehensive Dataset...

E-learning Recommender System Dataset

Network traffic and code for machine learning classification

PPG signals in .CSV files

Classification of online health messages - Dataset - CKAN

Online Package for the manuscript "Continuous Integration and Delivery for...

Replication Data for: Revisiting 'The Rise and Decline' in a Population of...

fluTwitterData.csv: Data file containing weekly ILI and tweet counts from A...

Effects of community management on user activity in online communities

ToS;DR policies dataset (training)

Dataset Overview

File Descriptions

1. training_tosdr_all_data.csv

Key Columns:

2. training_tosdr_privacy_data.csv

Key Columns:

Florida COVID19 08222020 ByCounty CSV

CSV file used in statistical analysesSee More Versions

CSV file used in statistical analyses