Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted.
The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community.
Records dataset
Filename: zenodo_open_metadata_{ date of export }.jsonl.gz
Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date
which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.
In addition, some terms have been altered:
Communities dataset
Filename: zenodo_community_metadata_{ date of export }.jsonl.gz
Each object contains the terms: id, title, description, curation_policy, page
which correspond to the fields with the same name available in Zenodo's community creation form.
Notes for all datasets
For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff.
Some values for the top-level terms, which were missing in the metadata may contain a null value.
A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data file for the second release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.
The data file includes 5,256,114 data citation records in JSON and CSV formats. The JSON file is the version of record.
For convenience, the data is provided in batches of approximately 1 million records each. The publication date and batch number are included in the file name, ex: 2024-08-23-data-citation-corpus-01-v2.0.json.
The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles.
Each data citation record is comprised of:
A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication (journal article or preprint) in which the dataset is cited
Metadata for the cited dataset and for the citing publication
The data file includes the following fields:
|
Field |
Description |
Required? |
|
id |
Internal identifier for the citation |
Yes |
|
created |
Date of item's incorporation into the corpus |
Yes |
|
updated |
Date of item's most recent update in corpus |
Yes |
|
repository |
Repository where cited data is stored |
No |
|
publisher |
Publisher for the article citing the data |
No |
|
journal |
Journal for the article citing the data |
No |
|
title |
Title of cited data |
No |
|
publication |
DOI of article where data is cited |
Yes |
|
dataset |
DOI or accession number of cited data |
Yes |
|
publishedDate |
Date when citing article was published |
No |
|
source |
Source where citation was harvested |
Yes |
|
subjects |
Subject information for cited data |
No |
|
affiliations |
Affiliation information for creator of cited data |
No |
|
funders |
Funding information for cited data |
No |
Additional documentation about the citations and metadata in the file is available on the Make Data Count website.
The second release of the Data Citation Corpus data file reflects several changes made to add new citations, remove some records deemed out of scope for the corpus, update and enhance citation metadata, and improve the overall usability of the file. These changes are as follows:
Add and update Event Data citations:
Add 179,885 new data citations created in DataCite Event Data between 01 June 2023 through 30 June 2024
Remove citation records deemed out of scope for the corpus:
273,567 records from DataCite Event Data with non-citation relationship types
28,334 citations to items in non-data repositories (clinical trials registries, stem cells, samples, and other non-data materials)
44,117 invalid citations where subj_id value was the same as the obj_id value or subj_id and obj_id are inverted, indicating a citation from a dataset to a publication
473,792 citations to invalid accession numbers from CZI data present in v1.1 as a result of false positives in the algorithm used to identify mentions
4,110,019 duplicate records from CZI data present in v1.1 where metadata is the same for obj_id, subj_id, repository_id, publisher_id, journal_id, accession_number, and source_id (the record with the most recent updated date was retained in all of these cases)
Metadata enhancements:
Apply Field of Science subject terms to citation records originating from CZI, based on disciplinary area of data repository
Initial cleanup of affiliation and funder organization names to remove personal email addresses and social media handles (additional cleanup and standardization in progress and will be included in future releases)
Data structure updates to improve usability and eliminate redundancies:
Rename subj_id and obj_id fields to “dataset” and “publication” for clarity
Remove accessionNumber and doi elements to eliminate redundancy with subj_id
Remove relationTypeId fields as these are specific to Event Data only
Full details of the above changes, including scripts used to perform the above tasks, are available in GitHub.
While v2 addresses a number of cleanup and enhancement tasks, additional data issues may remain, and additional enhancements are being explored. These will be addressed in the course of subsequent data file releases.
Feedback on the data file can be submitted via GitHub. For general questions, email info@makedatacount.org.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description:
This repository contains all source codes, necessary input and output data, and raw figures and tables for reproducing most figures and results published in the following study:
Hefei Zhang#, Xuhang Li#, Dongyuan Song, Onur Yukselen, Shivani Nanda, Alper Kucukural, Jingyi Jessica Li, Manuel Garber, Albertha J.M. Walhout. Worm Perturb-Seq: massively parallel whole-animal RNAi and RNA-seq. (2025) Nature Communications, in press (# equal contribution, *: correspondng author)
These include results related to method benchmarking and NHR data processing. Source data for figures that are not reproducible here have been provided with the publication.
Files:
This repository contains a few directories related to this publication. To deposit into Zenodo, we have individually zipped each subfolder of the root directory.
There are three directories included:
MetabolicLibrary
method_simulation
NHRLibrary
Note: the parameter optimization output is deposited in a seperate Zenodo repository (10.5281/zenodo.15236858) for better oganization and easy usage. If you would like to reproduce results related to the "MetabolicLibrary" folder, please download and integrate the omitted subfolder "MetabolicLibrary/2_DE/output/" from this seperate repository.
Please be advised that this repository contains raw codes and data that are not directly related to a figure in our paper. However, they may be useful to generate input used in the analysis of a figure, or to reproduce tables in our manuscript. It may also contain unpublished analyses and figures, which we did not intentionally delete and kept for records.
Usage:
Please refer to the table in below to locate a specific file for reproducing a figure of interest (also availabe in the METHOD_FIGURE_LOOKUP.xlsx under the root directory).
| Figure | File | Linesa | Notes |
| Fig. 2c | MetabolicLibrary/1_QC_dataCleaning/2_badSampleExclusion_manual.R | 65-235 | output figure is selected from figures/met10_lib6_badSamplePCA.pdf |
| Fig. 2d | NHRLibrary/example_bams/* | - | load the bam files in IGV to make the figure |
| Fig. 3a | MetabolicLibrary/2_DE/2_5_vectorlike_re_analysis_with_curated_44_NTP_conditions.R | 348-463 | |
| Fig. 3b,c | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 106-376 | |
| Fig. 3d | MetabolicLibrary/2_DE/SUPP_extra_figures_for_rewiring.R | 10-139 | |
| Fig. 3e | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 379-522 | |
| Fig. 3f,g | MetabolicLibrary/2_DE/SUPP_extra_figures_for_rewiring.R | 1-8 | |
| Fig. 3h | method_simulation/Supp_systematic_mean_variation_example.R | 1-138 | |
| Fig. 3i | method_simulation/3_benchmark_DE_result_w_rep.R and 1_benchmark_DE_result_std_NB_w_rep.R | 1-518 | the example figure was from figures/GLM_NB_deltaMiu_k08_10rep/seed_12345_empircial_null_fit_simulated_data_NB_GLM.pdf and figures/GLM_NB_10rep/seed_1_empircial_null_fit_simulated_data_NB_GLM.pdf; |
| Fig. 3j | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 2104-2106 | load dependencies starting from line 1837 |
| Fig. 3k | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 2053-2078 | load dependencies starting from line 1837; the GSEA was performed using SUPP_supplementary_figures_for_method_noiseness_GSEA.R |
| Fig. 4a,b | method_simulation/3_benchmark_DE_result_w_rep.R | 1-523 | |
| Fig. 4c | method_simulation/3_benchmark_WPS_parameters.R | 1-237 | |
| Fig. 4d | MetabolicLibrary/2_DE/2_5_vectorlike_re_analysis_with_curated_44_NTP_conditions.R | 1-346 | output figure is selected from figures/0_DE_QA/vectorlike_analysis/2d_cutoff_titration_71NTP_rawDE_log2FoldChange_log2FoldChange_raw.pdf and 2d_cutoff_titration_71NTP_p0.005_log2FoldChange_raw.pdf. The "p0.005" in the second file name indicates the p_outlier cutoff used in the final parameter set for EmpirDE. |
| Fig. 4e | MetabolicLibrary/2_DE/SUPP_plot_N_DE_repeated_RNAi.R | entire file | |
| Fig. 4f | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 1020-1407 | |
| Fig. 4g,h | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 529-851 | |
| Fig. 5d | NHRLibrary/FinalAnalysis/2_DE_new/3_DE_network_analysis.R | 51-69; 94-112 | load dependencies starting from line 1 |
| Fig. 5e | NHRLibrary/FinalAnalysis/2_DE_new/5_GSA_bubble_plot.R | 1-306 | |
| Fig. 5f | NHRLibrary/FinalAnalysis/2_DE_new/5_GSA.R | 1-1492 | |
| Fig. 6a | NHRLibrary/FinalAnalysis/2_DE_new/4_DE_similarity_analysis.R | 1-175 | |
| Fig. 6b | NHRLibrary/FinalAnalysis/6_case_study.R | 506-534 | load dependencies starting from line 1 |
| Fig. 6c | NHRLibrary/FinalAnalysis/6_case_study.R | 668-888 | load dependencies starting from line 1 |
| Supplementary Fig. 1e | NHRLibrary/FinalAnalysis/5_revision/REVISION_gene_detection_sensitivity_benchmark.R | 1-143 | |
| Supplementary Fig. 1f | MetabolicLibrary/1_QC_dataCleaning/2_badSampleExclusion_manual.R | 65-235 | output figure is selected from figures/met10_lib6_badSampleCorr.pdf |
| Supplementary Fig. 1g | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 2191-2342 | |
| Supplementary Fig. 2a | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 1409-1822 | |
| Supplementary Fig. 2b | method_simulation/Supp_systematic_mean_variation_example.R | 1-138 | |
| Supplementary Fig. 2c | method_simulation/Supp_systematic_mean_variation_example.R; 2_fit_logFC_distribution.R | 141-231; 1-201 | the middle panel was generated from Supp_systematic_mean_variation_example.R (lines 141-231) and right panel was from 2_fit_logFC_distribution.R (lines 1-201) |
| Supplementary Fig. 2d | method_simulation/1_benchmark_DE_result_std_NB_w_rep.R | 1-518 | the example figure was from figures/GLM_NB_10rep/seed_1_empircial_null_fit_simulated_data_NB_GLM.pdf; |
| Supplementary Fig. 2e | method_simulation/3_benchmark_DE_result_w_rep.R | 1-518 | the example figure was from figures/GLM_NB_deltaMiu_k08_10rep/seed_12345_empircial_null_fit_simulated_data_NB_GLM.pdf |
| Supplementary Fig. 2f | method_simulation/3_benchmark_DE_result_w_rep.R | 528-573 | may need to run the code from line 1 to load other variables needed |
| Supplementary Fig. 3a,b | method_simulation/1_benchmark_DE_result_std_NB_w_rep.R | 1-523 | |
| Supplementary Fig. 3c | method_simulation/3_benchmark_WPS_parameters.R | 1-237 | |
| Supplementary Fig. 3d | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 3190-3300 | |
| Supplementary Fig. 3e | 2_3_power_error_tradeoff_optimization.R | entire file | the figure was in figures/0_DE_QA/cleaning_strength_titration/benchmark_vectorlikes_titrate_cleaning_cutoff_DE_log2FoldChange_FDR0.2_FC1.pdf (produced in line 398); this script produced the titration plots for a series of thresholds, where we picked FDR0.2_FC1 for presentation in the paper |
| Supplementary Fig. 3f | 2_3_power_error_tradeoff_optimization.R | entire file | the figure was in figures/0_DE_QA/cleaning_strength_titration/benchmark_independent_repeats_titrate_cleaning_cutoff_FP_log2FoldChange_raw.pdf (produced in line 195). The top line plot was from figures/0_DE_QA/cleaning_strength_titration/benchmark_independent_repeats_titrate_cleaning_cutoff_FP_log2FoldChange_raw_summary_stat.pdf (produced in line 218). |
| Supplementary Fig. 4 | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 853-898 | please run from line 529 to load dependencies |
| Supplementary Fig. 5a,b | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 2590-3185 | |
| Supplementary Fig. |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package accompagnies the dataset and exploratory empirical analysis reported in the paper "A dataset of GitHub Actions workflow histories" published in the IEEE MSR 2024 conference. (The Jupyter notebook can be found in previous version of this dataset).
Important notice : It looks like Zenodo is compressing gzipped files two times without notice, they are "double compressed". So, when you download them they should be named : x.gz.gz instead of x.gz. Notice that the provided MD5 refers to the original file.
2024-10-25 update : updated repositories list and observation period. The filters relying on date were also updated.
2024-07-09 update : fix sometimes invalid valid_yaml flag.
The dataset was created as follow :
First, we used GitHub SEART (on October 7th, 2024) to get a list of every non-fork repositories created before January 1st, 2024. having at least 300 commits and at least 100 stars where at least one commit was made after January 1st, 2024. (The goal of these filter is to exclude experimental and personnal repositories).
We checked if a folder .github/workflows existed. We filtered out those that did not contained this folder and pulled the others (between 9th and 10thof October 2024).
We applied the tool gigawork (version 1.4.2) to extract every files from this folder. The exact command used is python batch.py -d /ourDataFolder/repositories -e /ourDataFolder/errors -o /ourDataFolder/output -r /ourDataFolder/repositories_everything.csv.gz -- -w /ourDataFolder/workflows_auxiliaries. (The script batch.py can be found on GitHub).
We concatenated every files in /ourDataFolder/output into a csv (using cat headers.csv output/*.csv > workflows_auxiliaries.csv in /ourDataFolder) and compressed it.
We added the column uid via a script available on GitHub.
Finally, we archived the folder with pigz /ourDataFolder/workflows (tar -c --use-compress-program=pigz -f workflows_auxiliaries.tar.gz /ourDataFolder/workflows)
Using the extracted data, the following files were created :
workflows.tar.gz contains the dataset of GitHub Actions workflow file histories.
workflows_auxiliaries.tar.gz is a similar file containing also auxiliary files.
workflows.csv.gz contains the metadata for the extracted workflow files.
workflows_auxiliaries.csv.gz is a similar file containing also metadata for auxiliary files.
repositories.csv.gz contains metadata about the GitHub repositories containing the workflow files. These metadata were extracted using the SEART Search tool.
The metadata is separated in different columns:
repository: The repository (author and repository name) from which the workflow was extracted. The separator "/" allows to distinguish between the author and the repository name
commit_hash: The commit hash returned by git
author_name: The name of the author that changed this file
author_email: The email of the author that changed this file
committer_name: The name of the committer
committer_email: The email of the committer
committed_date: The committed date of the commit
authored_date: The authored date of the commit
file_path: The path to this file in the repository
previous_file_path: The path to this file before it has been touched
file_hash: The name of the related workflow file in the dataset
previous_file_hash: The name of the related workflow file in the dataset, before it has been touched
git_change_type: A single letter (A,D, M or R) representing the type of change made to the workflow (Added, Deleted, Modified or Renamed). This letter is given by gitpython and provided as is.
valid_yaml: A boolean indicating if the file is a valid YAML file.
probably_workflow: A boolean representing if the file contains the YAML key on and jobs. (Note that it can still be an invalid YAML file).
valid_workflow: A boolean indicating if the file respect the syntax of GitHub Actions workflow. A freely available JSON Schema (used by gigawork) was used in this goal.
uid: Unique identifier for a given file surviving modifications and renames. It is generated on the addition of the file and stays the same until the file is deleted. Renamings does not change the identifier.
Both workflows.csv.gz and workflows_auxiliaries.csv.gz are following this format.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
rxivist.org allowed readers to sort and filter the tens of thousands of preprints posted to bioRxiv and medRxiv. Rxivist used a custom web crawler to index all papers posted to those two websites; this is a snapshot of Rxivist the production database. The version number indicates the date on which the snapshot was taken. See the included "README.md" file for instructions on how to use the "rxivist.backup" file to import data into a PostgreSQL database server.
Please note this is a different repository than the one used for the Rxivist manuscript—that is in a separate Zenodo repository. You're welcome (and encouraged!) to use this data in your research, but please cite our paper, now published in eLife.
Previous versions are also available pre-loaded into Docker images, available at blekhmanlab/rxivist_data.
Version notes:
2023-03-01
The final Rxivist data upload, more than four years after the first and encompassing 223,541 preprints posted to bioRxiv and medRxiv through the end of February 2023.
2020-12-07***
In addition to bioRxiv preprints, the database now includes all medRxiv preprints as well.
The website where a preprint was posted is now recorded in a new field in the "articles" table, called "repo".
We've significantly refactored the web crawler to take advantage of developments with the bioRxiv API.
The main difference is that preprints flagged as "published" by bioRxiv are no longer recorded on the same schedule that download metrics are updated: The Rxivist database should now record published DOI entries the same day bioRxiv detects them.
Twitter metrics have returned, for the most part. Improvements with the Crossref Event Data API mean we can once again tally daily Twitter counts for all bioRxiv DOIs.
The "crossref_daily" table remains where these are recorded, and daily numbers are now up to date.
Historical daily counts have also been re-crawled to fill in the empty space that started in October 2019.
There are still several gaps that are more than a week long due to missing data from Crossref.
We have recorded available Crossref Twitter data for all papers with DOI numbers starting with "10.1101," which includes all medRxiv preprints. However, there appears to be almost no Twitter data available for medRxiv preprints.
The download metrics for article id 72514 (DOI 10.1101/2020.01.30.927871) were found to be out of date for February 2020 and are now correct. This is notable because article 72514 is the most downloaded preprint of all time; we're still looking into why this wasn't updated after the month ended.
2020-11-18
Publication checks should be back on schedule.
2020-10-26
This snapshot fixes most of the data issues found in the previous version. Indexed papers are now up to date, and download metrics are back on schedule. The check for publication status remains behind schedule, however, and the database may not include published DOIs for papers that have been flagged on bioRxiv as "published" over the last two months. Another snapshot will be posted in the next few weeks with updated publication information.
2020-09-15
A crawler error caused this snapshot to exclude all papers posted after about August 29, with some papers having download metrics that were more out of date than usual. The "last_crawled" field is accurate.
2020-09-08
This snapshot is misconfigured and will not work without modification; it has been replaced with version 2020-09-15.
2019-12-27
Several dozen papers did not have dates associated with them; that has been fixed.
Some authors have had two entries in the "authors" table for portions of 2019, one profile that was linked to their ORCID and one that was not, occasionally with almost identical "name" strings. This happened after bioRxiv began changing author names to reflect the names in the PDFs, rather than the ones manually entered into their system. These database records are mostly consolidated now, but some may remain.
2019-11-29
The Crossref Event Data API remains down; Twitter data is unavailable for dates after early October.
2019-10-31
The Crossref Event Data API is still experiencing problems; the Twitter data for October is incomplete in this snapshot.
The README file has been modified to reflect changes in the process for creating your own DB snapshots if using the newly released PostgreSQL 12.
2019-10-01
The Crossref API is back online, and the "crossref_daily" table should now include up-to-date tweet information for July through September.
About 40,000 authors were removed from the author table because the name had been removed from all preprints they had previously been associated with, likely because their name changed slightly on the bioRxiv website ("John Smith" to "J Smith" or "John M Smith"). The "author_emails" table was also modified to remove entries referring to the deleted authors. The web crawler is being updated to clean these orphaned entries more frequently.
2019-08-30
The Crossref Event Data API, which provides the data used to populate the table of tweet counts, has not been fully functional since early July. While we are optimistic that accurate tweet counts will be available at some point, the sparse values currently in the "crossref_daily" table for July and August should not be considered reliable.
2019-07-01
A new "institution" field has been added to the "article_authors" table that stores each author's institutional affiliation as listed on that paper. The "authors" table still has each author's most recently observed institution.
We began collecting this data in the middle of May, but it has not been applied to older papers yet.
2019-05-11
The README was updated to correct a link to the Docker repository used for the pre-built images.
2019-03-21
The license for this dataset has been changed to CC-BY, which allows use for any purpose and requires only attribution.
A new table, "publication_dates," has been added and will be continually updated. This table will include an entry for each preprint that has been published externally for which we can determine a date of publication, based on data from Crossref. (This table was previously included in the "paper" schema but was not updated after early December 2018.)
Foreign key constraints have been added to almost every table in the database. This should not impact any read behavior, but anyone writing to these tables will encounter constraints on existing fields that refer to other tables. Most frequently, this means the "article" field in a table will need to refer to an ID that actually exists in the "articles" table.
The "author_translations" table has been removed. This was used to redirect incoming requests for outdated author profile pages and was likely not of any functional use to others.
The "README.md" file has been renamed "1README.md" because Zenodo only displays a preview for the file that appears first in the list alphabetically.
The "article_ranks" and "article_ranks_working" tables have been removed as well; they were unused.
2019-02-13.1
After consultation with bioRxiv, the "fulltext" table will not be included in further snapshots until (and if) concerns about licensing and copyright can be resolved.
The "docker-compose.yml" file was added, with corresponding instructions in the README to streamline deployment of a local copy of this database.
2019-02-13
The redundant "paper" schema has been removed.
BioRxiv has begun making the full text of preprints available online. Beginning with this version, a new table ("fulltext") is available that contains the text of preprints that have been processed already. The format in which this information is stored may change in the future; any digression will be noted here.
This is the first version that has a corresponding Docker image.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reddit Comments Dataset for Text Style Transfer Tasks
A dataset of Reddit comments prepared for Text Style Transfer Tasks.
The dataset contains Reddit comments translated into a formal language. For the translation of Reddit comments into a formal language text-davinci-003 was used. To make text-davinci-003 translate the comments into a more formal version, the following prompt was used:
"Here is some text: {original_comment} Here is a rewrite of the text, which is more neutral: {"
This prompting technique was taken from A Recipe For Arbitrary Text Style Transfer with Large Language Models.
The dataset contains comments from the following Subreddits: antiwork, atheism, Conservative, conspiracy, dankmemes, gaybros, leagueoflegends, lgbt, libertarian, linguistics, MensRights, news, offbeat, PoliticalCompassMemes, politics, teenagers, TrueReddit, TwoXChromosomes, wallstreetbets, worldnews.
The quality of formal translations was assessed with BERTScore and chrF++:
The average perplexity of the generated formal texts was calculated using GPT-2 and is 123.77
The dataset consists of 3 components.
reddit_commments.csv
This file contains a collection of randomly selected comments from 20 Subreddits. For each comment, the following information was collected:
- subreddit (name of the subreddit in which the comment was posted)
- id (ID of the comment)
- submission_id (ID of the submission to which the comment was posted)
- body (the comment itself)
- created_utc (timestamp in seconds)
- parent_id (The ID of the comment or submission to which the comment is a reply)
- permalink (The URL to the original comment)-
- token_size (How many tokens the comment will be split into by the standard GPT-2 tokenizer)
- perplexity (What perplexity does GPT-2 calculate for the comment)
The comments were filtered. This file contains only comments that:
- have been split by GPT-2 Tokenizer into more than 10 tokens but less than 512 tokens.
- are not [removed] or [deleted]
- do not contain URLs
This file was used as a source for the other two file types.
Labeled Files (training_labeled.csv and eval_labeled.csv)
These files contain the formal translations of the Reddit comments.
The 150 comments with the highest calculated perplexity of GPT-2 from each Subreddit were translated into a formal version. This filter was used to translate as many comments as possible that have large stylistic salience.
They are structured as follows:
- Subreddit (name of the subreddit where the comment was posted).
- Original Comment
- Formal Comment
Labeled Files with Style Examples (training_labeled_with_style_samples.json and eval_labeled_with_style_samples.json)
These files contain an original Reddit comment, three sample comments from the same subreddit, and the formal translation of the original Reddit comment.
These files can be used to train models to perform style transfers based on given examples.
The task is to transform the formal translation of the Reddit comment, using the three given examples, into the style of the examples.
An entry in this file is structured as follows:
"data":[
{
"input_sentence":"The original Reddit comment",
"style_samples":[
"sample1",
"sample2",
"sample3"
],
"results_sentence":"The formal translated input_sentence",
"subreddit":"The subreddit from which the comments originated"
},
"..."
]
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Given the critical need for the unification and coordinated use of floristic checklists within the TURBOVEG software environment (Hennekens & Schaminée 2001), we propose a new species list, Ukraine_SL, for Ukrainian flora.
The taxonomic basis for Ukraine_SL (for vascular plants) is the UkrTrait taxonomy (Vynokurov et al. 2024), which is based on the Checklist of vascular plants of Ukraine (Mosyakin & Fedoronchuk 1999) and supplemented with taxa newly recorded or described in Ukraine in the years following its publication. Additionally, corrections have been made to the spelling of some taxon names (see details in Vynokurov 2024). Bryobionts follow the Second checklist of bryobionts in Ukraine (Boiko 2014).
For the vast majority of vascular plants, corresponding names from the Euro+Med database are provided, enabling efficient conversion of phytosociological relevés between different taxonomic systems and facilitating integration with the European Vegetation Archive (EVA) (Chytrý et al. 2016).
Moreover, most vascular plants are linked to the Ukrainian Plant Trait Database (UkrTrait v. 1.0) (Vynokurov et al. 2024), allowing rapid extraction of available traits for vegetation studies (e.g. plant height, life forms, flowering period, etc.).
Ukraine_SL will be regularly updated and published on the Zenodo platform. In addition to the species list for TURBOVEG itself (Ukraine_SL.zip), an Excel file with a taxonomic crosswalk (ukraine_sl_taxonomy.xlsx) is also provided. It includes matches between the UkrTrait taxonomy, the original taxon concepts from Mosyakin & Fedoronchuk (1999), and names from the Euro+Med database (europlusmed.org).
An expert system file (expert_ukraine_sl_euromed.txt) is also available for download, enabling translation of vegetation plots to the Euro+Med floristic list within the JUICE software (Tichý 2002).
To install the species list in TURBOVEG (Ukraine_SL.zip), download and unzip the archive into the Turbowin/Species/ directory of your TURBOVEG installation. After unzipping, a folder named Ukraine_SL should appear, containing the file SPECIES.DBF. The list will then be available for use in TURBOVEG.
When working with this list, it is critically important to use only the species already included and not to add new taxa manually, as this would prevent synchronization with future updates and may cause errors during database merging.
If taxa not present in the list are needed, users should contact the authors. The list will then be updated, and a new version made available for download. The full update history, including a list of changes, will be accessible on the Zenodo website. Any newly added taxa will be assigned unique, non-overlapping IDs.
To update the list in TURBOVEG, download the latest version from Zenodo and replace the old version in the Turbowin/Species/ directory by deleting it and unzipping the new archive (Ukraine_SL.zip).
To use the expert system file (expert_ukraine_sl_euromed.txt) in JUICE:
Go to Analysis → Expert System Classificator.
Upload the .txt file.
In the window that appears, click "Modify Species Names", followed by "Merge Same Spec. Names".
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
[Content warning: Files may contain instances of highly inflammatory and offensive content.]
This dataset was generated as an extension of our CSCW 2018 paper:
Eshwar Chandrasekharan, Mattia Samory, Shagun Jhaver, Hunter Charvat, Amy Bruckman, Cliff Lampe, Jacob Eisenstein, and Eric Gilbert. 2018. The Internet’s Hidden Rules: An Empirical Study of Reddit Norm Violations at Micro, Meso, and Macro Scales. Proceedings of the ACM on Human-Computer Interaction 2, CSCW (2018), 32.
Description:
Working with over 2M removed comments collected from 100 different communities on Reddit (subreddit names listed in data/study-subreddits.csv), we identified 8 macro norms, i.e., norms that are widely enforced on most parts of Reddit. We extracted these macro norms by employing a hybrid approach—classification, topic modeling, and open-coding—on comments identified to be norm violations within at least 85 out of the 100 study subreddits. Finally, we labelled over 40K Reddit comments removed by moderators according to the specific type of macro norm being violated, and make this dataset publicly available (also available on Github).
For each of the labeled topics, we identified the top 5000 removed comments that were best fit by the LDA topic model. In this way, we identified over 5000 removed comments that are examples of each type of macro norm violation described in the paper. The removed comments were sorted by their topic fit, stored into respective files based on the type of norm violation they represent, and are made available on this repo.
Here we make the following datasets publicly available:
* 1 file containing the log of over 2M removed comments obtained from the top 100 subreddits between May 2016 to March 2017, after filtering out the following comments: 1) comments by u/AutoModerator, 2) replies to removed comments (i.e., children of the poisoned tree - refer to the paper for more information), and 3) non-readable comments (not utf-8 encoded).
* 8 files, each containing 5000+ removed comments obtained from Reddit, are stored in: data/macro-norm-violations/ , and they are split into different files based on the macro norm they violated. Each new line in the files represent a comment that was posted on Reddit between May 2016 to March 2017, and subsequently removed by subreddit moderators for violating community norms. All comments were preprocessed using the script in code/preprocessing-reddit-comments.py , in order to do the following: 1. remove new lines, 2. convert text to lowercase, and 3. strip numbers and punctuations from comments.
Description of 1 file containing over 2M removed comments from 100 subreddits.
Descriptions of each file containing 5059 comments (that were removed from Reddit, and preprocessed) violating macro norms present in data/macro-norm-violations/:
More details about the dataset can be found on arXiv: https://arxiv.org/abs/1904.03596
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
New version on https://zenodo.org/records/13341214.
When you use this dataset, please cite this paper. More information about this dataset could also be found in this paper.
Xu, X., Wang, B., Xiao, B., Niu, Y., Wang, Y., Wu, X., & Chen, J. (2024). Beware of Overestimated Decoding Performance Arising from Temporal Autocorrelations in Electroencephalogram Signals. arXiv preprint arXiv:2405.17024.
The present work aims to demonstrate that temporal autocorrelations (TA) significantly impacts various BCI tasks even in conditions without neural activity. We used the watermelon as the phantom head and found that we could get the pitfall of overestimated decoding performance if continuous EEG data with the same class label were split into training and test sets. More details can be found in Motivation.
As watermelons cannot perform any experimental tasks, we can reorganize it to the format of various actual EEG dataset without the need to collect EEG data as previous work did (examples in Domain Studied).
Manufacturers: NeuroScan SynAmps2 system (Compumedics Limited, Victoria, Australia)
Configuration: 64-channel Ag/AgCl electrode cap with a 10/20 layout
Watermelons. Ten watermelons served as phantom heads.
Overestimated Decoding Performance in EEG decoding.
Following BCI datasets in various BCI tasks have been reorganized using the Phantom EEG Dataset. The pitfall has been found in four of five tasks.
- CVPR dataset [1] for image decoding task.
- DEAP dataset [2] for emotion recognition task.
- KUL dataset [3] for auditory spatial attention decoding task.
- BCIIV2a dataset [4] for motor imagery task (the pitfalls were absent due to the use of rapid-design paradigm during EEG recording).
- SIENA dataset [5] for epilepsy detection task.
Resting State but you could reorganize it to any task in BCI.
The Phantom EEG Dataset
Creative Commons Attribution 4.0 International
The code to read the data files (.cnt) is provided in "Other". We could not add the file in this version because Zenodo demand that "you must create a new version to add, modify or delete files". We will add the file after organizing the datasets to comply with the FAIR principles in the version v2 recently.
The data will be published with following format in version v2:
- CNT: the raw data.
- BIDS: an extension to the brain imaging data structure for electroencephalography. BIDS primarily addresses the heterogeneity of data organization by following the FAIR principles [6].
An additional electrode was placed on the lower part of the watermelon as the physiological reference, and the forehead served as the ground site. The inter-electrode impedances were maintained under 20 kOhm. Data were recorded at a sampling rate of 1000 Hz. EEG recordings for each watermelon lasted for more than 1 hour to ensure sufficient data for the decoding task.
Each Subject (S*.cnt) contains the following information:
EEG.data: EEG data (samples X channels)
EEG.srate: Sampling frequency of the saved data
EEG.chanlocs : channel numbers (1 to 68, ‘EKG’ ‘EMG’ 'VEO' 'HEO' were not recorded)
Citation will be updated after the review period is completed.
We will provide more information about this dataset (e.g. the units of the captured data) once our work is accepted. This is because our work is currently under review, and we are not allowed to disclose more information according to the relevant requirements.
All metadata will be provided as a backup on Github and will be available after the review period is completed.
Researchers have reported high decoding accuracy (>95%) using non-invasive Electroencephalogram (EEG) signals for brain-computer interface (BCI) decoding tasks like image decoding, emotion recognition, auditory spatial attention detection, epilepsy detection, etc. Since these EEG data were usually collected with well-designed paradigms in labs, the reliability and robustness of the corresponding decoding methods were doubted by some researchers, and they proposed that such decoding accuracy was overestimated due to the inherent temporal autocorrelations (TA) of EEG signals [7]–[9].
However, the coupling between the stimulus-driven neural responses and the EEG temporal autocorrelations makes it difficult to confirm whether this overestimation exists in truth. Some researchers also argue that the effect of TA in EEG data on decoding is negligible and that it becomes a significant problem only under specific experimental designs in which subjects do not have enough resting time [10], [11].
Due to a lack of problem formulation previous studies [7]–[9] only proposed that block-design should not be used to avoid the pitfall. However, the impact of TA could be avoided only when the trial of EEG was not further segmented into several samples. Otherwise, the overfitting or pitfall would still occur. In contrast, when the correct data splitting strategy was used (e.g. separating training and test data in time), the pitfall could also be avoided even when block-design was used.
In our framework, we proposed the concept of "domain" to represent the EEG patterns resulting from TA and then used phantom EEG to remove stimulus-driven neural responses for verification. The results confirmed that the TA, always existing in the EEG data, added unique domain features to a continuous segment of EEG. The specific finding is that when the segment of EEG data with the same class label is split into multiple samples, the classifier will associate the sample's class label with the domain features, interfering with the learning of class-related features. This leads to an overestimation of decoding performance for test samples from the domains seen during training, and results in poor accuracy for test samples from unseen domains (as in real-world applications).
Importantly, our work suggests that the key to reducing the impact of EEG TA on BCI decoding is to decouple class-related features from domain features in the actual EEG dataset. Our proposed unified framework serves as a reminder to BCI researchers of the impact of TA on their specific BCI tasks and is intended to guide them in selecting the appropriate experimental design, splitting strategy and model construction.
We must point out that the "phantom EEG" indeed does not contain any "EEG" but records only noise, a watermelon is not a brain and does not generate any electrical signals. Therefore, the recorded electrical noises, even when amplified using equipment typically used for EEG, do not constitute EEG data when considering the definition of EEG. This is why previous researchers called it "phantom EEG". Some researchers may therefore think that it is questionable to use watermelon to get the phantom EEG.
However, the usage of the phantom head allows researchers to evaluate the performance of neural-recording equipment and proposed algorithms without the effects of neural activity variability, artifacts, and potential ethical issues. Phantom heads used in previous studies include digital models [12]–[14], real human skulls [15]–[17], artificial physical phantoms [18]–[24] and watermelons [25]–[40]. Due to their similar conductivity to human tissue, similar size and shape to the human head, and ease of acquisition, watermelons are widely used as "phantom heads".
Most works tried to use watermelon as a phantom head and found that the results analyzed using the neural signals from human subjects could not be obtained when using the phantom head, thus proving that the achieved results were indeed caused by neural signals. For example, Mutanen et.al [35] proposed that “the fact that the phantom head stimulation did not evoke similar biphasic artifacts excludes the possibility that residual induced artifacts, with the current TMS-compatible EEG system, could explain these components”.
Our work differs significantly from most previous works. It is firstly found in our work that the phantom EEG exhibits the effect of TA on BCI decoding even when only noise was recorded, indicating the inherent existence of TA in the EEG data. The conclusion we hope to draw is that some current works may not truly use stimulus-driven neural responses to obtain the overestimated decoding performance. Similar logic may be found in a neuroscience review article [41], they proposed that EEG recordings from phantom head (watermelon) remind us that background noise may appear as positive results without proper statistical precautions.
[1] C. Spampinato, S. Palazzo, I. Kavasidis, D. Giordano, N. Souly, and M. Shah, “Deep Learning Human Mind for Automated Visual Classification,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 4503–4511.
[2] S. Koelstra et al., “DEAP: A Database for Emotion Analysis ;Using Physiological Signals,” IEEE Transactions on Affective Computing, vol. 3, no. 1, pp. 18–31, 2012.
[3] N. Das, T. Francart, and A. Bertrand, “Auditory Attention Detection Dataset KULeuven.” Zenodo, Aug. 27, 2020.
[4] M. Tangermann et al., “Review of the BCI Competition IV,” Front.
Facebook
Twitterhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.
The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:
After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.
In addition to this file, we have also included the following files:
This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a redistribution of the dataset 'Watervlakken - versie 2024’ (Water surfaces - edition 2024), originally published by the Research Institute for Nature and Forest (INBO) and distributed by 'Informatie Vlaanderen' under a CC-BY compatible license. More specifically, this Zenodo record redistributes the GeoPackage file from the original data source, in order to support reproducible, analytical workflows on Flemish Natura 2000 habitats and regionally important biotopes.
The digital map of standing water surfaces (edition 2024) is a georeferenced digital file of standing surface waters in Flanders (northern Belgium). The file contains 93 201 polygons with an area between 1.45 m² and 2.47 km² and can be considered as the most complete and accurate representation of lentic water bodies presently available for the Flemish territory. The map is based on topographic map layers, orthophoto images, the Digital Terrain Model of Flanders version II, results of a water prediction model and, to a lesser extent, field observations. It can be used for a wide range of applications in research, policy preparation and policy implementation, management planning and evaluation that consider the distribution and characteristics of stagnant water bodies. The map is also relevant internationally, including updates for the National Wetland Inventories (Ramsar). Furthermore, its unique reference to each object will considerably facilitate related data management.
For this new edition of Watervlakken (2024), the orthophoto images of 2021, 2022 and 2023 and the digital terrain model of Flanders have been used. This edition also uses the results of an AI prediction model for water developed by VITO. Data from various Regional Landscapes, ad hoc user reports and field observations have been used to digitise additional polygons, make shape corrections or remove filled ponds from the map layer. For a number of water surfaces, new data on the Flemish type according to the European Water Framework Directive (WFD type), water depth and connectivity have been added to the attribute table.
The data source is produced, owned and administered by the Research Institute for Nature and Forest (INBO, Department of Environment of the Flemish government).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the context of the European projects “Wind and Ports” (grant No. B87E09000000007) and “Wind, Ports and Sea” (grant No. B82F13000100005), an extensive in-situ wind monitoring network was installed in the main ports of the Northern Mediterranean Sea. An unprecedent number of wind records has been acquired and systematically analyzed. Among these, a considerable number of records presented non-stationary and non-Gaussian characteristics that are completely different from those of synoptic extra-tropical cyclones, widely known in the atmospheric science and wind engineering communities. The cross-checking with meteorological information allows to identify which of these events can be defined as thunderstorm winds, i.e., downbursts and gust fronts.
The scientific literature of the last few decades has demonstrated that downbursts, and especially micro-bursts, are extremely dangerous for the natural and built environment. Furthermore, recent trends in climate change seem to preview drastic future scenarios in terms of intensification and frequency increase of this type of extreme events. However, the limited space and time structure of thunderstorm outflows makes them still difficult to be measured in nature and, consequently, to build physically reliable and easily applicable models as in the case of extra-tropical cyclones. For these reasons, the collection and publication of events of this type represents a unique opportunity for the scientific community.
The dataset here presented was built in the context of the activities of the project THUNDERR “Detection, simulation, modelling and loading of thunderstorm outflows to design wind-safer and cost-efficient structures”, financed by the European Research Council (ERC), Advanced Grant 2016 (grant No. 741273, P.I. Prof. Giovanni Solari, University of Genoa). It collects 29 thunderstorm downbursts that occurred between 2010 and 2015 in the Italian ports of Genoa (GE) (4), Livorno (LI) (14), and La Spezia (SP) (11), and were recorded by means of ultrasonic anemometers (Gill WindObserver II in Genoa and La Spezia, Gill WindMaster Pro in Livorno). All thunderstorm events included in the database were verified by means of meteorological information, such as radar (CIMA Research Foundation is gratefully acknowledge for providing with most of the radar images), satellite, and lightning data. In fact, (i) high and localized clouds typical of thunderstorm cumulonimbus, (ii) precipitations, and (iii) lightnings represent reliable indicators of the occurrence of the thunderstorm event.
Some events were recorded by multiple anemometers in the same port area – the total number of signals included in the database is 99. Despite the limited number of points (anemometers), this will allow the user to perform cross-correlation analysis in time and space to eventually retrieve size, position, trajectory of the storm, etc.
The ASCII tab-delimited file ‘Anemometers_location.txt’ reports specifications of the anemometers used in this monitoring study: port code (Port code – Genoa-GE, Livorno-LI, La Spezia-SP); anemometer code (Anemometer code); latitude (Lat.) and longitude (Lon.) in decimal degree WGS84; height above the ground level (h a.g.l.) in meters; Instrument type. Bi-axial anemometers were used from the ports of Genoa and La Spezia, recording the two horizontal wind speed components (u, v). Three-axial ultrasonic anemometers were used in the port of Livorno, also providing the vertical wind speed component w (except bi-axial anemometers LI06 and LI07). All anemometers acquired velocity data at sampling frequency 10 Hz, sensitivity 0.01 m s-1 (except anemometers LI06 and LI07 with sensitivity 0.1 m s-1) and were installed at various heights ranging from 13.0 to 75.0 m, as reported in the file ‘Anemometers_location.txt’.
The ASCII tab-delimited file ‘List_DBevents.txt’ lists all downburst records included in the database, in terms of: event and record number (Event | record no.); port code (Port code); date of event occurrence (Date) in the format yyyy-mm-dd; approximate time of occurrence of the velocity peak (Time [UTC]) in the format HH:MM; anemometer code (Anemometer code).
The database is presented as a zip file (‘DB-records.zip’). The events are divided based on the port of occurrence (three folders GE, LI, and SP). Within each folder, the downburst events that were recorded in that specific port are reported as subfolders (name format ‘[port code]_yyyy-mm-dd’) and contain the single anemometers signals as TAB-delimited text files (name format ‘[port and anemometer code]_yyyy-mm-dd.txt’). Each sub-dataset (file) contains 3(4) columns and 360.000 rows. The first column shows the 10-h time vector (t, ISO format) in UTC, while the remaining 2(3) columns report the 10-h time series of 10-Hz instantaneous horizontal (zonal west-to-east u, meridional south-to-north v) and, where available, vertical (positive upward w) wind speed components, centred around the time of maximum horizontal wind speed (vectorial sum of u and v). The choice of representation of the wind speed in a large time interval (10 hours) allows the user to perform a more comprehensive and detailed analysis of the event by taking into account also the wind conditions before and after the onset of the downburst phenomenon. 'Not-a-Number' (‘NaN’) values are reported in wind velocity signals when the instrument did not record valid data. Some wind speed records show noise in discrete intervals of the signal, which reflects in an increase of the wind speed standard deviation. A modified Hampel filter was employed to remove measurement outliers. For each wind speed signal, every data sample was considered in ascending order, along with its adjacent ten samples (five on each side). This technique calculated the median and standard deviation within the sampling window using the median absolute deviation. Elements deviating from the median by more than six standard deviations were identified and replaced with 'NaN'. The tuning of the filter parameters involved finding a balance between overly agressive and insufficient removal of outliers. Residual outliers were subsequently manually removed through meticulous qualitative inspection. The complexity and subjectivity of this operation provide users with the opportunity to explore alternative approaches. Consequently, the published dataset includes two versions: an initial version (v1) comprising the original raw data with no filtering applied, and a second "cleaned" version (v2).
The presented database can be further used by researchers to validate and calibrate experimental and numerical simulations, as well as analytical models, of downburst winds. It will also be an important resource for the scientific community working in the wind engineering field, in meteorology and atmospheric sciences, as well as in the risk management and reductions of losses related to thunderstorm events (i.e., insurance companies).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data is split into 3 Zenodo locations as it is too large for one location. In total the data set contains meg data of 58 participants. An overview of the participants and the amount of musical training the have conducted is also available. Each of the 3 Zenodo uploads contains the participant overview file + Set#.zip.
Part/Set 1 (blue) contains: meg data of participants 1 - 19 + audio folder (can be found here)
Part/Set 2 (pink) contains: meg data of participants 20 - 38 (can be found here)
Part/Set 3 (yellow) contains: meg data of participants 39 - 58
We used four German audiobooks (all published by Hörbuch Hamburg Verlag and available online.
1. „Frau Ella“ (narrated by lower pitched (LP) speaker and attended by participants)
2. „Darum“ (narrated by LP speaker and ignored by participants)
3. „Den Hund überleben“ (narrated by higher pitched (HP) speaker and attended by participants)
4. „Looking for Hope“ (narrated by HP speaker and ignored by participants)
The participants listened to 10 audiobook chapters. There were always 2 audiobooks presented at the same time (one narrated by a HP speaker and one by a LP speaker) and the participants attended one and ignored the other speaker. The structure of the chapters was as follows:
Chapter 1 of audiobook 1 + random part of audiobook 4
3 comprehension questions
Chapter 1 of audiobook 3 + random part of audiobook 2
3 comprehension questions
Chapter 2 of audiobook 1 + random part of audiobook 4
3 comprehension questions
Chapter 2 of audiobook 3 + random part of audiobook 2
3 comprehension questions
Chapter 3 of audiobook 1 + random part of audiobook 4
3 comprehension questions
Chapter 3 of audiobook 3 + random part of audiobook 2
3 comprehension questions
Chapter 4 of audiobook 1 + random part of audiobook 4
3 comprehension questions
Chapter 4 of audiobook 3 + random part of audiobook 2
3 comprehension questions
Chapter 5 of audiobook 1 + random part of audiobook 4
3 comprehension questions
Chapter 5 of audiobook 3 + random part of audiobook 2
3 comprehension questions
MEG data of 58 participants is contained in this data set.
Each participant has a folder with its participant number as folder name (1,2,3,…).
In the participant folder are two subfolders. One (LP_speaker_attended) containing the MEG data when the participant was attending the LP speaker (ignoring the HP speaker) and one (HP_speaker_attended) containing the MEG data measured when the participant was attending the HP speaker (ignoring the LP speaker). Note that after each chapter the participants switched the attention from LP to HP and vice versa but for evaluation we concatenated the data of the LP speaker attended/ HP speaker ignored mode and the HP speaker attended/ LP speaker ignored mode.
The data of attending the HP speaker is of shape (248, 959416) (ca 16 minutes). That of the LP speaker is of shape (248, 1247854) (ca 21 minutes)
#The meg data can be loaded with the mne python library
meg = mne.read_raw_fif(“…/data_meg.fif“)
#The data can be accessed:
meg_data = meg.get_data()
Exemplary code for performing source reconstruction and trf evaluation can be found in our git repository.
The original audio chapters of the audio books are stored in the folder „Audio“ in Part 1.
There are two subfolders. One (attended_speech) contains the ten audiobook chapters which were attended by the participant (audiobook1_#, audiobook3_#). The other subfolder (ignored_speech) contains the ten audiobook chapters which were ignored by the participant (audiobook2_#, audiobook4_#).
We recommend the librosa library for audio loading and processing.
Audio data is provided with a sampling frequency of 44.1 kHz
Each audio book is provided in 5 chapters as they were presented to the participants. The corresponding meg file as described above already contains the concatenated measured data of all five chapters.
If you resample the audio data to 1000Hz and concatenate the chapters, the audio shape (n_times) will be equal to the corresponding n_times of the meg data.
The meg data was filtered analog with a 1.0 - 200 Hz filter and preprocessed offline using a notch filter (Firwin, 0.5 Hz bandwidth) to remove power line interference at frequencies 50, 100, 150 and 200 Hz.
The data was then resampled from 1017.25 Hz to 1000 Hz.
The meg system with which the data was recorded was a 248 magnetometer system (4D Neuroimaging, San Diego, CA, USA)
The audio signal was presented through loud speakers outside the magnetic chamber and passed on to the participant via tubes of 2 m length and 2 cm diameter leading to a delay of the acoustic signal of 6 ms. The audio was presented diotically (both the attended and the ignored audio stream were presented in both ears) with a sound pressure level of 67 dB(A).
The measurement setup was provided by a former study by Schilling et al (https://doi.org/10.1080/23273798.2020.1803375).
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted.
The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community.
Records dataset
Filename: zenodo_open_metadata_{ date of export }.jsonl.gz
Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date
which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.
In addition, some terms have been altered:
Communities dataset
Filename: zenodo_community_metadata_{ date of export }.jsonl.gz
Each object contains the terms: id, title, description, curation_policy, page
which correspond to the fields with the same name available in Zenodo's community creation form.
Notes for all datasets
For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff.
Some values for the top-level terms, which were missing in the metadata may contain a null value.
A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.