19 datasets found

Chinook CSV Dataset
kaggle.com
Updated Nov 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anurag Verma (2023). Chinook CSV Dataset [Dataset]. https://www.kaggle.com/datasets/anurag629/chinook-csv-dataset/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 9, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Anurag Verma
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset is an export of the tables from the Chinook sample database into CSV files. The Chinook database contains information about a fictional digital media store, including tables for artists, albums, media tracks, invoices, customers, and more.

The CSV file for each table contains the columns and all rows of data. The column headers match the table schema. Refer to the Chinook schema documentation for more details on each table and column.

The files are encoded as UTF-8. The delimiter is a comma. Strings are quoted. Null values are represented by empty strings.

Files

albums.csv

artists.csv

customers.csv

employees.csv

genres.csv

invoice_items.csv

invoices.csv

media_types.csv

playlist_track.csv

playlists.csv

tracks.csv

Usage

This dataset can be used to analyze the Chinook store data. For example, you could build models on customer purchases, track listening patterns, identify trends in genres or artists,etc.

The data is ideal for practicing Pandas, Numpy, PySpark, etc libraries. The database schema provides a realistic set of tables and relationships.
Podcast PR Contacts - Self-Service CSV Batch Export
datarade.ai
.csv, .xls
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Listen Notes (2025). Podcast PR Contacts - Self-Service CSV Batch Export [Dataset]. https://datarade.ai/data-products/podcast-pr-contacts-self-service-csv-batch-export-listen-notes
Explore at:
.csv, .xlsAvailable download formats
Dataset updated
May 27, 2025
Dataset authored and provided by
Listen Notes
Area covered
Bulgaria, Algeria, Kuwait, Costa Rica, Dominican Republic, Congo, Benin, French Polynesia, Israel, Gibraltar
Description
== Quick starts ==

Batch export podcast metadata to CSV files:

1) Export by search keyword: https://www.listennotes.com/podcast-datasets/keyword/

2) Export by category: https://www.listennotes.com/podcast-datasets/category/

== Quick facts ==

The most up-to-date and comprehensive podcast database available All languages & All countries Includes over 3,500,000 podcasts Features 35+ data fields , such as basic metadata, global rank, RSS feed (with audio URLs), Spotify links, and more Delivered in CSV format

== Data Attributes ==

See the full list of data attributes on this page: https://www.listennotes.com/podcast-datasets/fields/?filter=podcast_only

How to access podcast audio files: Our dataset includes RSS feed URLs for all podcasts. You can retrieve audio for over 170 million episodes directly from these feeds. With access to the raw audio, you’ll have high-quality podcast speech data ideal for AI training and related applications.

== Custom Offers ==

We can provide custom datasets based on your needs, such as language-specific data, daily/weekly/monthly update frequency, or one-time purchases.

We also provide a RESTful API at PodcastAPI.com

Contact us: hello@listennotes.com

== Need Help? ==

If you have any questions about our products, feel free to reach out hello@listennotes.com

== About Listen Notes, Inc. ==

Since 2017, Listen Notes, Inc. has provided the leading podcast search engine and podcast database.
d
TWIW database dump
data.dtu.dk
txt
Updated Jul 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sidsel Nag; Gunhild Larsen; Judit Szarvas; Laura Elmlund Kohl Birkedahl; Gábor Máté Gulyás; Wojciech Jakub Ciok; Timmie M. R. Lagermann; Silva Tafaj; Susan Bradbury; Peter Collignon; Denise Daley; Victorien Dougnon; Kafayath Fabiyi; Boubacar Coulibaly; René Dembélé; Georgette Nikiema; Natama Magloire; Isidore Juste Ouindgueta; Zenat Zebin Hossain; Anowara Begum; Deyan Donchev; Mathew Diggle; LeeAnn Turnbull; Simon Lévesque; Livia Berlinger; Kirstine Kobberoe Søgaard; Paula Diaz Guevara; Carolina Duarte Valderrama; Panagiota Maikanti; Jana Amlerova; Pavel Drevinek; Jan Tkadlec; Milica Dilas; Achim J. Kaasch; HenrikTorkil Westh; Mohamed Azzedine Bachtarzi; Wahiba Amhis; Carolina Elizabeth Satán Salazar; José Eduardo Villacis; Mária Angeles Dominguez Lúzon; Dàmaris Berbel Palau; Claire Duployez; Maxime Paluch; Solomon Asante-Sefa; Mie Møller; Margaret Ip; Ivana Marecović; Agnes Pál-Sonnevend; Clementiza Elvezia Cocuzza; Asta Dambrauskiene; Alexandre Macanze; Anelsio Cossa; Inácio Mandomando; Philip Nwajiobi-Princewill; Iruka N. Okeke; Aderemi O. Kehinde; Ini Adebiyi; Ifeoluwa Akintayo; Oluwafemi Popoola; Anthony Onipede; Anita Blomfeldt; Nora Elisabeth Nyquist; Kiri Bocker; James Ussher; Amjad Ali; Nimat Ullah; Habibullah Khan; Natalie Weiler Gustafson; Ikhlas Jarrar; Arif Al-Hamad; Viravarn Luvira; Wantana Paveenkittiporn; Irmak Baran; James C. L. Mwansa; Linda Sikakwa; Kaunda Yamba; Rene Sjøgren Hendriksen; Frank Møller Aarestrup (2023). TWIW database dump [Dataset]. http://doi.org/10.11583/DTU.21758456.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.11583/DTU.21758456.v2
Dataset updated
Jul 10, 2023
Dataset provided by
Technical University of Denmark
Authors
Sidsel Nag; Gunhild Larsen; Judit Szarvas; Laura Elmlund Kohl Birkedahl; Gábor Máté Gulyás; Wojciech Jakub Ciok; Timmie M. R. Lagermann; Silva Tafaj; Susan Bradbury; Peter Collignon; Denise Daley; Victorien Dougnon; Kafayath Fabiyi; Boubacar Coulibaly; René Dembélé; Georgette Nikiema; Natama Magloire; Isidore Juste Ouindgueta; Zenat Zebin Hossain; Anowara Begum; Deyan Donchev; Mathew Diggle; LeeAnn Turnbull; Simon Lévesque; Livia Berlinger; Kirstine Kobberoe Søgaard; Paula Diaz Guevara; Carolina Duarte Valderrama; Panagiota Maikanti; Jana Amlerova; Pavel Drevinek; Jan Tkadlec; Milica Dilas; Achim J. Kaasch; HenrikTorkil Westh; Mohamed Azzedine Bachtarzi; Wahiba Amhis; Carolina Elizabeth Satán Salazar; José Eduardo Villacis; Mária Angeles Dominguez Lúzon; Dàmaris Berbel Palau; Claire Duployez; Maxime Paluch; Solomon Asante-Sefa; Mie Møller; Margaret Ip; Ivana Marecović; Agnes Pál-Sonnevend; Clementiza Elvezia Cocuzza; Asta Dambrauskiene; Alexandre Macanze; Anelsio Cossa; Inácio Mandomando; Philip Nwajiobi-Princewill; Iruka N. Okeke; Aderemi O. Kehinde; Ini Adebiyi; Ifeoluwa Akintayo; Oluwafemi Popoola; Anthony Onipede; Anita Blomfeldt; Nora Elisabeth Nyquist; Kiri Bocker; James Ussher; Amjad Ali; Nimat Ullah; Habibullah Khan; Natalie Weiler Gustafson; Ikhlas Jarrar; Arif Al-Hamad; Viravarn Luvira; Wantana Paveenkittiporn; Irmak Baran; James C. L. Mwansa; Linda Sikakwa; Kaunda Yamba; Rene Sjøgren Hendriksen; Frank Møller Aarestrup
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Two Weeks in the World is a global research collaboration which seeks to shed light on various aspects of antimicrobial resistance. The research project has resulted in a dataset of 3100 clinically relevant bacterial genomes with pertaining metadata. “Clinically relevant” refers to the fact that the bacteria from which the genomes were obtained, were all concluded as being a cause of clinical manifestations of infection. The metadata refers to the data describing the infection from which the bacteria was obtained, like geographic origin and approximate collection date. The bacteria were collected from 59 microbiological diagnostic units in 35 countries around the world during 2020. The data from the project consists of tabular data and genomic sequence data. The tabular data is available as a mysql dump (relational database) and as csv files. The tabular data includes the infection metadata, the results from bioinformatic analyses (species prediction, identification of acquired resistance genes and phylogenetic analysis) as well as the pertaining accession numbers of the individual genomic sequence data, which are available through the European Nucleotide Archive (ENA). At time of submission, the project also has a dedicated web app, from which data can be browsed and downloaded: https://twiw.genomicepidemiology.org/ This complete dataset is created and shared according to the FAIR principles and has large reuse potential within the research fields of antimicrobial resistance, clinical microbiology and global health.

.v2: Author list and readme has been updated. And a file containing column descriptions, for the database dump, has been added: TWIW_dbcolumns_explained.csv.
Up-to-date mapping of COVID-19 treatment and vaccine development...
zenodo.org
data.niaid.nih.gov
bin, csv, png
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tomáš Wagner; Ivana Mišová; Ivana Mišová; Ján Frankovský; Ján Frankovský; Tomáš Wagner (2024). Up-to-date mapping of COVID-19 treatment and vaccine development (covid19-help.org data dump) [Dataset]. http://doi.org/10.5281/zenodo.4601446
Explore at:
csv, png, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4601446
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tomáš Wagner; Ivana Mišová; Ivana Mišová; Ján Frankovský; Ján Frankovský; Tomáš Wagner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The free database mapping COVID-19 treatment and vaccine development based on the global scientific research is available at https://covid19-help.org/.

Files provided here are curated partial data exports in the form of .csv files or full data export as .sql script generated with pg_dump from our PostgreSQL 12 database. You can also find .png file with our ER diagram of tables in .sql file in this repository.

Structure of CSV files

*On our site, compounds are named as substances

compounds.csv

Id - Unique identifier in our database (unsigned integer)

Name - Name of the Substance/Compound (string)

Marketed name - The marketed name of the Substance/Compound (string)

Synonyms - Known synonyms (string)

Description - Description (HTML code)

Dietary sources - Dietary sources where the Substance/Compound can be found (string)

Dietary sources URL - Dietary sources URL (string)

Formula - Compound formula (HTML code)

Structure image URL - Url to our website with the structure image (string)

Status - Status of approval (string)

Therapeutic approach - Approach in which Substance/Compound works (string)

Drug status - Availability of Substance/Compound (string)

Additional data - Additional data in stringified JSON format with data as prescribing information and note (string)

General information - General information about Substance/Compound (HTML code)

references.csv

Id - Unique identifier in our database (unsigned integer)

Impact factor - Impact factor of the scientific article (string)

Source title - Title of the scientific article (string)

Source URL - URL link of the scientific article (string)

Tested on species - What testing model was used for the study (string)

Published at - Date of publication of the scientific article (Date in ISO 8601 format)

clinical-trials.csv

Id - Unique identifier in our database (unsigned integer)

Title - Title of the clinical trial study (string)

Acronym title - Acronym of title of the clinical trial study (string)

Source id - Unique identifier in the source database

Source id optional - Optional identifier in other databases (string)

Interventions - Description of interventions (string)

Study type - Type of the conducted study (string)

Study results - Has results? (string)

Phase - Current phase of the clinical trial (string)

Url - URL to clinical trial study page on clinicaltrials.gov (string)

Status - Status in which study currently is (string)

Start date - Date at which study was started (Date in ISO 8601 format)

Completion date - Date at which study was completed (Date in ISO 8601 format)

Additional data - Additional data in the form of stringified JSON with data as locations of study, study design, enrollment, age, outcome measures (string)

compound-reference-relations.csv

Reference id - Id of a reference in our DB (unsigned integer)

Compound id - Id of a substance in our DB (unsigned integer)

Note - Id of a substance in our DB (unsigned integer)

Is supporting - Is evidence supporting or contradictory (Boolean, true if supporting)

compound-clinical-trial.csv

Clinical trial id - Id of a clinical trial in our DB (unsigned integer)

Compound id - Id of a Substance/Compound in our DB (unsigned integer)

tags.csv

Id - Unique identifier in our database (unsigned integer)

Name - Name of the tag (string)

tags-entities.csv

Tag id - Id of a tag in our DB (unsigned integer)

Reference id - Id of a reference in our DB (unsigned integer)

API Specification

Our project also has an Open API that gives you access to our data in a format suitable for processing, particularly in JSON format.

https://covid19-help.org/api-specification

Services are split into five endpoints:

Substances - /api/substances

References - /api/references

Substance-reference relations - /api/substance-reference-relations

Clinical trials - /api/clinical-trials

Clinical trials-substances relations - /api/clinical-trials-substances

Method of providing data

All dates are text strings formatted in compliance with ISO 8601 as YYYY-MM-DD

If the syntax request is incorrect (missing or incorrectly formatted parameters) an HTTP 400 Bad Request response will be returned. The body of the response may include an explanation.

Data updated_at (used for querying changed-from) refers only to a particular entity and not its logical relations. Example: If a new substance reference relation is added, but the substance detail has not changed, this is reflected in the substance reference relation endpoint where a new entity with id and current dates in created_at and updated_at fields will be added, but in substances or references endpoint nothing has changed.

The recommended way of sequential download

During the first download, it is possible to obtain all data by entering an old enough date in the parameter value changed-from, for example: changed-from=2020-01-01 It is important to write down the date on which the receiving the data was initiated let’s say 2020-10-20

For repeated data downloads, it is sufficient to receive only the records in which something has changed. It can therefore be requested with the parameter changed-from=2020-10-20 (example from the previous bullet). Again, it is important to write down the date when the updates were downloaded (eg. 2020-10-20). This date will be used in the next update (refresh) of the data.

Services for entities

List of endpoint URLs:

/api/substances

/api/references

/api/substance-reference-relations

/api/clinical-trials

/api/clinical-trials-substances

Format of the request

All endpoints have these parameters in common:

changed-from - a parameter to return only the entities that have been modified on a given date or later.

continue-after-id - a parameter to return only the entities that have a larger ID than specified in the parameter.

limit - a parameter to return only the number of records specified (up to 1000). The preset number is 100.

Request example:

/api/references?changed-from=2020-01-01&continue-after-id=1&limit=100

Format of the response

The response format is the same for all endpoints.

number_of_remaining_ids - the number of remaining entities that meet the specified criteria but are not displayed on the page. An integer of virtually unlimited size.

entities - an array of entity details in JSON format.

Response example:

{

"number_of_remaining_ids" : 100,

"entities" : [

{

"id": 3,

"url": "https://www.ncbi.nlm.nih.gov/pubmed/32147628",

"title": "Discovering drugs to treat coronavirus disease 2019 (COVID-19).",

"impact_factor": "Discovering drugs to treat coronavirus disease 2019 (COVID-19).",

"tested_on_species": "in silico",

"publication_date": "2020-22-02",

"created_at": "2020-30-03",

"updated_at": "2020-31-03",

"deleted_at": null

},

{

"id": 4,

"url": "https://www.ncbi.nlm.nih.gov/pubmed/32157862",

"title": "CT Manifestations of Novel Coronavirus Pneumonia: A Case Report",

"impact_factor": "CT Manifestations of Novel Coronavirus Pneumonia: A Case Report",

"tested_on_species": "Patient",

"publication_date": "2020-06-03",

"created_at": "2020-30-03",

"updated_at": "2020-30-03",

"deleted_at": null

},

]

}

Endpoint details

Substances

URL: /api/substances

Substances
Australian Employee Salary/Wages DATAbase by detailed occupation, location...
figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Ferrers; Australian Taxation Office (2023). Australian Employee Salary/Wages DATAbase by detailed occupation, location and year (2002-14); (plus Sole Traders) [Dataset]. http://doi.org/10.6084/m9.figshare.4522895.v5
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4522895.v5
Dataset updated
May 31, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Richard Ferrers; Australian Taxation Office
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The ATO (Australian Tax Office) made a dataset openly available (see links) showing all the Australian Salary and Wages (2002, 2006, 2010, 2014) by detailed occupation (around 1,000) and over 100 SA4 regions. Sole Trader sales and earnings are also provided. This open data (csv) is now packaged into a database (*.sql) with 45 sample SQL queries (backupSQL[date]_public.txt).See more description at related Figshare #datavis record. Versions:V5: Following #datascience course, I have made main data (individual salary and wages) available as csv and Jupyter Notebook. Checksum matches #dataTotals. In 209,xxx rows.Also provided Jobs, and SA4(Locations) description files as csv. More details at: Where are jobs growing/shrinking? Figshare DOI: 4056282 (linked below). Noted 1% discrepancy ($6B) in 2010 wages total - to follow up.#dataTotals - Salary and WagesYearWorkers (M)Earnings ($B) 20028.528520069.4372201010.2481201410.3584#dataTotal - Sole TradersYearWorkers (M)Sales ($B)Earnings ($B)20020.9611320061.0881920101.11122620141.19630#links See ATO request for data at ideascale link below.See original csv open data set (CC-BY) at data.gov.au link below.This database was used to create maps of change in regional employment - see Figshare link below (m9.figshare.4056282).#packageThis file package contains a database (analysing the open data) in SQL package and sample SQL text, interrogating the DB. DB name: test. There are 20 queries relating to Salary and Wages.#analysisThe database was analysed and outputs provided on Nectar(.org.au) resources at: http://118.138.240.130.(offline)This is only resourced for max 1 year, from July 2016, so will expire in June 2017. Hence the filing here. The sample home page is provided here (and pdf), but not all the supporting files, which may be packaged and added later. Until then all files are available at the Nectar URL. Nectar URL now offline - server files attached as package (html_backup[date].zip), including php scripts, html, csv, jpegs.#installIMPORT: DB SQL dump e.g. test_2016-12-20.sql (14.8Mb)1.Started MAMP on OSX.1.1 Go to PhpMyAdmin2. New Database: 3. Import: Choose file: test_2016-12-20.sql -> Go (about 15-20 seconds on MacBookPro 16Gb, 2.3 Ghz i5)4. four tables appeared: jobTitles 3,208 rows | salaryWages 209,697 rows | soleTrader 97,209 rows | stateNames 9 rowsplus views e.g. deltahair, Industrycodes, states5. Run test query under **#; Sum of Salary by SA4 e.g. 101 $4.7B, 102 $6.9B#sampleSQLselect sa4,(select sum(count) from salaryWageswhere year = '2014' and sa4 = sw.sa4) as thisYr14,(select sum(count) from salaryWageswhere year = '2010' and sa4 = sw.sa4) as thisYr10,(select sum(count) from salaryWageswhere year = '2006' and sa4 = sw.sa4) as thisYr06,(select sum(count) from salaryWageswhere year = '2002' and sa4 = sw.sa4) as thisYr02from salaryWages swgroup by sa4order by sa4
Data from: Public Notification
anla-esp-esri-co.hub.arcgis.com
Updated Feb 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
esri_en (2023). Public Notification [Dataset]. https://anla-esp-esri-co.hub.arcgis.com/items/4799f2bd805c4d7bab4cfc90c96e1c38
Explore at:
Dataset updated
Feb 23, 2023
Dataset provided by
Esrihttp://esri.com/
Authors
esri_en
Description
Use the Public Notification template to allow users to create a list of features that they can export to a CSV or PDF file. They can create lists of features by searching for a single location, drawing an area of interest to include intersecting features, or using the geometry of an existing feature as the area of interest. Users can include a search buffer around a defined area of interest to expand the list of features. Examples: Export a CSV file with addresses for residents to alert about road closures. Create an inventory of contact information for parents in a school district. Generate a PDF file to print address labels and the corresponding map for community members. Data requirements The Public Notification template requires a feature layer to use all of its capabilities. Key app capabilities Search radius - Define a distance for a search buffer that selects intersecting input features to include in the list. Export - Save the results from the lists created in the app. Users can export the data to CSV and PDF format. Refine selection - Allow users to revise the selected features in the lists they create by adding or removing features with sketch tools. Sketch tools - Draw graphics on the map to select features to add to a list. Users can also use features from a layer to select intersecting features from the input layer. Home, Zoom controls, Legend, Layer List, Search Supportability This web app is designed responsively to be used in browsers on desktops, mobile phones, and tablets. We are committed to ongoing efforts towards making our apps as accessible as possible. Please feel free to leave a comment on how we can improve the accessibility of our apps for those who use assistive technologies.
Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...
zenodo.org
data.niaid.nih.gov
Updated Apr 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova (2022). Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" [Dataset]. http://doi.org/10.5281/zenodo.5996864
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5996864
Dataset updated
Apr 22, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova
Description
Overview

This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).

The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.

Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.

The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).

The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.

Options to access the dataset

There are two ways how to get access to the dataset:

1. Static dump of the dataset available in the CSV format
2. Continuously updated dataset available via REST API

In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.

References

If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:

@inproceedings{SrbaMonantPlatform, author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria}, booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)}, pages = {1--7}, title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior}, year = {2019} }

@inproceedings{SrbaMonantMedicalDataset, author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)}, numpages = {11}, title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims}, year = {2022}, doi = {10.1145/3477495.3531726}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477495.3531726}, }

Dataset creation process

In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.

Ethical considerations

The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.

The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.

As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.

Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.

Reporting mistakes in the dataset
The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.

Dataset structure

Raw data

At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.

Raw data are contained in these CSV files (and corresponding REST API endpoints):

sources.csv

articles.csv

article_media.csv

article_authors.csv

discussion_posts.csv

discussion_post_authors.csv

fact_checking_articles.csv

fact_checking_article_media.csv

claims.csv

feedback_facebook.csv

Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.

Annotations

Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.

Each annotation is described by the following attributes:

category of annotation (`annotation_category`). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).

type of annotation (`annotation_type_id`). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.

method which created annotation (`method_id`). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.

its value (`value`). The value is stored in JSON format and its structure differs according to particular annotation type.

At the same time, annotations are associated with a particular object identified by:

entity type (parameter entity_type in case of entity annotations, or source_entity_type and target_entity_type in case of relation annotations). Possible values: sources, articles, fact-checking-articles.

entity id (parameter entity_id in case of entity annotations, or source_entity_id and target_entity_id in case of relation
Test data and model for the FlowCam data processing pipeline
zenodo.org
explore.openaire.eu
csv, zip
Updated Jan 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katerina Symiakaki; Katerina Symiakaki; Tim Walles; Tim Walles; Cassidy J. Park; Jens Nejstgaard; Jens Nejstgaard; Stella A. Berger; Stella A. Berger; Cassidy J. Park (2025). Test data and model for the FlowCam data processing pipeline [Dataset]. http://doi.org/10.5281/zenodo.14732560
Explore at:
zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14732560
Dataset updated
Jan 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Katerina Symiakaki; Katerina Symiakaki; Tim Walles; Tim Walles; Cassidy J. Park; Jens Nejstgaard; Jens Nejstgaard; Stella A. Berger; Stella A. Berger; Cassidy J. Park
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Testing data for the processing pipeline for FlowCam data

The data are fully processed but can be used to test each pipeline component. You can download the scripts at

Pipeline scripts

To use the model, unzip the freshwater_phytoplankton_model.zip and place the folder in the respective model folder in the services.

|--services |-- ProcessData.py |-- config.py |-- classification
|-- ObjectClassification
|-- models
|--

Once you unzip the data.zip file, each folder corresponds to the data export of a FlowCam run. You have the TIF collage files, a CSV file with the sample name containing all the parameters measured by the FlowCam, and a LabelChecker_

You can run the preprocessing.py script directly on the files by including the -R (reprocess) argument. Otherwise you can do it by removing the LabelChecker CSV from the folders. The PreprocessingTrue column will remain the same.

When running the classification.py script you can get new predictions on the data. In this case, only the LabelPredicted column will be updated and the validated labels (LabelTrue column) will not be lost.

You could also use these files to try out the train_model.ipynb, although the resulting model will not be very good with so little data. We recommend trying it with your own data.

LabelChecker

These files can be used to test LabelChecker. You can open them one by one or all together and try all functionalities. We provide a label_file.csv but you can also make your own.
c
Data from: Supporting Dataset for "RT-QuIC Optimization for Prion Detection...
researchworks.creighton.edu
Updated Feb 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Madeline K Grunklee; Jason C Bartz; Diana L Karwan; Stuart S Lichtenberg; Nicole A Lurndahl; Peter A Larsen; Marc D Schwabenlander; Gage R Rowden; E Anu Li; Qi Yuan; Tiffany M Wolf (2025). Supporting Dataset for "RT-QuIC Optimization for Prion Detection in Two Minnesota Soil Types" and "Detection of Chronic Wasting Disease Prions in Soil at an Illegal White-tailed Deer Carcass Disposal Site" [Dataset]. https://researchworks.creighton.edu/esploro/outputs/dataset/Supporting-Dataset-for-RT-QuIC-Optimization-for/991006150025902656
Explore at:
Dataset updated
Feb 24, 2025
Authors
Madeline K Grunklee; Jason C Bartz; Diana L Karwan; Stuart S Lichtenberg; Nicole A Lurndahl; Peter A Larsen; Marc D Schwabenlander; Gage R Rowden; E Anu Li; Qi Yuan; Tiffany M Wolf
Time period covered
Feb 24, 2025
Description
Soil_Cntrl_Expmts_Data.csv This data includes all control experiments conducted prior to testing study site test samples. I.e. negative soil experiments and inoculation/spiking soil experiments. We described this dataset and outline the full sample/data collection and processing protocols in a.Grunklee et al., In review. We collected this data to inform the sample processing protocol which we then used on the study site soil samples (b.Grunklee et al., In review). Soil_Test_Samples_Data.csv This data includes all experiments involved in the testing of the Beltrami County study site soil samples (i.e. those within the dump site, around the CWD+ farm, and immediately around the dump site). We described this dataset and outline the full sample/data collection and processing protocols in b.Grunklee et al. In review. Each plate had at least 1 representative negative control Alfisol or Histosol soil types followed by the study site test soil samples in question. Note: all of these soil samples were initially dried and run in a soil dilution of 10^-1 prior to RT-QuIC analysis per negative soil experimental results (a.Grunklee et al., In review). The Soil_Cntrl_Expmts_Data.csv dataset informed our sample processing and analysis protocols for the study site samples contained in the Soil_Test_Samples_Data.csv dataset. Per the results of the Soil_Cntrl_Expmts_Data.csv dataset, we dried and ran study site soil samples in a soil dilution of 10^-1 prior to RT-QuIC analysis (a.Grunklee et al., In review). These data describe prion detections in soil using real-time quaking-induced conversion (RT-QuIC) assay with various metric calculations common to RT-QuIC. The Soil_Cntrl_Expmts_Data.xlsx file contains data from a series of control experiments aimed at optimizing and applying RT-QuIC for the detection of chronic wasting disease prions in environmental soil samples. We focused negative control experiments on refining RT-QuIC and sample processing to use on Minnesota native soils, which included limiting background noise from the samples. Starting on 2023-05-08, we used spiked soil control experiments to distinguish true prion signal from background noise and validate detection reliability. Following soil control experiments, the Soil_Test_Samples_Data.xlsx file describes our sample testing in RT-QuIC collected from our study site, an illegal white-tailed deer (Odocoileus virginianus, WTD) carcass disposal site and a nearby captive WTD farm in Beltrami County, Minnesota. We analyzed study site soil samples for prion presence to assess potential environmental contamination associated with improper carcass disposal practices. This study was funded by the Minnesota Environment and Natural Resource Trust Fund as recommended by the Legislative-Citizen Commission on Minnesota Resources [2020-087 and 2022-217] and the Conservation Science Graduate Program at the University of Minnesota Twin Cities. Grunklee, Madeline K; Bartz, Jason C; Karwan, Diana L; Lichtenberg, Stuart S; Lurndahl, Nicole A; Larsen, Peter A; Schwabenlander, Marc D; Rowden, Gage R; Li, E Anu; Yuan, Qi; Wolf, Tiffany M. (2025). Supporting Dataset for "RT-QuIC Optimization for Prion Detection in Two Minnesota Soil Types" and "Detection of Chronic Wasting Disease Prions in Soil at an Illegal White-tailed Deer Carcass Disposal Site". Retrieved from the Data Repository for the University of Minnesota (DRUM), https://hdl.handle.net/11299/270027.
OpenCitations Index CSV dataset of all the citation data
figshare.com
zip
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenCitations (2025). OpenCitations Index CSV dataset of all the citation data [Dataset]. http://doi.org/10.6084/m9.figshare.24356626.v6
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24356626.v6
Dataset updated
Jul 15, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
OpenCitations
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains all the citation data (in CSV format) included in the OpenCitation Index (https://opencitations.net/index), released on July 10, 2025. In particular, each line of the CSV file defines a citation, and includes the following information:[field "oci"] the Open Citation Identifier (OCI) for the citation;[field "citing"] the OMID of the citing entity;[field "cited"] the OMID of the cited entity;[field "creation"] the creation date of the citation (i.e. the publication date of the citing entity);[field "timespan"] the time span of the citation (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity);[field "journal_sc"] it records whether the citation is a journal self-citations (i.e. the citing and the cited entities are published in the same journal);[field "author_sc"] it records whether the citation is an author self-citation (i.e. the citing and the cited entities have at least one author in common).Note: the information for each citation is sourced from OpenCitations Meta (https://opencitations.net/meta), a database that stores and delivers bibliographic metadata for all bibliographic resources included in the OpenCitations Index. The data provided in this dump is therefore based on the state of OpenCitations Meta at the time this collection was generated.This version of the dataset contains:2,216,426,689 citationsThe size of the zipped archive is 38.8 GB, while the size of the unzipped CSV file is 242 GB.
d
Stable Water Isotope Data for the East River Watershed, Colorado (2014-2024)...
search.dataone.org
search-demo.dataone.org
+6more
Updated May 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kenneth Williams; Curtis Beutler; Markus Bill; Wendy Brown; Alexander Newman; Dylan O'Ryan; Austin Shirley; Roelof Versteeg (2025). Stable Water Isotope Data for the East River Watershed, Colorado (2014-2024) [Dataset]. http://doi.org/10.15485/1668053
Explore at:
Unique identifier
https://doi.org/10.15485/1668053
Dataset updated
May 16, 2025
Dataset provided by
ESS-DIVE
Authors
Kenneth Williams; Curtis Beutler; Markus Bill; Wendy Brown; Alexander Newman; Dylan O'Ryan; Austin Shirley; Roelof Versteeg
Time period covered
Jun 27, 2014 - Sep 30, 2024
Area covered

Description
The stable water isotope data for the East River Watershed, Colorado, consists of delta2H (hydrogen) and delta18O (oxygen) values from samples collected at multiple, long-term monitoring sites including streams, groundwater wells, springs, and a precipitation collector used to establish a local meteoric water line (LMWL) for the watershed. These locations represent important and/or unique end-member locations for which stable isotope values can be diagnostic of the connection between precipitation inputs as snow and rain and riverine export. Such locations include drainages underline entirely or largely by shale bedrock, land covered dominated by conifers, aspens, or meadows, and drainages impacted by historic mining activity and the presence of naturally mineralized rock. Developing a long-term record of water isotope values from a diversity of environments is a critical component of quantifying the impacts of both climate change and discrete climate perturbations, such as drought, forest mortality, and wildfire, on water export. Such data may be combined with stream gaging stations co-located at each surface water monitoring site to relate seasonal variations in water export to their stable isotopic signature. Data for liquid water delta2H and delta18O values are reported in units of parts per thousand (per-mil; ‰). This data package contains (1) a zip file (isotope_data_2014-2024.zip) containing a total of 87 files: 86 data files of isotope data from across the Lawrence Berkeley National Laboratory (LBNL) Watershed Function Scientific Focus Area (SFA) which is reported in .csv files per location and a locations.csv (1 file) with latitude and longitude for each location; (2) a file-level metadata (v5_20250515_flmd.csv) file that lists each file contained in the dataset with associated metadata; and (3) a data dictionary (v5_20250515_dd.csv) file that contains terms/column_headers used throughout the files along with a definition, units, and data type. Missing values within the anion data files are noted as either "-9999" or "0.0" for not detectable (N.D.) data. There are a total of 43 locations containing isotope data. Update on 2022-06-10: versioned updates to this dataset was made along with these changes: (1) updated isotope data for all locations up to 2021-12-31 and (2) the addition of the file-level metadata (flmd.csv) and data dictionary (dd.csv) were added to comply with the File-Level Metadata Reporting Format. Update on 2022-09-09: Updates were made to reporting format specific files (file-level metadata and data dictionary) to correct swapped file names, add additional details on metadata descriptions on both files, add a header_row column to enable parsing, and add version number and date to file names (v2_20220909_flmd.csv and v2_20220909_dd.csv). Update on 2023-08-08: Updates were made to both the data files and reporting format specific files. New available anion data was added, up until 2023-03-13. The file level metadata and data dictionary files were updated to reflect the additional data added. Update on 2024-03-11: Updates were made to both the data files and reporting format specific files. New available anion data was added, up until 2024-02-19. Further, revisions to the data files were made to remove incorrect data points (from 1970 and 2001). The reporting format specific files were updated to reflect the additional data added. Update on 2025-05-15: Updates were made to both the data files and reporting format specific files. New available isotope data was added, up until the end of WY2024 (September 30, 2024). International Generic Sample Numbers (IGSNs), when registered, were added to the data files. The reporting format specific files were updated to reflect the additional data added.
f
String probability dataset of Al-Sm glasses
figshare.com
zip
Updated Jan 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qi Wang (2024). String probability dataset of Al-Sm glasses [Dataset]. http://doi.org/10.6084/m9.figshare.24995903.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24995903.v1
Dataset updated
Jan 14, 2024
Dataset provided by
figshare
Authors
Qi Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
** About this dataset:This dataset is established based on the iso-configurational ensemble simulations of Al90Sm10 metallic glasses.There are a total of 61 independent Al90Sm10 samples that are quenched to and relaxed at 400 K.Additional details, including the train/val/test splits used in the paper, can be found in "Readme.txt".** Files of each sample directory:[data.dump]The glass configuration in the format of LAMMPS dump. Atom type 1 is for Al, 2 is for Sm.[string_probability.csv]The string-like motion (SLM) probability data of each glass configuration.The columns are ["source_id", "final_id", "atom_types", "string_probability"]:"source_id": the source id of the string."final_id": the final id of the string. For example, if source_id is 103 and final_id is 107, this indicates a string segment as 103 --> 107."atom_types": (atom type of source_id, atom type of final_id). For example, (1, 1) represents (Al, Al)."string_probability": the probability of each string segment emerging during the iso-configurational ensemble simulations (a total of 170 runs).** Citation:Qi Wang*, Long-Fei Zhang, Zhen-Ya Zhou, Hai-Bin Yu*. To be determined.
Crypto Market Data CSV Export: Trades, Quotes & Order Book Access via S3
datarade.ai
.json, .csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CoinAPI, Crypto Market Data CSV Export: Trades, Quotes & Order Book Access via S3 [Dataset]. https://datarade.ai/data-products/coinapi-comprehensive-crypto-market-data-in-flat-files-tra-coinapi
Explore at:
.json, .csvAvailable download formats
Dataset provided by
Coinapi Ltd
Authors
CoinAPI
Area covered
Solomon Islands, Kyrgyzstan, Norfolk Island, Montserrat, Liechtenstein, Qatar, Iraq, Tanzania, Latvia, Northern Mariana Islands
Description
When you need to analyze crypto market history, batch processing often beats streaming APIs. That's why we built the Flat Files S3 API - giving analysts and researchers direct access to structured historical cryptocurrency data without the integration complexity of traditional APIs.

Pull comprehensive historical data across 800+ cryptocurrencies and their trading pairs, delivered in clean, ready-to-use CSV formats that drop straight into your analysis tools. Whether you're building backtest environments, training machine learning models, or running complex market studies, our flat file approach gives you the flexibility to work with massive datasets efficiently.

Why work with us?

Market Coverage & Data Types: - Comprehensive historical data since 2010 (for chosen assets) - Comprehensive order book snapshots and updates - Trade-by-trade data

Technical Excellence: - 99,9% uptime guarantee - Standardized data format across exchanges - Flexible Integration - Detailed documentation - Scalable Architecture

CoinAPI serves hundreds of institutions worldwide, from trading firms and hedge funds to research organizations and technology providers. Our S3 delivery method easily integrates with your existing workflows, offering familiar access patterns, reliable downloads, and straightforward automation for your data team. Our commitment to data quality and technical excellence, combined with accessible delivery options, makes us the trusted choice for institutions that demand both comprehensive historical data and real-time market intelligence
o
Subject indexing data of K10plus library union catalog
explore.openaire.eu
zenodo.org
Updated Jun 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jakob Voß (2021). Subject indexing data of K10plus library union catalog [Dataset]. http://doi.org/10.5281/zenodo.6817455
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6817455
Dataset updated
Jun 30, 2021
Authors
Jakob Voß
Description
This dataset contains a an extract of K10plus library union catalog with its subject indexing data: kxp-subjects-sample_2022-06-30.dat : a random sample fo 10.000 records kxp-subjects_2022-06-30_??of10.dat : the full data (47.686.063 records) split in files of up to 5.000.000 records each K10plus is a union catalog of German libraries, run by library service centers BSZ and VZG since 2019. The catalog contains bibliographic data of the majority of academic libraries in Germany. The core data of K10plus is made available as OpenData via APIs and in form of database dumps. More information can be found here: K10plus homepage (in German) K10plus Open Data page (in German) Traditional search interface (OPAC) Data format The data is provided in its raw internal format called PICA+ to not loose information during conversion. In particular the data is given in PICA Normalized Format with one record per line. Each record consists of a list of fields and each field consists of a list of subfields. The data can best be processed with command line tools pica-rs or picadata. A detailled description of PICA format and its processing is given in the German textbook Einführung in die Verarbeitung von PICA-Daten. For visual inspection PICA Normalized Format is best converted into PICA Plain Format (pica-rs command pica print). The following example record contains seven fields: 003@ $0010003231 013D $9104450460$VTsvz$3209786884$7gnd/4151278-9$aEinführung 044K $9106080474$VTsv1$7gnd/4077343-7$3209204761$aSekte 044N $aReligionsgemeinschaft 045E $a12 045F $a291 045Q/01 $9181570408$VTkv$a11.97$jNeue religiöse Bewegungen$jSekten 045R $91270641751$VTkv$7rvk/11410:$3200641751$aBG 9600$jAllgemeines$NB$JTheologie und Religionswissenschaften$NBG$JFundamentaltheologie$NBG 9020-BG 9790$JKirche und Kirchen$NBG 9600-BG 9720$JFreikirchen und Sekten 045V $a1 Each K10plus record is uniquely identified by its record identifier PPN, given in field 003@ subfield $0. The PPN can be used: to link into K10plus catalog, e.g. https://opac.k10plus.de/DB=2.299/PPNSET?PPN=010003231 to retrieve the record in other formats via API, e.g. https://unapi.k10plus.de/?id=opac-de-627:ppn:010003231&format=marcxml (MARC/XML format) and https://ws.gbv.de/suggest/csl/?query=pica.ppn=010003231&citationstyle=ieee&language=de (Citation Format) Scope of the data The data is limited to records having a least one holding by a library participating in K10plus. Records are provided with “offline expansion” (some subfield have been added automatically to facilitate re-use of the data) and limited to the following fields: 003@ with internal record identifier “PPN” in subfield $0 013D type of content 013F target audience 041A keywords 044. all subject indexing fields starting with 044 045. all subject indexing fields starting with 045 144Z local library keywords 145S local library classification 145Z local library classification Documentation of the fields can be found at https://format.k10plus.de/k10plushelp.pl?cmd=pplist&katalog=Standard#titel The current dump contains 47.686.063 records with subject indexing out of 74.127.563 K10plus records in total. For reference, the dump has been created and split from a full dump of K10plus with script extract.sh. Processing examples Extract CSV file of PPN and RVK-Notation: pica filter '045R?' kxp-subjects_2022-06-30.dat | pica select '003@$0,045Ra' Get a list of PPN of records having RVK but not BK: pica filter '045R? & !045Q/01' kxp-subjects_2022-06-30.dat | pica select '003@$0' See https://github.com/gbv/k10plus-subjects#readme for additional examples of data analysis. Automatic download Given the Zenodo Record ID (e.g. 6810556), a list of all files can be generated with curl and jq: curl -sL https://zenodo.org/api/records/$ID | jq -r '.files|map([.key,.links.self]|@tsv)[]' Changes 2022-06-30: update with additional fields 013D and 013F (47.686.064 records) 2021-06-30: first published dump (41.786.820 records) License https://creativecommons.org/publicdomain/zero/1.0/
I
Dataset of 286 publications citing the 2014 Willoughby-Jansma-Hoye protocol
databank.illinois.edu
Updated Nov 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heng Zheng; Yuanxi Fu; Ellie Vandel; Jodi Schneider (2024). Dataset of 286 publications citing the 2014 Willoughby-Jansma-Hoye protocol [Dataset]. http://doi.org/10.13012/B2IDB-4610831_V3
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-4610831_V3
Dataset updated
Nov 7, 2024
Authors
Heng Zheng; Yuanxi Fu; Ellie Vandel; Jodi Schneider
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
U.S. National Science Foundation (NSF)
Alfred P. Sloan Foundation
Harvard Radcliffe Institute for Advanced Study
Description
This dataset consists of the 286 publications retrieved from Web of Science and Scopus on July 6, 2023 as citations for Willoughby et al., 2014: Patrick H. Willoughby, Matthew J. Jansma, and Thomas R. Hoye (2014). A guide to small-molecule structure assignment through computation of (¹H and ¹³C) NMR chemical shifts. Nature Protocols, 9(3), Article 3. https://doi.org/10.1038/nprot.2014.042 We added the DOIs of the citing publications into a Zotero collection. Then we exported all 286 DOIs in two formats: a .csv file (data export) and an .rtf file (bibliography). Willoughby2014_286citing_publications.csv is a Zotero data export of the citing publications. Willoughby2014_286citing_publications.rtf is a bibliography of the citing publications, using a variation of the American Psychological Association style (7th edition) with full names instead of initials. To create Willoughby2014_citation_contexts.csv, HZ manually extracted the paragraphs that contain a citation marker of Willoughby et al., 2014. We refer to these paragraphs as the citation contexts of Willoughby et al., 2014. Manual extraction started with 286 citing publications but excluded 2 publications that are not in English, those with DOIs 10.13220/j.cnki.jipr.2015.06.004 and 10.19540/j.cnki.cjcmm.20200604.201 The silver standard aimed to triage the citing publications of Willoughby et al., 2014 that are at risk of propagating unreliability due to a code glitch in a computational chemistry protocol introduced in Willoughby et al., 2014. The silver standard was created stepwise: First one chemistry expert (YF) manually annotated the corpus of 284 citing publications in English, using their full text and citation contexts. She manually categorized publications as either at risk of propagating unreliability or not at risk of propagating unreliability, with a rationale justifying each category. Then we selected a representative sample of citation contexts to be double annotated. To do this, MJS turned the full dataset of citation contexts (Willoughby2014_citation_contexts.csv) into word embeddings, clustered them using similarity measures using BERTopic's HDBS, and selected representative citation contexts based on the centroids of the clusters. Next the second chemistry expert (EV) annotated the 77 publications associated with the citation contexts, considering the full text as well as the citation contexts. double_annotated_subset_77_before_reconciliation.csv provides EV and YF's annotation before reconciliation. To create the silver standard YF, EV, and JS discussed differences and reconciled most differences. YF and EV had principled reasons for disagreeing on 9 publications; to handle these, YF updated the annotations, to create the silver standard we use for evaluation in the remainder of our JCDL 2024 paper (silver_standard.csv) Inter_Annotator_Agreement.xlsx indicates publications where the two annotators made opposite decisions and calculates the inter-annotator agreement before and after reconciliation together. double_annotated_subset_77_before_reconciliation.csv provides EV and YF's annotation after reconciliation, including applying the reconciliation policy.
Data supporting the Master thesis "Monitoring von Open Data Praktiken -...
zenodo.org
data.niaid.nih.gov
zip
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katharina Zinke; Katharina Zinke (2024). Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" [Dataset]. http://doi.org/10.5281/zenodo.14196539
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14196539
Dataset updated
Nov 21, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Katharina Zinke; Katharina Zinke
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" (Monitoring open data practices - challenges in finding data publications using the example of publications by researchers at TU Dresden) - Katharina Zinke, Institut für Bibliotheks- und Informationswissenschaften, Humboldt-Universität Berlin, 2023

This ZIP-File contains the data the thesis is based on, interim exports of the results and the R script with all pre-processing, data merging and analyses carried out. The documentation of the additional, explorative analysis is also available. The actual PDFs and text files of the scientific papers used are not included as they are published open access.

The folder structure is shown below with the file names and a brief description of the contents of each file. For details concerning the analyses approach, please refer to the master's thesis (publication following soon).

## Data sources

Folder 01_SourceData/

- PLOS-Dataset_v2_Mar23.csv (PLOS-OSI dataset)

- ScopusSearch_ExportResults.csv (export of Scopus search results from Scopus)

- ScopusSearch_ExportResults.ris (export of Scopus search results from Scopus)

- Zotero_Export_ScopusSearch.csv (export of the file names and DOIs of the Scopus search results from Zotero)

## Automatic classification

Folder 02_AutomaticClassification/

- (NOT INCLUDED) PDFs folder (Folder for PDFs of all publications identified by the Scopus search, named AuthorLastName_Year_PublicationTitle_Title)

- (NOT INCLUDED) PDFs_to_text folder (Folder for all texts extracted from the PDFs by ODDPub, named AuthorLastName_Year_PublicationTitle_Title)

- PLOS_ScopusSearch_matched.csv (merge of the Scopus search results with the PLOS_OSI dataset for the files contained in both)

- oddpub_results_wDOIs.csv (results file of the ODDPub classification)

- PLOS_ODDPub.csv (merge of the results file of the ODDPub classification with the PLOS-OSI dataset for the publications contained in both)

## Manual coding

Folder 03_ManualCheck/

- CodeSheet_ManualCheck.txt (Code sheet with descriptions of the variables for manual coding)

- ManualCheck_2023-06-08.csv (Manual coding results file)

- PLOS_ODDPub_Manual.csv (Merge of the results file of the ODDPub and PLOS-OSI classification with the results file of the manual coding)

## Explorative analysis for the discoverability of open data

Folder04_FurtherAnalyses

Proof_of_of_Concept_Open_Data_Monitoring.pdf (Description of the explorative analysis of the discoverability of open data publications using the example of a researcher) - in German

## R-Script

Analyses_MA_OpenDataMonitoring.R (R-Script for preparing, merging and analyzing the data and for performing the ODDPub algorithm)
Cyber-Physical System power Consumption
zenodo.org
bin, csv, zip
Updated Nov 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Iuhasz; Gabriel Iuhasz; Teodor-Florin Fortis; Teodor-Florin Fortis (2024). Cyber-Physical System power Consumption [Dataset]. http://doi.org/10.5281/zenodo.14215756
Explore at:
bin, csv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14215756
Dataset updated
Nov 26, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gabriel Iuhasz; Gabriel Iuhasz; Teodor-Florin Fortis; Teodor-Florin Fortis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Files

This dataset is comprised of 5 CSV files contained in the data.zip archive. Each one represents a production machine from which various sensor data has been collected. The average cadence for collection was 5 measurements per second. The monitored devices where used for hydroforming.

The collection period covered the period from 2023-06-01 until 2023-08-05.

Data

These files represent a complete data dump from the data available in the time-series database, InfluxDB, used for collection. Because of this some columns have no semantic value for detecting production cycles or any other analytics.

Each file contains a total of 14 columns. Some of the columns are artefacts of the query used to extract the data from InfluxDB and can be discarded. These columns are: results, table _start, _stop

results - An artefact of the InfluxDB query, signifies postprocessing of results in this dataset. It is "mean".

table - An artefact of the InfluxDB query, can be discarded.

_start and _stop - Refers to ingestion related data, used in monitoring ingestion.

_field - An artefact of the InfluxDB query, specifying what field to use for the query.

_measurement - An artefact of the InfluxDB query, specifying what measurement to use for the query. Contains the same information as device_id.

host - An artefact of the InfluxDB query, the unique name of the host used for the InfluxDB sink in Kubernetes.

kafka_topic - Name of the Kafka topic used for collection.

Pertinent columns are:

_time - Denotes the time at which a particular event has been measured, it is used as index when creating a dataframe.

_time.1 - Duplicate of _time for sanity check and ease of analysis when _time is set as index

_value - Represents the value measured by each sensor type.

device_id - Unique identifier of the manufacturing device, should be the same as the file name, i.e. B827EB8D8E0C.

ingestion_time - Timestamp when the data has been collected and ingested by influxDB.

sid - Unique sensor ID; the power measurements can be found at sid 1.

Annotations

There are two additional files which contain annotation data:

scamp_devices.csv - Contains mapping information between the dataset device ID (defined in column "DeviceIDMonitoring") and the ground truth file ID (defined in column "DeviceID")

scamp_report_3m.csv - Contains the ground truth, which can be used for validation of cycle detection and analysis methods. The columns are as follows:

ReportID - Internal unique ID created during data collection. It can be discarded.

JobID - Internal Scheduling Job unique ID.

DeviceID - The unique ID of the devices used for manufacturing needs to be mapped using the scamp_device.csv data.

StartTime - Start time of operations

EndTime - End time of operations

ProductID - Unique identifier of the product being manufactured.

CycleTime - Average length of cycle in seconds, added manually by operators. It can be unreliable.

QuantityProduced - Number of products manufactured during the timeframe given by StartTime and EndTime.

QuantityScrap - Number of scraped/malformed products in the given timeframe. These are part of the QuantityProduced, not in addition to it.

IntreruptionMinuted - Minutes of production halt.

scamp_patterns.csv - Contains the start and end timestamp for selected example production cycles. These where chosen based on expert users.

Jupyter Notebook

We have provided a sample Jupyter notebook (verify_data.ipynb), which gives examples of how the dataset can be loaded and visualised as well as examples of how the sample patterns and ground truth can be addressed and visualised.

Note

The Jupyter Notebook contains an example of how the data can be loaded and visualised. Please note that both data should be filtered based on sid; the power measurements are collected by sid 1. See Notebook for example.
National RCRA Hazardous Waste Biennial Report Data Files
datasets.ai
datadiscoverystudio.org
+1more
57
Updated Sep 27, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Environmental Protection Agency (2016). National RCRA Hazardous Waste Biennial Report Data Files [Dataset]. https://datasets.ai/datasets/national-rcra-hazardous-waste-biennial-report-data-files5
Explore at:
57Available download formats
Dataset updated
Sep 27, 2016
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Authors
U.S. Environmental Protection Agency
Description
The United States Environmental Protection Agency (EPA), in cooperation with the States, biennially collects information regarding the generation, management, and final disposition of hazardous wastes regulated under the Resource Conservation and Recovery Act of 1976 (RCRA), as amended. Collection, validation and verification of the Biennial Report (BR) data is the responsibility of RCRA authorized states and EPA regions. EPA does not modify the data reported by the states or regions. Any questions regarding the information reported for a RCRA handler should be directed to the state agency or region responsible for the BR data collection. BR data are collected every other year (odd-numbered years) and submitted in the following year. The BR data are used to support regulatory activities and provide basic statistics and trend of hazardous waste generation and management.

BR data is available to the public through 3 mechanisms. 1. The RCRAInfo website includes data collected from 2001 to present-day (https://rcrainfo.epa.gov/rcrainfoweb/action/main-menu/view). Users of the RCRAInfo website can run queries and output reports for different data collection years at this site. All BR data collected from 2001 to present-day is stored in RCRAInfo, and is accessible through this website. 2. BR data files collected from 1999 - present day may be downloaded directory in zip file format from (https://rcrapublic.epa.gov/rcra-public-export/?outputType=Fixed or https://rcrapublic.epa.gov/rcra-public-export/?outputType=CSV). 3. Historical data collected prior to 1999 may be ordered on CD. Please see contact information in this metadata file to order historical BR data. BR data are typically published in December of the year following their collection. Data must be received by authorized states and EPA regions if a state is not authorized to implement the BR program by March 1st of the year following collection, and are usually published in December of the year following collection. For example, data collected in 2001 would be received by states and EPA regions by March 1, 2002 and states and EPA regions compile the BR data submitted by facilities and load the state data set into RCRAInfo, the system which EPA Headquarters (HQ) manage. Then EPA HQ published the data files around December 2002. Additional information regarding the biennial report data is available here: https://rcrapublic.epa.gov/rcra-public-export/rcrainfo_flat_file_documentation_v5.pdf and here: https://www.epa.gov/hwgenerators/biennial-hazardous-waste-report.

Please note that the update frequency field for this data set indicates annual, but that the true update period is biennial (every other year). There is no selection option for biennial for the update frequency field.
TrainRuns.jl: an Open-Source Tool for Running Time Estimation - Supplement...
zenodo.org
zip
Updated Aug 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Scheidt; Martin Scheidt; Max Kannenberg; Max Kannenberg (2023). TrainRuns.jl: an Open-Source Tool for Running Time Estimation - Supplement Data [Dataset]. http://doi.org/10.5281/zenodo.6842604
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6842604
Dataset updated
Aug 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Martin Scheidt; Martin Scheidt; Max Kannenberg; Max Kannenberg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This additional data contains the initial data and the calculated results for comparing FBS and TrainRuns.jl.

File description

local.yaml: input parameters for the local train

freight.yaml: input parameters for the freight train

running_path.yaml: input parameters for the path

freight_FBS.csv: export of calculation from FBS for the freight train

freight_TrainRuns.csv: export of calculation from TrainRun.jl converted in FBS units for the freight train

freight_diff.csv: the calculated difference between FBS.csv and TrainRuns.csv for the freight train

local_FBS.csv: export of calculation from FBS for the local train

local_TrainRuns.csv: export of calculation from TrainRun.jl converted in FBS units for the local train

local_diff.csv: the calculated difference between FBS.csv and TrainRuns.csv for the local train

running_path.csv: converted running_path.yaml for displaying

comparison.tex: LaTeX code for the graph in comparison.pdf

Sources

The calculations in FBS were done with the file 'Ostsachsen_V220.railml'. FBS needs a commercial license, which can be purchased. License for 'Ostsachsen_V220.railml' is Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0).
The file 'Ostsachsen_V220.railml' can be found at:
https://www.railml.org/en/user/exampledata.html (last accessed 2022-06-06 with login) -> "Real world railway examples from professional tools" -> "East Saxony railway network by FBS" -> "Ostsachsen_V220.railml"

Other sources are mentioned in the files.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Anurag Verma (2023). Chinook CSV Dataset [Dataset]. https://www.kaggle.com/datasets/anurag629/chinook-csv-dataset/data

Chinook CSV Dataset

Chinook Database Export

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 9, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Anurag Verma

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

This dataset is an export of the tables from the Chinook sample database into CSV files. The Chinook database contains information about a fictional digital media store, including tables for artists, albums, media tracks, invoices, customers, and more.

The CSV file for each table contains the columns and all rows of data. The column headers match the table schema. Refer to the Chinook schema documentation for more details on each table and column.

The files are encoded as UTF-8. The delimiter is a comma. Strings are quoted. Null values are represented by empty strings.

Files

albums.csv
artists.csv
customers.csv
employees.csv
genres.csv
invoice_items.csv
invoices.csv
media_types.csv
playlist_track.csv
playlists.csv
tracks.csv

Usage

This dataset can be used to analyze the Chinook store data. For example, you could build models on customer purchases, track listening patterns, identify trends in genres or artists,etc.

The data is ideal for practicing Pandas, Numpy, PySpark, etc libraries. The database schema provides a realistic set of tables and relationships.

Clear search

Close search

Google apps

Main menu

Chinook CSV Dataset

Podcast PR Contacts - Self-Service CSV Batch Export

TWIW database dump

Up-to-date mapping of COVID-19 treatment and vaccine development...

Australian Employee Salary/Wages DATAbase by detailed occupation, location...

Data from: Public Notification

Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...

Test data and model for the FlowCam data processing pipeline

Testing data for the processing pipeline for FlowCam data

Pipeline scripts

LabelChecker

Data from: Supporting Dataset for "RT-QuIC Optimization for Prion Detection...

OpenCitations Index CSV dataset of all the citation data

Stable Water Isotope Data for the East River Watershed, Colorado (2014-2024)...

String probability dataset of Al-Sm glasses

Crypto Market Data CSV Export: Trades, Quotes & Order Book Access via S3

Subject indexing data of K10plus library union catalog

Dataset of 286 publications citing the 2014 Willoughby-Jansma-Hoye protocol

Data supporting the Master thesis "Monitoring von Open Data Praktiken -...

Cyber-Physical System power Consumption

Files

Data

Annotations

Jupyter Notebook

Note

National RCRA Hazardous Waste Biennial Report Data Files

TrainRuns.jl: an Open-Source Tool for Running Time Estimation - Supplement...

Chinook CSV Dataset

Chinook Database Export