78 datasets found

Z
Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...
data.niaid.nih.gov
zenodo.org
Updated Apr 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Branislav Pecher (2022). Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5996863
Explore at:
Dataset updated
Apr 22, 2022
Dataset provided by
Jakub Simko
Branislav Pecher
Elena Stefancova
Robert Moro
Maria Bielikova
Ivan Srba
Matus Tomlein
Description
Overview

This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).

The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.

Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.

The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).

The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.

Options to access the dataset

There are two ways how to get access to the dataset:

Static dump of the dataset available in the CSV format

Continuously updated dataset available via REST API

In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.

References

If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:

@inproceedings{SrbaMonantPlatform, author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria}, booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)}, pages = {1--7}, title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior}, year = {2019} }

@inproceedings{SrbaMonantMedicalDataset, author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)}, numpages = {11}, title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims}, year = {2022}, doi = {10.1145/3477495.3531726}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477495.3531726}, }

Dataset creation process

In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.

Ethical considerations

The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.

The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.

As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.

Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.

Reporting mistakes in the dataset The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.

Dataset structure

Raw data

At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.

Raw data are contained in these CSV files (and corresponding REST API endpoints):

sources.csv

articles.csv

article_media.csv

article_authors.csv

discussion_posts.csv

discussion_post_authors.csv

fact_checking_articles.csv

fact_checking_article_media.csv

claims.csv

feedback_facebook.csv

Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.

Annotations

Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.

Each annotation is described by the following attributes:

category of annotation (annotation_category). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).

type of annotation (annotation_type_id). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.

method which created annotation (method_id). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.

its value (value). The value is stored in JSON format and its structure differs according to particular annotation type.

At the same time, annotations are associated with a particular object identified by:

entity type (parameter entity_type in case of entity annotations, or source_entity_type and target_entity_type in case of relation annotations). Possible values: sources, articles, fact-checking-articles.

entity id (parameter entity_id in case of entity annotations, or source_entity_id and target_entity_id in case of relation annotations).

The dataset provides specifically these entity annotations:

Source reliability (binary). Determines validity of source (website) at a binary scale with two options: reliable source and unreliable source.

Article veracity. Aggregated information about veracity from article-claim pairs.

The dataset provides specifically these relation annotations:

Fact-checking article to claim mapping. Determines mapping between fact-checking article and claim.

Claim presence. Determines presence of claim in article.

Claim stance. Determines stance of an article to a claim.

Annotations are contained in these CSV files (and corresponding REST API endpoints):

entity_annotations.csv

relation_annotations.csv

Note: Identification of human annotators authors (email provided in the annotation app) is anonymised.
d
Replication Data for: Does mode of administration impact on quality of data?...
dataone.org
dataverse.harvard.edu
+1more
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Triga, Vasiliki; Vasilis Manavopoulos (2023). Replication Data for: Does mode of administration impact on quality of data? Comparing a traditional survey versus an online survey via a Voting Advice Application [Dataset]. http://doi.org/10.7910/DVN/ARDVUL
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ARDVUL
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Triga, Vasiliki; Vasilis Manavopoulos
Description
This dataset (in .csv format), accompanying codebook and replication code serve as supplement to a study titled: “Does the mode of administration impact on quality of data? Comparing a traditional survey versus an online survey via a Voting Advice Application” submitted for publication to the journal: “Survey Research Methods”). The study involved comparisons of responses to two near-identical questionnaires administered via a traditional survey and through a Voting Advice Application (VAA) both designed for and administered during the pre-electoral period of the Cypriot Presidential Elections of 2013. The offline dataset consisted of questionnaires collected from 818 individuals whose participation was elicited through door-to-door stratified random sampling with replacement of individuals who could not be contacted. The strata were designed to take into account the regional population density, gender, age and whether the area was urban or rural. Offline participants completed a pen-and-paper questionnaire version of the VAA in a self-completing capacity, although the person administering the questionnaire remained present throughout. The online dataset involved responses from 10,241 VAA users who completed the Choose4Cyprus VAA. Voting Advice Applications are online platforms that provide voting recommendations to users based on their closeness to political parties after they declare their agreement or disagreement on a number of policy statements. VAA users freely visited the VAA website and completed the relevant questionnaire in a self-completing capacity. The two modes of administration (online and offline) involved respondents completing a series of supplementary questions (demographics, ideological affinity & political orientation [e.g. vote in the previous election]) prior to the main questionnaire consisting of 35 and 30 policy-related Likert-type items for the offline and online mode respectively. The dataset includes all 30 policy items that were common between the two modes, although only the first 19 (q1:q19) appeared in the same order and in the same position in the two questionnaires; as such, all analyses reported in the article were conducted using these 19 items only. The phrasing of the questions was identical for the two modes and is described per variable in the attached codebook.
Zero Modes and Classification of Combinatorial Metamaterials
zenodo.org
data.niaid.nih.gov
zip
Updated Nov 8, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan van Mastrigt; Ryan van Mastrigt; Marjolein Dijkstra; Marjolein Dijkstra; Martin van Hecke; Martin van Hecke; Corentin Coulais; Corentin Coulais (2022). Zero Modes and Classification of Combinatorial Metamaterials [Dataset]. http://doi.org/10.5281/zenodo.7070963
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7070963
Dataset updated
Nov 8, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ryan van Mastrigt; Ryan van Mastrigt; Marjolein Dijkstra; Marjolein Dijkstra; Martin van Hecke; Martin van Hecke; Corentin Coulais; Corentin Coulais
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the simulation data of the combinatorial metamaterial as used for the paper 'Machine Learning of Implicit Combinatorial Rules in Mechanical Metamaterials', as published in Physical Review Letters.

In this paper, the data is used to classify each $k \times k$ unit cell design into one of two classes (C or I) based on the scaling (linear or constant) of the number of zero modes $M_k(n)$ for metamaterials consisting of an $n\times n$ tiling of the corresponding unit cell. Additionally, a random walk through the design space starting from class C unit cells was performed to characterize the boundary between class C and I in design space. A more detailed description of the contents of the dataset follows below.

Modescaling_raw_data.zip

This file contains uniformly sampled unit cell designs for metamaterial M2 and $M_k(n)$ for $1\leq n\leq 4$, which was used to classify the unit cell designs for the data set. There is a small subset of designs for $k=\{3, 4, 5\}$ that do not neatly fall into the class C and I classification, and instead require additional simulation for $4 \leq n \leq 6$ before either saturating to a constant number of zero modes (class I) or linearly increasing (class C). This file contains the simulation data of size $3 \leq k \leq 8$ unit cells. The data is organized as follows.

Simulation data for $3 \leq k \leq 5$ and $1 \leq n \leq 4$ is stored in numpy array format (.npy) and can be readily loaded in Python with the Numpy package using the numpy.load command. These files are named "data_new_rrQR_i_n_M_kxk_fixn4.npy", and contain a [Nsim, 1+k*k+4] sized array, where Nsim is the number of simulated unit cells. Each row corresponds to a unit cell. The columns are organized as follows:

col 0: label number to keep track

col 1 - k*k+1: flattened unit cell design, numpy.reshape should bring it back to its original $k \times k$ form.

col k*k+1 - k*k+5: number of zero modes $M_k(n)$ in ascending order of $n$, so: $\{M_k(1), M_k(2), M_k(3), M_k(4)\}$.

Note: the unit cell design uses the numbers $\{0, 1, 2, 3\}$ to refer to each building block orientation. The building block orientations can be characterized through the orientation of the missing diagonal bar (see Fig. 2 in the paper), which can be Left Up (LU), Left Down (LD), Right Up (RU), or Right Down (RD). The numbers correspond to the building block orientation $\{0, 1, 2, 3\} = \{\mathrm{LU, RU, RD, LD}\}$.

Simulation data for $3 \leq k \leq 5$ and $1 \leq n \leq 6$ for unit cells that cannot be classified as class C or I for $1 \leq n \leq 4$ is stored in numpy array format (.npy) and can be readily loaded in Python with the Numpy package using the numpy.load command. These files are named "data_new_rrQR_i_n_M_kxk_fixn4_classX_extend.npy", and contain a [Nsim, 1+k*k+6] sized array, where Nsim is the number of simulated unit cells. Each row corresponds to a unit cell. The columns are organized as follows:

col 0: label number to keep track

col 1 - k*k+1: flattened unit cell design, numpy.reshape should bring it back to its original $k \times k$ form.

col k*k+1 - k*k+5: number of zero modes $M_k(n)$ in ascending order of $n$, so: $\{M_k(1), M_k(2), M_k(3), M_k(4), M_k(5), M_k(6)\}$.

Simulation data for $6 \leq k \leq 8$ unit cells are stored in numpy array format (.npy) and can be readily loaded in Python with the Numpy package using the numpy.load command. Note that the number of modes is now calculated for $n_x \times n_y$ metamaterials, where we calculate $(n_x, n_y) = \{(1,1), (2, 2), (3, 2), (4,2), (2, 3), (2, 4)\}$ rather than $n_x=n_y=n$ to save computation time. These files are named "data_new_rrQR_i_n_Mx_My_n4_kxk(_extended).npy", and contain a [Nsim, 1+k*k+8] sized array, where Nsim is the number of simulated unit cells. Each row corresponds to a unit cell. The columns are organized as follows:

col 0: label number to keep track

col 1 - k*k+1: flattened unit cell design, numpy.reshape should bring it back to its original $k \times k$ form.

col k*k+1 - k*k+9: number of zero modes $M_k(n_x, n_y)$ in order: $\{M_k(1, 1), M_k(2, 2), M_k(3, 2), M_k(4, 2), M_k(1, 1), M_k(2, 2), M_k(2, 3), M_k(2, 4)\}$.

Simulation data of metamaterial M1 for $k_x \times k_y$ metamaterials are stored in compressed numpy array format (.npz) and can be loaded in Python with the Numpy package using the numpy.load command. These files are named "smiley_cube_x_y_$k_x$x$k_y$.npz", which contain all possible metamaterial designs, and "smiley_cube_uniform_sample_x_y_$k_x$x$k_y$.npz", which contain uniformly sampled metamaterial designs. The configurations are accessed with the keyword argument 'configs'. The classification is accessed with the keyword argument 'compatible'. The configurations array is of shape [Nsim, $k_x$, $k_y$], the classification array is of shape [Nsim]. The building blocks in the configuration are denoted by 0 or 1, which correspond to the red/green and white/dashed building blocks respectively. Classification is 0 or 1, which corresponds to I and C respectively.

Modescaling_classification_results.zip

This file contains the classification, slope, and offset of the scaling of the number of zero modes $M_k(n)$ for the unit cells of metamaterial M2 in Modescaling_raw_data.zip. The data is organized as follows.

The results for $3 \leq k \leq 5$ based on the $1 \leq n \leq 4$ mode scaling data is stored in "results_analysis_new_rrQR_i_Scen_slope_offset_M1k_kxk_fixn4.txt". The data can be loaded using ',' as delimiter. Every row corresponds to a unit cell design (see the label number to compare to the earlier data). The columns are organized as follows:

col 0: label number to keep track

col 1: the class, where 0 corresponds to class I, 1 to class C and 2 to class X (neither class I or C for $1 \leq n \leq 4$)

col 2: slope from $n \geq 2$ onward (undefined for class X)

col 3: the offset is defined as $M_k(2) - 2 \cdot \mathrm{slope}$

col 4: $M_k(1)$

The results for $3 \leq k \leq 5$ based on the extended $1 \leq n \leq 6$ mode scaling data is stored in "results_analysis_new_rrQR_i_Scen_slope_offset_M1k_kxk_fixn4_classC_extend.txt". The data can be loaded using ',' as delimiter. Every row corresponds to a unit cell design (see the label number to compare to the earlier data). The columns are organized as follows:

col 0: label number to keep track

col 1: the class, where 0 corresponds to class I, 1 to class C and 2 to class X (neither class I or C for $1 \leq n \leq 6$)

col 2: slope from $n \geq 2$ onward (undefined for class X)

col 3: the offset is defined as $M_k(2) - 2 \cdot \mathrm{slope}$

col 4: $M_k(1)$

The results for $6 \leq k \leq 8$ based on the $1 \leq n \leq 4$ mode scaling data is stored in "results_analysis_new_rrQR_i_Scenx_Sceny_slopex_slopey_offsetx_offsety_M1k_kxk(_extended).txt". The data can be loaded using ',' as delimiter. Every row corresponds to a unit cell design (see the label number to compare to the earlier data). The columns are organized as follows:

col 0: label number to keep track

col 1: the class_x based on $M_k(n_x, 2)$, where 0 corresponds to class I, 1 to class C and 2 to class X (neither class I or C for $1 \leq n_x \leq 4$)

col 2: the class_y based on $M_k(2, n_y)$, where 0 corresponds to class I, 1 to class C and 2 to class X (neither class I or C for $1 \leq n_y \leq 4$)

col 3: slope_x from $n_x \geq 2$ onward (undefined for class X)

col 4: slope_y from $n_y \geq 2$ onward (undefined for class X)

col 5: the offset_x is defined as $M_k(2, 2) - 2 \cdot \mathrm{slope_x}$

col 6: the offset_x is defined as $M_k(2, 2) - 2 \cdot \mathrm{slope_y}$

col 7: (M_k(1,
CMAPSS Jet Engine Simulated Data
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
+1more
Updated Oct 15, 2008
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2008). CMAPSS Jet Engine Simulated Data [Dataset]. https://data.nasa.gov/dataset/cmapss-jet-engine-simulated-data
Explore at:
Dataset updated
Oct 15, 2008
Dataset provided by
NASAhttp://nasa.gov/
Description
Data sets consists of multiple multivariate time series. Each data set is further divided into training and test subsets. Each time series is from a different engine i.e., the data can be considered to be from a fleet of engines of the same type. Each engine starts with different degrees of initial wear and manufacturing variation which is unknown to the user. This wear and variation is considered normal, i.e., it is not considered a fault condition. There are three operational settings that have a substantial effect on engine performance. These settings are also included in the data. The data is contaminated with sensor noise. The engine is operating normally at the start of each time series, and develops a fault at some point during the series. In the training set, the fault grows in magnitude until system failure. In the test set, the time series ends some time prior to system failure. The objective of the competition is to predict the number of remaining operational cycles before failure in the test set, i.e., the number of operational cycles after the last cycle that the engine will continue to operate. Also provided a vector of true Remaining Useful Life (RUL) values for the test data. The data are provided as a zip-compressed text file with 26 columns of numbers, separated by spaces. Each row is a snapshot of data taken during a single operational cycle, each column is a different variable. The columns correspond to: 1) unit number 2) time, in cycles 3) operational setting 1 4) operational setting 2 5) operational setting 3 6) sensor measurement 1 7) sensor measurement 2 ... 26) sensor measurement 26 Data Set: FD001 Train trjectories: 100 Test trajectories: 100 Conditions: ONE (Sea Level) Fault Modes: ONE (HPC Degradation) Data Set: FD002 Train trjectories: 260 Test trajectories: 259 Conditions: SIX Fault Modes: ONE (HPC Degradation) Data Set: FD003 Train trjectories: 100 Test trajectories: 100 Conditions: ONE (Sea Level) Fault Modes: TWO (HPC Degradation, Fan Degradation) Data Set: FD004 Train trjectories: 248 Test trajectories: 249 Conditions: SIX Fault Modes: TWO (HPC Degradation, Fan Degradation) Reference: A. Saxena, K. Goebel, D. Simon, and N. Eklund, ‘Damage Propagation Modeling for Aircraft Engine Run-to-Failure Simulation’, in the Proceedings of the 1st International Conference on Prognostics and Health Management (PHM08), Denver CO, Oct 2008.
R
Modes Of Transport Dataset
universe.roboflow.com
zip
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TS (2024). Modes Of Transport Dataset [Dataset]. https://universe.roboflow.com/ts-2qpml/modes-of-transport/model/2
Explore at:
zipAvailable download formats
Dataset updated
Sep 26, 2024
Dataset authored and provided by
TS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cars Bikes Bounding Boxes
Description
Modes Of Transport

## Overview Modes Of Transport is a dataset for object detection tasks - it contains Cars Bikes annotations for 401 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
o
Grid Transformer Power Flow Historic Monthly
ukpowernetworks.opendatasoft.com
Updated Mar 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Grid Transformer Power Flow Historic Monthly [Dataset]. https://ukpowernetworks.opendatasoft.com/explore/dataset/ukpn-grid-transformer-operational-data-monthly/
Explore at:
Dataset updated
Mar 28, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionUK Power Network maintains the 132kV voltage level network and below. An important part of the distribution network is the stepping down of voltage as it is moved towards the household; this is achieved using transformers. Transformers have a maximum rating for the utilisation of these assets based upon protection, overcurrent, switch gear, etc. This dataset contains the Grid Substation Transformers, also known as Bulk Supply Points, that typically step-down voltage from 132kV to 33kV (occasionally down to 66 or more rarely 20-25). These transformers can be viewed on the single line diagrams in our Long-Term Development Statements (LTDS) and the underlying data is then found in the LTDS tables.Care is taken to protect the private affairs of companies connected to the 33kV network, resulting in the redaction of certain transformers. Where redacted, we provide monthly statistics to continue to add value where possible. Where monthly statistics exist but half-hourly is absent, this data has been redacted.This dataset provides monthly statistics data across these named transformers from 2021 through to the previous month across our license areas. The data are aligned with the same naming convention as the LTDS for improved interoperability.To find half-hourly current and power flow data for a transformer, use the ‘tx_id’ that can be cross referenced in the Grid Transformers Half Hourly Dataset.If you want to download all this data, it is perhaps more convenient from our public sharepoint: Open Data Portal Library - Grid Transformers - All Documents (sharepoint.com)This dataset is part of a larger endeavour to share more operational data on UK Power Networks assets. Please visit our Network Operational Data Dashboard for more operational datasets.Methodological ApproachThe dataset is not derived, it is the measurements from our network stored in our historian.The measurement devices are taken from current transformers attached to the cable at the circuit breaker, and power is derived combining this with the data from voltage transformers physically attached to the busbar. The historian stores datasets based on a report-by-exception process, such that a certain deviation from the present value must be reached before logging a point measurement to the historian. We extract the data following a 30-min time weighted averaging method to get half-hourly values. Where there are no measurements logged in the period, the data provided is blank; due to the report-by-exception process, it may be appropriate to forward fill this data for shorter gaps.We developed a data redactions process to protect the privacy or companies according to the Utilities Act 2000 section 105.1.b, which requires UK Power Networks to not disclose information relating to the affairs of a business. For this reason, where the demand of a private customer is derivable from our data and that data is not already public information (e.g., data provided via Elexon on the Balancing Mechanism), we redact the half-hourly time series, and provide only the monthly averages. This redaction process considers the correlation of all the data, of only corresponding periods where the customer is active, the first order difference of all the data, and the first order difference of only corresponding periods where the customer is active. Should any of these four tests have a high linear correlation, the data is deemed redacted. This process is not simply applied to only the circuit of the customer, but of the surrounding circuits that would also reveal the signal of that customer.The directionality of the data is not consistent within this dataset. Where directionality was ascertainable, we arrange the power data in the direction of the LTDS "from node" to the LTDS "to node". Measurements of current do not indicate directionality and are instead positive regardless of direction. In some circumstances, the polarity can be negative, and depends on the data commissioner's decision on what the operators in the control room might find most helpful in ensuring reliable and secure network operation.Quality Control StatementThe data is provided "as is". In the design and delivery process adopted by the DSO, customer feedback and guidance is considered at each phase of the project. One of the earliest steers was that raw data was preferable. This means that we do not perform prior quality control screening to our raw network data. The result of this decision is that network rearrangements and other periods of non-intact running of the network are present throughout the dataset, which has the potential to misconstrue the true utilisation of the network, which is determined regulatorily by considering only by in-tact running arrangements. Therefore, taking the maximum or minimum of these transformers are not a reliable method of correctly ascertaining the true utilisation. This does have the intended added benefit of giving a realistic view of how the network was operated. The critical feedback was that our customers have a desire to understand what would have been the impact to them under real operational conditions. As such, this dataset offers unique insight into that.Assurance StatementCreating this dataset involved a lot of human data imputation. At UK Power Networks, we have differing software to run the network operationally (ADMS) and to plan and study the network (PowerFactory). The measurement devices are intended to primarily inform the network operators of the real time condition of the network, and importantly, the network drawings visible in the LTDS are a planning approach, which differs to the operational. To compile this dataset, we made the union between the two modes of operating manually. A team of data scientists, data engineers, and power system engineers manually identified the LTDS transformer from the single line diagram, identified the line name from LTDS Table 2a/b, then identified the same transformer in ADMS to identify the measurement data tags. This was then manually inputted to a spreadsheet. Any influential customers to that circuit were noted using ADMS and the single line diagrams. From there, a python code is used to perform the triage and compilation of the datasets. There is potential for human error during the manual data processing. These issues can include missing transformers, incorrectly labelled transformers, incorrectly identified measurement data tags, incorrectly interpreted directionality. Whilst care has been taken to minimise the risk of these issues, they may persist in the provided dataset. Any uncertain behaviour observed by using this data should be reported to allow us to correct as fast as possible.Additional informationDefinitions of key terms related to this dataset can be found in the Open Data Portal Glossary.Download dataset information: Metadata (JSON)We would be grateful if you find this dataset useful to submit a “reuse” case study to tell us what you did and how you used it. This enables us to drive our direction and gain better understanding for how we improve our data offering in the future. Click here for more information: Open Data Portal Reuses — UK Power Networks
Multi-Camera Action Dataset (MCAD)
zenodo.org
data.niaid.nih.gov
application/gzip +2
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenhui Li; Yongkang Wong; An-An Liu; Yang Li; Yu-Ting Su; Mohan Kankanhalli; Wenhui Li; Yongkang Wong; An-An Liu; Yang Li; Yu-Ting Su; Mohan Kankanhalli (2020). Multi-Camera Action Dataset (MCAD) [Dataset]. http://doi.org/10.5281/zenodo.884592
Explore at:
application/gzip, json, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.884592
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Wenhui Li; Yongkang Wong; An-An Liu; Yang Li; Yu-Ting Su; Mohan Kankanhalli; Wenhui Li; Yongkang Wong; An-An Liu; Yang Li; Yu-Ting Su; Mohan Kankanhalli
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Action recognition has received increasing attentions from the computer vision and machine learning community in the last decades. Ever since then, the recognition task has evolved from single view recording under controlled laboratory environment to unconstrained environment (i.e., surveillance environment or user generated videos). Furthermore, recent work focused on other aspect of action recognition problem, such as cross-view classification, cross domain learning, multi-modality learning, and action localization. Despite the large variations of studies, we observed limited works that explore the open-set and open-view classification problem, which is a genuine inherited properties in action recognition problem. In other words, a well designed algorithm should robustly identify an unfamiliar action as “unknown” and achieved similar performance across sensors with similar field of view. The Multi-Camera Action Dataset (MCAD) is designed to evaluate the open-view classification problem under surveillance environment.

In our multi-camera action dataset, different from common action datasets we use a total of five cameras, which can be divided into two types of cameras (StaticandPTZ), to record actions. Particularly, there are three Static cameras (Cam04 & Cam05 & Cam06) with fish eye effect and two PanTilt-Zoom (PTZ) cameras (PTZ04 & PTZ06). Static camera has a resolution of 1280×960 pixels, while PTZ camera has a resolution of 704×576 pixels and a smaller field of view than Static camera. What’s more, we don’t control the illumination environment. We even set two contrasting conditions (Daytime and Nighttime environment) which makes our dataset more challenge than many controlled datasets with strongly controlled illumination environment.The distribution of the cameras is shown in the picture on the right.

We identified 18 units single person daily actions with/without object which are inherited from the KTH, IXMAS, and TRECIVD datasets etc. The list and the definition of actions are shown in the table. These actions can also be divided into 4 types actions. Micro action without object (action ID of 01, 02 ,05) and with object (action ID of 10, 11, 12 ,13). Intense action with object (action ID of 03, 04 ,06, 07, 08, 09) and with object (action ID of 14, 15, 16, 17, 18). We recruited a total of 20 human subjects. Each candidate repeats 8 times (4 times during the day and 4 times in the evening) of each action under one camera. In the recording process, we use five cameras to record each action sample separately. During recording stage we just tell candidates the action name then they could perform the action freely with their own habit, only if they do the action in the field of view of the current camera. This can make our dataset much closer to reality. As a results there is high intra action class variation among different action samples as shown in picture of action samples.

URL: http://mmas.comp.nus.edu.sg/MCAD/MCAD.html

Resources:

IDXXXX.mp4.tar.gz contains video data for each individual

boundingbox.tar.gz contains person bounding box for all videos

protocol.json contains the evaluation protocol

img_list.txt contains the download URLs for the images version of the video data

idt_list.txt contians the download URLs for the improved Dense Trajectory feature

stip_list.txt contians the download URLs for the STIP feature

Manual annotated 2D joints for selected camera view and action class (available via http://zju-capg.org/heightmap/)

How to Cite:

Please cite the following paper if you use the MCAD dataset in your work (papers, articles, reports, books, software, etc):

Wenhui Liu, Yongkang Wong, An-An Liu, Yang Li, Yu-Ting Su, Mohan Kankanhalli
Multi-Camera Action Dataset for Cross-Camera Action Recognition Benchmarking
IEEE Winter Conference on Applications of Computer Vision (WACV), 2017.
http://doi.org/10.1109/WACV.2017.28
Prescription Drugs Introduced to Market
data.chhs.ca.gov
csv, xlsx, zip
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Health Care Access and Information (2025). Prescription Drugs Introduced to Market [Dataset]. https://data.chhs.ca.gov/dataset/prescription-drugs-introduced-to-market
Explore at:
xlsx(78989), xlsx(97853), xlsx(88082), csv(4193), xlsx(56740), zip, csv(209944), xlsx(87563), xlsx(138801)Available download formats
Dataset updated
Jun 6, 2025
Dataset authored and provided by
Department of Health Care Access and Information
Description
This dataset provides data for new prescription drugs introduced to market in California with a Wholesale Acquisition Cost (WAC) that exceeds the Medicare Part D specialty drug cost threshold. Prescription drug manufacturers submit information to HCAI within a specified time period after a drug is introduced to market. Key data elements include the National Drug Code (NDC) administered by the FDA, a narrative description of marketing and pricing plans, and WAC, among other information. Manufacturers may withhold information that is not in the public domain. Note that prescription drug manufacturers are able to submit new drug reports for a prior quarter at any time. Therefore, the data set may include additional new drug report(s) from previous quarter(s).

There are two types of New Drug data sets: Monthly and Annual. The Monthly data sets include the data in completed reports submitted by manufacturers for calendar year 2025, as of June 6, 2025. The Annual data sets include data in completed reports submitted by manufacturers for the specified year. The data sets may include reports that do not meet the specified minimum thresholds for reporting.

The program regulations are available here: https://hcai.ca.gov/wp-content/uploads/2024/03/CTRx-Regulations-Text.pdf

The data format and file specifications are available here: https://hcai.ca.gov/wp-content/uploads/2024/03/Format-and-File-Specifications-version-2.0-ada.pdf

DATA NOTES: Due to recent changes in Excel capabilities, it is not recommended that you save these files to .csv format. If you do, when importing back into Excel the leading zeros in the NDC number column will be dropped. If you need to save it into a different format other than .xlsx it must be .txt
o
33kV Circuit Operational Data Half Hourly - South Eastern Power Networks...
ukpowernetworks.opendatasoft.com
Updated May 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). 33kV Circuit Operational Data Half Hourly - South Eastern Power Networks (SPN) [Dataset]. https://ukpowernetworks.opendatasoft.com/explore/dataset/ukpn-33kv-circuit-operational-data-half-hourly-spn/
Explore at:
Dataset updated
May 1, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction

UK Power Network maintains the 132kV voltage level network and below. An important part of the distribution network is distributing this electricity across our regions through circuits. Electricity enters our network through Super Grid Transformers at substations shared with National Grid we call Grid Supply Points. It is then sent at across our 132 kV Circuits towards our grid substations and primary substations. From there, electricity is distributed along the 33 kV circuits to bring it closer to the home. These circuits can be viewed on the single line diagrams in our Long-Term Development Statements (LTDS) and the underlying data is then found in the LTDS tables.

This dataset provides half-hourly current and power flow data across these named circuits from 2021 through to the previous month in our South Eastern Power Networks (SPN) licence area. The data are aligned with the same naming convention as the LTDS for improved interoperability.

Care is taken to protect the private affairs of companies connected to the 33 kV network, resulting in the redaction of certain circuits. Where redacted, we provide monthly statistics to continue to add value where possible. Where monthly statistics exist but half-hourly is absent, this data has been redacted.

To find which circuit you are looking for, use the ‘ltds_line_name’ that can be cross referenced in the 33kV Circuits Monthly Data, which describes by month what circuits were triaged, if they could be made public, and what the monthly statistics are of that site.

If you want to download all this data, it is perhaps more convenient from our public sharepoint: Sharepoint

This dataset is part of a larger endeavour to share more operational data on UK Power Networks assets. Please visit our Network Operational Data Dashboard for more operational datasets.

Methodological Approach The dataset is not derived, it is the measurements from our network stored in our historian. The measurement devices are taken from current transformers attached to the cable at the circuit breaker, and power is derived combining this with the data from voltage transformers physically attached to the busbar. The historian stores datasets based on a report-by-exception process, such that a certain deviation from the present value must be reached before logging a point measurement to the historian. We extract the data following a 30-min time weighted averaging method to get half-hourly values. Where there are no measurements logged in the period, the data provided is blank; due to the report-by-exception process, it may be appropriate to forward fill this data for shorter gaps. We developed a data redactions process to protect the privacy or companies according to the Utilities Act 2000 section 105.1.b, which requires UK Power Networks to not disclose information relating to the affairs of a business. For this reason, where the demand of a private customer is derivable from our data and that data is not already public information (e.g., data provided via Elexon on the Balancing Mechanism), we redact the half-hourly time series, and provide only the monthly averages. This redaction process considers the correlation of all the data, of only corresponding periods where the customer is active, the first order difference of all the data, and the first order difference of only corresponding periods where the customer is active. Should any of these four tests have a high linear correlation, the data is deemed redacted. This process is not simply applied to only the circuit of the customer, but of the surrounding circuits that would also reveal the signal of that customer. The directionality of the data is not consistent within this dataset. Where directionality was ascertainable, we arrange the power data in the direction of the LTDS "from node" to the LTDS "to node". Measurements of current do not indicate directionality and are instead positive regardless of direction. In some circumstances, the polarity can be negative, and depends on the data commissioner's decision on what the operators in the control room might find most helpful in ensuring reliable and secure network operation. Quality Control Statement The data is provided "as is".
In the design and delivery process adopted by the DSO, customer feedback and guidance is considered at each phase of the project. One of the earliest steers was that raw data was preferable. This means that we do not perform prior quality control screening to our raw network data. The result of this decision is that network rearrangements and other periods of non-intact running of the network are present throughout the dataset, which has the potential to misconstrue the true utilisation of the network, which is determined regulatorily by considering only by in-tact running arrangements. Therefore, taking the maximum or minimum of these measurements are not a reliable method of correctly ascertaining the true utilisation. This does have the intended added benefit of giving a realistic view of how the network was operated. The critical feedback was that our customers have a desire to understand what would have been the impact to them under real operational conditions. As such, this dataset offers unique insight into that. Assurance StatementCreating this dataset involved a lot of human data imputation. At UK Power Networks, we have differing software to run the network operationally (ADMS) and to plan and study the network (PowerFactory). The measurement devices are intended to primarily inform the network operators of the real time condition of the network, and importantly, the network drawings visible in the LTDS are a planning approach, which differs to the operational. To compile this dataset, we made the union between the two modes of operating manually. A team of data scientists, data engineers, and power system engineers manually identified the LTDS circuit from the single line diagram, identified the line name from LTDS Table 2a/b, then identified the same circuit in ADMS to identify the measurement data tags. This was then manually inputted to a spreadsheet. Any influential customers to that circuit were noted using ADMS and the single line diagrams. From there, a python code is used to perform the triage and compilation of the datasets. There is potential for human error during the manual data processing. These issues can include missing circuits, incorrectly labelled circuits, incorrectly identified measurement data tags, incorrectly interpreted directionality. Whilst care has been taken to minimise the risk of these issues, they may persist in the provided dataset. Any uncertain behaviour observed by using this data should be reported to allow us to correct as fast as possible. Additional Information Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary. Download dataset information: Metadata (JSON)We would be grateful if you find this dataset useful to submit a “reuse” case study to tell us what you did and how you used it. This enables us to drive our direction and gain better understanding for how we improve our data offering in the future. Click here for more information:Open Data Portal Reuses — UK Power Networks
e
Simple download service (Atom) of the dataset: Areas of a PLU or POS...
data.europa.eu
gimi9.com
unknown
Updated Feb 3, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Simple download service (Atom) of the dataset: Areas of a PLU or POS document [Dataset]. https://data.europa.eu/data/datasets/fr-120066022-srv-c3249caa-5654-40ff-b6dc-ba9849d6b63c
Explore at:
unknownAvailable download formats
Dataset updated
Feb 3, 2021
Description
The Urban Planning Code defines four types of areas regulated in the local planning plan (R.123-5 to 8): urban areas (U), areas to be urbanised (AU), agricultural areas (A) and natural and forest areas (N). These areas shall be demarcated on one or more graphic documents. A regulation is attached to each area. The by-law may lay down different rules, depending on whether the purpose of the construction relates to housing, hotel accommodation, offices, commerce, crafts, industry, agricultural or forestry operations or warehouse functions. These categories are limited (Art. R.123-9).The areas already urbanised where existing or under construction public facilities have sufficient capacity to serve the buildings to be installed are classified as U-areas.The natural areas of the municipality may be classified as AU zones, which are intended to be opened for urbanisation depending on whether or not the existing equipment on the periphery is sufficient to serve the buildings to be installed. There are two types of AU zone: “constructible” and “inconstructible” areas.Can be classified as zones A, the areas of the municipality, whether or not equipped, to be protected due to the agronomic, biological or economic potential of agricultural land.Can be classified as N zones, the areas of the municipality equipped or not, to be protected either by reason of the quality of the sites, natural habitats, landscapes and their interest, in particular from the aesthetic, historical or ecological point of view, either the existence of a forestry operation or their nature as natural areas.- Within the N zones, can be: perimeters in which possibilities for the transfer of the right to build can be carried out (transfer of COS),- areas of limited size and capacity where construction is possible under the condition of implantation and density.
NIST Structured Forms Reference Set of Binary Images (SFRS) - NIST Special...
catalog.data.gov
data.nist.gov
+2more
Updated Jun 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). NIST Structured Forms Reference Set of Binary Images (SFRS) - NIST Special Database 2 [Dataset]. https://catalog.data.gov/dataset/nist-structured-forms-reference-set-of-binary-images-sfrs-nist-special-database-2-36b10
Explore at:
Dataset updated
Jun 27, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
The documents in this database are 12 different tax forms from the IRS 1040 Package X for the year 1988. These include Forms 1040, 2106, 2441, 4562, and 6251 together with Schedules A, B, C, D, E, F, and SE. Eight of these forms contain two pages or form faces; therefore, there are 20 different form faces represented in the database. The document images in this database appear to be real forms prepared by individuals, but the images have been automatically derived and synthesized using a computer.
d
Satellite Electric Vehicle Dataset (TESLA,LUCID, RIVIAN
datarade.ai
.csv
Updated Jan 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Space Know (2023). Satellite Electric Vehicle Dataset (TESLA,LUCID, RIVIAN [Dataset]. https://datarade.ai/data-products/satellite-electric-vehicle-dataset-tesla-lucid-rivian-space-know
Explore at:
.csvAvailable download formats
Dataset updated
Jan 21, 2023
Dataset authored and provided by
Space Know
Area covered
China, United States of America
Description
SpaceKnow uses satellite (SAR) data to capture activity in electric vehicles and automotive factories.

Data is updated daily, has an average lag of 4-6 days, and history back to 2017.

The insights provide you with level and change data that monitors the area which is covered with assembled light vehicles in square meters.

We offer 3 delivery options: CSV, API, and Insights Dashboard

Available companies Rivian (NASDAQ: RIVN) for employee parking, logistics, logistic centers, product distribution & product in the US. (See use-case write up on page 4) TESLA (NASDAQ: TSLA) indices for product, logistics & employee parking for Fremont, Nevada, Shanghai, Texas, Berlin, and Global level Lucid Motors (NASDAQ: LCID) for employee parking, logistics & product in US

Why get SpaceKnow's EV datasets?

Monitor the company’s business activity: Near-real-time insights into the business activities of Rivian allow users to better understand and anticipate the company’s performance.

Assess Risk: Use satellite activity data to assess the risks associated with investing in the company.

Types of Indices Available Continuous Feed Index (CFI) is a daily aggregation of the area of metallic objects in square meters. There are two types of CFI indices. The first one is CFI-R which gives you level data, so it shows how many square meters are covered by metallic objects (for example assembled cars). The second one is CFI-S which gives you change data, so it shows you how many square meters have changed within the locations between two consecutive satellite images.

How to interpret the data SpaceKnow indices can be compared with the related economic indicators or KPIs. If the economic indicator is in monthly terms, perform a 30-day rolling sum and pick the last day of the month to compare with the economic indicator. Each data point will reflect approximately the sum of the month. If the economic indicator is in quarterly terms, perform a 90-day rolling sum and pick the last day of the 90-day to compare with the economic indicator. Each data point will reflect approximately the sum of the quarter.

Product index This index monitors the area covered by manufactured cars. The larger the area covered by the assembled cars, the larger and faster the production of a particular facility. The index rises as production increases.

Product distribution index This index monitors the area covered by assembled cars that are ready for distribution. The index covers locations in the Rivian factory. The distribution is done via trucks and trains.

Employee parking index Like the previous index, this one indicates the area covered by cars, but those that belong to factory employees. This index is a good indicator of factory construction, closures, and capacity utilization. The index rises as more employees work in the factory.

Logistics index The index monitors the movement of materials supply trucks in particular car factories.

Logistics Centers index The index monitors the movement of supply trucks in warehouses.

Where the data comes from: SpaceKnow brings you information advantages by applying machine learning and AI algorithms to synthetic aperture radar and optical satellite imagery. The company’s infrastructure searches and downloads new imagery every day, and the computations of the data take place within less than 24 hours.

In contrast to traditional economic data, which are released in monthly and quarterly terms, SpaceKnow data is high-frequency and available daily. It is possible to observe the latest movements in the EV industry with just a 4-6 day lag, on average.

The EV data help you to estimate the performance of the EV sector and the business activity of the selected companies.

The backbone of SpaceKnow’s high-quality data is the locations from which data is extracted. All locations are thoroughly researched and validated by an in-house team of annotators and data analysts.

Each individual location is precisely defined so that the resulting data does not contain noise such as surrounding traffic or changing vegetation with the season.

We use radar imagery and our own algorithms, so the final indices are not devalued by weather conditions such as rain or heavy clouds.

→ Reach out to get a free trial

Use Case - Rivian:

SpaceKnow uses the quarterly production and delivery data of Rivian as a benchmark. Rivian targeted to produce 25,000 cars in 2022. To achieve this target, the company had to increase production by 45% by producing 10,683 cars in Q4. However the production was 10,020 and the target was slightly missed by reaching total production of 24,337 cars for FY22.

SpaceKnow indices help us to observe the company’s operations, and we are able to monitor if the company is set to meet its forecasts or not. We deliver five different indices for Rivian, and these indices observe logistic centers, employee parking lot, logistics, product, and prod...
o
Primary Transformer Power Flow Historic Monthly
ukpowernetworks.opendatasoft.com
Updated May 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Primary Transformer Power Flow Historic Monthly [Dataset]. https://ukpowernetworks.opendatasoft.com/explore/dataset/ukpn-primary-transformer-power-flow-historic-monthly/
Explore at:
Dataset updated
May 12, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction

UK Power Network maintains the 132kV voltage level network and below. An important part of the distribution network is the stepping down of voltage as it is moved towards the household; this is achieved using transformers. Transformers have a maximum rating for the utilisation of these assets based upon protection, overcurrent, switch gear, etc. This dataset contains the Primary Substation Transformers, that typically step-down voltage from 33kV to 11kV (occasionally from 132kV to 11kV). These transformers can be viewed on the single line diagrams in our Long-Term Development Statements (LTDS) and the underlying data is then found in the LTDS tables.Care is taken to protect the private affairs of companies connected to the 11kV network, resulting in the redaction of certain transformers. Where redacted, we provide monthly statistics to continue to add value where possible. Where monthly statistics exist but half-hourly is absent, this data has been redacted.This dataset provides monthly statistics data across these named transformers from 2021 through to the previous month across our license areas. The data are aligned with the same naming convention as the LTDS for improved interoperability.To find half-hourly current and power flow data for a transformer, use the ‘tx_id’ that can be cross referenced in the Primary Transformers Half Hourly Dataset.If you want to download all this data, it is perhaps more convenient from our public sharepoint: Open Data Portal Library - Primary Transformers - All Documents (sharepoint.com)This dataset is part of a larger endeavour to share more operational data on UK Power Networks assets. Please visit our Network Operational Data Dashboard for more operational datasets.Methodological ApproachThe dataset is not derived, it is the measurements from our network stored in our historian.The measurement devices are taken from current transformers attached to the cable at the circuit breaker, and power is derived combining this with the data from voltage transformers physically attached to the busbar. The historian stores datasets based on a report-by-exception process, such that a certain deviation from the present value must be reached before logging a point measurement to the historian. We extract the data following a 30-min time weighted averaging method to get half-hourly values. Where there are no measurements logged in the period, the data provided is blank; due to the report-by-exception process, it may be appropriate to forward fill this data for shorter gaps.We developed a data redactions process to protect the privacy or companies according to the Utilities Act 2000 section 105.1.b, which requires UK Power Networks to not disclose information relating to the affairs of a business. For this reason, where the demand of a private customer is derivable from our data and that data is not already public information (e.g., data provided via Elexon on the Balancing Mechanism), we redact the half-hourly time series, and provide only the monthly averages. This redaction process considers the correlation of all the data, of only corresponding periods where the customer is active, the first order difference of all the data, and the first order difference of only corresponding periods where the customer is active. Should any of these four tests have a high linear correlation, the data is deemed redacted. This process is not simply applied to only the circuit of the customer, but of the surrounding circuits that would also reveal the signal of that customer.The directionality of the data is not consistent within this dataset. Where directionality was ascertainable, we arrange the power data in the direction of the LTDS "from node" to the LTDS "to node". Measurements of current do not indicate directionality and are instead positive regardless of direction. In some circumstances, the polarity can be negative, and depends on the data commissioner's decision on what the operators in the control room might find most helpful in ensuring reliable and secure network operation. Quality Control StatementThe data is provided "as is". In the design and delivery process adopted by the DSO, customer feedback and guidance is considered at each phase of the project. One of the earliest steers was that raw data was preferable. This means that we do not perform prior quality control screening to our raw network data. The result of this decision is that network rearrangements and other periods of non-intact running of the network are present throughout the dataset, which has the potential to misconstrue the true utilisation of the network, which is determined regulatorily by considering only by in-tact running arrangements. Therefore, taking the maximum or minimum of these transformers are not a reliable method of correctly ascertaining the true utilisation. This does have the intended added benefit of giving a realistic view of how the network was operated. The critical feedback was that our customers have a desire to understand what would have been the impact to them under real operational conditions. As such, this dataset offers unique insight into that.

Assurance StatementCreating this dataset involved a lot of human data imputation. At UK Power Networks, we have differing software to run the network operationally (ADMS) and to plan and study the network (PowerFactory). The measurement devices are intended to primarily inform the network operators of the real time condition of the network, and importantly, the network drawings visible in the LTDS are a planning approach, which differs to the operational. To compile this dataset, we made the union between the two modes of operating manually. A team of data scientists, data engineers, and power system engineers manually identified the LTDS transformer from the single line diagram, identified the line name from LTDS Table 2a/b, then identified the same transformer in ADMS to identify the measurement data tags. This was then manually inputted to a spreadsheet. Any influential customers to that circuit were noted using ADMS and the single line diagrams. From there, a python code is used to perform the triage and compilation of the datasets. There is potential for human error during the manual data processing. These issues can include missing transformers, incorrectly labelled transformers, incorrectly identified measurement data tags, incorrectly interpreted directionality. Whilst care has been taken to minimise the risk of these issues, they may persist in the provided dataset. Any uncertain behaviour observed by using this data should be reported to allow us to correct as fast as possible.

Additional informationDefinitions of key terms related to this dataset can be found in the Open Data Portal Glossary.Download dataset information: Metadata (JSON)We would be grateful if you find this dataset useful to submit a “reuse” case study to tell us what you did and how you used it. This enables us to drive our direction and gain better understanding for how we improve our data offering in the future. Click here for more information: Open Data Portal Reuses — UK Power Networks
Crop classification dataset for testing domain adaptation or distributional...
zenodo.org
explore.openaire.eu
+1more
bin, csv
Updated May 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dan M. Kluger; Dan M. Kluger; Sherrie Wang; Sherrie Wang; David B. Lobell; David B. Lobell (2022). Crop classification dataset for testing domain adaptation or distributional shift methods [Dataset]. http://doi.org/10.5281/zenodo.6376160
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6376160
Dataset updated
May 13, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dan M. Kluger; Dan M. Kluger; Sherrie Wang; Sherrie Wang; David B. Lobell; David B. Lobell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this upload we share processed crop type datasets from both France and Kenya. These datasets can be helpful for testing and comparing various domain adaptation methods. The datasets are processed, used, and described in this paper: https://doi.org/10.1016/j.rse.2021.112488 (arXiv version: https://arxiv.org/pdf/2109.01246.pdf).

In summary, each point in the uploaded datasets corresponds to a particular location. The label is the crop type grown at that location in 2017. The 70 processed features are based on Sentinel-2 satellite measurements at that location in 2017. The points in the France dataset come from 11 different departments (regions) in Occitanie, France, and the points in the Kenya dataset come from 3 different regions in Western Province, Kenya. Within each dataset there are notable shifts in the distribution of the labels and in the distribution of the features between regions. Therefore, these datasets can be helpful for testing for testing and comparing methods that are designed to address such distributional shifts.

More details on the dataset and processing steps can be found in Kluger et. al. (2021). Much of the processing steps were taken to deal with Sentinel-2 measurements that were corrupted by cloud cover. For users interested in the raw multi-spectral time series data and dealing with cloud cover issues on their own (rather than using the 70 processed features provided here), the raw dataset from Kenya can be found in Yeh et. al. (2021), and the raw dataset from France can be made available upon request from the authors of this Zenodo upload.

All of the data uploaded here can be found in "CropTypeDatasetProcessed.RData". We also post the dataframes and tables within that .RData file as separate .csv files for users who do not have R. The contents of each R object (or .csv file) is described in the file "Metadata.rtf".

Preferred Citation:

-Kluger, D.M., Wang, S., Lobell, D.B., 2021. Two shifts for crop mapping: Leveraging aggregate crop statistics to improve satellite-based maps in new regions. Remote Sens. Environ. 262, 112488. https://doi.org/10.1016/j.rse.2021.112488.

-URL to this Zenodo post https://zenodo.org/record/6376160
NIST Handprinted Forms and Characters - NIST Special Database 19
catalog.data.gov
datadiscoverystudio.org
+1more
Updated Jun 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). NIST Handprinted Forms and Characters - NIST Special Database 19 [Dataset]. https://catalog.data.gov/dataset/nist-handprinted-forms-and-characters-nist-special-database-19-0f025
Explore at:
Dataset updated
Jun 27, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
Special Database 19 contains NIST's entire corpus of training materials for handprinted document and character recognition. It publishes Handprinted Sample Forms from 3600 writers, 810,000 character images isolated from their forms, ground truth classifications for those images, reference forms for further data collection, and software utilities for image management and handling. there are two editions of the databases. One is the original database with the images in mis or pct format. It also includes software to open and manipulate the data. The second edition has the images all in PNG format.
Hong Kong Social Contact Dynamics
kaggle.com
Updated Feb 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Hong Kong Social Contact Dynamics [Dataset]. https://www.kaggle.com/datasets/thedevastator/hong-kong-social-contact-dynamics
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 5, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Hong Kong
Description
Hong Kong Social Contact Dynamics

Understanding Age, Gender and Network Dynamics

By [source]

About this dataset

This dataset provides an in-depth look at the dynamics of social interaction, particularly in Hong Kong. It contains comprehensive information regarding individuals, households and interactions between individuals such as their ages, frequency and duration of contact, and genders. This data can be utilized to evaluate various social and economic trends, behaviors, as well as dynamics observed at different levels. For example, this data set is an ideal tool to recognize population-level trends such as age and gender diversification of contacts or investigate the structure of social networks in addition to the implications of contact patterns on health and economic outcomes. Additionally, it offers valuable insights into dissimilar groups of people including their permanent residence activities related to work or leisure by enabling one to understand their interactions along with contact dynamics within their respective populations. Ultimately this dataset is key for attaining a comprehensive understanding of social contact dynamics which are fundamental for grasping why these interactions are crucial in Hong Kong's society today

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides detailed information about the social contact dynamics in Hong Kong. With this dataset, it is possible to gain a comprehensive understanding of the patterns of various forms of social contact - from permanent residence and work contacts to leisure contacts. This guide will provide an overview and guidelines on how to use this dataset for analysis.

Exploring Trends and Dynamics:

To begin exploring the trends and dynamics of social contact in Hong Kong, start by looking at demographic factors such as age, gender, ethnicity, and educational attainment associated with different types of contacts (permanent residence/work/leisure). Consider the frequency and duration of contacts within these segments to identify any potential differences between them. Additionally, look at how these factors interact with each other – observe which segments have higher levels of interaction with each other or if there are any differences between different population groups based on their demographic characteristics. This can be done through visualizations such as line graphs or bar charts which can illustrate trends across timeframes or population demographics more clearly than raw numbers would alone.

Investigating Social Networks:

The data collected through this dataset also allows for investigation into social networks – understanding who connects with who in both real-life interactions as well as through digital channels (if applicable). Focus on analyzing individual or family networks rather than larger groups in order to get a clearer picture without having too much complexity added into the analysis time. Analyze commonalities among individuals within a network even after controlling for certain factors that could affect interaction such as age or gender – utilize clustering techniques for this step if appropriate– then focus on comparing networks between individuals/families overall using graph theory methods such as length distributions (the average number of relationships one has) , degrees (the number of links connected from one individual or family unit), centrality measures(identifying individuals who serve an important role bridging two different parts fo he network) etc., These methods will help provide insights into varying structures between large groups rather than focusing only on small-scale personal connections among friends / colleagues / relatives which may not always offer accurate portrayals due to their naturally limited scope

Modeling Health Implications:

Finally, consider modeling health implications stemming from these observed patterns– particularly implications that may not be captured by simpler measures like count per contact hour (which does not differentiate based on intensity). Take into account aspects like viral transmission risk by analyzing secondary effects generated from contact events captured in the data – things like physical proximity when multiple people meet up together over multiple days

Research Ideas

Analyzing the age, gender and contact dynamics of different areas within Hong Kong to understand the local population trends and behavior.

Investigating the structure of social networks to study how patterns of contact vary among socio economic backgro...
s
Fostering cultures of open qualitative research: Dataset 2 – Interview...
orda.shef.ac.uk
xlsx
Updated Jun 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew Hanchard; Itzel San Roman Pineda (2023). Fostering cultures of open qualitative research: Dataset 2 – Interview Transcripts [Dataset]. http://doi.org/10.15131/shef.data.23567223.v2
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.15131/shef.data.23567223.v2
Dataset updated
Jun 28, 2023
Dataset provided by
The University of Sheffield
Authors
Matthew Hanchard; Itzel San Roman Pineda
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset was created and deposited onto the University of Sheffield Online Research Data repository (ORDA) on 23-Jun-2023 by Dr. Matthew S. Hanchard, Research Associate at the University of Sheffield iHuman Institute. The dataset forms part of three outputs from a project titled ‘Fostering cultures of open qualitative research’ which ran from January 2023 to June 2023:

· Fostering cultures of open qualitative research: Dataset 1 – Survey Responses · Fostering cultures of open qualitative research: Dataset 2 – Interview Transcripts · Fostering cultures of open qualitative research: Dataset 3 – Coding Book

The project was funded with £13,913.85 of Research England monies held internally by the University of Sheffield - as part of their ‘Enhancing Research Cultures’ scheme 2022-2023.

The dataset aligns with ethical approval granted by the University of Sheffield School of Sociological Studies Research Ethics Committee (ref: 051118) on 23-Jan-2021. This includes due concern for participant anonymity and data management.

ORDA has full permission to store this dataset and to make it open access for public re-use on the basis that no commercial gain will be made form reuse. It has been deposited under a CC-BY-NC license. Overall, this dataset comprises:

· 15 x Interview transcripts - in .docx file format which can be opened with Microsoft Word, Google Doc, or an open-source equivalent.

All participants have read and approved their transcripts and have had an opportunity to retract details should they wish to do so.

Participants chose whether to be pseudonymised or named directly. The pseudonym can be used to identify individual participant responses in the qualitative coding held within the ‘Fostering cultures of open qualitative research: Dataset 3 – Coding Book’ files.

For recruitment, 14 x participants we selected based on their responses to the project survey., whilst one participant was recruited based on specific expertise.

· 1 x Participant sheet – in .csv format which may by opened with Microsoft Excel, Google Sheet, or an open-source equivalent.

The provides socio-demographic detail on each participant alongside their main field of research and career stage. It includes a RespondentID field/column which can be used to connect interview participants with their responses to the survey questions in the accompanying ‘Fostering cultures of open qualitative research: Dataset 1 – Survey Responses’ files.

The project was undertaken by two staff:

Co-investigator: Dr. Itzel San Roman Pineda ORCiD ID: 0000-0002-3785-8057 i.sanromanpineda@sheffield.ac.uk Postdoctoral Research Assistant Labelled as ‘Researcher 1’ throughout the dataset

Principal Investigator (corresponding dataset author): Dr. Matthew Hanchard ORCiD ID: 0000-0003-2460-8638 m.s.hanchard@sheffield.ac.uk Research Associate iHuman Institute, Social Research Institutes, Faculty of Social Science Labelled as ‘Researcher 2’ throughout the dataset
ModE-Sim - A medium size AGCM ensemble to study climate variability during...
wdc-climate.de
Updated Mar 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hand, Ralf; Brönnimann, Stefan; Samakinwa, Eric; Lipfert, Laura (2023). ModE-Sim - A medium size AGCM ensemble to study climate variability during the modern era (1420 to 2009): Set 1420-2: forcings [Dataset]. https://www.wdc-climate.de/ui/entry?acronym=ModE-Sim_s14202_forc
Explore at:
Dataset updated
Mar 7, 2023
Dataset provided by
World Data Centerhttp://www.icsu-wds.org/
Authors
Hand, Ralf; Brönnimann, Stefan; Samakinwa, Eric; Lipfert, Laura
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1420 - Dec 31, 1900
Area covered
Earth
Variables measured
aerosol_extinction, aerosol optical depth, sea_ice_area_fraction, sea_surface_temperature, aerosol effective radius, single_scattering_albedo, aerosol_scattering_asymmetry_factor
Description
This dataset provides the forcings and boundary conditions used for ModE-Sim Set 1420-2. The output for the individual ensemble members and ensemble statistics can be found in the other datasets within this dataset group. Example run scripts of the simulations can be found in second additional info file at the experiment level. Information on the experiment design and the variables included in this dataset can be found in the experiment summary and the additional information provided with it. For a detailed description of the ModE-Sim please refer to the documentation paper (reference provided in the summary at the experiment level).
F
Spanish Open Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Spanish Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/spanish-open-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The Spanish Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Spanish language, advancing the field of artificial intelligence.
Dataset Content:
This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Spanish. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Spanish people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Spanish Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in Spanish are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Spanish Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.
RODEM Jet Datasets
zenodo.org
application/gzip
Updated Aug 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Knut Zoch; Knut Zoch; Debajyoti Sengupta; Debajyoti Sengupta; John Andrew Raine; John Andrew Raine; Tobias Golling; Tobias Golling (2024). RODEM Jet Datasets [Dataset]. http://doi.org/10.5281/zenodo.12793616
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12793616
Dataset updated
Aug 23, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Knut Zoch; Knut Zoch; Debajyoti Sengupta; Debajyoti Sengupta; John Andrew Raine; John Andrew Raine; Tobias Golling; Tobias Golling
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A detailed description of the RODEM Jet Datasets is provided at arXiv:2408.11616.

Jet types

There are five different types of datasets:

Light jets: simulated via QCD dijet events (QCD.tar.gz)

Jets from W bosons: simulated via WZ production (WZ.tar.gz)

Jets from top quarks: simulated via ttbar production (ttbar.tar.gz)

Semi-visible jets: simulated via dark-sector quarks (SIMP.tar.gz)

Resonant Higgs boson production: simulated via type-II two-Higgs-doublet models (2HDM.tar.gz)

The tar.gz archives contain files in the HDF5 format, compressed using 7z. For types 1 to 4, validation and training splits of 5% of the total event count are provided. The remaining events are split into (decompressed) chunks no larger than 8GB.

For the 2HDM models, two production modes (via g-g fusion and b-bbar annihilation) and two decay modes (h --> jj and t --> tb) are simulated. In addition, various heavy-Higgs and light-Higgs mass combinations were produced.

Dataset content

All HDF5 files contain four dataset objects:

jet1_obs – observables for the leading jet

jet1_cnsts – constituent array for the leading jet

jet2_obs – observables for the subleading jet

jet2_cnsts – constituent array for the subleading jet

The latter two are not present in the WZ files.

The observable dataset objects contain one row per event with 11 entries (in this order): pT, eta, phi, mass, tau1, tau2, tau3, d12, d23, ECF2, ECF3 (for details on the calculation, see arXiv).

The constituent dataset objects contain 100 rows per event with seven entries each. The 100 rows represent (up to) 100 jet constituents; if the jet has fewer, the rows are zero-padded. The seven entries per row are (in this order): pT, eta, phi, mass, charge, D0, DZ (for details, see arXiv).

Usage Example

The following snippet loads 100,000 jets and their constituents from one of the QCD input files, then creates distributions of the jet transverse momenta and the number of constituents:

import h5py import numpy as np import matplotlib.pyplot as plt # The input HDF5 file containing the QCD jets. input_qcd = "h5files/QCDjj_pT_450_1200_train01.h5" # The number of jets to load. n_jets = 100_000 def load_jets(ifile: str, n_jets: int): """Load jets and constituents from an HDF5 file.""" with h5py.File(ifile, "r") as f: cnsts = f["objects/jets/jet1_cnsts"][:n_jets] jets = f["objects/jets/jet1_obs"][:n_jets] zeros = np.repeat(cnsts[:, :, 0] == 0, cnsts.shape[2]) zeros = zeros.reshape(-1, cnsts.shape[1], cnsts.shape[2]) cnsts = np.ma.masked_where(zeros, cnsts) return jets, cnsts qcd_jets, qcd_constituents = load_jets(input_qcd, n_jets=n_jets) # Plot the transverse momentum of the jets. plt.hist(qcd_jets[:, 0], label="QCD jets", bins=30) plt.xlabel(r"$p_{\mathrm{T}}$ [GeV]") plt.ylabel("Number of jets") plt.show() # Plot the number of constituents in the jets. plt.hist(qcd_constituents.count(axis=1)[:, 0], label="QCD jets", bins=100, range=(0.5, 100.5)) plt.xlabel("Number of constituents") plt.ylabel("Number of jets") plt.show()

Citing this work

Please cite the work as follows:

K. Zoch, J. A. Raine, D. Sengupta, and T. Golling. RODEM Jet Datasets. Available on Zenodo: 10.5281/zenodo.12793616. Aug. 2024. arXiv: 2408.11616 [hep-ph].

Bibtex entry:

@misc{Zoch:2024eyp, author = "Zoch, Knut and Raine, John Andrew and Sengupta, Debajyoti and Golling, Tobias", title = "{RODEM Jet Datasets}", eprint = "2408.11616", archivePrefix = "arXiv", primaryClass = "hep-ph", month = "8", year = "2024", note = "Available on Zenodo: \href{https://doi.org/10.5281/zenodo.12793616}{10.5281/zenodo.12793616}." }

Facebook

Twitter

Click to copy link

Link copied

Cite

Branislav Pecher (2022). Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5996863

Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims"

Explore at:

Dataset updated

Apr 22, 2022

Dataset provided by

Jakub Simko
Branislav Pecher
Elena Stefancova
Robert Moro
Maria Bielikova
Ivan Srba
Matus Tomlein

Description

Overview

This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).

The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.

Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.

The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).

The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.

Options to access the dataset

There are two ways how to get access to the dataset:

Static dump of the dataset available in the CSV format
Continuously updated dataset available via REST API

In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.

References

If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:

@inproceedings{SrbaMonantPlatform, author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria}, booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)}, pages = {1--7}, title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior}, year = {2019} }

@inproceedings{SrbaMonantMedicalDataset, author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)}, numpages = {11}, title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims}, year = {2022}, doi = {10.1145/3477495.3531726}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477495.3531726}, }

Dataset creation process

In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.

Ethical considerations

The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.

The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.

As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.

Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.

Reporting mistakes in the dataset The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.

Dataset structure

Raw data

At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.

Raw data are contained in these CSV files (and corresponding REST API endpoints):

sources.csv

articles.csv

article_media.csv

article_authors.csv

discussion_posts.csv

discussion_post_authors.csv

fact_checking_articles.csv

fact_checking_article_media.csv

claims.csv

feedback_facebook.csv

Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.

Annotations

Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.

Each annotation is described by the following attributes:

category of annotation (annotation_category). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).

type of annotation (annotation_type_id). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.

method which created annotation (method_id). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.

its value (value). The value is stored in JSON format and its structure differs according to particular annotation type.

At the same time, annotations are associated with a particular object identified by:

entity type (parameter entity_type in case of entity annotations, or source_entity_type and target_entity_type in case of relation annotations). Possible values: sources, articles, fact-checking-articles.

entity id (parameter entity_id in case of entity annotations, or source_entity_id and target_entity_id in case of relation annotations).

The dataset provides specifically these entity annotations:

Source reliability (binary). Determines validity of source (website) at a binary scale with two options: reliable source and unreliable source.

Article veracity. Aggregated information about veracity from article-claim pairs.

The dataset provides specifically these relation annotations:

Fact-checking article to claim mapping. Determines mapping between fact-checking article and claim.

Claim presence. Determines presence of claim in article.

Claim stance. Determines stance of an article to a claim.

Annotations are contained in these CSV files (and corresponding REST API endpoints):

entity_annotations.csv

relation_annotations.csv

Note: Identification of human annotators authors (email provided in the annotation app) is anonymised.

Clear search

Close search

Google apps

Main menu

Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...

Replication Data for: Does mode of administration impact on quality of data?...

Zero Modes and Classification of Combinatorial Metamaterials

CMAPSS Jet Engine Simulated Data

Modes Of Transport Dataset

Modes Of Transport

Grid Transformer Power Flow Historic Monthly

Multi-Camera Action Dataset (MCAD)

Prescription Drugs Introduced to Market

33kV Circuit Operational Data Half Hourly - South Eastern Power Networks...

Simple download service (Atom) of the dataset: Areas of a PLU or POS...

NIST Structured Forms Reference Set of Binary Images (SFRS) - NIST Special...

Satellite Electric Vehicle Dataset (TESLA,LUCID, RIVIAN

Primary Transformer Power Flow Historic Monthly

Crop classification dataset for testing domain adaptation or distributional...

NIST Handprinted Forms and Characters - NIST Special Database 19

Hong Kong Social Contact Dynamics

Hong Kong Social Contact Dynamics

Understanding Age, Gender and Network Dynamics

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Exploring Trends and Dynamics:

Investigating Social Networks:

Modeling Health Implications:

Research Ideas

Fostering cultures of open qualitative research: Dataset 2 – Interview...

ModE-Sim - A medium size AGCM ensemble to study climate variability during...

Spanish Open Ended Question Answer Text Dataset

What’s Included

RODEM Jet Datasets

Jet types

Dataset content

Usage Example

Citing this work

Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims"See More Versions

Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims"