Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Python in a nutshell : a desktop quick reference. It features 7 columns including author, publication date, language, and book publisher.
This is a dataset for classifying citation intents in academic papers. The main citation intent label for each Json object is specified with the label key while the citation context is specified in with a context key. Example:
{
'string': 'In chacma baboons, male-infant relationships can be linked to both
formation of friendships and paternity success [30,31].'
'sectionName': 'Introduction',
'label': 'background',
'citingPaperId': '7a6b2d4b405439',
'citedPaperId': '9d1abadc55b5e0',
...
}
You may obtain the full information about the paper using the provided paper ids with the Semantic Scholar API (https://api.semanticscholar.org/).
The labels are: Method, Background, Result
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('scicite', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
This is a sample project highlighting some basic methodologies in working with the DataCite public data file and Data Citation Corpus on Redivis.
Using the transform interface, we extract all records associated with DOIs for Stanford datasets on Redivis. We then make a simple plot using a python notebook to see DOI issuance over time. The nested nature of some of the public data file fields makes exploration a bit challenging; future work could break this dataset into multiple related tables for easier analysis.
We can also join with the Data Citation Corpus to find all citations referencing Stanford-on-Redivis DOIs (the citation corpus is a work in progress, and doesn't currently capture many of the citations in the literature).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
These files accompany an article published in the Law and Courts Newsletter
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 1 row and is filtered where the books is Python pocket reference. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 1 row and is filtered where the books is Python : the complete reference. It features 4 columns: authors, books, and publication dates.
The MNIST database of handwritten digits.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('mnist', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset represents structured metadata and contextual information about references added to Wikipedia articles in a JSON format. Each record represents an individual Wikipedia article revision with all the tags parsed, as stored in Wikipedia's XML dumps, including information about: 1) the context(s) in which the reference occurs within the article – such as the surrounding text, parent section title, and section level – 2) structured data and bibliographic metadata included within the reference itself (such as: any citation template used, external links, any known persistent identifiers) 3) additional data/metadata about the reference itself (the reference name, its raw content, and if applicable, revision ID associated with reference addition/deletion/change)The data is available as a set of compressed JSON files, extracted from the July 1, 2017 XML dump of English Wikipedia. Other languages may be added to this dataset in the future.The JSON schema and Python parsing libraries used to generate the data are in the references.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The NewsMediaBias-Plus dataset is designed for the analysis of media bias and disinformation by combining textual and visual data from news articles. It aims to support research in detecting, categorizing, and understanding biased reporting in media outlets.
NewsMediaBias-Plus pairs news articles with relevant images and annotations indicating perceived biases and the reliability of the content. It adds a multimodal dimension for bias detection in news media.
unique_id
: Unique identifier for each news item. Each unique_id
matches an image for the same article.outlet
: The publisher of the article.headline
: The headline of the article.article_text
: The full content of the news article.image_description
: Description of the paired image.image
: The file path of the associated image.date_published
: The date the article was published.source_url
: The original URL of the article.canonical_link
: The canonical URL of the article.new_categories
: Categories assigned to the article.news_categories_confidence_scores
: Confidence scores for each category.text_label
: Indicates the likelihood of the article being disinformation:
Likely
: Likely to be disinformation.Unlikely
: Unlikely to be disinformation.multimodal_label
: Indicates the likelihood of disinformation from the combination of the text snippet and image content:
Likely
: Likely to be disinformation.Unlikely
: Unlikely to be disinformation.Load the dataset into Python:
from datasets import load_dataset
ds = load_dataset("vector-institute/newsmediabias-plus")
print(ds) # View structure and splits
print(ds['train'][0]) # Access the first record of the train split
print(ds['train'][:5]) # Access the first five records
from datasets import load_dataset
# Load the dataset in streaming mode
streamed_dataset = load_dataset("vector-institute/newsmediabias-plus", streaming=True)
# Get an iterable dataset
dataset_iterable = streamed_dataset['train'].take(5)
# Print the records
for record in dataset_iterable:
print(record)
Contributions are welcome! You can:
To contribute, fork the repository and create a pull request with your changes.
This dataset is released under a non-commercial license. See the LICENSE file for more details.
Please cite the dataset using this BibTeX entry:
@misc{vector_institute_2024_newsmediabias_plus,
title={NewsMediaBias-Plus: A Multimodal Dataset for Analyzing Media Bias},
author={Vector Institute Research Team},
year={2024},
url={https://huggingface.co/datasets/vector-institute/newsmediabias-plus}
}
For questions or support, contact Shaina Raza at: shaina.raza@vectorinstitute.ai
Disclaimer: The labels Likely
and Unlikely
are based on LLM annotations and expert assessments, intended for informational use only. They should not be considered final judgments.
Guidance: This dataset is for research purposes. Cross-reference findings with other reliable sources before drawing conclusions. The dataset aims to encourage critical thinking, not provide definitive classifications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Machine learning pocket reference : working with structured data in Python. It features 7 columns including author, publication date, language, and book publisher.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary
This dataset contains two hyperspectral and one multispectral anomaly detection images, and their corresponding binary pixel masks. They were initially used for real-time anomaly detection in line-scanning, but they can be used for any anomaly detection task.
They are in .npy file format (will add tiff or geotiff variants in the future), with the image datasets being in the order of (height, width, channels). The SNP dataset was collected using sentinelhub, and the Synthetic dataset was collected from AVIRIS. The Python code used to analyse these datasets can be found at: https://github.com/WiseGamgee/HyperAD
How to Get Started
All that is needed to load these datasets is Python (preferably 3.8+) and the NumPy package. Example code for loading the Beach Dataset if you put it in a folder called "data" with the python script is:
import numpy as np
hsi_array = np.load("data/beach_hsi.npy") n_pixels, n_lines, n_bands = hsi_array.shape print(f"This dataset has {n_pixels} pixels, {n_lines} lines, and {n_bands}.")
mask_array = np.load("data/beach_mask.npy") m_pixels, m_lines = mask_array.shape print(f"The corresponding anomaly mask is {m_pixels} pixels by {m_lines} lines.")
Citing the Datasets
If you use any of these datasets, please cite the following paper:
@article{garske2024erx, title={ERX - a Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line-Scanning}, author={Garske, Samuel and Evans, Bradley and Artlett, Christopher and Wong, KC}, journal={arXiv preprint arXiv:2408.14947}, year={2024},}
If you use the beach dataset please cite the following paper as well (original source):
@article{mao2022openhsi, title={OpenHSI: A complete open-source hyperspectral imaging solution for everyone}, author={Mao, Yiwei and Betters, Christopher H and Evans, Bradley and Artlett, Christopher P and Leon-Saval, Sergio G and Garske, Samuel and Cairns, Iver H and Cocks, Terry and Winter, Robert and Dell, Timothy}, journal={Remote Sensing}, volume={14}, number={9}, pages={2244}, year={2022}, publisher={MDPI} }
Empirical line methods (ELM) are frequently used to correct images from aerial remote sensing. Remote sensing of aquatic environments captures only a small amount of energy because the water absorbs much of it. The small signal response of the water is proportionally smaller when compared to the other land surface targets. This dataset presents some resources and results of a new approach to calibrate empirical lines combining reference calibration panels with water samples. We optimize the method using python algorithms until reaches the best result.
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('cifar10', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/cifar10-3.0.2.png" alt="Visualization" width="500px">
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
4,914 perovskite oxides containing composition data, lattice constants, and formation + vacancy formation energies. All perovskites are of the form ABO3. Adapted from a dataset presented by Emery and Wolverton.Available as Monty Encoder encoded JSON and as CSV. Recommended access method is with the matminer Python package using the datasets module.Note on citations: If you found this dataset useful and would like to cite it in your work, please be sure to cite its original sources below rather than or in addition to this page.Dataset described in:Emery, A. A. & Wolverton, C. High-throughput DFT calculations of formation energy, stability and oxygen vacancy formation energy of ABO3 perovskites. Sci. Data 4:170153 doi: 10.1038/sdata.2017.153 (2017).Data sourced from:Emery, A. A., & Wolverton, C. Figshare http://dx.doi.org/10.6084/m9.figshare.5334142 (2017)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data for Figure Atlas.2 from Atlas of the Working Group I (WGI) Contribution to the Intergovernmental Panel on Climate Change (IPCC) Sixth Assessment Report (AR6).
Figure Atlas.2 shows WGI reference regions used in the (a) AR5 and (b) AR6 reports.
How to cite this dataset
When citing this dataset, please include both the data citation below (under 'Citable as') and the following citations: For the report component from which the figure originates: Gutiérrez, J.M., R.G. Jones, G.T. Narisma, L.M. Alves, M. Amjad, I.V. Gorodetskaya, M. Grose, N.A.B. Klutse, S. Krakovska, J. Li, D. Martínez-Castro, L.O. Mearns, S.H. Mernild, T. Ngo-Duc, B. van den Hurk, and J.-H. Yoon, 2021: Atlas. In Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change [Masson-Delmotte, V., P. Zhai, A. Pirani, S.L. Connors, C. Péan, S. Berger, N. Caud, Y. Chen, L. Goldfarb, M.I. Gomis, M. Huang, K. Leitzell, E. Lonnoy, J.B.R. Matthews, T.K. Maycock, T. Waterfield, O. Yelekçi, R. Yu, and B. Zhou (eds.)]. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA, pp. 1927–2058, doi:10.1017/9781009157896.021
Iturbide, M. et al., 2021: Repository supporting the implementation of FAIR principles in the IPCC-WG1 Interactive Atlas. Zenodo. Retrieved from: http://doi.org/10.5281/zenodo.5171760
Figure subpanels
The figure has two panels, with data provided for both panels in the master GitHub repository linked in the documentation.
Data provided in relation to figure
This dataset contains the corner coordinates defining each reference region for the second panel of the figure, which contain coordinate information at a 0.44º resolution. The repository directory 'reference-regions' contains data provided for the reference regions as polygons in different formats (CSV with coordinates, R data, shapefile and geojson) together with R and Python notebooks illustrating the use of these regions with worked examples.
Data for reference regions for AR5 can be found here: https://catalogue.ceda.ac.uk/uuid/a3b6d7f93e5c4ea986f3622eeee2b96f
CMIP5 is the fifth phase of the Coupled Model Intercomparison Project. CMIP6 is the sixth phase of the Coupled Model Intercomparison Project. CORDEX is The Coordinated Regional Downscaling Experiment from the WCRP. AR5 and AR6 refer to the 5th and 6th Annual Report of the IPCC. WGI stands for Working Group I
Notes on reproducing the figure from the provided data
Data and figures produced by the Jupyter Notebooks live inside the notebooks directory. The notebooks describe step by step the basic process followed to generate some key figures of the AR6 WGI Atlas and some products underpinning the Interactive Atlas, such as reference regions, global warming levels, aggregated datasets. They include comments and hints to extend the analysis, thus promoting reusability of the results. These notebooks are provided as guidance for practitioners, more user friendly than the code provided as scripts in the reproducibility folder.
Some of the notebooks require access to large data volumes out of this repository. To speed up the execution of the notebook, in addition to the full code to access the data, we provide a data loading shortcut, by storing intermediate results in the auxiliary-material folder in this repository. To test other parameter settings, the full data access instructions should be followed, which can take long waiting times.
Sources of additional information
The following weblinks are provided in the Related Documents section of this catalogue record: - Link to the figure on the IPCC AR6 website - Link to the report component containing the figure (Atlas) - Link to the Supplementary Material for Atlas, which contains details on the input data used in Table Atlas.SM.15. - Link to the code for the figure, archived on Zenodo. - Link to the necessary notebooks for reproducing the figure from GitHub. - Link to IPCC AR5 reference regions dataset
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Metallic glass formation data for binary alloys, collected from various experimental techniques such as melt-spinning or mechanical alloying. This dataset covers all compositions with an interval of 5 at.% in 59 binary systems, containing a total of 5959 alloys in the dataset. The target property of this dataset is the glass forming ability (GFA), i.e. whether the composition can form monolithic glass or not, which is either 1 for glass forming or 0 for non-full glass forming.The V2 versions of this dataset have been cleaned to remove duplicate data points. Any entries with identical formula and both negative and positive GFA classes were combined to a single entry with a positive GFA class.Data is available as Monty Encoder encoded JSON and as the source CSV file. Recommended access method is with the matminer Python package using the datasets module.Note on citations: If you found this dataset useful and would like to cite it in your work, please be sure to cite its original sources below rather than or in addition to this page.Dataset discussed in:Machine Learning Approach for Prediction and Understanding of Glass-Forming AbilityY. T. Sun†§ , H. Y. Bai†§, M. Z. Li*‡, and W. H. Wang*†§† Institute of Physics, Chinese Academy of Sciences, Beijing 100190, People’s Republic of China‡ Department of Physics, Beijing Key Laboratory of Optoelectronic Functional Materials & Micro-nano Devices, Renmin University of China, Beijing 100872, People’s Republic of China§ University of Chinese Academy of Science, Beijing 100049, People’s Republic of ChinaJ. Phys. Chem. Lett., 2017, 8 (14), pp 3434–3439DOI: 10.1021/acs.jpclett.7b01046Publication Date (Web): July 11, 2017
This dataset features both data and code related to the research article titled "Rayleigh Invariance Enables Estimation of Effective CO2 Fluxes Resulting from Convective Dissolution in Water-Filled Fractures." It includes raw data packaged in tarball format, including Python scripts used to derive the results presented in the publication. High-resolution raw data for contour plots is available upon request. 1 Download the Dataset: Download the dataset file using Access Dataset. Ensure you have sufficient disk space available for storing and processing the dataset. 2 Extract the Dataset: Once the dataset file is downloaded, extract its contents. The dataset is compressed in a tar.xz format. Use appropriate tools to extract it. For example, in Linux, you can use the following command: tar -xf Publication_CCS.tar.xz tar -xf Publication_Karst.tar.xz tar -xf Validation_Sim.tar.xz This will create a directory containing the dataset files. 3 Install Required Python Packages: Before running any code, ensure you have the necessary Python (version 3.10 tested) packages installed. The required packages and their versions are listed in the requirements.txt file. You can install the required packages using pip: pip install -r requirements.txt 4 Run the Post Processing Script: After extracting the dataset and installing the required Python packages, you can run the provided post processing script. The post processing script (post_process.py) is designed to replicate all the plots from a publication based on the dataset. Execute the script using Python: python3 post_process.py This script will generate the plots and output them to the specified directory. 5 Explore and Analyze: Once the script has completed running, you can explore the generated plots to gain insights from the dataset. Feel free to modify the script or use the dataset in your own analysis and experiments. High-resolution data, such as the vtu's for contour plots is available upon request; please feel free to reach out if needed. 6 Small Grid Study: There is a tarball for the data that was generated to study the grid used in the related publication. tar -xf Publication_CCS.tar.xz If you unpack the tarball and have the requirements from above installed, you can use the python script to generate the plots. 7 Citation: If you use this dataset in your research or publication, please cite the original source appropriately to give credit to the authors and contributors.
MultiPL-T Python Sources
Citation
If you use this dataset we request that you cite our work: @misc{cassano:multipl-t, title={Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs}, author={Federico Cassano and John Gouwar and Francesca Lucchetti and Claire Schlesinger and Anders Freeman and Carolyn Jane Anderson and Molly Q Feldman and Michael Greenberg and Abhinav Jangda and Arjun Guha}, year={2024}… See the full description on the dataset page: https://huggingface.co/datasets/nuprl/stack-dedup-python-testgen-starcoder-filter-v2-dedup.
This dataset contains additional data for the publication "A Deep Dive into Machine Learning Density Functional Theory for Materials Science and Chemistry". Its goal is to enable interested people to reproduce the citation analysis carried out in the aforementioned publication. Prerequesites The following software versions were used for the python version of this dataset: Python: 3.8.6 Scholarly: 1.2.0 Pyzotero: 1.4.24 Numpy: 1.20.1 Contents results/ : Contains the .csv files that were the results of the citation analysis. Paper groupings follow the ones outlined in the publication. scripts/ : Contains scripts to perform the citation analysis. Zotero.cached.pkl : Contains the cached Zotero library. Usage In order to reproduce the results of the citation analysis, you can use citation_analysis.py in conjunction with cached Zotero library. Manual additions can be verified using the check_consistency script. Please note that you will need a Tor key for the citation analysis, and access to our Zotero library if you don't want to use the cached version. If you need this access, simply contact us.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Python in a nutshell : a desktop quick reference. It features 7 columns including author, publication date, language, and book publisher.