27 datasets found
  1. o

    Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

    • explore.openaire.eu
    • zenodo.org
    Updated Mar 13, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    João Felipe; Leonardo; Vanessa; Juliana (2019). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
    Explore at:
    Dataset updated
    Mar 13, 2019
    Authors
    João Felipe; Leonardo; Vanessa; Juliana
    Description

    The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub. Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks This repository contains two files: dump.tar.bz2 jupyter_reproducibility.tar.bz2 The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks. The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows: analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database. archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks. paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again. Reproducing the Analysis This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment: Ubuntu 18.04.1 LTS PostgreSQL 10.6 Conda 4.5.11 Python 3.7.2 PdfCrop 2012/11/02 v1.38 First, download dump.tar.bz2 and extract it: tar -xjf dump.tar.bz2 It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump: psql jupyter < db2019-03-13.dump It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION: export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; Download and extract jupyter_reproducibility.tar.bz2: tar -xjf jupyter_reproducibility.tar.bz2 Create a conda environment with Python 3.7: conda create -n analyses python=3.7 conda activate analyses Go to the analyses folder and install all the dependencies of the requirements.txt cd jupyter_reproducibility/analyses pip install -r requirements.txt For reproducing the analyses, run jupyter on this folder: jupyter notebook Execute the notebooks on this order: Index.ipynb N0.Repository.ipynb N1.Skip.Notebook.ipynb N2.Notebook.ipynb N3.Cell.ipynb N4.Features.ipynb N5.Modules.ipynb N6.AST.ipynb N7.Name.ipynb N8.Execution.ipynb N9.Cell.Execution.Order.ipynb N10.Markdown.ipynb N11.Repository.With.Notebook.Restriction.ipynb N12.To.Paper.ipynb Reproducing or Expanding the Collection The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution. Requirements This time, we have extra requirements: All the analysis requirements lbzip2 2.5 gcc 7.3.0 Github account Gmail account Environment First, set the following environment variables: export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeou...

  2. Z

    MoreFixes: Largest CVE dataset with fixes

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Oct 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akhoundali, Jafar (2024). MoreFixes: Largest CVE dataset with fixes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11199119
    Explore at:
    Dataset updated
    Oct 23, 2024
    Dataset provided by
    Rietveld, Kristian F. D.
    GADYATSKAYA, Olga
    Akhoundali, Jafar
    Rahim Nouri, Sajad
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    In our work, we have designed and implemented a novel workflow with several heuristic methods to combine state-of-the-art methods related to CVE fix commits gathering. As a consequence of our improvements, we have been able to gather the largest programming language-independent real-world dataset of CVE vulnerabilities with the associated fix commits. Our dataset containing 29,203 unique CVEs coming from 7,238 unique GitHub projects is, to the best of our knowledge, by far the biggest CVE vulnerability dataset with fix commits available today. These CVEs are associated with 35,276 unique commits as sql and 39,931 patch commit files that fixed those vulnerabilities(some patch files can't be saved as sql due to several techincal reasons) Our larger dataset thus substantially improves over the current real-world vulnerability datasets and enables further progress in research on vulnerability detection and software security. We used NVD(nvd.nist.gov) and Github Secuirty advisory Database as the main sources of our pipeline.

    We release to the community a 16GB PostgreSQL database that contains information on CVEs up to 2024-09-26, CWEs of each CVE, files and methods changed by each commit, and repository metadata. Additionally, patch files related to the fix commits are available as a separate package. Furthermore, we make our dataset collection tool also available to the community.

    cvedataset-patches.zip file contains fix patches, and postgrescvedumper.sql.zip contains a postgtesql dump of fixes, together with several other fields such as CVEs, CWEs, repository meta-data, commit data, file changes, method changed, etc.

    MoreFixes data-storage strategy is based on CVEFixes to store CVE commits fixes from open-source repositories, and uses a modified version of Porspector(part of ProjectKB from SAP) as a module to detect commit fixes of a CVE. Our full methodology is presented in the paper, with the title of "MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery", which will be published in the Promise conference (2024).

    For more information about usage and sample queries, visit the Github repository: https://github.com/JafarAkhondali/Morefixes

    If you are using this dataset, please be aware that the repositories that we mined contain different licenses and you are responsible to handle any licesnsing issues. This is also the similar case with CVEFixes.

    This product uses the NVD API but is not endorsed or certified by the NVD.

    This research was partially supported by the Dutch Research Council (NWO) under the project NWA.1215.18.008 Cyber Security by Integrated Design (C-SIDe).

    To restore the dataset, you can use the docker-compose file available at the gitub repository. Dataset default credentials after restoring dump:

    POSTGRES_USER=postgrescvedumper POSTGRES_DB=postgrescvedumper POSTGRES_PASSWORD=a42a18537d74c3b7e584c769152c3d

    Please use this for citation:

     title={MoreFixes: A large-scale dataset of CVE fix commits mined through enhanced repository discovery},
     author={Akhoundali, Jafar and Nouri, Sajad Rahim and Rietveld, Kristian and Gadyatskaya, Olga},
     booktitle={Proceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering},
     pages={42--51},
     year={2024}
    }
    
  3. GitHub Repos

    • kaggle.com
    zip
    Updated Mar 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset provided by
    GitHubhttps://github.com/
    Authors
    Github
    Description

    GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

    This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

    Acknowledgements

    This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

    Inspiration

    • This is the perfect dataset for fighting language wars.
    • Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
  4. Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

    • zenodo.org
    bin, json, txt
    Updated Aug 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322
    Explore at:
    txt, json, binAvailable download formats
    Dataset updated
    Aug 16, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

    It contains the following files:

    - spider-realistic.json
    # The spider-realistic evaluation set
    # Examples: 508
    # Databases: 19
    - dev.json
    # The original dev split of Spider
    # Examples: 1034
    # Databases: 20
    - tables.json
    # The original DB schemas from Spider
    # Databases: 166
    - README.txt
    - license

    The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
    For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
    For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

    This dataset is distributed under the CC BY-SA 4.0 license.

    If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

    @article{deng2020structure,
    title={Structure-Grounded Pretraining for Text-to-SQL},
    author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
    journal={arXiv preprint arXiv:2010.12773},
    year={2020}
    }

    @inproceedings{Yu&al.18c,
    year = 2018,
    title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
    booktitle = {EMNLP},
    author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
    }

    @InProceedings{P18-1033,
    author = "Finegan-Dollak, Catherine
    and Kummerfeld, Jonathan K.
    and Zhang, Li
    and Ramanathan, Karthik
    and Sadasivam, Sesh
    and Zhang, Rui
    and Radev, Dragomir",
    title = "Improving Text-to-SQL Evaluation Methodology",
    booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    year = "2018",
    publisher = "Association for Computational Linguistics",
    pages = "351--360",
    location = "Melbourne, Australia",
    url = "http://aclweb.org/anthology/P18-1033"
    }

    @InProceedings{data-sql-imdb-yelp,
    dataset = {IMDB and Yelp},
    author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
    title = {SQLizer: Query Synthesis from Natural Language},
    booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
    month = {October},
    year = {2017},
    pages = {63:1--63:26},
    url = {http://doi.org/10.1145/3133887},
    }

    @article{data-academic,
    dataset = {Academic},
    author = {Fei Li and H. V. Jagadish},
    title = {Constructing an Interactive Natural Language Interface for Relational Databases},
    journal = {Proceedings of the VLDB Endowment},
    volume = {8},
    number = {1},
    month = {September},
    year = {2014},
    pages = {73--84},
    url = {http://dx.doi.org/10.14778/2735461.2735468},
    }

    @InProceedings{data-atis-geography-scholar,
    dataset = {Scholar, and Updated ATIS and Geography},
    author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
    title = {Learning a Neural Semantic Parser from User Feedback},
    booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    year = {2017},
    pages = {963--973},
    location = {Vancouver, Canada},
    url = {http://www.aclweb.org/anthology/P17-1089},
    }

    @inproceedings{data-geography-original
    dataset = {Geography, original},
    author = {John M. Zelle and Raymond J. Mooney},
    title = {Learning to Parse Database Queries Using Inductive Logic Programming},
    booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
    year = {1996},
    pages = {1050--1055},
    location = {Portland, Oregon},
    url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
    }

    @inproceedings{data-restaurants-logic,
    author = {Lappoon R. Tang and Raymond J. Mooney},
    title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
    booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
    year = {2000},
    pages = {133--141},
    location = {Hong Kong, China},
    url = {http://www.aclweb.org/anthology/W00-1317},
    }

    @inproceedings{data-restaurants-original,
    author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
    title = {Towards a Theory of Natural Language Interfaces to Databases},
    booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
    year = {2003},
    location = {Miami, Florida, USA},
    pages = {149--157},
    url = {http://doi.acm.org/10.1145/604045.604070},
    }

    @inproceedings{data-restaurants,
    author = {Alessandra Giordani and Alessandro Moschitti},
    title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
    booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
    year = {2012},
    location = {Montpellier, France},
    pages = {59--76},
    url = {https://doi.org/10.1007/978-3-642-45260-4_5},
    }

  5. f

    Data from: Correlated RNN Framework to Quickly Generate Molecules with...

    • acs.figshare.com
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chuan Li; Chenghui Wang; Ming Sun; Yan Zeng; Yuan Yuan; Qiaolin Gou; Guangchuan Wang; Yanzhi Guo; Xuemei Pu (2023). Correlated RNN Framework to Quickly Generate Molecules with Desired Properties for Energetic Materials in the Low Data Regime [Dataset]. http://doi.org/10.1021/acs.jcim.2c00997.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Chuan Li; Chenghui Wang; Ming Sun; Yan Zeng; Yuan Yuan; Qiaolin Gou; Guangchuan Wang; Yanzhi Guo; Xuemei Pu
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Motivated by the challenging of deep learning on the low data regime and the urgent demand for intelligent design on highly energetic materials, we explore a correlated deep learning framework, which consists of three recurrent neural networks (RNNs) correlated by the transfer learning strategy, to efficiently generate new energetic molecules with a high detonation velocity in the case of very limited data available. To avoid the dependence on the external big data set, data augmentation by fragment shuffling of 303 energetic compounds is utilized to produce 500,000 molecules to pretrain RNN, through which the model can learn sufficient structure knowledge. Then the pretrained RNN is fine-tuned by focusing on the 303 energetic compounds to generate 7153 molecules similar to the energetic compounds. In order to more reliably screen the molecules with a high detonation velocity, the SMILE enumeration augmentation coupled with the pretrained knowledge is utilized to build an RNN-based prediction model, through which R2 is boosted from 0.4446 to 0.9572. The comparable performance with the transfer learning strategy based on an existing big database (ChEMBL) to produce the energetic molecules and drug-like ones further supports the effectiveness and generality of our strategy in the low data regime. High-precision quantum mechanics calculations further confirm that 35 new molecules present a higher detonation velocity and lower synthetic accessibility than the classic explosive RDX, along with good thermal stability. In particular, three new molecules are comparable to caged CL-20 in the detonation velocity. All the source codes and the data set are freely available at https://github.com/wangchenghuidream/RNNMGM.

  6. Open Data Portal Catalogue

    • open.canada.ca
    • datasets.ai
    • +1more
    csv, json, jsonl, png +2
    Updated Jul 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Treasury Board of Canada Secretariat (2025). Open Data Portal Catalogue [Dataset]. https://open.canada.ca/data/en/dataset/c4c5c7f1-bfa6-4ff6-b4a0-c164cb2060f7
    Explore at:
    csv, sqlite, json, png, jsonl, xlsxAvailable download formats
    Dataset updated
    Jul 27, 2025
    Dataset provided by
    Treasury Board of Canada Secretariathttp://www.tbs-sct.gc.ca/
    Treasury Board of Canadahttps://www.canada.ca/en/treasury-board-secretariat/corporate/about-treasury-board.html
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Description

    The open data portal catalogue is a downloadable dataset containing some key metadata for the general datasets available on the Government of Canada's Open Data portal. Resource 1 is generated using the ckanapi tool (external link) Resources 2 - 8 are generated using the Flatterer (external link) utility. ###Description of resources: 1. Dataset is a JSON Lines (external link) file where the metadata of each Dataset/Open Information Record is one line of JSON. The file is compressed with GZip. The file is heavily nested and recommended for users familiar with working with nested JSON. 2. Catalogue is a XLSX workbook where the nested metadata of each Dataset/Open Information Record is flattened into worksheets for each type of metadata. 3. datasets metadata contains metadata at the dataset level. This is also referred to as the package in some CKAN documentation. This is the main table/worksheet in the SQLite database and XLSX output. 4. Resources Metadata contains the metadata for the resources contained within each dataset. 5. resource views metadata contains the metadata for the views applied to each resource, if a resource has a view configured. 6. datastore fields metadata contains the DataStore information for CSV datasets that have been loaded into the DataStore. This information is displayed in the Data Dictionary for DataStore enabled CSVs. 7. Data Package Fields contains a description of the fields available in each of the tables within the Catalogue, as well as the count of the number of records each table contains. 8. data package entity relation diagram Displays the title and format for column, in each table in the Data Package in the form of a ERD Diagram. The Data Package resource offers a text based version. 9. SQLite Database is a .db database, similar in structure to Catalogue. This can be queried with database or analytical software tools for doing analysis.

  7. Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter...

    • zenodo.org
    bz2
    Updated Mar 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2021). Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2538877
    Explore at:
    bz2Available download formats
    Dataset updated
    Mar 15, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

    This repository contains two files:

    • dump.tar.bz2
    • jupyter_reproducibility.tar.bz2

    The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

    The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

    • analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.
    • archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.
    • paper: empty. The notebook analyses/N11.To.Paper.ipynb moves data to it

    In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

    Reproducing the Analysis

    This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

    Ubuntu 18.04.1 LTS
    PostgreSQL 10.6
    Conda 4.5.1
    Python 3.6.8
    PdfCrop 2012/11/02 v1.38

    First, download dump.tar.bz2 and extract it:

    tar -xjf dump.tar.bz2

    It extracts the file db2019-01-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

    psql jupyter < db2019-01-13.dump

    It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

    export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

    Download and extract jupyter_reproducibility.tar.bz2:

    tar -xjf jupyter_reproducibility.tar.bz2

    Create a conda environment with Python 3.6:

    conda create -n py36 python=3.6

    Go to the analyses folder and install all the dependencies of the requirements.txt

    cd jupyter_reproducibility/analyses
    pip install -r requirements.txt

    For reproducing the analyses, run jupyter on this folder:

    jupyter notebook

    Execute the notebooks on this order:

    • N0.Index.ipynb
    • N1.Repository.ipynb
    • N2.Notebook.ipynb
    • N3.Cell.ipynb
    • N4.Features.ipynb
    • N5.Modules.ipynb
    • N6.AST.ipynb
    • N7.Name.ipynb
    • N8.Execution.ipynb
    • N9.Cell.Execution.Order.ipynb
    • N10.Markdown.ipynb
    • N11.To.Paper.ipynb

    Reproducing or Expanding the Collection

    The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

    Requirements

    This time, we have extra requirements:

    All the analysis requirements
    lbzip2 2.5
    gcc 7.3.0
    Github account
    Gmail account

    Environment

    First, set the following environment variables:

    export JUP_MACHINE="db"; # machine identifier
    export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
    export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
    export JUP_COMPRESSION="lbzip2"; # compression program
    export JUP_VERBOSE="5"; # verbose level
    export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
    export JUP_GITHUB_USERNAME="github_username"; # your github username
    export JUP_GITHUB_PASSWORD="github_password"; # your github password
    export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
    export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
    export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
    export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
    export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
    export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
    export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
    export JUP_WITH_EXECUTION="1"; # run execute python notebooks
    export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
    export JUP_EXECUTION_MODE="-1"; # run following the execution order
    export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
    export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
    export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
    export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
    export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
    
    
    # Frequenci of log report
    export JUP_ASTROID_FREQUENCY="5";
    export JUP_IPYTHON_FREQUENCY="5";
    export JUP_NOTEBOOKS_FREQUENCY="5";
    export JUP_REQUIREMENT_FREQUENCY="5";
    export JUP_CRAWLER_FREQUENCY="1";
    export JUP_CLONE_FREQUENCY="1";
    export JUP_COMPRESS_FREQUENCY="5";
    
    export JUP_DB_IP="localhost"; # postgres database IP

    Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

    Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

    Scripts

    Download and extract jupyter_reproducibility.tar.bz2:

    tar -xjf jupyter_reproducibility.tar.bz2

    Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

    Conda 2.7

    conda create -n raw27 python=2.7 -y
    conda activate raw27
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 2.7

    conda create -n py27 python=2.7 anaconda -y
    conda activate py27
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology
    

    Conda 3.4

    It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

    conda create -n raw34 python=3.4 -y
    conda activate raw34
    conda install jupyter -c conda-forge -y
    conda uninstall jupyter -y
    pip install --upgrade pip
    pip install jupyter
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology
    pip install pathlib2

    Anaconda 3.4

    conda create -n py34 python=3.4 anaconda -y
    conda activate py34
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.5

    conda create -n raw35 python=3.5 -y
    conda activate raw35
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.5

    It requires the manual installation of other anaconda packages.

    conda create -n py35 python=3.5 anaconda -y
    conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
    conda activate py35
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.6

    conda create -n raw36 python=3.6 -y
    conda activate raw36
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.6

    conda create -n py36 python=3.6 anaconda -y
    conda activate py36
    conda install -y anaconda-navigator jupyterlab_server navigator-updater
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.7

    conda create -n raw37 python=3.7 -y
    conda activate raw37
    pip install --upgrade pip
    pip install pipenv
    pip install -e

  8. Fruits-360 dataset

    • kaggle.com
    • data.mendeley.com
    Updated Jun 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mihai Oltean (2025). Fruits-360 dataset [Dataset]. https://www.kaggle.com/datasets/moltean/fruits
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mihai Oltean
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Fruits-360 dataset: A dataset of images containing fruits, vegetables, nuts and seeds

    Version: 2025.06.07.0

    Content

    The following fruits, vegetables and nuts and are included: Apples (different varieties: Crimson Snow, Golden, Golden-Red, Granny Smith, Pink Lady, Red, Red Delicious), Apricot, Avocado, Avocado ripe, Banana (Yellow, Red, Lady Finger), Beans, Beetroot Red, Blackberry, Blueberry, Cabbage, Caju seed, Cactus fruit, Cantaloupe (2 varieties), Carambula, Carrot, Cauliflower, Cherimoya, Cherry (different varieties, Rainier), Cherry Wax (Yellow, Red, Black), Chestnut, Clementine, Cocos, Corn (with husk), Cucumber (ripened, regular), Dates, Eggplant, Fig, Ginger Root, Goosberry, Granadilla, Grape (Blue, Pink, White (different varieties)), Grapefruit (Pink, White), Guava, Hazelnut, Huckleberry, Kiwi, Kaki, Kohlrabi, Kumsquats, Lemon (normal, Meyer), Lime, Lychee, Mandarine, Mango (Green, Red), Mangostan, Maracuja, Melon Piel de Sapo, Mulberry, Nectarine (Regular, Flat), Nut (Forest, Pecan), Onion (Red, White), Orange, Papaya, Passion fruit, Peach (different varieties), Pepino, Pear (different varieties, Abate, Forelle, Kaiser, Monster, Red, Stone, Williams), Pepper (Red, Green, Orange, Yellow), Physalis (normal, with Husk), Pineapple (normal, Mini), Pistachio, Pitahaya Red, Plum (different varieties), Pomegranate, Pomelo Sweetie, Potato (Red, Sweet, White), Quince, Rambutan, Raspberry, Redcurrant, Salak, Strawberry (normal, Wedge), Tamarillo, Tangelo, Tomato (different varieties, Maroon, Cherry Red, Yellow, not ripened, Heart), Walnut, Watermelon, Zucchini (green and dark).

    Branches

    The dataset has 5 major branches:

    -The 100x100 branch, where all images have 100x100 pixels. See _fruits-360_100x100_ folder.

    -The original-size branch, where all images are at their original (captured) size. See _fruits-360_original-size_ folder.

    -The meta branch, which contains additional information about the objects in the Fruits-360 dataset. See _fruits-360_dataset_meta_ folder.

    -The multi branch, which contains images with multiple fruits, vegetables, nuts and seeds. These images are not labeled. See _fruits-360_multi_ folder.

    -The _3_body_problem_ branch where the Training and Test folders contain different (varieties of) the 3 fruits and vegetables (Apples, Cherries and Tomatoes). See _fruits-360_3-body-problem_ folder.

    How to cite

    Mihai Oltean, Fruits-360 dataset, 2017-

    Dataset properties

    For the 100x100 branch

    Total number of images: 138704.

    Training set size: 103993 images.

    Test set size: 34711 images.

    Number of classes: 206 (fruits, vegetables, nuts and seeds).

    Image size: 100x100 pixels.

    For the original-size branch

    Total number of images: 58363.

    Training set size: 29222 images.

    Validation set size: 14614 images

    Test set size: 14527 images.

    Number of classes: 90 (fruits, vegetables, nuts and seeds).

    Image size: various (original, captured, size) pixels.

    For the 3-body-problem branch

    Total number of images: 47033.

    Training set size: 34800 images.

    Test set size: 12233 images.

    Number of classes: 3 (Apples, Cherries, Tomatoes).

    Number of varieties: Apples = 29; Cherries = 12; Tomatoes = 19.

    Image size: 100x100 pixels.

    For the meta branch

    Number of classes: 26 (fruits, vegetables, nuts and seeds).

    For the multi branch

    Number of images: 150.

    Filename format:

    For the 100x100 branch

    image_index_100.jpg (e.g. 31_100.jpg) or

    r_image_index_100.jpg (e.g. r_31_100.jpg) or

    r?_image_index_100.jpg (e.g. r2_31_100.jpg)

    where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis. "100" comes from image size (100x100 pixels).

    Different varieties of the same fruit (apple, for instance) are stored as belonging to different classes.

    For the original-size branch

    r?_image_index.jpg (e.g. r2_31.jpg)

    where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis.

    The name of the image files in the new version does NOT contain the "_100" suffix anymore. This will help you to make the distinction between the original-size branch and the 100x100 branch.

    For the multi branch

    The file's name is the concatenation of the names of the fruits inside that picture.

    Alternate download

    The Fruits-360 dataset can be downloaded from:

    Kaggle https://www.kaggle.com/moltean/fruits

    GitHub https://github.com/fruits-360

    How fruits were filmed

    Fruits and vegetables were planted in the shaft of a low-speed motor (3 rpm) and a short movie of 20 seconds was recorded.

    A Logitech C920 camera was used for filming the fruits. This is one of the best webcams available.

    Behind the fruits, we placed a white sheet of paper as a background.

    Here i...

  9. Data from: TerraDS: A Dataset for Terraform HCL Programs

    • zenodo.org
    application/gzip, bin
    Updated Nov 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christoph Bühler; Christoph Bühler; David Spielmann; David Spielmann; Roland Meier; Roland Meier; Guido Salvaneschi; Guido Salvaneschi (2024). TerraDS: A Dataset for Terraform HCL Programs [Dataset]. http://doi.org/10.5281/zenodo.14217386
    Explore at:
    application/gzip, binAvailable download formats
    Dataset updated
    Nov 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christoph Bühler; Christoph Bühler; David Spielmann; David Spielmann; Roland Meier; Roland Meier; Guido Salvaneschi; Guido Salvaneschi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    TerraDS

    The TerraDS dataset provides a comprehensive collection of Terraform programs written in the HashiCorp Configuration Language (HCL). As Infrastructure as Code (IaC) gains popularity for managing cloud infrastructure, Terraform has become one of the leading tools due to its declarative nature and widespread adoption. However, a lack of publicly available, large-scale datasets has hindered systematic research on Terraform practices. TerraDS addresses this gap by compiling metadata and source code from 62,406 open-source repositories with valid licenses. This dataset aims to foster research on best practices, vulnerabilities, and improvements in IaC methodologies.

    Structure of the Database

    The TerraDS dataset is organized into two main components: a SQLite database containing metadata and an archive of source code (~335 MB). The metadata, captured in a structured format, includes information about repositories, modules, and resources:

    1. Repository Data:

    • Contains 62,406 repositories with fields such as repository name, creation date, star count, and permissive license details.
    • Provides cloneable URLs for access and analysis.
    • Tracks additional metrics like repository size and the latest commit details.

    2. Module Data:

    • Consists of 279,344 modules identified within the repositories.
    • Each module includes its relative path, referenced providers, and external module calls stored as JSON objects.

    3. Resource Data:

    • Encompasses 1,773,991 resources, split into managed (1,484,185) and data (289,806) resources.
    • Each resource entry details its type, provider, and whether it is managed or read-only.

    Structure of the Archive

    The provided archive contains the source code of the 62,406 repositories to allow further analysis based on the actual source instead of the metadata only. As such, researcher can access the permissive repositories and conduct studies on the executable HCL code.

    Tools

    The "HCL Dataset Tools" file contains a snapshot of the https://github.com/prg-grp/hcl-dataset-tools repository - for long term archival reasons. The tools in this repository can be used to reproduce this dataset.

    One of the tools - "RepositorySearcher" - can be used to fetch metadata for various other GitHub API queries, not only Terraform code. While the RepositorySearcher allows usage for other types of repository search, the other tools provided are focused on Terraform repositories.

  10. Z

    Overhead Wind Turbine Dataset (NAIP)

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Dec 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordan Malof (2022). Overhead Wind Turbine Dataset (NAIP) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7385226
    Explore at:
    Dataset updated
    Dec 2, 2022
    Dataset provided by
    Yuxi Long
    Kyle Bradbury
    Jordan Malof
    Simiao Ren
    Frank Willard
    Caleb Kornfein
    Caroline Tang
    Saksham Jain
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    1 - OVERVIEW

    This dataset contains overhead images of wind turbines from three regions of the United States – the Eastern Midwest (EM), Northwest (NW), and Southwest (SW). The images come from the National Agricultural Imagery Program and were extracted using Google Earth Engine and wind turbine latitude-longitude coordinates from the U.S. Wind Turbine Database. Overall, there are 2003 NAIP collected images, of which 988 images contain wind turbines and the other 1015 are background images (not containing wind turbines) collected from regions nearby the wind turbines. Labels are provided for all images containing wind turbines. We welcome uses of this dataset for object detection or other research purposes.

    2 - DATA DETAILS

    Each image is 608 x 608 pixels, with a GSD of 1m. This means each image represents a frame of approximately 608 m x 608m. Because images were collected from overhead the exact wind turbine coordinates, images used to be nearly exactly centered on turbines. To avoid this issue, images were randomly shifted up to 75m in two directions.

    We refer to images without turbines as "background images", and further split up the images with turbines into the training and testing set splits. We call the training images with turbines "real images" and the testing images "test images".

    Distribution of gathered images by region and type:

    Domain

    Real

    Test

    Background

    EM

    267

    100

    244

    NW

    213

    100

    415

    SW

    208

    100

    356

    Note that this dataset is part of a larger research project in Duke's 2021-2022 Bass Connections team, Creating Artificial Worlds with AI to Improve Energy Access Data. Our research proposes a technique to synthetically generate images with implanted energy infrastructure objects. We include the synthetic images we generated along with the NAIP collected images above. Generating synthetic images requires a training and testing domain, so for each pair of domains we include 173 synthetically generated images. For a fuller picture on our research, including additional image data from domain adaptation techniques we benchmark our method against, visit our github: https://github.com/energydatalab/closing-the-domain-gap. If you use this dataset, please cite the citation found in our Github README.

    3 - NAVIGATING THE DATASET

    Once the data is unzipped, you will see that the base level of the dataset contains an image and a labels folder, which have the exact same structure. Here is how the images directory is divided:

    | - images

    | | - SW

    | | | - Background

    | | | - Test

    | | | - Real

    | | - EM

    | | | - Background

    | | | - Test

    | | | - Real

    | | - NW

    | | | - Background

    | | | - Test

    | | | - Real

    | | - Synthetic

    | | | - s_EM_t_NW

    | | | - s_SW_t_NW

    | | | - s_NW_t_NW

    | | | - s_NW_t_EM

    | | | - s_SW_t_EM

    | | | - s_EM_t_SW

    | | | - s_NW_t_SW

    | | | - s_EM_t_EM

    | | | - s_SW_t_SW

    For example images/SW/Real has the 208 .jpg images from the Southwest that contain turbines. The synthetic subdirectory is structured such that for example images/Synthetic/s_EM_t_NW contains synthetic images using a source domain of Eastern Midwest and a target domain of Northwest, meaning the images were stylized to artificially look like Northwest images.

    Note that we also provide a domain_overview.json file at the top level to help you navigate the directory. The domain_overview.json file navigates the directory with keys, so if you load the file as f, then f['images']['SW']['Background'] should list all the background photos from the SW. The keys in the domain json are ordered in the order we used the images for our experiments. So if our experiment used 100 SW background images, we used the images corresponding to the first 100 keys.

    Naming conventions:

    1 - Real and Test images:

    {DOMAIN}_{UNIQUE ID}.jpg

    For example 'EM_136.jpg' with corresponding label file 'EM_136.txt' refers to an image from the Eastern Midwest with unique ID 136.

    2 - Background images:

    Background images were collected in 3 waves with the purpose to create a set of images similar visually to real images, just without turbines:

    The first wave came from NAIP images from the U.S. Wind Turbine Database coordinates where no wind turbine was present in the snapshot (NAIP images span a relatively large time, thus it is possible that wind turbines might be missing from the images). These images are labeled {DOMAIN}_{UNIQUE ID}.jpg, for example 'EM_1612_background.jpg'.

    Using wind turbine coordinates, images were randomly collected either 4000m Southeast or Northwest. These images are labeled {DOMAIN}_{UNIQUE_ID}_{SHIFT DIRECTION (SE or NW)}.jpg. For example 'NW_12750_SE_background.jpg' refers to an image from the Northwest without turbines captured at a shift of 4000m Southeast from a wind turbine with unique ID 12750. Using wind turbine coordinates, images were randomly collected either 6000m Southeast or Northwest. These images are labeled {DOMAIN}_{UNIQUE_ID}_{SHIFT DIRECTION (SE or NW)}_6000.jpg, for example 'NW_12937_NW_6000_background.jpg'.

    3 - Synthetic images

    Each synthetic image takes in labeled wind turbine examples from the source domain, a background image from the target domain, and a mask. It uses the mask to place wind turbine examples and blends those examples onto the background image using GP-GAN. Thus, the naming conventions for synthetic images are:

    {BACKGROUND IMAGE NAME FROM TARGET DOMAIN}_{MASK NUMBER}.jpg.

    For example, images/Synthetic/s_NW_t_SW/SW_2246_m15.jpg corresponds to a synthetic image created using labeled wind turbine examples from the Northwest and stylized in the image of the Southwest using Southwest background image SW_2246 and mask 15.

    For any remaining questions, please reach out to the author point of contact at caleb.kornfein@gmail.com.

  11. h

    B3DB

    • huggingface.co
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maom Lab (2025). B3DB [Dataset]. https://huggingface.co/datasets/maomlab/B3DB
    Explore at:
    Dataset updated
    Jun 4, 2025
    Dataset authored and provided by
    Maom Lab
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Blood-Brain Barrier Database (B3DB)

    The Blood-Brain Barrier Database (B3DB) is a large benchmark dataset compiled from 50 published resources (as summarized at raw_data/raw_data_summary.tsv) and categorized based on the consistency between different experimental references/measurements. This dataset was published in Scientific Data and a mirror of the theochem/B3DB the official Github repo where it is occasionally uploaded with new experimental data. We used the original datasets… See the full description on the dataset page: https://huggingface.co/datasets/maomlab/B3DB.

  12. s

    HYG Stellar database / Base de données stellaire HYG

    • data.smartidf.services
    • datastro.eu
    csv, excel, json
    Updated Jun 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). HYG Stellar database / Base de données stellaire HYG [Dataset]. https://data.smartidf.services/explore/dataset/hyg-stellar-database/
    Explore at:
    csv, excel, jsonAvailable download formats
    Dataset updated
    Jun 27, 2022
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Note: in this dataset, the stars whose distances are 100,000 parsecs are associated with defective data. As it can't be modified at the source, users have to exclude it manually : but if you exclude these stars from all distance-related analyzes, everything works perfectly well."The database is a subset of the data in three major catalogs: the Hipparcos Catalog,the Yale Bright Star Catalog (5th Edition), and the Gliese Catalog of Nearby Stars (3rd Edition). Each of these catalogs contains information useful to amateur astronomers: The Hipparcos catalog is the largest collection of high-accuracy stellar positional data, particularly parallaxes, which makes it useful as a starting point for stellar distance data.The Yale Bright Star Catalog contains basic data on essentially all naked-eye stars, including much information (such as the traditional Bayer Greek letters and Flamsteed numbers) missing from many other catalogsThe Gliese catalog is the most comprehensive catalog of nearby stars (those within 75 light years of the Sun). It contains many fainter stars not found in Hipparcos.The name of the database comes from the three catalogs comprising its data: Hipparcos, Yale, and Gliese.This database contains ALL stars that are either brighter than a certain magnitude cutoff (magnitude +7.5 to +9.0) or within 50 parsecs (about 160 light years) from the Sun. The current version, v. 3.0, has no magnitude cutoff: any star in Hipparcos, Yale, or Gliese is represented."Source: http://www.astronexus.com/hyg and https://github.com/astronexus/HYG-DatabaseFields in the database: id: The database primary key.hip: The star's ID in the Hipparcos catalog, if known.hd: The star's ID in the Henry Draper catalog, if known.hr: The star's ID in the Harvard Revised catalog, which is the same as its number in the Yale Bright Star Catalog.gl: The star's ID in the third edition of the Gliese Catalog of Nearby Stars.bf: The Bayer / Flamsteed designation, primarily from the Fifth Edition of the Yale Bright Star Catalog. This is a combination of the two designations. The Flamsteed number, if present, is given first; then a three-letter abbreviation for the Bayer Greek letter; the Bayer superscript number, if present; and finally, the three-letter constellation abbreviation. Thus Alpha Andromedae has the field value "21Alp And", and Kappa1 Sculptoris (no Flamsteed number) has "Kap1Scl".ra, dec: The star's right ascension and declination, for epoch and equinox 2000.0.proper: A common name for the star, such as "Barnard's Star" or "Sirius". I have taken these names primarily from the Hipparcos project's web site, which lists representative names for the 150 brightest stars and many of the 150 closest stars. I have added a few names to this list. Most of the additions are designations from catalogs mostly now forgotten (e.g., Lalande, Groombridge, and Gould ["G."]) except for certain nearby stars which are still best known by these designations.dist: The star's distance in parsecs, the most common unit in astrometry. To convert parsecs to light years, multiply by 3.262. A value >= 10000000 indicates missing or dubious (e.g., negative) parallax data in Hipparcos.pmra, pmdec: The star's proper motion in right ascension and declination, in milliarcseconds per year.rv: The star's radial velocity in km/sec, where known.mag: The star's apparent visual magnitude.absmag: The star's absolute visual magnitude (its apparent magnitude from a distance of 10 parsecs).spect: The star's spectral type, if known.ci: The star's color index (blue magnitude - visual magnitude), where known.x,y,z: The Cartesian coordinates of the star, in a system based on the equatorial coordinates as seen from Earth. +X is in the direction of the vernal equinox (at epoch 2000), +Z towards the north celestial pole, and +Y in the direction of R.A. 6 hours, declination 0 degrees.vx,vy,vz: The Cartesian velocity components of the star, in the same coordinate system described immediately above. They are determined from the proper motion and the radial velocity (when known). The velocity unit is parsecs per year; these are small values (around 1 millionth of a parsec per year), but they enormously simplify calculations using parsecs as base units for celestial mapping.rarad, decrad, pmrarad, prdecrad: The positions in radians, and proper motions in radians per year.bayer: The Bayer designation as a distinct valueflam: The Flamsteed number as a distinct valuecon: The standard constellation abbreviationcomp, comp_primary, base: Identifies a star in a multiple star system. comp = ID of companion star, comp_primary = ID of primary star for this component, and base = catalog ID or name for this multi-star system. Currently only used for Gliese stars.lum: Star's luminosity as a multiple of Solar luminosity.var: Star's standard variable star designation, when known.var_min, var_max: Star's approximate magnitude range, for variables. This value is based on the Hp magnitudes for the range in the original Hipparcos catalog, adjusted to the V magnitude scale to match the "mag" field."

  13. h

    short_jokes

    • huggingface.co
    Updated Feb 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yuvraj sharma (2024). short_jokes [Dataset]. https://huggingface.co/datasets/ysharma/short_jokes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 22, 2024
    Authors
    yuvraj sharma
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Context Generating humor is a complex task in the domain of machine learning, and it requires the models to understand the deep semantic meaning of a joke in order to generate new ones. Such problems, however, are difficult to solve due to a number of reasons, one of which is the lack of a database that gives an elaborate list of jokes. Thus, a large corpus of over 0.2 million jokes has been collected by scraping several websites containing funny and short jokes. You can visit the Github… See the full description on the dataset page: https://huggingface.co/datasets/ysharma/short_jokes.

  14. 3D MNIST

    • kaggle.com
    Updated Oct 18, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David de la Iglesia Castro (2019). 3D MNIST [Dataset]. https://www.kaggle.com/daavoo/3d-mnist/home
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 18, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    David de la Iglesia Castro
    Description

    Context

    The aim of this dataset is to provide a simple way to get started with 3D computer vision problems such as 3D shape recognition.

    Accurate 3D point clouds can (easily and cheaply) be adquired nowdays from different sources:

    However there is a lack of large 3D datasets (you can find a good one here based on triangular meshes); it's especially hard to find datasets based on point clouds (wich is the raw output from every 3D sensing device).

    This dataset contains 3D point clouds generated from the original images of the MNIST dataset to bring a familiar introduction to 3D to people used to work with 2D datasets (images).

    In the 3D_from_2D notebook you can find the code used to generate the dataset.

    You can use the code in the notebook to generate a bigger 3D dataset from the original.

    Content

    full_dataset_vectors.h5

    The entire dataset stored as 4096-D vectors obtained from the voxelization (x:16, y:16, z:16) of all the 3D point clouds.

    In adition to the original point clouds, it contains randomly rotated copies with noise.

    The full dataset is splitted into arrays:

    • X_train (10000, 4096)
    • y_train (10000)
    • X_test(2000, 4096)
    • y_test (2000)

    Example python code reading the full dataset:

     with h5py.File("../input/train_point_clouds.h5", "r") as hf:  
       X_train = hf["X_train"][:]
       y_train = hf["y_train"][:]  
       X_test = hf["X_test"][:] 
       y_test = hf["y_test"][:] 
    

    train_point_clouds.h5 & test_point_clouds.h5

    5000 (train), and 1000 (test) 3D point clouds stored in HDF5 file format. The point clouds have zero mean and a maximum dimension range of 1.

    Each file is divided into HDF5 groups

    Each group is named as its corresponding array index in the original mnist dataset and it contains:

    • "points" dataset: x, y, z coordinates of each 3D point in the point cloud.
    • "normals" dataset: nx, ny, nz components of the unit normal associate to each point.
    • "img" dataset: the original mnist image.
    • "label" attribute: the original mnist label.

    Example python code reading 2 digits and storing some of the group content in tuples:

    with h5py.File("../input/train_point_clouds.h5", "r") as hf:  
      a = hf["0"]
      b = hf["1"]  
      digit_a = (a["img"][:], a["points"][:], a.attrs["label"]) 
      digit_b = (b["img"][:], b["points"][:], b.attrs["label"]) 
    

    voxelgrid.py

    Simple Python class that generates a grid of voxels from the 3D point cloud. Check kernel for use.

    plot3D.py

    Module with functions to plot point clouds and voxelgrid inside jupyter notebook. You have to run this locally due to Kaggle's notebook lack of support to rendering Iframes. See github issue here

    Functions included:

    • array_to_color Converts 1D array to rgb values use as kwarg color in plot_points()

    • plot_points(xyz, colors=None, size=0.1, axis=False)

    • plot_voxelgrid(v_grid, cmap="Oranges", axis=False)

    Acknowledgements

    Have fun!

  15. Z

    Data from: MarFERReT: an open-source, version-controlled reference library...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blaskowski, Stephen (2025). MarFERReT: an open-source, version-controlled reference library of marine microbial eukaryote functional genes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7055911
    Explore at:
    Dataset updated
    Jan 22, 2025
    Dataset provided by
    Blaskowski, Stephen
    Coesel, Sacha
    Armbrust, E. Virginia
    Groussman, Mora J
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Metatranscriptomics generates large volumes of sequence data about transcribed genes in natural environments. Taxonomic annotation of these datasets depends on availability of curated reference sequences. For marine microbial eukaryotes, current reference libraries are limited by gaps in sequenced organism diversity and barriers to updating libraries with new sequence data, resulting in taxonomic annotation of only about half of eukaryotic environmental transcripts. Here, we introduce version 1.0 of the Marine Functional EukaRyotic Reference Taxa (MarFERReT), an updated marine microbial eukaryotic sequence library with a version-controlled framework designed for taxonomic annotation of eukaryotic metatranscriptomes. We gathered 902 marine eukaryote genomes and transcriptomes from multiple sources and assessed these candidate entries for sequence quality and cross-contamination issues, selecting 800 validated entries for inclusion in the library. MarFERReT v1 contains reference sequences from 800 marine eukaryotic genomes and transcriptomes, covering 453 species- and strain-level taxa, totaling nearly 28 million protein sequences with associated NCBI and PR2 Taxonomy identifiers and Pfam functional annotations. An accompanying MarFERReT project repository hosts containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT.MarFERReT is linked to a code repository hosting containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT here: https://github.com/armbrustlab/marferret

    The raw source data for the 902 candidate entries considered for MarFERReT v1.1.1, including the 800 accepted entries, are available for download from their respective online locations. The source URL for each of the entries is listed here in MarFERReT.v1.1.1.entry_curation.csv, and detailed instructions and code for downloading the raw sequence data from source are available in the MarFERReT code repository (link).

    This repository release contains MarFERReT database files from the v1.1.1 MarFERReT release using the following MarFERReT library build scripts: assemble_marferret.sh, pfam_annotate.sh, and build_diamond_db.shThe following MarFERReT data products are available in this repository:

    MarFERReT.v1.1.1.metadata.csvThis CSV file contains descriptors of each of the 902 database entries, including data source, taxonomy, and sequence descriptors. Data fields are as follows:

    entry_id: Unique MarFERReT sequence entry identifier.

    accepted: Acceptance into the final MarFERReT build (Y/N). The Y/N values can be adjusted to customize the final build output according to user-specific needs.

    marferret_name: A human and machine friendly string derived from the NCBI Taxonomy organism name; maintaining strain-level designation wherever possible.

    tax_id: The NCBI Taxonomy ID (taxID).

    pr2_accession: Best-matching PR2 accession ID associated with entry

    pr2_rank: The lowest shared rank between the entry and the pr2_accession

    pr2_taxonomy: PR2 Taxonomy classification scheme of the pr2_accession

    data_type: Type of sequence data; transcriptome shotgun assemblies (TSA), gene models from assembled genomes (genome), and single-cell amplified genomes (SAG) or transcriptomes (SAT).

    data_source: Online location of sequence data; the Zenodo data repository (Zenodo), the datadryad.org repository (datadryad.org), MMETSP re-assemblies on Zenodo (MMETSP)17, NCBI GenBank (NCBI), JGI Phycocosm (JGI-Phycocosm), the TARA Oceans portal on Genoscope (TARA), or entries from the Roscoff Culture Collection through the METdb database repository (METdb).

    source_link: URL where the original sequence data and/or metadata was collected.

    pub_year: Year of data release or publication of linked reference.

    ref_link: Pubmed URL directs to the published reference for entry, if available.

    ref_doi: DOI of entry data from source, if available.

    source_filename: Name of the original sequence file name from the data source.

    seq_type: Entry sequence data retrieved in nucleotide (nt) or amino acid (aa) alphabets.

    n_seqs_raw: Number of sequences in the original sequence file.

    source_name: Full organism name from entry source

    original_taxID: Original NCBI taxID from entry data source metadata, if available

    alias: Additional identifiers for the entry, if available

    MarFERReT.v1.1.1.curation.csvThis CSV file contains curation and quality-control information on the 902 candidate entries considered for incorporation into MarFERReT v1, including curated NCBI Taxonomy IDs and entry validation statistics. Data fields are as follows:

    entry_id: Unique MarFERReT sequence entry identifier

    marferret_name: Organism name in human and machine friendly format, including additional NCBI taxonomy strain identifiers if available.

    tax_id: Verified NCBI taxID used in MarFERReT

    taxID_status: Status of the final NCBI taxID (Assigned, Updated, or Unchanged)

    taxID_notes: Notes on the original_taxID

    n_seqs_raw: Number of sequences in the original sequence file

    n_pfams: Number of Pfam domains identified in protein sequences

    qc_flag: Early validation quality control flags for the following: LOW_SEQS; less than 1,200 raw sequences; LOW_PFAMS; less than 500 Pfam domain annotations.

    flag_Lasek: Flag notes from Lasek-Nesselquist and Johnson (2019); contains the flag 'FLAG_LASEK' indicating ciliate samples reported as contaminated in this study.

    VV_contam_pct: Estimated contamination reported for MMETSP entries in Van Vlierberghe et al., (2021).

    flag_VanVlierberghe: Flag for a high level of estimated contamination, from 'flag_VanVlierberghe' values over 50%: FLAG_VV.

    rp63_npfams: Number of ribosomal protein Pfam domains out of 63 total.

    rp63_contam_pct: Percent of total ribosomal protein sequences with an inferred taxonomic identity in any lineage other than the recorded identity, as described in the Technical Validation section from analysis of 63 Pfam ribosomal protein domains.

    flag_rp63: Flag for a high level of estimated contamination, from 'rp63_contam_pct' values over 50%: FLAG_RP63.

    flag_sum: Count of the number of flag columns (qc_flag, flag_Lasek, flag_VanVlierberghe, and flag_rp63). All entries with one or more flag are nominally rejected ('accepted' = N); entries without any flags are validated and accepted ('accepted' = Y).

    accepted: Acceptance into the final MarFERReT build (Y or N).

    MarFERReT.v1.1.1.proteins.faa.gzThis Gzip-compressed FASTA file contains the 27,951,013 final translated and clustered protein sequences for all 800 accepted MarFERReT entries. The sequence defline contains the unique identifier for the sequence and its reference (mftX, where 'X' is a ten-digit integer value).

    MarFERReT.v1.1.1.taxonomies.tab.gzThis Gzip-compressed tab-separated file is formatted for interoperability with the DIAMOND protein alignment tool commonly used for downstream analyses and contains some columns without any data. Each row contains an entry for one of the MarFERReT protein sequences in MarFERReT.v1.proteins.faa.gz. Note that 'accession.version' and 'taxid' are populated columns while 'accession' and 'gi' have NA values; the latter columns are required for back-compatibility as input for the DIAMOND alignment software and LCA analysis.

    The columns in this file contain the following information:

    accession: (NA)

    accession.version: The unique MarFERReT sequence identifier ('mftX').

    taxid: The NCBI Taxonomy ID associated with this reference sequence.

    gi: (NA).

    MarFERReT.v1.1.1.proteins_info.tab.gzThis Gzip-compressed tab-separated file contains a row for each final MarFERReT protein sequence with the following columns:

    aa_id: the unique identifier for each MarFERReT protein sequence.

    entry_id: The unique numeric identifier for each MarFERReT entry.

    source_defline: The original, unformatted sequence identifier

    MarFERReT.v1.1.1.best_pfam_annotations.csv.gzThis Gzip-compressed CSV file contains the best-scoring Pfam annotation for intra-species clustered protein sequences from the 800 validated MarFERReT entries; derived from the hmmsearch annotations against Pfam 34.0 functional domains. This file contains the following fields:

    aa_id: The unique MarFERReT protein sequence ID ('mftX').

    pfam_name: The shorthand Pfam protein family name.

    pfam_id: The Pfam identifier.

    pfam_eval: hmm profile match e-value score

    pfam_score: hmm profile match bitscore

    MarFERReT.v1.1.1.dmndThis binary file is the indexed database of the MarFERReT protein library with embedded NCBI taxonomic information generated by the DIAMOND makedb tool using the build_diamond_db.sh script from the MarFERReT /scripts/ library. This can be used as the reference DIAMOND database for annotating environment sequences from eukaryotic metatranscriptomes.

  16. HumBugDB: a large-scale acoustic mosquito dataset

    • zenodo.org
    csv, zip
    Updated May 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Kiskin; Ivan Kiskin; Lawrence Wang; Marianne Sinka; Marianne Sinka; Adam D. Cobb; Adam D. Cobb; Benjamin Gutteridge; Davide Zilli; Waqas Rafique; Rinita Dam; Theodoros Marinos; Yunpeng Li; Yunpeng Li; Gerard Killeen; Dickson Msaky; Emmanuel Kaindoa; Kathy Willis; Kathy Willis; Steve J. Roberts; Steve J. Roberts; Lawrence Wang; Benjamin Gutteridge; Davide Zilli; Waqas Rafique; Rinita Dam; Theodoros Marinos; Gerard Killeen; Dickson Msaky; Emmanuel Kaindoa (2022). HumBugDB: a large-scale acoustic mosquito dataset [Dataset]. http://doi.org/10.5281/zenodo.4904800
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    May 10, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ivan Kiskin; Ivan Kiskin; Lawrence Wang; Marianne Sinka; Marianne Sinka; Adam D. Cobb; Adam D. Cobb; Benjamin Gutteridge; Davide Zilli; Waqas Rafique; Rinita Dam; Theodoros Marinos; Yunpeng Li; Yunpeng Li; Gerard Killeen; Dickson Msaky; Emmanuel Kaindoa; Kathy Willis; Kathy Willis; Steve J. Roberts; Steve J. Roberts; Lawrence Wang; Benjamin Gutteridge; Davide Zilli; Waqas Rafique; Rinita Dam; Theodoros Marinos; Gerard Killeen; Dickson Msaky; Emmanuel Kaindoa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A large-scale multi-species dataset of acoustic recordings

    Dataset accompanying code and paper: HumBugDB: a large-scale acoustic mosquito dataset.

    A large-scale multi-species dataset containing recordings of mosquitoes collected from multiple locations globally, as well as via different collection methods. In total, we present 71,286 seconds (20 hours) of labelled mosquito data with 53,227 seconds (15 hours) of corresponding background noise, recorded at the sites of 8 experiments. Of these, 64,843 seconds contain species metadata, consisting of 36 species (or species complexes).

    This repository contains:

    This data is supplemented by a GitHub repository, https://github.com/HumBug-Mosquito/HumBugDB, which aids as follows:

    • The multi-part zip is intended to be extracted into the folder: /data/audio/ in the repository.
    • Latest metadata is hosted on GitHub to allow the modification of additional metadata as it becomes available in the database or bug-fixing.
    • Documentation for code use, and a complete Datasheet for Datasets also available on GitHub.
    • Example code for data splitting, feature extraction, model training, and evaluation in the top-level notebook main.ipynb.
    • Bayesian Convolutional Neural Network models, in both Keras and PyTorch, trained on this data available at GitHub release v1.0

  17. f

    Arabic Handwritten Digits Dataset

    • figshare.com
    bin
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Loey (2023). Arabic Handwritten Digits Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.12236948.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Mohamed Loey
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Arabic Handwritten Digits DatasetAbstractIn recent years, handwritten digits recognition has been an important areadue to its applications in several fields. This work is focusing on the recognitionpart of handwritten Arabic digits recognition that face several challenges, includingthe unlimited variation in human handwriting and the large public databases. Thepaper provided a deep learning technique that can be effectively apply to recognizing Arabic handwritten digits. LeNet-5, a Convolutional Neural Network (CNN)trained and tested MADBase database (Arabic handwritten digits images) that contain 60000 training and 10000 testing images. A comparison is held amongst theresults, and it is shown by the end that the use of CNN was leaded to significantimprovements across different machine-learning classification algorithms.The Convolutional Neural Network was trained and tested MADBase database (Arabic handwritten digits images) that contain 60000 training and 10000 testing images. Moreover, the CNN is giving an average recognition accuracy of 99.15%.ContextThe motivation of this study is to use cross knowledge learned from multiple works to enhancement the performance of Arabic handwritten digits recognition. In recent years, Arabic handwritten digits recognition with different handwriting styles as well, making it important to find and work on a new and advanced solution for handwriting recognition. A deep learning systems needs a huge number of data (images) to be able to make a good decisions.ContentThe MADBase is modified Arabic handwritten digits database contains 60,000 training images, and 10,000 test images. MADBase were written by 700 writers. Each writer wrote each digit (from 0 -9) ten times. To ensure including different writing styles, the database was gathered from different institutions: Colleges of Engineering and Law, School of Medicine, the Open University (whose students span a wide range of ages), a high school, and a governmental institution.MADBase is available for free and can be downloaded from (http://datacenter.aucegypt.edu/shazeem/) .AcknowledgementsCNN for Handwritten Arabic Digits Recognition Based on LeNet-5http://link.springer.com/chapter/10.1007/978-3-319-48308-5_54Ahmed El-Sawy, Hazem El-Bakry, Mohamed LoeyProceedings of the International Conference on Advanced Intelligent Systems and Informatics 2016Volume 533 of the series Advances in Intelligent Systems and Computing pp 566-575InspirationCreating the proposed database presents more challenges because it deals with many issues such as style of writing, thickness, dots number and position. Some characters have different shapes while written in the same position. For example the teh character has different shapes in isolated position.Arabic Handwritten Characters Datasethttps://www.kaggle.com/mloey1/ahcd1Benha Universityhttp://bu.edu.eg/staff/mloeyhttps://mloey.github.io/

  18. c

    CAnine CuTaneous Cancer Histology Dataset

    • cancerimagingarchive.net
    • dev.cancerimagingarchive.net
    json, n/a, svs +1
    Updated Jan 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive (2022). CAnine CuTaneous Cancer Histology Dataset [Dataset]. http://doi.org/10.7937/TCIA.2M93-FX66
    Explore at:
    n/a, svs, json, zip and sqliteAvailable download formats
    Dataset updated
    Jan 12, 2022
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    Jan 12, 2022
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    We present a large-scale dataset of 350 histologic samples of seven different canine cutaneous tumors. All samples were obtained through surgical resection due to neoplastic indicators and were selected retrospectively from the biopsy archive of the Institute for Veterinary Pathology of the Freie Universität Berlin according to sufficient tissue preservation and presence of characteristic histologic features for the corresponding tumor subtypes. Samples were stained with a routine Hematoxylin & Eosin dye and digitized with two Leica linear scanning systems at a resolution of 0.25 um/pixel. Together with the 350 whole slide images, we provide a database consisting of 12,424 polygon annotations for six non-neoplastic tissue classes (epidermis, dermis, subcutis, bone, cartilage, and a joint class of inflammation and necrosis) and seven tumor classes (melanoma, mast cell tumor, squamous cell carcinoma, peripheral nerve sheath tumor, plasmacytoma, trichoblastoma, and histiocytoma).

    The polygon annotations were generated using the open source software SlideRunner (https://github.com/DeepPathology/SlideRunner). Within SlideRunner, users can view whole slide images (WSIs) and zoom through their magnification levels. Using multiple clicks or click-and-drag, the pathologist annotated polygons for 13 classes (epidermis, dermis, subcutis, bone, cartilage, a joint class of inflammation and necrosis, melanoma, mast cell tumor, squamous cell carcinoma, peripheral nerve sheath tumor, plasmacytoma, trichoblastoma, and histiocytoma) on 287 WSIs. The remaining WSIs were annotated by three medical students in their 8th semester supervised by the leading pathologist who later reviewed these annotations for correctness and completeness.

    Due to the large size of the dataset and the extensive annotations, it provides a good basis for segmentation and classification algorithms based on supervised learning. Previous work [1-4] has shown, that due to various homologies between the species, canine cutaneous tissue can serve as a model for human samples. Prouteau et al. have published an extensive comparison of the two species especially for cutaneous tumors and include homologies between canine and human oncology regarding "clinical and histological appearance, biological behavior, tumor genetics, molecular pathways and targets, and response to therapies" [1]. Ranieri et al. highlight that pet dogs and humans share many environmental risk factors and show the highest risk for cancer development at similar points of time respective to their life spans [2]. Both, Ranieri et al. and Pinho et al. highlight the potential of using insights from experiments on canine samples for developing human cancer treatments [2,3]. From a technical perspective, Aubreville et al. have shown that canine samples can be used to aid human cancer research through the use of transfer learning methods [4].

    Potential users of the dataset can load the SQLite database into their custom installation of SlideRunner and adapt or extend the database with custom annotations. Furthermore, we converted the annotations to the COCO JSON format, which is commonly used by computer scientists for training neural networks. Its pixel-level annotations can be used for supervised segmentation algorithms as opposed to datasets that only provide clinical data on slide level.

    References

    1. Prouteau, Anaïs, and Catherine André. "Canine melanomas as models for human melanomas: Clinical, histological, and genetic comparison." Genes 10.7 (2019): 501. https://doi.org/10.3390/genes10070501
    2. Ranieri, G., et al. "A model of study for human cancer: Spontaneous occurring tumors in dogs. Biological features and translation for new anticancer therapies." Critical reviews in oncology/hematology 88.1 (2013): 187-197. https://doi.org/10.1016/j.critrevonc.2013.03.005
    3. Pinho, Salomé S., et al. "Canine tumors: a spontaneous animal model of human carcinogenesis." Translational Research 159.3 (2012): 165-172. https://doi.org/10.1016/j.trsl.2011.11.005
    4. Aubreville, Marc, et al. "A completely annotated whole slide image dataset of canine breast cancer to aid human breast cancer research." Scientific data 7.1 (2020): 1-10. https://doi.org/10.1038/s41597-020-00756-z

  19. m

    LUTBIO multimodal biometric database

    • data.mendeley.com
    Updated Nov 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    rui yang (2024). LUTBIO multimodal biometric database [Dataset]. http://doi.org/10.17632/jszw485f8j.4
    Explore at:
    Dataset updated
    Nov 25, 2024
    Authors
    rui yang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The LUTBIO database provides a comprehensive resource for research in multimodal biometric authentication, featuring the following key aspects:

    • Extensive Biometric Modalities: The database contains data from nine biometric modalities: voice, face, fingerprint, contact-based palmprint, electrocardiogram (ECG), opisthenar (back of hand), ear, contactless palmprint, and periocular region.

    • Diverse Demographics: Data were collected from 306 individuals, with a balanced gender distribution of 164 males and 142 females, spanning an age range of 8 to 90 years. This diverse age representation enables analyses across a wide demographic spectrum.

    • Representative Population Sampling: Volunteers were recruited from naturally occurring communities, ensuring a large-scale, statistically representative population. The collected data encompass variations observed in real-world environments.

    • Support for Multimodal and Cross-Modality Research: LUTBIO provides both contact-based and contactless palmprint data, as well as fingerprint data (from optical images and scans), promoting advancements in multimodal biometric authentication. This resource is designed to guide the development of future multimodal databases.

    • Flexible, Decouplable Data: The biometric data in the LUTBIO database are designed to be highly decouplable, enabling independent processing of each modality without loss of information. This flexibility supports both single-modality and multimodal analysis, empowering researchers to optimize, combine, and customize biometric features for specific applications.

    ✅ Data Availability: To facilitate early access, we are initially releasing sample data from 6 individuals. Upon publication of the paper, the full dataset will be made available.

    🥸 Important Notice: Please read the data collection protocol of the LUTBIO dataset carefully before use, as it is essential for understanding and correctly interpreting the dataset. Thank you.

    😎 Good news! Our paper has received revision comments, and we are carefully making revisions based on the feedback. We appreciate the reviewers and the editor for their efforts.🥰🥰🥰

    🤠 We conducted further research based on the LUTBIO database and proposed AuthFormer: An Adaptive Multimodal Biometric Authentication Framework for Secure and Flexible Identity Verification. The code is available at https://github.com/RykerYang/Authformer-LUTBIO.git. Our work is currently under review for an international conference, where we are dealing with a malicious reviewer. We are in the process of responding and appealing—wish us luck! 🍀🍀🍀

  20. mroeck/carbenmats-buildings: Pre-release

    • zenodo.org
    zip
    Updated Sep 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin RÖCK; Martin RÖCK (2023). mroeck/carbenmats-buildings: Pre-release [Dataset]. http://doi.org/10.5281/zenodo.8363895
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 26, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Martin RÖCK; Martin RÖCK
    Description

    A Global Database on Whole Life Carbon, Energy and Material Intensity of Buildings (CarbEnMats-Buildings)

    Abstract

    Globally, interest in understanding the life cycle related greenhouse gas (GHG) emissions of buildings is increasing. Robust data is required for benchmarking and analysis of parameters driving resource use and whole life carbon (WLC) emissions. However, open datasets combining information on energy and material use as well as whole life carbon emissions remain largely unavailable – until now.

    We present a global database on whole life carbon, energy use, and material intensity of buildings. It contains data on more than 1,200 building case studies and includes over 300 attributes addressing context and site, building design, assessment methods, energy and material use, as well as WLC emissions across different life cycle stages. The data was collected through various meta-studies, using a dedicated data collection template (DCT) and processing scripts (Python Jupyter Notebooks), all of which are shared alongside this data descriptor.

    This dataset is valuable for industrial ecology and sustainable construction research and will help inform decision-making in the building industry as well as the climate policy context.

    Background & Summary

    The need for reducing greenhouse gas (GHG) emissions across Europe require defining and implementing a performance system for both operational and embodied carbon at the building level that provides relevant guidance for policymakers and the building industry. So-called whole life carbon (WLC) of buildings is gaining increasing attention among decision-makers concerned with climate and industrial policy, as well as building procurement, design, and operation. However, most open buildings datasets published thus far have been focusing on building’s operational energy consumption and related parameters 1,2,2–4. Recent years furthermore brought large-scale datasets on building geometry (footprint, height) 5,6 as well as the publication of some datasets on building construction systems and material intensity 7,8. Heeren and Fishman’s database seed on material intensity (MI) of buildings 7, an essential reference to this work, was a first step towards an open data repository on material-related environmental impacts of buildings. In their 2019 descriptor, the authors present data on the material coefficients of more than 300 building cases intended for use in studies applying material flow analysis (MFA), input-output (IO) or life cycle assessment (LCA) methods. Guven et al. 8 elaborated on this effort by publishing a construction classification system database for understanding resource use in building construction. However, thus far, there is a lack of publicly available data that combines material composition, energy use and also considers life cycle-related environmental impacts, such as life cycle-related GHG emissions, also referred to as building’s whole life carbon.

    The Global Database on Whole Life Carbon, Energy Use, and Material Intensity of Buildings (CarbEnMats-Buildings) published alongside this descriptor provides information on more than 1,200 buildings worldwide. The dataset includes attributes on geographical context and site, main building design characteristics, LCA-based assessment methods, as well as information on energy and material use, and related life cycle greenhouse gas (GHG) emissions, commonly referred to as whole life carbon (WLC), with a focus on embodied carbon (EC) emissions. The dataset compiles data obtained through a systematic review of the scientific literature as well as systematic data collection from both literature sources and industry partners. By applying a uniform data collection template (DCT) and related automated procedures for systematic data collection and compilation, we facilitate the processing, analysis and visualization along predefined categories and attributes, and support the consistency of data types and units. The descriptor includes specifications related to the DCT spreadsheet form used for obtaining these data as well as explanations of the data processing and feature engineering steps undertaken to clean and harmonise the data records. The validation focuses on describing the composition of the dataset and values observed for attributes related to whole life carbon, energy and material intensity.

    The data published with this descriptor offers the largest open compilation of data on whole life carbon emissions, energy use and material intensity of buildings published to date. This open dataset is expected to be valuable for research applications in the context of MFA, I/O and LCA modelling. It also offers a unique data source for benchmarking whole life carbon, energy use and material intensity of buildings to inform policy and decision-making in the context of the decarbonization of building construction and operation as well as commercial real estate in Europe and beyond.

    Files

    All files related to this descriptor are available on a public GitHub repository and related release via Zenodo (https://doi.org/10.5281/zenodo.8363895). The repository contains the following files:

    • README.md is a text file with instructions on how to use the files and documents.
    • CarbEnMats_attributes.XLSX is a table with the complete attribute description.
    • CarbEnMats_materials.XLSX is the table of material options and mappings.
    • CarbEnMats_dataset.XLSX is the building dataset in MS Excel format.
    • CarbEnMats_dataset.txt is the building dataset in tab-delimited TXT format.

    Further information

    Please consult the related data descriptor article (linked at the top) for further information, e.g.:

    • Methods (Data collection; data processing)
    • Data records (Files; Sources; Attributes)
    • Technical validation (Data overview; Data consistency)
    • Usage Notes (Attribute priority; Scope summary, Missing information)

    Code availability (LICENSE)

    The dataset, the data collection template as well as the code used for processing, harmonization and visualization are published under a GNU General Public License v3.0. The GNU General Public License is a free, copyleft license for software and other kinds of works. We encourage you to review, reuse, and refine the data and scripts and eventually share-alike.

    Contributing

    The CarbEnMats-Buildings database is the results of a highly collaborative effort and needs your active contributions to further improve and grow the open building data landscape. Reach out to the lead author (email, linkedin) if you are interested to contribute your data or time.

    Cite as

    When referring to this work, please cite both the descriptor and the dataset:

    • Descriptor: RÖCK, Martin, SORENSEN, Andreas, BALOUKTSI, Maria, RUSCHI MENDES SAADE, Marcella, RASMUSSEN, Freja Nygaard, BIRGISDOTTIR, Harpa, FRISCHKNECHT, Rolf, LÜTZKENDORF, Thomas, HOXHA, Endrit, HABERT, Guillaume, SATOLA, Daniel, TRUGER, Barbara, TOZAN, Buket, KUITTINEN, Matti, ALAUX, Nicolas, ALLACKER, Karen, & PASSER, Alexader. (2023). A Global Database on Whole Life Carbon, Energy and Material Intensity of Buildings (CarbEnMats-Buildings) [Preprint]. Zenodo. https://doi.org/10.5281/zenodo.8378939
    • Dataset: Martin Röck. (2023). mroeck/carbenmats-buildings: Pre-release (0.1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8363895
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
João Felipe; Leonardo; Vanessa; Juliana (2019). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks

Explore at:
23 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Mar 13, 2019
Authors
João Felipe; Leonardo; Vanessa; Juliana
Description

The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub. Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks This repository contains two files: dump.tar.bz2 jupyter_reproducibility.tar.bz2 The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks. The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows: analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database. archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks. paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again. Reproducing the Analysis This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment: Ubuntu 18.04.1 LTS PostgreSQL 10.6 Conda 4.5.11 Python 3.7.2 PdfCrop 2012/11/02 v1.38 First, download dump.tar.bz2 and extract it: tar -xjf dump.tar.bz2 It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump: psql jupyter < db2019-03-13.dump It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION: export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; Download and extract jupyter_reproducibility.tar.bz2: tar -xjf jupyter_reproducibility.tar.bz2 Create a conda environment with Python 3.7: conda create -n analyses python=3.7 conda activate analyses Go to the analyses folder and install all the dependencies of the requirements.txt cd jupyter_reproducibility/analyses pip install -r requirements.txt For reproducing the analyses, run jupyter on this folder: jupyter notebook Execute the notebooks on this order: Index.ipynb N0.Repository.ipynb N1.Skip.Notebook.ipynb N2.Notebook.ipynb N3.Cell.ipynb N4.Features.ipynb N5.Modules.ipynb N6.AST.ipynb N7.Name.ipynb N8.Execution.ipynb N9.Cell.Execution.Order.ipynb N10.Markdown.ipynb N11.Repository.With.Notebook.Restriction.ipynb N12.To.Paper.ipynb Reproducing or Expanding the Collection The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution. Requirements This time, we have extra requirements: All the analysis requirements lbzip2 2.5 gcc 7.3.0 Github account Gmail account Environment First, set the following environment variables: export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeou...

Search
Clear search
Close search
Google apps
Main menu