27 datasets found

o
Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...
explore.openaire.eu
zenodo.org
Updated Mar 13, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Felipe; Leonardo; Vanessa; Juliana (2019). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.2592524
Dataset updated
Mar 13, 2019
Authors
João Felipe; Leonardo; Vanessa; Juliana
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub. Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks This repository contains two files: dump.tar.bz2 jupyter_reproducibility.tar.bz2 The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks. The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows: analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database. archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks. paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again. Reproducing the Analysis This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment: Ubuntu 18.04.1 LTS PostgreSQL 10.6 Conda 4.5.11 Python 3.7.2 PdfCrop 2012/11/02 v1.38 First, download dump.tar.bz2 and extract it: tar -xjf dump.tar.bz2 It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump: psql jupyter < db2019-03-13.dump It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION: export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; Download and extract jupyter_reproducibility.tar.bz2: tar -xjf jupyter_reproducibility.tar.bz2 Create a conda environment with Python 3.7: conda create -n analyses python=3.7 conda activate analyses Go to the analyses folder and install all the dependencies of the requirements.txt cd jupyter_reproducibility/analyses pip install -r requirements.txt For reproducing the analyses, run jupyter on this folder: jupyter notebook Execute the notebooks on this order: Index.ipynb N0.Repository.ipynb N1.Skip.Notebook.ipynb N2.Notebook.ipynb N3.Cell.ipynb N4.Features.ipynb N5.Modules.ipynb N6.AST.ipynb N7.Name.ipynb N8.Execution.ipynb N9.Cell.Execution.Order.ipynb N10.Markdown.ipynb N11.Repository.With.Notebook.Restriction.ipynb N12.To.Paper.ipynb Reproducing or Expanding the Collection The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution. Requirements This time, we have extra requirements: All the analysis requirements lbzip2 2.5 gcc 7.3.0 Github account Gmail account Environment First, set the following environment variables: export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeou...
Z
MoreFixes: Largest CVE dataset with fixes
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Oct 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akhoundali, Jafar (2024). MoreFixes: Largest CVE dataset with fixes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11199119
Explore at:
Dataset updated
Oct 23, 2024
Dataset provided by
Rietveld, Kristian F. D.
GADYATSKAYA, Olga
Akhoundali, Jafar
Rahim Nouri, Sajad
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
In our work, we have designed and implemented a novel workflow with several heuristic methods to combine state-of-the-art methods related to CVE fix commits gathering. As a consequence of our improvements, we have been able to gather the largest programming language-independent real-world dataset of CVE vulnerabilities with the associated fix commits. Our dataset containing 29,203 unique CVEs coming from 7,238 unique GitHub projects is, to the best of our knowledge, by far the biggest CVE vulnerability dataset with fix commits available today. These CVEs are associated with 35,276 unique commits as sql and 39,931 patch commit files that fixed those vulnerabilities(some patch files can't be saved as sql due to several techincal reasons) Our larger dataset thus substantially improves over the current real-world vulnerability datasets and enables further progress in research on vulnerability detection and software security. We used NVD(nvd.nist.gov) and Github Secuirty advisory Database as the main sources of our pipeline.

We release to the community a 16GB PostgreSQL database that contains information on CVEs up to 2024-09-26, CWEs of each CVE, files and methods changed by each commit, and repository metadata. Additionally, patch files related to the fix commits are available as a separate package. Furthermore, we make our dataset collection tool also available to the community.

cvedataset-patches.zip file contains fix patches, and postgrescvedumper.sql.zip contains a postgtesql dump of fixes, together with several other fields such as CVEs, CWEs, repository meta-data, commit data, file changes, method changed, etc.

MoreFixes data-storage strategy is based on CVEFixes to store CVE commits fixes from open-source repositories, and uses a modified version of Porspector(part of ProjectKB from SAP) as a module to detect commit fixes of a CVE. Our full methodology is presented in the paper, with the title of "MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery", which will be published in the Promise conference (2024).

For more information about usage and sample queries, visit the Github repository: https://github.com/JafarAkhondali/Morefixes

If you are using this dataset, please be aware that the repositories that we mined contain different licenses and you are responsible to handle any licesnsing issues. This is also the similar case with CVEFixes.

This product uses the NVD API but is not endorsed or certified by the NVD.

This research was partially supported by the Dutch Research Council (NWO) under the project NWA.1215.18.008 Cyber Security by Integrated Design (C-SIDe).

To restore the dataset, you can use the docker-compose file available at the gitub repository. Dataset default credentials after restoring dump:

POSTGRES_USER=postgrescvedumper POSTGRES_DB=postgrescvedumper POSTGRES_PASSWORD=a42a18537d74c3b7e584c769152c3d

Please use this for citation:

title={MoreFixes: A large-scale dataset of CVE fix commits mined through enhanced repository discovery}, author={Akhoundali, Jafar and Nouri, Sajad Rahim and Rietveld, Kristian and Gadyatskaya, Olga}, booktitle={Proceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering}, pages={42--51}, year={2024} }
GitHub Repos
kaggle.com
zip
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
GitHubhttps://github.com/
Authors
Github
Description
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

This is the perfect dataset for fighting language wars.

Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL
zenodo.org
bin, json, txt
Updated Aug 16, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322
Explore at:
txt, json, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5205322
Dataset updated
Aug 16, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

It contains the following files:

- spider-realistic.json
# The spider-realistic evaluation set
# Examples: 508
# Databases: 19
- dev.json
# The original dev split of Spider
# Examples: 1034
# Databases: 20
- tables.json
# The original DB schemas from Spider
# Databases: 166
- README.txt
- license

The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

This dataset is distributed under the CC BY-SA 4.0 license.

If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

@article{deng2020structure,
title={Structure-Grounded Pretraining for Text-to-SQL},
author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
journal={arXiv preprint arXiv:2010.12773},
year={2020}
}

@inproceedings{Yu&al.18c,
year = 2018,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
booktitle = {EMNLP},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}

@InProceedings{P18-1033,
author = "Finegan-Dollak, Catherine
and Kummerfeld, Jonathan K.
and Zhang, Li
and Ramanathan, Karthik
and Sadasivam, Sesh
and Zhang, Rui
and Radev, Dragomir",
title = "Improving Text-to-SQL Evaluation Methodology",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "351--360",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1033"
}

@InProceedings{data-sql-imdb-yelp,
dataset = {IMDB and Yelp},
author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
title = {SQLizer: Query Synthesis from Natural Language},
booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
month = {October},
year = {2017},
pages = {63:1--63:26},
url = {http://doi.org/10.1145/3133887},
}

@article{data-academic,
dataset = {Academic},
author = {Fei Li and H. V. Jagadish},
title = {Constructing an Interactive Natural Language Interface for Relational Databases},
journal = {Proceedings of the VLDB Endowment},
volume = {8},
number = {1},
month = {September},
year = {2014},
pages = {73--84},
url = {http://dx.doi.org/10.14778/2735461.2735468},
}

@InProceedings{data-atis-geography-scholar,
dataset = {Scholar, and Updated ATIS and Geography},
author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
title = {Learning a Neural Semantic Parser from User Feedback},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2017},
pages = {963--973},
location = {Vancouver, Canada},
url = {http://www.aclweb.org/anthology/P17-1089},
}

@inproceedings{data-geography-original
dataset = {Geography, original},
author = {John M. Zelle and Raymond J. Mooney},
title = {Learning to Parse Database Queries Using Inductive Logic Programming},
booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
year = {1996},
pages = {1050--1055},
location = {Portland, Oregon},
url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}

@inproceedings{data-restaurants-logic,
author = {Lappoon R. Tang and Raymond J. Mooney},
title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
year = {2000},
pages = {133--141},
location = {Hong Kong, China},
url = {http://www.aclweb.org/anthology/W00-1317},
}

@inproceedings{data-restaurants-original,
author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
title = {Towards a Theory of Natural Language Interfaces to Databases},
booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
year = {2003},
location = {Miami, Florida, USA},
pages = {149--157},
url = {http://doi.acm.org/10.1145/604045.604070},
}

@inproceedings{data-restaurants,
author = {Alessandra Giordani and Alessandro Moschitti},
title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
year = {2012},
location = {Montpellier, France},
pages = {59--76},
url = {https://doi.org/10.1007/978-3-642-45260-4_5},
}
f
Data from: Correlated RNN Framework to Quickly Generate Molecules with...
acs.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chuan Li; Chenghui Wang; Ming Sun; Yan Zeng; Yuan Yuan; Qiaolin Gou; Guangchuan Wang; Yanzhi Guo; Xuemei Pu (2023). Correlated RNN Framework to Quickly Generate Molecules with Desired Properties for Energetic Materials in the Low Data Regime [Dataset]. http://doi.org/10.1021/acs.jcim.2c00997.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.2c00997.s002
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Chuan Li; Chenghui Wang; Ming Sun; Yan Zeng; Yuan Yuan; Qiaolin Gou; Guangchuan Wang; Yanzhi Guo; Xuemei Pu
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Motivated by the challenging of deep learning on the low data regime and the urgent demand for intelligent design on highly energetic materials, we explore a correlated deep learning framework, which consists of three recurrent neural networks (RNNs) correlated by the transfer learning strategy, to efficiently generate new energetic molecules with a high detonation velocity in the case of very limited data available. To avoid the dependence on the external big data set, data augmentation by fragment shuffling of 303 energetic compounds is utilized to produce 500,000 molecules to pretrain RNN, through which the model can learn sufficient structure knowledge. Then the pretrained RNN is fine-tuned by focusing on the 303 energetic compounds to generate 7153 molecules similar to the energetic compounds. In order to more reliably screen the molecules with a high detonation velocity, the SMILE enumeration augmentation coupled with the pretrained knowledge is utilized to build an RNN-based prediction model, through which R2 is boosted from 0.4446 to 0.9572. The comparable performance with the transfer learning strategy based on an existing big database (ChEMBL) to produce the energetic molecules and drug-like ones further supports the effectiveness and generality of our strategy in the low data regime. High-precision quantum mechanics calculations further confirm that 35 new molecules present a higher detonation velocity and lower synthetic accessibility than the classic explosive RDX, along with good thermal stability. In particular, three new molecules are comparable to caged CL-20 in the detonation velocity. All the source codes and the data set are freely available at https://github.com/wangchenghuidream/RNNMGM.
Open Data Portal Catalogue
open.canada.ca
datasets.ai
+1more
csv, json, jsonl, png +2
Updated Jul 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Treasury Board of Canada Secretariat (2025). Open Data Portal Catalogue [Dataset]. https://open.canada.ca/data/en/dataset/c4c5c7f1-bfa6-4ff6-b4a0-c164cb2060f7
Explore at:
csv, sqlite, json, png, jsonl, xlsxAvailable download formats
Dataset updated
Jul 27, 2025
Dataset provided by
Treasury Board of Canada Secretariathttp://www.tbs-sct.gc.ca/
Treasury Board of Canadahttps://www.canada.ca/en/treasury-board-secretariat/corporate/about-treasury-board.html
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
The open data portal catalogue is a downloadable dataset containing some key metadata for the general datasets available on the Government of Canada's Open Data portal. Resource 1 is generated using the ckanapi tool (external link) Resources 2 - 8 are generated using the Flatterer (external link) utility. ###Description of resources: 1. Dataset is a JSON Lines (external link) file where the metadata of each Dataset/Open Information Record is one line of JSON. The file is compressed with GZip. The file is heavily nested and recommended for users familiar with working with nested JSON. 2. Catalogue is a XLSX workbook where the nested metadata of each Dataset/Open Information Record is flattened into worksheets for each type of metadata. 3. datasets metadata contains metadata at the dataset level. This is also referred to as the package in some CKAN documentation. This is the main table/worksheet in the SQLite database and XLSX output. 4. Resources Metadata contains the metadata for the resources contained within each dataset. 5. resource views metadata contains the metadata for the views applied to each resource, if a resource has a view configured. 6. datastore fields metadata contains the DataStore information for CSV datasets that have been loaded into the DataStore. This information is displayed in the Data Dictionary for DataStore enabled CSVs. 7. Data Package Fields contains a description of the fields available in each of the tables within the Catalogue, as well as the count of the number of records each table contains. 8. data package entity relation diagram Displays the title and format for column, in each table in the Data Package in the form of a ERD Diagram. The Data Package resource offers a text based version. 9. SQLite Database is a .db database, similar in structure to Catalogue. This can be queried with database or analytical software tools for doing analysis.
Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter...
zenodo.org
bz2
Updated Mar 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2021). Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2538877
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.2538877
Dataset updated
Mar 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

This repository contains two files:

dump.tar.bz2

jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.

archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.

paper: empty. The notebook analyses/N11.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.1
Python 3.6.8
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-01-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-01-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.6:

conda create -n py36 python=3.6

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

N0.Index.ipynb

N1.Repository.ipynb

N2.Notebook.ipynb

N3.Cell.ipynb

N4.Features.ipynb

N5.Modules.ipynb

N6.AST.ipynb

N7.Name.ipynb

N8.Execution.ipynb

N9.Cell.Execution.Order.ipynb

N10.Markdown.ipynb

N11.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction # Frequenci of log report export JUP_ASTROID_FREQUENCY="5"; export JUP_IPYTHON_FREQUENCY="5"; export JUP_NOTEBOOKS_FREQUENCY="5"; export JUP_REQUIREMENT_FREQUENCY="5"; export JUP_CRAWLER_FREQUENCY="1"; export JUP_CLONE_FREQUENCY="1"; export JUP_COMPRESS_FREQUENCY="5"; export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y conda activate raw27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y conda activate py27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y conda activate raw34 conda install jupyter -c conda-forge -y conda uninstall jupyter -y pip install --upgrade pip pip install jupyter pip install pipenv pip install -e jupyter_reproducibility/archaeology pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y conda activate py34 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y conda activate raw35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator conda activate py35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y conda activate raw36 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y conda activate py36 conda install -y anaconda-navigator jupyterlab_server navigator-updater pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.7

conda create -n raw37 python=3.7 -y conda activate raw37 pip install --upgrade pip pip install pipenv pip install -e
Fruits-360 dataset
kaggle.com
data.mendeley.com
Updated Jun 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mihai Oltean (2025). Fruits-360 dataset [Dataset]. https://www.kaggle.com/datasets/moltean/fruits
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mihai Oltean
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Fruits-360 dataset: A dataset of images containing fruits, vegetables, nuts and seeds

Version: 2025.06.07.0

Content

The following fruits, vegetables and nuts and are included: Apples (different varieties: Crimson Snow, Golden, Golden-Red, Granny Smith, Pink Lady, Red, Red Delicious), Apricot, Avocado, Avocado ripe, Banana (Yellow, Red, Lady Finger), Beans, Beetroot Red, Blackberry, Blueberry, Cabbage, Caju seed, Cactus fruit, Cantaloupe (2 varieties), Carambula, Carrot, Cauliflower, Cherimoya, Cherry (different varieties, Rainier), Cherry Wax (Yellow, Red, Black), Chestnut, Clementine, Cocos, Corn (with husk), Cucumber (ripened, regular), Dates, Eggplant, Fig, Ginger Root, Goosberry, Granadilla, Grape (Blue, Pink, White (different varieties)), Grapefruit (Pink, White), Guava, Hazelnut, Huckleberry, Kiwi, Kaki, Kohlrabi, Kumsquats, Lemon (normal, Meyer), Lime, Lychee, Mandarine, Mango (Green, Red), Mangostan, Maracuja, Melon Piel de Sapo, Mulberry, Nectarine (Regular, Flat), Nut (Forest, Pecan), Onion (Red, White), Orange, Papaya, Passion fruit, Peach (different varieties), Pepino, Pear (different varieties, Abate, Forelle, Kaiser, Monster, Red, Stone, Williams), Pepper (Red, Green, Orange, Yellow), Physalis (normal, with Husk), Pineapple (normal, Mini), Pistachio, Pitahaya Red, Plum (different varieties), Pomegranate, Pomelo Sweetie, Potato (Red, Sweet, White), Quince, Rambutan, Raspberry, Redcurrant, Salak, Strawberry (normal, Wedge), Tamarillo, Tangelo, Tomato (different varieties, Maroon, Cherry Red, Yellow, not ripened, Heart), Walnut, Watermelon, Zucchini (green and dark).

Branches

The dataset has 5 major branches:

-The 100x100 branch, where all images have 100x100 pixels. See _fruits-360_100x100_ folder.

-The original-size branch, where all images are at their original (captured) size. See _fruits-360_original-size_ folder.

-The meta branch, which contains additional information about the objects in the Fruits-360 dataset. See _fruits-360_dataset_meta_ folder.

-The multi branch, which contains images with multiple fruits, vegetables, nuts and seeds. These images are not labeled. See _fruits-360_multi_ folder.

-The _3_body_problem_ branch where the Training and Test folders contain different (varieties of) the 3 fruits and vegetables (Apples, Cherries and Tomatoes). See _fruits-360_3-body-problem_ folder.

How to cite

Mihai Oltean, Fruits-360 dataset, 2017-

Dataset properties

For the 100x100 branch

Total number of images: 138704.

Training set size: 103993 images.

Test set size: 34711 images.

Number of classes: 206 (fruits, vegetables, nuts and seeds).

Image size: 100x100 pixels.

For the original-size branch

Total number of images: 58363.

Training set size: 29222 images.

Validation set size: 14614 images

Test set size: 14527 images.

Number of classes: 90 (fruits, vegetables, nuts and seeds).

Image size: various (original, captured, size) pixels.

For the 3-body-problem branch

Total number of images: 47033.

Training set size: 34800 images.

Test set size: 12233 images.

Number of classes: 3 (Apples, Cherries, Tomatoes).

Number of varieties: Apples = 29; Cherries = 12; Tomatoes = 19.

Image size: 100x100 pixels.

For the meta branch

Number of classes: 26 (fruits, vegetables, nuts and seeds).

For the multi branch

Number of images: 150.

Filename format:

For the 100x100 branch

image_index_100.jpg (e.g. 31_100.jpg) or

r_image_index_100.jpg (e.g. r_31_100.jpg) or

r?_image_index_100.jpg (e.g. r2_31_100.jpg)

where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis. "100" comes from image size (100x100 pixels).

Different varieties of the same fruit (apple, for instance) are stored as belonging to different classes.

For the original-size branch

r?_image_index.jpg (e.g. r2_31.jpg)

where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis.

The name of the image files in the new version does NOT contain the "_100" suffix anymore. This will help you to make the distinction between the original-size branch and the 100x100 branch.

For the multi branch

The file's name is the concatenation of the names of the fruits inside that picture.

Alternate download

The Fruits-360 dataset can be downloaded from:

Kaggle https://www.kaggle.com/moltean/fruits

GitHub https://github.com/fruits-360

How fruits were filmed

Fruits and vegetables were planted in the shaft of a low-speed motor (3 rpm) and a short movie of 20 seconds was recorded.

A Logitech C920 camera was used for filming the fruits. This is one of the best webcams available.

Behind the fruits, we placed a white sheet of paper as a background.

Here i...
Data from: TerraDS: A Dataset for Terraform HCL Programs
zenodo.org
application/gzip, bin
Updated Nov 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christoph Bühler; Christoph Bühler; David Spielmann; David Spielmann; Roland Meier; Roland Meier; Guido Salvaneschi; Guido Salvaneschi (2024). TerraDS: A Dataset for Terraform HCL Programs [Dataset]. http://doi.org/10.5281/zenodo.14217386
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14217386
Dataset updated
Nov 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christoph Bühler; Christoph Bühler; David Spielmann; David Spielmann; Roland Meier; Roland Meier; Guido Salvaneschi; Guido Salvaneschi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
TerraDS

The TerraDS dataset provides a comprehensive collection of Terraform programs written in the HashiCorp Configuration Language (HCL). As Infrastructure as Code (IaC) gains popularity for managing cloud infrastructure, Terraform has become one of the leading tools due to its declarative nature and widespread adoption. However, a lack of publicly available, large-scale datasets has hindered systematic research on Terraform practices. TerraDS addresses this gap by compiling metadata and source code from 62,406 open-source repositories with valid licenses. This dataset aims to foster research on best practices, vulnerabilities, and improvements in IaC methodologies.

Structure of the Database

The TerraDS dataset is organized into two main components: a SQLite database containing metadata and an archive of source code (~335 MB). The metadata, captured in a structured format, includes information about repositories, modules, and resources:

1. Repository Data:

Contains 62,406 repositories with fields such as repository name, creation date, star count, and permissive license details.

Provides cloneable URLs for access and analysis.

Tracks additional metrics like repository size and the latest commit details.

2. Module Data:

Consists of 279,344 modules identified within the repositories.

Each module includes its relative path, referenced providers, and external module calls stored as JSON objects.

3. Resource Data:

Encompasses 1,773,991 resources, split into managed (1,484,185) and data (289,806) resources.

Each resource entry details its type, provider, and whether it is managed or read-only.

Structure of the Archive

The provided archive contains the source code of the 62,406 repositories to allow further analysis based on the actual source instead of the metadata only. As such, researcher can access the permissive repositories and conduct studies on the executable HCL code.

Tools

The "HCL Dataset Tools" file contains a snapshot of the https://github.com/prg-grp/hcl-dataset-tools repository - for long term archival reasons. The tools in this repository can be used to reproduce this dataset.

One of the tools - "RepositorySearcher" - can be used to fetch metadata for various other GitHub API queries, not only Terraform code. While the RepositorySearcher allows usage for other types of repository search, the other tools provided are focused on Terraform repositories.
Z
Overhead Wind Turbine Dataset (NAIP)
data.niaid.nih.gov
explore.openaire.eu
Updated Dec 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordan Malof (2022). Overhead Wind Turbine Dataset (NAIP) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7385226
Explore at:
Dataset updated
Dec 2, 2022
Dataset provided by
Yuxi Long
Kyle Bradbury
Jordan Malof
Simiao Ren
Frank Willard
Caleb Kornfein
Caroline Tang
Saksham Jain
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
1 - OVERVIEW

This dataset contains overhead images of wind turbines from three regions of the United States – the Eastern Midwest (EM), Northwest (NW), and Southwest (SW). The images come from the National Agricultural Imagery Program and were extracted using Google Earth Engine and wind turbine latitude-longitude coordinates from the U.S. Wind Turbine Database. Overall, there are 2003 NAIP collected images, of which 988 images contain wind turbines and the other 1015 are background images (not containing wind turbines) collected from regions nearby the wind turbines. Labels are provided for all images containing wind turbines. We welcome uses of this dataset for object detection or other research purposes.

2 - DATA DETAILS

Each image is 608 x 608 pixels, with a GSD of 1m. This means each image represents a frame of approximately 608 m x 608m. Because images were collected from overhead the exact wind turbine coordinates, images used to be nearly exactly centered on turbines. To avoid this issue, images were randomly shifted up to 75m in two directions.

We refer to images without turbines as "background images", and further split up the images with turbines into the training and testing set splits. We call the training images with turbines "real images" and the testing images "test images".

Distribution of gathered images by region and type:

Domain

Real

Test

Background

EM

267

100

244

NW

213

100

415

SW

208

100

356

Note that this dataset is part of a larger research project in Duke's 2021-2022 Bass Connections team, Creating Artificial Worlds with AI to Improve Energy Access Data. Our research proposes a technique to synthetically generate images with implanted energy infrastructure objects. We include the synthetic images we generated along with the NAIP collected images above. Generating synthetic images requires a training and testing domain, so for each pair of domains we include 173 synthetically generated images. For a fuller picture on our research, including additional image data from domain adaptation techniques we benchmark our method against, visit our github: https://github.com/energydatalab/closing-the-domain-gap. If you use this dataset, please cite the citation found in our Github README.

3 - NAVIGATING THE DATASET

Once the data is unzipped, you will see that the base level of the dataset contains an image and a labels folder, which have the exact same structure. Here is how the images directory is divided:

| - images

| | - SW

| | | - Background

| | | - Test

| | | - Real

| | - EM

| | | - Background

| | | - Test

| | | - Real

| | - NW

| | | - Background

| | | - Test

| | | - Real

| | - Synthetic

| | | - s_EM_t_NW

| | | - s_SW_t_NW

| | | - s_NW_t_NW

| | | - s_NW_t_EM

| | | - s_SW_t_EM

| | | - s_EM_t_SW

| | | - s_NW_t_SW

| | | - s_EM_t_EM

| | | - s_SW_t_SW

For example images/SW/Real has the 208 .jpg images from the Southwest that contain turbines. The synthetic subdirectory is structured such that for example images/Synthetic/s_EM_t_NW contains synthetic images using a source domain of Eastern Midwest and a target domain of Northwest, meaning the images were stylized to artificially look like Northwest images.

Note that we also provide a domain_overview.json file at the top level to help you navigate the directory. The domain_overview.json file navigates the directory with keys, so if you load the file as f, then f['images']['SW']['Background'] should list all the background photos from the SW. The keys in the domain json are ordered in the order we used the images for our experiments. So if our experiment used 100 SW background images, we used the images corresponding to the first 100 keys.

Naming conventions:

1 - Real and Test images:

{DOMAIN}_{UNIQUE ID}.jpg

For example 'EM_136.jpg' with corresponding label file 'EM_136.txt' refers to an image from the Eastern Midwest with unique ID 136.

2 - Background images:

Background images were collected in 3 waves with the purpose to create a set of images similar visually to real images, just without turbines:

The first wave came from NAIP images from the U.S. Wind Turbine Database coordinates where no wind turbine was present in the snapshot (NAIP images span a relatively large time, thus it is possible that wind turbines might be missing from the images). These images are labeled {DOMAIN}_{UNIQUE ID}.jpg, for example 'EM_1612_background.jpg'.

Using wind turbine coordinates, images were randomly collected either 4000m Southeast or Northwest. These images are labeled {DOMAIN}_{UNIQUE_ID}_{SHIFT DIRECTION (SE or NW)}.jpg. For example 'NW_12750_SE_background.jpg' refers to an image from the Northwest without turbines captured at a shift of 4000m Southeast from a wind turbine with unique ID 12750. Using wind turbine coordinates, images were randomly collected either 6000m Southeast or Northwest. These images are labeled {DOMAIN}_{UNIQUE_ID}_{SHIFT DIRECTION (SE or NW)}_6000.jpg, for example 'NW_12937_NW_6000_background.jpg'.

3 - Synthetic images

Each synthetic image takes in labeled wind turbine examples from the source domain, a background image from the target domain, and a mask. It uses the mask to place wind turbine examples and blends those examples onto the background image using GP-GAN. Thus, the naming conventions for synthetic images are:

{BACKGROUND IMAGE NAME FROM TARGET DOMAIN}_{MASK NUMBER}.jpg.

For example, images/Synthetic/s_NW_t_SW/SW_2246_m15.jpg corresponds to a synthetic image created using labeled wind turbine examples from the Northwest and stylized in the image of the Southwest using Southwest background image SW_2246 and mask 15.

For any remaining questions, please reach out to the author point of contact at caleb.kornfein@gmail.com.
h
B3DB
huggingface.co
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maom Lab (2025). B3DB [Dataset]. https://huggingface.co/datasets/maomlab/B3DB
Explore at:
Dataset updated
Jun 4, 2025
Dataset authored and provided by
Maom Lab
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Blood-Brain Barrier Database (B3DB)

The Blood-Brain Barrier Database (B3DB) is a large benchmark dataset compiled from 50 published resources (as summarized at raw_data/raw_data_summary.tsv) and categorized based on the consistency between different experimental references/measurements. This dataset was published in Scientific Data and a mirror of the theochem/B3DB the official Github repo where it is occasionally uploaded with new experimental data. We used the original datasets… See the full description on the dataset page: https://huggingface.co/datasets/maomlab/B3DB.
s
HYG Stellar database / Base de données stellaire HYG
data.smartidf.services
datastro.eu
csv, excel, json
Updated Jun 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). HYG Stellar database / Base de données stellaire HYG [Dataset]. https://data.smartidf.services/explore/dataset/hyg-stellar-database/
Explore at:
csv, excel, jsonAvailable download formats
Dataset updated
Jun 27, 2022
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Note: in this dataset, the stars whose distances are 100,000 parsecs are associated with defective data. As it can't be modified at the source, users have to exclude it manually : but if you exclude these stars from all distance-related analyzes, everything works perfectly well."The database is a subset of the data in three major catalogs: the Hipparcos Catalog,the Yale Bright Star Catalog (5th Edition), and the Gliese Catalog of Nearby Stars (3rd Edition). Each of these catalogs contains information useful to amateur astronomers: The Hipparcos catalog is the largest collection of high-accuracy stellar positional data, particularly parallaxes, which makes it useful as a starting point for stellar distance data.The Yale Bright Star Catalog contains basic data on essentially all naked-eye stars, including much information (such as the traditional Bayer Greek letters and Flamsteed numbers) missing from many other catalogsThe Gliese catalog is the most comprehensive catalog of nearby stars (those within 75 light years of the Sun). It contains many fainter stars not found in Hipparcos.The name of the database comes from the three catalogs comprising its data: Hipparcos, Yale, and Gliese.This database contains ALL stars that are either brighter than a certain magnitude cutoff (magnitude +7.5 to +9.0) or within 50 parsecs (about 160 light years) from the Sun. The current version, v. 3.0, has no magnitude cutoff: any star in Hipparcos, Yale, or Gliese is represented."Source: http://www.astronexus.com/hyg and https://github.com/astronexus/HYG-DatabaseFields in the database: id: The database primary key.hip: The star's ID in the Hipparcos catalog, if known.hd: The star's ID in the Henry Draper catalog, if known.hr: The star's ID in the Harvard Revised catalog, which is the same as its number in the Yale Bright Star Catalog.gl: The star's ID in the third edition of the Gliese Catalog of Nearby Stars.bf: The Bayer / Flamsteed designation, primarily from the Fifth Edition of the Yale Bright Star Catalog. This is a combination of the two designations. The Flamsteed number, if present, is given first; then a three-letter abbreviation for the Bayer Greek letter; the Bayer superscript number, if present; and finally, the three-letter constellation abbreviation. Thus Alpha Andromedae has the field value "21Alp And", and Kappa1 Sculptoris (no Flamsteed number) has "Kap1Scl".ra, dec: The star's right ascension and declination, for epoch and equinox 2000.0.proper: A common name for the star, such as "Barnard's Star" or "Sirius". I have taken these names primarily from the Hipparcos project's web site, which lists representative names for the 150 brightest stars and many of the 150 closest stars. I have added a few names to this list. Most of the additions are designations from catalogs mostly now forgotten (e.g., Lalande, Groombridge, and Gould ["G."]) except for certain nearby stars which are still best known by these designations.dist: The star's distance in parsecs, the most common unit in astrometry. To convert parsecs to light years, multiply by 3.262. A value >= 10000000 indicates missing or dubious (e.g., negative) parallax data in Hipparcos.pmra, pmdec: The star's proper motion in right ascension and declination, in milliarcseconds per year.rv: The star's radial velocity in km/sec, where known.mag: The star's apparent visual magnitude.absmag: The star's absolute visual magnitude (its apparent magnitude from a distance of 10 parsecs).spect: The star's spectral type, if known.ci: The star's color index (blue magnitude - visual magnitude), where known.x,y,z: The Cartesian coordinates of the star, in a system based on the equatorial coordinates as seen from Earth. +X is in the direction of the vernal equinox (at epoch 2000), +Z towards the north celestial pole, and +Y in the direction of R.A. 6 hours, declination 0 degrees.vx,vy,vz: The Cartesian velocity components of the star, in the same coordinate system described immediately above. They are determined from the proper motion and the radial velocity (when known). The velocity unit is parsecs per year; these are small values (around 1 millionth of a parsec per year), but they enormously simplify calculations using parsecs as base units for celestial mapping.rarad, decrad, pmrarad, prdecrad: The positions in radians, and proper motions in radians per year.bayer: The Bayer designation as a distinct valueflam: The Flamsteed number as a distinct valuecon: The standard constellation abbreviationcomp, comp_primary, base: Identifies a star in a multiple star system. comp = ID of companion star, comp_primary = ID of primary star for this component, and base = catalog ID or name for this multi-star system. Currently only used for Gliese stars.lum: Star's luminosity as a multiple of Solar luminosity.var: Star's standard variable star designation, when known.var_min, var_max: Star's approximate magnitude range, for variables. This value is based on the Hp magnitudes for the range in the original Hipparcos catalog, adjusted to the V magnitude scale to match the "mag" field."
h
short_jokes
huggingface.co
Updated Feb 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yuvraj sharma (2024). short_jokes [Dataset]. https://huggingface.co/datasets/ysharma/short_jokes
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 22, 2024
Authors
yuvraj sharma
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Context Generating humor is a complex task in the domain of machine learning, and it requires the models to understand the deep semantic meaning of a joke in order to generate new ones. Such problems, however, are difficult to solve due to a number of reasons, one of which is the lack of a database that gives an elaborate list of jokes. Thus, a large corpus of over 0.2 million jokes has been collected by scraping several websites containing funny and short jokes. You can visit the Github… See the full description on the dataset page: https://huggingface.co/datasets/ysharma/short_jokes.
3D MNIST
kaggle.com
Updated Oct 18, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David de la Iglesia Castro (2019). 3D MNIST [Dataset]. https://www.kaggle.com/daavoo/3d-mnist/home
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 18, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
David de la Iglesia Castro
Description
Context

The aim of this dataset is to provide a simple way to get started with 3D computer vision problems such as 3D shape recognition.

Accurate 3D point clouds can (easily and cheaply) be adquired nowdays from different sources:

RGB-D devices: Google Tango, Microsoft Kinect, etc.

Lidar.

3D reconstruction from multiple images.

However there is a lack of large 3D datasets (you can find a good one here based on triangular meshes); it's especially hard to find datasets based on point clouds (wich is the raw output from every 3D sensing device).

This dataset contains 3D point clouds generated from the original images of the MNIST dataset to bring a familiar introduction to 3D to people used to work with 2D datasets (images).

In the 3D_from_2D notebook you can find the code used to generate the dataset.

You can use the code in the notebook to generate a bigger 3D dataset from the original.

Content

full_dataset_vectors.h5

The entire dataset stored as 4096-D vectors obtained from the voxelization (x:16, y:16, z:16) of all the 3D point clouds.

In adition to the original point clouds, it contains randomly rotated copies with noise.

The full dataset is splitted into arrays:

X_train (10000, 4096)

y_train (10000)

X_test(2000, 4096)

y_test (2000)

Example python code reading the full dataset:

with h5py.File("../input/train_point_clouds.h5", "r") as hf: X_train = hf["X_train"][:] y_train = hf["y_train"][:] X_test = hf["X_test"][:] y_test = hf["y_test"][:]

train_point_clouds.h5 & test_point_clouds.h5

5000 (train), and 1000 (test) 3D point clouds stored in HDF5 file format. The point clouds have zero mean and a maximum dimension range of 1.

Each file is divided into HDF5 groups

Each group is named as its corresponding array index in the original mnist dataset and it contains:

"points" dataset: x, y, z coordinates of each 3D point in the point cloud.

"normals" dataset: nx, ny, nz components of the unit normal associate to each point.

"img" dataset: the original mnist image.

"label" attribute: the original mnist label.

Example python code reading 2 digits and storing some of the group content in tuples:

with h5py.File("../input/train_point_clouds.h5", "r") as hf: a = hf["0"] b = hf["1"] digit_a = (a["img"][:], a["points"][:], a.attrs["label"]) digit_b = (b["img"][:], b["points"][:], b.attrs["label"])

voxelgrid.py

Simple Python class that generates a grid of voxels from the 3D point cloud. Check kernel for use.

plot3D.py

Module with functions to plot point clouds and voxelgrid inside jupyter notebook. You have to run this locally due to Kaggle's notebook lack of support to rendering Iframes. See github issue here

Functions included:

array_to_color Converts 1D array to rgb values use as kwarg color in plot_points()

plot_points(xyz, colors=None, size=0.1, axis=False)

plot_voxelgrid(v_grid, cmap="Oranges", axis=False)

Acknowledgements

Website of the original MNIST dataset

Website of the 3D MNIST dataset

Have fun!
Z
Data from: MarFERReT: an open-source, version-controlled reference library...
data.niaid.nih.gov
zenodo.org
Updated Jan 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blaskowski, Stephen (2025). MarFERReT: an open-source, version-controlled reference library of marine microbial eukaryote functional genes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7055911
Explore at:
Dataset updated
Jan 22, 2025
Dataset provided by
Blaskowski, Stephen
Coesel, Sacha
Armbrust, E. Virginia
Groussman, Mora J
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Metatranscriptomics generates large volumes of sequence data about transcribed genes in natural environments. Taxonomic annotation of these datasets depends on availability of curated reference sequences. For marine microbial eukaryotes, current reference libraries are limited by gaps in sequenced organism diversity and barriers to updating libraries with new sequence data, resulting in taxonomic annotation of only about half of eukaryotic environmental transcripts. Here, we introduce version 1.0 of the Marine Functional EukaRyotic Reference Taxa (MarFERReT), an updated marine microbial eukaryotic sequence library with a version-controlled framework designed for taxonomic annotation of eukaryotic metatranscriptomes. We gathered 902 marine eukaryote genomes and transcriptomes from multiple sources and assessed these candidate entries for sequence quality and cross-contamination issues, selecting 800 validated entries for inclusion in the library. MarFERReT v1 contains reference sequences from 800 marine eukaryotic genomes and transcriptomes, covering 453 species- and strain-level taxa, totaling nearly 28 million protein sequences with associated NCBI and PR2 Taxonomy identifiers and Pfam functional annotations. An accompanying MarFERReT project repository hosts containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT.MarFERReT is linked to a code repository hosting containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT here: https://github.com/armbrustlab/marferret

The raw source data for the 902 candidate entries considered for MarFERReT v1.1.1, including the 800 accepted entries, are available for download from their respective online locations. The source URL for each of the entries is listed here in MarFERReT.v1.1.1.entry_curation.csv, and detailed instructions and code for downloading the raw sequence data from source are available in the MarFERReT code repository (link).

This repository release contains MarFERReT database files from the v1.1.1 MarFERReT release using the following MarFERReT library build scripts: assemble_marferret.sh, pfam_annotate.sh, and build_diamond_db.shThe following MarFERReT data products are available in this repository:

MarFERReT.v1.1.1.metadata.csvThis CSV file contains descriptors of each of the 902 database entries, including data source, taxonomy, and sequence descriptors. Data fields are as follows:

entry_id: Unique MarFERReT sequence entry identifier.

accepted: Acceptance into the final MarFERReT build (Y/N). The Y/N values can be adjusted to customize the final build output according to user-specific needs.

marferret_name: A human and machine friendly string derived from the NCBI Taxonomy organism name; maintaining strain-level designation wherever possible.

tax_id: The NCBI Taxonomy ID (taxID).

pr2_accession: Best-matching PR2 accession ID associated with entry

pr2_rank: The lowest shared rank between the entry and the pr2_accession

pr2_taxonomy: PR2 Taxonomy classification scheme of the pr2_accession

data_type: Type of sequence data; transcriptome shotgun assemblies (TSA), gene models from assembled genomes (genome), and single-cell amplified genomes (SAG) or transcriptomes (SAT).

data_source: Online location of sequence data; the Zenodo data repository (Zenodo), the datadryad.org repository (datadryad.org), MMETSP re-assemblies on Zenodo (MMETSP)17, NCBI GenBank (NCBI), JGI Phycocosm (JGI-Phycocosm), the TARA Oceans portal on Genoscope (TARA), or entries from the Roscoff Culture Collection through the METdb database repository (METdb).

source_link: URL where the original sequence data and/or metadata was collected.

pub_year: Year of data release or publication of linked reference.

ref_link: Pubmed URL directs to the published reference for entry, if available.

ref_doi: DOI of entry data from source, if available.

source_filename: Name of the original sequence file name from the data source.

seq_type: Entry sequence data retrieved in nucleotide (nt) or amino acid (aa) alphabets.

n_seqs_raw: Number of sequences in the original sequence file.

source_name: Full organism name from entry source

original_taxID: Original NCBI taxID from entry data source metadata, if available

alias: Additional identifiers for the entry, if available

MarFERReT.v1.1.1.curation.csvThis CSV file contains curation and quality-control information on the 902 candidate entries considered for incorporation into MarFERReT v1, including curated NCBI Taxonomy IDs and entry validation statistics. Data fields are as follows:

entry_id: Unique MarFERReT sequence entry identifier

marferret_name: Organism name in human and machine friendly format, including additional NCBI taxonomy strain identifiers if available.

tax_id: Verified NCBI taxID used in MarFERReT

taxID_status: Status of the final NCBI taxID (Assigned, Updated, or Unchanged)

taxID_notes: Notes on the original_taxID

n_seqs_raw: Number of sequences in the original sequence file

n_pfams: Number of Pfam domains identified in protein sequences

qc_flag: Early validation quality control flags for the following: LOW_SEQS; less than 1,200 raw sequences; LOW_PFAMS; less than 500 Pfam domain annotations.

flag_Lasek: Flag notes from Lasek-Nesselquist and Johnson (2019); contains the flag 'FLAG_LASEK' indicating ciliate samples reported as contaminated in this study.

VV_contam_pct: Estimated contamination reported for MMETSP entries in Van Vlierberghe et al., (2021).

flag_VanVlierberghe: Flag for a high level of estimated contamination, from 'flag_VanVlierberghe' values over 50%: FLAG_VV.

rp63_npfams: Number of ribosomal protein Pfam domains out of 63 total.

rp63_contam_pct: Percent of total ribosomal protein sequences with an inferred taxonomic identity in any lineage other than the recorded identity, as described in the Technical Validation section from analysis of 63 Pfam ribosomal protein domains.

flag_rp63: Flag for a high level of estimated contamination, from 'rp63_contam_pct' values over 50%: FLAG_RP63.

flag_sum: Count of the number of flag columns (qc_flag, flag_Lasek, flag_VanVlierberghe, and flag_rp63). All entries with one or more flag are nominally rejected ('accepted' = N); entries without any flags are validated and accepted ('accepted' = Y).

accepted: Acceptance into the final MarFERReT build (Y or N).

MarFERReT.v1.1.1.proteins.faa.gzThis Gzip-compressed FASTA file contains the 27,951,013 final translated and clustered protein sequences for all 800 accepted MarFERReT entries. The sequence defline contains the unique identifier for the sequence and its reference (mftX, where 'X' is a ten-digit integer value).

MarFERReT.v1.1.1.taxonomies.tab.gzThis Gzip-compressed tab-separated file is formatted for interoperability with the DIAMOND protein alignment tool commonly used for downstream analyses and contains some columns without any data. Each row contains an entry for one of the MarFERReT protein sequences in MarFERReT.v1.proteins.faa.gz. Note that 'accession.version' and 'taxid' are populated columns while 'accession' and 'gi' have NA values; the latter columns are required for back-compatibility as input for the DIAMOND alignment software and LCA analysis.

The columns in this file contain the following information:

accession: (NA)

accession.version: The unique MarFERReT sequence identifier ('mftX').

taxid: The NCBI Taxonomy ID associated with this reference sequence.

gi: (NA).

MarFERReT.v1.1.1.proteins_info.tab.gzThis Gzip-compressed tab-separated file contains a row for each final MarFERReT protein sequence with the following columns:

aa_id: the unique identifier for each MarFERReT protein sequence.

entry_id: The unique numeric identifier for each MarFERReT entry.

source_defline: The original, unformatted sequence identifier

MarFERReT.v1.1.1.best_pfam_annotations.csv.gzThis Gzip-compressed CSV file contains the best-scoring Pfam annotation for intra-species clustered protein sequences from the 800 validated MarFERReT entries; derived from the hmmsearch annotations against Pfam 34.0 functional domains. This file contains the following fields:

aa_id: The unique MarFERReT protein sequence ID ('mftX').

pfam_name: The shorthand Pfam protein family name.

pfam_id: The Pfam identifier.

pfam_eval: hmm profile match e-value score

pfam_score: hmm profile match bitscore

MarFERReT.v1.1.1.dmndThis binary file is the indexed database of the MarFERReT protein library with embedded NCBI taxonomic information generated by the DIAMOND makedb tool using the build_diamond_db.sh script from the MarFERReT /scripts/ library. This can be used as the reference DIAMOND database for annotating environment sequences from eukaryotic metatranscriptomes.
HumBugDB: a large-scale acoustic mosquito dataset
zenodo.org
csv, zip
Updated May 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Kiskin; Ivan Kiskin; Lawrence Wang; Marianne Sinka; Marianne Sinka; Adam D. Cobb; Adam D. Cobb; Benjamin Gutteridge; Davide Zilli; Waqas Rafique; Rinita Dam; Theodoros Marinos; Yunpeng Li; Yunpeng Li; Gerard Killeen; Dickson Msaky; Emmanuel Kaindoa; Kathy Willis; Kathy Willis; Steve J. Roberts; Steve J. Roberts; Lawrence Wang; Benjamin Gutteridge; Davide Zilli; Waqas Rafique; Rinita Dam; Theodoros Marinos; Gerard Killeen; Dickson Msaky; Emmanuel Kaindoa (2022). HumBugDB: a large-scale acoustic mosquito dataset [Dataset]. http://doi.org/10.5281/zenodo.4904800
Explore at:
zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4904800
Dataset updated
May 10, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ivan Kiskin; Ivan Kiskin; Lawrence Wang; Marianne Sinka; Marianne Sinka; Adam D. Cobb; Adam D. Cobb; Benjamin Gutteridge; Davide Zilli; Waqas Rafique; Rinita Dam; Theodoros Marinos; Yunpeng Li; Yunpeng Li; Gerard Killeen; Dickson Msaky; Emmanuel Kaindoa; Kathy Willis; Kathy Willis; Steve J. Roberts; Steve J. Roberts; Lawrence Wang; Benjamin Gutteridge; Davide Zilli; Waqas Rafique; Rinita Dam; Theodoros Marinos; Gerard Killeen; Dickson Msaky; Emmanuel Kaindoa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A large-scale multi-species dataset of acoustic recordings

Dataset accompanying code and paper: HumBugDB: a large-scale acoustic mosquito dataset.

A large-scale multi-species dataset containing recordings of mosquitoes collected from multiple locations globally, as well as via different collection methods. In total, we present 71,286 seconds (20 hours) of labelled mosquito data with 53,227 seconds (15 hours) of corresponding background noise, recorded at the sites of 8 experiments. Of these, 64,843 seconds contain species metadata, consisting of 36 species (or species complexes).

This repository contains:

Multi-part zip of audio files to be extracted into the same folder

Metadata in csv format: neurips_2021_zenodo_0_0_1.csv

This data is supplemented by a GitHub repository, https://github.com/HumBug-Mosquito/HumBugDB, which aids as follows:

The multi-part zip is intended to be extracted into the folder: /data/audio/ in the repository.

Latest metadata is hosted on GitHub to allow the modification of additional metadata as it becomes available in the database or bug-fixing.

Documentation for code use, and a complete Datasheet for Datasets also available on GitHub.

Example code for data splitting, feature extraction, model training, and evaluation in the top-level notebook main.ipynb.

Bayesian Convolutional Neural Network models, in both Keras and PyTorch, trained on this data available at GitHub release v1.0
f
Arabic Handwritten Digits Dataset
figshare.com
bin
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Loey (2023). Arabic Handwritten Digits Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.12236948.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12236948.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Mohamed Loey
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Arabic Handwritten Digits DatasetAbstractIn recent years, handwritten digits recognition has been an important areadue to its applications in several fields. This work is focusing on the recognitionpart of handwritten Arabic digits recognition that face several challenges, includingthe unlimited variation in human handwriting and the large public databases. Thepaper provided a deep learning technique that can be effectively apply to recognizing Arabic handwritten digits. LeNet-5, a Convolutional Neural Network (CNN)trained and tested MADBase database (Arabic handwritten digits images) that contain 60000 training and 10000 testing images. A comparison is held amongst theresults, and it is shown by the end that the use of CNN was leaded to significantimprovements across different machine-learning classification algorithms.The Convolutional Neural Network was trained and tested MADBase database (Arabic handwritten digits images) that contain 60000 training and 10000 testing images. Moreover, the CNN is giving an average recognition accuracy of 99.15%.ContextThe motivation of this study is to use cross knowledge learned from multiple works to enhancement the performance of Arabic handwritten digits recognition. In recent years, Arabic handwritten digits recognition with different handwriting styles as well, making it important to find and work on a new and advanced solution for handwriting recognition. A deep learning systems needs a huge number of data (images) to be able to make a good decisions.ContentThe MADBase is modified Arabic handwritten digits database contains 60,000 training images, and 10,000 test images. MADBase were written by 700 writers. Each writer wrote each digit (from 0 -9) ten times. To ensure including different writing styles, the database was gathered from different institutions: Colleges of Engineering and Law, School of Medicine, the Open University (whose students span a wide range of ages), a high school, and a governmental institution.MADBase is available for free and can be downloaded from (http://datacenter.aucegypt.edu/shazeem/) .AcknowledgementsCNN for Handwritten Arabic Digits Recognition Based on LeNet-5http://link.springer.com/chapter/10.1007/978-3-319-48308-5_54Ahmed El-Sawy, Hazem El-Bakry, Mohamed LoeyProceedings of the International Conference on Advanced Intelligent Systems and Informatics 2016Volume 533 of the series Advances in Intelligent Systems and Computing pp 566-575InspirationCreating the proposed database presents more challenges because it deals with many issues such as style of writing, thickness, dots number and position. Some characters have different shapes while written in the same position. For example the teh character has different shapes in isolated position.Arabic Handwritten Characters Datasethttps://www.kaggle.com/mloey1/ahcd1Benha Universityhttp://bu.edu.eg/staff/mloeyhttps://mloey.github.io/
c
CAnine CuTaneous Cancer Histology Dataset
cancerimagingarchive.net
dev.cancerimagingarchive.net
json, n/a, svs +1
Updated Jan 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cancer Imaging Archive (2022). CAnine CuTaneous Cancer Histology Dataset [Dataset]. http://doi.org/10.7937/TCIA.2M93-FX66
Explore at:
n/a, svs, json, zip and sqliteAvailable download formats
Unique identifier
https://doi.org/10.7937/TCIA.2M93-FX66
Dataset updated
Jan 12, 2022
Dataset authored and provided by
The Cancer Imaging Archive
License
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Time period covered
Jan 12, 2022
Dataset funded by
National Cancer Institutehttp://www.cancer.gov/
Description
We present a large-scale dataset of 350 histologic samples of seven different canine cutaneous tumors. All samples were obtained through surgical resection due to neoplastic indicators and were selected retrospectively from the biopsy archive of the Institute for Veterinary Pathology of the Freie Universität Berlin according to sufficient tissue preservation and presence of characteristic histologic features for the corresponding tumor subtypes. Samples were stained with a routine Hematoxylin & Eosin dye and digitized with two Leica linear scanning systems at a resolution of 0.25 um/pixel. Together with the 350 whole slide images, we provide a database consisting of 12,424 polygon annotations for six non-neoplastic tissue classes (epidermis, dermis, subcutis, bone, cartilage, and a joint class of inflammation and necrosis) and seven tumor classes (melanoma, mast cell tumor, squamous cell carcinoma, peripheral nerve sheath tumor, plasmacytoma, trichoblastoma, and histiocytoma).
The polygon annotations were generated using the open source software SlideRunner (https://github.com/DeepPathology/SlideRunner). Within SlideRunner, users can view whole slide images (WSIs) and zoom through their magnification levels. Using multiple clicks or click-and-drag, the pathologist annotated polygons for 13 classes (epidermis, dermis, subcutis, bone, cartilage, a joint class of inflammation and necrosis, melanoma, mast cell tumor, squamous cell carcinoma, peripheral nerve sheath tumor, plasmacytoma, trichoblastoma, and histiocytoma) on 287 WSIs. The remaining WSIs were annotated by three medical students in their 8th semester supervised by the leading pathologist who later reviewed these annotations for correctness and completeness.
Due to the large size of the dataset and the extensive annotations, it provides a good basis for segmentation and classification algorithms based on supervised learning. Previous work [1-4] has shown, that due to various homologies between the species, canine cutaneous tissue can serve as a model for human samples. Prouteau et al. have published an extensive comparison of the two species especially for cutaneous tumors and include homologies between canine and human oncology regarding "clinical and histological appearance, biological behavior, tumor genetics, molecular pathways and targets, and response to therapies" [1]. Ranieri et al. highlight that pet dogs and humans share many environmental risk factors and show the highest risk for cancer development at similar points of time respective to their life spans [2]. Both, Ranieri et al. and Pinho et al. highlight the potential of using insights from experiments on canine samples for developing human cancer treatments [2,3]. From a technical perspective, Aubreville et al. have shown that canine samples can be used to aid human cancer research through the use of transfer learning methods [4].
Potential users of the dataset can load the SQLite database into their custom installation of SlideRunner and adapt or extend the database with custom annotations. Furthermore, we converted the annotations to the COCO JSON format, which is commonly used by computer scientists for training neural networks. Its pixel-level annotations can be used for supervised segmentation algorithms as opposed to datasets that only provide clinical data on slide level.
References
Prouteau, Anaïs, and Catherine André. "Canine melanomas as models for human melanomas: Clinical, histological, and genetic comparison." Genes 10.7 (2019): 501. https://doi.org/10.3390/genes10070501
Ranieri, G., et al. "A model of study for human cancer: Spontaneous occurring tumors in dogs. Biological features and translation for new anticancer therapies." Critical reviews in oncology/hematology 88.1 (2013): 187-197. https://doi.org/10.1016/j.critrevonc.2013.03.005
Pinho, Salomé S., et al. "Canine tumors: a spontaneous animal model of human carcinogenesis." Translational Research 159.3 (2012): 165-172. https://doi.org/10.1016/j.trsl.2011.11.005
Aubreville, Marc, et al. "A completely annotated whole slide image dataset of canine breast cancer to aid human breast cancer research." Scientific data 7.1 (2020): 1-10. https://doi.org/10.1038/s41597-020-00756-z
m
LUTBIO multimodal biometric database
data.mendeley.com
Updated Nov 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
rui yang (2024). LUTBIO multimodal biometric database [Dataset]. http://doi.org/10.17632/jszw485f8j.4
Explore at:
Unique identifier
https://doi.org/10.17632/jszw485f8j.4
Dataset updated
Nov 25, 2024
Authors
rui yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The LUTBIO database provides a comprehensive resource for research in multimodal biometric authentication, featuring the following key aspects:

Extensive Biometric Modalities: The database contains data from nine biometric modalities: voice, face, fingerprint, contact-based palmprint, electrocardiogram (ECG), opisthenar (back of hand), ear, contactless palmprint, and periocular region.

Diverse Demographics: Data were collected from 306 individuals, with a balanced gender distribution of 164 males and 142 females, spanning an age range of 8 to 90 years. This diverse age representation enables analyses across a wide demographic spectrum.

Representative Population Sampling: Volunteers were recruited from naturally occurring communities, ensuring a large-scale, statistically representative population. The collected data encompass variations observed in real-world environments.

Support for Multimodal and Cross-Modality Research: LUTBIO provides both contact-based and contactless palmprint data, as well as fingerprint data (from optical images and scans), promoting advancements in multimodal biometric authentication. This resource is designed to guide the development of future multimodal databases.

Flexible, Decouplable Data: The biometric data in the LUTBIO database are designed to be highly decouplable, enabling independent processing of each modality without loss of information. This flexibility supports both single-modality and multimodal analysis, empowering researchers to optimize, combine, and customize biometric features for specific applications.

✅ Data Availability: To facilitate early access, we are initially releasing sample data from 6 individuals. Upon publication of the paper, the full dataset will be made available.

🥸 Important Notice: Please read the data collection protocol of the LUTBIO dataset carefully before use, as it is essential for understanding and correctly interpreting the dataset. Thank you.

😎 Good news! Our paper has received revision comments, and we are carefully making revisions based on the feedback. We appreciate the reviewers and the editor for their efforts.🥰🥰🥰

🤠 We conducted further research based on the LUTBIO database and proposed AuthFormer: An Adaptive Multimodal Biometric Authentication Framework for Secure and Flexible Identity Verification. The code is available at https://github.com/RykerYang/Authformer-LUTBIO.git. Our work is currently under review for an international conference, where we are dealing with a malicious reviewer. We are in the process of responding and appealing—wish us luck! 🍀🍀🍀
mroeck/carbenmats-buildings: Pre-release
zenodo.org
zip
Updated Sep 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin RÖCK; Martin RÖCK (2023). mroeck/carbenmats-buildings: Pre-release [Dataset]. http://doi.org/10.5281/zenodo.8363895
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8363895
Dataset updated
Sep 26, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Martin RÖCK; Martin RÖCK
Description
A Global Database on Whole Life Carbon, Energy and Material Intensity of Buildings (CarbEnMats-Buildings)

Data descriptor article (Preprint): https://doi.org/10.5281/zenodo.8378939

Data records / database (this repo): https://doi.org/10.5281/zenodo.8363895

Latest version: Available via https://github.com/mroeck/carbenmats-buildings (main)

Abstract

Globally, interest in understanding the life cycle related greenhouse gas (GHG) emissions of buildings is increasing. Robust data is required for benchmarking and analysis of parameters driving resource use and whole life carbon (WLC) emissions. However, open datasets combining information on energy and material use as well as whole life carbon emissions remain largely unavailable – until now.

We present a global database on whole life carbon, energy use, and material intensity of buildings. It contains data on more than 1,200 building case studies and includes over 300 attributes addressing context and site, building design, assessment methods, energy and material use, as well as WLC emissions across different life cycle stages. The data was collected through various meta-studies, using a dedicated data collection template (DCT) and processing scripts (Python Jupyter Notebooks), all of which are shared alongside this data descriptor.

This dataset is valuable for industrial ecology and sustainable construction research and will help inform decision-making in the building industry as well as the climate policy context.

Background & Summary

The need for reducing greenhouse gas (GHG) emissions across Europe require defining and implementing a performance system for both operational and embodied carbon at the building level that provides relevant guidance for policymakers and the building industry. So-called whole life carbon (WLC) of buildings is gaining increasing attention among decision-makers concerned with climate and industrial policy, as well as building procurement, design, and operation. However, most open buildings datasets published thus far have been focusing on building’s operational energy consumption and related parameters 1,2,2–4. Recent years furthermore brought large-scale datasets on building geometry (footprint, height) 5,6 as well as the publication of some datasets on building construction systems and material intensity 7,8. Heeren and Fishman’s database seed on material intensity (MI) of buildings 7, an essential reference to this work, was a first step towards an open data repository on material-related environmental impacts of buildings. In their 2019 descriptor, the authors present data on the material coefficients of more than 300 building cases intended for use in studies applying material flow analysis (MFA), input-output (IO) or life cycle assessment (LCA) methods. Guven et al. 8 elaborated on this effort by publishing a construction classification system database for understanding resource use in building construction. However, thus far, there is a lack of publicly available data that combines material composition, energy use and also considers life cycle-related environmental impacts, such as life cycle-related GHG emissions, also referred to as building’s whole life carbon.

The Global Database on Whole Life Carbon, Energy Use, and Material Intensity of Buildings (CarbEnMats-Buildings) published alongside this descriptor provides information on more than 1,200 buildings worldwide. The dataset includes attributes on geographical context and site, main building design characteristics, LCA-based assessment methods, as well as information on energy and material use, and related life cycle greenhouse gas (GHG) emissions, commonly referred to as whole life carbon (WLC), with a focus on embodied carbon (EC) emissions. The dataset compiles data obtained through a systematic review of the scientific literature as well as systematic data collection from both literature sources and industry partners. By applying a uniform data collection template (DCT) and related automated procedures for systematic data collection and compilation, we facilitate the processing, analysis and visualization along predefined categories and attributes, and support the consistency of data types and units. The descriptor includes specifications related to the DCT spreadsheet form used for obtaining these data as well as explanations of the data processing and feature engineering steps undertaken to clean and harmonise the data records. The validation focuses on describing the composition of the dataset and values observed for attributes related to whole life carbon, energy and material intensity.

The data published with this descriptor offers the largest open compilation of data on whole life carbon emissions, energy use and material intensity of buildings published to date. This open dataset is expected to be valuable for research applications in the context of MFA, I/O and LCA modelling. It also offers a unique data source for benchmarking whole life carbon, energy use and material intensity of buildings to inform policy and decision-making in the context of the decarbonization of building construction and operation as well as commercial real estate in Europe and beyond.

Files

All files related to this descriptor are available on a public GitHub repository and related release via Zenodo (https://doi.org/10.5281/zenodo.8363895). The repository contains the following files:

README.md is a text file with instructions on how to use the files and documents.

CarbEnMats_attributes.XLSX is a table with the complete attribute description.

CarbEnMats_materials.XLSX is the table of material options and mappings.

CarbEnMats_dataset.XLSX is the building dataset in MS Excel format.

CarbEnMats_dataset.txt is the building dataset in tab-delimited TXT format.

Further information

Please consult the related data descriptor article (linked at the top) for further information, e.g.:

Methods (Data collection; data processing)

Data records (Files; Sources; Attributes)

Technical validation (Data overview; Data consistency)

Usage Notes (Attribute priority; Scope summary, Missing information)

Code availability (LICENSE)

The dataset, the data collection template as well as the code used for processing, harmonization and visualization are published under a GNU General Public License v3.0. The GNU General Public License is a free, copyleft license for software and other kinds of works. We encourage you to review, reuse, and refine the data and scripts and eventually share-alike.

Contributing

The CarbEnMats-Buildings database is the results of a highly collaborative effort and needs your active contributions to further improve and grow the open building data landscape. Reach out to the lead author (email, linkedin) if you are interested to contribute your data or time.

Cite as

When referring to this work, please cite both the descriptor and the dataset:

Descriptor: RÖCK, Martin, SORENSEN, Andreas, BALOUKTSI, Maria, RUSCHI MENDES SAADE, Marcella, RASMUSSEN, Freja Nygaard, BIRGISDOTTIR, Harpa, FRISCHKNECHT, Rolf, LÜTZKENDORF, Thomas, HOXHA, Endrit, HABERT, Guillaume, SATOLA, Daniel, TRUGER, Barbara, TOZAN, Buket, KUITTINEN, Matti, ALAUX, Nicolas, ALLACKER, Karen, & PASSER, Alexader. (2023). A Global Database on Whole Life Carbon, Energy and Material Intensity of Buildings (CarbEnMats-Buildings) [Preprint]. Zenodo. https://doi.org/10.5281/zenodo.8378939

Dataset: Martin Röck. (2023). mroeck/carbenmats-buildings: Pre-release (0.1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8363895

Facebook

Twitter

Click to copy link

Link copied

Cite

João Felipe; Leonardo; Vanessa; Juliana (2019). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks

Explore at:

23 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.5281/zenodo.2592524

Dataset updated

Mar 13, 2019

Authors

João Felipe; Leonardo; Vanessa; Juliana

Description

The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub. Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks This repository contains two files: dump.tar.bz2 jupyter_reproducibility.tar.bz2 The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks. The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows: analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database. archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks. paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again. Reproducing the Analysis This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment: Ubuntu 18.04.1 LTS PostgreSQL 10.6 Conda 4.5.11 Python 3.7.2 PdfCrop 2012/11/02 v1.38 First, download dump.tar.bz2 and extract it: tar -xjf dump.tar.bz2 It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump: psql jupyter < db2019-03-13.dump It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION: export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; Download and extract jupyter_reproducibility.tar.bz2: tar -xjf jupyter_reproducibility.tar.bz2 Create a conda environment with Python 3.7: conda create -n analyses python=3.7 conda activate analyses Go to the analyses folder and install all the dependencies of the requirements.txt cd jupyter_reproducibility/analyses pip install -r requirements.txt For reproducing the analyses, run jupyter on this folder: jupyter notebook Execute the notebooks on this order: Index.ipynb N0.Repository.ipynb N1.Skip.Notebook.ipynb N2.Notebook.ipynb N3.Cell.ipynb N4.Features.ipynb N5.Modules.ipynb N6.AST.ipynb N7.Name.ipynb N8.Execution.ipynb N9.Cell.Execution.Order.ipynb N10.Markdown.ipynb N11.Repository.With.Notebook.Restriction.ipynb N12.To.Paper.ipynb Reproducing or Expanding the Collection The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution. Requirements This time, we have extra requirements: All the analysis requirements lbzip2 2.5 gcc 7.3.0 Github account Gmail account Environment First, set the following environment variables: export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeou...

Clear search

Close search

Google apps

Main menu

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

MoreFixes: Largest CVE dataset with fixes

GitHub Repos

Querying BigQuery tables

Acknowledgements

Inspiration

Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

Data from: Correlated RNN Framework to Quickly Generate Molecules with...

Open Data Portal Catalogue

Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter...

Fruits-360 dataset

Fruits-360 dataset: A dataset of images containing fruits, vegetables, nuts and seeds

Version: 2025.06.07.0

Content

Branches

How to cite

Dataset properties

For the 100x100 branch

For the original-size branch

For the 3-body-problem branch

For the meta branch

For the multi branch

Filename format:

For the 100x100 branch

For the original-size branch

For the multi branch

Alternate download

How fruits were filmed

Data from: TerraDS: A Dataset for Terraform HCL Programs

TerraDS

Structure of the Database

Structure of the Archive

Tools

Overhead Wind Turbine Dataset (NAIP)

B3DB

HYG Stellar database / Base de données stellaire HYG

short_jokes

3D MNIST

Context

Content

full_dataset_vectors.h5

train_point_clouds.h5 & test_point_clouds.h5

voxelgrid.py

plot3D.py

Acknowledgements

Have fun!

Data from: MarFERReT: an open-source, version-controlled reference library...

HumBugDB: a large-scale acoustic mosquito dataset

Arabic Handwritten Digits Dataset

CAnine CuTaneous Cancer Histology Dataset

References

LUTBIO multimodal biometric database

mroeck/carbenmats-buildings: Pre-release

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter NotebooksSee More Versions

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks