Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).
Image datasets:
Dataset Card for example-preference-dataset
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/example-preference-dataset/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is How to create high impact business presentations. It features 7 columns including author, publication date, language, and book publisher.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions
in Meta Kaggle. The file names match the ids in the KernelVersions
csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads
. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 6,000 example images generated with the process described in Roboflow's How to Create a Synthetic Dataset tutorial.
The images are composed of a background (randomly selected from Google's Open Images dataset) and a number of fruits (from Horea94's Fruit Classification Dataset) superimposed on top with a random orientation, scale, and color transformation. All images are 416x550 to simulate a smartphone aspect ratio.
To generate your own images, follow our tutorial or download the code.
Example:
https://blog.roboflow.ai/content/images/2020/04/synthetic-fruit-examples.jpg" alt="Example Image">
The datasets are resulting from OPFLearn.jl, a Julia package for creating AC OPF datasets. The package was developed to provide researchers with a standardized way to efficiently create AC OPF datasets that are representative of more of the AC OPF feasible load space compared to typical dataset creation methods. The OPFLearn dataset creation method uses a relaxed AC OPF formulation to reduce the volume of the unclassified input space throughout the dataset creation process. The dataset contains load profiles and their respective optimal primal and dual solutions. Load samples are processed using AC OPF formulations from PowerModels.jl. More information on the dataset creation method can be found in our publication, "OPF-Learn: An Open-Source Framework for Creating Representative AC Optimal Power Flow Datasets" and in the package website: https://github.com/NREL/OPFLearn.jl.
This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation).csv │ ├── basic.csv │ ├── contributor(citation).csv │ ├── ... │ └── topic_classification(citation).csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2022.10.02_17.11.19.zip │ ├── dataset_pids_Abacus_2022.10.02_17.11.19.csv │ ├── Dataverse_JSON_metadata_2022.10.02_17.11.19 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0.json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2022.10.02_17.26.19.zip │ ├── ADA_Dataverse_2022.10.02_17.26.57.zip │ ├── Arca_Dados_2022.10.02_17.44.35.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected ... Visit https://dataone.org/datasets/sha256%3Ad27d528dae8cf01e3ea915f450426c38fd6320e8c11d3e901c43580f997a3146 for complete metadata about this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Create your own sign making business from scratch. It features 7 columns including author, publication date, language, and book publisher.
The CREATE database is composed of 14 hours of multimodal recordings from a mobile robotic platform based on the iRobot Create.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
sql-create-context-v2 Dataset
Overview
The sql-create-context-v2 dataset enhances the original dataset built from WikiSQL and Spider, focusing on text-to-SQL tasks with a special emphasis on reducing hallucination of column and table names. This version introduces a JSONL format for more efficient data processing and iteration, alongside a structured approach to representing SQL queries in the dataset entries.
Key Enhancements
Dataset Format: Transitioned to… See the full description on the dataset page: https://huggingface.co/datasets/ramachetan22/sql-create-context-v2.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global AI training dataset market size was valued at approximately USD 1.2 billion in 2023 and is projected to reach USD 6.5 billion by 2032, growing at a compound annual growth rate (CAGR) of 20.5% from 2024 to 2032. This substantial growth is driven by the increasing adoption of artificial intelligence across various industries, the necessity for large-scale and high-quality datasets to train AI models, and the ongoing advancements in AI and machine learning technologies.
One of the primary growth factors in the AI training dataset market is the exponential increase in data generation across multiple sectors. With the proliferation of internet usage, the expansion of IoT devices, and the digitalization of industries, there is an unprecedented volume of data being generated daily. This data is invaluable for training AI models, enabling them to learn and make more accurate predictions and decisions. Moreover, the need for diverse and comprehensive datasets to improve AI accuracy and reliability is further propelling market growth.
Another significant factor driving the market is the rising investment in AI and machine learning by both public and private sectors. Governments around the world are recognizing the potential of AI to transform economies and improve public services, leading to increased funding for AI research and development. Simultaneously, private enterprises are investing heavily in AI technologies to gain a competitive edge, enhance operational efficiency, and innovate new products and services. These investments necessitate high-quality training datasets, thereby boosting the market.
The proliferation of AI applications in various industries, such as healthcare, automotive, retail, and finance, is also a major contributor to the growth of the AI training dataset market. In healthcare, AI is being used for predictive analytics, personalized medicine, and diagnostic automation, all of which require extensive datasets for training. The automotive industry leverages AI for autonomous driving and vehicle safety systems, while the retail sector uses AI for personalized shopping experiences and inventory management. In finance, AI assists in fraud detection and risk management. The diverse applications across these sectors underline the critical need for robust AI training datasets.
As the demand for AI applications continues to grow, the role of Ai Data Resource Service becomes increasingly vital. These services provide the necessary infrastructure and tools to manage, curate, and distribute datasets efficiently. By leveraging Ai Data Resource Service, organizations can ensure that their AI models are trained on high-quality and relevant data, which is crucial for achieving accurate and reliable outcomes. The service acts as a bridge between raw data and AI applications, streamlining the process of data acquisition, annotation, and validation. This not only enhances the performance of AI systems but also accelerates the development cycle, enabling faster deployment of AI-driven solutions across various sectors.
Regionally, North America currently dominates the AI training dataset market due to the presence of major technology companies and extensive R&D activities in the region. However, Asia Pacific is expected to witness the highest growth rate during the forecast period, driven by rapid technological advancements, increasing investments in AI, and the growing adoption of AI technologies across various industries in countries like China, India, and Japan. Europe and Latin America are also anticipated to experience significant growth, supported by favorable government policies and the increasing use of AI in various sectors.
The data type segment of the AI training dataset market encompasses text, image, audio, video, and others. Each data type plays a crucial role in training different types of AI models, and the demand for specific data types varies based on the application. Text data is extensively used in natural language processing (NLP) applications such as chatbots, sentiment analysis, and language translation. As the use of NLP is becoming more widespread, the demand for high-quality text datasets is continually rising. Companies are investing in curated text datasets that encompass diverse languages and dialects to improve the accuracy and efficiency of NLP models.
Image data is critical for computer vision application
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books, has 1 rows. and is filtered where the book is Learning and playing indoors : how to create an inspiring indoor environment. It features 7 columns including book, author, publication date, language, and book publisher. The preview is ordered by publication date (descending).
Description:
👉 Download the dataset here
The Printed Digits Dataset is a comprehensive collection of approximately 3,000 grayscale images, specifically curate for numeric digit classification tasks. Originally create with 177 images, this dataset has undergone extensive augmentation to enhance its diversity and utility, making it an ideal resource for machine learning projects such as Sudoku digit recognition.
Dataset Composition:
Image Count: The dataset contains around 3,000 images, each representing a single numeric digit from 0 to 9.
Image Dimensions: Each image is standardized to a 28×28 pixel resolution, maintaining a consistent grayscale format.
Purpose: This dataset was develop with a specific focus on Sudoku digit classification. Notably, it includes blank images for the digit '0', reflecting the common occurrence of empty cells in Sudoku puzzles.
Download Dataset
Augmentation Details:
To expand the original dataset from 177 images to 3,000, a variety of data augmentation techniques were apply. These include:
Rotation: Images were rotated to simulate different orientations of printed digits.
Scaling: Variations in the size of digits were introduced to mimic real-world printing inconsistencies.
Translation: Digits were shifted within the image frame to represent slight misalignments often seen in printed text.
Noise Addition: Gaussian noise was added to simulate varying print quality and scanner imperfections.
Applications:
Sudoku Digit Recognition: Given its design, this dataset is particularly well-suited for training models to recognize and classify digits in Sudoku puzzles.
Handwritten Digit Classification: Although the dataset contains printed digits, it can be adapted and utilized in combination with handwritten digit datasets for broader numeric
classification tasks.
Optical Character Recognition (OCR): This dataset can also be valuable for training OCR systems, especially those aim at processing low-resolution or small-scale printed text.
Dataset Quality:
Uniformity: All images are uniformly scaled and aligned, providing a clean and consistent dataset for model training.
Diversity: Augmentation has significantly increased the diversity of digit representation, making the dataset robust for training deep learning models.
Usage Notes:
Zero Representation: Users should note that the digit '0' is represented by a blank image.
This design choice aligns with the specific application of Sudoku puzzle solving but may require adjustments if the dataset is use for other numeric classification tasks.
Preprocessing Required: While the dataset is ready for use, additional preprocessing steps, such as normalization or further augmentation, can be applied based on the specific requirements of the intended machine learning model.
File Format:
The images are stored in a standardized format compatible with most machine learning frameworks, ensuring ease of integration into existing workflows.
Conclusion: The Printed Digits Dataset offers a rich resource for those working on digit classification projects, particularly within the context of Sudoku or other numeric-based puzzles. Its extensive augmentation and attention to application-specific details make it a valuable asset for both academic research and practical Al development.
This dataset is sourced from Kaggle.
Three datasets are available, each consisting of 15 csv files. Each file containing the voxelised shower information obtained from single particles produced at the front of the calorimeter in the |η| range (0.2-0.25) simulated in the ATLAS detector. Two datasets contain photons events with different statistics; the larger sample has about 10 times the number of events as the other. The other dataset contains pions. The pion dataset and the photon dataset with the lower statistics were used to train the corresponding two GANs presented in the AtlFast3 paper SIMU-2018-04.
The information in each file is a table; the rows correspond to the events and the columns to the voxels. The voxelisation procedure is described in the AtlFast3 paper linked above and in the dedicated PUB note ATL-SOFT-PUB-2020-006. In summary, the detailed energy deposits produced by ATLAS were converted from x,y,z coordinates to local cylindrical coordinates defined around the particle 3-momentum at the entrance of the calorimeter. The energy deposits in each layer were then grouped in voxels and for each voxel the energy was stored in the csv file. For each particle, there are 15 files corresponding to the 15 energy points used to train the GAN. The name of the csv file defines both the particle and the energy of the sample used to create the file.
The size of the voxels is described in the binning.xml file. Software tools to read the XML file and manipulate the spatial information of voxels are provided in the FastCaloGAN repository.
Updated on February 10th 2022. A new dataset photons_samples_highStat.tgz was added to this record and the binning.xml file was updated accordingly.
Updated on April 18th 2023. A new dataset pions_samples_highStat.tgz was added to this record.
The Delta Produce Sources Study was an observational study designed to measure and compare food environments of farmers markets (n=3) and grocery stores (n=12) in 5 rural towns located in the Lower Mississippi Delta region of Mississippi. Data were collected via electronic surveys from June 2019 to March 2020 using a modified version of the Nutrition Environment Measures Survey (NEMS) Farmers Market Audit tool. The tool was modified to collect information pertaining to source of fresh produce and also for use with both farmers markets and grocery stores. Availability, source, quality, and price information were collected and compared between farmers markets and grocery stores for 13 fresh fruits and 32 fresh vegetables via SAS software programming. Because the towns were not randomly selected and the sample sizes are relatively small, the data may not be generalizable to all rural towns in the Lower Mississippi Delta region of Mississippi. Resources in this dataset:Resource Title: Delta Produce Sources Study dataset . File Name: DPS Data Public.csvResource Description: The dataset contains variables corresponding to availability, source (country, state and town if country is the United States), quality, and price (by weight or volume) of 13 fresh fruits and 32 fresh vegetables sold in farmers markets and grocery stores located in 5 Lower Mississippi Delta towns.Resource Software Recommended: Microsoft Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: Delta Produce Sources Study data dictionary. File Name: DPS Data Dictionary Public.csvResource Description: This file is the data dictionary corresponding to the Delta Produce Sources Study dataset.Resource Software Recommended: Microsoft Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Fraser Greenlee (From Huggingface) [source]
This dataset offers a valuable resource for various applications such as natural language processing, sentiment analysis, joke generation algorithms, or simply for entertainment purposes. Whether you're a data scientist looking to analyze humor patterns or an individual seeking some quick comedic relief, this dataset has got you covered.
By utilizing this dataset, researchers can explore different aspects of humor and study the linguistic features that make these short jokes amusing. Moreover, it provides an opportunity for developing computer models capable of generating similar humorous content based on learned patterns.
Understanding the Columns:
text
: This column contains the text of the short joke.**text
: No information is provided about this column.Exploring the Jokes:
- Start by exploring the
text
column, which contains the actual jokes. You can read through them and have a good laugh!Analyzing the Jokes:
- To gain insights from this dataset, you can perform various analyses:
- Sentiment Analysis: Use Natural Language Processing techniques to analyze the sentiment of each joke.
- Categorization: Group jokes based on common themes or subjects, such as animals, professions, etc.
- Length Distribution: Analyze and visualize the distribution of joke lengths.
Creating New Content or Applications: Since this dataset provides a large collection of short jokes, you can utilize it creatively:
- Generating Random Jokes: Develop an algorithm that generates new jokes based on patterns found in this dataset.
- Humor Classification: Build a model that predicts if a given piece of text is funny or not using machine learning techniques.
Sharing Your Findings: If you make interesting discoveries or create unique applications using this dataset, consider sharing them with others in Kaggle community.
Please note that no information regarding dates is available in train.csv; therefore, any temporal analysis or date-based insights won't be feasible with this specific file.
- Analyzing humor patterns: This dataset can be used to analyze different types of humor and identify patterns or common elements in jokes that make them funny. Researchers and linguists can use this dataset to gain insights into the structure, wordplay, or comedic techniques used in short jokes.
- Natural language processing: With the text data available in this dataset, it can be used for training models in natural language processing (NLP) tasks such as sentiment analysis, joke generation, or understanding humor from written text. NLP researchers and developers can utilize this dataset to build and improve algorithms for detecting or generating funny content.
- Social media analysis: Short jokes are popular on social media platforms like Twitter or Reddit where users frequently share humorous content. This dataset can be valuable for analyzing the reception and impact of these jokes on social media platforms. By examining trends, engagement metrics, or user reactions to specific jokes from the dataset, marketers or social media analysts can gain insights into what type of humor resonates with different online communities. Overall, this dataset provides a rich resource for exploring various aspects related to humor analysis and NLP tasks while offering opportunities for sociocultural studies related to online comedy culture
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:----------------------------------------------| | text | The actual content of the short jokes. (Text) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Fraser Greenlee (From Huggingface).
Unlock the power of ready-to-use data sourced from developer communities and repositories with Developer Community and Code Datasets.
Data Sources:
GitHub: Access comprehensive data about GitHub repositories, developer profiles, contributions, issues, social interactions, and more.
StackShare: Receive information about companies, their technology stacks, reviews, tools, services, trends, and more.
DockerHub: Dive into data from container images, repositories, developer profiles, contributions, usage statistics, and more.
Developer Community and Code Datasets are a treasure trove of public data points gathered from tech communities and code repositories across the web.
With our datasets, you'll receive:
Choose from various output formats, storage options, and delivery frequencies:
Why choose our Datasets?
Fresh and accurate data: Access complete, clean, and structured data from scraping professionals, ensuring the highest quality.
Time and resource savings: Let us handle data extraction and processing cost-effectively, freeing your resources for strategic tasks.
Customized solutions: Share your unique data needs, and we'll tailor our data harvesting approach to fit your requirements perfectly.
Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is trusted by Fortune 500 companies and adheres to GDPR and CCPA standards.
Pricing Options:
Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.
Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.
Experience a seamless journey with Oxylabs:
Empower your data-driven decisions with Oxylabs Developer Community and Code Datasets!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Scenario planning in organizations : how to create, use, and assess scenarios. It features 7 columns including author, publication date, language, and book publisher.
Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.
PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.
Dataset Structure
containing the images. The file dataset.json comprehends a list of json objects with the attributes:
user: anonymized user that made the post;
filename: image file name;
raw_caption: raw caption;
caption: clean caption;
date: post date.
Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.
Download Instructions
If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:
cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz
Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:
python download_dataset.py --access_token=
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.
This repository contains two files:
The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.
The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:
In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.
Reproducing the Analysis
This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:
Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38
First, download dump.tar.bz2 and extract it:
tar -xjf dump.tar.bz2
It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:
psql jupyter < db2019-03-13.dump
It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Create a conda environment with Python 3.7:
conda create -n analyses python=3.7
conda activate analyses
Go to the analyses folder and install all the dependencies of the requirements.txt
cd jupyter_reproducibility/analyses
pip install -r requirements.txt
For reproducing the analyses, run jupyter on this folder:
jupyter notebook
Execute the notebooks on this order:
Reproducing or Expanding the Collection
The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.
Requirements
This time, we have extra requirements:
All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account
Environment
First, set the following environment variables:
export JUP_MACHINE="db"; # machine identifier
export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
export JUP_COMPRESSION="lbzip2"; # compression program
export JUP_VERBOSE="5"; # verbose level
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
export JUP_GITHUB_USERNAME="github_username"; # your github username
export JUP_GITHUB_PASSWORD="github_password"; # your github password
export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
export JUP_WITH_EXECUTION="1"; # run execute python notebooks
export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
export JUP_EXECUTION_MODE="-1"; # run following the execution order
export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
# Frequenci of log report
export JUP_ASTROID_FREQUENCY="5";
export JUP_IPYTHON_FREQUENCY="5";
export JUP_NOTEBOOKS_FREQUENCY="5";
export JUP_REQUIREMENT_FREQUENCY="5";
export JUP_CRAWLER_FREQUENCY="1";
export JUP_CLONE_FREQUENCY="1";
export JUP_COMPRESS_FREQUENCY="5";
export JUP_DB_IP="localhost"; # postgres database IP
Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf
Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.
Scripts
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):
Conda 2.7
conda create -n raw27 python=2.7 -y
conda activate raw27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 2.7
conda create -n py27 python=2.7 anaconda -y
conda activate py27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.4
It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.
conda create -n raw34 python=3.4 -y
conda activate raw34
conda install jupyter -c conda-forge -y
conda uninstall jupyter -y
pip install --upgrade pip
pip install jupyter
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
pip install pathlib2
Anaconda 3.4
conda create -n py34 python=3.4 anaconda -y
conda activate py34
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.5
conda create -n raw35 python=3.5 -y
conda activate raw35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.5
It requires the manual installation of other anaconda packages.
conda create -n py35 python=3.5 anaconda -y
conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
conda activate py35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.6
conda create -n raw36 python=3.6 -y
conda activate raw36
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.6
conda create -n py36 python=3.6 anaconda -y
conda activate py36
conda install -y anaconda-navigator jupyterlab_server navigator-updater
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.7
<code
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).
Image datasets: