Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).
Image datasets:
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Description
Overview: This dataset contains three distinct fake datasets generated using the Faker and Mimesis libraries. These libraries are commonly used for generating realistic-looking synthetic data for testing, prototyping, and data science projects. The datasets were created to simulate real-world scenarios while ensuring no sensitive or private information is included.
Data Generation Process: The data creation process is documented in the accompanying notebook, Creating_simple_Sintetic_data.ipynb. This notebook showcases the step-by-step procedure for generating synthetic datasets with customizable structures and fields using the Faker and Mimesis libraries.
File Contents:
Datasets: CSV files containing the three synthetic datasets. Notebook: Creating_simple_Sintetic_data.ipynb detailing the data generation process and the code used to create these datasets.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
What is Pandas?
Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.
Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.
What Can Pandas Do?
Pandas gives you answers about the data. Like:
Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "instructional_code-search-net-python"
Dataset Summary
This is an instructional dataset for Python. The dataset contains two different kind of tasks:
Given a piece of code generate a description of what it does. Given a description generate a piece of code that fulfils the description.
Languages
The dataset is in English.
Data Splits
There are no splits.
Dataset Creation
May of 2023
Curation Rationale
This… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/instructional_code-search-net-python.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sample data set used in an introductory course on Programming in Python
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Attention: This dataset is a summary and reformat pulled from github code.
You should make your own assumptions based on this. In fact, there is another dataset I formed through parsing that addresses several points:
out of 500k python related items, most of them are python-ish, not pythonic the majority of the items here contain excessive licensing inclusion of original code the items here are sometimes not even python but have references There's a whole lot of gpl summaries… See the full description on the dataset page: https://huggingface.co/datasets/jtatman/python-code-dataset-500k.
Facebook
TwitterThe EPA GitHub repository PAU4ChemAs as described in the README.md file, contains Python scripts written to build the PAU dataset modules (technologies, capital and operating costs, and chemical prices) for tracking chemical flows transfers, releases estimation, and identification of potential occupation exposure scenarios in pollution abatement units (PAUs). These PAUs are employed for on-site chemical end-of-life management. The folder datasets contains the outputs for each framework step. The Chemicals_in_categories.csv contains the chemicals for the TRI chemical categories. The EPA GitHub repository PAU_case_study as described in its readme.md entry, contains the Python scripts to run the manuscript case study for designing the PAUs, the data-driven models, and the decision-making module for chemicals of concern and tracking flow transfers at the end-of-life stage. The data was obtained by means of data engineering using different publicly-available databases. The properties of chemicals were obtained using the GitHub repository Properties_Scraper, while the PAU dataset using the repository PAU4Chem. Finally, the EPA GitHub repository Properties_Scraper contains a Python script to massively gather information about exposure limits and physical properties from different publicly-available sources: EPA, NOAA, OSHA, and the institute for Occupational Safety and Health of the German Social Accident Insurance (IFA). Also, all GitHub repositories describe the Python libraries required for running their code, how to use them, the obtained outputs files after running the Python script modules, and the corresponding EPA Disclaimer. This dataset is associated with the following publication: Hernandez-Betancur, J.D., M. Martin, and G.J. Ruiz-Mercado. A data engineering framework for on-site end-of-life industrial operations. JOURNAL OF CLEANER PRODUCTION. Elsevier Science Ltd, New York, NY, USA, 327: 129514, (2021).
Facebook
TwitterThe files in this repository can be used to generate the complete set of figures in the paper "An algorithm to identify vapor-liquid-liquid equilibria from vapor-liquid equilibria". The zip file, when expanded, includes a conda environment to populate the dependencies, and a set of python scripts. Running make_figures.py will regenerate all the figures, demonstrating how to use the algorithm.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "reason_code-search-net-python"
Dataset Summary
This dataset is an instructional dataset for Python.The dataset contains five different kind of tasks.
Given a Python 3 function:
Type 1: Generate a summary explaining what it does. (For example: This function counts the number of objects stored in the jsonl file passed as input.) Type 2: Generate a summary explaining what its input parameters represent ("For example: infile: a file descriptor of a file… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/reason_code-search-net-python.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset to run Example.py script of the ValparaÃso Stacking Analysis Tool (VSAT-3D). The ValparaÃso Stacking Analysis Tool (VSAT-3D) provides a series of tools for selecting, stacking, and analyzing 3D spectra. It is intended for stacking samples of datacubes extracted from interferometric datasets, belonging to large extragalactic catalogs by selecting subsamples of galaxies defined by their available properties (e.g. redshift, stellar mass, star formation rate) being possible to generate diverse (e.g. median, average, weighted average, histogram) composite spectra. However, it is possible to also use VSAT-3D on smaller datasets containing any type of astronomical object.
VSAT-3D can be downloaded from the github repository link.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book series. It has 1 row and is filtered where the books is Building machine learning systems with Python : master the art of machine learning with Python and build effective machine learning sytems with this intensive hands-on guide. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Facebook
TwitterAutomatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.
PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.
Dataset Structure
containing the images. The file dataset.json comprehends a list of json objects with the attributes:
user: anonymized user that made the post;
filename: image file name;
raw_caption: raw caption;
caption: clean caption;
date: post date.
Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.
Download Instructions
If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:
cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz
Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:
python download_dataset.py --access_token=
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.
This repository contains two files:
The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.
The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:
In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.
Reproducing the Analysis
This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:
Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38
First, download dump.tar.bz2 and extract it:
tar -xjf dump.tar.bz2
It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:
psql jupyter < db2019-03-13.dump
It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Create a conda environment with Python 3.7:
conda create -n analyses python=3.7
conda activate analyses
Go to the analyses folder and install all the dependencies of the requirements.txt
cd jupyter_reproducibility/analyses
pip install -r requirements.txt
For reproducing the analyses, run jupyter on this folder:
jupyter notebook
Execute the notebooks on this order:
Reproducing or Expanding the Collection
The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.
Requirements
This time, we have extra requirements:
All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account
Environment
First, set the following environment variables:
export JUP_MACHINE="db"; # machine identifier
export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
export JUP_COMPRESSION="lbzip2"; # compression program
export JUP_VERBOSE="5"; # verbose level
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
export JUP_GITHUB_USERNAME="github_username"; # your github username
export JUP_GITHUB_PASSWORD="github_password"; # your github password
export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
export JUP_WITH_EXECUTION="1"; # run execute python notebooks
export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
export JUP_EXECUTION_MODE="-1"; # run following the execution order
export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
# Frequenci of log report
export JUP_ASTROID_FREQUENCY="5";
export JUP_IPYTHON_FREQUENCY="5";
export JUP_NOTEBOOKS_FREQUENCY="5";
export JUP_REQUIREMENT_FREQUENCY="5";
export JUP_CRAWLER_FREQUENCY="1";
export JUP_CLONE_FREQUENCY="1";
export JUP_COMPRESS_FREQUENCY="5";
export JUP_DB_IP="localhost"; # postgres database IP
Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf
Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.
Scripts
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):
Conda 2.7
conda create -n raw27 python=2.7 -y
conda activate raw27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 2.7
conda create -n py27 python=2.7 anaconda -y
conda activate py27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.4
It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.
conda create -n raw34 python=3.4 -y
conda activate raw34
conda install jupyter -c conda-forge -y
conda uninstall jupyter -y
pip install --upgrade pip
pip install jupyter
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
pip install pathlib2
Anaconda 3.4
conda create -n py34 python=3.4 anaconda -y
conda activate py34
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.5
conda create -n raw35 python=3.5 -y
conda activate raw35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.5
It requires the manual installation of other anaconda packages.
conda create -n py35 python=3.5 anaconda -y
conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
conda activate py35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.6
conda create -n raw36 python=3.6 -y
conda activate raw36
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.6
conda create -n py36 python=3.6 anaconda -y
conda activate py36
conda install -y anaconda-navigator jupyterlab_server navigator-updater
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.7
<code
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This repository contains the dataset for the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.
Data Collection and Analysis
We use the code for reproducibility of Jupyter notebooks from the study done by Pimentel et al., 2019 and adapted the code from ReproduceMeGit. We provide code for collecting the publication metadata from PubMed Central using NCBI Entrez utilities via Biopython.
Our approach involves searching PMC using the esearch function for Jupyter notebooks using the query: ``(ipynb OR jupyter OR ipython) AND github''. We meticulously retrieve data in XML format, capturing essential details about journals and articles. By systematically scanning the entire article, encompassing the abstract, body, data availability statement, and supplementary materials, we extract GitHub links. Additionally, we mine repositories for key information such as dependency declarations found in files like requirements.txt, setup.py, and pipfile. Leveraging the GitHub API, we enrich our data by incorporating repository creation dates, update histories, pushes, and programming languages.
All the extracted information is stored in a SQLite database. After collecting and creating the database tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories based on the code from Pimentel et al., 2019.
Our reproducibility pipeline was started on 27 March 2023.
Repository Structure
Our repository is organized into two main folders:
Accessing Data and Resources:
System Requirements:
Running the pipeline:
Running the analysis:
References:
Facebook
TwitterThis dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.
About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.
Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.
This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.
This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Upload From Python is a dataset for object detection tasks - it contains Cars WyfR annotations for 3,002 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterCL-Splats Dataset
This repository contains the data to reproduce our results from the CL-Splats paper. In addition, for future research we provide the Blender files used to create our synthetic scenes with a blender python script creating the camera trajectories.
Usage
The files under Blender-Levels and Real-World are compatible with the 3DGS data-loading. We will create a copy of this dataset that matches our data-loading script once available. Note that the COLMAP… See the full description on the dataset page: https://huggingface.co/datasets/ackermannj/cl-splats-dataset.
Facebook
TwitterPythonic Function Calling Dataset
This dataset contains synthetic data used for training Pythonic function calling models Dria-Agent-a-3B and Dria-Agent-a-7B. Dria is a python framework to generate synthetic data on globally connected edge devices with 50+ models. See the network here
Dataset Summary
The dataset includes various examples of function calling scenarios, ranging from simple to complex multi-turn interactions. It was generated synthetically using the… See the full description on the dataset page: https://huggingface.co/datasets/driaforall/pythonic-function-calling.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cadaster data from PDOK used to illustrate the use of geopandas and shapely, geospatial python packages for manipulating vector data. The brpgewaspercelen_definitief_2020.gpkg file has been subsetted in order to make the download manageable for workshops. Other datasets are copies of those available from PDOK.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Python Projesi is a dataset for object detection tasks - it contains Objects annotations for 506 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).
Image datasets: