61 datasets found

Exploratory Topic Modelling in Python Dataset - EHRI-3
zenodo.org
data.niaid.nih.gov
bin
Updated Jun 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Dermentzi; Maria Dermentzi (2022). Exploratory Topic Modelling in Python Dataset - EHRI-3 [Dataset]. http://doi.org/10.5281/zenodo.6670234
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6670234
Dataset updated
Jun 20, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maria Dermentzi; Maria Dermentzi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In the EHRI-3 project, we are investigating tools and methods that historical researchers and scholars can use to better understand, visualise, and interpret the material held by our partner archives. This dataset accompanies a tutorial exploring a technique called topic modelling in the context of a Holocaust-related historical collection.

We were on the lookout for datasets that would be easily accessible and, for convenience, predominantly in English. One such dataset was the United States Holocaust Memorial Museum’s (USHMM) extensive collection of oral history testimonies, for which there are a considerable number of textual transcripts. The museum’s total collection consists of over 80,703 testimonies, 41,695 of which are available in English, with 2,894 of them listing a transcript.

Since there is not yet a ready-to-download dataset that includes these transcripts, we had to construct our own. Using a web scraping tool, we managed to create a list of the links pointing to the metadata (including transcripts) of the testimonies that were of interest to us. After obtaining the transcript and other metadata of each of these testimonies, we were able to create our dataset and curate it to remove any unwanted entries. For example, we made sure to remove entries with restrictions on access or use. We also removed entries with transcripts that consisted only of some automatically generated headers and entries which turned out to be in languages other than English. The remaining 1,873 transcripts form the corpus of this tutorial — a small, but still decently sized dataset.

The process that we followed to put together this dataset is detailed in the Jupyter Notebook accompanying this post, which can be found in this Github repository.

In this Zenodo upload, the user can find two files, each of them containing a pickled pandas DataFrame that was obtained at a different stage of the tutorial:

"unrestricted_df.pkl" contains 1,946 entries of Oral Testimony transcripts and has five fields (RG_number, text, display_date, conditions_access, conditions_use)
"unrestricted_lemmatized_df.pkl" contains 1,873 entries of Oral Testimony transcripts and has six fields (RG_number, text, display_date, conditions_access, conditions_use, lemmas)

Instructions on their intended use can be found in the accompanying Jupyter Notebook.

Credits:

The transcripts that form the corpus in this tutorial were obtained through the United States Holocaust Memorial Museum (USHMM).
Meta Kaggle Code
kaggle.com
zip
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Explore at:
zip(143722388562 bytes)Available download formats
Dataset updated
Jun 5, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!
h
notional-python
huggingface.co
Updated Dec 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Notional Project (2021). notional-python [Dataset]. https://huggingface.co/datasets/notional/notional-python
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 24, 2021
Authors
Notional Project
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for notional-python

Dataset Summary

The Notional-python dataset contains python code files from 100 well-known repositories gathered from Google Bigquery Github Dataset. The dataset was created to test the ability of programming language models. Follow our repo to do the model evaluation using notional-python dataset.

Languages

Python

Dataset Creation Curation Rationale

Notional-python was built to provide a dataset for… See the full description on the dataset page: https://huggingface.co/datasets/notional/notional-python.
Python scripts used to generate the figures in "An algorithm to identify...
datasets.ai
data.nist.gov
+1more
0, 47, 57
Updated Aug 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2024). Python scripts used to generate the figures in "An algorithm to identify vapor-liquid-liquid equilibria of binary mixtures from vapor-liquid equilibria" [Dataset]. https://datasets.ai/datasets/python-scripts-used-to-generate-the-figures-in-an-algorithm-to-identify-vapor-liquid-liqui
Explore at:
57, 47, 0Available download formats
Dataset updated
Aug 8, 2024
Dataset authored and provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
The files in this repository can be used to generate the complete set of figures in the paper "An algorithm to identify vapor-liquid-liquid equilibria from vapor-liquid equilibria". The zip file, when expanded, includes a conda environment to populate the dependencies, and a set of python scripts. Running make_figures.py will regenerate all the figures, demonstrating how to use the algorithm.
w
Dataset of book series where books equals Python, beginner's guide to...
workwithdata.com
Updated Aug 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2024). Dataset of book series where books equals Python, beginner's guide to artificial intelligence : build applications to intelligently interact with the world around you using Python [Dataset]. https://www.workwithdata.com/datasets/book-series?f=1&fcol0=j0-book&fop0=%3D&fval0=Python%2C+beginner%27s+guide+to+artificial+intelligence+:+build+applications+to+intelligently+interact+with+the+world+around+you+using+Python&j=1&j0=books
Explore at:
Dataset updated
Aug 26, 2024
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about book series, has 1 rows. and is filtered where the books is Python, beginner's guide to artificial intelligence : build applications to intelligently interact with the world around you using Python. It features 10 columns including book series, number of authors, number of books, earliest publication date, and latest publication date. The preview is ordered by number of books (descending).
d
Dataset metadata of known Dataverse installations
search.dataone.org
dataverse.harvard.edu
+1more
Updated Nov 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautier, Julian (2023). Dataset metadata of known Dataverse installations [Dataset]. http://doi.org/10.7910/DVN/DCDKZQ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/DCDKZQ
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Gautier, Julian
Description
This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation).csv │ ├── basic.csv │ ├── contributor(citation).csv │ ├── ... │ └── topic_classification(citation).csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2022.10.02_17.11.19.zip │ ├── dataset_pids_Abacus_2022.10.02_17.11.19.csv │ ├── Dataverse_JSON_metadata_2022.10.02_17.11.19 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0.json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2022.10.02_17.26.19.zip │ ├── ADA_Dataverse_2022.10.02_17.26.57.zip │ ├── Arca_Dados_2022.10.02_17.44.35.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected ... Visit https://dataone.org/datasets/sha256%3Ad27d528dae8cf01e3ea915f450426c38fd6320e8c11d3e901c43580f997a3146 for complete metadata about this dataset.
Z
#PraCegoVer dataset
data.niaid.nih.gov
zenodo.org
Updated Jan 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sandra Avila (2023). #PraCegoVer dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5710561
Explore at:
Dataset updated
Jan 19, 2023
Dataset provided by
Gabriel Oliveira dos Santos
Esther Luna Colombini
Sandra Avila
Description
Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

Dataset Structure

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

containing the images. The file dataset.json comprehends a list of json objects with the attributes:

user: anonymized user that made the post;

filename: image file name;

raw_caption: raw caption;

caption: clean caption;

date: post date.

Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

Download Instructions

If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:

cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz

Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

python download_dataset.py --access_token=
Z
PIPr: A Dataset of Public Infrastructure as Code Programs
data.niaid.nih.gov
zenodo.org
Updated Nov 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salvaneschi, Guido (2023). PIPr: A Dataset of Public Infrastructure as Code Programs [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8262770
Explore at:
Dataset updated
Nov 28, 2023
Dataset provided by
Spielmann, David
Salvaneschi, Guido
Sokolowski, Daniel
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0. Contents:

metadata.zip: The dataset metadata and analysis results as CSV files. scripts-and-logs.zip: Scripts and logs of the dataset creation. LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text. README.md: This document. redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program. This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io. Metadata The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files. repositories.csv:

ID (integer): GitHub repository ID url (string): GitHub repository URL downloaded (boolean): Whether cloning the repository succeeded name (string): Repository name description (string): Repository description licenses (string, list of strings): Repository licenses redistributable (boolean): Whether the repository's licenses permit redistribution created (string, date & time): Time of the repository's creation updated (string, date & time): Time of the last update to the repository pushed (string, date & time): Time of the last push to the repository fork (boolean): Whether the repository is a fork forks (integer): Number of forks archive (boolean): Whether the repository is archived programs (string, list of strings): Project file path of each IaC program in the repository programs.csv:

ID (string): Project file path of the IaC program repository (integer): GitHub repository ID of the repository containing the IaC program directory (string): Path of the directory containing the IaC program's project file solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi") language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml") name (string): IaC program name description (string): IaC program description runtime (string): Runtime string of the IaC program testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") tests (string, list of strings): File paths of IaC program's tests testing-files.csv:

file (string): Testing file path language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript") techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(") program (string): Project file path of the testing file's IaC program Dataset Creation scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:

A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries. A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below). Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses. Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories. Searching Repositories The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:

Github access token. Name of the CSV output file. Filename to search for. File extensions to search for, separated by commas. Min file size for the search (for all files: 0). Max file size for the search or * for unlimited (for all files: *). Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/ AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup Limitations The script uses the GitHub code search API and inherits its limitations:

Only forks with more stars than the parent repository are included. Only the repositories' default branches are considered. Only files smaller than 384 KB are searchable. Only repositories with fewer than 500,000 files are considered. Only repositories that have had activity or have been returned in search results in the last year are considered. More details: https://docs.github.com/en/search-github/searching-on-github/searching-code The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api Downloading Repositories download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:

Name of the repositories CSV files generated through search-repositories.py, separated by commas. Output directory to download the repositories to. Name of the CSV output file. The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.
h
1300q_ATP
huggingface.co
Updated Jan 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Elliott (2025). 1300q_ATP [Dataset]. https://huggingface.co/datasets/AnnieEl/1300q_ATP
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 17, 2025
Authors
Andrea Elliott
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
I tried to create this by creating a dataset in python using dataset = Dataset.from_pandas(df). My goal is to then see if I can load it into an autotrain model.

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

zenodo.org

application/gzip, bin +2

Updated Aug 2, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788

Explore at:

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

f
DATS 6401 - Final Project - Yon ho Cheong.zip
figshare.com
zip
Updated Dec 15, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yon ho Cheong (2018). DATS 6401 - Final Project - Yon ho Cheong.zip [Dataset]. http://doi.org/10.6084/m9.figshare.7471007.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7471007.v1
Dataset updated
Dec 15, 2018
Dataset provided by
figshare
Authors
Yon ho Cheong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
AbstractThe H1B is an employment-based visa category for temporary foreign workers in the United States. Every year, the US immigration department receives over 200,000 petitions and selects 85,000 applications through a random process and the U.S. employer must submit a petition for an H1B visa to the US immigration department. This is the most common visa status applied to international students once they complete college or higher education and begin working in a full-time position. The project provides essential information on job titles, preferred regions of settlement, foreign applicants and employers' trends for H1B visa application. According to locations, employers, job titles and salary range make up most of the H1B petitions, so different visualization utilizing tools will be used in order to analyze and interpreted in relation to the trends of the H1B visa to provide a recommendation to the applicant. This report is the base of the project for Visualization of Complex Data class at the George Washington University, some examples in this project has an analysis for the different relevant variables (Case Status, Employer Name, SOC name, Job Title, Prevailing Wage, Worksite, and Latitude and Longitude information) from Kaggle and Office of Foreign Labor Certification(OFLC) in order to see the H1B visa changes in the past several decades. Keywords: H1B visa, Data Analysis, Visualization of Complex Data, HTML, JavaScript, CSS, Tableau, D3.jsDatasetThe dataset contains 10 columns and covers a total of 3 million records spanning from 2011-2016. The relevant columns in the dataset include case status, employer name, SOC name, jobe title, full time position, prevailing wage, year, worksite, and latitude and longitude information.Link to dataset: https://www.kaggle.com/nsharan/h-1b-visaLink to dataset(FY2017): https://www.foreignlaborcert.doleta.gov/performancedata.cfmRunning the codeOpen Index.htmlData ProcessingDoing some data preprocessing to transform the raw data into an understandable format.Find and combine any other external datasets to enrich the analysis such as dataset of FY2017.To make appropriated Visualizations, variables should be Developed and compiled into visualization programs.Draw a geo map and scatter plot to compare the fastest growth in fixed value and in percentages.Extract some aspects and analyze the changes in employers’ preference as well as forecasts for the future trends.VisualizationsCombo chart: this chart shows the overall volume of receipts and approvals rate.Scatter plot: scatter plot shows the beneficiary country of birth.Geo map: this map shows All States of H1B petitions filed.Line chart: this chart shows top10 states of H1B petitions filed. Pie chart: this chart shows comparison of Education level and occupations for petitions FY2011 vs FY2017.Tree map: tree map shows overall top employers who submit the greatest number of applications.Side-by-side bar chart: this chart shows overall comparison of Data Scientist and Data Analyst.Highlight table: this table shows mean wage of a Data Scientist and Data Analyst with case status certified.Bubble chart: this chart shows top10 companies for Data Scientist and Data Analyst.Related ResearchThe H-1B Visa Debate, Explained - Harvard Business Reviewhttps://hbr.org/2017/05/the-h-1b-visa-debate-explainedForeign Labor Certification Data Centerhttps://www.foreignlaborcert.doleta.govKey facts about the U.S. H-1B visa programhttp://www.pewresearch.org/fact-tank/2017/04/27/key-facts-about-the-u-s-h-1b-visa-program/H1B visa News and Updates from The Economic Timeshttps://economictimes.indiatimes.com/topic/H1B-visa/newsH-1B visa - Wikipediahttps://en.wikipedia.org/wiki/H-1B_visaKey FindingsFrom the analysis, the government is cutting down the number of approvals for H1B on 2017.In the past decade, due to the nature of demand for high-skilled workers, visa holders have clustered in STEM fields and come mostly from countries in Asia such as China and India.Technical Jobs fill up the majority of Top 10 Jobs among foreign workers such as Computer Systems Analyst and Software Developers.The employers located in the metro areas thrive to find foreign workforce who can fill the technical position that they have in their organization.States like California, New York, Washington, New Jersey, Massachusetts, Illinois, and Texas are the prime location for foreign workers and provide many job opportunities. Top Companies such Infosys, Tata, IBM India that submit most H1B Visa Applications are companies based in India associated with software and IT services.Data Scientist position has experienced an exponential growth in terms of H1B visa applications and jobs are clustered in West region with the highest number.Visualization utilizing programsHTML, JavaScript, CSS, D3.js, Google API, Python, R, and Tableau
codeparrot_1M
kaggle.com
Updated Feb 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tanay Mehta (2024). codeparrot_1M [Dataset]. https://www.kaggle.com/datasets/heyytanay/codeparrot-1m
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 25, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Tanay Mehta
Description
A subset of codeparrot/github-code dataset consisting of 1 Million tokenized Python files in Lance file format for blazing fast and memory efficient I/O.

The files were tokenized using the EleutherAI/gpt-neox-20b tokenizer with no extra tokens.

For detailed information on how the dataset was created, refer to my article on Curating Custom Datasets for efficient LLM training using Lance.

The script used for creating the dataset can be found here.

Instructions for using this dataset

This dataset is not supposed to be used on Kaggle Kernels since Lance requires the input directory of the dataset to have write access but Kaggle Kernel's input directory doesn't have it and the dataset size prohibits one from moving it to /kaggle/working. Hence, to use this dataset, you must download it by using the Kaggle API or through this page and then move the unzipped files to a folder called codeparrot_1M.lance. Below are detailed snippets on how to download and use this dataset.

First download and unzip the dataset from your terminal (make sure you have your kaggle API key at ~/.kaggle/:

$ pip install -q kaggle pyarrow pylance $ kaggle datasets download -d heyytanay/codeparrot-1m $ mkdir codeparrot_1M.lance/ $ unzip -qq codeparrot-1m.zip -d codeparrot_1M.lance/ $ rm codeparrot-1m.zip

Once this is done, you will find your dataset in the codeparrot_1M.lance/ folder. Now to load and get a gist of the data, run the below snippet.

import lance dataset = lance.dataset('codeparrot_1M.lance/') print(dataset.count_rows())

This will give you the total number of tokens in the dataset.

Considerations for Using the Data The dataset consists of source code from a wide range of repositories. As such they can potentially include harmful or biased code as well as sensitive information like passwords or usernames.
d
(HS 17) Automate Workflows using Jupyter notebook to create Large Spatial...
search.dataone.org
Updated Dec 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Young-Don Choi (2023). (HS 17) Automate Workflows using Jupyter notebook to create Large Spatial Sample Datasets [Dataset]. https://search.dataone.org/view/sha256%3A031befe4052e42a42b569cdcb0e76542e5c5b163dbf4480db9d1a52481071759
Explore at:
Dataset updated
Dec 30, 2023
Dataset provided by
Hydroshare
Authors
Young-Don Choi
Description
For the automated workflows, we create Jupyter notebooks for each state. In these workflows, GIS processing to merge, extract and project GeoTIFF data was the most important process. For this process, we used ArcPy which is a python package to perform geographic data analysis, data conversion, and data management in ArcGIS (Toms, 2015). After creating state-scale LSS datasets in GeoTIFF format, we convert GeoTIFF to NetCDF using xarray and rioxarray Python packages. Xarray is a Python package to work with multi-dimensional arrays and rioxarray is rasterio xarray extension. Rasterio is a Python library to read and write GeoTIFF and other raster formats. We used xarray to manipulate data type and add metadata in NetCDF file and rioxarray to save GeoTIFF to NetCDF format. Through these procedures, we created three composite HyddroShare resources to share state-scale LSS datasets. Due to the limitation of ArcGIS Pro license which is a commercial GIS software, we developed this Jupyter notebook on Windows OS.
"module-utilities": A Python package for simplify creating python modules.
catalog.data.gov
Updated Apr 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2024). "module-utilities": A Python package for simplify creating python modules. [Dataset]. https://catalog.data.gov/dataset/module-utilities-a-python-package-for-simplify-creating-python-modules
Explore at:
Dataset updated
Apr 11, 2024
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
"module-utilities" is a python package of utilities to simplify working with python packages.The main features of module-utilities are as follows: "cached" module: A module to cache class attributes and methods. Right now, this uses a standard python dictionary for storage. Future versions will hopefully be more robust to threading and shared cache."docfiller" module: A module to share documentation. This is adapted from the pandas doc decorator. There are a host of utilities build around this."docinhert": An interface to "docstring-inheritance" module. This can be combined with "docfiller" to make creating related function/class documentation easy.
Z
Data from: Algorithms for Reconstruction of Undersampled Atomic Force...
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Larsen, Torben (2020). Algorithms for Reconstruction of Undersampled Atomic Force Microscopy Images Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_375833
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Oxvig, Christian Schou
Arildsen, Thomas
Larsen, Torben
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This deposition contains the results from a simulation of reconstructions of undersampled atomic force microscopy (AFM) images. The reconstructions were obtained using a variety of interpolation and reconstruction methods.

The deposition consists of:

An HDF5 database containing the results from simulations of reconstructions of undersampled atomic force microscopy images (reconstruction_goblet_ID_0_of_1.hdf5).

The Python script which was used to create the database (reconstruction_goblet.py).

Auxillary Python scripts needed to run the simulations (optim_reconstructions.py, it_reconstruction.py, interp_reconstructions.py, gamp_reconstructions.py, and utils.py).

MD5 and SHA256 checksums of the database and Python script files (reconstruction_goblet.MD5SUMS, reconstruction_goblet.SHA256SUMS).

The HDF5 database is licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/) . Since the CC BY 4.0 license is not well suited for source code, the Python script is licensed under the BSD 2-Clause license (http://opensource.org/licenses/BSD-2-Clause) .

The files are provided as-is with no warranty as detailed in the above mentioned licenses.

The simulation results in the database are based on "Atomic Force Microscopy Images of Cell Specimens" and "Atomic Force Microscopy Images of Various Specimens" by Christian Rankl licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). The original images are available at http://dx.doi.org/10.5281/zenodo.17573 and http://dx.doi.org/10.5281/zenodo.60434. The original images are provided as-is without warranty of any kind. Both the original images as well as adapted images are part of the dataset.
Vietnamese Person Name
kaggle.com
Updated Jun 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yến Trang (2021). Vietnamese Person Name [Dataset]. https://www.kaggle.com/datasets/minhnguynphc/vietnamese-person-name/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 5, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yến Trang
Description
Context

This dataset is generated while I am dealing with "Vietnamese person name recognition" for informal text. The deep model is great but it's too heavy for deployment. An ML, rule-based or dictionary-based can benefit production by its complexity.

Content

This dataset contains a list of Vietnamese fullname. The fullname follow the structure : First + Middle + Last .

Acknowledgements

Thanks to Vinbrain to provide me the opportunity to create this dataset during the internship.

Inspiration

Hope that you find it helpful for your facing problem.
Data from: Advanced Python Scripting for ArcGIS Pro
dados-edu-pt.hub.arcgis.com
Updated Aug 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri Portugal - Educação (2020). Advanced Python Scripting for ArcGIS Pro [Dataset]. https://dados-edu-pt.hub.arcgis.com/datasets/advanced-python-scripting-for-arcgis-pro
Explore at:
Dataset updated
Aug 13, 2020
Dataset provided by
Esrihttp://esri.com/
Authors
Esri Portugal - Educação
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
Ready for something more complex? An easy-to-follow guide to writing specialized Python scripts for spatial data in ArcGlS Pro.

Advanced Python Scripting for ArcGlS Pro builds on Python Scripting for ArcGlS Pro (Esri Press, 201 9). Learn how to create a geoprocessing tool out of your script and automate tasks in ArcGlS Pro, how to share your tools with others, as well as master a number of more specialized tasks.Some of the key topics you will learn include:Python toolboxesCreating and sharing script toolsCreating functions and classesManaging Python packages and environmentsArcPy and ArcGlS API for PythonJupyter Notebook, PandasNumPy, MatplotlibMigrating scripts from Python 2 to 3Helpful "points to remember," key terms, and review questions are included at the end of each chapter to reinforce your understanding of Python. Companion data and exercises are available online.Advanced Python Scripting for ArcGlS Pro is perfect for more experienced developers who are lookilngto upgrade their skills.Don't forget to also check out Esri Press's other Python title: Python Scripting for ArcGlS ProAUDIENCEProfessional and scholarly. College/higher education. General/trade.AUTHOR BIOPaul A. Zandbergen is an associate professor of geography at the University of New Mexico in Albuquerque. His areas of expertise include geographic information science; spatial and statistical analysis techniques using GIS; error and uncertainty in spatial data; GIS applications in criminology, economics, health, and spatial ecology; terrain analysis and modeling; and community-based mapping using GIS and GPS.Pub Date: print: 7/14/2020 Digital: 7/14/2020ISBN: print: 9781589486188 Digital: 9781589486195Price: print: $69.99 USD Digital: $69.99 USDPages: 300 Trim: 8 x 1 0 in.Table of ContentsPrefaceAcknowledgmentsChapter 1. Creating Python Functions and ClassesChapter 2. Creating Python script toolsChapter 3. Python toolboxesChapter 4. Sharing toolsChapter 5. Managing Python packages and environmentsChapter 6. Essential Python Modules and Packages for GeoprocessingChapter 7. Migrating Scripts from Python 2 to 3 Chapter 8. ArcGlS API for PythonIndexPython Scripting and Advanced Python Scripting for ArcGIS Pro | Official Trailer | 2020-07-12 | 01:04Paul Zandbergen | Interview with Esri Press | 2020-07-10 | 25:37 | Link.
Aluminum alloy industrial materials defect
figshare.com
zip
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ying Han; Yugang Wang (2024). Aluminum alloy industrial materials defect [Dataset]. http://doi.org/10.6084/m9.figshare.27922929.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27922929.v3
Dataset updated
Dec 3, 2024
Dataset provided by
figshare
Authors
Ying Han; Yugang Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset used in this study experiment was from the preliminary competition dataset of the 2018 Guangdong Industrial Intelligent Manufacturing Big Data Intelligent Algorithm Competition organized by Tianchi Feiyue Cloud (https://tianchi.aliyun.com/competition/entrance/231682/introduction). We have selected the dataset, removing images that do not meet the requirements of our experiment. All datasets have been classified for training and testing. The image pixels are all 2560×1960. Before training, all defects need to be labeled using labelimg and saved as json files. Then, all json files are converted to txt files. Finally, the organized defect dataset is detected and classified.Description of the data and file structureThis is a project based on the YOLOv8 enhanced algorithm for aluminum defect classification and detection tasks.All code has been tested on Windows computers with Anaconda and CUDA-enabled GPUs. The following instructions allow users to run the code in this repository based on a Windows+CUDA GPU system already in use.Files and variablesFile: defeat_dataset.zipDescription:SetupPlease follow the steps below to set up the project:Download Project RepositoryDownload the project repository defeat_dataset.zip from the following location.Unzip and navigate to the project folder; it should contain a subfolder: quexian_datasetDownload data1.Download data .defeat_dataset.zip2.Unzip the downloaded data and move the 'defeat_dataset' folder into the project's main folder.3. Make sure that your defeat_dataset folder now contains a subfolder: quexian_dataset.4. Within the folder you should find various subfolders such as addquexian-13, quexian_dataset, new_dataset-13, etc.softwareSet up the Python environment1.Download and install the Anaconda.2.Once Anaconda is installed, activate the Anaconda Prompt. For Windows, click Start, search for Anaconda Prompt, and open it.3.Create a new conda environment with Python 3.8. You can name it whatever you like; for example. Enter the following command: conda create -n yolov8 python=3.84.Activate the created environment. If the name is , enter: conda activate yolov8Download and install the Visual Studio Code.Install PyTorch based on your system:For Windows/Linux users with a CUDA GPU: bash conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forgeInstall some necessary libraries:Install scikit-learn with the command: conda install anaconda scikit-learn=0.24.1Install astropy with: conda install astropy=4.2.1Install pandas using: conda install anaconda pandas=1.2.4Install Matplotlib with: conda install conda-forge matplotlib=3.5.3Install scipy by entering: conda install scipy=1.10.1RepeatabilityFor PyTorch, it's a well-known fact:There is no guarantee of fully reproducible results between PyTorch versions, individual commits, or different platforms. In addition, results may not be reproducible between CPU and GPU executions, even if the same seed is used.All results in the Analysis Notebook that involve only model evaluation are fully reproducible. However, when it comes to updating the model on the GPU, the results of model training on different machines vary.Access informationOther publicly accessible locations of the data:https://tianchi.aliyun.com/dataset/public/Data was derived from the following sources:https://tianchi.aliyun.com/dataset/140666Data availability statementThe ten datasets used in this study come from Guangdong Industrial Wisdom Big Data Innovation Competition - Intelligent Algorithm Competition Rematch. and the dataset download link is https://tianchi.aliyun.com/competition/entrance/231682/information?lang=en-us. Officially, there are 4,356 images, including single blemish images, multiple blemish images and no blemish images. The official website provides 4,356 images, including single defect images, multiple defect images and no defect images. We have selected only single defect images and multiple defect images, which are 3,233 images in total. The ten defects are non-conductive, effacement, miss bottom corner, orange, peel, varicolored, jet, lacquer bubble, jump into a pit, divulge the bottom and blotch. Each image contains one or more defects, and the resolution of the defect images are all 2560×1920.By investigating the literature, we found that most of the experiments were done with 10 types of defects, so we chose three more types of defects that are more different from these ten types and more in number, which are suitable for the experiments. The three newly added datasets come from the preliminary dataset of Guangdong Industrial Wisdom Big Data Intelligent Algorithm Competition. The dataset can be downloaded from https://tianchi.aliyun.com/dataset/140666. There are 3,000 images in total, among which 109, 73 and 43 images are for the defects of bruise, camouflage and coating cracking respectively. Finally, the 10 types of defects in the rematch and the 3 types of defects selected in the preliminary round are fused into a new dataset, which is examined in this dataset.In the processing of the dataset, we tried different division ratios, such as 8:2, 7:3, 7:2:1, etc. After testing, we found that the experimental results did not differ much for different division ratios. Therefore, we divide the dataset according to the ratio of 7:2:1, the training set accounts for 70%, the validation set accounts for 20%, and the testing set accounts for 10%. At the same time, the random number seed is set to 0 to ensure that the results obtained are consistent every time the model is trained.Finally, the mean Average Precision (mAP) metric obtained from the experiment was tested on the dataset a total of three times. Each time the results differed very little, but for the accuracy of the experimental results, we took the average value derived from the highest and lowest results. The highest was 71.5% and the lowest was 71.1%, resulting in an average detection accuracy of 71.3% for the final experiment.All data and images utilized in this research are from publicly available sources, and the original creators have given their consent for these materials to be published in open-access formats.The settings for other parameters are as follows. epochs: 200，patience: 50，batch: 16，imgsz: 640，pretrained: true，optimizer: SGD，close_mosaic: 10，iou: 0.7，momentum: 0.937，weight_decay: 0.0005，box: 7.5，cls: 0.5，dfl: 1.5，pose: 12.0，kobj: 1.0，save_dir: runs/trainThe defeat_dataset.(ZIP)is mentioned in the Supporting information section of our manuscript. The underlying data are held at Figshare. DOI: 10.6084/m9.figshare.27922929.The results_images.zipin the system contains the experimental results graphs.The images_1.zipand images_2.zipin the system contain all the images needed to generate the manuscript.tex manuscript.
d
Data from: Different evolutionary paths to complexity for small and large...
datadryad.org
plos.figshare.com
+1more
zip
Updated Jul 25, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas LaBar; Christoph Adami (2017). Different evolutionary paths to complexity for small and large populations of digital organisms [Dataset]. http://doi.org/10.5061/dryad.3h5kv
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.3h5kv
Dataset updated
Jul 25, 2017
Dataset provided by
Dryad
Authors
Thomas LaBar; Christoph Adami
Time period covered
2017
Description
ReadMe FileThis file is a guide to the rest of the files associated with this manuscript.Avida Configuration FilesThis folder contains the Avida configuration files needed to replicate all but one of the experimental treatments in the paper.config_files.zipAvida nopX Configuration FilesThis file contains the Avida configuration files needed to run the "non-functional insertion" treatment experiments. "nopX" refers to the nop-X instruction added in this configuration.config_files_nopx.zipSubmission FilesThis folder contains the scripts used to submit the Avida experiments to a computing cluster.sub_files.zipFigures ScriptThis python script creates all of the figures in the paper.create_figures.pyCSV ScriptThis python script created the final csv used to create the figures for the paper.create_final_csv.pyRename Detail Files ScriptThis python script renamed the detail files from the Avida experiments in order to better use them in Avida's Analyze mode.rename_detail_files.pyData CSVThis cs...
m
MBC Groundwater drawdown plots
demo.dev.magda.io
researchdata.edu.au
+1more
zip
Updated Dec 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2022). MBC Groundwater drawdown plots [Dataset]. https://demo.dev.magda.io/dataset/ds-dga-46c50a62-21ea-4e27-9ffd-c98d1a79a47e
Explore at:
zipAvailable download formats
Dataset updated
Dec 4, 2022
Dataset provided by
Bioregional Assessment Program
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract The dataset was derived by the Bioregional Assessment Programme from the MBC Groundwater Model dataset. The source dataset is identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. This dataset contains a number of files that were used to create Figure MBC-2627-002 for report number MBC-2.6.2. This figure depicts changes in groundwater drawdown over time at …Show full descriptionAbstract The dataset was derived by the Bioregional Assessment Programme from the MBC Groundwater Model dataset. The source dataset is identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. This dataset contains a number of files that were used to create Figure MBC-2627-002 for report number MBC-2.6.2. This figure depicts changes in groundwater drawdown over time at three separate groundwater wells, as predicted by the MBC groundwater flow model under both Baseline and CRDP scenarios. Differences between the two scenarios are also depicted for each of the three groundwater wells. Purpose To produce non-text element in the MBC 2.6.2 product. Dataset History This dataset contains the following files that were used to create Figure MBC-2627-002 for report number MBC-2.6.2. * RN30203.csv, RN66872.csv, RN87532.csv : These three comma delimited files contain drawdown time series data that were exported from the source data listed under metadata lineage. * MBC-2627-002.py : This Python script was created by Chris Turnadge from an existing script created by Luk Peeters and was used to plot three drawdown time series and the relative differences between them, using the data contained in the three csv files "RN30203.csv", "RN66872.csv" and "RN87532.csv". * BA_visualisation.py and BA_visualisation.pyc : These Python subroutines were created by Luk Peeters and were used by the Python script "MBC-2627-002.py" * MBC-2627-002.png : This PNG image was created by the Python script "MBC-2627-002.py" and was used as Figure MBC-2627-002 in report number MBC-2.6.2. Dataset Citation Bioregional Assessment Programme (2016) MBC Groundwater drawdown plots. Bioregional Assessment Derived Dataset. Viewed 07 July 2017, http://data.bioregionalassessments.gov.au/dataset/352a2f65-ddbf-4251-a401-c7070d2c9208. Dataset Ancestors Derived From MBC Groundwater model Derived From MBC Groundwater model mine footprints Derived From MBC Groundwater model layer boundaries

Facebook

Twitter

Click to copy link

Link copied

Cite

Maria Dermentzi; Maria Dermentzi (2022). Exploratory Topic Modelling in Python Dataset - EHRI-3 [Dataset]. http://doi.org/10.5281/zenodo.6670234

Exploratory Topic Modelling in Python Dataset - EHRI-3

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6670234

Dataset updated

Jun 20, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Maria Dermentzi; Maria Dermentzi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

In the EHRI-3 project, we are investigating tools and methods that historical researchers and scholars can use to better understand, visualise, and interpret the material held by our partner archives. This dataset accompanies a tutorial exploring a technique called topic modelling in the context of a Holocaust-related historical collection.

We were on the lookout for datasets that would be easily accessible and, for convenience, predominantly in English. One such dataset was the United States Holocaust Memorial Museum’s (USHMM) extensive collection of oral history testimonies, for which there are a considerable number of textual transcripts. The museum’s total collection consists of over 80,703 testimonies, 41,695 of which are available in English, with 2,894 of them listing a transcript.

Since there is not yet a ready-to-download dataset that includes these transcripts, we had to construct our own. Using a web scraping tool, we managed to create a list of the links pointing to the metadata (including transcripts) of the testimonies that were of interest to us. After obtaining the transcript and other metadata of each of these testimonies, we were able to create our dataset and curate it to remove any unwanted entries. For example, we made sure to remove entries with restrictions on access or use. We also removed entries with transcripts that consisted only of some automatically generated headers and entries which turned out to be in languages other than English. The remaining 1,873 transcripts form the corpus of this tutorial — a small, but still decently sized dataset.

The process that we followed to put together this dataset is detailed in the Jupyter Notebook accompanying this post, which can be found in this Github repository.

In this Zenodo upload, the user can find two files, each of them containing a pickled pandas DataFrame that was obtained at a different stage of the tutorial:

"unrestricted_df.pkl" contains 1,946 entries of Oral Testimony transcripts and has five fields (RG_number, text, display_date, conditions_access, conditions_use)
"unrestricted_lemmatized_df.pkl" contains 1,873 entries of Oral Testimony transcripts and has six fields (RG_number, text, display_date, conditions_access, conditions_use, lemmas)

Instructions on their intended use can be found in the accompanying Jupyter Notebook.

Credits:

The transcripts that form the corpus in this tutorial were obtained through the United States Holocaust Memorial Museum (USHMM).

Clear search

Close search

Google apps

Main menu

Exploratory Topic Modelling in Python Dataset - EHRI-3

Meta Kaggle Code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

notional-python

Python scripts used to generate the figures in "An algorithm to identify...

Dataset of book series where books equals Python, beginner's guide to...

Dataset metadata of known Dataverse installations

#PraCegoVer dataset

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

PIPr: A Dataset of Public Infrastructure as Code Programs

1300q_ATP

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

DATS 6401 - Final Project - Yon ho Cheong.zip

codeparrot_1M

Instructions for using this dataset

(HS 17) Automate Workflows using Jupyter notebook to create Large Spatial...

"module-utilities": A Python package for simplify creating python modules.

Data from: Algorithms for Reconstruction of Undersampled Atomic Force...

Vietnamese Person Name

Context

Content

Acknowledgements

Inspiration

Data from: Advanced Python Scripting for ArcGIS Pro

Aluminum alloy industrial materials defect

Data from: Different evolutionary paths to complexity for small and large...

MBC Groundwater drawdown plots

Exploratory Topic Modelling in Python Dataset - EHRI-3