100+ datasets found

m
Free JSON MAC Address Database Download
maclookup.app
json
Updated Sep 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Free JSON MAC Address Database Download [Dataset]. https://maclookup.app/downloads/json-database
Explore at:
jsonAvailable download formats
Dataset updated
Sep 10, 2025
Description
Download the complete MAC Address JSON database to integrate network data into your projects. Regularly updated and easy to use.
Mongo DB/ Json datasets
kaggle.com
Updated Sep 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrashti (2023). Mongo DB/ Json datasets [Dataset]. https://www.kaggle.com/datasets/shrashtisinghal/mongo-db-datsets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 3, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shrashti
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Introducing the largest and most comprehensive collection of Mongo DB Dataset! This meticulously curated dataset brings together a wealth of information from various domains, including ecommerce, aviation, biology, zoology, literature, history, and more. Meticulously gathered from numerous reliable sources, this dataset has been expertly transformed into a unified format, making it an invaluable resource for researchers, data scientists, and enthusiasts alike. Each domain contributes its unique insights and knowledge, providing a diverse range of information for exploration and analysis. With its enriched content and extensive coverage, this Mongo DB Dataset opens up endless possibilities for uncovering hidden patterns, conducting groundbreaking research, and gaining profound insights across multiple disciplines.
MEDQA-USMLE QA JSON Only
kaggle.com
Updated Oct 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nithin Dhananjayan (2023). MEDQA-USMLE QA JSON Only [Dataset]. https://www.kaggle.com/datasets/evidence/medqa-usmle-qa-json-only
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 24, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nithin Dhananjayan
Description
The current dataset is a subset and reformatting of a more raw dataset. The focus here is only on US questions and answers split into dev, train, and test sets in separate json files. This format ought to be easier to use. This notebook captures how the conversion was done.

The more raw dataset is pulled from paperswithcode which was originally pulled from A Large-scale Open Domain Question Answering Dataset from Medical Exams

The dataset is collected from the professional medical board exams. It covers three languages: English, simplified Chinese, and traditional Chinese, and contains 12,723, 34,251, and 14,123 questions for the three languages, respectively.

This is under the MIT License

MIT License (As given on github)

Copyright (c) 2022 Di Jin

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Written with StackEdit.
wikidata-20220103-all.json.gz
academictorrents.com
bittorrent
Updated Jan 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wikidata.org (2022). wikidata-20220103-all.json.gz [Dataset]. https://academictorrents.com/details/229cfeb2331ad43d4706efd435f6d78f40a3c438
Explore at:
bittorrent(109042925619)Available download formats
Dataset updated
Jan 24, 2022
Dataset provided by
Wikidata//wikidata.org/
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A BitTorrent file to download data with the title 'wikidata-20220103-all.json.gz'
wikidata-20240902-all.json.bz2
academictorrents.com
bittorrent
Updated Sep 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikidata Contributors (2024). wikidata-20240902-all.json.bz2 [Dataset]. https://academictorrents.com/details/7bee8ece634c55ab4ed7da5a56dd81578729ed2b
Explore at:
bittorrent(91964359511)Available download formats
Dataset updated
Sep 5, 2024
Dataset provided by
Wikidata//wikidata.org/
Authors
Wikidata Contributors
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
A BitTorrent file to download data with the title 'wikidata-20240902-all.json.bz2'
h
Data from: MECD
huggingface.co
Updated Oct 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tychen (2024). MECD [Dataset]. https://huggingface.co/datasets/tychen-sjtu/MECD
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2024
Authors
tychen
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This version of the JSON file is for display on the dataset card.

To utilize the JSON file loading mechanism in the current version of the code, please download the JSON file directly from the GitHub repository.
Bulk Download Facility
catalog.data.gov
s.cnmilf.com
+1more
Updated Jul 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Energy Information Administration (2021). Bulk Download Facility [Dataset]. https://catalog.data.gov/dataset/bulk-download-facility
Explore at:
Dataset updated
Jul 6, 2021
Dataset provided by
Energy Information Administrationhttp://www.eia.gov/
Description
The bulk download facility provides the entire contents of each major API data set in a single ZIP file. A small JSON formatted manifest file lists the bulk files and the update date of each file. The manifest is generally updated daily and can be downloaded from http://api.eia.gov/bulk/manifest.txt. The manifest contains information about the bulk files, including all required common core attributes.
Z
#PraCegoVer dataset
data.niaid.nih.gov
Updated Jan 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sandra Avila (2023). #PraCegoVer dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5710561
Explore at:
Dataset updated
Jan 19, 2023
Dataset provided by
Gabriel Oliveira dos Santos
Sandra Avila
Esther Luna Colombini
Description
Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

Dataset Structure

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

containing the images. The file dataset.json comprehends a list of json objects with the attributes:

user: anonymized user that made the post;

filename: image file name;

raw_caption: raw caption;

caption: clean caption;

date: post date.

Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

Download Instructions

If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:

cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz

Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

python download_dataset.py --access_token=
d
Dataset metadata of known Dataverse installations
search.dataone.org
dataverse.harvard.edu
+1more
Updated Nov 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautier, Julian (2023). Dataset metadata of known Dataverse installations [Dataset]. http://doi.org/10.7910/DVN/DCDKZQ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/DCDKZQ
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Gautier, Julian
Description
This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation).csv │ ├── basic.csv │ ├── contributor(citation).csv │ ├── ... │ └── topic_classification(citation).csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2022.10.02_17.11.19.zip │ ├── dataset_pids_Abacus_2022.10.02_17.11.19.csv │ ├── Dataverse_JSON_metadata_2022.10.02_17.11.19 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0.json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2022.10.02_17.26.19.zip │ ├── ADA_Dataverse_2022.10.02_17.26.57.zip │ ├── Arca_Dados_2022.10.02_17.44.35.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected ... Visit https://dataone.org/datasets/sha256%3Ad27d528dae8cf01e3ea915f450426c38fd6320e8c11d3e901c43580f997a3146 for complete metadata about this dataset.
Data from: ThermoML/Data Archive
catalog.data.gov
data.nist.gov
+2more
Updated Jul 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). ThermoML/Data Archive [Dataset]. https://catalog.data.gov/dataset/thermoml-data-archive
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
ThermoML is an XML-based IUPAC standard for the storage and exchange of experimental thermophysical and thermochemical property data. The ThermoML archive is a subset of Thermodynamics Research Center (TRC) data holdings corresponding to cooperation between NIST TRC and five journals: Journal of Chemical Engineering and Data (ISSN: 1520-5134), The Journal of Chemical Thermodynamics (ISSN: 1096-3626), Fluid Phase Equilibria (ISSN: 0378-3812), Thermochimica Acta (ISSN: 0040-6031), and International Journal of Thermophysics (ISSN: 1572-9567). Data from initial cooperation (around 2003) through the 2019 calendar year are included. The original scope of the archive has been expanded to include JSON files. The JSON files are structured according to the ThermoML.xsd (available below) and rendered from the same experimental thermophysical and thermochemical property data reported in the corresponding articles as the ThermoML files. In fact, the ThermoML files are generated from the JSON files to keep the information in sync. The JSON files may contain additional information not supported by the ThermoML schema. For example, each JSON file contains the md5 checksum on the ThermoML file (THERMOML_MD5_CHECKSUM) that may be used to validate the ThermoML download. This data.nist.gov resource provides a .tgz file download containing the JSON and ThermoML files for each version of the archive. Data from initial cooperation (around 2003) through the 2019 calendar year are provided below (ThermoML.v2020-09.30.tgz). The date of the extraction from TRC databases, as specified in the dateCit field of the xml files, are 2020-09-29 and 2020-09-30. The .tgz file contains a directory tree that maps to the DOI prefix/suffix of the entries; e.g. unzipping the .tgz file creates a directory for each of the prefixes ( 10.1007, 10.1016, and 10.1021) that contains all the .json and .xml files. The data and other information throughout this digital resource (including the website, API, JSON, and ThermoML files) have been carefully extracted from the original articles by NIST/TRC personnel. Neither the Journal publisher, nor its editors, nor NIST/TRC warrant or represent, expressly or implied, the correctness or accuracy of the content of information contained throughout this digital resource, nor its fitness for any use or for any purpose, nor can they, or will they, accept any liability or responsibility whatever for the consequences of its use or misuse by anyone. In any individual case of application, the respective user must check the correctness by consulting other relevant sources of information.
Site Scanning API
catalog.data.gov
gimi9.com
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
General Services Administration (2025). Site Scanning API [Dataset]. https://catalog.data.gov/dataset/site-scanning-api
Explore at:
Dataset updated
May 6, 2025
Dataset provided by
General Services Administrationhttp://www.gsa.gov/
Description
Every day, the Site Scanning program runs a scanning engine to dynamically pull down lists of domains from various sources and then scan them with a collection of scan plugins to gather data on them. The resulting data that populates this API then can be seen as having two main utilities: Providing a fairly comprehensive dataset of US federal government websites. Providing various information and analysis about each of these websites. In addition to querying the data via API, you can also download it directly as a CSV or JSON file.
n
Anvil Centre Events Schedule (JSON file)
opendata.newwestcity.ca
data-60320-newwestcity.opendata.arcgis.com
+1more
Updated Mar 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of New Westminster, British Columbia, Canada (2022). Anvil Centre Events Schedule (JSON file) [Dataset]. https://opendata.newwestcity.ca/datasets/6d398e267fde4cd19a29abd461034830
Explore at:
Dataset updated
Mar 22, 2022
Dataset authored and provided by
City of New Westminster, British Columbia, Canada
Description
Custom JSON File created for download
EIA Bulk File Downloads
datalumos.org
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Energy Information Administration (2025). EIA Bulk File Downloads [Dataset]. http://doi.org/10.3886/E229741V1
Explore at:
Unique identifier
https://doi.org/10.3886/E229741V1
Dataset updated
May 14, 2025
Dataset provided by
Energy Information Administrationhttp://www.eia.gov/
Authors
U.S. Energy Information Administration
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States of America
Description
This collection encompasses all bulk data downloads available on EIA's open data site on 5/14/2025. The manifest.txt files provides descriptions of the included datasets in a JSON format. The datasets are divided by subject.Survey forms used to collect the data are available here: https://www.eia.gov/survey/File name | SubjectAEO2025.zip | Annual Energy Outlook 2025SEDS.zip | State Energy Data SystemsELEC.zip | ElectricityNG.zip | Natural GasPET.zip | PetroleumTOTAL.zip | Total EnergyCOAL.zip | CoalSTEO.zip | Short Term Energy OutlookPET_IMPORTS.zip | Crude Oil ImportsINTL.zip | International Energy DataEBA.zip | US Electric System Operating Data (2019-present)EBA-pre2019.zip | US Electric System Operating Data (before 2019)EMISS.zip | CO2 EmissionsIEO.zip | International Energy OutlookNUC_STATUS.zip | U.S. Nuclear Outages
h
MULocBench
huggingface.co
Updated Sep 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
somethingone (2025). MULocBench [Dataset]. https://huggingface.co/datasets/somethingone/MULocBench
Explore at:
Dataset updated
Sep 22, 2025
Dataset authored and provided by
somethingone
Description
Downloads

Please download the benchmark from https://huggingface.co/datasets/somethingone/MULocBench/blob/main/all_issues_with_pr_commit_comment_all_project_0922.pkl If you’d like to view the data directly, you can download the JSON file and then open the json file using your browser

2. How to Use DataSet

import pickle filepath = "input the data path, e.g., all_issues_with_pr_commit_comment_all_project_0922.pkl" with open(filepath, 'rb') as file: iss_list =… See the full description on the dataset page: https://huggingface.co/datasets/somethingone/MULocBench.
Forensic Toolkit Dataset
kaggle.com
Updated May 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SUNNY THAKUR (2025). Forensic Toolkit Dataset [Dataset]. https://www.kaggle.com/datasets/cyberprince/forensic-toolkit-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 26, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SUNNY THAKUR
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Forensic Toolkit Dataset Overview The Forensic Toolkit Dataset is a comprehensive collection of 300 digital forensics and incident response (DFIR) tools, designed for training AI models, supporting forensic investigations, and enhancing cybersecurity workflows. The dataset includes both mainstream and unconventional tools, covering disk imaging, memory analysis, network forensics, mobile forensics, cloud forensics, blockchain analysis, and AI-driven forensic techniques. Each entry provides detailed information about the tool's name, commands, usage, description, supported platforms, and official links, making it a valuable resource for forensic analysts, data scientists, and machine learning engineers. Dataset Description The dataset is provided in JSON Lines (JSONL) format, with each line representing a single tool as a JSON object. It is optimized for AI training, data analysis, and integration into forensic workflows. Schema Each entry contains the following fields:

id: Sequential integer identifier (1–300). tool_name: Name of the forensic tool. commands: List of primary commands or usage syntax (if applicable; GUI-based tools noted). usage: Brief description of how the tool is used in forensic or incident response tasks. description: Detailed explanation of the tool’s purpose, capabilities, and forensic applications. link: URL to the tool’s official website or documentation (verified as of May 26, 2025). system: List of supported platforms (e.g., Linux, Windows, macOS, Android, iOS, Cloud).

Sample Entry { "id": 1, "tool_name": "The Sleuth Kit (TSK)", "commands": ["fls -r -m / image.dd > bodyfile", "ils -e image.dd", "icat image.dd 12345 > output.file", "istat image.dd 12345"], "usage": "Analyze disk images to recover files, list file metadata, and create timelines.", "description": "Open-source collection of command-line tools for analyzing disk images and file systems (NTFS, FAT, ext). Enables recovery of deleted files, metadata examination, and timeline generation.", "link": "https://www.sleuthkit.org/sleuthkit/", "system": ["Linux", "Windows", "macOS"] }

Dataset Structure

Total Entries: 300

Content Focus: Mainstream tools (e.g., The Sleuth Kit, FTK Imager). Unconventional tools (e.g., IoTSeeker, Chainalysis Reactor, DeepCase). Specialized areas: IoT, blockchain, cloud, mobile, and AI-driven forensics.

Purpose The dataset is designed for:

AI Training: Fine-tuning machine learning models for forensic tool recommendation, command generation, or artifact analysis. Forensic Analysis: Reference for forensic analysts to identify tools for specific investigative tasks. Cybersecurity Research: Supporting incident response, threat hunting, and vulnerability analysis. Education: Providing a structured resource for learning about DFIR tools and their applications.

Usage Accessing the Dataset

Download the JSONL files from the repository. Each file can be parsed using standard JSONL libraries (e.g., jsonlines in Python, jq in Linux). Combine files for a complete dataset or use individual segments as needed. ```python

Example: Parsing with Python import json

with open('forensic_toolkit_dataset_1_50.jsonl', 'r') as file: for line in file: tool = json.loads(line) print(f"Tool: {tool['tool_name']}, Supported Systems: {tool['system']}")

Applications AI Model Training: Use the dataset to train models for predicting tool usage based on forensic tasks or generating command sequences. Forensic Workflows: Query the dataset to select tools for specific platforms (e.g., Cloud, Android) or tasks (e.g., memory analysis). Data Analysis: Analyze tool distribution across platforms or forensic categories using data science tools (e.g., Pandas, R). Contribution Guidelines We welcome contributions to expand or refine the dataset. To contribute: Fork the repository. Add new tools or update existing entries in JSONL format, ensuring adherence to the schema. Verify links and platform compatibility as of the contribution date. Submit a pull request with a clear description of changes. Avoid duplicating tools from existing entries (check IDs 1–300). Contribution Notes Ensure tools are forensically sound (preserve evidence integrity, court-admissible where applicable). Include unconventional or niche tools to maintain dataset diversity. Validate links and commands against official documentation. License This dataset is licensed under the MIT License. See the LICENSE file for details. Acknowledgments Inspired by forensic toolkits and resources from ForensicArtifacts.com, SANS, and open-source communities. Thanks to contributors for identifying unique and unconventional DFIR tools. Contact For issues, suggestions, or inquiries, please open an issue on the repository or contact the maintainers at sunny48445@gmail.com.
20150112.json.gz
academictorrents.com
bittorrent
Updated Jan 17, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikidata Project (2015). 20150112.json.gz [Dataset]. https://academictorrents.com/details/466d6a3794328acc7c068a45f0380ef3ade8345f
Explore at:
bittorrent(3908362534)Available download formats
Dataset updated
Jan 17, 2015
Dataset provided by
Wikidata//wikidata.org/
Authors
Wikidata Project
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
A BitTorrent file to download data with the title '20150112.json.gz'
s
Pleiades dataset
marketplace.sshopencloud.eu
Updated Jan 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Pleiades dataset [Dataset]. https://marketplace.sshopencloud.eu/dataset/ZV8S2J
Explore at:
Dataset updated
Jan 1, 2017
Description
Pleiades gives scholars, students, and enthusiasts worldwide the ability to use, create, and share historical geographic information about the ancient world in digital form. At present, Pleiades has extensive coverage for the Greek and Roman world, and is expanding into Ancient Near Eastern, Byzantine, Celtic, and Early Medieval geography. JSON-formatted data is the site's "only comprehensive data dump." It contains all attributes of all place, name, and location objects in the database that have been published. Each morning, a single JSON file for all published places is written to a JSON file at http://atlantides.org/downloads/pleiades/json/. JSON is a widely used, well-known format that is popular for use in web applications and other programming tasks. We keep a week's worth of files, deleting older ones. The file named pleiades-places-latest.json.gz will always get you the most recent version. Note also that those previously published place resources that have been withdrawn and moved to the "errata" section of the site are dumped to a separate json file.
openFDA Drug Labeling
kaggle.com
Updated Apr 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ddrbcn (2025). openFDA Drug Labeling [Dataset]. https://www.kaggle.com/datasets/ddrbcn/openfda-drug-labeling
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 9, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ddrbcn
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
🧬 openFDA Drug Labeling – JSON Dataset

This dataset contains structured drug labeling information (FDA labels) provided by DailyMed and made available through the openFDA Drug Labeling endpoint.

The dataset includes 13 compressed .zip files with drug label records in JSON format. Each record reflects the full label submitted to the FDA, and the structure matches what you would receive from the /drug/label API.

📁 Dataset Contents

13 ZIP files

Each file contains multiple JSON documents representing FDA-approved drug labels

Data fields include (but are not limited to):

drug_interactions

warnings

indications_and_usage

contraindications

adverse_reactions

dosage_and_administration

brand_name, generic_name

...and many others

You will also find the 'Human Drug.xlsx' file included in the dataset, which contains the complete data dictionary for reference.

🔄 Updates

This dataset reflects the most recent version available as of April 9, 2025. According to the source, previous records may be modified in future updates. For accuracy and completeness, all files should be downloaded together.

📚 Sources and More Information

openFDA Drug Labeling Downloads

API Documentation

DailyMed Main Site

⚠️ Disclaimer (Please Read Carefully)

Do not rely on openFDA to make decisions regarding medical care. Always speak to your health provider about the risks and benefits of FDA-regulated products. We may limit or otherwise restrict your access to the API in line with our Terms of Service.

Full terms available here: openFDA Terms of Service

🛠️ Notes for Usage

This dataset is ideal for applications involving: - Drug safety analysis - Drug interaction monitoring - Medical language modeling - Retrieval-augmented generation (RAG) agents - Regulatory and pharmacovigilance systems

You may want to extract and preprocess only relevant fields before vectorizing or feeding them into an AI model for efficiency and performance.
C
CityPropertyMailingListBusiness
data.milwaukee.gov
csv
Updated Sep 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Information Technology and Management Division (2025). CityPropertyMailingListBusiness [Dataset]. https://data.milwaukee.gov/dataset/citypropertymailinglistbusiness
Explore at:
csvAvailable download formats
Dataset updated
Sep 9, 2025
Dataset authored and provided by
Information Technology and Management Division
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
To download XML and JSON files, click the CSV option below and click the down arrow next to the Download button in the upper right on its page.
Z
PIPr: A Dataset of Public Infrastructure as Code Programs
data.niaid.nih.gov
zenodo.org
Updated Nov 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salvaneschi, Guido (2023). PIPr: A Dataset of Public Infrastructure as Code Programs [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8262770
Explore at:
Dataset updated
Nov 28, 2023
Dataset provided by
Salvaneschi, Guido
Spielmann, David
Sokolowski, Daniel
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0. Contents:

metadata.zip: The dataset metadata and analysis results as CSV files. scripts-and-logs.zip: Scripts and logs of the dataset creation. LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text. README.md: This document. redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program. This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io. Metadata The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files. repositories.csv:

ID (integer): GitHub repository ID url (string): GitHub repository URL downloaded (boolean): Whether cloning the repository succeeded name (string): Repository name description (string): Repository description licenses (string, list of strings): Repository licenses redistributable (boolean): Whether the repository's licenses permit redistribution created (string, date & time): Time of the repository's creation updated (string, date & time): Time of the last update to the repository pushed (string, date & time): Time of the last push to the repository fork (boolean): Whether the repository is a fork forks (integer): Number of forks archive (boolean): Whether the repository is archived programs (string, list of strings): Project file path of each IaC program in the repository programs.csv:

ID (string): Project file path of the IaC program repository (integer): GitHub repository ID of the repository containing the IaC program directory (string): Path of the directory containing the IaC program's project file solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi") language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml") name (string): IaC program name description (string): IaC program description runtime (string): Runtime string of the IaC program testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") tests (string, list of strings): File paths of IaC program's tests testing-files.csv:

file (string): Testing file path language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript") techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(") program (string): Project file path of the testing file's IaC program Dataset Creation scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:

A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries. A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below). Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses. Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories. Searching Repositories The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:

Github access token. Name of the CSV output file. Filename to search for. File extensions to search for, separated by commas. Min file size for the search (for all files: 0). Max file size for the search or * for unlimited (for all files: *). Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/ AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup Limitations The script uses the GitHub code search API and inherits its limitations:

Only forks with more stars than the parent repository are included. Only the repositories' default branches are considered. Only files smaller than 384 KB are searchable. Only repositories with fewer than 500,000 files are considered. Only repositories that have had activity or have been returned in search results in the last year are considered. More details: https://docs.github.com/en/search-github/searching-on-github/searching-code The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api Downloading Repositories download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:

Name of the repositories CSV files generated through search-repositories.py, separated by commas. Output directory to download the repositories to. Name of the CSV output file. The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.