51 datasets found

Z
#PraCegoVer dataset
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jan 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Oliveira dos Santos; Esther Luna Colombini; Sandra Avila (2023). #PraCegoVer dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5710561
Explore at:
Dataset updated
Jan 19, 2023
Dataset provided by
Institute of Computing, University of Campinas
Authors
Gabriel Oliveira dos Santos; Esther Luna Colombini; Sandra Avila
Description
Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

Dataset Structure

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

containing the images. The file dataset.json comprehends a list of json objects with the attributes:

user: anonymized user that made the post;

filename: image file name;

raw_caption: raw caption;

caption: clean caption;

date: post date.

Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

Download Instructions

If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:

cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz

Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

python download_dataset.py --access_token=
Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...
zenodo.org
bz2
Updated Mar 15, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.2592524
Dataset updated
Mar 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

This repository contains two files:

dump.tar.bz2

jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.

archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.

paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-03-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.7:

conda create -n analyses python=3.7 conda activate analyses

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

Index.ipynb

N0.Repository.ipynb

N1.Skip.Notebook.ipynb

N2.Notebook.ipynb

N3.Cell.ipynb

N4.Features.ipynb

N5.Modules.ipynb

N6.AST.ipynb

N7.Name.ipynb

N8.Execution.ipynb

N9.Cell.Execution.Order.ipynb

N10.Markdown.ipynb

N11.Repository.With.Notebook.Restriction.ipynb

N12.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction # Frequenci of log report export JUP_ASTROID_FREQUENCY="5"; export JUP_IPYTHON_FREQUENCY="5"; export JUP_NOTEBOOKS_FREQUENCY="5"; export JUP_REQUIREMENT_FREQUENCY="5"; export JUP_CRAWLER_FREQUENCY="1"; export JUP_CLONE_FREQUENCY="1"; export JUP_COMPRESS_FREQUENCY="5"; export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y conda activate raw27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y conda activate py27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y conda activate raw34 conda install jupyter -c conda-forge -y conda uninstall jupyter -y pip install --upgrade pip pip install jupyter pip install pipenv pip install -e jupyter_reproducibility/archaeology pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y conda activate py34 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y conda activate raw35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator conda activate py35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y conda activate raw36 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y conda activate py36 conda install -y anaconda-navigator jupyterlab_server navigator-updater pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.7

<code
h
python-copilot-training-from-many-repos-large
huggingface.co
Updated Feb 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Welcome to matlok (2024). python-copilot-training-from-many-repos-large [Dataset]. https://huggingface.co/datasets/matlok/python-copilot-training-from-many-repos-large
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 8, 2024
Authors
Welcome to matlok
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Python Copilot Large Coding Dataset

This dataset is a subset of the matlok python copilot datasets. Please refer to the Multimodal Python Copilot Training Overview for more details on how to use this dataset.

Details

Each row contains python code, either a class method or a global function, imported modules, base classes (if any), exceptions (ordered based off the code), returns (ordered based off the code), arguments (ordered based off the code), and more.

Rows: 2350782… See the full description on the dataset page: https://huggingface.co/datasets/matlok/python-copilot-training-from-many-repos-large.
Z
Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction
data.niaid.nih.gov
zenodo.org
Updated Jan 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Keshavarz, Hossein; Nagappan, Meiyappan (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5907001
Explore at:
Dataset updated
Jan 27, 2022
Dataset provided by
David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada
Authors
Keshavarz, Hossein; Nagappan, Meiyappan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

The datasets are available under directory dataset. There are 4 datasets in this directory.

apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.

apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).

apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.

apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).

References:

GumTree

https://github.com/GumTreeDiff/gumtree

Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

PyDriller

https://pydriller.readthedocs.io/en/latest/

Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911
4
PARAMOUNT: parallel modal analysis of large datasets
data.4tu.nl
zip
Updated Nov 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alireza Ghasemi; Jim Kok (2022). PARAMOUNT: parallel modal analysis of large datasets [Dataset]. http://doi.org/10.4121/20089760.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/20089760.v1
Dataset updated
Nov 28, 2022
Dataset provided by
4TU.ResearchData
Authors
Alireza Ghasemi; Jim Kok
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Description
PARAMOUNT: parallel modal analysis of large datasets

PARAMOUNT is a python package developed at University of Twente to perform modal analysis of large numerical and experimental datasets. Brief video introduction into the theory and methodology is presented _here.

Features

- Distributed processing of data on local machines or clusters using Dask Distributed
- Reading CSV files in glob format from specified folders
- Extracting relevant columns from CSV files and writing Parquet database for each specified variable
- Distributed computation of Proper Orthogonal Decomposition (POD)
- Writing U, S and V matrices into Parquet database for further analysis
- Visualizing POD modes and coefficients using pyplot

Using PARAMOUNT

Make sure to install the dependencies by running `pip install -r requirements.txt`

Refer to csv_example to see how to use PARAMOUNT to read CSV files, write the variables of interest into Parquet datasets and inspect the final datasets.

Refer to svd_example to see how to read Parquet datasets, compute the Singular Value Decomposition, and store the results in Parquet format.

To visualize the results you can simply read the U, S and V parquet files and your plotting tool of choice. Examples are provided in viz_example.

Author and Acknowledgements

This package is developed by Alireza Ghasemi (alireza.ghasemi@utwente.nl) at University of Twente under the MAGISTER (https://www.magister-itn.eu/) project. This project has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No. 766264.
CESNET-TLS22: A large dataset for fine-grained classification of TLS...
zenodo.org
csv, zip
Updated Feb 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luxemburk Jan; Luxemburk Jan; Čejka Tomáš; Čejka Tomáš (2024). CESNET-TLS22: A large dataset for fine-grained classification of TLS services [Dataset]. http://doi.org/10.5281/zenodo.7965515
Explore at:
zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7965515
Dataset updated
Feb 6, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Luxemburk Jan; Luxemburk Jan; Čejka Tomáš; Čejka Tomáš
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please refer to the original article for further data description: Jan Luxemburk et al. Fine-grained TLS services classification with reject option, Computer Networks, 2023, 109467, ISSN 1389-1286, https://doi.org/10.1016/j.comnet.2022.109467
We recommend using the CESNET DataZoo python library, which facilitates the work with large network traffic datasets. More information about the DataZoo project can be found in the GitHub repository https://github.com/CESNET/cesnet-datazoo.
The recent success and proliferation of machine learning and deep learning have provided powerful tools, which are also utilized for encrypted traffic analysis, classification, and threat detection. These methods, neural networks in particular, are often complex and require a huge corpus of training data. Moreover, because most of the network traffic is being encrypted, the traditional deep-packet-inspecting (DPI) solutions are becoming obsolete, and there is an urgent need for modern classification methods capable of analyzing encrypted traffic. These methods have to forgo the packet's opaque payload and focus on flow statistics and packet metadata sequences like packet sizes, directions, and inter-arrival times. The classification can be further extended with the task of "rejecting" unknown traffic, i.e., the traffic not seen during the training phase. This makes the problem more challenging, and neural networks offer superior performance for tackling this problem. When the factors of (1) the hardness of classification of encrypted traffic with unknown traffic detection and (2) the neural networks' inherent need for large datasets are combined, the requirement for a rich, large, and up-to-date dataset is even stronger.
Therefore, we created a large dataset spanning two weeks, consisting of 141 million network flows, and having 191 fine-grained service labels. The dataset is intended as a benchmark for the task of identification of services in encrypted traffic with the detection of unknown services.
Data capture The data was captured in the flow monitoring infrastructure of the CESNET2 network. The capturing was done for two weeks between 4.10.2021 and 17.10.2021. The following table provides per-week flow count, capture period, and uncompressed size:
W-2021-40
Uncompressed Size: 22 GB
Capture Period: 4.10.2021 - 10.10.2021
Flows: 73.2M
W-2021-41
Uncompressed Size: 20 GB
Capture Period: 11.10.2021 - 17.10.2021
Flows: 68.5M
CESNET-TLS22
Uncompressed Size: 42 GB
Capture Period: 4.10.2021 - 17.10.2021
Flows: 141.7M
Dataset structure The dataset flows are delivered in compressed CSV files, which contain one flow per row. For each flow data file, there is a JSON file with the number of saved flows per service. There is also the stats-week.json file aggregating flow counts of a whole week and the stats-dataset.json file aggregating flow counts for the entire dataset. The mapping between services and service providers is provided in the servicemap.csv file, which also includes SNI domains used for ground truth labeling. The following table describes flow data fields in CSV files:
ID: Unique identifier
BYTES: Number of transmitted bytes from client to server
BYTES_REV: Number of transmitted bytes from server to client
PACKETS: Number of packets transmitted from client to server
PACKETS_REV: Number of packets transmitted from server to client
DURATION: Duration of the flow in seconds
PPI: Packet metadata sequence in the format: [[inter-packet times], [packet directions], [packet sizes]]
PPI_LEN: Number of packets in the PPI sequence
PPI_DURATION: Duration of the PPI sequence in seconds
PPI_ROUNDTRIPS: Number of roundtrips in the PPI sequence
APP: Web service label
CATEGORY: Service category
TCP_FLAGS: TCP flags sent from client to server
TCP_FLAGS_REV: TCP flags sent from server to client
FLAG_CWR: Presence of the CWR flag
FLAG_CWR_REV: Presence of the CWR flag in the reverse direction
FLAG_ECE: Presence of the ECE flag
FLAG_ECE_REV: Presence of the ECE flag in the reverse direction
FLAG_URG: Presence of the URG flag
FLAG_URG_REV: Presence of the URG flag in the reverse direction
FLAG_ACK: Presence of the ACK flag
FLAG_ACK_REV: Presence of the ACK flag in the reverse direction
FLAG_PSH: Presence of the PSH flag
FLAG_PSH_REV: Presence of the PSH flag in the reverse direction
FLAG_RST: Presence of the RST flag
FLAG_RST_REV: Presence of the RST flag in the reverse direction
FLAG_SYN: Presence of the SYN flag
FLAG_SYN_REV: Presence of the SYN flag in the reverse direction
FLAG_FIN: Presence of the FIN flag
FLAG_FIN_REV: Presence of the FIN flag in the reverse direction
Link to other CESNET datasets
https://www.liberouter.org/technology-v2/tools-services-datasets/datasets/
https://github.com/CESNET/cesnet-datazoo
Please cite the original article:
@article{luxemburk_fine-grained-tls_2023, author = {Jan Luxemburk and Tomáš Čejka}, title = {Fine-grained TLS services classification with reject option}, journal = {Computer Networks}, volume = {220}, pages = {109467}, year = {2023}, issn = {1389-1286}, doi = {https://doi.org/10.1016/j.comnet.2022.109467}, url = {https://www.sciencedirect.com/science/article/pii/S1389128622005011} }
Prothom Alo News Articles
kaggle.com
zip
Updated Mar 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shakleen Ishfar (2021). Prothom Alo News Articles [Dataset]. https://www.kaggle.com/datasets/ishfar/prothom-alo-news-articles/discussion
Explore at:
zip(1207479794 bytes)Available download formats
Dataset updated
Mar 8, 2021
Authors
Shakleen Ishfar
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

This dataset was accumulated for my undergraduate thesis research. We were exploring deep learning models to perform text summarization for the Bangla language. As there weren't large datasets available on the internet, we decided to mine Prothom Alo website and store their news article data. This dataset is the unaltered version of that mined data. It contains nearly 1 million news articles.

Content

We mined this huge dataset using the Python Requests library and performed data handling using Pandas. The tool we developed for mining can be found here. The tool saves data in CSV format. Articles published in the same month and year are stored together. We mined news articles all the way back from 2009 to the beginning of 2021. There are 19 columns in total, which are mentioned here: * Time related: updated_at, last_published_at, content_updated_at, created_at, content_created_at, published_at. * Text related: body, headline, subheadline, summary * Category related: tags * Metadata: id, read_time, word_count, URL, status, content_type, author_name

Acknowledgements

Accumulation of this dataset wouldn't be possible without my teammates Ahmed Sadman Muhib and AKM Nahid Hasan. I am also thankful to my thesis supervisor Dr. Abu Raihan Mostofa Kamal for his guidance, support, and advice.
a
How Python Can Work For You
cope-open-data-deegsnccu.hub.arcgis.com
code-deegsnccu.hub.arcgis.com
+1more
Updated Aug 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
East Carolina University (2023). How Python Can Work For You [Dataset]. https://cope-open-data-deegsnccu.hub.arcgis.com/datasets/ECU::how-python-can-work-for-you-
Explore at:
Dataset updated
Aug 26, 2023
Dataset authored and provided by
East Carolina University
Description
Python is a free computer language that prioritizes readability for humans and general application. It is one of the easier computer languages to learn and start especially with no prior programming knowledge. I have been using Python for Excel spreadsheet automation, data analysis, and data visualization. It has allowed me to better focus on learning how to automate my data analysis workload. I am currently examining the North Carolina Department of Environmental Quality (NCDEQ) database for water quality sampling for the Town of Nags Head, NC. It spans over 26 years (1997-2023) and lists a total of currently 41 different testing site locations. You can see at the bottom of image 2 below that I have 148,204 testing data points for the entirety of the NCDEQ testing for the state. From this large dataset 34,759 data points are from Dare County (Nags Head) specifically with this subdivided into testing sites.
d
(HS 2) Automate Workflows using Jupyter notebook to create Large Extent...
search.dataone.org
hydroshare.org
Updated Oct 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Young-Don Choi (2024). (HS 2) Automate Workflows using Jupyter notebook to create Large Extent Spatial Datasets [Dataset]. http://doi.org/10.4211/hs.a52df87347ef47c388d9633925cde9ad
Explore at:
Unique identifier
https://doi.org/10.4211/hs.a52df87347ef47c388d9633925cde9ad
Dataset updated
Oct 19, 2024
Dataset provided by
Hydroshare
Authors
Young-Don Choi
Description
We implemented automated workflows using Jupyter notebooks for each state. The GIS processing, crucial for merging, extracting, and projecting GeoTIFF data, was performed using ArcPy—a Python package for geographic data analysis, conversion, and management within ArcGIS (Toms, 2015). After generating state-scale LES (large extent spatial) datasets in GeoTIFF format, we utilized the xarray and rioxarray Python packages to convert GeoTIFF to NetCDF. Xarray is a Python package to work with multi-dimensional arrays and rioxarray is rasterio xarray extension. Rasterio is a Python library to read and write GeoTIFF and other raster formats. Xarray facilitated data manipulation and metadata addition in the NetCDF file, while rioxarray was used to save GeoTIFF as NetCDF. These procedures resulted in the creation of three HydroShare resources (HS 3, HS 4 and HS 5) for sharing state-scale LES datasets. Notably, due to licensing constraints with ArcGIS Pro, a commercial GIS software, the Jupyter notebook development was undertaken on a Windows OS.
h
the-stack
huggingface.co
opendatalab.com
Updated Oct 27, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigCode (2022). the-stack [Dataset]. https://huggingface.co/datasets/bigcode/the-stack
Explore at:
Dataset updated
Oct 27, 2022
Dataset authored and provided by
BigCode
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for The Stack

Changelog

Release Description

v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.

v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.
Large-Scale Dynamic Random Graph - Example
figshare.com
txt
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Osnat Mokryn; Alex Abbey (2023). Large-Scale Dynamic Random Graph - Example [Dataset]. http://doi.org/10.6084/m9.figshare.20462871.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20462871.v1
Dataset updated
Jun 4, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Osnat Mokryn; Alex Abbey
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Zhang et al. (https://link.springer.com/article/10.1140/epjb/e2017-80122-8) suggest a temporal random network with changing dynamics that follow a Markov process, allowing for a continuous-time network history moving from a static definition of a random graph with a fixed number of nodes n and edge probability p to a temporal one. Defining lambda = probability per time granule of a new edge to appear and mu = probability per time granule of an existing edge to disappear, Zhang et al. show that the equilibrium probability of an edge is p=lambda/(lambda+mu) Our implementation, a Python package that we refer to as RandomDynamicGraph https://github.com/ScanLab-ossi/DynamicRandomGraphs, generates large-scale dynamic random graphs according to the defined density. The package focuses on massive data generation; it uses efficient math calculations, writes to file instead of in-memory when datasets are too large, and supports multi-processing. Please note the datetime is arbitrary.

Datasets and source code for a pipeline architecture for feature-based...

data.mendeley.com

Updated Dec 14, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Jonatan Enes (2022). Datasets and source code for a pipeline architecture for feature-based unsupervised clustering using multivariate time series from HPC jobs [Dataset]. http://doi.org/10.17632/hgkv9cpnmn.2

Explore at:

Unique identifier

https://doi.org/10.17632/hgkv9cpnmn.2

Dataset updated

Dec 14, 2022

Authors

Jonatan Enes

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository is composed of 2 compressed files, with the contents as next described.

--- code.tar.gz --- The source code that implements the pipeline, as well as code and scripts needed to retrieve time series, create the plots or run the experiments. More specifically:

+ prepare.py and main.py ⇨ 
  The Python programs that implement the pipeline, both the auxiliary and the main pipeline 
  stages, respectively. 

+ 'anomaly' and 'config' folders ⇨ 
  Scripts and Python files containing the configuration and some basic functions that are 
  used to retrieve the information needed to process the data, like the actual resource 
  time series from OpenTSDB, or the job metadata from Slurm.

+ 'functions' folder ⇨ 
  Several folders with the Python programs that implement all the stages of the pipeline, 
  either for the Machine Learning processing (e.g., extractors, aggregators, models), or 
  the technical aspect of the pipeline (e.g., pipelines, transformer).

+ plotDF.py ⇨ 
  A Python program used to create the different plots presented, from the resource time 
  series to the evaluation plots.

+ several bash scripts ⇨ 
  Used to run the experiments using a specific configuration, whether regarding which 
  transformers are chosen and how they are parametrized, or more technical aspects 
  involving how the pipeline is executed.

--- data.tar.gz --- The actual data and results, organized as follows:

+ jobs ⇨ 
  All the jobs' resource time series plots for all the experiments, with a folder used 
  for each experiment. Inside each folder all the jobs are separated according to their 
  id, containing the plots for the different system resources (e.g., User CPU, Cached memory).

+ plots ⇨ 
  All the predictions' plots for all the experiments in separated folders, mainly used for 
  evaluation purposes (e.g., scatter plot, heatmaps, Andrews curves, dendrograms). These 
  plots are available for all the predictors resulting from the pipeline execution. In 
  addition, for each predictor it is also possible to visualize the resource time series 
  grouped by clusters. Finally, the projections as generated by the dimension reduction 
  models, and the outliers detected, are also available for each experiment.

+ datasets ⇨ 
  The datasets used for the experiments, which include the lists of job IDs to be processed 
  (CSV files) and the results of each stage of the pipeline (e.g., features, predictions), 
  and the output text files as generated by several pipeline stages. Among these latter 
  files it is worth to note the evaluation ones, that include all the predictions scores.

T
criteo
tensorflow.org
Updated Dec 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). criteo [Dataset]. https://www.tensorflow.org/datasets/catalog/criteo
Explore at:
Dataset updated
Dec 22, 2022
Description
Criteo Uplift Modeling Dataset

This dataset is released along with the paper: “A Large Scale Benchmark for Uplift Modeling” Eustache Diemert, Artem Betlei, Christophe Renaudin; (Criteo AI Lab), Massih-Reza Amini (LIG, Grenoble INP)

This work was published in: AdKDD 2018 Workshop, in conjunction with KDD 2018.

Data description

This dataset is constructed by assembling data resulting from several incrementality tests, a particular randomized trial procedure where a random part of the population is prevented from being targeted by advertising. it consists of 25M rows, each one representing a user with 11 features, a treatment indicator and 2 labels (visits and conversions).

Fields

Here is a detailed description of the fields (they are comma-separated in the file):

f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)

treatment: treatment group (1 = treated, 0 = control)

conversion: whether a conversion occured for this user (binary, label)

visit: whether a visit occured for this user (binary, label)

exposure: treatment effect, whether the user has been effectively exposed (binary)

Key figures

Format: CSV

Size: 459MB (compressed)

Rows: 25,309,483

Average Visit Rate: .04132

Average Conversion Rate: .00229

Treatment Ratio: .846

Tasks

The dataset was collected and prepared with uplift prediction in mind as the main task. Additionally we can foresee related usages such as but not limited to:

benchmark for causal inference

uplift modeling

interactions between features and treatment

heterogeneity of treatment

benchmark for observational causality methods

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('criteo', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
wikihow
tensorflow.org
opendatalab.com
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). wikihow [Dataset]. https://www.tensorflow.org/datasets/catalog/wikihow
Explore at:
Dataset updated
Dec 6, 2022
Description
WikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base.

There are two features: - text: wikihow answers texts. - headline: bold lines as summary.

There are two separate versions: - all: consisting of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries. - sep: consisting of each paragraph and its summary.

Download "wikihowAll.csv" and "wikihowSep.csv" from https://github.com/mahnazkoupaee/WikiHow-Dataset and place them in manual folder https://www.tensorflow.org/datasets/api_docs/python/tfds/download/DownloadConfig. Train/validation/test splits are provided by the authors. Preprocessing is applied to remove short articles (abstract length < 0.75 article length) and clean up extra commas.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wikihow', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

RBD24 - Risk Activities Dataset 2024

zenodo.org

bin

Updated Mar 4, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime (2025). RBD24 - Risk Activities Dataset 2024 [Dataset]. http://doi.org/10.5281/zenodo.13787591

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13787591

Dataset updated

Mar 4, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Introduction

This repository contains a selection of behavioral datasets collected using soluble agents and labeled using realistic threat simulation and IDS rules. The collected datasets are anonymized and aggregated using time window representations. The dataset generation pipeline preprocesses the application logs from the corporate network, structures them according to entities and users inventory, and labels them based on the IDS and phishing simulation appliances.

This repository is associated with the article "RBD24: A labelled dataset with risk activities using log applications data" published in the journal Computers & Security. For more information go to https://doi.org/10.1016/j.cose.2024.104290" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.cose.2024.104290

Summary of the Datasets

The RBD24 dataset comprises various risk activities collected from real entities and users over a period of 15 days, with the samples segmented by Desktop (DE) and Smartphone (SM) devices.

DatasetId	Entity	Observed Behaviour	Groundtruth	Sample Shape
Crypto_desktop.parquet	DE	Miner Checking	IDS	0: 738/161202, 1: 11/1343
Crypto_smarphone.parquet	SM	Miner Checking	IDS	0: 613/180021, 1: 4/956
OutFlash_desktop.parquet	DE	Outdated software components	IDS	0: 738/161202, 1: 56/10820
OutFlash_smartphone.parquet	SM	Outdated software components	IDS	0: 613/180021, 1: 22/6639
OutTLS_desktop.parquet	DE	Outdated TLS protocol	IDS	0: 738/161202, 1: 18/2458
OutTLS_smartphone.parquet	SM	Outdated TLS protocol	IDS	0: 613/180021, 1: 11/2930
P2P_desktop.parquet	DE	P2P Activity	IDS	0: 738/161202, 1: 177/35892
P2P_smartphone.parquet	SM	P2P Activity	IDS	0: 613/180021, 1: 94/21688
NonEnc_desktop.parquet	DE	Non-encrypted password	IDS	0: 738/161202, 1: 291/59943
NonEnc_smaprthone.parquet	SM	Non-encrypted password	IDS	0: 613/180021, 1: 167/41434
Phishing_desktop.parquet	DE	Phishing email	Experimental Campaign	0: 98/13864, 1: 19/3072
Phishing_smartphone.parquet	SM	Phishing email	Experimental Campaign	0: 117/34006, 1: 26/8968

Methodology

To collect the dataset, we have deployed multiple agents and soluble agents within an infrastructure with
more than 3k entities, comprising laptops, workstations, and smartphone devices. The methods to build
ground truth are as follows:

- Simulator: We launch different realistic phishing campaigns, aiming to expose user credentials or defeat access to a service.
- IDS: We deploy an IDS to collect various alerts associated with behavioral anomalies, such as cryptomining or peer-to-peer traffic.

For each user exposed to the behaviors stated in the summary table, different TW is computed, aggregating
user behavior within a fixed time interval. This TW serves as the basis for generating various supervised
and unsupervised methods.

Sample Representation

The time windows (TW) are a data representation based on aggregated logs from multimodal sources between two
timestamps. In this study, logs from HTTP, DNS, SSL, and SMTP are taken into consideration, allowing the
construction of rich behavioral profiles. The indicators described in the TE are a set of manually curated
interpretable features designed to describe device-level properties within the specified time frame. The most
influential features are described below.

User:** A unique hash value that identifies a user.
Timestamp:** The timestamp of the windows.
Features
Label: 1 if the user exhibits compromised behavior, 0 otherwise. -1 indicates that it is a TW with an unknown label.

Dataset Format

Parquet format uses a columnar storage format, which enhances efficiency and compression, making it suitable for large datasets and complex analytical tasks. It has support across various tools and languages, including Python. Parquet can be used with pandas library in Python, allowing pandas to read and write Parquet files through the `pyarrow` or `fastparquet` libraries. Its efficient data retrieval and fast query execution improve performance over other formats. Compared to row-based storage formats such as CSV, Parquet's columnar storage greatly reduces read times and storage costs for large datasets. Although binary formats like HDF5 are effective for specific use cases, Parquet provides broader compatibility and optimization. The provided datasets use the Parquet format. Here’s an example of how to retrieve data using pandas, ensure you have the fastparquet library installed:

```python
import pandas as pd

# Reading a Parquet file
df = pd.read_parquet(
'path_to_your_file.parquet',
engine='fastparquet'
)

```

p
Tree Point Classification - New Zealand
pacificgeoportal.com
digital-earth-pacificcore.hub.arcgis.com
Updated Jul 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eagle Technology Group Ltd (2022). Tree Point Classification - New Zealand [Dataset]. https://www.pacificgeoportal.com/content/0e2e3d0d0ef843e690169cac2f5620f9
Explore at:
Dataset updated
Jul 26, 2022
Dataset authored and provided by
Eagle Technology Group Ltd
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
This New Zealand Point Cloud Classification Deep Learning Package will classify point clouds into tree and background classes. This model is optimized to work with New Zealand aerial LiDAR data.The classification of point cloud datasets to identify Trees is useful in applications such as high-quality 3D basemap creation, urban planning, forestry workflows, and planning climate change response.Trees could have a complex irregular geometrical structure that is hard to capture using traditional means. Deep learning models are highly capable of learning these complex structures and giving superior results.This model is designed to extract Tree in both urban and rural area in New Zealand.The Training/Testing/Validation dataset are taken within New Zealand resulting of a high reliability to recognize the pattern of NZ common building architecture.Licensing requirementsArcGIS Desktop - ArcGIS 3D Analyst extension for ArcGIS ProUsing the modelThe model can be used in ArcGIS Pro's Classify Point Cloud Using Trained Model tool. Before using this model, ensure that the supported deep learning frameworks libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.Note: Deep learning is computationally intensive, and a powerful GPU is recommended to process large datasets.InputThe model is trained with classified LiDAR that follows the LINZ base specification. The input data should be similar to this specification.Note: The model is dependent on additional attributes such as Intensity, Number of Returns, etc, similar to the LINZ base specification. This model is trained to work on classified and unclassified point clouds that are in a projected coordinate system, in which the units of X, Y and Z are based on the metric system of measurement. If the dataset is in degrees or feet, it needs to be re-projected accordingly. The model was trained using a training dataset with the full set of points. Therefore, it is important to make the full set of points available to the neural network while predicting - allowing it to better discriminate points of 'class of interest' versus background points. It is recommended to use 'selective/target classification' and 'class preservation' functionalities during prediction to have better control over the classification and scenarios with false positives.The model was trained on airborne lidar datasets and is expected to perform best with similar datasets. Classification of terrestrial point cloud datasets may work but has not been validated. For such cases, this pre-trained model may be fine-tuned to save on cost, time, and compute resources while improving accuracy. Another example where fine-tuning this model can be useful is when the object of interest is tram wires, railway wires, etc. which are geometrically similar to electricity wires. When fine-tuning this model, the target training data characteristics such as class structure, maximum number of points per block and extra attributes should match those of the data originally used for training this model (see Training data section below).OutputThe model will classify the point cloud into the following classes with their meaning as defined by the American Society for Photogrammetry and Remote Sensing (ASPRS) described below: 0 Background 5 Trees / High-vegetationApplicable geographiesThe model is expected to work well in the New Zealand. It's seen to produce favorable results as shown in many regions. However, results can vary for datasets that are statistically dissimilar to training data.Training dataset - Wellington CityTesting dataset - Tawa CityValidation/Evaluation dataset - Christchurch City Dataset City Training Wellington Testing Tawa Validating ChristchurchModel architectureThis model uses the PointCNN model architecture implemented in ArcGIS API for Python.Accuracy metricsThe table below summarizes the accuracy of the predictions on the validation dataset. - Precision Recall F1-score Never Classified 0.991200 0.975404 0.983239 High Vegetation 0.933569 0.975559 0.954102Training dataThis model is trained on classified dataset originally provided by Open TopoGraphy with < 1% of manual labelling and correction.Train-Test split percentage {Train: 80%, Test: 20%} Chosen this ratio based on the analysis from previous epoch statistics which appears to have a descent improvementThe training data used has the following characteristics: X, Y, and Z linear unitMeter Z range-121.69 m to 26.84 m Number of Returns1 to 5 Intensity16 to 65520 Point spacing0.2 ± 0.1 Scan angle-15 to +15 Maximum points per block8192 Block Size20 Meters Class structure[0, 5]Sample resultsModel to classify a dataset with 5pts/m density Christchurch city dataset. The model's performance are directly proportional to the dataset point density and noise exlcuded point clouds.To learn how to use this model, see this story
a
COCO
datasets.activeloop.ai
huggingface.co
deeplake
Updated Feb 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tsung-Yi Lin (2022). COCO [Dataset]. https://datasets.activeloop.ai/docs/ml/datasets/coco-dataset/
Explore at:
deeplakeAvailable download formats
Dataset updated
Feb 5, 2022
Authors
Tsung-Yi Lin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2014 - Dec 31, 2015
Dataset funded by
Microsoft Research
Description
The COCO dataset is a large dataset of labeled images and annotations. It is a popular dataset for machine learning and artificial intelligence research. The dataset consists of 330,000 images and 500,000 object annotations. The annotations include the bounding boxes of objects in the images, as well as the labels of the objects.
T
Data from: chexpert
tensorflow.org
opendatalab.com
+1more
Updated Feb 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). chexpert [Dataset]. https://www.tensorflow.org/datasets/catalog/chexpert
Explore at:
Dataset updated
Feb 1, 2019
Description
CheXpert is a large dataset of chest X-rays and competition for automated chest x-ray interpretation, which features uncertainty labels and radiologist-labeled reference standard evaluation sets. It consists of 224,316 chest radiographs of 65,240 patients, where the chest radiographic examinations and the associated radiology reports were retrospectively collected from Stanford Hospital. Each report was labeled for the presence of 14 observations as positive, negative, or uncertain. We decided on the 14 observations based on the prevalence in the reports and clinical relevance.

The CheXpert dataset must be downloaded separately after reading and agreeing to a Research Use Agreement. To do so, please follow the instructions on the website, https://stanfordmlgroup.github.io/competitions/chexpert/.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('chexpert', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
d
Data from: Multi-task Deep Learning for Water Temperature and Streamflow...
catalog.data.gov
Updated Nov 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Multi-task Deep Learning for Water Temperature and Streamflow Prediction (ver. 1.1, June 2022) [Dataset]. https://catalog.data.gov/dataset/multi-task-deep-learning-for-water-temperature-and-streamflow-prediction-ver-1-1-june-2022
Explore at:
Dataset updated
Nov 11, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This item contains data and code used in experiments that produced the results for Sadler et. al (2022) (see below for full reference). We ran five experiments for the analysis, Experiment A, Experiment B, Experiment C, Experiment D, and Experiment AuxIn. Experiment A tested multi-task learning for predicting streamflow with 25 years of training data and using a different model for each of 101 sites. Experiment B tested multi-task learning for predicting streamflow with 25 years of training data and using a single model for all 101 sites. Experiment C tested multi-task learning for predicting streamflow with just 2 years of training data. Experiment D tested multi-task learning for predicting water temperature with over 25 years of training data. Experiment AuxIn used water temperature as an input variable for predicting streamflow. These experiments and their results are described in detail in the WRR paper. Data from a total of 101 sites across the US was used for the experiments. The model input data and streamflow data were from the Catchment Attributes and Meteorology for Large-sample Studies (CAMELS) dataset (Newman et. al 2014, Addor et. al 2017). The water temperature data were gathered from the National Water Information System (NWIS) (U.S. Geological Survey, 2016). The contents of this item are broken into 13 files or groups of files aggregated into zip files:

input_data_processing.zip: A zip file containing the scripts used to collate the observations, input weather drivers, and catchment attributes for the multi-task modeling experiments

flow_observations.zip: A zip file containing collated daily streamflow data for the sites used in multi-task modeling experiments. The streamflow data were originally accessed from the CAMELs dataset. The data are stored in csv and Zarr formats.

temperature_observations.zip: A zip file containing collated daily water temperature data for the sites used in multi-task modeling experiments. The data were originally accessed via NWIS. The data are stored in csv and Zarr formats.

temperature_sites.geojson: Geojson file of the locations of the water temperature and streamflow sites used in the analysis.

model_drivers.zip: A zip file containing the daily input weather driver data for the multi-task deep learning models. These data are from the Daymet drivers and were collated from the CAMELS dataset. The data are stored in csv and Zarr formats.

catchment_attrs.csv: Catchment attributes collatted from the CAMELS dataset. These data are used for the Random Forest modeling. For full metadata regarding these data see CAMELS dataset.

experiment_workflow_files.zip: A zip file containing workflow definitions used to run multi-task deep learning experiments. These are Snakemake workflows. To run a given experiment, one would run (for experiment A) 'snakemake -s expA_Snakefile --configfile expA_config.yml'

river-dl-paper_v0.zip: A zip file containing python code used to run multi-task deep learning experiments. This code was called by the Snakemake workflows contained in 'experiment_workflow_files.zip'.

random_forest_scripts.zip: A zip file containing Python code and a Python Jupyter Notebook used to prepare data for, train, and visualize feature importance of a Random Forest model.

plotting_code.zip: A zip file containing python code and Snakemake workflow used to produce figures showing the results of multi-task deep learning experiments.

results.zip: A zip file containing results of multi-task deep learning experiments. The results are stored in csv and netcdf formats. The netcdf files were used by the plotting libraries in 'plotting_code.zip'. These files are for five experiments, 'A', 'B', 'C', 'D', and 'AuxIn'. These experiment names are shown in the file name.

sample_scripts.zip: A zip file containing scripts for creating sample output to demonstrate how the modeling workflow was executed.

sample_output.zip: A zip file containing sample output data. Similar files are created by running the sample scripts provided.

A. Newman; K. Sampson; M. P. Clark; A. Bock; R. J. Viger; D. Blodgett, 2014. A large-sample watershed-scale hydrometeorological dataset for the contiguous USA. Boulder, CO: UCAR/NCAR. https://dx.doi.org/10.5065/D6MW2F4D

N. Addor, A. Newman, M. Mizukami, and M. P. Clark, 2017. Catchment attributes for large-sample studies. Boulder, CO: UCAR/NCAR. https://doi.org/10.5065/D6G73C3Q

Sadler, J. M., Appling, A. P., Read, J. S., Oliver, S. K., Jia, X., Zwart, J. A., & Kumar, V. (2022). Multi-Task Deep Learning of Daily Streamflow and Water Temperature. Water Resources Research, 58(4), e2021WR030138. https://doi.org/10.1029/2021WR030138

U.S. Geological Survey, 2016, National Water Information System data available on the World Wide Web (USGS Water Data for the Nation), accessed Dec. 2020.
Source code for labbench 0.20 release
data.nist.gov
datasets.ai
+2more
Updated Sep 10, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dan Kuester (2019). Source code for labbench 0.20 release [Dataset]. http://doi.org/10.18434/M32122
Explore at:
Unique identifier
https://doi.org/10.18434/M32122, https://identifiers.org/ark:/88434/mds2-2122
Dataset updated
Sep 10, 2019
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Authors
Dan Kuester
License
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Description
This is the source code package for the labbench python module, version 0.20, which is its first public release. The purpose of labbench is to streamline and organize complicated laboratory automation tasks that involve large-scale benchtop automation, concurrency, and/or data management. It is built around a system of wrappers that facilitate robust, concise exception handling, type checking, API conventions, and synchronized device connection through python context blocks. The wrappers also provide convenient new functionality, such as support for automated status displays in jupyter notebooks, simplified threaded concurrency, and automated, type-safe logging to relational databases. Together, these features help to minimize the amount of "copy-and-paste" code that can make your lab automation scripts error-prone and difficult to maintain. The python code that results can be clear, concise, reusable and maintainable, and provide consistent formatting for stored data. The result helps researchers to meet NIST's open data obligations, even for complicated, large, and heterogeneous datasets. Several past and ongoing projects in the NIST Communication Technology Laboratory (CTL) published data that were acquired by automation in labbench. We release it here both for transparency and to invite public use and feedback. Ongoing updates to this source code will be maintained on the NIST github page at https://github.com/usnistgov/labbench. The code was developed in python, documented with the python sphinx package and markdown, and shared through the USNISTGOV organization on GitHub. INSTALLATION labbench can run on any computer that supports python 3.6. The hardware requirements are discussed here: https://docs.anaconda.com/anaconda/install/#requirements 1. Install your favorite distribution of a python version 3.6 or greater 2. In a command prompt, pip install git+https://gitlab.nist.gov/gitlab/ssm/labbench 3. (Optional) install an NI VISA [1] runtime, for example this one for windows. USAGE The source distribution contains detailed information including * README.md - documentation to get started using labbench * LICENSE.md - license and redistribution information * doc/labbench-api.pdf - complete listing of the module and documentation

Facebook

Twitter

Click to copy link

Link copied

Cite

Gabriel Oliveira dos Santos; Esther Luna Colombini; Sandra Avila (2023). #PraCegoVer dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5710561

#PraCegoVer dataset

Explore at:

Dataset updated

Jan 19, 2023

Dataset provided by

Institute of Computing, University of Campinas

Authors

Gabriel Oliveira dos Santos; Esther Luna Colombini; Sandra Avila

Description

Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

Dataset Structure

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

containing the images. The file dataset.json comprehends a list of json objects with the attributes:

user: anonymized user that made the post;

filename: image file name;

raw_caption: raw caption;

caption: clean caption;

date: post date.

Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

Download Instructions

If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:

cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz

Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

python download_dataset.py --access_token=

Clear search

Close search

Google apps

Main menu

#PraCegoVer dataset

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

python-copilot-training-from-many-repos-large

Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

PARAMOUNT: parallel modal analysis of large datasets

CESNET-TLS22: A large dataset for fine-grained classification of TLS...

Prothom Alo News Articles

Context

Content

Acknowledgements

How Python Can Work For You

(HS 2) Automate Workflows using Jupyter notebook to create Large Extent...

the-stack

Large-Scale Dynamic Random Graph - Example

Datasets and source code for a pipeline architecture for feature-based...

criteo

Criteo Uplift Modeling Dataset

Data description

Fields

Key figures

Tasks

wikihow

RBD24 - Risk Activities Dataset 2024

Introduction

Summary of the Datasets

Methodology

Sample Representation

Dataset Format

Tree Point Classification - New Zealand

COCO

Data from: chexpert

Data from: Multi-task Deep Learning for Water Temperature and Streamflow...

Source code for labbench 0.20 release

#PraCegoVer dataset

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX