Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Indian Names Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ananysharma/indian-names-dataset on 28 January 2022.
--- Dataset description provided by original source is as follows ---
This dataset is useful to me in terms of my project which i was working. Problem was to extract names from unstructured text and i am still working on it.I felt of sharing this as some of the people might find useful in some Named Entity Recognition and other nlp tasks. If you want you can work on how to extract names from unstructured text without any context.For eg if we have to extract names from a document where context is not present.You can share your work and we can work together for better.
The dataset contains a male and female dataset along with a python preprocessing file for merging the two datasets.You can use either of the datset. Or you can see how we can merge both.
I get to know this dataset from a github repository which can be visited here
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Content of this repository This is the repository that contains the scripts and dataset for the MSR 2019 mining challenge
DATASET The dataset was retrived utilizing google bigquery and dumped to a csv file for further processing, this original file with no treatment is called jsanswers.csv, here we can find the following information : 1. The Id of the question (PostId) 2. The Content (in this case the code block) 3. the lenght of the code block 4. the line count of the code block 5. The score of the post 6. The title
A quick look at this files, one can notice that a postID can have multiple rows related to it, that's how multiple codeblocks are saved in the database.
Filtered Dataset:
Extracting code from CSV We used a python script called "ExtractCodeFromCSV.py" to extract the code from the original csv and merge all the codeblocks in their respective javascript file with the postID as name, this resulted in 336 thousand files.
Running ESlint Due to the single threaded nature of ESlint, we needed to create a script to run ESlint because it took a huge toll on the machine to run it on 336 thousand files, this script is named "ESlintRunnerScript.py", it splits the files in 20 evenly distributed parts and runs 20 processes of esLinter to generate the reports, as such it generates 20 json files.
Number of Violations per Rule This information was extracted using the script named "parser.py", it generated the file named "NumberofViolationsPerRule.csv" which contains the number of violations per rule used in the linter configuration in the dataset.
Number of violations per Category As a way to make relevant statistics of the dataset, we generated the number of violations per rule category as defined in the eslinter website, this information was extracted using the same "parser.py" script.
Individual Reports This information was extracted from the json reports, it's a csv file with PostID and violations per rule.
Rules The file Rules with categories contains all the rules used and their categories.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...
https://www.gnu.org/licenses/lgpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/lgpl-3.0-standalone.html
Cell line pharmacogenomics datasets for cancer biology and machine learning studies. The datasets are compatible with rcellminer and CellMinerCDB (see publications for details) and data can be extracted for use with Python-based projects.
An example for extracting data from the rcellminer and CellMinerCDB compatible packages:
# INSTALL ----
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("rcellminer")
# Replace path_to_file with the data package filename
install.packages(path_to_file, repos = NULL, type="source")
# GET DATA ----
## Replace nciSarcomaData with name of dataset through code
library(nciSarcomaData)
## DRUG DATA ----
drugAct <- exprs(getAct(nciSarcomaData::drugData))
drugAnnot <- getFeatureAnnot(nciSarcomaData::drugData)[["drug"]]
## MOLECULAR DATA ----
### List available datasets
names(getAllFeatureData(nciSarcomaData::molData))
### Extract data and annotations
expData <- exprs(nciSarcomaData::molData[["exp"]])
mirData <- exprs(nciSarcomaData::molData[["mir"]])
expAnnot <- getFeatureAnnot(nciSarcomaData::molData)[["exp"]]
mirAnnot <- getFeatureAnnot(nciSarcomaData::molData)[["mir"]]
## SAMPLE DATA ----
sampleAnnot <- getSampleData(nciSarcomaData::molData)
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The dataset that accompanies the "Build Better LibGuides" chapter of Teaching Information Literacy in Political Science, Public Affairs, and International Studies. This dataset was created to compare current practices in Political Science, Public Affairs, and International Studies (PSPAIS) LibGuides with recommended best practices using a sample that represents a variety of academic institutions. Members of the ACRL Politics, Policy, and International Relations Section (PPIRS) were identified as the librarians most likely to be actively engaged with these specific subjects, so the dataset was scoped by identifying the institutions associated with the most active PPIRS members and then locating the LibGuides in these and related disciplines. The resulting dataset includes 101 guides at 46 institutions, for a total of 887 LibGuide tabs. Methods This dataset was created to compare current practices in Political Science, Public Affairs, and International Studies (PSPAIS) LibGuides with recommended best practices using a sample that represents a variety of academic institutions. Members of the ACRL Politics, Policy, and International Relations Section (PPIRS) were identified as the librarians most likely to be actively engaged with these specific subjects, so the dataset was scoped by identifying the institutions associated with the most active PPIRS members and then locating the LibGuides in these and related disciplines. Specifically, a student assistant collected the names and institutional affiliations of each member serving on a PPIRS committee as of July 1, 2021, 2022, and 2023. The student then removed the individual librarian names from the list and located the links to the Political Science or Government; Public Policy, Public Affairs, or Public Administration; and International Studies or International Relations LibGuides at each institution. The chapter author then confirmed and, in a few cases, added to the student's work and copied and pasted the tab names from each guide (which conveniently were also hyperlinked) into a Google Sheet. The resulting dataset included 101 guides at 46 institutions, for a total of 887 LibGuide tabs. A Google Apps script was used to extract the hyperlinks from the collected tab names and then a Python script was used to scrape the names of links included on each of the tabs. LibGuides from two institutions returned errors during the link name scraping process and were excluded in this part of the analysis.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides a collection of 1000 tweets designed for sentiment analysis. The tweets were sourced from Twitter using Python and systematically generated using various modules to ensure a balanced representation of different tweet types, user behaviours, and sentiments. This includes the use of a random module for IDs and text, a faker module for usernames and dates, and a textblob module for assigning sentiment. The dataset's purpose is to offer a robust foundation for analysing and visualising sentiment trends and patterns, aiding in the initial exploration of data and the identification of significant patterns or trends.
The dataset is provided in a CSV file format. It consists of 1000 individual tweet records, structured in a tabular layout with the columns detailed above. A sample file will be made available separately on the platform.
This dataset is ideal for: * Analysing and visualising sentiment trends and patterns in social media. * Initial data exploration to uncover insights into tweet characteristics and user emotions. * Identifying underlying patterns or trends within social media conversations. * Developing and training machine learning models for sentiment classification. * Academic research into Natural Language Processing (NLP) and social media dynamics. * Educational purposes, allowing students to practise data analysis and visualisation techniques.
The dataset spans tweets created between January and April 2023, as observed from the included data samples. While specific geographic or demographic information for users is not available within the dataset, the nature of Twitter implies a general global scope, reflecting a variety of user behaviours and sentiments without specific regional or population group focus.
CC0
This dataset is valuable for: * Data Scientists and Machine Learning Engineers working on NLP tasks and model development. * Researchers in fields such as Natural Language Processing, Machine Learning Algorithms, Deep Learning, and Computer Science. * Data Analysts looking to extract insights from social media content. * Academics and Students undertaking projects related to sentiment analysis or social media studies. * Anyone interested in understanding online sentiment and user behaviour on social media platforms.
Original Data Source: Twitter Sentiment Analysis using Roberta and VaderTwitter Sentiment Analysis using Roberta and Vader
US Census Bureau conducts American Census Survey 1 and 5 Yr surveys that record various demographics and provide public access through APIs. I have attempted to call the APIs through the python environment using the requests library, Clean, and organize the data in a usable format.
ACS Subject data [2011-2019] was accessed using Python by following the below API Link:
https://api.census.gov/data/2011/acs/acs1?get=group(B08301)&for=county:*
The data was obtained in JSON format by calling the above API, then imported as Python Pandas Dataframe. The 84 variables returned have 21 Estimate values for various metrics, 21 pairs of respective Margin of Error, and respective Annotation values for Estimate and Margin of Error Values. This data was then undergone through various cleaning processes using Python, where excess variables were removed, and the column names were renamed. Web-Scraping was carried out to extract the variables' names and replace the codes in the column names in raw data.
The above step was carried out for multiple ACS/ACS-1 datasets spanning 2011-2019 and then merged into a single Python Pandas Dataframe. The columns were rearranged, and the "NAME" column was split into two columns, namely 'StateName' and 'CountyName.' The counties for which no data was available were also removed from the Dataframe. Once the Dataframe was ready, it was separated into two new dataframes for separating State and County Data and exported into '.csv' format
More information about the source of Data can be found at the URL below:
US Census Bureau. (n.d.). About: Census Bureau API. Retrieved from Census.gov
https://www.census.gov/data/developers/about.html
I hope this data helps you to create something beautiful, and awesome. I will be posting a lot more databases shortly, if I get more time from assignments, submissions, and Semester Projects 🧙🏼♂️. Good Luck.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Geographic Diversity in Public Code Contributions - Replication Package
This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Geographic Diversity in Public Code Contributions - An Exploratory Large-Scale Study Over 50 Years. In 19th International Conference on Mining Software Repositories (MSR ’22), May 23-24, Pittsburgh, PA, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3524842.3528471
This document comes with the software needed to mine and analyze the data presented in the paper.
Prerequisites
These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, …), all of which are available for multiple architectures and OSs. It is advisable to create a Python virtual environment and install the following PyPI packages:
click==8.0.4 cycler==0.11.0 fonttools==4.31.2 kiwisolver==1.4.0 matplotlib==3.5.1 numpy==1.22.3 packaging==21.3 pandas==1.4.1 patsy==0.5.2 Pillow==9.0.1 pyparsing==3.0.7 python-dateutil==2.8.2 pytz==2022.1 scipy==1.8.0 six==1.16.0 statsmodels==0.13.2
Initial data
swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/. We retrieved these data from Software Heritage, in collaboration with the archive operators, taking an archive snapshot as of 2021-07-07. We cannot make these data available in full as part of the replication package due to both its volume and the presence in it of personal information such as user email addresses. However, equivalent data (stripped of email addresses) can be obtained from the Software Heritage archive dataset, as documented in the article: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli, The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019. http://dx.doi.org/10.1109/MSR.2019.00030. Once retrieved, the data can be loaded in PostgreSQL to populate swh-replica.
names.tab - forenames and surnames per country with their frequency
zones.acc.tab - countries/territories, timezones, population and world zones
c_c.tab - ccTDL entities - world zones matches
Data preparation
Export data from the swh-replica database to create commits.csv.zst and authors.csv.zst
sh> ./export.sh
Run the authors cleanup script to create authors--clean.csv.zst
sh> ./cleanup.sh authors.csv.zst
Filter out implausible names and create authors--plausible.csv.zst
sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst
Zone detection by email
Run the email detection script to create author-country-by-email.tab.zst
sh> pv authors--plausible.csv.zst | zstdcat | ./guess_country_by_email.py -f 3 2> author-country-by-email.csv.log | zstdmt > author-country-by-email.tab.zst
Database creation and initial data ingestion
Create the PostgreSQL DB
sh> createdb zones-commit
Notice that from now on when prepending the psql> prompt we assume the execution of psql on the zones-commit database.
Import data into PostgreSQL DB
sh> ./import_data.sh
Zone detection by name
Extract commits data from the DB and create commits.tab, that is used as input for the zone detection script
sh> psql -f extract_commits.sql zones-commit
Run the world zone detection script to create commit_zones.tab.zst
sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst Use ./assign_world_zone.py --help if you are interested in changing the script parameters.
Ingest zones assignment data into the DB
psql> \copy commit_zone from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''
Extraction and graphs
Run the script to execute the queries to extract the data to plot from the DB. This creates commit_zones_7120.tab, author_zones_7120_t5.tab, commit_zones_7120.grid and author_zones_7120_t5.grid. Edit extract_data.sql if you whish to modify extraction parameters (start/end year, sampling, …).
sh> ./extract_data.sh
Run the script to create the graphs from all the previously extracted tabfiles.
sh> ./create_stackedbar_chart.py -w 20 -s 1971 -f commit_zones_7120.grid -f author_zones_7120_t5.grid -o chart.pdf
OpenAsp Dataset OpenAsp is an Open Aspect-based Multi-Document Summarization dataset derived from DUC and MultiNews summarization datasets.
Dataset Access To generate OpenAsp, you require access to the DUC dataset which OpenAsp is derived from.
Steps:
Grant access to DUC dataset by following NIST instructions here. you should receive two user-password pairs (for DUC01-02 and DUC06-07) you should receive a file named fwdrequestingducdata.zip Clone this repository by running the following command: git clone https://github.com/liatschiff/OpenAsp.git Optionally create a conda or virtualenv environment:
bash conda create -n openasp 'python>3.10,<3.11' conda activate openasp
Install python requirements, currently requires python3.8-3.10 (later python versions have issues with spacy)
bash pip install -r requirements.txt
copy fwdrequestingducdata.zip into the OpenAsp repo directory
run the prepare script command:
bash python prepare_openasp_dataset.py --nist-duc2001-user '<2001-user>' --nist-duc2001-password '<2001-pwd>' --nist-duc2006-user '<2006-user>' --nist-duc2006-password '<2006-pwd>'
load the dataset using huggingface datasets
from glob import glob
import os
import gzip
import shutil
from datasets import load_dataset
openasp_files = os.path.join('openasp-v1', '*.jsonl.gz')
data_files = {
os.path.basename(fname).split('.')[0]: fname
for fname in glob(openasp_files)
}
for ftype, fname in data_files.copy().items():
with gzip.open(fname, 'rb') as gz_file:
with open(fname[:-3], 'wb') as output_file:
shutil.copyfileobj(gz_file, output_file)
data_files[ftype] = fname[:-3]
load OpenAsp as huggingface's dataset
openasp = load_dataset('json', data_files=data_files)
print first sample from every split
for split in ['train', 'valid', 'test']:
sample = openasp[split][0]
# print title, aspect_label, summary and documents for the sample
title = sample['title']
aspect_label = sample['aspect_label']
summary = '
'.join(sample['summary_text'])
input_docs_text = ['
'.join(d['text']) for d in sample['documents']]
print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *')
print(f'Sample from {split}
Split title={title}
Aspect label={aspect_label}')
print(f'
aspect-based summary:
{summary}')
print('
input documents:
')
for i, doc_txt in enumerate(input_docs_text):
print(f'---- doc #{i} ----')
print(doc_txt[:256] + '...')
print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
')
Troubleshooting
Dataset failed loading with load_dataset() - you may want to delete huggingface datasets cache folder 401 Client Error: Unauthorized - you're DUC credentials are incorrect, please verify them (case sensitive, no extra spaces etc) Dataset created but prints a warning about content verification - you may be using different version of NLTK or spacy model which affects the sentence tokenization process. You must use exact versions as pinned on requirements.txt. IndexError: list index out of range - similar to (3), try to reinstall the requirements with exact package versions.
Under The Hood The prepare_openasp_dataset.py script downloads DUC and Multi-News source files, uses sacrerouge package to prepare the datasets and uses the openasp_v1_dataset_metadata.json file to extract the relevant aspect summaries and compile the final OpenAsp dataset.
License This repository, including the openasp_v1_dataset_metadata.json and prepare_openasp_dataset.py, are released under APACHE license.
OpenAsp dataset summary and source document for each sample, which are generated by running the script, are licensed under the respective generic summarization dataset - Multi-News license and DUC license.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Authors, through Twitter API, collected this database over eight months. These data are tweets of over 50 experts regarding market analysis of 40 cryptocurrencies. These experts are known as influencers on social networks such as Twitter. The theory of Behavioral economics shows that the opinions of people, especially experts, can impact the stock market trend (here, cryptocurrencies). Existing databases often cover tweets related to one or more cryptocurrencies. Also, in these databases, no attention is paid to the user's expertise, and most of the data is extracted using hashtags. Failure to pay attention to the user's expertise causes the irrelevant volume to increase and the neutral polarity to increase considerably. This database has a main table named "Tweets1" with 11 columns and 40 tables to separate comments related to each cryptocurrency. The columns of the main table and the cryptocurrency tables are explained in the attached document. Researchers can use this dataset in various machine learning tasks, such as sentiment analysis and deep transfer learning with sentiment analysis. Also, this data can be used to check the impact of influencers' opinions on the cryptocurrency market trend. The use of this database is allowed by mentioning the source. Also, in this version, we have added the excel version of the database and Python code to extract the names of influencers and tweets. in Version(3): In the new version, three datasets related to historical prices and sentiments related to Bitcoin, Ethereum, and Binance have been added as Excel files from January 1, 2023, to June 12, 2023. Also, two datasets of 52 influential tweets in cryptocurrencies have been published, along with the score and polarity of sentiments regarding more than 300 cryptocurrencies from February 2021 to June 2023. Also, two Python codes related to the sentiment analysis algorithm of tweets with Python have been published. This algorithm combines RoBERTa pre-trained deep neural network and BiGRU deep neural network with an attention layer (see code Preprocessing_and_sentiment_analysis with python).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.
A description of this dataset, including the methodology and validation results, is available at:
Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.
ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.
You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.
#!/bin/bash
# Set download directory
DOWNLOAD_DIR=~/Downloads
base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"
# Loop through years 1991 to 2023 and download & extract data
for year in {1991..2023}; do
echo "Downloading $year.zip..."
wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
rm "$DOWNLOAD_DIR/$year.zip"
done
The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:
ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc
Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:
Additional information for each variable is given in the netCDF attributes.
Changes in v9.1r1 (previous version was v09.1):
These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:
The following records are all part of the Soil Moisture Climate Data Records from satellites community
1 |
ESA CCI SM MODELFREE Surface Soil Moisture Record | <a href="https://doi.org/10.48436/svr1r-27j77" target="_blank" |
1.Framework overview. This paper proposed a pipeline to construct high-quality datasets for text mining in materials science. Firstly, we utilize the traceable automatic acquisition scheme of literature to ensure the traceability of textual data. Then, a data processing method driven by downstream tasks is performed to generate high-quality pre-annotated corpora conditioned on the characteristics of materials texts. On this basis, we define a general annotation scheme derived from materials science tetrahedron to complete high-quality annotation. Finally, a conditional data augmentation model incorporating materials domain knowledge (cDA-DK) is constructed to augment the data quantity.2.Dataset information. The experimental datasets used in this paper include: the Matscholar dataset publicly published by Weston et al. (DOI: 10.1021/acs.jcim.9b00470), and the NASICON entity recognition dataset constructed by ourselves. Herein, we mainly introduce the details of NASICON entity recognition dataset.2.1 Data collection and preprocessing. Firstly, 55 materials science literature related to NASICON system are collected through Crystallographic Information File (CIF), which contains a wealth of structure-activity relationship information. Note that materials science literature is mostly stored as portable document format (PDF), with content arranged in columns and mixed with tables, images, and formulas, which significantly compromises the readability of the text sequence. To tackle this issue, we employ the text parser PDFMiner (a Python toolkit) to standardize, segment, and parse the original documents, thereby converting PDF literature into plain text. In this process, the entire textual information of literature, encompassing title, author, abstract, keywords, institution, publisher, and publication year, is retained and stored as a unified TXT document. Subsequently, we apply rules based on Python regular expressions to remove redundant information, such as garbled characters and line breaks caused by figures, tables, and formulas. This results in a cleaner text corpus, enhancing its readability and enabling more efficient data analysis. Note that special symbols may also appear as garbled characters, but we refrain from directly deleting them, as they may contain valuable information such as chemical units. Therefore, we converted all such symbols to a special token
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the replication package that accompanies the paper titled: "Dynamic Slicing of WebAssembly Binaries".
# Slices dataset
## Generating the dataset
The dynamic slices have been generated with [P-ORBS](https://syed-islam.github.io/research/program-analysis/#observation-based-program-slicing-orbs).
The steps and scripts to generate the dynamic slices are included in the `slicing-steps/` directory.
These steps also describe how to generate the `stats.csv` file that is included in this dataset.
The static slices have been generated with [wassail](https://github.com/acieroid/wassail).
The scripts to generate the static slices are included in the current directory (`generate-static-slices.sh` which relies on `run_wassail.sh`).
These steps generate the file `static-stats.csv` that is included in this dataset.
The `original-size.csv` file, included in this dataset, can be generated as follows:
```sh
find subjects-wasm-extract-slice -name \*.c.wat -exec c_count {} \; | grep subjects > counts.txt
echo 'slice,original fn slice' > evaluation/original-sizes.csv
sed -E 's|^(.*) subjects-wasm-extract-slice/[^/]*/([^/]*)/.*$|\2,\1|' counts.txt >> evaluation/original-sizes.csv
```
The numbers in Table 1 of the paper (the list of programs in the dataset along with their sizes) can be generated as follows.
For the WebAssembly files, we can count the function size:
```
find subjects-wasm-extract-slice -name \*.c.wat -exec c_count {} \; | grep subjects > counts.txt
sed -E 's|^(.*) subjects-wasm-extract-slice/([^/]*)/.*$|\1 \2|' counts.txt | python evaluation/table1-wasm-mean.py
```
or the full program size:
```
find subjects-wasm-extract-slice -name t.wat -exec c_count {} \; | grep subjects > counts.txt
sed -E 's|^(.*) subjects-wasm-extract-slice/([^/]*)/.*$|\1 \2|' counts.txt | python evaluation/table1-wasm-mean.py
```
## Structure of the dataset
The dataset is structured as follows:
- `subjects/` contains the instrumented `.c` source code, along with scripts to generate the dynamic slices.
The original source code can be obtained by removing the line `printf("
ORBS:%x
....`.
- `subjects-wasm-extract-slice/` contains the original WebAssembly programs to slice. Each program has two files: `t.wat` is the full binary file, and `name.c.wat` is the binary code of the function containing the slicing criterion.
- `all_slices/` contains the slices. For example, program `adpcm_ah1_254_expr` has the following files
- `adpcm/adpcm_ah1_254_expr/EWS_adpcm.wat`: the EWS slice
- `adpcm/adpcm_ah1_254_expr/SEW_adpcm.wat`: the SEW slice
- `adpcm/adpcm_ah1_254_expr/ESW_adpcm.wat`: the ESW slice
- `adpcm/adpcm_ah1_254_expr/static_adpcm.wat.slice`: the SWS slice
The other files are produced by intermediary steps and can be ignored. They are:
- `adpcm/adpcm_ah1_254_expr/ESW_adpcm.wat.orig`: original (unsliced) binary *file* from which SW and ESW slices are computed
- `adpcm/adpcm_ah1_254_expr/SEW_adpcm.wat.orig`: original (unsliced) binary *function* from which SEW slice is computed
- `adpcm/adpcm_ah1_254_expr/SW_adpcm.wat`: slice of entire binary file from which ESW slice is extracted
- `adpcm/adpcm_ah1_254_expr/WS_adpcm.wat`: compiled (binary) version of dynamic C slice from which EWS slice is extracted
# Research questions
## RQ1
The script `./RQ1.py` found in the `evaluation/` directory generates:
- Figure 3 (time.pdf)
- The mean, min, max, and stddev of the times
- How many slices are computed below 10, 100, 1000, and 10000 seconds
## RQ2
The script `./RQ2.py` found in the `evaluation/` directory generates:
- Figure 4 (loc.pdf)
- The mean, median, min, max, and stddev of the sizes
- The largest differences between the approaches
- The number of slices larger than the original program
## RQ3 and RQ4
The process for these research questions is manual and requires comparing slices.
It cannot be automated.
We did make heavy use of `diff --side-by-side` in this analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain/Project:
This dataset is part of the Tour Recommendation System project, which focuses on predicting user preferences and ratings for various tourist places and events. It belongs to the field of Machine Learning, specifically applied to Recommender Systems and Predictive Analytics.
Purpose:
The dataset serves as the training and evaluation data for a Decision Tree Regressor model, which predicts ratings (from 1-5) for different tourist destinations based on user preferences. The model can be used to recommend places or events to users based on their predicted ratings.
Creation Methodology:
The dataset was originally collected from a tourism platform where users rated various tourist places and events. The data was preprocessed to remove missing or invalid entries (such as #NAME?
in rating columns). It was then split into subsets for training, validation, and testing the model.
Structure of the Dataset:
The dataset is stored as a CSV file (user_ratings_dataset.csv
) and contains the following columns:
place_or_event_id: Unique identifier for each tourist place or event.
rating: Rating given by the user, ranging from 1 to 5.
The data is split into three subsets:
Training Set: 80% of the dataset used to train the model.
Validation Set: A small portion used for hyperparameter tuning.
Test Set: 20% used to evaluate model performance.
Folder and File Naming Conventions:
The dataset files are stored in the following structure:
user_ratings_dataset.csv
: The original dataset file containing user ratings.
tour_recommendation_model.pkl
: The saved model after training.
actual_vs_predicted_chart.png
: A chart comparing actual and predicted ratings.
Software Requirements:
To open and work with this dataset, the following software and libraries are required:
Python 3.x
Pandas for data manipulation
Scikit-learn for training and evaluating machine learning models
Matplotlib for chart generation
Joblib for saving and loading the trained model
The dataset can be opened and processed using any Python environment that supports these libraries.
Additional Resources:
The model training code, README file, and performance chart are available in the project repository.
For detailed explanation and code, please refer to the GitHub repository (or any other relevant link for the code).
Dataset Reusability:
The dataset is structured for easy use in training machine learning models for recommendation systems. Researchers and practitioners can utilize it to:
Train other types of models (e.g., regression, classification).
Experiment with different features or add more metadata to enrich the dataset.
Data Integrity:
The dataset has been cleaned and preprocessed to remove invalid values (such as #NAME?
or missing ratings). However, users should ensure they understand the structure and the preprocessing steps taken before reusing it.
Licensing:
The dataset is provided under the CC BY 4.0 license, which allows free usage, distribution, and modification, provided that proper attribution is given.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
These data granularly explore the substance of collections data of Pan troglodytes (PT) specimens stored at the illustrious Field Museum of Natural History (FMNH) in Chicago, Illinois. Describing a specific dataset in the wake of thorough reviewing and analyzing, these collections reveal patterns that could lead to new research questions and hypotheses, calling for a deepening of our scientific understanding of the species' distribution and biology. Thematically, at its core, this data extraction and analysis emphasize the importance of studying chimpanzees, in light of their close genetic relationship(s) to human beings, in allowing for the unearthing of insights into human health and disease. In this work, Python code for data extraction is also provided, in describing extraction methodology and data categorization. Furthermore, this piece and data highlight some of the barriers associated with studying museum collections, concluding that studying PT collections data from the Field Museum can provide valuable insights that would inform conclusions in ecology, primate biology, biogeography, taxonomy, as well as anatomy, and physiology. The Field Museum of Natural History (FMNH), often referred to as the Field Museum, is an internationally-acclaimed, natural history museum located in one of the U.S.’s most populous cities, i.e., Chicago, Illinois. Named in honor of its primary benefactor as well as University of Chicago founding benefactor Marshall Field, the Museum is renowned for its exceptional scientific and educational programs, as well as its vast assortment of scientific specimens and artifacts. Internationally, the Museum’s zoological collections are regarded as one of the world’s most extensive collections, comprising millions of specimens preserved in dryness, fluids, or ice (e.g., to facilitate anatomical research and DNA analysis). Aiming to extract collections data to gather information on Pan troglodytes (PT), i.e., common chimpanzees, a search of museum records was conducted. The primary objective of this data extraction, and its subsequent analysis, entailed providing significant benefits, including a deeper understanding of the species’ taxonomy, ecology, and evolution. Additionally, surveying and analyzing PT collections aligns with uncovering patterns, which can foster the generation of novel research questions and hypotheses regarding primates and human beings. Furthermore, this work has the potential to document PT’s distribution and range, shedding light on PT interactions with other species and the environment. Fundamentally, therefore, studying PT collections data from a renowned institution like the Field Museum offers valuable insights into ecology, primate biology, biogeography, taxonomy, as well as anatomy and physiology —contributing incrementally to the advancement of scientific knowledge in these realms. Being close genetic relatives to human beings, chimpanzees allow for the harnessing of important insights into human health and disease. For instance, studying the immune system of chimpanzees has led to a new understanding of human immunology. Moreover, studying PT collections data offers the additional benefit of promoting greater cultural understanding. Chimpanzees bear cultural significance in various human societies, and exploring museum collections can therefore enhance people’s comprehension of the cultural significance of these animals. Despite these benefits, some barriers associated with studying museum collections exist. For instance, the FMNH notes that its web database does not represent the museum’s complete zoological holdings, and documentation for specimens may differ based on when and how they were collected, as well as how recently they were acquired. Furthermore, while steps are taken to ensure the accuracy of the information in the FMNH database, per FMNH, some of the collections data may contain inaccuracies. Methodologically, based on the benefits of studying PT collections data, a search was conducted on the FMNH database to extract all available data regarding this primate species. The search criteria focus solely on the species’ name, resulting in 61 specimens with corresponding data. The extracted data, downloaded from the Museum in a comma-separated values (CSV) file format, were categorized by Occurrence ID, Institution Code, Collection Code, Catalog Number, Taxonomic Name, Phylum, Class, Order, Family, State/Province, County, Locality, DwC_Sex, Sex, Tissue Present, IRN, Prep Type, Date Collected, and PriLoan Status. Some columns from the original file, such as IRN, were duplicated and deleted, while other columns, such as Identified By, Type Status, Continent/Ocean, Preservation, Life Stage, Collection Methods, Available Data, Latitude, Longitude, Count, Birds/Mammals Parts, Measurement, and Measurement Value, contained empty cells. An alternative method for producing the CSV file using Python entails using the following lines of code,...
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The ARK extension for CKAN provides the capability to mint and resolve ARK (Archival Resource Key) identifiers for datasets. Inspired by the ckanext-doi extension, it facilitates persistent identification of data resources within a CKAN instance, ensuring greater stability and longevity of data references. This extension is compatible with CKAN versions 2.9 and 2.10 and supports Python versions 3.8, 3.9, and 3.10. Key Features: ARK Identifier Minting: Generates unique ARK identifiers for CKAN datasets, helping to guarantee long-term access and citation stability. ARK Identifier Resolving: Enables resolution of ARK identifiers to associated CKAN resources. Configurable ARK Generation: Allows customization of ARK generation through configurable templates and shoulder assignments, providing flexibility in identifier structure. ERC Metadata Mapping: Supports mapping of dataset fields to ERC (Encoding and Rendering Conventions) metadata elements, using configurable mappings to extract data for persistent identification. Command-Line Interface (CLI) Tools: Includes CLI commands to manage (create, update and delete) ARK identifiers for existing datasets. Customizable NMA (Name Mapping Authority) URL: Supports setting up a custom NMA URL, providing the ability to customize the resolver to point towards different data sources than ckan.siteurl. Technical Integration: The ARK extension integrates with CKAN through plugins and modifies the read_base.html template (either in a custom extension or directly) to display the ARK identifier within the dataset's view. Configuration settings, such as the ARK NAAN (Name Assigning Authority Number) and other specific parameters, are defined in the CKAN configuration file (ckan.ini). Database initialization is required after installation to ensure proper functioning. Benefits & Impact: Implementing the ARK extension enhances the data management capabilities of CKAN by providing persistent identifiers for datasets. This ensures that data resources can be reliably cited and accessed over time. By enabling the association of ERC metadata with ARKs, the extension promotes better description and discoverability of data. The extension can be beneficial for institutions that require long-term preservation and persistent identification of their data resources.
The publicly available open GreenDB is a product-by-product sustainability database. It contains product attributes, e.g., name
, description
, color
, etc. and, on the other hand, information about the products' sustainability. Moreover, the sustainability information is transparently evaluated so that it is possible to rank products depending on their sustainability.
Landing Page: https://green-db.calgo-lab.de
Source Code: https://github.com/calgo-lab/green-db
Further Notes
For usability, we export two datatypes:
pyarrow
python package)Be careful with the columns' data types when using the CSV files! Use for example:
from ast import literal_eval
from datetime import datetime
import pandas as pd
products = pd.read_csv("products.csv", converters={"timestamp": datetime.fromisoformat, "image_urls": literal_eval, "sustainability_labels": literal_eval}).convert_dtypes()
sustainability_labels = pd.read_csv("sustainability_labels.csv", converters={"timestamp": datetime.fromisoformat}).convert_dtypes()
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package accompagnies the dataset and exploratory empirical analysis reported in the paper "A dataset of GitHub Actions workflow histories" published in the IEEE MSR 2024 conference. (The Jupyter notebook can be found in previous version of this dataset).
Important notice : It looks like Zenodo is compressing gzipped files two times without notice, they are "double compressed". So, when you download them they should be named : x.gz.gz instead of x.gz. Notice that the provided MD5 refers to the original file.
2024-10-25 update : updated repositories list and observation period. The filters relying on date were also updated.
2024-07-09 update : fix sometimes invalid valid_yaml flag.
The dataset was created as follow :
First, we used GitHub SEART (on October 7th, 2024) to get a list of every non-fork repositories created before January 1st, 2024. having at least 300 commits and at least 100 stars where at least one commit was made after January 1st, 2024. (The goal of these filter is to exclude experimental and personnal repositories).
We checked if a folder .github/workflows existed. We filtered out those that did not contained this folder and pulled the others (between 9th and 10thof October 2024).
We applied the tool gigawork (version 1.4.2) to extract every files from this folder. The exact command used is python batch.py -d /ourDataFolder/repositories -e /ourDataFolder/errors -o /ourDataFolder/output -r /ourDataFolder/repositories_everything.csv.gz -- -w /ourDataFolder/workflows_auxiliaries. (The script batch.py can be found on GitHub).
We concatenated every files in /ourDataFolder/output into a csv (using cat headers.csv output/*.csv > workflows_auxiliaries.csv in /ourDataFolder) and compressed it.
We added the column uid via a script available on GitHub.
Finally, we archived the folder with pigz /ourDataFolder/workflows (tar -c --use-compress-program=pigz -f workflows_auxiliaries.tar.gz /ourDataFolder/workflows)
Using the extracted data, the following files were created :
workflows.tar.gz contains the dataset of GitHub Actions workflow file histories.
workflows_auxiliaries.tar.gz is a similar file containing also auxiliary files.
workflows.csv.gz contains the metadata for the extracted workflow files.
workflows_auxiliaries.csv.gz is a similar file containing also metadata for auxiliary files.
repositories.csv.gz contains metadata about the GitHub repositories containing the workflow files. These metadata were extracted using the SEART Search tool.
The metadata is separated in different columns:
repository: The repository (author and repository name) from which the workflow was extracted. The separator "/" allows to distinguish between the author and the repository name
commit_hash: The commit hash returned by git
author_name: The name of the author that changed this file
author_email: The email of the author that changed this file
committer_name: The name of the committer
committer_email: The email of the committer
committed_date: The committed date of the commit
authored_date: The authored date of the commit
file_path: The path to this file in the repository
previous_file_path: The path to this file before it has been touched
file_hash: The name of the related workflow file in the dataset
previous_file_hash: The name of the related workflow file in the dataset, before it has been touched
git_change_type: A single letter (A,D, M or R) representing the type of change made to the workflow (Added, Deleted, Modified or Renamed). This letter is given by gitpython and provided as is.
valid_yaml: A boolean indicating if the file is a valid YAML file.
probably_workflow: A boolean representing if the file contains the YAML key on and jobs. (Note that it can still be an invalid YAML file).
valid_workflow: A boolean indicating if the file respect the syntax of GitHub Actions workflow. A freely available JSON Schema (used by gigawork) was used in this goal.
uid: Unique identifier for a given file surviving modifications and renames. It is generated on the addition of the file and stays the same until the file is deleted. Renamings does not change the identifier.
Both workflows.csv.gz and workflows_auxiliaries.csv.gz are following this format.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please refer to the original data article for further data description: Jan Luxemburk et al. CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines, Data in Brief, 2023, 108888, ISSN 2352-3409, https://doi.org/10.1016/j.dib.2023.108888. We recommend using the CESNET DataZoo python library, which facilitates the work with large network traffic datasets. More information about the DataZoo project can be found in the GitHub repository https://github.com/CESNET/cesnet-datazoo. The QUIC (Quick UDP Internet Connection) protocol has the potential to replace TLS over TCP, which is the standard choice for reliable and secure Internet communication. Due to its design that makes the inspection of QUIC handshakes challenging and its usage in HTTP/3, there is an increasing demand for research in QUIC traffic analysis. This dataset contains one month of QUIC traffic collected in an ISP backbone network, which connects 500 large institutions and serves around half a million people. The data are delivered as enriched flows that can be useful for various network monitoring tasks. The provided server names and packet-level information allow research in the encrypted traffic classification area. Moreover, included QUIC versions and user agents (smartphone, web browser, and operating system identifiers) provide information for large-scale QUIC deployment studies. Data capture The data was captured in the flow monitoring infrastructure of the CESNET2 network. The capturing was done for four weeks between 31.10.2022 and 27.11.2022. The following list provides per-week flow count, capture period, and uncompressed size:
W-2022-44
Uncompressed Size: 19 GB Capture Period: 31.10.2022 - 6.11.2022 Number of flows: 32.6M W-2022-45
Uncompressed Size: 25 GB Capture Period: 7.11.2022 - 13.11.2022 Number of flows: 42.6M W-2022-46
Uncompressed Size: 20 GB Capture Period: 14.11.2022 - 20.11.2022 Number of flows: 33.7M W-2022-47
Uncompressed Size: 25 GB Capture Period: 21.11.2022 - 27.11.2022 Number of flows: 44.1M CESNET-QUIC22
Uncompressed Size: 89 GB Capture Period: 31.10.2022 - 27.11.2022 Number of flows: 153M
Data description The dataset consists of network flows describing encrypted QUIC communications. Flows were created using ipfixprobe flow exporter and are extended with packet metadata sequences, packet histograms, and with fields extracted from the QUIC Initial Packet, which is the first packet of the QUIC connection handshake. The extracted handshake fields are the Server Name Indication (SNI) domain, the used version of the QUIC protocol, and the user agent string that is available in a subset of QUIC communications. Packet Sequences Flows in the dataset are extended with sequences of packet sizes, directions, and inter-packet times. For the packet sizes, we consider payload size after transport headers (UDP headers for the QUIC case). Packet directions are encoded as ±1, +1 meaning a packet sent from client to server, and -1 a packet from server to client. Inter-packet times depend on the location of communicating hosts, their distance, and on the network conditions on the path. However, it is still possible to extract relevant information that correlates with user interactions and, for example, with the time required for an API/server/database to process the received data and generate the response to be sent in the next packet. Packet metadata sequences have a length of 30, which is the default setting of the used flow exporter. We also derive three fields from each packet sequence: its length, time duration, and the number of roundtrips. The roundtrips are counted as the number of changes in the communication direction (from packet directions data); in other words, each client request and server response pair counts as one roundtrip. Flow statistics Flows also include standard flow statistics, which represent aggregated information about the entire bidirectional flow. The fields are: the number of transmitted bytes and packets in both directions, the duration of flow, and packet histograms. Packet histograms include binned counts of packet sizes and inter-packet times of the entire flow in both directions (more information in the PHISTS plugin documentation There are eight bins with a logarithmic scale; the intervals are 0-15, 16-31, 32-63, 64-127, 128-255, 256-511, 512-1024, >1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes. Moreover, each flow has its end reason - either it was idle, reached the active timeout, or ended due to other reasons. This corresponds with the official IANA IPFIX-specified values. The FLOW_ENDREASON_OTHER field represents the forced end and lack of resources reasons. The end of flow detected reason is not considered because it is not relevant for UDP connections. Dataset structure The dataset flows are delivered in compressed CSV files. CSV files contain one flow per row; data columns are summarized in the provided list below. For each flow data file, there is a JSON file with the number of saved and seen (before sampling) flows per service and total counts of all received (observed on the CESNET2 network), service (belonging to one of the dataset's services), and saved (provided in the dataset) flows. There is also the stats-week.json file aggregating flow counts of a whole week and the stats-dataset.json file aggregating flow counts for the entire dataset. Flow counts before sampling can be used to compute sampling ratios of individual services and to resample the dataset back to the original service distribution. Moreover, various dataset statistics, such as feature distributions and value counts of QUIC versions and user agents, are provided in the dataset-statistics folder. The mapping between services and service providers is provided in the servicemap.csv file, which also includes SNI domains used for ground truth labeling. The following list describes flow data fields in CSV files:
ID: Unique identifier SRC_IP: Source IP address DST_IP: Destination IP address DST_ASN: Destination Autonomous System number SRC_PORT: Source port DST_PORT: Destination port PROTOCOL: Transport protocol QUIC_VERSION QUIC: protocol version QUIC_SNI: Server Name Indication domain QUIC_USER_AGENT: User agent string, if available in the QUIC Initial Packet TIME_FIRST: Timestamp of the first packet in format YYYY-MM-DDTHH-MM-SS.ffffff TIME_LAST: Timestamp of the last packet in format YYYY-MM-DDTHH-MM-SS.ffffff DURATION: Duration of the flow in seconds BYTES: Number of transmitted bytes from client to server BYTES_REV: Number of transmitted bytes from server to client PACKETS: Number of packets transmitted from client to server PACKETS_REV: Number of packets transmitted from server to client PPI: Packet metadata sequence in the format: [[inter-packet times], [packet directions], [packet sizes]] PPI_LEN: Number of packets in the PPI sequence PPI_DURATION: Duration of the PPI sequence in seconds PPI_ROUNDTRIPS: Number of roundtrips in the PPI sequence PHIST_SRC_SIZES: Histogram of packet sizes from client to server PHIST_DST_SIZES: Histogram of packet sizes from server to client PHIST_SRC_IPT: Histogram of inter-packet times from client to server PHIST_DST_IPT: Histogram of inter-packet times from server to client APP: Web service label CATEGORY: Service category FLOW_ENDREASON_IDLE: Flow was terminated because it was idle FLOW_ENDREASON_ACTIVE: Flow was terminated because it reached the active timeout FLOW_ENDREASON_OTHER: Flow was terminated for other reasons
Link to other CESNET datasets
https://www.liberouter.org/technology-v2/tools-services-datasets/datasets/ https://github.com/CESNET/cesnet-datazoo Please cite the original data article:
@article{CESNETQUIC22, author = {Jan Luxemburk and Karel Hynek and Tomáš Čejka and Andrej Lukačovič and Pavel Šiška}, title = {CESNET-QUIC22: a large one-month QUIC network traffic dataset from backbone lines}, journal = {Data in Brief}, pages = {108888}, year = {2023}, issn = {2352-3409}, doi = {https://doi.org/10.1016/j.dib.2023.108888}, url = {https://www.sciencedirect.com/science/article/pii/S2352340923000069} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Indian Names Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ananysharma/indian-names-dataset on 28 January 2022.
--- Dataset description provided by original source is as follows ---
This dataset is useful to me in terms of my project which i was working. Problem was to extract names from unstructured text and i am still working on it.I felt of sharing this as some of the people might find useful in some Named Entity Recognition and other nlp tasks. If you want you can work on how to extract names from unstructured text without any context.For eg if we have to extract names from a document where context is not present.You can share your work and we can work together for better.
The dataset contains a male and female dataset along with a python preprocessing file for merging the two datasets.You can use either of the datset. Or you can see how we can merge both.
I get to know this dataset from a github repository which can be visited here
--- Original source retains full ownership of the source dataset ---