71 datasets found

w
Data Use in Academia Dataset
datacatalog.worldbank.org
csv, utf-8
Updated Nov 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Semantic Scholar Open Research Corpus (S2ORC) (2023). Data Use in Academia Dataset [Dataset]. https://datacatalog.worldbank.org/search/dataset/0065200/data_use_in_academia_dataset
Explore at:
utf-8, csvAvailable download formats
Dataset updated
Nov 27, 2023
Dataset provided by
Semantic Scholar Open Research Corpus (S2ORC)
Brian William Stacy
License
https://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc
Description
This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.

Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.

We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.

Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.

The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.

To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.

The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.

The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:

Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.
There are two classification tasks in this exercise:
1. identifying whether an academic article is using data from any country
2. Identifying from which country that data came.
For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.
After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]
For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.
We expect between 10 and 35 percent of all articles to use data.

The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.

A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.

The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.

The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of
Data for "To Pre-Filter, or Not to Pre-Filter, That Is the Query: A...
figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heather Cribbs; Gabriel Gardner (2023). Data for "To Pre-Filter, or Not to Pre-Filter, That Is the Query: A Multi-Campus Big Data Study" [Dataset]. http://doi.org/10.6084/m9.figshare.19071578.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19071578.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Heather Cribbs; Gabriel Gardner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Five files, one of which is a ZIP archive, containing data that support the findings of this study. PDF file "IA screenshots CSU Libraries search config" contains screenshots captured from the Internet Archive's Wayback Machine for all 24 CalState libraries' homepages for years 2017 - 2019. Excel file "CCIHE2018-PublicDataFile" contains Carnegie Classifications data from the Indiana University Center for Postsecondary Research for all of the CalState campuses from 2018. CSV file "2017-2019_RAW" contains the raw data exported from Ex Libris Primo Analytics (OBIEE) for all 24 CalState libraries for calendar years 2017 - 2019. CSV file "clean_data" contains the cleaned data from Primo Analytics which was used for all subsequent analysis such as charting and import into SPSS for statistical testing. ZIP archive file "NonparametricStatisticalTestsFromSPSS" contains 23 SPSS files [.spv format] reporting the results of testing conducted in SPSS. This archive includes things such as normality check, descriptives, and Kruskal-Wallis H-test results.
Z
Data supporting the Master thesis "Monitoring von Open Data Praktiken -...
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zinke, Katharina (2024). Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_14196538
Explore at:
Dataset updated
Nov 21, 2024
Dataset provided by
SLUB Dresden
Authors
Zinke, Katharina
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Dresden
Description
Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" (Monitoring open data practices - challenges in finding data publications using the example of publications by researchers at TU Dresden) - Katharina Zinke, Institut für Bibliotheks- und Informationswissenschaften, Humboldt-Universität Berlin, 2023

This ZIP-File contains the data the thesis is based on, interim exports of the results and the R script with all pre-processing, data merging and analyses carried out. The documentation of the additional, explorative analysis is also available. The actual PDFs and text files of the scientific papers used are not included as they are published open access.

The folder structure is shown below with the file names and a brief description of the contents of each file. For details concerning the analyses approach, please refer to the master's thesis (publication following soon).

Data sources

Folder 01_SourceData/

PLOS-Dataset_v2_Mar23.csv (PLOS-OSI dataset)

ScopusSearch_ExportResults.csv (export of Scopus search results from Scopus)

ScopusSearch_ExportResults.ris (export of Scopus search results from Scopus)

Zotero_Export_ScopusSearch.csv (export of the file names and DOIs of the Scopus search results from Zotero)

Automatic classification

Folder 02_AutomaticClassification/

(NOT INCLUDED) PDFs folder (Folder for PDFs of all publications identified by the Scopus search, named AuthorLastName_Year_PublicationTitle_Title)

(NOT INCLUDED) PDFs_to_text folder (Folder for all texts extracted from the PDFs by ODDPub, named AuthorLastName_Year_PublicationTitle_Title)

PLOS_ScopusSearch_matched.csv (merge of the Scopus search results with the PLOS_OSI dataset for the files contained in both)

oddpub_results_wDOIs.csv (results file of the ODDPub classification)

PLOS_ODDPub.csv (merge of the results file of the ODDPub classification with the PLOS-OSI dataset for the publications contained in both)

Manual coding

Folder 03_ManualCheck/

CodeSheet_ManualCheck.txt (Code sheet with descriptions of the variables for manual coding)

ManualCheck_2023-06-08.csv (Manual coding results file)

PLOS_ODDPub_Manual.csv (Merge of the results file of the ODDPub and PLOS-OSI classification with the results file of the manual coding)

Explorative analysis for the discoverability of open data Folder04_FurtherAnalyses

Proof_of_of_Concept_Open_Data_Monitoring.pdf (Description of the explorative analysis of the discoverability of open data publications using the example of a researcher) - in German

R-Script

Analyses_MA_OpenDataMonitoring.R (R-Script for preparing, merging and analyzing the data and for performing the ODDPub algorithm)
PDF Classification ML Models & Metrics
kaggle.com
zip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jackson Júnior (2023). PDF Classification ML Models & Metrics [Dataset]. https://www.kaggle.com/datasets/jacksonbarreto/pdf-classification-ml-models-and-metrics/discussion
Explore at:
zip(1189073825 bytes)Available download formats
Dataset updated
Jun 2, 2023
Authors
Jackson Júnior
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is a product of extensive research conducted on detecting malware in PDF files using machine learning algorithms. The work was motivated by the challenge of detecting malicious PDF files with evasive characteristics and the need to improve cyber security. Using the Python programming language and leveraging the power of NoSQL cloud storage, the project resulted in a variety of trained models and their corresponding performance metrics.

The dataset was built on Evasive-PDFMal2022, a dataset containing more than 10,000 records, including malicious and benign PDF files. Different machine learning algorithms were applied, producing a variety of models and hyperparameters. For each model, the performance was evaluated and recorded, allowing a direct comparison of the different models and algorithms. In addition, a hyperparameter generator module was developed to provide all possible combinations of hyperparameters for each algorithm, allowing in-depth analysis of the factors affecting model performance.

The resulting dataset is comprehensive, providing insights into the performance of different machine learning algorithms on the task of detecting malware in PDF files. The storage of all trained models, regardless of accuracy, provides a transparent and honest view of the modeling process and allows for a comprehensive evaluation of the strengths and weaknesses of each algorithm.

It is hoped that this dataset will serve as a valuable resource for the machine learning and cybersecurity research community, providing detailed information that can help improve the detection of malicious PDF files and contribute to the development of more effective machine learning techniques in this domain.

The inspiration for the project came from a paper done as part of the assessment component for approval in the Data Analysis and CyberIntelligence subject of the Master in Cybersecurity at the Polytechnic University of Viana do Castelo. The research sought to fill gaps in the state of the art in the detection of malicious PDF files, and the results were used to improve and deepen our understanding in this field.
Text Document Classification Dataset
kaggle.com
zip
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sunil thite (2023). Text Document Classification Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/text-document-classification-dataset
Explore at:
zip(1941393 bytes)Available download formats
Dataset updated
Dec 4, 2023
Authors
sunil thite
Description
This is text document classification dataset which contains 2225 text data and five categories of documents. Five categories are politics, sport, tech, entertainment and business. We can use this dataset for documents classification and document clustering.

About Dataset - Dataset contains two features text and label. - No. of Rows : 2225 - No. of Columns : 2

Text: It contains different categories of text data Label: It contains labels for five different categories : 0,1,2,3,4

Politics = 0

Sport = 1

Technology = 2

Entertainment =3

Business = 4
DataSheet1_Comparative analysis of classification techniques for topic-based...
frontiersin.figshare.com
pdf
Updated Nov 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ihor Stepanov; Arsentii Ivasiuk; Oleksandr Yavorskyi; Alina Frolova (2023). DataSheet1_Comparative analysis of classification techniques for topic-based biomedical literature categorisation.PDF [Dataset]. http://doi.org/10.3389/fgene.2023.1238140.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2023.1238140.s001
Dataset updated
Nov 7, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Ihor Stepanov; Arsentii Ivasiuk; Oleksandr Yavorskyi; Alina Frolova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction: Scientific articles serve as vital sources of biomedical information, but with the yearly growth in publication volume, processing such vast amounts of information has become increasingly challenging. This difficulty is particularly pronounced when it requires the expertise of highly qualified professionals. Our research focused on the domain-specific articles classification to determine whether they contain information about drug-induced liver injury (DILI). DILI is a clinically significant condition and one of the reasons for drug registration failures. The rapid and accurate identification of drugs that may cause such conditions can prevent side effects in millions of patients.Methods: Developing a text classification method can help regulators, such as the FDA, much faster at a massive scale identify facts of potential DILI of concrete drugs. In our study, we compared several text classification methodologies, including transformers, LSTMs, information theory, and statistics-based methods. We devised a simple and interpretable text classification method that is as fast as Naïve Bayes while delivering superior performance for topic-oriented text categorisation. Moreover, we revisited techniques and methodologies to handle the imbalance of the data.Results: Transformers achieve the best results in cases if the distribution of classes and semantics of test data matches the training set. But in cases of imbalanced data, simple statistical-information theory-based models can surpass complex transformers, bringing more interpretable results that are so important for the biomedical domain. As our results show, neural networks can achieve better results if they are pre-trained on domain-specific data, and the loss function was designed to reflect the class distribution.Discussion: Overall, transformers are powerful architecture, however, in certain cases, such as topic classification, its usage can be redundant and simple statistical approaches can achieve compatible results while being much faster and explainable. However, we see potential in combining results from both worlds. Development of new neural network architectures, loss functions and training procedures that bring stability to unbalanced data is a promising topic of development.
Generative AI In Data Analytics Market Analysis, Size, and Forecast...
technavio.com
pdf
Updated Jul 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). Generative AI In Data Analytics Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, and UK), APAC (China, India, and Japan), South America (Brazil), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/generative-ai-in-data-analytics-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Jul 17, 2025
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2025 - 2029
Area covered
United States
Description
Snapshot img

Generative AI In Data Analytics Market Size 2025-2029

The generative ai in data analytics market size is valued to increase by USD 4.62 billion, at a CAGR of 35.5% from 2024 to 2029. Democratization of data analytics and increased accessibility will drive the generative ai in data analytics market.

Market Insights

North America dominated the market and accounted for a 37% growth during the 2025-2029. By Deployment - Cloud-based segment was valued at USD 510.60 billion in 2023 By Technology - Machine learning segment accounted for the largest market revenue share in 2023

Market Size & Forecast

Market Opportunities: USD 621.84 million Market Future Opportunities 2024: USD 4624.00 million CAGR from 2024 to 2029 : 35.5%

Market Summary

The market is experiencing significant growth as businesses worldwide seek to unlock new insights from their data through advanced technologies. This trend is driven by the democratization of data analytics and increased accessibility of AI models, which are now available in domain-specific and enterprise-tuned versions. Generative AI, a subset of artificial intelligence, uses deep learning algorithms to create new data based on existing data sets. This capability is particularly valuable in data analytics, where it can be used to generate predictions, recommendations, and even new data points. One real-world business scenario where generative AI is making a significant impact is in supply chain optimization. In this context, generative AI models can analyze historical data and generate forecasts for demand, inventory levels, and production schedules. This enables businesses to optimize their supply chain operations, reduce costs, and improve customer satisfaction. However, the adoption of generative AI in data analytics also presents challenges, particularly around data privacy, security, and governance. As businesses continue to generate and analyze increasingly large volumes of data, ensuring that it is protected and used in compliance with regulations is paramount. Despite these challenges, the benefits of generative AI in data analytics are clear, and its use is set to grow as businesses seek to gain a competitive edge through data-driven insights.

What will be the size of the Generative AI In Data Analytics Market during the forecast period?

Get Key Insights on Market Forecast (PDF) Request Free SampleGenerative AI, a subset of artificial intelligence, is revolutionizing data analytics by automating data processing and analysis, enabling businesses to derive valuable insights faster and more accurately. Synthetic data generation, a key application of generative AI, allows for the creation of large, realistic datasets, addressing the challenge of insufficient data in analytics. Parallel processing methods and high-performance computing power the rapid analysis of vast datasets. Automated machine learning and hyperparameter optimization streamline model development, while model monitoring systems ensure continuous model performance. Real-time data processing and scalable data solutions facilitate data-driven decision-making, enabling businesses to respond swiftly to market trends. One significant trend in the market is the integration of AI-powered insights into business operations. For instance, probabilistic graphical models and backpropagation techniques are used to predict customer churn and optimize marketing strategies. Ensemble learning methods and transfer learning techniques enhance predictive analytics, leading to improved customer segmentation and targeted marketing. According to recent studies, businesses have achieved a 30% reduction in processing time and a 25% increase in predictive accuracy by implementing generative AI in their data analytics processes. This translates to substantial cost savings and improved operational efficiency. By embracing this technology, businesses can gain a competitive edge, making informed decisions with greater accuracy and agility.

Unpacking the Generative AI In Data Analytics Market Landscape

In the dynamic realm of data analytics, Generative AI algorithms have emerged as a game-changer, revolutionizing data processing and insights generation. Compared to traditional data mining techniques, Generative AI models can create new data points that mirror the original dataset, enabling more comprehensive data exploration and analysis (Source: Gartner). This innovation leads to a 30% increase in identified patterns and trends, resulting in improved ROI and enhanced business decision-making (IDC).

Data security protocols are paramount in this context, with Classification Algorithms and Clustering Algorithms ensuring data privacy and compliance alignment. Machine Learning Pipelines and Deep Learning Frameworks facilitate seamless integration with Predictive Modeling Tools and Automated Report Generation on Cloud
H
Data from: Land Use Land Cover (LULC)
opendata.hawaii.gov
geoportal.hawaii.gov
+3more
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of Planning (2024). Land Use Land Cover (LULC) [Dataset]. https://opendata.hawaii.gov/dataset/land-use-land-cover-lulc
Explore at:
pdf, arcgis geoservices rest api, geojson, kml, html, zip, csv, ogc wms, ogc wfsAvailable download formats
Dataset updated
Jun 1, 2024
Dataset provided by
Hawaii Statewide GIS Program
Authors
Office of Planning
Description
[Metadata] Description: Land Use Land Cover of main Hawaiian Islands as of 1976
Source: 1:100,000 1976 Digital GIRAS (Geographic Information Retrieval and Analysis) files.

Land Use and Land Cover (LULC) data consists of historical land use and land cover classification data that was based primarily on the manual interpretation of 1970's and 1980's aerial photography. Secondary sources included land use maps and surveys. There are 21 possible categories of cover type. The spatial resolution for all LULC files will depend on the format and feature type. Files in GIRAS format will have a minimum polygon area of 10 acres (4 hectares) with a minimum width of 660 feet (200 meters) for manmade features. Non-urban or natural features have a minimum polygon area of 40 acres (16 hectares) with a minimum width of 1320 feet (400 meters). Files in CTG format will have a resolution of 30 meters.

May 2024: Hawaii Statewide GIS Program staff removed extraneous fields that had been added as part of the 2016 GIS database conversion and were no longer needed.

For additional information, please refer to https://files.hawaii.gov/dbedt/op/gis/data/lulc.pdf or contact Hawaii Statewide GIS Program, Office of Planning and Sustainable Development, State of Hawaii; PO Box 2359, Honolulu, HI 96804; (808) 587-2846; email: gis@hawaii.gov; Website: https://planning.hawaii.gov/gis.

Network traffic datasets created by Single Flow Time Series Analysis

zenodo.org
data.niaid.nih.gov

csv, pdf

Updated Jul 11, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Josef Koumar; Josef Koumar; Karel Hynek; Karel Hynek; Tomáš Čejka; Tomáš Čejka (2024). Network traffic datasets created by Single Flow Time Series Analysis [Dataset]. http://doi.org/10.5281/zenodo.8035724

Explore at:

csv, pdfAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.8035724

Dataset updated

Jul 11, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Josef Koumar; Josef Koumar; Karel Hynek; Karel Hynek; Tomáš Čejka; Tomáš Čejka

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Network traffic datasets created by Single Flow Time Series Analysis

Datasets were created for the paper: Network Traffic Classification based on Single Flow Time Series Analysis -- Josef Koumar, Karel Hynek, Tomáš Čejka -- which was published at The 19th International Conference on Network and Service Management (CNSM) 2023. Please cite usage of our datasets as:

J. Koumar, K. Hynek and T. Čejka, "Network Traffic Classification Based on Single Flow Time Series Analysis," 2023 19th International Conference on Network and Service Management (CNSM), Niagara Falls, ON, Canada, 2023, pp. 1-7, doi: 10.23919/CNSM59352.2023.10327876.

This Zenodo repository contains 23 datasets created from 15 well-known published datasets which are cited in the table below. Each dataset contains 69 features created by Time Series Analysis of Single Flow Time Series. The detailed description of features from datasets is in the file: feature_description.pdf

In the following table is a description of each dataset file:

File name	Detection problem	Citation of original raw dataset
botnet_binary.csv	Binary detection of botnet	S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014.
botnet_multiclass.csv	Multi-class classification of botnet	S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014.
cryptomining_design.csv	Binary detection of cryptomining; the design part	Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022
cryptomining_evaluation.csv	Binary detection of cryptomining; the evaluation part	Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022
dns_malware.csv	Binary detection of malware DNS	Samaneh Mahdavifar et al. Classifying Malicious Domains using DNS Traffic Analysis. In DASC/PiCom/CBDCom/CyberSciTech 2021, pages 60–67. IEEE, 2021.
doh_cic.csv	Binary detection of DoH	Mohammadreza MontazeriShatoori et al. Detection of doh tunnels using time-series classification of encrypted traffic. In DASC/PiCom/CBDCom/CyberSciTech 2020, pages 63–70. IEEE, 2020
doh_real_world.csv	Binary detection of DoH	Kamil Jeřábek et al. Collection of datasets with DNS over HTTPS traffic. Data in Brief, 42:108310, 2022
dos.csv	Binary detection of DoS	Nickolaos Koroniotis et al. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst., 100:779–796, 2019.
edge_iiot_binary.csv	Binary detection of IoT malware	Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022.
edge_iiot_multiclass.csv	Multi-class classification of IoT malware	Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022.
https_brute_force.csv	Binary detection of HTTPS Brute Force	Jan Luxemburk et al. HTTPS Brute-force dataset with extended network flows, November 2020
ids_cic_binary.csv	Binary detection of intrusion in IDS	Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018.
ids_cic_multiclass.csv	Multi-class classification of intrusion in IDS	Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018.
ids_unsw_nb_15_binary.csv	Binary detection of intrusion in IDS	Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015.
ids_unsw_nb_15_multiclass.csv	Multi-class classification of intrusion in IDS	Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015.
iot_23.csv	Binary detection of IoT malware	Sebastian Garcia et al. IoT-23: A labeled dataset with malicious and benign IoT network traffic, January 2020. More details here https://www.stratosphereips.org /datasets-iot23
ton_iot_binary.csv	Binary detection of IoT malware	Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021
ton_iot_multiclass.csv	Multi-class classification of IoT malware	Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021
tor_binary.csv	Binary detection of TOR	Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017.
tor_multiclass.csv	Multi-class classification of TOR	Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017.
vpn_iscx_binary.csv	Binary detection of VPN	Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016.
vpn_iscx_multiclass.csv	Multi-class classification of VPN	Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016.
vpn_vnat_binary.csv	Binary detection of VPN	Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022
vpn_vnat_multiclass.csv	Multi-class classification of VPN	Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022

Dataset - Understanding the software and data used in the social sciences
eprints.soton.ac.uk
Updated Mar 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chue Hong, Neil; Aragon, Selina; Antonioletti, Mario; Walker, Johanna (2023). Dataset - Understanding the software and data used in the social sciences [Dataset]. http://doi.org/10.5281/zenodo.7785710
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7785710
Dataset updated
Mar 30, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Chue Hong, Neil; Aragon, Selina; Antonioletti, Mario; Walker, Johanna
Description
This is a repository for a UKRI Economic and Social Research Council (ESRC) funded project to understand the software used to analyse social sciences data. Any software produced has been made available under a BSD 2-Clause license and any data and other non-software derivative is made available under a CC-BY 4.0 International License. Note that the software that analysed the survey is provided for illustrative purposes - it will not work on the decoupled anonymised data set. Exceptions to this are: Data from the UKRI ESRC is mostly made available under a CC BY-NC-SA 4.0 Licence. Data from Gateway to Research is made available under an Open Government Licence (Version 3.0). Contents Survey data & analysis: esrc_data-survey-analysis-data.zip Other data: esrc_data-other-data.zip Transcripts: esrc_data-transcripts.zip Data Management Plan: esrc_data-dmp.zip Survey data & analysis The survey ran from 3rd February 2022 to 6th March 2023 during which 168 responses were received. Of these responses, three were removed because they were supplied by people from outside the UK without a clear indication of involvement with the UK or associated infrastructure. A fourth response was removed as both came from the same person which leaves us with 164 responses in the data. The survey responses, Question (Q) Q1-Q16, have been decoupled from the demographic data, Q17-Q23. Questions Q24-Q28 are for follow-up and have been removed from the data. The institutions (Q17) and funding sources (Q18) have been provided in a separate file as this could be used to identify respondents. Q17, Q18 and Q19-Q23 have all been independently shuffled. The data has been made available as Comma Separated Values (CSV) with the question number as the header of each column and the encoded responses in the column below. To see what the question and the responses correspond to you will have to consult the survey-results-key.csv which decodes the question and responses accordingly. A pdf copy of the survey questions is available on GitHub. The survey data has been decoupled into: survey-results-key.csv - maps a question number and the responses to the actual question values. q1-16-survey-results.csv- the non-demographic component of the survey responses (Q1-Q16). q19-23-demographics.csv - the demographic part of the survey (Q19-Q21, Q23). q17-institutions.csv - the institution/location of the respondent (Q17). q18-funding.csv - funding sources within the last 5 years (Q18). Please note the code that has been used to do the analysis will not run with the decoupled survey data. Other data files included CleanedLocations.csv - normalised version of the institutions that the survey respondents volunteered. DTPs.csv - information on the UKRI Doctoral Training Partnerships (DTPs) scaped from the UKRI DTP contacts web page in October 2021. projectsearch-1646403729132.csv.gz - data snapshot from the UKRI Gateway to Research released on the 24th February 2022 made available under an Open Government Licence. locations.csv - latitude and longitude for the institutions in the cleaned locations. subjects.csv - research classifications for the ESRC projects for the 24th February data snapshot. topics.csv - topic classification for the ESRC projects for the 24th February data snapshot. Interview transcripts The interview transcripts have been anonymised and converted to markdown so that it's easier to process in general. List of interview transcripts: 1269794877.md 1578450175.md 1792505583.md 2964377624.md 3270614512.md 40983347262.md 4288358080.md 4561769548.md 4938919540.md 5037840428.md 5766299900.md 5996360861.md 6422621713.md 6776362537.md 7183719943.md 7227322280.md 7336263536.md 75909371872.md 7869268779.md 8031500357.md 9253010492.md Data Management Plan The study's Data Management Plan is provided in PDF format and shows the different data sets used throughout the duration of the study and where they have been deposited, as well as how long the SSI will keep these records.
d
October 2023 data-update for "Updated science-wide author databases of...
elsevier.digitalcommonsdata.com
Updated Oct 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John P.A. Ioannidis (2023). October 2023 data-update for "Updated science-wide author databases of standardized citation indicators" [Dataset]. http://doi.org/10.17632/btchxktzyw.6
Explore at:
Unique identifier
https://doi.org/10.17632/btchxktzyw.6
Dataset updated
Oct 4, 2023
Authors
John P.A. Ioannidis
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
Citation metrics are widely used and misused. We have created a publicly available database of top-cited scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator (c-score). Separate data are shown for career-long and, separately, for single recent year impact. Metrics with and without self-citations and ratio of citations to citing papers are given. Scientists are classified into 22 scientific fields and 174 sub-fields according to the standard Science-Metrix classification. Field- and subfield-specific percentiles are also provided for all scientists with at least 5 papers. Career-long data are updated to end-of-2022 and single recent year data pertain to citations received during calendar year 2022. The selection is based on the top 100,000 scientists by c-score (with and without self-citations) or a percentile rank of 2% or above in the sub-field. This version (6) is based on the October 1, 2023 snapshot from Scopus, updated to end of citation year 2022. This work uses Scopus data provided by Elsevier through ICSR Lab (https://www.elsevier.com/icsr/icsrlab). Calculations were performed using all Scopus author profiles as of October 1, 2023. If an author is not on the list it is simply because the composite indicator value was not high enough to appear on the list. It does not mean that the author does not do good work.

PLEASE ALSO NOTE THAT THE DATABASE HAS BEEN PUBLISHED IN AN ARCHIVAL FORM AND WILL NOT BE CHANGED. The published version reflects Scopus author profiles at the time of calculation. We thus advise authors to ensure that their Scopus profiles are accurate. REQUESTS FOR CORRECIONS OF THE SCOPUS DATA (INCLUDING CORRECTIONS IN AFFILIATIONS) SHOULD NOT BE SENT TO US. They should be sent directly to Scopus, preferably by use of the Scopus to ORCID feedback wizard (https://orcid.scopusfeedback.com/) so that the correct data can be used in any future annual updates of the citation indicator databases.

The c-score focuses on impact (citations) rather than productivity (number of publications) and it also incorporates information on co-authorship and author positions (single, first, last author). If you have additional questions, please read the 3 associated PLoS Biology papers that explain the development, validation and use of these metrics and databases. (https://doi.org/10.1371/journal.pbio.1002501, https://doi.org/10.1371/journal.pbio.3000384 and https://doi.org/10.1371/journal.pbio.3000918).

Finally, we alert users that all citation metrics have limitations and their use should be tempered and judicious. For more reading, we refer to the Leiden manifesto: https://www.nature.com/articles/520429a
Multi-Class Images for Weather Classification
kaggle.com
zip
Updated Jan 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Somesh Sharma (2021). Multi-Class Images for Weather Classification [Dataset]. https://www.kaggle.com/datasets/somesh24/multiclass-images-for-weather-classification/data
Explore at:
zip(95740914 bytes)Available download formats
Dataset updated
Jan 14, 2021
Authors
Somesh Sharma
Description
Multi-class weather dataset(MWD) for image classification is a valuable dataset used in the research paper entitled “Multi-class weather recognition from still image using heterogeneous ensemble method”. The dataset provides a platform for outdoor weather analysis by extracting various features for recognizing different weather conditions.

Research Paper: https://web.cse.ohio-state.edu/~zhang.7804/Cheng_NC2016.pdf
f
Data Sheet 1_Weakly supervised text classification on free-text comments in...
frontiersin.figshare.com
pdf
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna-Grace Linton; Vania Gatseva Dimitrova; Amy Downing; Richard Wagland; Adam W. Glaser (2025). Data Sheet 1_Weakly supervised text classification on free-text comments in patient-reported outcome measures.pdf [Dataset]. http://doi.org/10.3389/fdgth.2025.1345360.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fdgth.2025.1345360.s001
Dataset updated
Apr 30, 2025
Dataset provided by
Frontiers
Authors
Anna-Grace Linton; Vania Gatseva Dimitrova; Amy Downing; Richard Wagland; Adam W. Glaser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundFree-text comments in patient-reported outcome measures (PROMs) data provide insights into health-related quality of life (HRQoL). However, these comments are typically analysed using manual methods, such as content analysis, which is labour-intensive and time-consuming. Machine learning analysis methods are largely unsupervised, necessitating post-analysis interpretation. Weakly supervised text classification (WSTC) can be a valuable analytical method of analysis for classifying domain-specific text data, especially when limited labelled data are available. In this paper, we applied five WSTC techniques to PROMs comment data to explore the extent to which they can be used to identify HRQoL themes reported by patients with prostate and colorectal cancer.MethodsThe main HRQoL themes and associated keywords were identified from a scoping review. They were used to classify PROMs comments with these themes from two national PROMs datasets: colorectal cancer (n = 5,634) and prostate cancer (n = 59,768). Classification was done using five keyword-based WSTC methods (anchored CorEx, BERTopic, Guided LDA, WeSTClass, and X-Class). To evaluate these methods, we assessed the overall performance of the methods and by theme. Domain experts reviewed the interpretability of the methods using the keywords extracted from the methods during training.ResultsBased on the 12 papers identified in the scoping review, we determined six main themes and corresponding keywords to label PROMs comments using WSTC methods. These themes were: Comorbidities, Daily Life, Health Pathways and Services, Physical Function, Psychological and Emotional Function, and Social Function. The performance of the methods varied across themes and between the datasets. While the best-performing model for both datasets, CorEx, attained weighted F1 scores of 0.57 (colorectal cancer) and 0.61 (prostate cancer), methods achieved an F1 score of up to 0.92 (Social Function) on individual themes. By evaluating the keywords extracted from the trained models, we saw that the methods that can utilise expert-driven seed terms and extrapolate based on limited data performed the best.ConclusionsOverall, evaluating these WSTC methods provided insight into their applicability for analysing PROMs comments. Evaluating the classification performance illustrated the potential and limitations of keyword-based WSTC in labelling PROMs comments when labelled data are limited.
f
Data_Sheet_1_Mild cognitive impairment prediction and cognitive score...
datasetcatalog.nlm.nih.gov
Updated Jan 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rutkowski, Tomasz M.; Otake-Matsuura, Mihoko; Komendziński, Tomasz (2024). Data_Sheet_1_Mild cognitive impairment prediction and cognitive score regression in the elderly using EEG topological data analysis and machine learning with awareness assessed in affective reminiscent paradigm.PDF [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001273208
Explore at:
Dataset updated
Jan 4, 2024
Authors
Rutkowski, Tomasz M.; Otake-Matsuura, Mihoko; Komendziński, Tomasz
Description
IntroductionThe main objective of this study is to evaluate working memory and determine EEG biomarkers that can assist in the field of health neuroscience. Our ultimate goal is to utilize this approach to predict the early signs of mild cognitive impairment (MCI) in healthy elderly individuals, which could potentially lead to dementia. The advancements in health neuroscience research have revealed that affective reminiscence stimulation is an effective method for developing EEG-based neuro-biomarkers that can detect the signs of MCI.MethodsWe use topological data analysis (TDA) on multivariate EEG data to extract features that can be used for unsupervised clustering, subsequent machine learning-based classification, and cognitive score regression. We perform EEG experiments to evaluate conscious awareness in affective reminiscent photography settings.ResultsWe use EEG and interior photography to distinguish between healthy cognitive aging and MCI. Our clustering UMAP and random forest application accurately predict MCI stage and MoCA scores.DiscussionOur team has successfully implemented TDA feature extraction, MCI classification, and an initial regression of MoCA scores. However, our study has certain limitations due to a small sample size of only 23 participants and an unbalanced class distribution. To enhance the accuracy and validity of our results, future research should focus on expanding the sample size, ensuring gender balance, and extending the study to a cross-cultural context.
f
DataSheet2_A deep learning mixed-data type approach for the classification...
frontiersin.figshare.com
pdf
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Edoardo Spairani; Beniamino Daniele; Maria Gabriella Signorini; Giovanni Magenes (2023). DataSheet2_A deep learning mixed-data type approach for the classification of FHR signals.PDF [Dataset]. http://doi.org/10.3389/fbioe.2022.887549.s002
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fbioe.2022.887549.s002
Dataset updated
Jun 2, 2023
Dataset provided by
Frontiers
Authors
Edoardo Spairani; Beniamino Daniele; Maria Gabriella Signorini; Giovanni Magenes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Cardiotocography (CTG) is a widely diffused monitoring practice, used in Ob-Gyn Clinic to assess the fetal well-being through the analysis of the Fetal Heart Rate (FHR) and the Uterine contraction signals. Due to the complex dynamics regulating the Fetal Heart Rate, a reliable visual interpretation of the signal is almost impossible and results in significant subjective inter and intra-observer variability. Also, the introduction of few parameters obtained from computer analysis did not solve the problem of a robust antenatal diagnosis. Hence, during the last decade, computer aided diagnosis systems, based on artificial intelligence (AI) machine learning techniques have been developed to assist medical decisions. The present work proposes a hybrid approach based on a neural architecture that receives heterogeneous data in input (a set of quantitative parameters and images) for classifying healthy and pathological fetuses. The quantitative regressors, which are known to represent different aspects of the correct development of the fetus, and thus are related to the fetal healthy status, are combined with features implicitly extracted from various representations of the FHR signal (images), in order to improve the classification performance. This is achieved by setting a neural model with two connected branches, consisting respectively of a Multi-Layer Perceptron (MLP) and a Convolutional Neural Network (CNN). The neural architecture was trained on a huge and balanced set of clinical data (14.000 CTG tracings, 7000 healthy and 7000 pathological) recorded during ambulatory non stress tests at the University Hospital Federico II, Napoli, Italy. After hyperparameters tuning and training, the neural network proposed has reached an overall accuracy of 80.1%, which is a promising result, as it has been obtained on a huge dataset.
t
Data for binary classification experiments
researchdata.tuwien.at
zip
Updated Feb 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Markus Kattenbeck; Antonia Golab; Antonia Golab; Negar Alinaghi; Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Markus Kattenbeck; Markus Kattenbeck; Markus Kattenbeck (2025). Data for binary classification experiments [Dataset]. http://doi.org/10.48436/zjkky-pgs18
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.48436/zjkky-pgs18
Dataset updated
Feb 24, 2025
Dataset provided by
Geoinformation, TU Wien
Authors
Markus Kattenbeck; Antonia Golab; Antonia Golab; Negar Alinaghi; Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Markus Kattenbeck; Markus Kattenbeck; Markus Kattenbeck
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Research context

This zip archive contains all the data and scripts which are neccessary to reproduce the results of the following paper, co-authored by Markus Kattenbeck, Ioannis Giannopoulos, Negar Alinaghi, Antonia Golab, and Daniel R. Montello:

Predicting spatial familiarity by exploiting head and eye movements during pedestrian navigation in the real world

This paper will be published in Springer Nature Scientific Reports.

File overview

The structure of the archive is the following:

Folder "01_data" contains all the data files needed and a readme file describing the structure of each of these data files. These data files are:

lsp.csv [contains demographic data about participants]

matched_gaze_imu.csv [contains the segmented behavioral data, i.e. both gaze features and imu features]

matched_gaze_imu_feature_description.pdf [contains a description of the features contained in matched_gaze_imu.csv]

walking_dates.csv [contains an overview on which date participants walked the familiar and unfamiliar routes]

users_polygons.csv [contains one or more polygons per participant in which they are familiar]

polygons_markers.csv [contains locations of POIs per polygon for which participants reported to be familiar with]

user_routes.csv [containes the route participants provided between a randomly selected pair of POIs they have provided for a given polygon]

Folder "02_scripts" contains the data analysis scripts; they are organized in two subfolders:

01_ml_scripts: these are the scripts for the XGBoost classification; they are organized as two python files in which further instructions for use are given.

80_20_code.py is the python file which runs the ML experiments using an 80/20 train/test split

L5O4T_code.py is the python file which runs the ML experiments leaving the full data of five different participants per condition as unseen data for the test.

requirements.txt states the used Python package versions

02_r_scripts:

cleaned_script.Rmd This is an R notebook which can be easily opened in R-Studio and provides the analysis scripts for the descriptive statistics presented in the paper.

package_versions.txt states the used R package versions

Licenses

The code is licensed under MIT, the data is licensed under CC-BY.
Make Data Count Dataset - MinerU Extraction
kaggle.com
zip
Updated Aug 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omid Erfanmanesh (2025). Make Data Count Dataset - MinerU Extraction [Dataset]. https://www.kaggle.com/datasets/omiderfanmanesh/make-data-count-dataset-mineru-extraction
Explore at:
zip(4272989320 bytes)Available download formats
Dataset updated
Aug 26, 2025
Authors
Omid Erfanmanesh
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Description

This dataset contains PDF-to-text conversions of scientific research articles, prepared for the task of data citation mining. The goal is to identify references to research datasets within full-text scientific papers and classify them as Primary (data generated in the study) or Secondary (data reused from external sources).

The PDF articles were processed using MinerU, which converts scientific PDFs into structured machine-readable formats (JSON, Markdown, images). This ensures participants can access both the raw text and layout information needed for fine-grained information extraction.

Files and Structure

Each paper directory contains the following files:

*_origin.pdf The original PDF file of the scientific article.

*_content_list.json Structured extraction of the PDF content, where each object represents a text or figure element with metadata. Example entry:

{ "type": "text", "text": "10.1002/2017JC013030", "text_level": 1, "page_idx": 0 }

full.md The complete article content in Markdown format (linearized for easier reading).

images/ Folder containing figures and extracted images from the article.

layout.json Page layout metadata, including positions of text blocks and images.

Data Mining Task

The aim is to detect dataset references in the article text and classify them:

DOIs (Digital Object Identifiers): https://doi.org/[prefix]/[suffix] Example: https://doi.org/10.5061/dryad.r6nq870

Accession IDs: Used by data repositories. Format varies by repository. Examples:

GSE12345 (NCBI GEO)

PDB 1Y2T (Protein Data Bank)

E-MEXP-568 (ArrayExpress)

Each dataset mention must be labeled as:

Primary: Data generated by the paper (new experiments, field observations, sequencing runs, etc.).

Secondary: Data reused from external repositories or prior studies.

Training and Test Splits

train/ → Articles with gold-standard labels (train_labels.csv).

test/ → Articles without labels, used for evaluation.

train_labels.csv → Ground truth with:

article_id: Research paper DOI.

dataset_id: Extracted dataset identifier.

type: Citation type (Primary / Secondary).

sample_submission.csv → Example submission format.

Example

Paper: https://doi.org/10.1098/rspb.2016.1151 Data: https://doi.org/10.5061/dryad.6m3n9 In-text span:

"The data we used in this publication can be accessed from Dryad at doi:10.5061/dryad.6m3n9." Citation type: Primary

This dataset enables participants to develop and test NLP systems for:

Information extraction (locating dataset mentions).

Identifier normalization (mapping mentions to persistent IDs).

Citation classification (distinguishing Primary vs Secondary data usage).
d
Data from: NRCS FY2018 Soil Properties and Interpretations, Derived Using...
catalog.data.gov
data.usgs.gov
Updated Nov 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). NRCS FY2018 Soil Properties and Interpretations, Derived Using gSSURGO Data and Tools [Dataset]. https://catalog.data.gov/dataset/nrcs-fy2018-soil-properties-and-interpretations-derived-using-gssurgo-data-and-tools
Explore at:
Dataset updated
Nov 25, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
These data depict the western United States Map Unit areas as defined by the USDA NRCS. Each Map Unit area contains information on a variety of soil properties and interpretations. The raster is to be joined to the .csv file by the field "mukey." We keep the raster and csv separate to preserve the full attribute names in the csv that would be truncated if attached to the raster. Once joined, the raster can be classified or analyzed by the columns which depict the properties and interpretations. It is important to note that each property has a corresponding component percent column to indicate how much of the map unit has the dominant property provided. For example, if the property "AASHTO Group Classification (Surface) 0 to 1cm" is recorded as "A-1" for a map unit, a user should also refer to the component percent field for this property (in this case 75). This means that an estimated 75% of the map unit has a "A-1" AASHTO group classification and that "A-1" is the dominant group. The property in the column is the dominant component, and so the other 25% of this map unit is comprised of other AASHTO Group Classifications. This raster attribute table was generated from the "Map Soil Properties and Interpretations" tool within the gSSURGO Mapping Toolset in the Soil Data Management Toolbox for ArcGIS™ User Guide Version 4.0 (https://www.nrcs.usda.gov/wps/PA_NRCSConsumption/download?cid=nrcseprd362255&ext=pdf) from GSSURGO that used their Map Unit Raster as the input feature (https://gdg.sc.egov.usda.gov/). The FY2018 Gridded SSURGO Map Unit Raster was created for use in national, regional, and state-wide resource planning and analysis of soils data. These data were created with guidance from the USDA NRCS. The fields named "*COMPPCT_R" can exceed 100% for some map units. The NRCS personnel are aware of and working on fixing this issue. Take caution when interpreting these areas, as they are the result of some data duplication in the master gSSURGO database. The data are considered valuable and required for timely science needs, and thus are released with this known error. The USDA NRCS are developing a data release which will replace this item when it is available. For the most up to date ssurgo releases that do not include the custom fields as this release does, see https://www.nrcs.usda.gov/wps/portal/nrcs/detail/soils/home/?cid=nrcs142p2_053628#tools For additional definitions, see https://www.nrcs.usda.gov/wps/portal/nrcs/detail/soils/survey/geo/?cid=nrcs142p2_053627.
Broad-Coverage German Sentiment Classification Model and Dataset for Dialog...
zenodo.org
live.european-language-grid.eu
+2more
zip
Updated Jun 11, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oliver Guhr; Anne-Kathrin Schumann; Frank Bahrmann; Hans-Joachim Böhme; Oliver Guhr; Anne-Kathrin Schumann; Frank Bahrmann; Hans-Joachim Böhme (2020). Broad-Coverage German Sentiment Classification Model and Dataset for Dialog Systems [Dataset]. http://doi.org/10.5281/zenodo.3693810
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3693810
Dataset updated
Jun 11, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Oliver Guhr; Anne-Kathrin Schumann; Frank Bahrmann; Hans-Joachim Böhme; Oliver Guhr; Anne-Kathrin Schumann; Frank Bahrmann; Hans-Joachim Böhme
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems

This paper describes the training of a general-purpose German sentiment classification model. Sentiment classification is an important aspect of general text analytics. Furthermore, it plays a vital role in dialogue systems and voice interfaces that depend on the ability of the system to pick up and understand emotional signals from user utterances. The presented study outlines how we have collected a new German sentiment corpus and then combined this corpus with existing resources to train a broad-coverage German sentiment model. The resulting data set contains 5.4 million labelled samples. We have used the data to train both, a simple convolutional and a transformer-based classification model and compared the results achieved on various training configurations. The model and the data set will be published along with this paper.

You can find the code for training testing the models, that was published along with the paper in this repository.

The germansentiment Python package contains a easy to use interface for the model that was published with this paper.
Cleaned CIC PDF-Malware 2022 Dataset
kaggle.com
zip
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Satyaprakash Sethy (2023). Cleaned CIC PDF-Malware 2022 Dataset [Dataset]. https://www.kaggle.com/datasets/satyaprakash138/cleaned-cic-pdf-malware-2022-dataset
Explore at:
zip(637257 bytes)Available download formats
Dataset updated
Nov 30, 2023
Authors
Satyaprakash Sethy
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Welcome to the CIC PDF-Malware 2022 dataset! This dataset is meticulously cleaned and curated to support research and development in the field of malware detection within PDF files. The dataset offers a valuable resource for machine learning practitioners, researchers, and data scientists working on cybersecurity projects.

Dataset Overview: The CIC PDF-Malware 2022 dataset comprises a comprehensive collection of features extracted from PDF files, both benign and malicious. It has been thoroughly cleaned to ensure high quality and consistency. Each entry in the dataset includes detailed attributes that can be leveraged for training and testing machine learning models aimed at detecting malware embedded in PDFs.

Key Features:

Feature-Rich Data: Includes various attributes related to PDF files, making it suitable for in-depth analysis and model training. Cleaned and Curated: The dataset has been meticulously cleaned to remove inconsistencies and errors, ensuring reliability and accuracy. Visualizations: We provide insightful visualizations to help understand the dataset's characteristics and distribution. Usage: To facilitate easy utilization of the dataset, we have included example code and tutorials demonstrating how to load and analyze the data. These resources will help you get started quickly and effectively.

Why This Dataset is Valuable:

Research and Development: Ideal for researchers and practitioners focused on enhancing malware detection mechanisms. Benchmarking: Useful for benchmarking new algorithms and models in the context of PDF malware detection. Community Engagement: Engage with the dataset through discussions and collaborative projects to advance cybersecurity research. Getting Started:

Download the dataset and explore the included examples and tutorials. Use the provided visualizations to gain insights into the dataset’s structure and attributes. Share your findings, contribute to discussions, and collaborate with other Kaggle users to maximize the impact of this dataset. Feel free to reach out with any questions or feedback. We look forward to seeing how you utilize this dataset to advance the field of malware detection!

Facebook

Twitter

Click to copy link

Link copied

Cite

Semantic Scholar Open Research Corpus (S2ORC) (2023). Data Use in Academia Dataset [Dataset]. https://datacatalog.worldbank.org/search/dataset/0065200/data_use_in_academia_dataset

Data Use in Academia Dataset

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

utf-8, csvAvailable download formats

Dataset updated

Nov 27, 2023

Dataset provided by

Semantic Scholar Open Research Corpus (S2ORC)
Brian William Stacy

License

https://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc

Description

This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.

Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.

We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.

Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.

The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.

To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.

The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.

The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:

Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.

There are two classification tasks in this exercise:

1. identifying whether an academic article is using data from any country

2. Identifying from which country that data came.

For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.

After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]

For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.

We expect between 10 and 35 percent of all articles to use data.

The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.

A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.

The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.

The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of

Clear search

Close search

Google apps

Main menu

Data Use in Academia Dataset

Data for "To Pre-Filter, or Not to Pre-Filter, That Is the Query: A...

Data supporting the Master thesis "Monitoring von Open Data Praktiken -...

Data sources

Automatic classification

Manual coding

Explorative analysis for the discoverability of open data Folder04_FurtherAnalyses

R-Script

PDF Classification ML Models & Metrics

Text Document Classification Dataset

DataSheet1_Comparative analysis of classification techniques for topic-based...

Generative AI In Data Analytics Market Analysis, Size, and Forecast...

Snapshot img

Data from: Land Use Land Cover (LULC)

Network traffic datasets created by Single Flow Time Series Analysis

Dataset - Understanding the software and data used in the social sciences

October 2023 data-update for "Updated science-wide author databases of...

Multi-Class Images for Weather Classification

Data Sheet 1_Weakly supervised text classification on free-text comments in...

Data_Sheet_1_Mild cognitive impairment prediction and cognitive score...

DataSheet2_A deep learning mixed-data type approach for the classification...

Data for binary classification experiments

Research context

File overview

Licenses

Make Data Count Dataset - MinerU Extraction

Dataset Description

Files and Structure

Data Mining Task

Training and Test Splits

Example

Data from: NRCS FY2018 Soil Properties and Interpretations, Derived Using...

Broad-Coverage German Sentiment Classification Model and Dataset for Dialog...

Cleaned CIC PDF-Malware 2022 Dataset

Data Use in Academia Dataset