17 datasets found

z
SemEval-2024 Task 3: The Competition of Multimodal Emotion Cause Analysis in...
zenodo.org
data.niaid.nih.gov
bin, json
Updated Sep 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fanfan Wang; Fanfan Wang; Heqing Ma; Jianfei Yu; Rui Xia; Erik Cambria; Heqing Ma; Jianfei Yu; Rui Xia; Erik Cambria (2024). SemEval-2024 Task 3: The Competition of Multimodal Emotion Cause Analysis in Conversations [Dataset]. http://doi.org/10.5281/zenodo.13689364
Explore at:
bin, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13689364
Dataset updated
Sep 5, 2024
Dataset provided by
Association for Computational Linguistics
Authors
Fanfan Wang; Fanfan Wang; Heqing Ma; Jianfei Yu; Rui Xia; Erik Cambria; Heqing Ma; Jianfei Yu; Rui Xia; Erik Cambria
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

The ability to understand emotions is an essential component of human-like artificial intelligence, as emotions greatly influence human cognition, decision making, and social interactions. In addition to emotion recognition in conversations, the task of identifying the potential causes behind an individual’s emotional state in conversations, is of great importance in many application scenarios. We organize SemEval-2024 Task 3, named Multimodal Emotion Cause Analysis in Conversations, which aims at extracting all pairs of emotions and their corresponding causes from conversations. Under different modality settings, it consists of two subtasks: Textual Emotion-Cause Pair Extraction in Conversations (TECPE) and Multimodal Emotion-Cause Pair Extraction in Conversations (MECPE). The shared task has attracted 143 registrations and 216 successful submissions. In this paper, we introduce the task, dataset and evaluation settings, summarize the systems of the top teams, and discuss the findings of the participants.

For more information about the task, please visit our https://github.com/NUSTM/SemEval-2024_ECAC" target="_blank" rel="noopener">task website and CodaLab competition website.
N
Natural Language Processing Solution Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jun 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Natural Language Processing Solution Report [Dataset]. https://www.datainsightsmarket.com/reports/natural-language-processing-solution-1943950
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Jun 3, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Natural Language Processing (NLP) solutions market is experiencing robust growth, driven by the increasing adoption of AI-powered applications across various sectors. The market's expansion is fueled by the rising volume of unstructured data, the need for efficient data analysis and automation, and the growing demand for personalized customer experiences. Technological advancements, such as deep learning and improved algorithms, are enhancing NLP capabilities, enabling more accurate language understanding and generation. Key applications include chatbots, virtual assistants, sentiment analysis, machine translation, and text summarization. While market size data is not explicitly provided, based on the presence of major players like IBM, Google, and Microsoft, and considering the rapid growth of AI, we can estimate the 2025 market size to be around $15 billion. Assuming a conservative CAGR (Compound Annual Growth Rate) of 20% (a reasonable estimate given the current market dynamics), the market is projected to reach approximately $40 billion by 2033. The market is segmented across various industries, including healthcare, finance, retail, and customer service. Healthcare's adoption of NLP for medical record analysis and patient engagement is a significant growth driver. Financial institutions leverage NLP for fraud detection, risk management, and regulatory compliance. Retail businesses utilize NLP for personalized marketing and customer service automation. While there are restraining factors such as data privacy concerns and the need for high-quality training data, the overall market outlook remains positive. The competitive landscape is characterized by both large technology companies and specialized NLP solution providers, fostering innovation and competition. This leads to continuous improvement in accuracy, efficiency, and the affordability of NLP solutions, further accelerating market growth. The forecast period of 2025-2033 offers substantial opportunities for businesses to capitalize on this rapidly evolving technology.
P
NLC2CMD Dataset
paperswithcode.com
opendatalab.com
Updated Apr 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mayank Agarwal; Tathagata Chakraborti; Quchen Fu; David Gros; Xi Victoria Lin; Jaron Maene; Kartik Talamadupula; Zhongwei Teng; Jules White (2020). NLC2CMD Dataset [Dataset]. https://paperswithcode.com/dataset/nlc2cmd
Explore at:
Dataset updated
Apr 16, 2020
Authors
Mayank Agarwal; Tathagata Chakraborti; Quchen Fu; David Gros; Xi Victoria Lin; Jaron Maene; Kartik Talamadupula; Zhongwei Teng; Jules White
Description
The NLC2CMD Competition hosted at NeurIPS 2020 aimed to bring the power of natural language processing to the command line. Participants were tasked with building models that can transform descriptions of command line tasks in English to their Bash syntax.
N
Natural Language Processing For Healthcare And Life Sciences Report
marketresearchforecast.com
doc, pdf, ppt
Updated Mar 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Forecast (2025). Natural Language Processing For Healthcare And Life Sciences Report [Dataset]. https://www.marketresearchforecast.com/reports/natural-language-processing-for-healthcare-and-life-sciences-43881
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Mar 20, 2025
Dataset authored and provided by
Market Research Forecast
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Natural Language Processing (NLP) market for healthcare and life sciences is experiencing robust growth, driven by the increasing volume of unstructured clinical data and the need for efficient data analysis to improve patient care and accelerate drug discovery. A 5% CAGR suggests a consistently expanding market, projected to reach significant value within the forecast period (2025-2033). The market is segmented by NLP type (rule-based, statistical, hybrid, learned) and application (physicians, patients, clinical operators, others). The diverse application areas reflect the multifaceted nature of NLP's impact, ranging from automating administrative tasks and improving diagnostic accuracy to personalizing patient experiences and accelerating research. Major players like Microsoft, Google, IBM, and others are actively investing in and developing NLP solutions, contributing to increased competition and innovation within the sector. The growth is further fueled by advancements in machine learning and deep learning techniques, allowing for more accurate and nuanced analysis of complex medical information. Regulatory approvals and increasing adoption of cloud-based solutions are additional positive market drivers. However, challenges remain. Data privacy concerns and the need for robust data security protocols represent significant hurdles. The complexity of integrating NLP solutions into existing healthcare IT infrastructure, along with the requirement for substantial investments in training and infrastructure, pose restraints to widespread adoption. The market's future growth hinges on overcoming these challenges, along with addressing ethical considerations related to algorithmic bias and data transparency. Strategic partnerships between technology providers and healthcare organizations will be crucial in driving successful implementation and maximizing the potential of NLP in improving healthcare outcomes and transforming life sciences research. The expansion into emerging markets, particularly in Asia Pacific, will also contribute to substantial market expansion.
ACHILLES: Ancient and Historical Language Evaluation Set
zenodo.org
data.niaid.nih.gov
zip
Updated May 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oksana Dereza; Oksana Dereza (2024). ACHILLES: Ancient and Historical Language Evaluation Set [Dataset]. http://doi.org/10.5281/zenodo.10655061
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10655061
Dataset updated
May 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Oksana Dereza; Oksana Dereza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Nov 5, 2023
Description
The dataset used in the SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages. The task included four problems; problems 1-3 were offered in both constrained and unconstrained tracks on CodaLab, while problem 4 was only a part of the unconstrained track.

POS-tagging

Lemmatisation

Morphological feature prediction

Mask filling

Word-level

Character level

For problems 1-3, data from Universal Dependencies v.2.12 was used for Ancient Greek, Ancient Hebrew, Classical Chinese, Coptic, Gothic, medieval Icelandic, Latin, Old Church Slavonic, Old East Slavic, Old French and Vedic Sanskrit. Old Hungarian texts, annotated to the same standard as UD corpora, were added to the dataset from the MGTSZ website. In Old Hungarian data, tokens which were POS-tagged PUNCT were altered so that the form matched the lemma to simplify complex punctuation marks used to approximate manuscript symbols; otherwise, no characters were changed.

As the ISO 639-3 standard does not distinguish between historical stages of Latin, as it does between other languages like Irish, but it was desirable to approximate this distinction for Latin, we further split Latin data. This resulted in two Latin datasets: Classical and Late Latin, and Medieval Latin. This split was dictated by the composition of the Perseus and PROIEL treebanks that served as a source for Latin UD treebanks.

Historical forms of Irish were only included in mask filling challenges (problem 4), as the quantity of historical Irish text data which has been tokenised and annotated to a single standard to date is insufficient for the purpose of training models to perform morphological analysis tasks. The texts were drawn from CELT, Corpas Stairiúil na Gaeilge, and digital editions of the St. Gall glosses and the Würzburg glosses. Each Irish text taken from CELT is labelled "Old", "Middle" or "Early Modern" in accordance with the language labels provided in CELT metadata. Because CELT metadata relating to language stages and text dating is reliant on information provided by a variety of different editors of earlier print editions, this metadata can be inconsistent across the corpus and on occasion inaccurate. To mitigate complications arising from this, texts drawn from CELT were included in the dataset only if they had a single Irish language label and if the dates provided in CELT metadata for the text match the expected dates for the given period in the history of the Irish language.

The upper temporal boundary was set at 1700 CE, and texts created later than this date were not included in the dataset. The choice of this date is driven by the fact that most of the historical language data used in word embedding research dates back to the 18th century CE or later, and our intention was to focus on the more challenging and yet unaddressed data. The resulting datasets for each language were then shuffled at the sentence level and split into training, validation and test subsets at the ratio of 0.8 : 0.1 : 0.1.

A detailed list of text sources for each language in the dataset, as well as other metadata and the description of data formats used for each problem, is provided on the Shared Task's GitHub. The structure of the dataset is as follows:

📂 morphology (data for problems 1-3) ├── 📂 test
├── 📂 ref (reference data used in CodaLab competitions)
├── 📂 lemmatisation
├── 📂 morph_features
└── 📂 pos_tagging
└── 📂 src (source test data with labels) ├── 📂 train └── 📂 valid

📂 fill_mask_word (data for problem 4a)
├── 📂 test
├── 📂 ref (reference data used in CodaLab competitions)
└── 📂 src (source test data with labels in 2 different formats)
├── 📂 json
└── 📂 tsv
├── 📂 train (train data in 2 different formats)
├── 📂 json
└── 📂 tsv
└── 📂 valid (validation data in 2 different formats)
├── 📂 json
└── 📂 tsv

📂 fill_mask_char (data for problem 4b)

├── 📂 test
├── 📂 ref (reference data used in CodaLab competitions)
└── 📂 src (source test data with labels in 2 different formats)
├── 📂 json
└── 📂 tsv
├── 📂 train (train data in 2 different formats)
├── 📂 json
└── 📂 tsv
└── 📂 valid (validation data in 2 different formats)
├── 📂 json
└── 📂 tsv

We would like to thank Ekaterina Melnikova for suggesting the name for the dataset.
Global Artificial Intelligence Market By Application (Image Recognition,...
techsciresearch.com
Updated Mar 22, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TechSci Research (2017). Global Artificial Intelligence Market By Application (Image Recognition, Natural Language Processing, Speech Recognition, etc.), By End User (Consumer Electronics, BFSI, etc.), By Region, Competition Forecast & Opportunities, Demand, Size and Competitive Analysis | TechSci Research [Dataset]. https://www.techsciresearch.com/report/global-artificial-intelligence-market-by-application-image-recognition-natural-language-processing-speech-recognition-etc-by-end-user-consumer-electronics-bfsi-etc-by-region-competition-forecast-opportunities/932.html
Explore at:
Dataset updated
Mar 22, 2017
Dataset authored and provided by
TechSci Research
License
https://www.techsciresearch.com/privacy-policy.aspxhttps://www.techsciresearch.com/privacy-policy.aspx
Description
Get TechSci Research Report on Global Artificial Intelligence Market which Covers Global Artificial Intelligence Market growth, Global Artificial Intelligence Market Trends, Global Artificial Intelligence Market Forecast& Revenue

Pages 255
Market Size
Forecast Market Size
CAGR
Fastest Growing Segment
Largest Market
Key Players

Code4ML 2.0

zenodo.org

csv, txt

Updated May 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737

Explore at:

csv, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15465737

Dataset updated

May 19, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonimous authors; Anonimous authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

The original dataset is organized into multiple CSV files, each containing structured data on different entities:

code_blocks.csv: Contains raw code snippets extracted from Kaggle.
kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

Table 1. code_blocks.csv structure

Column	Description
code_blocks_index	Global index linking code blocks to markup_data.csv.
kernel_id	Identifier for the Kaggle Jupyter notebook from which the code block was extracted.
code_block_id	Position of the code block within the notebook.
code_block	The actual machine learning code snippet.

Table 2. kernels_meta.csv structure

Column	Description
kernel_id	Identifier for the Kaggle Jupyter notebook.
kaggle_score	Performance metric of the notebook.
kaggle_comments	Number of comments on the notebook.
kaggle_upvotes	Number of upvotes the notebook received.
kernel_link	URL to the notebook.
comp_name	Name of the associated Kaggle competition.

Table 3. competitions_meta.csv structure

Column	Description
comp_name	Name of the Kaggle competition.
description	Overview of the competition task.
data_type	Type of data used in the competition.
comp_type	Classification of the competition.
subtitle	Short description of the task.
EvaluationAlgorithmAbbreviation	Metric used for assessing competition submissions.
data_sources	Links to datasets used.
metric type	Class label for the assessment metric.

Table 4. markup_data.csv structure

Column	Description
code_block	Machine learning code block.
too_long	Flag indicating whether the block spans multiple semantic types.
marks	Confidence level of the annotation.
graph_vertex_id	ID of the semantic type.

The dataset allows mapping between these tables. For example:

code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

Code4ML 2.0 Enhancements

The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

Applications

The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

Code generation
Code understanding
Natural language processing of code-related tasks

English Monograph OCR Dataset (Preprocessed) 📄🔍
kaggle.com
Updated Mar 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arjav 007 (2025). English Monograph OCR Dataset (Preprocessed) 📄🔍 [Dataset]. https://www.kaggle.com/datasets/arjav007/icdar-eng
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 21, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Arjav 007
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset is a preprocessed version of the English Monograph subset from the ICDAR 2017 OCR Post-Correction competition. It contains OCR-generated text alongside its corresponding aligned ground truth, making it useful for OCR error detection and correction tasks.

📌 About the Dataset

The dataset consists of historical English texts that were processed using OCR technology. Due to OCR errors, the text contains misrecognized characters, missing words, and other inaccuracies. This dataset provides both raw OCR output and gold-standard corrected text.

🚀 Use Cases

This dataset is ideal for:
- OCR Error Detection & Correction 📝
- Training Character-Based Machine Translation Models 🔠
- Natural Language Processing (NLP) on Historical Texts 📜

📊 Dataset Statistics

Total Entries: 724

Character-Level OCR Error Rate: ~1.79%

Common OCR Errors Observed:

1 → I

tbe → the

tho → the

aud → and

📜 Citation

If you use this dataset, please cite the original ICDAR 2017 OCR Post-Correction paper:

Chiron, G., Doucet, A., Coustaty, M., Moreux, J.P. (2017). ICDAR 2017 Competition on Post-OCR Text Correction.
o
Global Workforce Resume Dataset
opendatabay.com
.undefined
Updated Jul 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Global Workforce Resume Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/503680cc-bc47-4d8a-8231-cb824e9687e5
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 2, 2025
Dataset authored and provided by
Datasimple
Area covered
Data Science and Analytics
Description
This dataset, curated and processed by Neuralframe AI, serves as a valuable resource for resume parsing, candidate profiling, and job matching applications. It includes structured information on career objectives, skills, education, work experience, certifications, and other pertinent details. The data has been collected from both open-source platforms and Neuralframe AI's proprietary sources, with all data obtained with explicit consent. The dataset was initially utilised in the Datathon Competition at Bitfest 2025, offering participants a practical dataset to develop and refine resume parsing algorithms and candidate evaluation systems.

Columns

The dataset contains 35 columns. Key columns include: * address: Candidate's address (if available). * career_objective: A brief summary of the candidate's career goals or objectives. * skills: A list of skills possessed by the candidate, such as technical and soft skills. * educational_institution_name: Names of educational institutions attended by the candidate. * degree_names: Degrees obtained by the candidate (e.g., B.Tech, MBA). * passing_years: Year(s) of graduation or programme completion. * educational_results: Results or grades achieved in educational qualifications, such as GPA, percentage, or division. * result_types: The format or type of the educational results, such as GPA, percentage, or classification (e.g., Distinction). * major_field_of_studies: The main fields or subjects studied during the candidate’s education (e.g., Computer Science, Mathematics). * professional_company_names: Names of the companies or organisations where the candidate has worked professionally.

Distribution

Filename: resume_data.csv

Format: CSV (Comma-Separated Values)

Size: 17 MB

Number of Columns: 35

Number of Rows: 9544

Usage

This dataset is ideal for: * Developing and refining resume parsing algorithms. * Creating candidate profiling systems. * Building job matching applications. * Enhancing candidate evaluation systems. * Research in natural language processing (NLP) and machine learning on textual data.

Coverage

The dataset's region coverage is global. Specific details regarding time range or detailed demographic scope are not explicitly provided within the available information.

License

CC-BY

Who Can Use It

This dataset is particularly useful for: * Data Scientists and Analysts: For building predictive models and extracting insights from resume data. * Machine Learning Engineers: For training and testing NLP models for text analysis on resumes. * HR Professionals and Recruiters: For automating aspects of candidate screening and matching. * Academic Researchers: For studies related to human resources, labour markets, or AI applications in recruitment. * Participants in Datathons and Competitions: Seeking a practical dataset for developing real-world solutions.

Dataset Name Suggestions

Candidate Profile Dataset

Resume Data for AI Models

Global Workforce Resume Data

Structured Career Data

Job Applicant Skills Dataset

Attributes

Original Data Source: Resume Dataset
Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
Updated May 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy (2024). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.11213783
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.11213783
Dataset updated
May 18, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is an enriched version of Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The initial corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.

The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).

Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.

Automatic classification of code_blocks are stored in data_with_preds.csv. The mapping of this table with code_blocks.csv can be doe through code_blocks_index column, which corresponds to code_blocks indices.

The updated Code4ML 2.0 corpus includes kernels retrieved from Code Kaggle Meta. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

kernels_meta2.csv may contain kernels without Kaggle score, but with the place in the leader board (rank).

Code4ML 2.0 dataset can be used for various purposes, including training and evaluating models for code generation, code understanding, and natural language processing tasks.
SemEval-2021 Task 11: NLPContributionGraph
zenodo.org
zip
Updated Jul 27, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jennifer D'Souza; Jennifer D'Souza; Soeren Auer; Soeren Auer; Ted Pedersen; Ted Pedersen (2021). SemEval-2021 Task 11: NLPContributionGraph [Dataset]. http://doi.org/10.25835/0022787
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25835/0022787
Dataset updated
Jul 27, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jennifer D'Souza; Jennifer D'Souza; Soeren Auer; Soeren Auer; Ted Pedersen; Ted Pedersen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NLPContributionGraph was introduced as Task 11 at SemEval 2021 for the first time. The task is defined on a dataset of Natural Language Processing (NLP) scholarly articles with their contributions structured to be integrable within Knowledge Graph infrastructures such as the Open Research Knowledge Graph. The structured contribution annotations are provided as (1) Contribution sentences : a set of sentences about the contribution in the article; (2) Scientific terms and relations: a set of scientific terms and relational cue phrases extracted from the contribution sentences; and (3) Triples: semantic statements that pair scientific terms with a relation, modeled toward subject-predicate-object RDF statements for KG building. The Triples are organized under three (mandatory) or more of twelve total information units (viz., ResearchProblem, Approach, Model, Code, Dataset, ExperimentalSetup, Hyperparameters, Baselines, Results, Tasks, Experiments, and AblationAnalysis).

The Shared Task

As a complete submission for the Shared Task, given NLP scholarly articles in plaintext format, systems had to automatically extract the following information: contribution sentences; scientific term and predicate phrases from the sentences; and (subject,predicate,object) triple statements toward KG building organized under three or more of twelve total information units. The shared task has an open evaluation never-ending official online evaluation at Codalab.

Global Casual Ai Market Research Report: By Type of AI (Machine Learning,...

wiseguyreports.com

Updated Jul 19, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

wWiseguy Research Consultants Pvt Ltd (2024). Global Casual Ai Market Research Report: By Type of AI (Machine Learning, Natural Language Processing, Computer Vision, Speech Recognition, Others), By End User (Individual Consumers, Small and Medium-Sized Businesses (SMBs), Large Enterprises), By Application (Customer Service, Marketing and Sales, Fraud Detection, Predictive Analytics, Image Recognition, Language Translation, Others), By Deployment Model (On-Premise, Cloud-Based, Hybrid), By Vertical (Financial Services, Healthcare, Retail, Manufacturing, Transportation and Logistics, Government, Others) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2032. [Dataset]. https://www.wiseguyreports.com/reports/casual-ai-market

Explore at:

Dataset updated

Jul 19, 2024

Dataset authored and provided by

wWiseguy Research Consultants Pvt Ltd

License

https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

Time period covered

Jan 7, 2024

Area covered

Global

Description

BASE YEAR	2024
HISTORICAL DATA	2019 - 2024
REPORT COVERAGE	Revenue Forecast, Competitive Landscape, Growth Factors, and Trends
MARKET SIZE 2023	2.79(USD Billion)
MARKET SIZE 2024	3.26(USD Billion)
MARKET SIZE 2032	11.3(USD Billion)
SEGMENTS COVERED	Type of AI ,End User ,Application ,Deployment Model ,Vertical ,Regional
COUNTRIES COVERED	North America, Europe, APAC, South America, MEA
KEY MARKET DYNAMICS	Growing adoption of AI for task automation Increasing demand for personalized user experiences Advancements in natural language processing NLP Rising need for costeffective AI solutions Growing competition from established tech giants
MARKET FORECAST UNITS	USD Billion
KEY COMPANIES PROFILED	Microsoft ,IBM ,NVIDIA ,Qualcomm ,Google ,Arm ,Baidu ,Intel ,InfuseAI ,Tencent ,Amazon ,Apple ,Alibaba ,Meta ,Samsung
MARKET FORECAST PERIOD	2025 - 2032
KEY MARKET OPPORTUNITIES	Expansion into New Vertical Markets Increased Demand for Personalized User Experiences Growing Popularity of AIpowered Chatbots Integration with Existing Technologies Rising Focus on Data Privacy and Security
COMPOUND ANNUAL GROWTH RATE (CAGR)	16.8% (2025 - 2032)

TalentCLEF 2025 corpus: Skill and Job Title Intelligence for Human Capital...

zenodo.org

zip

Updated Apr 18, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Luis Gascó; Luis Gascó; Fabregat Marcos Hermenegildo; Fabregat Marcos Hermenegildo; García-Sardiña Laura; Déniz Cerpa Daniel; Paula Estrella; Rodrigo Alvaro; Zbib Rabih; García-Sardiña Laura; Déniz Cerpa Daniel; Paula Estrella; Rodrigo Alvaro; Zbib Rabih (2025). TalentCLEF 2025 corpus: Skill and Job Title Intelligence for Human Capital Management [Dataset]. http://doi.org/10.5281/zenodo.15038364

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15038364

Dataset updated

Apr 18, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

🚨 Current Status: Release of Task B Development set. To check when new data will be uploaded, please consult the calendar of the task

TalentCLEF2025 corpus - Task B Development set release

Introduction:

The first edition of TalentCLEF aims to develop and evaluate models designed to facilitate three essential tasks:

Finding/ranking candidates for job positions based on their experience and professional skills.
Implementing upskilling and reskilling strategies that promote the coninuous development of workers
Detecting emerging skills and skills gaps of importance in organizations.

With that aim, the task is divided into two tasks:

Task A - Multilingual Job Title Matching. This task involves developing systems to identify and rank the job titles most similar to a given one by generating a ranked list of similar titles from a specified knowledge base for each job title in a provided test set.
Task B - Job Title-Based Skill Prediction. Task B requires developing systems that can retrieve relevant skills associated with a specified job title.

This data repository contains the data for these two tasks. The data is being released progressively according to the task schedule.

The task evaluation takes place on Codabench (Task A and Task B). Participants must register for the competition through CLEF Lab Registration Page to be part of the evaluation campaign.

File structure:

For a detailed description of the data structure, you can refer to the TalentCLEF2025 data description page, where it is thoroughly explained.

The files is organized into two *.zip files, TaskA.zip and TaskB.zip, each containing training, validation and test folders to support different stages of model development. So far, only the training set for both tasks has been released, but in future releases, as the tasks progress, additional data will be added to the different subfolders for each task.

TaskA includes language-specific subfolders within the training and validation directories, covering English, Spanish, German, and Chinese job title data. The training folders for TaskA contain language-specific .tsv files for each respective language. Validation folders include three essential files—queries, corpus_elements, and q_rels—for evaluating model relevance to search queries. TaskA’s test folder has queries and corpus_elements files for testing retrieval.

TaskA/
│
├── training/
│  ├── english/
│  │  └── taskA_training_en.tsv
│  ├── spanish/
│  │  └── taskA_training_es.tsv
│  └── german/
│    └── taskA_training_de.tsv
│
├── validation/
│  ├── english/
│  │  ├── queries
│  │  ├── corpus_elements
│  │  └── qrels
│  ├── spanish/
│  ├── german/
│  └── chinese/
│
└── test/
  ├── english/
  │  ├── queries
  │  └── corpus_elements
  ├── spanish/
  ├── german/
  └── chinese/

TaskB follows a similar structure but without language-specific subfolders, providing general .tsv files for training, validation, and testing. This consistent file organization enables efficient data access and structured updates as new data versions are published.

TaskB/
│
├── training/
│  ├── job2skill.tsv
│  ├── jobid2terms.json
│  └── skillid2terms.json
│
├── validation/
│  ├── queries
│  ├── corpus_elements
│  └── qrels
│
└── test/
  ├── queries
  └── corpus_elements

Tutorials:

Notebook	Link
Data Download and Load using Python	Link to Colab
Task A - Prepare submission file and run evaluation	Link to Colab
Task A - Development set Baseline generation	Link to Colab
Task B - Prepare submission file and run evaluation	Link to Colab

Resources:

c
Competition and Symmetry in an Artificial Word Learning Task, 2016-2019
datacatalogue.cessda.eu
Updated Jun 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dautriche, I (2025). Competition and Symmetry in an Artificial Word Learning Task, 2016-2019 [Dataset]. http://doi.org/10.5255/UKDA-SN-855110
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-855110
Dataset updated
Jun 1, 2025
Dataset provided by
CNRS
Authors
Dautriche, I
Time period covered
Nov 1, 2016 - Nov 27, 2019
Area covered
United Kingdom
Variables measured
Individual
Measurement technique
Participants were recruited through Amazon Mechanical Turk. Participants were tested online. They were instructed that their task was to learn new words by associating them with objects displayed on the screen. In the instructions, participants were given a screenshot of a trial involving a word (not used during the test) and a set of objects. No information about the number of to- be-learned words was given. For each trial, a word was displayed, first alone for 500 ms to attract participants’ attention to the word, then together with a collection of 3 objects, aligned horizontally, below the word. Participants were asked to click on the object they believed to be associated with the word.
Description
Natural language involves competition. The sentences we choose to utter activate alternative sentences (those we chose not to utter), which hearers typically infer to be false. Hence, as a first approximation, the more alternatives a sentence activates, the more inferences it will trigger. But a closer look at the theory of competition shows that this is not quite true and that under specific circumstances, so-called symmetric alternatives cancel each other out. We present an artificial word learning experiment in which participants learn words that may enter into competition with one another. The results show that a mechanism of competition takes place, and that the subtle prediction that alternatives trigger inferences, and may stop triggering them after a point due to symmetry, is borne out. This study provides a minimal testing paradigm to reveal competition and some of its subtle characteristics in human languages and beyond.
As anyone who has learnt a foreign language or travelled abroad will have noticed, languages differ in the sounds they employ, the names they give to things, and the rules of grammar. However, linguists have long observed that, beneath this surface diversity, all human languages share a number of fundamental structural similarities. Most obviously, all languages use sounds, all languages have words, and all languages have a grammar. More subtly and more surprisingly, similarities can also be observed in more fine-grained linguistic features: for instance, George Zipf famously observed that, across multiple languages, short words tend also to be more frequent, and in my own recent work I have shown that languages prefer to use words that sound alike (e.g., cat, mat, rat, bat, fat, ...). Why do all languages exhibit these shared features? This project aims to tackle exactly this key question by studying how languages are shaped by the human mind. In particular, I will explore how the way we learn language and use it to communicate drives the emergence of important features of lexicons, the set of all words in a language. To simulate the process of language change and evolution in the lab, I will use an experimental paradigm where an artificial language is passed between learners (language learning), and used by individuals to communicate with each other (language use). This paradigm has been successfully applied in previous research showing that key structural features of language can be explained as a consequence of repeated learning and use; my contribution will be to apply the same methods to study the evolution of the lexicon. I will then use two complementary techniques to evaluate the ecological validity of these results. First, do the artificial lexicons obtained after repeated learning and communication match the structure of lexicons found in real human languages? We will assess this by analyzing real natural language corpora using computational methods. Second, are these lexicons easily learnable by young children, the primary conduit of natural language transmission in the wild? This will be assessed using methods from developmental psychology to study word learning in toddlers. The present project requires an unprecedented integration of techniques and concepts from language evolution, computational linguistics and developmental psychology, three fields that have so far worked independently to understand the structure of language. The outcomes of the project will be of vital interest for all these communities, and will provide insights into the foundational properties found in all human languages, as well as the nature of the constraints underlying language processing and language acquisition. This project will provide a springboard for my future work at the intersection of computational and experimental approaches to language and cognitive development.
L
Large-Scale Model Training Machine Report
archivemarketresearch.com
doc, pdf, ppt
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Large-Scale Model Training Machine Report [Dataset]. https://www.archivemarketresearch.com/reports/large-scale-model-training-machine-196019
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Jun 23, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Large-Scale Model Training Machine market is experiencing rapid growth, driven by the increasing demand for sophisticated AI applications across various sectors. The market size in 2025 is estimated at $15 billion, projecting a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033. This robust growth is fueled by several key factors, including the proliferation of big data, advancements in deep learning algorithms, and the rising adoption of cloud computing for AI model training. The expansion of edge computing infrastructure also contributes significantly, enabling faster and more efficient training of large-scale models closer to the data source. Major players like Google, Amazon, Microsoft, and others are heavily investing in research and development, further accelerating market expansion. The market segmentation is largely driven by deployment models (on-premises vs. cloud), application domains (image recognition, natural language processing, etc.), and geographical regions. Competition is fierce, with established tech giants and emerging AI startups vying for market share through innovative solutions and strategic partnerships. The continued growth of the Large-Scale Model Training Machine market is expected to be shaped by several emerging trends. These include the increasing adoption of specialized hardware like GPUs and TPUs, the development of more efficient training algorithms, and the growing interest in federated learning for enhanced data privacy. However, challenges remain, such as the high cost of infrastructure and specialized expertise, along with concerns about data security and ethical implications of advanced AI models. Despite these challenges, the long-term outlook for the Large-Scale Model Training Machine market remains extremely positive, with sustained growth predicted well into the next decade, driven by an ever-increasing need for powerful and sophisticated AI capabilities.
A
AI Children's Learning Robot Report
datainsightsmarket.com
doc, pdf, ppt
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). AI Children's Learning Robot Report [Dataset]. https://www.datainsightsmarket.com/reports/ai-childrens-learning-robot-1280042
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
May 5, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The AI children's learning robot market is experiencing robust growth, driven by increasing parental awareness of the benefits of early childhood education and technological advancements in artificial intelligence and robotics. The market is segmented by application (education & entertainment, autism treatment, others) and type (humanoid, animal type), reflecting the diverse functionalities and designs catering to various needs. The education and entertainment segment currently dominates, fueled by the rising demand for engaging and interactive learning tools. However, the autism treatment segment is projected to witness significant growth over the forecast period (2025-2033) due to the potential of AI robots to provide personalized therapeutic interventions and improve social interaction skills in autistic children. The humanoid robot type holds a larger market share compared to animal-type robots, largely because of its advanced capabilities in mimicking human interactions and engaging in complex educational activities. North America and Europe currently represent the largest regional markets, driven by high technological adoption rates and a strong emphasis on early childhood education. However, the Asia-Pacific region is expected to exhibit substantial growth in the coming years, fueled by rising disposable incomes and increasing investments in education technology. Several key players, including Miko, Elenco, ROYBI, Petoi, and others, are actively shaping the market landscape through product innovation and strategic partnerships. The market faces challenges such as high initial costs of AI robots and concerns about data privacy and security. Nonetheless, the continuous advancements in AI technology, coupled with growing parental investments in children's education, are expected to propel market expansion. The market's Compound Annual Growth Rate (CAGR) is estimated at 15% for the period 2025-2033, projecting a substantial increase in market size. This growth is further stimulated by the integration of advanced features like natural language processing, computer vision, and machine learning, improving the robots' capabilities. Competition is expected to intensify with the entry of new players, leading to further product diversification and cost reduction. Future growth will likely hinge on effectively addressing consumer concerns regarding data privacy and safety while further developing the educational and therapeutic capabilities of the robots. The market will benefit from increased research and development focusing on personalization and adaptability to various learning styles and needs.
M
Metasearch Engine Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Jul 11, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The metasearch engine market, while exhibiting a history of fluctuating growth, is poised for a period of expansion. While precise figures for market size and CAGR are unavailable, a logical assessment, considering the presence of established players like Google, Bing, and the listed companies (Dogpile, InfoSpace, IBM, Startpage, AOL, Ceek.jp, CurryGuide, Entireweb), suggests a substantial market. The market's value likely sits in the hundreds of millions of dollars, with a CAGR in the low-to-mid single digits, reflecting both the mature nature of the search landscape and the ongoing innovation within the metasearch sector. Key drivers include the increasing need for efficient and unbiased search results, particularly for price-sensitive consumers seeking the best deals across multiple platforms. Trends point toward increased integration of AI and machine learning for improved search accuracy and personalization, along with a growing focus on user privacy and data security. However, restraints include intense competition from dominant search engines and the complexities of maintaining consistent data accuracy across various sources. The market is segmented by features such as search algorithm, user interface, supported platforms (desktop, mobile, etc.), and target demographics (business, consumers, etc.) Although specific regional breakdowns are not provided, North America and Europe likely hold significant market share, given the established technological infrastructure and higher internet penetration rates. Future growth hinges on the ability of metasearch engines to differentiate themselves through innovative features and by effectively addressing user concerns about privacy and data security. The forecast period of 2025-2033 presents opportunities for metasearch engine providers to capitalize on evolving consumer needs. Strategic partnerships with travel, e-commerce, and other relevant sectors can drive adoption. Investment in advanced technologies such as natural language processing (NLP) and semantic search will be crucial for enhancing user experience. While competition remains fierce, focusing on niche markets or specialized search functions can create growth avenues. Furthermore, a robust marketing strategy emphasizing transparency and trust-building is vital in overcoming user hesitancy related to data privacy. Overall, the metasearch engine market presents a complex but potentially rewarding landscape for companies willing to innovate and adapt.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Fanfan Wang; Fanfan Wang; Heqing Ma; Jianfei Yu; Rui Xia; Erik Cambria; Heqing Ma; Jianfei Yu; Rui Xia; Erik Cambria (2024). SemEval-2024 Task 3: The Competition of Multimodal Emotion Cause Analysis in Conversations [Dataset]. http://doi.org/10.5281/zenodo.13689364

SemEval-2024 Task 3: The Competition of Multimodal Emotion Cause Analysis in Conversations

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

bin, jsonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13689364

Dataset updated

Sep 5, 2024

Dataset provided by

Association for Computational Linguistics

Authors

Fanfan Wang; Fanfan Wang; Heqing Ma; Jianfei Yu; Rui Xia; Erik Cambria; Heqing Ma; Jianfei Yu; Rui Xia; Erik Cambria

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The ability to understand emotions is an essential component of human-like artificial intelligence, as emotions greatly influence human cognition, decision making, and social interactions. In addition to emotion recognition in conversations, the task of identifying the potential causes behind an individual’s emotional state in conversations, is of great importance in many application scenarios. We organize SemEval-2024 Task 3, named Multimodal Emotion Cause Analysis in Conversations, which aims at extracting all pairs of emotions and their corresponding causes from conversations. Under different modality settings, it consists of two subtasks: Textual Emotion-Cause Pair Extraction in Conversations (TECPE) and Multimodal Emotion-Cause Pair Extraction in Conversations (MECPE). The shared task has attracted 143 registrations and 216 successful submissions. In this paper, we introduce the task, dataset and evaluation settings, summarize the systems of the top teams, and discuss the findings of the participants.

For more information about the task, please visit our https://github.com/NUSTM/SemEval-2024_ECAC" target="_blank" rel="noopener">task website and CodaLab competition website.

Clear search

Close search

Google apps

Main menu

Pages	255
Market Size
Forecast Market Size
CAGR
Fastest Growing Segment
Largest Market
Key Players

SemEval-2024 Task 3: The Competition of Multimodal Emotion Cause Analysis in...

Natural Language Processing Solution Report

NLC2CMD Dataset

Natural Language Processing For Healthcare And Life Sciences Report

ACHILLES: Ancient and Historical Language Evaluation Set

Global Artificial Intelligence Market By Application (Image Recognition,...

Code4ML 2.0

Code4ML 2.0 Enhancements

Applications

English Monograph OCR Dataset (Preprocessed) 📄🔍

📌 About the Dataset

🚀 Use Cases

📊 Dataset Statistics

📜 Citation

Global Workforce Resume Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

SemEval-2021 Task 11: NLPContributionGraph

Global Casual Ai Market Research Report: By Type of AI (Machine Learning,...

TalentCLEF 2025 corpus: Skill and Job Title Intelligence for Human Capital...

TalentCLEF2025 corpus - Task B Development set release

Introduction:

File structure:

Competition and Symmetry in an Artificial Word Learning Task, 2016-2019

Large-Scale Model Training Machine Report

AI Children's Learning Robot Report

Metasearch Engine Report

SemEval-2024 Task 3: The Competition of Multimodal Emotion Cause Analysis in Conversations