17 datasets found
  1. z

    SemEval-2024 Task 3: The Competition of Multimodal Emotion Cause Analysis in...

    • zenodo.org
    • data.niaid.nih.gov
    bin, json
    Updated Sep 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fanfan Wang; Fanfan Wang; Heqing Ma; Jianfei Yu; Rui Xia; Erik Cambria; Heqing Ma; Jianfei Yu; Rui Xia; Erik Cambria (2024). SemEval-2024 Task 3: The Competition of Multimodal Emotion Cause Analysis in Conversations [Dataset]. http://doi.org/10.5281/zenodo.13689364
    Explore at:
    bin, jsonAvailable download formats
    Dataset updated
    Sep 5, 2024
    Dataset provided by
    Association for Computational Linguistics
    Authors
    Fanfan Wang; Fanfan Wang; Heqing Ma; Jianfei Yu; Rui Xia; Erik Cambria; Heqing Ma; Jianfei Yu; Rui Xia; Erik Cambria
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The ability to understand emotions is an essential component of human-like artificial intelligence, as emotions greatly influence human cognition, decision making, and social interactions. In addition to emotion recognition in conversations, the task of identifying the potential causes behind an individual’s emotional state in conversations, is of great importance in many application scenarios. We organize SemEval-2024 Task 3, named Multimodal Emotion Cause Analysis in Conversations, which aims at extracting all pairs of emotions and their corresponding causes from conversations. Under different modality settings, it consists of two subtasks: Textual Emotion-Cause Pair Extraction in Conversations (TECPE) and Multimodal Emotion-Cause Pair Extraction in Conversations (MECPE). The shared task has attracted 143 registrations and 216 successful submissions. In this paper, we introduce the task, dataset and evaluation settings, summarize the systems of the top teams, and discuss the findings of the participants.

    For more information about the task, please visit our https://github.com/NUSTM/SemEval-2024_ECAC" target="_blank" rel="noopener">task website and CodaLab competition website.

  2. N

    Natural Language Processing Solution Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Natural Language Processing Solution Report [Dataset]. https://www.datainsightsmarket.com/reports/natural-language-processing-solution-1943950
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Jun 3, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Natural Language Processing (NLP) solutions market is experiencing robust growth, driven by the increasing adoption of AI-powered applications across various sectors. The market's expansion is fueled by the rising volume of unstructured data, the need for efficient data analysis and automation, and the growing demand for personalized customer experiences. Technological advancements, such as deep learning and improved algorithms, are enhancing NLP capabilities, enabling more accurate language understanding and generation. Key applications include chatbots, virtual assistants, sentiment analysis, machine translation, and text summarization. While market size data is not explicitly provided, based on the presence of major players like IBM, Google, and Microsoft, and considering the rapid growth of AI, we can estimate the 2025 market size to be around $15 billion. Assuming a conservative CAGR (Compound Annual Growth Rate) of 20% (a reasonable estimate given the current market dynamics), the market is projected to reach approximately $40 billion by 2033. The market is segmented across various industries, including healthcare, finance, retail, and customer service. Healthcare's adoption of NLP for medical record analysis and patient engagement is a significant growth driver. Financial institutions leverage NLP for fraud detection, risk management, and regulatory compliance. Retail businesses utilize NLP for personalized marketing and customer service automation. While there are restraining factors such as data privacy concerns and the need for high-quality training data, the overall market outlook remains positive. The competitive landscape is characterized by both large technology companies and specialized NLP solution providers, fostering innovation and competition. This leads to continuous improvement in accuracy, efficiency, and the affordability of NLP solutions, further accelerating market growth. The forecast period of 2025-2033 offers substantial opportunities for businesses to capitalize on this rapidly evolving technology.

  3. P

    NLC2CMD Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Apr 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mayank Agarwal; Tathagata Chakraborti; Quchen Fu; David Gros; Xi Victoria Lin; Jaron Maene; Kartik Talamadupula; Zhongwei Teng; Jules White (2020). NLC2CMD Dataset [Dataset]. https://paperswithcode.com/dataset/nlc2cmd
    Explore at:
    Dataset updated
    Apr 16, 2020
    Authors
    Mayank Agarwal; Tathagata Chakraborti; Quchen Fu; David Gros; Xi Victoria Lin; Jaron Maene; Kartik Talamadupula; Zhongwei Teng; Jules White
    Description

    The NLC2CMD Competition hosted at NeurIPS 2020 aimed to bring the power of natural language processing to the command line. Participants were tasked with building models that can transform descriptions of command line tasks in English to their Bash syntax.

  4. N

    Natural Language Processing For Healthcare And Life Sciences Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Mar 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2025). Natural Language Processing For Healthcare And Life Sciences Report [Dataset]. https://www.marketresearchforecast.com/reports/natural-language-processing-for-healthcare-and-life-sciences-43881
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Mar 20, 2025
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Natural Language Processing (NLP) market for healthcare and life sciences is experiencing robust growth, driven by the increasing volume of unstructured clinical data and the need for efficient data analysis to improve patient care and accelerate drug discovery. A 5% CAGR suggests a consistently expanding market, projected to reach significant value within the forecast period (2025-2033). The market is segmented by NLP type (rule-based, statistical, hybrid, learned) and application (physicians, patients, clinical operators, others). The diverse application areas reflect the multifaceted nature of NLP's impact, ranging from automating administrative tasks and improving diagnostic accuracy to personalizing patient experiences and accelerating research. Major players like Microsoft, Google, IBM, and others are actively investing in and developing NLP solutions, contributing to increased competition and innovation within the sector. The growth is further fueled by advancements in machine learning and deep learning techniques, allowing for more accurate and nuanced analysis of complex medical information. Regulatory approvals and increasing adoption of cloud-based solutions are additional positive market drivers. However, challenges remain. Data privacy concerns and the need for robust data security protocols represent significant hurdles. The complexity of integrating NLP solutions into existing healthcare IT infrastructure, along with the requirement for substantial investments in training and infrastructure, pose restraints to widespread adoption. The market's future growth hinges on overcoming these challenges, along with addressing ethical considerations related to algorithmic bias and data transparency. Strategic partnerships between technology providers and healthcare organizations will be crucial in driving successful implementation and maximizing the potential of NLP in improving healthcare outcomes and transforming life sciences research. The expansion into emerging markets, particularly in Asia Pacific, will also contribute to substantial market expansion.

  5. ACHILLES: Ancient and Historical Language Evaluation Set

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oksana Dereza; Oksana Dereza (2024). ACHILLES: Ancient and Historical Language Evaluation Set [Dataset]. http://doi.org/10.5281/zenodo.10655061
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Oksana Dereza; Oksana Dereza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Nov 5, 2023
    Description

    The dataset used in the SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages. The task included four problems; problems 1-3 were offered in both constrained and unconstrained tracks on CodaLab, while problem 4 was only a part of the unconstrained track.

    1. POS-tagging
    2. Lemmatisation
    3. Morphological feature prediction
    4. Mask filling
      • Word-level
      • Character level

    For problems 1-3, data from Universal Dependencies v.2.12 was used for Ancient Greek, Ancient Hebrew, Classical Chinese, Coptic, Gothic, medieval Icelandic, Latin, Old Church Slavonic, Old East Slavic, Old French and Vedic Sanskrit. Old Hungarian texts, annotated to the same standard as UD corpora, were added to the dataset from the MGTSZ website. In Old Hungarian data, tokens which were POS-tagged PUNCT were altered so that the form matched the lemma to simplify complex punctuation marks used to approximate manuscript symbols; otherwise, no characters were changed.

    As the ISO 639-3 standard does not distinguish between historical stages of Latin, as it does between other languages like Irish, but it was desirable to approximate this distinction for Latin, we further split Latin data. This resulted in two Latin datasets: Classical and Late Latin, and Medieval Latin. This split was dictated by the composition of the Perseus and PROIEL treebanks that served as a source for Latin UD treebanks.

    Historical forms of Irish were only included in mask filling challenges (problem 4), as the quantity of historical Irish text data which has been tokenised and annotated to a single standard to date is insufficient for the purpose of training models to perform morphological analysis tasks. The texts were drawn from CELT, Corpas Stairiúil na Gaeilge, and digital editions of the St. Gall glosses and the Würzburg glosses. Each Irish text taken from CELT is labelled "Old", "Middle" or "Early Modern" in accordance with the language labels provided in CELT metadata. Because CELT metadata relating to language stages and text dating is reliant on information provided by a variety of different editors of earlier print editions, this metadata can be inconsistent across the corpus and on occasion inaccurate. To mitigate complications arising from this, texts drawn from CELT were included in the dataset only if they had a single Irish language label and if the dates provided in CELT metadata for the text match the expected dates for the given period in the history of the Irish language.

    The upper temporal boundary was set at 1700 CE, and texts created later than this date were not included in the dataset. The choice of this date is driven by the fact that most of the historical language data used in word embedding research dates back to the 18th century CE or later, and our intention was to focus on the more challenging and yet unaddressed data. The resulting datasets for each language were then shuffled at the sentence level and split into training, validation and test subsets at the ratio of 0.8 : 0.1 : 0.1.

    A detailed list of text sources for each language in the dataset, as well as other metadata and the description of data formats used for each problem, is provided on the Shared Task's GitHub. The structure of the dataset is as follows:

    📂 morphology (data for problems 1-3)
      ├── 📂 test
    ├── 📂 ref (reference data used in CodaLab competitions)
    ├── 📂 lemmatisation
    ├── 📂 morph_features
    └── 📂 pos_tagging
    └── 📂 src (source test data with labels) ├── 📂 train └── 📂 valid

    📂 fill_mask_word (data for problem 4a)
    ├── 📂 test
    ├── 📂 ref (reference data used in CodaLab competitions)
    └──
    📂 src (source test data with labels in 2 different formats)
    ├──
    📂 json
    └── 📂 tsv
    ├── 📂 train (train data in 2 different formats)
    ├── 📂 json
    └── 📂 tsv
    └── 📂 valid (validation data in 2 different formats)
    ├── 📂 json
    └── 📂 tsv

    📂 fill_mask_char (data for problem 4b)
      ├── 📂 test
    ├── 📂 ref (reference data used in CodaLab competitions)
    └──
    📂 src (source test data with labels in 2 different formats)
    ├──
    📂 json
    └── 📂 tsv
    ├── 📂 train (train data in 2 different formats)
    ├── 📂 json
    └── 📂 tsv
    └── 📂 valid (validation data in 2 different formats)
    ├── 📂 json
    └── 📂 tsv

    We would like to thank Ekaterina Melnikova for suggesting the name for the dataset.

  6. Global Artificial Intelligence Market By Application (Image Recognition,...

    • techsciresearch.com
    Updated Mar 22, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TechSci Research (2017). Global Artificial Intelligence Market By Application (Image Recognition, Natural Language Processing, Speech Recognition, etc.), By End User (Consumer Electronics, BFSI, etc.), By Region, Competition Forecast & Opportunities, Demand, Size and Competitive Analysis | TechSci Research [Dataset]. https://www.techsciresearch.com/report/global-artificial-intelligence-market-by-application-image-recognition-natural-language-processing-speech-recognition-etc-by-end-user-consumer-electronics-bfsi-etc-by-region-competition-forecast-opportunities/932.html
    Explore at:
    Dataset updated
    Mar 22, 2017
    Dataset authored and provided by
    TechSci Research
    License

    https://www.techsciresearch.com/privacy-policy.aspxhttps://www.techsciresearch.com/privacy-policy.aspx

    Description

    Get TechSci Research Report on Global Artificial Intelligence Market which Covers Global Artificial Intelligence Market growth, Global Artificial Intelligence Market Trends, Global Artificial Intelligence Market Forecast& Revenue

    Pages255
    Market Size
    Forecast Market Size
    CAGR
    Fastest Growing Segment
    Largest Market
    Key Players

  7. Code4ML 2.0

    • zenodo.org
    csv, txt
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737
    Explore at:
    csv, txtAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonimous authors; Anonimous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

    The original dataset is organized into multiple CSV files, each containing structured data on different entities:

    • code_blocks.csv: Contains raw code snippets extracted from Kaggle.
    • kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
    • competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
    • markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
    • vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

    Table 1. code_blocks.csv structure

    ColumnDescription
    code_blocks_indexGlobal index linking code blocks to markup_data.csv.
    kernel_idIdentifier for the Kaggle Jupyter notebook from which the code block was extracted.
    code_block_id

    Position of the code block within the notebook.

    code_block

    The actual machine learning code snippet.

    Table 2. kernels_meta.csv structure

    ColumnDescription
    kernel_idIdentifier for the Kaggle Jupyter notebook.
    kaggle_scorePerformance metric of the notebook.
    kaggle_commentsNumber of comments on the notebook.
    kaggle_upvotesNumber of upvotes the notebook received.
    kernel_linkURL to the notebook.
    comp_nameName of the associated Kaggle competition.

    Table 3. competitions_meta.csv structure

    ColumnDescription
    comp_nameName of the Kaggle competition.
    descriptionOverview of the competition task.
    data_typeType of data used in the competition.
    comp_typeClassification of the competition.
    subtitleShort description of the task.
    EvaluationAlgorithmAbbreviationMetric used for assessing competition submissions.
    data_sourcesLinks to datasets used.
    metric typeClass label for the assessment metric.

    Table 4. markup_data.csv structure

    ColumnDescription
    code_blockMachine learning code block.
    too_longFlag indicating whether the block spans multiple semantic types.
    marksConfidence level of the annotation.
    graph_vertex_idID of the semantic type.

    The dataset allows mapping between these tables. For example:

    • code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
    • kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

    In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

    Code4ML 2.0 Enhancements

    The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

    Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

    competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

    Applications

    The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

    • Code generation
    • Code understanding
    • Natural language processing of code-related tasks
  8. English Monograph OCR Dataset (Preprocessed) 📄🔍

    • kaggle.com
    Updated Mar 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arjav 007 (2025). English Monograph OCR Dataset (Preprocessed) 📄🔍 [Dataset]. https://www.kaggle.com/datasets/arjav007/icdar-eng
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 21, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Arjav 007
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset is a preprocessed version of the English Monograph subset from the ICDAR 2017 OCR Post-Correction competition. It contains OCR-generated text alongside its corresponding aligned ground truth, making it useful for OCR error detection and correction tasks.

    📌 About the Dataset

    The dataset consists of historical English texts that were processed using OCR technology. Due to OCR errors, the text contains misrecognized characters, missing words, and other inaccuracies. This dataset provides both raw OCR output and gold-standard corrected text.

    🚀 Use Cases

    This dataset is ideal for:
    - OCR Error Detection & Correction 📝
    - Training Character-Based Machine Translation Models 🔠
    - Natural Language Processing (NLP) on Historical Texts 📜

    📊 Dataset Statistics

    • Total Entries: 724
    • Character-Level OCR Error Rate: ~1.79%
    • Common OCR Errors Observed:
      • 1 → I
      • tbe → the
      • tho → the
      • aud → and

    📜 Citation

    If you use this dataset, please cite the original ICDAR 2017 OCR Post-Correction paper:

    Chiron, G., Doucet, A., Coustaty, M., Moreux, J.P. (2017). ICDAR 2017 Competition on Post-OCR Text Correction.

  9. o

    Global Workforce Resume Dataset

    • opendatabay.com
    .undefined
    Updated Jul 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Global Workforce Resume Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/503680cc-bc47-4d8a-8231-cb824e9687e5
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 2, 2025
    Dataset authored and provided by
    Datasimple
    Area covered
    Data Science and Analytics
    Description

    This dataset, curated and processed by Neuralframe AI, serves as a valuable resource for resume parsing, candidate profiling, and job matching applications. It includes structured information on career objectives, skills, education, work experience, certifications, and other pertinent details. The data has been collected from both open-source platforms and Neuralframe AI's proprietary sources, with all data obtained with explicit consent. The dataset was initially utilised in the Datathon Competition at Bitfest 2025, offering participants a practical dataset to develop and refine resume parsing algorithms and candidate evaluation systems.

    Columns

    The dataset contains 35 columns. Key columns include: * address: Candidate's address (if available). * career_objective: A brief summary of the candidate's career goals or objectives. * skills: A list of skills possessed by the candidate, such as technical and soft skills. * educational_institution_name: Names of educational institutions attended by the candidate. * degree_names: Degrees obtained by the candidate (e.g., B.Tech, MBA). * passing_years: Year(s) of graduation or programme completion. * educational_results: Results or grades achieved in educational qualifications, such as GPA, percentage, or division. * result_types: The format or type of the educational results, such as GPA, percentage, or classification (e.g., Distinction). * major_field_of_studies: The main fields or subjects studied during the candidate’s education (e.g., Computer Science, Mathematics). * professional_company_names: Names of the companies or organisations where the candidate has worked professionally.

    Distribution

    • Filename: resume_data.csv
    • Format: CSV (Comma-Separated Values)
    • Size: 17 MB
    • Number of Columns: 35
    • Number of Rows: 9544

    Usage

    This dataset is ideal for: * Developing and refining resume parsing algorithms. * Creating candidate profiling systems. * Building job matching applications. * Enhancing candidate evaluation systems. * Research in natural language processing (NLP) and machine learning on textual data.

    Coverage

    The dataset's region coverage is global. Specific details regarding time range or detailed demographic scope are not explicitly provided within the available information.

    License

    CC-BY

    Who Can Use It

    This dataset is particularly useful for: * Data Scientists and Analysts: For building predictive models and extracting insights from resume data. * Machine Learning Engineers: For training and testing NLP models for text analysis on resumes. * HR Professionals and Recruiters: For automating aspects of candidate screening and matching. * Academic Researchers: For studies related to human resources, labour markets, or AI applications in recruitment. * Participants in Datathons and Competitions: Seeking a practical dataset for developing real-world solutions.

    Dataset Name Suggestions

    • Candidate Profile Dataset
    • Resume Data for AI Models
    • Global Workforce Resume Data
    • Structured Career Data
    • Job Applicant Skills Dataset

    Attributes

    Original Data Source: Resume Dataset

  10. Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

    • zenodo.org
    Updated May 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy (2024). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.11213783
    Explore at:
    Dataset updated
    May 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an enriched version of Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The initial corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.

    The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).

    Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.

    Automatic classification of code_blocks are stored in data_with_preds.csv. The mapping of this table with code_blocks.csv can be doe through code_blocks_index column, which corresponds to code_blocks indices.

    The updated Code4ML 2.0 corpus includes kernels retrieved from Code Kaggle Meta. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

    kernels_meta2.csv may contain kernels without Kaggle score, but with the place in the leader board (rank).

    Code4ML 2.0 dataset can be used for various purposes, including training and evaluating models for code generation, code understanding, and natural language processing tasks.

  11. SemEval-2021 Task 11: NLPContributionGraph

    • zenodo.org
    zip
    Updated Jul 27, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jennifer D'Souza; Jennifer D'Souza; Soeren Auer; Soeren Auer; Ted Pedersen; Ted Pedersen (2021). SemEval-2021 Task 11: NLPContributionGraph [Dataset]. http://doi.org/10.25835/0022787
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 27, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jennifer D'Souza; Jennifer D'Souza; Soeren Auer; Soeren Auer; Ted Pedersen; Ted Pedersen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    NLPContributionGraph was introduced as Task 11 at SemEval 2021 for the first time. The task is defined on a dataset of Natural Language Processing (NLP) scholarly articles with their contributions structured to be integrable within Knowledge Graph infrastructures such as the Open Research Knowledge Graph. The structured contribution annotations are provided as (1) Contribution sentences : a set of sentences about the contribution in the article; (2) Scientific terms and relations: a set of scientific terms and relational cue phrases extracted from the contribution sentences; and (3) Triples: semantic statements that pair scientific terms with a relation, modeled toward subject-predicate-object RDF statements for KG building. The Triples are organized under three (mandatory) or more of twelve total information units (viz., ResearchProblem, Approach, Model, Code, Dataset, ExperimentalSetup, Hyperparameters, Baselines, Results, Tasks, Experiments, and AblationAnalysis).

    The Shared Task

    As a complete submission for the Shared Task, given NLP scholarly articles in plaintext format, systems had to automatically extract the following information: contribution sentences; scientific term and predicate phrases from the sentences; and (subject,predicate,object) triple statements toward KG building organized under three or more of twelve total information units. The shared task has an open evaluation never-ending official online evaluation at Codalab.

  12. w

    Global Casual Ai Market Research Report: By Type of AI (Machine Learning,...

    • wiseguyreports.com
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wWiseguy Research Consultants Pvt Ltd (2024). Global Casual Ai Market Research Report: By Type of AI (Machine Learning, Natural Language Processing, Computer Vision, Speech Recognition, Others), By End User (Individual Consumers, Small and Medium-Sized Businesses (SMBs), Large Enterprises), By Application (Customer Service, Marketing and Sales, Fraud Detection, Predictive Analytics, Image Recognition, Language Translation, Others), By Deployment Model (On-Premise, Cloud-Based, Hybrid), By Vertical (Financial Services, Healthcare, Retail, Manufacturing, Transportation and Logistics, Government, Others) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2032. [Dataset]. https://www.wiseguyreports.com/reports/casual-ai-market
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset authored and provided by
    wWiseguy Research Consultants Pvt Ltd
    License

    https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

    Time period covered
    Jan 7, 2024
    Area covered
    Global
    Description
    BASE YEAR2024
    HISTORICAL DATA2019 - 2024
    REPORT COVERAGERevenue Forecast, Competitive Landscape, Growth Factors, and Trends
    MARKET SIZE 20232.79(USD Billion)
    MARKET SIZE 20243.26(USD Billion)
    MARKET SIZE 203211.3(USD Billion)
    SEGMENTS COVEREDType of AI ,End User ,Application ,Deployment Model ,Vertical ,Regional
    COUNTRIES COVEREDNorth America, Europe, APAC, South America, MEA
    KEY MARKET DYNAMICSGrowing adoption of AI for task automation Increasing demand for personalized user experiences Advancements in natural language processing NLP Rising need for costeffective AI solutions Growing competition from established tech giants
    MARKET FORECAST UNITSUSD Billion
    KEY COMPANIES PROFILEDMicrosoft ,IBM ,NVIDIA ,Qualcomm ,Google ,Arm ,Baidu ,Intel ,InfuseAI ,Tencent ,Amazon ,Apple ,Alibaba ,Meta ,Samsung
    MARKET FORECAST PERIOD2025 - 2032
    KEY MARKET OPPORTUNITIESExpansion into New Vertical Markets Increased Demand for Personalized User Experiences Growing Popularity of AIpowered Chatbots Integration with Existing Technologies Rising Focus on Data Privacy and Security
    COMPOUND ANNUAL GROWTH RATE (CAGR) 16.8% (2025 - 2032)
  13. TalentCLEF 2025 corpus: Skill and Job Title Intelligence for Human Capital...

    • zenodo.org
    zip
    Updated Apr 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luis Gascó; Luis Gascó; Fabregat Marcos Hermenegildo; Fabregat Marcos Hermenegildo; García-Sardiña Laura; Déniz Cerpa Daniel; Paula Estrella; Rodrigo Alvaro; Zbib Rabih; García-Sardiña Laura; Déniz Cerpa Daniel; Paula Estrella; Rodrigo Alvaro; Zbib Rabih (2025). TalentCLEF 2025 corpus: Skill and Job Title Intelligence for Human Capital Management [Dataset]. http://doi.org/10.5281/zenodo.15038364
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Luis Gascó; Luis Gascó; Fabregat Marcos Hermenegildo; Fabregat Marcos Hermenegildo; García-Sardiña Laura; Déniz Cerpa Daniel; Paula Estrella; Rodrigo Alvaro; Zbib Rabih; García-Sardiña Laura; Déniz Cerpa Daniel; Paula Estrella; Rodrigo Alvaro; Zbib Rabih
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    🚨 Current Status: Release of Task B Development set. To check when new data will be uploaded, please consult the calendar of the task

    TalentCLEF2025 corpus - Task B Development set release

    Introduction:

    The first edition of TalentCLEF aims to develop and evaluate models designed to facilitate three essential tasks:

    1. Finding/ranking candidates for job positions based on their experience and professional skills.
    2. Implementing upskilling and reskilling strategies that promote the coninuous development of workers
    3. Detecting emerging skills and skills gaps of importance in organizations.

    With that aim, the task is divided into two tasks:

    • Task A - Multilingual Job Title Matching. This task involves developing systems to identify and rank the job titles most similar to a given one by generating a ranked list of similar titles from a specified knowledge base for each job title in a provided test set.
    • Task B - Job Title-Based Skill Prediction. Task B requires developing systems that can retrieve relevant skills associated with a specified job title.

    This data repository contains the data for these two tasks. The data is being released progressively according to the task schedule.

    The task evaluation takes place on Codabench (Task A and Task B). Participants must register for the competition through CLEF Lab Registration Page to be part of the evaluation campaign.

    File structure:

    For a detailed description of the data structure, you can refer to the TalentCLEF2025 data description page, where it is thoroughly explained.

    The files is organized into two *.zip files, TaskA.zip and TaskB.zip, each containing training, validation and test folders to support different stages of model development. So far, only the training set for both tasks has been released, but in future releases, as the tasks progress, additional data will be added to the different subfolders for each task.

    TaskA includes language-specific subfolders within the training and validation directories, covering English, Spanish, German, and Chinese job title data. The training folders for TaskA contain language-specific .tsv files for each respective language. Validation folders include three essential files—queries, corpus_elements, and q_rels—for evaluating model relevance to search queries. TaskA’s test folder has queries and corpus_elements files for testing retrieval.

    TaskA/
    │
    ├── training/
    │  ├── english/
    │  │  └── taskA_training_en.tsv
    │  ├── spanish/
    │  │  └── taskA_training_es.tsv
    │  └── german/
    │    └── taskA_training_de.tsv
    │
    ├── validation/
    │  ├── english/
    │  │  ├── queries
    │  │  ├── corpus_elements
    │  │  └── qrels
    │  ├── spanish/
    │  ├── german/
    │  └── chinese/
    │
    └── test/
      ├── english/
      │  ├── queries
      │  └── corpus_elements
      ├── spanish/
      ├── german/
      └── chinese/
    

    TaskB follows a similar structure but without language-specific subfolders, providing general .tsv files for training, validation, and testing. This consistent file organization enables efficient data access and structured updates as new data versions are published.

    TaskB/
    │
    ├── training/
    │  ├── job2skill.tsv
    │  ├── jobid2terms.json
    │  └── skillid2terms.json
    │ ├── validation/ │ ├── queries │ ├── corpus_elements │ └── qrels │ └── test/ ├── queries └── corpus_elements

    Tutorials:

    NotebookLink
    Data Download and Load using Python Link to Colab
    Task A - Prepare submission file and run evaluationLink to Colab
    Task A - Development set Baseline generationLink to Colab
    Task B - Prepare submission file and run evaluationLink to Colab

    Resources:

  14. c

    Competition and Symmetry in an Artificial Word Learning Task, 2016-2019

    • datacatalogue.cessda.eu
    Updated Jun 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dautriche, I (2025). Competition and Symmetry in an Artificial Word Learning Task, 2016-2019 [Dataset]. http://doi.org/10.5255/UKDA-SN-855110
    Explore at:
    Dataset updated
    Jun 1, 2025
    Dataset provided by
    CNRS
    Authors
    Dautriche, I
    Time period covered
    Nov 1, 2016 - Nov 27, 2019
    Area covered
    United Kingdom
    Variables measured
    Individual
    Measurement technique
    Participants were recruited through Amazon Mechanical Turk. Participants were tested online. They were instructed that their task was to learn new words by associating them with objects displayed on the screen. In the instructions, participants were given a screenshot of a trial involving a word (not used during the test) and a set of objects. No information about the number of to- be-learned words was given. For each trial, a word was displayed, first alone for 500 ms to attract participants’ attention to the word, then together with a collection of 3 objects, aligned horizontally, below the word. Participants were asked to click on the object they believed to be associated with the word.
    Description

    Natural language involves competition. The sentences we choose to utter activate alternative sentences (those we chose not to utter), which hearers typically infer to be false. Hence, as a first approximation, the more alternatives a sentence activates, the more inferences it will trigger. But a closer look at the theory of competition shows that this is not quite true and that under specific circumstances, so-called symmetric alternatives cancel each other out. We present an artificial word learning experiment in which participants learn words that may enter into competition with one another. The results show that a mechanism of competition takes place, and that the subtle prediction that alternatives trigger inferences, and may stop triggering them after a point due to symmetry, is borne out. This study provides a minimal testing paradigm to reveal competition and some of its subtle characteristics in human languages and beyond.

    As anyone who has learnt a foreign language or travelled abroad will have noticed, languages differ in the sounds they employ, the names they give to things, and the rules of grammar. However, linguists have long observed that, beneath this surface diversity, all human languages share a number of fundamental structural similarities. Most obviously, all languages use sounds, all languages have words, and all languages have a grammar. More subtly and more surprisingly, similarities can also be observed in more fine-grained linguistic features: for instance, George Zipf famously observed that, across multiple languages, short words tend also to be more frequent, and in my own recent work I have shown that languages prefer to use words that sound alike (e.g., cat, mat, rat, bat, fat, ...). Why do all languages exhibit these shared features? This project aims to tackle exactly this key question by studying how languages are shaped by the human mind. In particular, I will explore how the way we learn language and use it to communicate drives the emergence of important features of lexicons, the set of all words in a language. To simulate the process of language change and evolution in the lab, I will use an experimental paradigm where an artificial language is passed between learners (language learning), and used by individuals to communicate with each other (language use). This paradigm has been successfully applied in previous research showing that key structural features of language can be explained as a consequence of repeated learning and use; my contribution will be to apply the same methods to study the evolution of the lexicon. I will then use two complementary techniques to evaluate the ecological validity of these results. First, do the artificial lexicons obtained after repeated learning and communication match the structure of lexicons found in real human languages? We will assess this by analyzing real natural language corpora using computational methods. Second, are these lexicons easily learnable by young children, the primary conduit of natural language transmission in the wild? This will be assessed using methods from developmental psychology to study word learning in toddlers. The present project requires an unprecedented integration of techniques and concepts from language evolution, computational linguistics and developmental psychology, three fields that have so far worked independently to understand the structure of language. The outcomes of the project will be of vital interest for all these communities, and will provide insights into the foundational properties found in all human languages, as well as the nature of the constraints underlying language processing and language acquisition. This project will provide a springboard for my future work at the intersection of computational and experimental approaches to language and cognitive development.

  15. L

    Large-Scale Model Training Machine Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Large-Scale Model Training Machine Report [Dataset]. https://www.archivemarketresearch.com/reports/large-scale-model-training-machine-196019
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Jun 23, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Large-Scale Model Training Machine market is experiencing rapid growth, driven by the increasing demand for sophisticated AI applications across various sectors. The market size in 2025 is estimated at $15 billion, projecting a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033. This robust growth is fueled by several key factors, including the proliferation of big data, advancements in deep learning algorithms, and the rising adoption of cloud computing for AI model training. The expansion of edge computing infrastructure also contributes significantly, enabling faster and more efficient training of large-scale models closer to the data source. Major players like Google, Amazon, Microsoft, and others are heavily investing in research and development, further accelerating market expansion. The market segmentation is largely driven by deployment models (on-premises vs. cloud), application domains (image recognition, natural language processing, etc.), and geographical regions. Competition is fierce, with established tech giants and emerging AI startups vying for market share through innovative solutions and strategic partnerships. The continued growth of the Large-Scale Model Training Machine market is expected to be shaped by several emerging trends. These include the increasing adoption of specialized hardware like GPUs and TPUs, the development of more efficient training algorithms, and the growing interest in federated learning for enhanced data privacy. However, challenges remain, such as the high cost of infrastructure and specialized expertise, along with concerns about data security and ethical implications of advanced AI models. Despite these challenges, the long-term outlook for the Large-Scale Model Training Machine market remains extremely positive, with sustained growth predicted well into the next decade, driven by an ever-increasing need for powerful and sophisticated AI capabilities.

  16. A

    AI Children's Learning Robot Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated May 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). AI Children's Learning Robot Report [Dataset]. https://www.datainsightsmarket.com/reports/ai-childrens-learning-robot-1280042
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    May 5, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The AI children's learning robot market is experiencing robust growth, driven by increasing parental awareness of the benefits of early childhood education and technological advancements in artificial intelligence and robotics. The market is segmented by application (education & entertainment, autism treatment, others) and type (humanoid, animal type), reflecting the diverse functionalities and designs catering to various needs. The education and entertainment segment currently dominates, fueled by the rising demand for engaging and interactive learning tools. However, the autism treatment segment is projected to witness significant growth over the forecast period (2025-2033) due to the potential of AI robots to provide personalized therapeutic interventions and improve social interaction skills in autistic children. The humanoid robot type holds a larger market share compared to animal-type robots, largely because of its advanced capabilities in mimicking human interactions and engaging in complex educational activities. North America and Europe currently represent the largest regional markets, driven by high technological adoption rates and a strong emphasis on early childhood education. However, the Asia-Pacific region is expected to exhibit substantial growth in the coming years, fueled by rising disposable incomes and increasing investments in education technology. Several key players, including Miko, Elenco, ROYBI, Petoi, and others, are actively shaping the market landscape through product innovation and strategic partnerships. The market faces challenges such as high initial costs of AI robots and concerns about data privacy and security. Nonetheless, the continuous advancements in AI technology, coupled with growing parental investments in children's education, are expected to propel market expansion. The market's Compound Annual Growth Rate (CAGR) is estimated at 15% for the period 2025-2033, projecting a substantial increase in market size. This growth is further stimulated by the integration of advanced features like natural language processing, computer vision, and machine learning, improving the robots' capabilities. Competition is expected to intensify with the entry of new players, leading to further product diversification and cost reduction. Future growth will likely hinge on effectively addressing consumer concerns regarding data privacy and safety while further developing the educational and therapeutic capabilities of the robots. The market will benefit from increased research and development focusing on personalization and adaptability to various learning styles and needs.

  17. M

    Metasearch Engine Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Jul 11, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The metasearch engine market, while exhibiting a history of fluctuating growth, is poised for a period of expansion. While precise figures for market size and CAGR are unavailable, a logical assessment, considering the presence of established players like Google, Bing, and the listed companies (Dogpile, InfoSpace, IBM, Startpage, AOL, Ceek.jp, CurryGuide, Entireweb), suggests a substantial market. The market's value likely sits in the hundreds of millions of dollars, with a CAGR in the low-to-mid single digits, reflecting both the mature nature of the search landscape and the ongoing innovation within the metasearch sector. Key drivers include the increasing need for efficient and unbiased search results, particularly for price-sensitive consumers seeking the best deals across multiple platforms. Trends point toward increased integration of AI and machine learning for improved search accuracy and personalization, along with a growing focus on user privacy and data security. However, restraints include intense competition from dominant search engines and the complexities of maintaining consistent data accuracy across various sources. The market is segmented by features such as search algorithm, user interface, supported platforms (desktop, mobile, etc.), and target demographics (business, consumers, etc.) Although specific regional breakdowns are not provided, North America and Europe likely hold significant market share, given the established technological infrastructure and higher internet penetration rates. Future growth hinges on the ability of metasearch engines to differentiate themselves through innovative features and by effectively addressing user concerns about privacy and data security. The forecast period of 2025-2033 presents opportunities for metasearch engine providers to capitalize on evolving consumer needs. Strategic partnerships with travel, e-commerce, and other relevant sectors can drive adoption. Investment in advanced technologies such as natural language processing (NLP) and semantic search will be crucial for enhancing user experience. While competition remains fierce, focusing on niche markets or specialized search functions can create growth avenues. Furthermore, a robust marketing strategy emphasizing transparency and trust-building is vital in overcoming user hesitancy related to data privacy. Overall, the metasearch engine market presents a complex but potentially rewarding landscape for companies willing to innovate and adapt.

  18. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Fanfan Wang; Fanfan Wang; Heqing Ma; Jianfei Yu; Rui Xia; Erik Cambria; Heqing Ma; Jianfei Yu; Rui Xia; Erik Cambria (2024). SemEval-2024 Task 3: The Competition of Multimodal Emotion Cause Analysis in Conversations [Dataset]. http://doi.org/10.5281/zenodo.13689364

SemEval-2024 Task 3: The Competition of Multimodal Emotion Cause Analysis in Conversations

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
bin, jsonAvailable download formats
Dataset updated
Sep 5, 2024
Dataset provided by
Association for Computational Linguistics
Authors
Fanfan Wang; Fanfan Wang; Heqing Ma; Jianfei Yu; Rui Xia; Erik Cambria; Heqing Ma; Jianfei Yu; Rui Xia; Erik Cambria
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The ability to understand emotions is an essential component of human-like artificial intelligence, as emotions greatly influence human cognition, decision making, and social interactions. In addition to emotion recognition in conversations, the task of identifying the potential causes behind an individual’s emotional state in conversations, is of great importance in many application scenarios. We organize SemEval-2024 Task 3, named Multimodal Emotion Cause Analysis in Conversations, which aims at extracting all pairs of emotions and their corresponding causes from conversations. Under different modality settings, it consists of two subtasks: Textual Emotion-Cause Pair Extraction in Conversations (TECPE) and Multimodal Emotion-Cause Pair Extraction in Conversations (MECPE). The shared task has attracted 143 registrations and 216 successful submissions. In this paper, we introduce the task, dataset and evaluation settings, summarize the systems of the top teams, and discuss the findings of the participants.

For more information about the task, please visit our https://github.com/NUSTM/SemEval-2024_ECAC" target="_blank" rel="noopener">task website and CodaLab competition website.

Search
Clear search
Close search
Google apps
Main menu