3 datasets found
  1. Hive Annotation Job Results - Cleaned and Audited

    • kaggle.com
    zip
    Updated Apr 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brendan Kelley (2021). Hive Annotation Job Results - Cleaned and Audited [Dataset]. https://www.kaggle.com/brendankelley/hive-annotation-job-results-cleaned-and-audited
    Explore at:
    zip(471571 bytes)Available download formats
    Dataset updated
    Apr 28, 2021
    Authors
    Brendan Kelley
    Description

    Context

    This notebook serves to showcase my problem solving ability, knowledge of the data analysis process, proficiency with Excel and its various tools and functions, as well as my strategic mindset and statistical prowess. This project consist of an auditing prompt provided by Hive Data, a raw Excel data set, a cleaned and audited version of the raw Excel data set, and my description of my thought process and knowledge used during completion of the project. The prompt can be found below:

    Hive Data Audit Prompt

    The raw data that accompanies the prompt can be found below:

    Hive Annotation Job Results - Raw Data

    ^ These are the tools I was given to complete my task. The rest of the work is entirely my own.

    To summarize broadly, my task was to audit the dataset and summarize my process and results. Specifically, I was to create a method for identifying which "jobs" - explained in the prompt above - needed to be rerun based on a set of "background facts," or criteria. The description of my extensive thought process and results can be found below in the Content section.

    Content

    Brendan Kelley April 23, 2021

    Hive Data Audit Prompt Results

    This paper explains the auditing process of the “Hive Annotation Job Results” data. It includes the preparation, analysis, visualization, and summary of the data. It is accompanied by the results of the audit in the excel file “Hive Annotation Job Results – Audited”.

    Observation

    The “Hive Annotation Job Results” data comes in the form of a single excel sheet. It contains 7 columns and 5,001 rows, including column headers. The data includes “file”, “object id”, and the pseudonym for five questions that each client was instructed to answer about their respective table: “tabular”, “semantic”, “definition list”, “header row”, and “header column”. The “file” column includes non-unique (that is, there are multiple instances of the same value in the column) numbers separated by a dash. The “object id” column includes non-unique numbers ranging from 5 to 487539. The columns containing the answers to the five questions include Boolean values - TRUE or FALSE – which depend upon the yes/no worker judgement.

    Use of the COUNTIF() function reveals that there are no values other than TRUE or FALSE in any of the five question columns. The VLOOKUP() function reveals that the data does not include any missing values in any of the cells.

    Assumptions

    Based on the clean state of the data and the guidelines of the Hive Data Audit Prompt, the assumption is that duplicate values in the “file” column are acceptable and should not be removed. Similarly, duplicated values in the “object id” column are acceptable and should not be removed. The data is therefore clean and is ready for analysis/auditing.

    Preparation

    The purpose of the audit is to analyze the accuracy of the yes/no worker judgement of each question according to the guidelines of the background facts. The background facts are as follows:

    • A table that is a definition list should automatically be tabular and also semantic • Semantic tables should automatically be tabular • If a table is NOT tabular, then it is definitely not semantic nor a definition list • A tabular table that has a header row OR header column should definitely be semantic

    These background facts serve as instructions for how the answers to the five questions should interact with one another. These facts can be re-written to establish criteria for each question:

    For tabular column: - If the table is a definition list, it is also tabular - If the table is semantic, it is also tabular

    For semantic column: - If the table is a definition list, it is also semantic - If the table is not tabular, it is not semantic - If the table is tabular and has either a header row or a header column...

  2. d

    Data from: Data cleaning and enrichment through data integration: networking...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Feb 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irene Finocchi; Alessio Martino; Blerina Sinaimeri; Fariba Ranjbar (2025). Data cleaning and enrichment through data integration: networking the Italian academia [Dataset]. http://doi.org/10.5061/dryad.wpzgmsbwj
    Explore at:
    Dataset updated
    Feb 25, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Irene Finocchi; Alessio Martino; Blerina Sinaimeri; Fariba Ranjbar
    Description

    We describe a bibliometric network characterizing co-authorship collaborations in the entire Italian academic community. The network, consisting of 38,220 nodes and 507,050 edges, is built upon two distinct data sources: faculty information provided by the Italian Ministry of University and Research and publications available in Semantic Scholar. Both nodes and edges are associated with a large variety of semantic data, including gender, bibliometric indexes, authors' and publications' research fields, and temporal information. While linking data between the two original sources posed many challenges, the network has been carefully validated to assess its reliability and to understand its graph-theoretic characteristics. By resembling several features of social networks, our dataset can be profitably leveraged in experimental studies in the wide social network analytics domain as well as in more specific bibliometric contexts. , The proposed network is built starting from two distinct data sources:

    the entire dataset dump from Semantic Scholar (with particular emphasis on the authors and papers datasets) the entire list of Italian faculty members as maintained by Cineca (under appointment by the Italian Ministry of University and Research).

    By means of a custom name-identity recognition algorithm (details are available in the accompanying paper published in Scientific Data), the names of the authors in the Semantic Scholar dataset have been mapped against the names contained in the Cineca dataset and authors with no match (e.g., because of not being part of an Italian university) have been discarded. The remaining authors will compose the nodes of the network, which have been enriched with node-related (i.e., author-related) attributes. In order to build the network edges, we leveraged the papers dataset from Semantic Scholar: specifically, any two authors are said to be connected if there is at least one pap..., , # Data cleaning and enrichment through data integration: networking the Italian academia

    https://doi.org/10.5061/dryad.wpzgmsbwj

    Manuscript published in Scientific Data with DOI .

    Description of the data and file structure

    This repository contains two main data files:

    • edge_data_AGG.csv, the full network in comma-separated edge list format (this file contains mainly temporal co-authorship information);
    • Coauthorship_Network_AGG.graphml, the full network in GraphML format.Â

    along with several supplementary data, listed below, useful only to build the network (i.e., for reproducibility only):

    • University-City-match.xlsx, an Excel file that maps the name of a university against the city where its respective headquarter is located;
    • Areas-SS-CINECA-match.xlsx, an Excel file that maps the research areas in Cineca against the research areas in Semantic Scholar.

    Description of the main data files

    The `Coauthorship_Networ...

  3. R

    Food quality decision tree based on collective know-how (Capex ontology)

    • entrepot.recherche.data.gouv.fr
    bin, text/markdown +3
    Updated Sep 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrice Buche; Patrice Buche; Julien Couteaux; Julien Cufi; Julien Cufi; Sébastien Destercke; Sébastien Destercke; Alrick Oudot; Julien Couteaux; Alrick Oudot (2025). Food quality decision tree based on collective know-how (Capex ontology) [Dataset]. http://doi.org/10.57745/SEJP1B
    Explore at:
    ttl(87396), xlsx(38177), bin(15119), ttl(4261), text/markdown(10457), txt(1393), bin(153351)Available download formats
    Dataset updated
    Sep 4, 2025
    Dataset provided by
    Recherche Data Gouv
    Authors
    Patrice Buche; Patrice Buche; Julien Couteaux; Julien Cufi; Julien Cufi; Sébastien Destercke; Sébastien Destercke; Alrick Oudot; Julien Couteaux; Alrick Oudot
    License

    https://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/4.4/customlicense?persistentId=doi:10.57745/SEJP1Bhttps://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/4.4/customlicense?persistentId=doi:10.57745/SEJP1B

    Dataset funded by
    France Relance
    Description

    Agri-food chain processes are based on a multitude of knowledge, know-how and experiences forged over time. Improving food quality must go through the sharing of collective expertise. In this dataset, we provide files associated with the design and implementation of a comprehensive methodology to create a knowledge base integrating the collective expertise and use it to recommend technical actions to be taken to improve food quality. We propose an original core ontology expressed with the international languages of the Semantic Web to represent, on the one hand, knowledge in the form of decision trees representing potential causal relations between situations of interest and, on the other hand, recommendations in terms of technological actions to manage them. An example of decision tree is provided: Excessive salting in mind mapping format and RDF format. An additional Excel file contains data used to assess the relevance of the technical action's efficiency indicator.

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Brendan Kelley (2021). Hive Annotation Job Results - Cleaned and Audited [Dataset]. https://www.kaggle.com/brendankelley/hive-annotation-job-results-cleaned-and-audited
Organization logo

Hive Annotation Job Results - Cleaned and Audited

Hive Data Audit Prompt, Answered

Explore at:
zip(471571 bytes)Available download formats
Dataset updated
Apr 28, 2021
Authors
Brendan Kelley
Description

Context

This notebook serves to showcase my problem solving ability, knowledge of the data analysis process, proficiency with Excel and its various tools and functions, as well as my strategic mindset and statistical prowess. This project consist of an auditing prompt provided by Hive Data, a raw Excel data set, a cleaned and audited version of the raw Excel data set, and my description of my thought process and knowledge used during completion of the project. The prompt can be found below:

Hive Data Audit Prompt

The raw data that accompanies the prompt can be found below:

Hive Annotation Job Results - Raw Data

^ These are the tools I was given to complete my task. The rest of the work is entirely my own.

To summarize broadly, my task was to audit the dataset and summarize my process and results. Specifically, I was to create a method for identifying which "jobs" - explained in the prompt above - needed to be rerun based on a set of "background facts," or criteria. The description of my extensive thought process and results can be found below in the Content section.

Content

Brendan Kelley April 23, 2021

Hive Data Audit Prompt Results

This paper explains the auditing process of the “Hive Annotation Job Results” data. It includes the preparation, analysis, visualization, and summary of the data. It is accompanied by the results of the audit in the excel file “Hive Annotation Job Results – Audited”.

Observation

The “Hive Annotation Job Results” data comes in the form of a single excel sheet. It contains 7 columns and 5,001 rows, including column headers. The data includes “file”, “object id”, and the pseudonym for five questions that each client was instructed to answer about their respective table: “tabular”, “semantic”, “definition list”, “header row”, and “header column”. The “file” column includes non-unique (that is, there are multiple instances of the same value in the column) numbers separated by a dash. The “object id” column includes non-unique numbers ranging from 5 to 487539. The columns containing the answers to the five questions include Boolean values - TRUE or FALSE – which depend upon the yes/no worker judgement.

Use of the COUNTIF() function reveals that there are no values other than TRUE or FALSE in any of the five question columns. The VLOOKUP() function reveals that the data does not include any missing values in any of the cells.

Assumptions

Based on the clean state of the data and the guidelines of the Hive Data Audit Prompt, the assumption is that duplicate values in the “file” column are acceptable and should not be removed. Similarly, duplicated values in the “object id” column are acceptable and should not be removed. The data is therefore clean and is ready for analysis/auditing.

Preparation

The purpose of the audit is to analyze the accuracy of the yes/no worker judgement of each question according to the guidelines of the background facts. The background facts are as follows:

• A table that is a definition list should automatically be tabular and also semantic • Semantic tables should automatically be tabular • If a table is NOT tabular, then it is definitely not semantic nor a definition list • A tabular table that has a header row OR header column should definitely be semantic

These background facts serve as instructions for how the answers to the five questions should interact with one another. These facts can be re-written to establish criteria for each question:

For tabular column: - If the table is a definition list, it is also tabular - If the table is semantic, it is also tabular

For semantic column: - If the table is a definition list, it is also semantic - If the table is not tabular, it is not semantic - If the table is tabular and has either a header row or a header column...

Search
Clear search
Close search
Google apps
Main menu