100+ datasets found

d
Data Definition Guidelines
catalog.data.gov
data.virginia.gov
Updated Sep 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Administration for Children and Families (2025). Data Definition Guidelines [Dataset]. https://catalog.data.gov/dataset/data-definition-guidelines
Explore at:
Dataset updated
Sep 8, 2025
Dataset provided by
Administration for Children and Families
Description
ACF Agency Wide resource Metadata-only record linking to the original dataset. Open original dataset below.
Z
Dataset: A Systematic Literature Review on the topic of High-value datasets
data.niaid.nih.gov
zenodo.org
Updated Jun 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anastasija Nikiforova; Nina Rizun; Magdalena Ciesielska; Charalampos Alexopoulos; Andrea Miletič (2023). Dataset: A Systematic Literature Review on the topic of High-value datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7944424
Explore at:
Dataset updated
Jun 23, 2023
Dataset provided by
University of the Aegean
Gdańsk University of Technology
University of Zagreb
University of Tartu
Authors
Anastasija Nikiforova; Nina Rizun; Magdalena Ciesielska; Charalampos Alexopoulos; Andrea Miletič
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.

The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.

Methodology

To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).

These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.

To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.

Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.

Description of the data in this data set

Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies

The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information

Descriptive information
1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet 2) Complete reference - the complete source information to refer to the study 3) Year of publication - the year in which the study was published 4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter} 5) DOI / Website- a link to the website where the study can be found 6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science 7) Availability in OA - availability of an article in the Open Access 8) Keywords - keywords of the paper as indicated by the authors 9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}

Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?

Quality- and relevance- related information
17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)? 18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))

HVD determination-related information
19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term? 20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output") 21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description) 22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles? 23) Data - what data do HVD cover? 24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)

Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx

Licenses or restrictions CC-BY

For more info, see README.txt
Meta data and supporting documentation
catalog.data.gov
s.cnmilf.com
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Meta data and supporting documentation [Dataset]. https://catalog.data.gov/dataset/meta-data-and-supporting-documentation
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
We include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) • y: Vector of binary responses (1: preterm birth, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Medical Service Study Area Data Dictionary
gis.data.chhs.ca.gov
data.ca.gov
+4more
Updated Sep 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CA Department of Health Care Access and Information (2024). Medical Service Study Area Data Dictionary [Dataset]. https://gis.data.chhs.ca.gov/datasets/hcai::medical-service-study-area-data-dictionary
Explore at:
Dataset updated
Sep 6, 2024
Dataset provided by
Department of Health Care Access and Information
Authors
CA Department of Health Care Access and Information
Description
Field Name Data Type Description

Statefp Number US Census Bureau unique identifier of the state

Countyfp Number US Census Bureau unique identifier of the county

Countynm Text County name

Tractce Number US Census Bureau unique identifier of the census tract

Geoid Number US Census Bureau unique identifier of the state + county + census tract

Aland Number US Census Bureau defined land area of the census tract

Awater Number US Census Bureau defined water area of the census tract

Asqmi Number Area calculated in square miles from the Aland

MSSAid Text ID of the Medical Service Study Area (MSSA) the census tract belongs to

MSSAnm Text Name of the Medical Service Study Area (MSSA) the census tract belongs to

Definition Text Type of MSSA, possible values are urban, rural and frontier.

TotalPovPop Number US Census Bureau total population for whom poverty status is determined of the census tract, taken from the 2020 ACS 5 YR S1701
d
Data from: Data Dictionary Template
catalog.data.gov
data-academy.tempe.gov
+8more
Updated Mar 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Tempe (2023). Data Dictionary Template [Dataset]. https://catalog.data.gov/dataset/data-dictionary-template-2e170
Explore at:
Dataset updated
Mar 18, 2023
Dataset provided by
City of Tempe
Description
Data Dictionary template for Tempe Open Data.
Z
Conceptualization of public data ecosystems
data.niaid.nih.gov
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anastasija, Nikiforova; Martin, Lnenicka (2024). Conceptualization of public data ecosystems [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13842001
Explore at:
Dataset updated
Sep 26, 2024
Dataset provided by
University of Tartu
University of Hradec Králové
Authors
Anastasija, Nikiforova; Martin, Lnenicka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data collected during a study "Understanding the development of public data ecosystems: from a conceptual model to a six-generation model of the evolution of public data ecosystems" conducted by Martin Lnenicka (University of Hradec Králové, Czech Republic), Anastasija Nikiforova (University of Tartu, Estonia), Mariusz Luterek (University of Warsaw, Warsaw, Poland), Petar Milic (University of Pristina - Kosovska Mitrovica, Serbia), Daniel Rudmark (Swedish National Road and Transport Research Institute, Sweden), Sebastian Neumaier (St. Pölten University of Applied Sciences, Austria), Karlo Kević (University of Zagreb, Croatia), Anneke Zuiderwijk (Delft University of Technology, Delft, the Netherlands), Manuel Pedro Rodríguez Bolívar (University of Granada, Granada, Spain).

As there is a lack of understanding of the elements that constitute different types of value-adding public data ecosystems and how these elements form and shape the development of these ecosystems over time, which can lead to misguided efforts to develop future public data ecosystems, the aim of the study is: (1) to explore how public data ecosystems have developed over time and (2) to identify the value-adding elements and formative characteristics of public data ecosystems. Using an exploratory retrospective analysis and a deductive approach, we systematically review 148 studies published between 1994 and 2023. Based on the results, this study presents a typology of public data ecosystems and develops a conceptual model of elements and formative characteristics that contribute most to value-adding public data ecosystems, and develops a conceptual model of the evolutionary generation of public data ecosystems represented by six generations called Evolutionary Model of Public Data Ecosystems (EMPDE). Finally, three avenues for a future research agenda are proposed.

This dataset is being made public both to act as supplementary data for "Understanding the development of public data ecosystems: from a conceptual model to a six-generation model of the evolution of public data ecosystems ", Telematics and Informatics*, and its Systematic Literature Review component that informs the study.

Description of the data in this data set

PublicDataEcosystem_SLR provides the structure of the protocol

Spreadsheet#1 provides the list of results after the search over three indexing databases and filtering out irrelevant studies

Spreadsheets #2 provides the protocol structure.

Spreadsheets #3 provides the filled protocol for relevant studies.

The information on each selected study was collected in four categories:(1) descriptive information,(2) approach- and research design- related information,(3) quality-related information,(4) HVD determination-related information

Descriptive Information

Article number

A study number, corresponding to the study number assigned in an Excel worksheet

Complete reference

The complete source information to refer to the study (in APA style), including the author(s) of the study, the year in which it was published, the study's title and other source information.

Year of publication

The year in which the study was published.

Journal article / conference paper / book chapter

The type of the paper, i.e., journal article, conference paper, or book chapter.

Journal / conference / book

Journal article, conference, where the paper is published.

DOI / Website

A link to the website where the study can be found.

Number of words

A number of words of the study.

Number of citations in Scopus and WoS

The number of citations of the paper in Scopus and WoS digital libraries.

Availability in Open Access

Availability of a study in the Open Access or Free / Full Access.

Keywords

Keywords of the paper as indicated by the authors (in the paper).

Relevance for our study (high / medium / low)

What is the relevance level of the paper for our study

Approach- and research design-related information

Approach- and research design-related information

Objective / Aim / Goal / Purpose & Research Questions

The research objective and established RQs.

Research method (including unit of analysis)

The methods used to collect data in the study, including the unit of analysis that refers to the country, organisation, or other specific unit that has been analysed such as the number of use-cases or policy documents, number and scope of the SLR etc.

Study’s contributions

The study’s contribution as defined by the authors

Qualitative / quantitative / mixed method

Whether the study uses a qualitative, quantitative, or mixed methods approach?

Availability of the underlying research data

Whether the paper has a reference to the public availability of the underlying research data e.g., transcriptions of interviews, collected data etc., or explains why these data are not openly shared?

Period under investigation

Period (or moment) in which the study was conducted (e.g., January 2021-March 2022)

Use of theory / theoretical concepts / approaches? If yes, specify them

Does the study mention any theory / theoretical concepts / approaches? If yes, what theory / concepts / approaches? If any theory is mentioned, how is theory used in the study? (e.g., mentioned to explain a certain phenomenon, used as a framework for analysis, tested theory, theory mentioned in the future research section).

Quality-related information

Quality concerns

Whether there are any quality concerns (e.g., limited information about the research methods used)?

Public Data Ecosystem-related information

Public data ecosystem definition

How is the public data ecosystem defined in the paper and any other equivalent term, mostly infrastructure. If an alternative term is used, how is the public data ecosystem called in the paper?

Public data ecosystem evolution / development

Does the paper define the evolution of the public data ecosystem? If yes, how is it defined and what factors affect it?

What constitutes a public data ecosystem?

What constitutes a public data ecosystem (components & relationships) - their "FORM / OUTPUT" presented in the paper (general description with more detailed answers to further additional questions).

Components and relationships

What components does the public data ecosystem consist of and what are the relationships between these components? Alternative names for components - element, construct, concept, item, helix, dimension etc. (detailed description).

Stakeholders

What stakeholders (e.g., governments, citizens, businesses, Non-Governmental Organisations (NGOs) etc.) does the public data ecosystem involve?

Actors and their roles

What actors does the public data ecosystem involve? What are their roles?

Data (data types, data dynamism, data categories etc.)

What data do the public data ecosystem cover (is intended / designed for)? Refer to all data-related aspects, including but not limited to data types, data dynamism (static data, dynamic, real-time data, stream), prevailing data categories / domains / topics etc.

Processes / activities / dimensions, data lifecycle phases

What processes, activities, dimensions and data lifecycle phases (e.g., locate, acquire, download, reuse, transform, etc.) does the public data ecosystem involve or refer to?

Level (if relevant)

What is the level of the public data ecosystem covered in the paper? (e.g., city, municipal, regional, national (=country), supranational, international).

Other elements or relationships (if any)

What other elements or relationships does the public data ecosystem consist of?

Additional comments

Additional comments (e.g., what other topics affected the public data ecosystems and their elements, what is expected to affect the public data ecosystems in the future, what were important topics by which the period was characterised etc.).

New papers

Does the study refer to any other potentially relevant papers?

Additional references to potentially relevant papers that were found in the analysed paper (snowballing).

Format of the file.xls, .csv (for the first spreadsheet only), .docx

Licenses or restrictionsCC-BY

For more info, see README.txt
English Wikipedia People Dataset
kaggle.com
zip
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
Explore at:
zip(4293465577 bytes)Available download formats
Dataset updated
Jul 31, 2025
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Summary

This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

Data Structure

File name: wme_people_infobox.tar.gz

Size of compressed file: 4.12 GB

Size of uncompressed file: 21.28 GB

Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

Stats

Infoboxes - Compressed: 2GB - Uncompressed: 11GB

Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

Maintenance and Support

This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

Initial Data Collection and Normalization

The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

Who are the source language producers?

Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

Attribution

Terms and conditions

Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...
d
Natural Resources Data Dictionary
catalog.data.gov
datasets.ai
+4more
Updated Mar 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lake County Illinois GIS (2023). Natural Resources Data Dictionary [Dataset]. https://catalog.data.gov/dataset/natural-resources-data-dictionary-aeff9
Explore at:
Dataset updated
Mar 17, 2023
Dataset provided by
Lake County Illinois GIS
Description
An in-depth description of the various Natural Resources GIS data layers outlining terms of use, update frequency, attribute explanations, and more. District data layers include: Forest Preserve Boundaries and State Park Boundaries.
f
20CDA35310305 data information (README and data dictionary)
datasetcatalog.nlm.nih.gov
figshare.com
Updated Aug 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Burchat, Natalie; Sampath, Harini (2024). 20CDA35310305 data information (README and data dictionary) [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001287002
Explore at:
Dataset updated
Aug 30, 2024
Authors
Burchat, Natalie; Sampath, Harini
Description
An intestine-specific Scd1 knockout model was developed by crossing Scd1fl/fl mice with mice expressing Cre recombinase under the control of the Villin promoter to study the specific role of intestinal SCD1 in intestinal and whole-body lipid metabolism. The intestinal, hepatic and plasma lipid content and composition of these mice were evaluated by GC-MS analysis under chow fed and sucrose refed conditions. The role of intestinal SCD1 in the regulation of energy balance was also evaluated under chow fed and high-fat conditions. Bile acid content, composition, and signaling was analyzed. Additionally, metabolic phenotyping including body composition, indirect calorimetry and glucose tolerance analyses were conducted.
f
Data from: Variable definition.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Mar 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Min, Liangyu; Huang, Xiaohong; Zhang, Xiaorong; Zhang, Jun; Zeng, Qianqian; Liu, Jiangwei (2023). Variable definition. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000998892
Explore at:
Dataset updated
Mar 17, 2023
Authors
Min, Liangyu; Huang, Xiaohong; Zhang, Xiaorong; Zhang, Jun; Zeng, Qianqian; Liu, Jiangwei
Description
The impact of a chief executive officer’s (CEO’s) functional experience on firm performance has gained the attention of many scholars. However, the measurement of functional experience is rarely disclosed in the public database. Few studies have been conducted on the comprehensive functional experience of CEOs. This paper used the upper echelons theory and obtained deep-level curricula vitae (CVs) data through the named entity recognition technique. First, we mined 15 consecutive years of CEOs’ CVs from 2006 to 2020 from Chinese listed companies. Second, we extracted information throughout their careers and automatically classified their functional hierarchy. Finally, we constructed breadth (functional breadth: functional experience richness) and depth (functional depth: average tenure and the hierarchy of function) for empirical analysis. We found that a CEO’s breadth is significantly negatively related to firm performance, and the quadratic term is significantly positive. A CEO’s depth is significantly positively related to firm performance, and the quadratic term is significantly negative. The research results indicate a u-shaped relationship between a CEO’s breadth and firm performance and an inverted u-shaped relationship between their depth and firm performance. The study’s findings extend the literature on factors influencing firm performance and CEOs’ functional experience. The study expands from the horizontal macro to the vertical micro level, providing new evidence to support the recruitment and selection of high-level corporate talent.
TxDOT Street Definition Data Dictionary
geoportal-mpo.opendata.arcgis.com
arc-gis-hub-home-arcgishub.hub.arcgis.com
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Texas Department of Transportation (2025). TxDOT Street Definition Data Dictionary [Dataset]. https://geoportal-mpo.opendata.arcgis.com/documents/2c7c512e64334fb49884613fe745b406
Explore at:
Dataset updated
Apr 24, 2025
Dataset authored and provided by
Texas Department of Transportationhttp://txdot.gov/
Description
Programmatically generated Data Dictionary document detailing the TxDOT Street Definition service.

The PDF contains service metadata and a complete list of data fields. For any questions or issues related to the document, please contact the data owner of the service identified in the PDF and Credits of this portal item. Related Links TxDOT Street Definition Service URL TxDOT Street Definition Portal Item
Wikipedia Structured Contents
kaggle.com
zip
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikimedia (2025). Wikipedia Structured Contents [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/wikipedia-structured-contents
Explore at:
zip(25121685657 bytes)Available download formats
Dataset updated
Apr 11, 2025
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Summary Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback.

This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema. Each JSON line holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.).

Invitation for Feedback The dataset is built as part of the Structured Contents initiative and based on the Wikimedia Enterprise html snapshots. It is an early beta release to improve transparency in the development process and request feedback. This first version includes pre-parsed Wikipedia abstracts, short descriptions, main images links, infoboxes and article sections, excluding non-prose sections (e.g. references). More elements (such as lists and tables) may be added over time. For updates follow the project’s blog and our Mediawiki Quarterly software updates on MediaWiki. As this is an early beta release, we highly value your feedback to help us refine and improve this dataset. Please share your thoughts, suggestions, and any issues you encounter either on the discussion page of Wikimedia Enterprise’s homepage on Meta wiki, or on the discussion page for this dataset here on Kaggle.

The contents of this dataset of Wikipedia articles is collectively written and curated by a global volunteer community. All original textual content is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 4.0 License. Some text may be available only under the Creative Commons license; see the Wikimedia Terms of Use for details. Text written by some authors may be released under additional licenses or into the public domain.

The dataset in its structured form is generally helpful for a wide variety of tasks, including all phases of model development, from pre-training to alignment, fine-tuning, updating/RAG as well as testing/benchmarking. We would love to hear more about your use cases.

Data Fields The data fields are the same among all, noteworthy included fields: name - title of the article. identifier - ID of the article. url - URL of the article. version: metadata related to the latest specific revision of the article version.editor - editor-specific signals that can help contextualize the revision version.scores - returns assessments by ML models on the likelihood of a revision being reverted. main entity - Wikidata QID the article is related to. abstract - lead section, summarizing what the article is about. description - one-sentence description of the article for quick reference. image - main image representing the article's subject. infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections. Full data dictionary is available here: https://enterprise.wikimedia.com/docs/data-dictionary/

Curation Rationale This dataset has been created as part of the larger Structured Contents initiative at Wikimedia Enterprise with the aim of making Wikimedia data more machine readable. These efforts are both focused on pre-parsing Wikipedia snippets as well as connecting the different projects closer together. Even if Wikipedia is very structured to the human eye, it is a non-triv...
n
National concept directory in National data catalogue
data.norge.no
json
Updated Oct 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Digitaliseringsdirektoratet (2025). National concept directory in National data catalogue [Dataset]. https://data.norge.no/en/datasets/8fbe9c6d-4962-3362-9952-62d9d7ce17bf/national-concept-directory-in-national-data-catalogue
Explore at:
jsonAvailable download formats
Dataset updated
Oct 9, 2025
Dataset provided by
Digitaliseringsdirektoratet
Description
The data set "National concept directory in National data catalogue" (Begrepskatalog i Felles datakatalog) contains all terms published in National concept directory in National data catalogue. Each term contains at least information about the recommended term, definition and source of definition. The terms may also include the following information if the owner of the concept has provided such information: additional information about the meaning of the term that does not belong in the definition field; permitted and advised term, example on use of the term, subject area the term belongs to, area of application, legal categories or value ranges of the term, the date the term is valid from, the date the term shall apply to and contact information by e-mail and telephone.

Objective: To make all concepts in the National concept directory in National data catalogue available for downloading
c
Data from: Delta Neighborhood Physical Activity Study
s.cnmilf.com
agdatacommons.nal.usda.gov
+1more
Updated Jun 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Delta Neighborhood Physical Activity Study [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/delta-neighborhood-physical-activity-study-f82d7
Explore at:
Dataset updated
Jun 5, 2025
Dataset provided by
Agricultural Research Service
Description
The Delta Neighborhood Physical Activity Study was an observational study designed to assess characteristics of neighborhood built environments associated with physical activity. It was an ancillary study to the Delta Healthy Sprouts Project and therefore included towns and neighborhoods in which Delta Healthy Sprouts participants resided. The 12 towns were located in the Lower Mississippi Delta region of Mississippi. Data were collected via electronic surveys between August 2016 and September 2017 using the Rural Active Living Assessment (RALA) tools and the Community Park Audit Tool (CPAT). Scale scores for the RALA Programs and Policies Assessment and the Town-Wide Assessment were computed using the scoring algorithms provided for these tools via SAS software programming. The Street Segment Assessment and CPAT do not have associated scoring algorithms and therefore no scores are provided for them. Because the towns were not randomly selected and the sample size is small, the data may not be generalizable to all rural towns in the Lower Mississippi Delta region of Mississippi. Dataset one contains data collected with the RALA Programs and Policies Assessment (PPA) tool. Dataset two contains data collected with the RALA Town-Wide Assessment (TWA) tool. Dataset three contains data collected with the RALA Street Segment Assessment (SSA) tool. Dataset four contains data collected with the Community Park Audit Tool (CPAT). [Note : title changed 9/4/2020 to reflect study name] Resources in this dataset:Resource Title: Dataset One RALA PPA Data Dictionary. File Name: RALA PPA Data Dictionary.csvResource Description: Data dictionary for dataset one collected using the RALA PPA tool.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Two RALA TWA Data Dictionary. File Name: RALA TWA Data Dictionary.csvResource Description: Data dictionary for dataset two collected using the RALA TWA tool.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Three RALA SSA Data Dictionary. File Name: RALA SSA Data Dictionary.csvResource Description: Data dictionary for dataset three collected using the RALA SSA tool.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Four CPAT Data Dictionary. File Name: CPAT Data Dictionary.csvResource Description: Data dictionary for dataset four collected using the CPAT.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset One RALA PPA. File Name: RALA PPA Data.csvResource Description: Data collected using the RALA PPA tool.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Two RALA TWA. File Name: RALA TWA Data.csvResource Description: Data collected using the RALA TWA tool.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Three RALA SSA. File Name: RALA SSA Data.csvResource Description: Data collected using the RALA SSA tool.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Dataset Four CPAT. File Name: CPAT Data.csvResource Description: Data collected using the CPAT.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel Resource Title: Data Dictionary. File Name: DataDictionary_RALA_PPA_SSA_TWA_CPAT.csvResource Description: This is a combined data dictionary from each of the 4 dataset files in this set.
CRITEO FAIRNESS IN JOB ADS DATASET
kaggle.com
zip
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md. Abdur Rahman (2024). CRITEO FAIRNESS IN JOB ADS DATASET [Dataset]. https://www.kaggle.com/datasets/borhanitrash/fairness-in-job-ads-dataset
Explore at:
zip(201430692 bytes)Available download formats
Dataset updated
Jul 1, 2024
Authors
Md. Abdur Rahman
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Summary

This dataset is released by Criteo to foster research and innovation on Fairness in Advertising and AI systems in general. See also Criteo pledge for Fairness in Advertising.

The dataset is intended to learn click predictions models and evaluate by how much their predictions are biased between different gender groups.

Data description

The dataset contains pseudononymized users' context and publisher features that was collected from a job targeting campaign ran for 5 months by Criteo AdTech company. Each line represents a product that was shown to a user. Each user has an impression session where they can see several products at the same time. Each product can be clicked or not clicked by the user. The dataset consists of 1072226 rows and 55 columns.

features

user_id is a unique identifier assigned to each user. This identifier has been anonymized and does not contain any information related to the real users.

product_id is a unique identifier assigned to each product, i.e. job offer.

impression_id is a unique identifier assigned to each impression, i.e. online session that can have several products at the same time.

cat0 to cat5 are anonymized categorical user features.

cat6 to cat12 are anonymized categorical product features.

num13 to num47 are anonymized numerical user features.

labels

protected_attribute is a binary feature that describes user gender proxy, i.e. female is 0, male is 1. The detailed description on the meaning can be found below.

senior is a binary feature that describes the seniority of the job position, i.e. an assistant role is 0, a managerial role is 1. This feature was created during data processing step from the product title feature: if the product title contains words describing managerial role (e.g. 'president', 'ceo', and others), it is assigned to 1, otherwise to 0.

rank is a numerical feature that corresponds to the positional rank of the product on the display for given impression_id. Usually, the position on the display creates the bias with respect to the click: lower rank means higher position of the product on the display.

displayrandom is a binary feature that equals 1 if the display position on the banner of the products associated with the same impression_id was randomized. The click-rank metric should be computed on displayrandom = 1 to avoid positional bias.

click is a binary feature that equals 1 if the product product_id in the impression impression_id was clicked by the user user_id.

Data statistics

dimension average
click 0.077
protected attribute 0.500
senior 0.704

License

The data is released under the CC-BY-NC-SA 4.0 license. You are free to Share and Adapt this data provided that you respect the Attribution, NonCommercial and ShareAlike conditions. Please read carefully the full license before using.

Protected attribute

As Criteo does not have access to user demographics we report a proxy of gender as protected attribute. This proxy is reported as binary for simplicity yet we acknowledge gender is not necessarily binary.

The value of the proxy is computed as the majority of gender attributes of products seen in the user timeline. Product having a gender attribute are typically fashion and clothing. We acknowledge that this proxy does not necessarily represent how users relate to a given gender yet we believe it to be a realistic approximation for research purposes.

We encourage research in Fairness defined with respect to other attributes as well.

Limitations and interpretations

We remark that the proposed gender proxy does not give a definition of the gender. Since we do not have access to the sensitive information, this is the best solution we have identified at this stage to idenitify bias on pseudonymised data, and we encourage any discussion on better approximations. This proxy is reported as binary for simplicity yet we acknowledge gender is not necessarily binary. Although our research focuses on gender, this should not diminish the importance of investigating other types of algorithmic discrimination. While this dataset provides important application of fairness-aware algorithms in a high-risk domain, there are several fundamental limitation that can not be addressed easily through data collection or curation processes. These limitations in...
Data dictionary from: Gridded National Soil Survey Geographic Database...
figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ag Data Commons (2023). Data dictionary from: Gridded National Soil Survey Geographic Database (gNATSGO) [Dataset]. http://doi.org/10.6084/m9.figshare.19108361.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19108361.v1
Dataset updated
Jun 1, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Ag Data Commons
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data dictionary for Gridded National Soil Survey Geographic Database (gNATSGO). https://data.nal.usda.gov/node/23067gNATSGO has a schema that is very similar to that of SSURGO and STATSGO2. A CSV version of the data dictionary is presented.A data dictionary typically provides a detailed description for each element or variable in a dataset or data model. Data dictionaries are used to document important and useful information such as a descriptive name, the data type, allowed values, units, and text description.Dataset citation: (dataset) Soil Survey Staff. Gridded National Soil Survey Geographic (gNATSGO) Database for [State name -or- the Conterminous United States]. United States Department of Agriculture, Natural Resources Conservation Service. Available online at https://nrcs.app.box.com/v/soils. Month, day, year.
d
Trail Centerline Data Dictionary
catalog.data.gov
data-test-lakecountyil.opendata.arcgis.com
+1more
Updated Mar 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lake County Illinois GIS (2023). Trail Centerline Data Dictionary [Dataset]. https://catalog.data.gov/dataset/trail-centerline-data-dictionary-5f8be
Explore at:
Dataset updated
Mar 17, 2023
Dataset provided by
Lake County Illinois GIS
Description
An in-depth description of the Trail Centerline GIS dataset outlining terms of use, update frequency, attribute explanations, and more.
U
Elevation, Flow Accumulation, Flow Direction, and Stream Definition Data in...
data.usgs.gov
datasets.ai
+2more
Updated Dec 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lindsey Schafer; Jennifer Sharpe (2023). Elevation, Flow Accumulation, Flow Direction, and Stream Definition Data in Support of the Illinois StreamStats Upgrade to the Basin Delineation Database [Dataset]. http://doi.org/10.5066/P9YIAUZQ
Explore at:
Unique identifier
https://doi.org/10.5066/P9YIAUZQ
Dataset updated
Dec 8, 2023
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Lindsey Schafer; Jennifer Sharpe
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
2023
Area covered
Illinois
Description
The U.S. Geological Survey (USGS), in cooperation with the Illinois Center for Transportation and the Illinois Department of Transportation, prepared hydro-conditioned geographic information systems (GIS) layers for use in the Illinois StreamStats application. These data were used to delineate drainage basins and compute basin characteristics for updated peak flow and flow duration regression equations for Illinois. This dataset consists of raster grid files for elevation (dem), flow accumulation (fac), flow direction (fdr), and stream definition (str900) for each 8-digit Hydrologic Unit Code (HUC) area in Illinois merged into a single dataset. There are 51 full or partial HUC 8s represented by this data set: 04040002, 05120108, 05120109, 05120111, 05120112, 05120113, 05120114, 05120115, 05140202, 05140203, 05140204, 05140206, 07060005, 07080101, 07080104, 07090001, 07090002, 07090003, 07090004, 07090005, 07090006, 07090007, 07110001, 07110004, 07110009, 07120001, 07120002, 071200 ...
m
Semantic Similarity with Concept Senses: new Experiment
data.mendeley.com
Updated Oct 24, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francesco Taglino (2022). Semantic Similarity with Concept Senses: new Experiment [Dataset]. http://doi.org/10.17632/v2bwh7z8kj.1
Explore at:
Unique identifier
https://doi.org/10.17632/v2bwh7z8kj.1
Dataset updated
Oct 24, 2022
Authors
Francesco Taglino
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset represents the results of the experimentation of a method for evaluating semantic similarity between concepts in a taxonomy. The method is based on the information-theoretic approach and allows senses of concepts in a given context to be considered. Relevance of senses is calculated in terms of semantic relatedness with the compared concepts. In a previous work [9], the adopted semantic relatedness method was the one described in [10], while in this work we also adopted the ones described in [11], [12], [13], [14], [15], and [16].

We applied our proposal by extending 7 methods for computing semantic similarity in a taxonomy, selected from the literature. The methods considered in the experiment are referred to as R[2], W&P[3], L[4], J&C[5], P&S[6], A[7], and A&M[8]

The experiment was run on the well-known Miller and Charles benchmark dataset [1] for assessing semantic similarity.

The results are organized in seven folders, each with the results related to one of the above semantic relatedness methods. In each folder there is a set of files, each referring to one pair of the Miller and Charles dataset. In fact, for each pair of concepts, all the 28 pairs are considered as possible different contexts.

REFERENCES [1] Miller G.A., Charles W.G. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes 6(1). [2] Resnik P. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. Int. Joint Conf. on Artificial Intelligence, Montreal. [3] Wu Z., Palmer M. 1994. Verb semantics and lexical selection. 32nd Annual Meeting of the Associations for Computational Linguistics. [4] Lin D. 1998. An Information-Theoretic Definition of Similarity. Int. Conf. on Machine Learning. [5] Jiang J.J., Conrath D.W. 1997. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. Inter. Conf. Research on Computational Linguistics. [6] Pirrò G. 2009. A Semantic Similarity Metric Combining Features and Intrinsic Information Content. Data Knowl. Eng, 68(11). [7] Adhikari A., Dutta B., Dutta A., Mondal D., Singh S. 2018. An intrinsic information content-based semantic similarity measure considering the disjoint common subsumers of concepts of an ontology. J. Assoc. Inf. Sci. Technol. 69(8). [8] Adhikari A., Singh S., Mondal D., Dutta B., Dutta A. 2016. A Novel Information Theoretic Framework for Finding Semantic Similarity in WordNet. CoRR, arXiv:1607.05422, abs/1607.05422. [9] Formica A., Taglino F. 2021. An Enriched Information-Theoretic Definition of Semantic Similarity in a Taxonomy. IEEE Access, vol. 9. [10] Information Content-based approach [Schuhmacher and Ponzetto, 2014]. [11] Linked Data Semantic Distance (LDSD) [Passant, 2010]. [12] Wikipedia Link-based Measure (WLM ) [Witten and Milne, 2008]; [13] Linked Open Data Description Overlap-based approach (LODDO) [Zhou et al. 2012] [14] Exclusivity-based [Hulpuş et al 2015] [15] ASRMP [El Vaigh et al. 2020] [16] LDSDGN [Piao and Breslin, 2016]
US Industry Data by State, by Industry
kaggle.com
zip
Updated Jan 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). US Industry Data by State, by Industry [Dataset]. https://www.kaggle.com/datasets/thedevastator/2012-us-industry-data-by-state-by-industry
Explore at:
zip(53066 bytes)Available download formats
Dataset updated
Jan 15, 2023
Authors
The Devastator
Area covered
United States
Description
US Industry Data by State, by Industry

Number of Establishments, Sales, Payroll, and Employees

By Gary Hoover [source]

About this dataset

This data set provides a detailed look into the US economy. It includes information on establishments and nonemployer businesses, as well as sales revenue, payrolls, and the number of employees. Gleaned from the Economic Census done every five years, this data is a valuable resource to anyone curious about where the nation was economically at the time. With columns including geographic area name, North American Industry Classification System (NAICS) codes for industries, descriptions of those codes meaning of operation or tax status, and annual payroll, this information-rich dataset contains all you need to track economic trends over time. Whether you’re a researcher studying industry patterns or an entrepreneur looking for market insight — this dataset has what you’re looking for!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides detailed US industry data by state, including the number of establishments, value of sales, payroll, and number of employees. All the data is based on the North American Industry Classification System (NAICS) code for each specific industry. This will allow you to easily analyze and compare industries across different states or regions.

Research Ideas

Analyzing the economic impact of a new business or industry trends in different states: Comparing the change in the number of establishments, payroll, and employees over time can give insight into how a state is affected by a new industry trend or introduction of a new service or product.

Estimating customer sales potential for businesses: This dataset can be used to estimate the potential customer base for businesses in different geographic areas. By analyzing total business done by non-employers in an area along with its estimated population can help estimate how much overall sales potential exists for a given region.

Tracking competitor performance: By looking at shipments, receipts, and value of business done across industries in different regions or even cities, companies can track their competitors’ performance and compare it to their own to better assess their strategies going forward

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.

Columns

File: 2012 Industry Data by Industry and State.csv | Column name | Description | |:----------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------| | Geographic area name | The name of the geographic area the data is for. (String) | | NAICS code | The North American Industry Classification System (NAICS) code for the industry. (String) | | Meaning of NAICS code | The description of the NAICS code. (String) | | Meaning of Type of operation or tax status code | The description of the type of operation or tax status code. (String) ...