2 datasets found
  1. d

    Overtone Journalistic Content Bot/Human Indicator Dataset

    • datarade.ai
    Updated Jan 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Overtone (2023). Overtone Journalistic Content Bot/Human Indicator Dataset [Dataset]. https://datarade.ai/data-products/overtone-journalistic-content-bot-human-indicator-dataset-overtone
    Explore at:
    Dataset updated
    Jan 23, 2023
    Dataset authored and provided by
    Overtone
    Area covered
    Russian Federation, Finland, Aruba, Virgin Islands (U.S.), Belarus, Falkland Islands (Malvinas), Panama, Australia, Belize, Brazil
    Description

    We indicate how likely a piece of content is computer generated or human written. Content: any text in English or Spanish, from a single sentence to articles of 1,000s words length.

    Data uniqueness: we use custom built and trained NLP algorithms to assess human effort metrics that are inherent in text content. We focus on what's in the text, not metadata such as publication or engagement. Our AI algorithms are co-created by NLP & journalism experts. Our datasets have all been human-reviewed and labeled.

    Dataset: CSV containing URL and/or body text, with attributed scoring as an integer and model confidence as a percentage. We ignore metadata such as author, publication, date, word count, shares and so on, to provide a clean and maximally unbiased assessment of how much human effort has been invested in content. Our data is provided in CSV/RSS/JSON format. One row = one scored article. CSV contains URL and/or body text, with attributed scoring as an integer and model confidence as a percentage.

    Integrity indicators provided as integers on a 1–5 scale. We also have custom models with 35 categories that can be added on request.

    Data sourcing: public websites, crawlers, scrapers, other partnerships where available. We generally can assess content behind paywalls as well as without paywalls. We source from ~4,000 news outlets, examples include: Bloomberg, CNN, BCC are one each. Countries: all English-speaking markets world-wide. Includes English-language content from non English majority regions, such as Germany, Scandinavia, Japan. Also available in Spanish on request.

    Use-cases: assessing the implicit integrity and reliability of an article. There is correlation between integrity and human value: we have shown that articles scoring highly according to our scales show increased, sustained, ongoing end-user engagement. Clients also use this to assess journalistic output, publication relevance and to create datasets of 'quality' journalism.

    Overtone provides a range of qualitative metrics for journalistic, newsworthy and long-form content. We find, highlight and synthesise content that shows added human effort and, by extension, added human value.

  2. Symbolic Institutional Traps: Language Regimes, Legal Legacy, and...

    • zenodo.org
    bin
    Updated Apr 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Brown; Scott Brown (2025). Symbolic Institutional Traps: Language Regimes, Legal Legacy, and Organizational Constraint in Postcolonial Economies [Dataset]. http://doi.org/10.5281/zenodo.15285179
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 26, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Scott Brown; Scott Brown
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    README: Symbolic Institutional Traps and the Liability of Foreignness

    Scott M. Brown (University of Puerto Rico)
    Email: scott.brown@upr.edu
    Data DOI: 10.5281/zenodo.15050209

    Overview

    This project empirically tests how language regimes embedded in legal and administrative systems create institutional traps that constrain multinational enterprise (MNE) operations and economic integration.
    The study combines national and subnational data across four key datasets to measure how symbolic misalignment (such as monolingualism in non-commercial languages) affects regulatory quality, business formation, and workforce access.

    📂 Datasets

    You must upload the following four files into your Google Colab session before running the code:

    Uploaded FileDescription
    /content/2020_Rankings.xlsxWorld Bank Ease of Doing Business (EODB) — Global regulatory efficiency indicators (2020 Edition)
    /content/DBNA 2022 Rank and Scores.xlsxDoing Business North America (DBNA 2022) — City-level institutional performance across 83 U.S. cities
    /content/Spanish_Speakers_All_States.xlsxU.S. Census American Community Survey (ACS) — State-level Spanish-speaking and English proficiency data
    /content/wgidataset.xlsxWorld Governance Indicators (WGI) — Governance quality measures (Regulatory Quality, Government Effectiveness, etc.)

    📋 How to Run the Study

    1. Open Google Colab.

    2. Upload the four Excel files listed above.

    3. Copy and paste the Python code provided below into a Colab notebook cell.

    4. Run the code to automatically load the datasets, clean the data, and estimate key regression models.

    🚀 Required Python Code

    python
    # --- 0. Imports ---
    import pandas as pd
    import statsmodels.api as sm
    import statsmodels.formula.api as smf


    # --- 1. Load Clean Datasets ---
    dbna = pd.read_excel('/content/DBNA 2022 Rank and Scores.xlsx')
    acs = pd.read_excel('/content/Spanish_Speakers_All_States.xlsx')
    wgi = pd.read_excel('/content/wgidataset.xlsx') # Optional: Governance analysis

    # --- 2. Standardize Column Names ---
    dbna.columns = dbna.columns.str.strip().str.replace(' ', '_')
    acs.columns = acs.columns.str.strip().str.replace(' ', '_')
    wgi.columns = wgi.columns.str.strip().str.replace(' ', '_')

    # --- 3. Merge Datasets ---
    # Merge DBNA and ACS on 'State'
    merged_dbna = dbna.merge(acs, on='State', how='left')

    # --- 4. Regressions: Language vs Institutional Outcomes ---

    # H1: Language (% Spanish) and Starting a Business Score
    model1 = smf.ols('Starting_a_Business_Score ~ Percent_Spanish_Speakers', data=merged_dbna).fit()
    print(" Regression: Starting a Business Score ~ Percent Spanish Speakers")
    print(model1.summary())

    # H3: Language (% Spanish) and Land and Space Use Score
    model2 = smf.ols('Land_and_Space_Use_Score ~ Percent_Spanish_Speakers', data=merged_dbna).fit()
    print(" Regression: Land and Space Use Score ~ Percent Spanish Speakers")
    print(model2.summary())

    # H3: Language (% Spanish) and Getting Electricity Score
    model3 = smf.ols('Getting_Electricity_Score ~ Percent_Spanish_Speakers', data=merged_dbna).fit()
    print(" Regression: Getting Electricity Score ~ Percent Spanish Speakers")
    print(model3.summary())

    # H4: Language (% Spanish) and Employing Workers Score
    model4 = smf.ols('Employing_Workers_Score ~ Percent_Spanish_Speakers', data=merged_dbna).fit()
    print(" Regression: Employing Workers Score ~ Percent Spanish Speakers")
    print(model4.summary())

    # --- 5. (Optional) Governance Analysis: Percent Spanish vs. WGI Regulatory Quality ---
    # If WGI includes 'State' or 'Country' to merge, otherwise skip
    # Example assuming WGI has 'Country' to match 'State'

    #wgi_merged = wgi.merge(acs, left_on='Country', right_on='State', how='left')
    #model5 = smf.ols('Regulatory_Quality ~ Percent_Spanish_Speakers', data=wgi_merged).fit()
    #print(" Regression: Regulatory Quality ~ Percent Spanish Speakers")
    #print(model5.summary())

    # --- 6. End ---
    print(" All regressions completed.")

    🧠 Key Concepts

    • Symbolic Institutional Traps: Language regimes act as hidden barriers, complicating regulatory navigation and labor market integration.

    • Symbolic Misalignment: Misfit between administrative languages and global commercial norms raises onboarding costs for MNEs.

    • Institutional Friction: Language encapsulation isolates economies and reduces foreign direct investment (FDI) attractiveness.

    📜 Data Documentation

    Each dataset has been:

    • Cleaned for consistent formatting.

    • Harmonized for cross-dataset integration.

    • Standardized to facilitate reproducible econometric analysis.

    • Full codebooks and metadata are available in the appendix of the research paper.

    ⚡ Notes

    • The EF EPI (English Proficiency) dataset was not uploaded here. If available, further regressions on symbolic distance can be run.

    • If any columns do not match exactly (e.g., different spellings), modify the variable names slightly based on print(dbna.columns).

    📈 Planned Outputs

    The code generates:

    • Regression outputs on how Spanish-speaking prevalence correlates with:

      • Starting a business

      • Ease of Doing Business

      • Regulatory quality

    • Subnational institutional performance differences (Puerto Rico vs. U.S. states).

    🌍 License and Reuse

    • Open Data: CC BY 4.0 License

    • Citation Requested:
      Brown, S.M. (2025). Symbolic Institutional Traps and the Liability of Foreignness: Language Regimes as Hidden Barriers to Multinational Entry. University of Puerto Rico. DOI: 10.5281/zenodo.15050209

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Overtone (2023). Overtone Journalistic Content Bot/Human Indicator Dataset [Dataset]. https://datarade.ai/data-products/overtone-journalistic-content-bot-human-indicator-dataset-overtone

Overtone Journalistic Content Bot/Human Indicator Dataset

Explore at:
Dataset updated
Jan 23, 2023
Dataset authored and provided by
Overtone
Area covered
Russian Federation, Finland, Aruba, Virgin Islands (U.S.), Belarus, Falkland Islands (Malvinas), Panama, Australia, Belize, Brazil
Description

We indicate how likely a piece of content is computer generated or human written. Content: any text in English or Spanish, from a single sentence to articles of 1,000s words length.

Data uniqueness: we use custom built and trained NLP algorithms to assess human effort metrics that are inherent in text content. We focus on what's in the text, not metadata such as publication or engagement. Our AI algorithms are co-created by NLP & journalism experts. Our datasets have all been human-reviewed and labeled.

Dataset: CSV containing URL and/or body text, with attributed scoring as an integer and model confidence as a percentage. We ignore metadata such as author, publication, date, word count, shares and so on, to provide a clean and maximally unbiased assessment of how much human effort has been invested in content. Our data is provided in CSV/RSS/JSON format. One row = one scored article. CSV contains URL and/or body text, with attributed scoring as an integer and model confidence as a percentage.

Integrity indicators provided as integers on a 1–5 scale. We also have custom models with 35 categories that can be added on request.

Data sourcing: public websites, crawlers, scrapers, other partnerships where available. We generally can assess content behind paywalls as well as without paywalls. We source from ~4,000 news outlets, examples include: Bloomberg, CNN, BCC are one each. Countries: all English-speaking markets world-wide. Includes English-language content from non English majority regions, such as Germany, Scandinavia, Japan. Also available in Spanish on request.

Use-cases: assessing the implicit integrity and reliability of an article. There is correlation between integrity and human value: we have shown that articles scoring highly according to our scales show increased, sustained, ongoing end-user engagement. Clients also use this to assess journalistic output, publication relevance and to create datasets of 'quality' journalism.

Overtone provides a range of qualitative metrics for journalistic, newsworthy and long-form content. We find, highlight and synthesise content that shows added human effort and, by extension, added human value.

Search
Clear search
Close search
Google apps
Main menu