2 datasets found

d
Overtone Journalistic Content Bot/Human Indicator Dataset
datarade.ai
Updated Jan 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Overtone (2023). Overtone Journalistic Content Bot/Human Indicator Dataset [Dataset]. https://datarade.ai/data-products/overtone-journalistic-content-bot-human-indicator-dataset-overtone
Explore at:
Dataset updated
Jan 23, 2023
Dataset authored and provided by
Overtone
Area covered
Russian Federation, Finland, Aruba, Virgin Islands (U.S.), Belarus, Falkland Islands (Malvinas), Panama, Australia, Belize, Brazil
Description
We indicate how likely a piece of content is computer generated or human written. Content: any text in English or Spanish, from a single sentence to articles of 1,000s words length.

Data uniqueness: we use custom built and trained NLP algorithms to assess human effort metrics that are inherent in text content. We focus on what's in the text, not metadata such as publication or engagement. Our AI algorithms are co-created by NLP & journalism experts. Our datasets have all been human-reviewed and labeled.

Dataset: CSV containing URL and/or body text, with attributed scoring as an integer and model confidence as a percentage. We ignore metadata such as author, publication, date, word count, shares and so on, to provide a clean and maximally unbiased assessment of how much human effort has been invested in content. Our data is provided in CSV/RSS/JSON format. One row = one scored article. CSV contains URL and/or body text, with attributed scoring as an integer and model confidence as a percentage.

Integrity indicators provided as integers on a 1–5 scale. We also have custom models with 35 categories that can be added on request.

Data sourcing: public websites, crawlers, scrapers, other partnerships where available. We generally can assess content behind paywalls as well as without paywalls. We source from ~4,000 news outlets, examples include: Bloomberg, CNN, BCC are one each. Countries: all English-speaking markets world-wide. Includes English-language content from non English majority regions, such as Germany, Scandinavia, Japan. Also available in Spanish on request.

Use-cases: assessing the implicit integrity and reliability of an article. There is correlation between integrity and human value: we have shown that articles scoring highly according to our scales show increased, sustained, ongoing end-user engagement. Clients also use this to assess journalistic output, publication relevance and to create datasets of 'quality' journalism.

Overtone provides a range of qualitative metrics for journalistic, newsworthy and long-form content. We find, highlight and synthesise content that shows added human effort and, by extension, added human value.

Symbolic Institutional Traps: Language Regimes, Legal Legacy, and...

zenodo.org

bin

Updated Apr 26, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Scott Brown; Scott Brown (2025). Symbolic Institutional Traps: Language Regimes, Legal Legacy, and Organizational Constraint in Postcolonial Economies [Dataset]. http://doi.org/10.5281/zenodo.15285179

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15285179

Dataset updated

Apr 26, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Scott Brown; Scott Brown

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

README: Symbolic Institutional Traps and the Liability of Foreignness

Scott M. Brown (University of Puerto Rico)
Email: scott.brown@upr.edu
Data DOI: 10.5281/zenodo.15050209

Overview

This project empirically tests how language regimes embedded in legal and administrative systems create institutional traps that constrain multinational enterprise (MNE) operations and economic integration.
The study combines national and subnational data across four key datasets to measure how symbolic misalignment (such as monolingualism in non-commercial languages) affects regulatory quality, business formation, and workforce access.

📂 Datasets

You must upload the following four files into your Google Colab session before running the code:

Uploaded File	Description
`/content/2020_Rankings.xlsx`	World Bank Ease of Doing Business (EODB) — Global regulatory efficiency indicators (2020 Edition)
`/content/DBNA 2022 Rank and Scores.xlsx`	Doing Business North America (DBNA 2022) — City-level institutional performance across 83 U.S. cities
`/content/Spanish_Speakers_All_States.xlsx`	U.S. Census American Community Survey (ACS) — State-level Spanish-speaking and English proficiency data
`/content/wgidataset.xlsx`	World Governance Indicators (WGI) — Governance quality measures (Regulatory Quality, Government Effectiveness, etc.)

📋 How to Run the Study

Open Google Colab.
Upload the four Excel files listed above.
Copy and paste the Python code provided below into a Colab notebook cell.
Run the code to automatically load the datasets, clean the data, and estimate key regression models.

🚀 Required Python Code

python

# --- 0. Imports ---

import pandas as pd

import statsmodels.api as sm

import statsmodels.formula.api as smf

# --- 1. Load Clean Datasets ---

dbna = pd.read_excel('/content/DBNA 2022 Rank and Scores.xlsx')

acs = pd.read_excel('/content/Spanish_Speakers_All_States.xlsx')

wgi = pd.read_excel('/content/wgidataset.xlsx') # Optional: Governance analysis

# --- 2. Standardize Column Names ---

dbna.columns = dbna.columns.str.strip().str.replace(' ', '_')

acs.columns = acs.columns.str.strip().str.replace(' ', '_')

wgi.columns = wgi.columns.str.strip().str.replace(' ', '_')

# --- 3. Merge Datasets ---

# Merge DBNA and ACS on 'State'

merged_dbna = dbna.merge(acs, on='State', how='left')

# --- 4. Regressions: Language vs Institutional Outcomes ---

# H1: Language (% Spanish) and Starting a Business Score

model1 = smf.ols('Starting_a_Business_Score ~ Percent_Spanish_Speakers', data=merged_dbna).fit()

print(" Regression: Starting a Business Score ~ Percent Spanish Speakers")

print(model1.summary())

# H3: Language (% Spanish) and Land and Space Use Score

model2 = smf.ols('Land_and_Space_Use_Score ~ Percent_Spanish_Speakers', data=merged_dbna).fit()

print(" Regression: Land and Space Use Score ~ Percent Spanish Speakers")

print(model2.summary())

# H3: Language (% Spanish) and Getting Electricity Score

model3 = smf.ols('Getting_Electricity_Score ~ Percent_Spanish_Speakers', data=merged_dbna).fit()

print(" Regression: Getting Electricity Score ~ Percent Spanish Speakers")

print(model3.summary())

# H4: Language (% Spanish) and Employing Workers Score

model4 = smf.ols('Employing_Workers_Score ~ Percent_Spanish_Speakers', data=merged_dbna).fit()

print(" Regression: Employing Workers Score ~ Percent Spanish Speakers")

print(model4.summary())

# --- 5. (Optional) Governance Analysis: Percent Spanish vs. WGI Regulatory Quality ---

# If WGI includes 'State' or 'Country' to merge, otherwise skip

# Example assuming WGI has 'Country' to match 'State'

#wgi_merged = wgi.merge(acs, left_on='Country', right_on='State', how='left')

#model5 = smf.ols('Regulatory_Quality ~ Percent_Spanish_Speakers', data=wgi_merged).fit()

#print(" Regression: Regulatory Quality ~ Percent Spanish Speakers")

#print(model5.summary())

# --- 6. End ---

print(" All regressions completed.")

🧠 Key Concepts

Symbolic Institutional Traps: Language regimes act as hidden barriers, complicating regulatory navigation and labor market integration.
Symbolic Misalignment: Misfit between administrative languages and global commercial norms raises onboarding costs for MNEs.
Institutional Friction: Language encapsulation isolates economies and reduces foreign direct investment (FDI) attractiveness.

📜 Data Documentation

Each dataset has been:

Cleaned for consistent formatting.
Harmonized for cross-dataset integration.
Standardized to facilitate reproducible econometric analysis.
Full codebooks and metadata are available in the appendix of the research paper.

⚡ Notes

The EF EPI (English Proficiency) dataset was not uploaded here. If available, further regressions on symbolic distance can be run.
If any columns do not match exactly (e.g., different spellings), modify the variable names slightly based on print(dbna.columns).

📈 Planned Outputs

The code generates:

Regression outputs on how Spanish-speaking prevalence correlates with:
- Starting a business
- Ease of Doing Business
- Regulatory quality
Subnational institutional performance differences (Puerto Rico vs. U.S. states).

🌍 License and Reuse

Open Data: CC BY 4.0 License
Citation Requested:
Brown, S.M. (2025). Symbolic Institutional Traps and the Liability of Foreignness: Language Regimes as Hidden Barriers to Multinational Entry. University of Puerto Rico. DOI: 10.5281/zenodo.15050209

Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Overtone (2023). Overtone Journalistic Content Bot/Human Indicator Dataset [Dataset]. https://datarade.ai/data-products/overtone-journalistic-content-bot-human-indicator-dataset-overtone

Overtone Journalistic Content Bot/Human Indicator Dataset

Explore at:

Dataset updated

Jan 23, 2023

Dataset authored and provided by

Overtone

Area covered

Russian Federation, Finland, Aruba, Virgin Islands (U.S.), Belarus, Falkland Islands (Malvinas), Panama, Australia, Belize, Brazil

Description

We indicate how likely a piece of content is computer generated or human written. Content: any text in English or Spanish, from a single sentence to articles of 1,000s words length.

Data uniqueness: we use custom built and trained NLP algorithms to assess human effort metrics that are inherent in text content. We focus on what's in the text, not metadata such as publication or engagement. Our AI algorithms are co-created by NLP & journalism experts. Our datasets have all been human-reviewed and labeled.

Dataset: CSV containing URL and/or body text, with attributed scoring as an integer and model confidence as a percentage. We ignore metadata such as author, publication, date, word count, shares and so on, to provide a clean and maximally unbiased assessment of how much human effort has been invested in content. Our data is provided in CSV/RSS/JSON format. One row = one scored article. CSV contains URL and/or body text, with attributed scoring as an integer and model confidence as a percentage.

Integrity indicators provided as integers on a 1–5 scale. We also have custom models with 35 categories that can be added on request.

Data sourcing: public websites, crawlers, scrapers, other partnerships where available. We generally can assess content behind paywalls as well as without paywalls. We source from ~4,000 news outlets, examples include: Bloomberg, CNN, BCC are one each. Countries: all English-speaking markets world-wide. Includes English-language content from non English majority regions, such as Germany, Scandinavia, Japan. Also available in Spanish on request.

Use-cases: assessing the implicit integrity and reliability of an article. There is correlation between integrity and human value: we have shown that articles scoring highly according to our scales show increased, sustained, ongoing end-user engagement. Clients also use this to assess journalistic output, publication relevance and to create datasets of 'quality' journalism.

Overtone provides a range of qualitative metrics for journalistic, newsworthy and long-form content. We find, highlight and synthesise content that shows added human effort and, by extension, added human value.

Clear search

Close search

Google apps

Main menu