18 datasets found

d
A n-grams collection extracted from the Portuguese Web
dataone.org
dados.gov.pt
Updated Nov 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Batista, David (2023). A n-grams collection extracted from the Portuguese Web [Dataset]. http://doi.org/10.7910/DVN/ZSXC55
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ZSXC55
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Batista, David
Description
The n-grams collection was extracted from the collected documents whose identified language was Portuguese. We extracted word n-grams up to the fifht order (5-grams). A set of regular expressions to tokenize the text were applied. After the extraction, all n-grams with tokens having more than 32 characters were discarded. N-grams with frequencies below 5 were discarded as well. The n-grams collection is available as a set of UTF-8 encoded files, containing the n-grams and their frequencies
d
Data from: LSTM neural network for textual ngrams
datadryad.org
figshare.com
zip
Updated Dec 6, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaun D'Souza (2019). LSTM neural network for textual ngrams [Dataset]. http://doi.org/10.5061/dryad.wstqjq2gm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.wstqjq2gm
Dataset updated
Dec 6, 2019
Dataset provided by
Dryad
Authors
Shaun D'Souza
Time period covered
2019
Description
Cognitive neuroscience is the study of how the human brain functions on tasks like decision making, language, perception and reasoning. Deep learning is a class of machine learning algorithms that use neural networks. They are designed to model the responses of neurons in the human brain. Learning can be supervised or unsupervised. Ngram token models are used extensively in language prediction. Ngrams are probabilistic models that are used in predicting the next word or token. They are a statistical model of word sequences or tokens and are called Language Models or Lms. Ngrams are essential in creating language prediction models. We are exploring a broader sandbox ecosystems enabling for AI. Specifically, around Deep learning applications on unstructured content form on the web.
English Word Frequency
kaggle.com
Updated Sep 6, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rachael Tatman (2017). English Word Frequency [Dataset]. https://www.kaggle.com/datasets/rtatman/english-word-frequency/discussion?sortBy=hot
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 6, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rachael Tatman
Description
Context:

How frequently a word occurs in a language is an important piece of information for natural language processing and linguists. In natural language processing, very frequent words tend to be less informative than less frequent one and are often removed during preprocessing. Human language users are also sensitive to word frequency. How often a word is used affects language processing in humans. For example, very frequent words are read and understood more quickly and can be understood more easily in background noise.

Content:

This dataset contains the counts of the 333,333 most commonly-used single words on the English language web, as derived from the Google Web Trillion Word Corpus.

Acknowledgements:

Data files were derived from the Google Web Trillion Word Corpus (as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium) by Peter Norvig. You can find more information on these files and the code used to generate them here.

The code used to generate this dataset is distributed under the MIT License.

Inspiration:

Can you tag the part of speech of these words? Which parts of speech are most frequent? Is this similar to other languages, like Japanese?

What differences are there between the very frequent words in this dataset, and the the frequent words in other corpora, such as the Brown Corpus or the TIMIT corpus? What might these differences tell us about how language is used?
e
PANACEA Labour Legislation Corpus n-grams FR (French) - Dataset - B2FIND
b2find.eudat.eu
Updated Apr 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). PANACEA Labour Legislation Corpus n-grams FR (French) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/5f92eb15-5bec-50bf-8d45-0ffdbee9a84b
Explore at:
Dataset updated
Apr 6, 2024
Area covered
French
Description
This data set contains French word n-grams and French word/tag/lemma n-grams in the "Labour" (LAB) domain. N-grams are accompanied by their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. The data were collected in the context of PANACEA (http://www.panacea-lr.eu), an EU-FP7 Funded Project under Grant Agreement 248064. The n-gram counts were generated from crawled Web pages that were automatically detected to be in the French language and were automatically classified as relevant to the LAB domain. The LAB domain collection used consisted of approximately 56.4 million tokens. Data collection took place in the summer of 2011.
C
PANACEA Environment Corpus n-grams ES (Spanish)
dataverse.csuc.cat
txt, zip
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CORA.Repositori de Dades de Recerca (2023). PANACEA Environment Corpus n-grams ES (Spanish) [Dataset]. http://doi.org/10.34810/data335
Explore at:
txt(1456), zip(1449062879), zip(194184)Available download formats
Unique identifier
https://doi.org/10.34810/data335
Dataset updated
Oct 11, 2023
Dataset provided by
CORA.Repositori de Dades de Recerca
License
https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34810/data335https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34810/data335
Dataset funded by
European Commission
Description
This data set contains Spanish word n-grams and Spanish word/tag/lemma n-grams in the "Environment" (ENV) domain. N-grams are accompanied by their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. The data were collected in the context of PANACEA (http://www.panacea-lr.eu), an EU-FP7 Funded Project under Grant Agreement 248064. The n-gram counts were generated from crawled Web pages that were automatically detected to be in the Spanish language and were automatically classified as relevant to the ENV domain. The ENV domain collection used consisted of approximately 49.86 million tokens. Data collection took place in the summer of 2011.
google n gram
kaggle.com
Updated Nov 28, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DHARMENDRA MAURYA (2018). google n gram [Dataset]. https://www.kaggle.com/dm4006/google-n-gram/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 28, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
DHARMENDRA MAURYA
Description
Dataset

This dataset was created by DHARMENDRA MAURYA

Contents
Z
Evaluation datasets and results of the paper "Efficient Online Computation...
data.niaid.nih.gov
Updated Jan 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chapela-Campa, David (2025). Evaluation datasets and results of the paper "Efficient Online Computation of Business Process State From Trace Prefixes via N-Gram Indexing" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11409896
Explore at:
Dataset updated
Jan 17, 2025
Dataset provided by
Dumas, Marlon
Chapela-Campa, David
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Event logs, process models, and results corresponding to the paper "Efficient Online Computation of Business Process State From Trace Prefixes via N-Gram Indexing".

Inputs: preprocessed event logs and discovered process models (and their characteristics) used in the evaluation.

Real-life: preprocessed event logs (xes and csv) corresponding to the real-life processes used in the evaluation. Process models (pnml) discovered with the Inductive Miner infrequent for thresholds of 10%, 20%, and 50%. Characteristics (txt) of the event logs and process models. Ongoing cases result from splitting each case in the preprocessed event logs (under folder split).

Synthetic: simulated event logs (csv) corresponding to the synthetic processes used in the evaluation. Designed process models (bpmn and pnml). Ongoing cases result from splitting each case in the preprocessed event logs (under folder split). Ongoing cases with injected noise as described in the publication (under folders noise_1, noise_2, and noise_3).
C
PANACEA Environment Corpus n-grams IT (Italian)
dataverse.csuc.cat
html, pdf, txt, zip
Updated Oct 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Consiglio di ionale delle Ricerche. Istituto di Linguistica Computazionale "Antonio Zampolli"; Consiglio di ionale delle Ricerche. Istituto di Linguistica Computazionale "Antonio Zampolli" (2023). PANACEA Environment Corpus n-grams IT (Italian) [Dataset]. http://doi.org/10.34810/data350
Explore at:
html(8640), pdf(194024), txt(236), zip(1128602950), txt(3790), txt(566)Available download formats
Unique identifier
https://doi.org/10.34810/data350
Dataset updated
Oct 19, 2023
Dataset provided by
CORA.Repositori de Dades de Recerca
Authors
Consiglio di ionale delle Ricerche. Istituto di Linguistica Computazionale "Antonio Zampolli"; Consiglio di ionale delle Ricerche. Istituto di Linguistica Computazionale "Antonio Zampolli"
License
https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34810/data350https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34810/data350
Dataset funded by
European Commission
Description
This data set contains Italian word n-grams and Italian word/tag/lemma n-grams in the "Environment" (ENV) domain. N-grams are accompanied by their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. The data were collected in the context of PANACEA (http://www.panacea-lr.eu), an EU-FP7 Funded Project under Grant Agreement 248064. The n-gram counts were generated from crawled Web pages that were automatically detected to be in the Italian language and were automatically classified as relevant to the ENV domain. The ENV domain collection used consisted of approximately 36 million tokens. Data collection took place in the summer of 2011.
w
Beautiful Data Natural Language Corpus and Code
data.wu.ac.at
zip
Updated Oct 10, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Global (2013). Beautiful Data Natural Language Corpus and Code [Dataset]. https://data.wu.ac.at/schema/datahub_io/MmMzNDM5NDEtMjk3Zi00YjMyLWJlZWUtYTE1YjhlZTE2YmRl
Explore at:
zipAvailable download formats
Dataset updated
Oct 10, 2013
Dataset provided by
Global
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Ngrams and code from Dr. Peter Norvig's chapter for Beautiful Data (2009), edited by Segaran and Hammerbacher. Data files are derived from the Google Web Trillion Word Corpus, as described, which is distributed by the Linguistic Data Consortium.
H
Knowledge Management - Raw Source Data
dataverse.harvard.edu
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diomar Anez; Dimar Anez (2025). Knowledge Management - Raw Source Data [Dataset]. http://doi.org/10.7910/DVN/8ATSMJ
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/8ATSMJ
Dataset updated
May 6, 2025
Dataset provided by
Harvard Dataverse
Authors
Diomar Anez; Dimar Anez
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains raw, unprocessed data files pertaining to the management tool 'Knowledge Management' (KM), including related concepts like Intellectual Capital Management and Knowledge Transfer. The data originates from five distinct sources, each reflecting different facets of the tool's prominence and usage over time. Files preserve the original metrics and temporal granularity before any comparative normalization or harmonization. Data Sources & File Details: Google Trends File (Prefix: GT_): Metric: Relative Search Interest (RSI) Index (0-100 scale). Keywords Used: "knowledge management" + "knowledge management organizational" Time Period: January 2004 - January 2025 (Native Monthly Resolution). Scope: Global Web Search, broad categorization. Extraction Date: Data extracted January 2025. Notes: Index relative to peak interest within the period for these terms. Reflects public/professional search interest trends. Based on probabilistic sampling. Source URL: Google Trends Query Google Books Ngram Viewer File (Prefix: GB_): Metric: Annual Relative Frequency (% of total n-grams in the corpus). Keywords Used: Knowledge Management + Intellectual Capital Management + Knowledge Transfer Time Period: 1950 - 2022 (Annual Resolution). Corpus: English. Parameters: Case Insensitive OFF, Smoothing 0. Extraction Date: Data extracted January 2025. Notes: Reflects term usage frequency in Google's digitized book corpus. Subject to corpus limitations (English bias, coverage). Source URL: Ngram Viewer Query Crossref.org File (Prefix: CR_): Metric: Absolute count of publications per month matching keywords. Keywords Used: ("knowledge management" OR "intellectual capital management" OR "knowledge transfer") AND ("organizational" OR "management" OR "learning" OR "innovation" OR "sharing" OR "system") Time Period: 1950 - 2025 (Queried for monthly counts based on publication date metadata). Search Fields: Title, Abstract. Extraction Date: Data extracted January 2025. Notes: Reflects volume of relevant academic publications indexed by Crossref. Deduplicated using DOIs; records without DOIs omitted. Source URL: Crossref Search Query Bain & Co. Survey - Usability File (Prefix: BU_): Metric: Original Percentage (%) of executives reporting tool usage. Tool Names/Years Included: Knowledge Management (1999, 2000, 2002, 2004, 2006, 2008, 2010). Respondent Profile: CEOs, CFOs, COOs, other senior leaders; global, multi-sector. Source: Bain & Company Management Tools & Trends publications (Rigby D., Bilodeau B., et al., various years: 2001, 2003, 2005, 2007, 2009, 2011). Note: Tool potentially not surveyed or reported after 2010 under this specific name. Data Compilation Period: July 2024 - January 2025. Notes: Data points correspond to specific survey years. Sample sizes: 1999/475; 2000/214; 2002/708; 2004/960; 2006/1221; 2008/1430; 2010/1230. Bain & Co. Survey - Satisfaction File (Prefix: BS_): Metric: Original Average Satisfaction Score (Scale 0-5). Tool Names/Years Included: Knowledge Management (1999, 2000, 2002, 2004, 2006, 2008, 2010). Respondent Profile: CEOs, CFOs, COOs, other senior leaders; global, multi-sector. Source: Bain & Company Management Tools & Trends publications (Rigby D., Bilodeau B., et al., various years: 2001, 2003, 2005, 2007, 2009, 2011). Note: Tool potentially not surveyed or reported after 2010 under this specific name. Data Compilation Period: July 2024 - January 2025. Notes: Data points correspond to specific survey years. Sample sizes: 1999/475; 2000/214; 2002/708; 2004/960; 2006/1221; 2008/1430; 2010/1230. Reflects subjective executive perception of utility. File Naming Convention: Files generally follow the pattern: PREFIX_Tool.csv, where the PREFIX indicates the data source: GT_: Google Trends GB_: Google Books Ngram CR_: Crossref.org (Count Data for this Raw Dataset) BU_: Bain & Company Survey (Usability) BS_: Bain & Company Survey (Satisfaction) The essential identification comes from the PREFIX and the Tool Name segment. This dataset resides within the 'Management Tool Source Data (Raw Extracts)' Dataverse.
H
Benchmarking - Raw Source Data
dataverse.harvard.edu
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diomar Anez; Dimar Anez (2025). Benchmarking - Raw Source Data [Dataset]. http://doi.org/10.7910/DVN/JKDONM
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/JKDONM
Dataset updated
May 6, 2025
Dataset provided by
Harvard Dataverse
Authors
Diomar Anez; Dimar Anez
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains raw, unprocessed data files pertaining to the management tool 'Benchmarking'. The data originates from five distinct sources, each reflecting different facets of the tool's prominence and usage over time. Files preserve the original metrics and temporal granularity before any comparative normalization or harmonization. Data Sources & File Details: Google Trends File (Prefix: GT_): Metric: Relative Search Interest (RSI) Index (0-100 scale). Keywords Used: "benchmarking" + "benchmarking management" Time Period: January 2004 - January 2025 (Native Monthly Resolution). Scope: Global Web Search, broad categorization. Extraction Date: Data extracted January 2025. Notes: Index relative to peak interest within the period for these terms. Reflects public/professional search interest trends. Based on probabilistic sampling. Source URL: Google Trends Query Google Books Ngram Viewer File (Prefix: GB_): Metric: Annual Relative Frequency (% of total n-grams in the corpus). Keywords Used: Benchmarking Time Period: 1950 - 2022 (Annual Resolution). Corpus: English. Parameters: Case Insensitive OFF, Smoothing 0. Extraction Date: Data extracted January 2025. Notes: Reflects term usage frequency in Google's digitized book corpus. Subject to corpus limitations (English bias, coverage). Source URL: Ngram Viewer Query Crossref.org File (Prefix: CR_): Metric: Absolute count of publications per month matching keywords. Keywords Used: "benchmarking" AND ("process" OR "management" OR "performance" OR "best practices" OR "implementation" OR "approach" OR "evaluation" OR "methodology") Time Period: 1950 - 2025 (Queried for monthly counts based on publication date metadata). Search Fields: Title, Abstract. Extraction Date: Data extracted January 2025. Notes: Reflects volume of relevant academic publications indexed by Crossref. Deduplicated using DOIs; records without DOIs omitted. Source URL: Crossref Search Query Bain & Co. Survey - Usability File (Prefix: BU_): Metric: Original Percentage (%) of executives reporting tool usage. Tool Names/Years Included: Benchmarking (1993, 1996, 1999, 2000, 2002, 2004, 2006, 2008, 2010, 2012, 2014, 2017). Respondent Profile: CEOs, CFOs, COOs, other senior leaders; global, multi-sector. Source: Bain & Company Management Tools & Trends publications (Rigby D., Bilodeau B., et al., various years: 1994, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017). Note: Tool not included in the 2022 survey data. Data Compilation Period: July 2024 - January 2025. Notes: Data points correspond to specific survey years. Sample sizes: 1993/500; 1996/784; 1999/475; 2000/214; 2002/708; 2004/960; 2006/1221; 2008/1430; 2010/1230; 2012/1208; 2014/1067; 2017/1268. Bain & Co. Survey - Satisfaction File (Prefix: BS_): Metric: Original Average Satisfaction Score (Scale 0-5). Tool Names/Years Included: Benchmarking (1993, 1996, 1999, 2000, 2002, 2004, 2006, 2008, 2010, 2012, 2014, 2017). Respondent Profile: CEOs, CFOs, COOs, other senior leaders; global, multi-sector. Source: Bain & Company Management Tools & Trends publications (Rigby D., Bilodeau B., et al., various years: 1994, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017). Note: Tool not included in the 2022 survey data. Data Compilation Period: July 2024 - January 2025. Notes: Data points correspond to specific survey years. Sample sizes: 1993/500; 1996/784; 1999/475; 2000/214; 2002/708; 2004/960; 2006/1221; 2008/1430; 2010/1230; 2012/1208; 2014/1067; 2017/1268. Reflects subjective executive perception of utility. File Naming Convention: Files generally follow the pattern: PREFIX_Tool.csv, where the PREFIX indicates the data source: GT_: Google Trends GB_: Google Books Ngram CR_: Crossref.org (Count Data for this Raw Dataset) BU_: Bain & Company Survey (Usability) BS_: Bain & Company Survey (Satisfaction) The essential identification comes from the PREFIX and the Tool Name segment. This dataset resides within the 'Management Tool Source Data (Raw Extracts)' Dataverse.
f
Results of our N-Gram (LLR) analysis.
figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laura Biester; James Pennebaker; Rada Mihalcea (2023). Results of our N-Gram (LLR) analysis. [Dataset]. http://doi.org/10.1371/journal.pone.0278179.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0278179.t002
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Laura Biester; James Pennebaker; Rada Mihalcea
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Results of our N-Gram (LLR) analysis.
H
Core Competencies - Raw Source Data
dataverse.harvard.edu
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diomar Anez; Dimar Anez (2025). Core Competencies - Raw Source Data [Dataset]. http://doi.org/10.7910/DVN/1UFJRM
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/1UFJRM
Dataset updated
May 6, 2025
Dataset provided by
Harvard Dataverse
Authors
Diomar Anez; Dimar Anez
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains raw, unprocessed data files pertaining to the management tool 'Core Competencies' (also Core Competence). The data originates from five distinct sources, each reflecting different facets of the tool's prominence and usage over time. Files preserve the original metrics and temporal granularity before any comparative normalization or harmonization. Data Sources & File Details: Google Trends File (Prefix: GT_): Metric: Relative Search Interest (RSI) Index (0-100 scale). Keywords Used: "core competencies" + "core competence strategy" Time Period: January 2004 - January 2025 (Native Monthly Resolution). Scope: Global Web Search, broad categorization. Extraction Date: Data extracted January 2025. Notes: Index relative to peak interest within the period for these terms. Reflects public/professional search interest trends. Based on probabilistic sampling. Source URL: Google Trends Query Google Books Ngram Viewer File (Prefix: GB_): Metric: Annual Relative Frequency (% of total n-grams in the corpus). Keywords Used: Core Competencies + Core Competence Time Period: 1950 - 2022 (Annual Resolution). Corpus: English. Parameters: Case Insensitive OFF, Smoothing 0. Extraction Date: Data extracted January 2025. Notes: Reflects term usage frequency in Google's digitized book corpus. Subject to corpus limitations (English bias, coverage). Source URL: Ngram Viewer Query Crossref.org File (Prefix: CR_): Metric: Absolute count of publications per month matching keywords. Keywords Used: ("core competencies" OR "core competence") AND ("management" OR "competitive advantage" OR "strategy" OR "capabilities" OR "resources" OR "approach" OR "development") Time Period: 1950 - 2025 (Queried for monthly counts based on publication date metadata). Search Fields: Title, Abstract. Extraction Date: Data extracted January 2025. Notes: Reflects volume of relevant academic publications indexed by Crossref. Deduplicated using DOIs; records without DOIs omitted. Source URL: Crossref Search Query Bain & Co. Survey - Usability File (Prefix: BU_): Metric: Original Percentage (%) of executives reporting tool usage. Tool Names/Years Included: Core Competencies (1993, 1996, 1999, 2000, 2002, 2004, 2006, 2008, 2010, 2012, 2014, 2017). Respondent Profile: CEOs, CFOs, COOs, other senior leaders; global, multi-sector. Source: Bain & Company Management Tools & Trends publications (Rigby D., Bilodeau B., et al., various years: 1994, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017). Note: Tool not included in the 2022 survey data. Data Compilation Period: July 2024 - January 2025. Notes: Data points correspond to specific survey years. Sample sizes: 1993/500; 1996/784; 1999/475; 2000/214; 2002/708; 2004/960; 2006/1221; 2008/1430; 2010/1230; 2012/1208; 2014/1067; 2017/1268. Bain & Co. Survey - Satisfaction File (Prefix: BS_): Metric: Original Average Satisfaction Score (Scale 0-5). Tool Names/Years Included: Core Competencies (1993, 1996, 1999, 2000, 2002, 2004, 2006, 2008, 2010, 2012, 2014, 2017). Respondent Profile: CEOs, CFOs, COOs, other senior leaders; global, multi-sector. Source: Bain & Company Management Tools & Trends publications (Rigby D., Bilodeau B., et al., various years: 1994, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017). Note: Tool not included in the 2022 survey data. Data Compilation Period: July 2024 - January 2025. Notes: Data points correspond to specific survey years. Sample sizes: 1993/500; 1996/784; 1999/475; 2000/214; 2002/708; 2004/960; 2006/1221; 2008/1430; 2010/1230; 2012/1208; 2014/1067; 2017/1268. Reflects subjective executive perception of utility. File Naming Convention: Files generally follow the pattern: PREFIX_Tool.csv, where the PREFIX indicates the data source: GT_: Google Trends GB_: Google Books Ngram CR_: Crossref.org (Count Data for this Raw Dataset) BU_: Bain & Company Survey (Usability) BS_: Bain & Company Survey (Satisfaction) The essential identification comes from the PREFIX and the Tool Name segment. This dataset resides within the 'Management Tool Source Data (Raw Extracts)' Dataverse.
f
Top 10 most frequent n-grams in cannabis-related posts.
plos.figshare.com
xls
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meredith C. Meacham; Alicia L. Nobles; D. Andrew Tompkins; Johannes Thrul (2023). Top 10 most frequent n-grams in cannabis-related posts. [Dataset]. http://doi.org/10.1371/journal.pone.0263583.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0263583.t002
Dataset updated
Jun 15, 2023
Dataset provided by
PLOS ONE
Authors
Meredith C. Meacham; Alicia L. Nobles; D. Andrew Tompkins; Johannes Thrul
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Top 10 most frequent n-grams in cannabis-related posts.
H
Total Quality Management (TQM) - Raw Source Data
dataverse.harvard.edu
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diomar Anez; Dimar Anez (2025). Total Quality Management (TQM) - Raw Source Data [Dataset]. http://doi.org/10.7910/DVN/IJLFWU
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/IJLFWU
Dataset updated
May 6, 2025
Dataset provided by
Harvard Dataverse
Authors
Diomar Anez; Dimar Anez
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains raw, unprocessed data files pertaining to the management tool group 'Total Quality Management' (TQM). The data originates from five distinct sources, each reflecting different facets of the tool's prominence and usage over time. Files preserve the original metrics and temporal granularity before any comparative normalization or harmonization. Data Sources & File Details: Google Trends File (Prefix: GT_): Metric: Relative Search Interest (RSI) Index (0-100 scale). Keywords Used: "total quality management" + TQM + "TQM system" Time Period: January 2004 - January 2025 (Native Monthly Resolution). Scope: Global Web Search, broad categorization. Extraction Date: Data extracted January 2025. Notes: Index relative to peak interest within the period for these terms. Reflects public/professional search interest trends. Based on probabilistic sampling. Source URL: Google Trends Query Google Books Ngram Viewer File (Prefix: GB_): Metric: Annual Relative Frequency (% of total n-grams in the corpus). Keywords Used: Total Quality Management + TQM + Total Quality Time Period: 1950 - 2022 (Annual Resolution). Corpus: English. Parameters: Case Insensitive OFF, Smoothing 0. Extraction Date: Data extracted January 2025. Notes: Reflects term usage frequency in Google's digitized book corpus. Subject to corpus limitations (English bias, coverage). Source URL: Ngram Viewer Query Crossref.org File (Prefix: CR_): Metric: Absolute count of publications per month matching keywords. Keywords Used: ("total quality management" OR "total quality" OR TQM) AND ("management" OR "system" OR "approach" OR "implementation" OR "practice" OR "framework" OR "methodology" OR "tool") Time Period: 1950 - 2025 (Queried for monthly counts based on publication date metadata). Search Fields: Title, Abstract. Extraction Date: Data extracted January 2025. Notes: Reflects volume of relevant academic publications indexed by Crossref. Deduplicated using DOIs; records without DOIs omitted. Source URL: Crossref Search Query Bain & Co. Survey - Usability File (Prefix: BU_): Metric: Original Percentage (%) of executives reporting tool usage. Tool Names/Years Included: Total Quality Management (1993, 1999, 2000, 2002, 2006, 2008, 2010, 2012, 2014, 2017, 2022); TQM (1996, 2004). Respondent Profile: CEOs, CFOs, COOs, other senior leaders; global, multi-sector. Source: Bain & Company Management Tools & Trends publications (Rigby D., Bilodeau B., Ronan C. et al., various years: 1994, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017, 2023). Data Compilation Period: July 2024 - January 2025. Notes: Data points correspond to specific survey years. Sample sizes: 1993/500; 1996/784; 1999/475; 2000/214; 2002/708; 2004/960; 2006/1221; 2008/1430; 2010/1230; 2012/1208; 2014/1067; 2017/1268; 2022/1068. Bain & Co. Survey - Satisfaction File (Prefix: BS_): Metric: Original Average Satisfaction Score (Scale 0-5). Tool Names/Years Included: Total Quality Management (1993, 1999, 2000, 2002, 2006, 2008, 2010, 2012, 2014, 2017, 2022); TQM (1996, 2004). Respondent Profile: CEOs, CFOs, COOs, other senior leaders; global, multi-sector. Source: Bain & Company Management Tools & Trends publications (Rigby D., Bilodeau B., Ronan C. et al., various years: 1994, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017, 2023). Data Compilation Period: July 2024 - January 2025. Notes: Data points correspond to specific survey years. Sample sizes: 1993/500; 1996/784; 1999/475; 2000/214; 2002/708; 2004/960; 2006/1221; 2008/1430; 2010/1230; 2012/1208; 2014/1067; 2017/1268; 2022/1068. Reflects subjective executive perception of utility. File Naming Convention: Files generally follow the pattern: PREFIX_Tool.csv, where the PREFIX indicates the data source: GT_: Google Trends GB_: Google Books Ngram CR_: Crossref.org (Count Data for this Raw Dataset) BU_: Bain & Company Survey (Usability) BS_: Bain & Company Survey (Satisfaction) The essential identification comes from the PREFIX and the Tool Name segment. This dataset resides within the 'Management Tool Source Data (Raw Extracts)' Dataverse.
H
Strategic Alliances & Corporate Venture Capital - Raw Source Data
dataverse.harvard.edu
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diomar Anez; Dimar Anez (2025). Strategic Alliances & Corporate Venture Capital - Raw Source Data [Dataset]. http://doi.org/10.7910/DVN/Z8SNIU
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/Z8SNIU
Dataset updated
May 6, 2025
Dataset provided by
Harvard Dataverse
Authors
Diomar Anez; Dimar Anez
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains raw, unprocessed data files pertaining to the management tool group focused on 'Strategic Alliances' and 'Corporate Venture Capital' (CVC). The data originates from five distinct sources, each reflecting different facets of the tool's prominence and usage over time. Files preserve the original metrics and temporal granularity before any comparative normalization or harmonization. Data Sources & File Details: Google Trends File (Prefix: GT_): Metric: Relative Search Interest (RSI) Index (0-100 scale). Keywords Used: "strategic alliance" + "corporate venture capital" + "strategic alliance strategy" Time Period: January 2004 - January 2025 (Native Monthly Resolution). Scope: Global Web Search, broad categorization. Extraction Date: Data extracted January 2025. Notes: Index relative to peak interest within the period for these terms. Reflects public/professional search interest trends. Based on probabilistic sampling. Source URL: Google Trends Query Google Books Ngram Viewer File (Prefix: GB_): Metric: Annual Relative Frequency (% of total n-grams in the corpus). Keywords Used: Corporate Venture Capital + Strategic Alliance + Strategic Alliances Time Period: 1950 - 2022 (Annual Resolution). Corpus: English. Parameters: Case Insensitive OFF, Smoothing 0. Extraction Date: Data extracted January 2025. Notes: Reflects term usage frequency in Google's digitized book corpus. Subject to corpus limitations (English bias, coverage). Source URL: Ngram Viewer Query Crossref.org File (Prefix: CR_): Metric: Absolute count of publications per month matching keywords. Keywords Used: ("strategic alliance" OR "strategic alliances" OR "corporate venture capital") AND ("management" OR "strategy" OR "corporate" OR "development" OR "partnership" OR "approach" OR "implementation") Time Period: 1950 - 2025 (Queried for monthly counts based on publication date metadata). Search Fields: Title, Abstract. Extraction Date: Data extracted January 2025. Notes: Reflects volume of relevant academic publications indexed by Crossref. Deduplicated using DOIs; records without DOIs omitted. Source URL: Crossref Search Query Bain & Co. Survey - Usability File (Prefix: BU_): Metric: Original Percentage (%) of executives reporting tool usage. Tool Names/Years Included: Strategic Alliance (1993); Strategic Alliances (1996, 1999, 2000, 2002, 2004, 2006, 2008, 2010, 2012, 2014, 2017); Corporate Venture Capital (2022). Respondent Profile: CEOs, CFOs, COOs, other senior leaders; global, multi-sector. Source: Bain & Company Management Tools & Trends publications (Rigby D., Bilodeau B., Ronan C. et al., various years: 1994, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017, 2023). Data Compilation Period: July 2024 - January 2025. Notes: Data points correspond to specific survey years. Sample sizes: 1993/500; 1996/784; 1999/475; 2000/214; 2002/708; 2004/960; 2006/1221; 2008/1430; 2010/1230; 2012/1208; 2014/1067; 2017/1268; 2022/1068. Bain & Co. Survey - Satisfaction File (Prefix: BS_): Metric: Original Average Satisfaction Score (Scale 0-5). Tool Names/Years Included: Strategic Alliance (1993); Strategic Alliances (1996, 1999, 2000, 2002, 2004, 2006, 2008, 2010, 2012, 2014, 2017); Corporate Venture Capital (2022). Respondent Profile: CEOs, CFOs, COOs, other senior leaders; global, multi-sector. Source: Bain & Company Management Tools & Trends publications (Rigby D., Bilodeau B., Ronan C. et al., various years: 1994, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017, 2023). Data Compilation Period: July 2024 - January 2025. Notes: Data points correspond to specific survey years. Sample sizes: 1993/500; 1996/784; 1999/475; 2000/214; 2002/708; 2004/960; 2006/1221; 2008/1430; 2010/1230; 2012/1208; 2014/1067; 2017/1268; 2022/1068. Reflects subjective executive perception of utility. File Naming Convention: Files generally follow the pattern: PREFIX_Tool.csv, where the PREFIX indicates the data source: GT_: Google Trends GB_: Google Books Ngram CR_: Crossref.org (Count Data for this Raw Dataset) BU_: Bain & Company Survey (Usability) BS_: Bain & Company Survey (Satisfaction) The essential identification comes from the PREFIX and the Tool Name segment. This dataset resides within the 'Management Tool Source Data (Raw Extracts)' Dataverse.
H
Business Process Reengineering - Raw Source Data
dataverse.harvard.edu
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diomar Anez; Dimar Anez (2025). Business Process Reengineering - Raw Source Data [Dataset]. http://doi.org/10.7910/DVN/2DR8U5
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/2DR8U5
Dataset updated
May 6, 2025
Dataset provided by
Harvard Dataverse
Authors
Diomar Anez; Dimar Anez
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains raw, unprocessed data files pertaining to the management tool 'Business Process Reengineering' (BPR). The data originates from five distinct sources, each reflecting different facets of the tool's prominence and usage over time. Files preserve the original metrics and temporal granularity before any comparative normalization or harmonization. Data Sources & File Details: Google Trends File (Prefix: GT_): Metric: Relative Search Interest (RSI) Index (0-100 scale). Keywords Used: "business process reengineering" + "process reengineering" + "reengineering management" Time Period: January 2004 - January 2025 (Native Monthly Resolution). Scope: Global Web Search, broad categorization. Extraction Date: Data extracted January 2025. Notes: Index relative to peak interest within the period for these terms. Reflects public/professional search interest trends. Based on probabilistic sampling. Source URL: Google Trends Query Google Books Ngram Viewer File (Prefix: GB_): Metric: Annual Relative Frequency (% of total n-grams in the corpus). Keywords Used: Reengineering + Business Process Reengineering + Process Reengineering Time Period: 1950 - 2022 (Annual Resolution). Corpus: English. Parameters: Case Insensitive OFF, Smoothing 0. Extraction Date: Data extracted January 2025. Notes: Reflects term usage frequency in Google's digitized book corpus. Subject to corpus limitations (English bias, coverage). Source URL: Ngram Viewer Query Crossref.org File (Prefix: CR_): Metric: Absolute count of publications per month matching keywords. Keywords Used: ("business process reengineering" OR "process reengineering" OR "reengineering") AND ("management" OR "technique" OR "methodology" OR "approach" OR "implementation" OR "adoption" OR "practice" OR "framework" OR "model" OR "tool" OR "system") Time Period: 1950 - 2025 (Queried for monthly counts based on publication date metadata). Search Fields: Title, Abstract. Extraction Date: Data extracted January 2025. Notes: Reflects volume of relevant academic publications indexed by Crossref. Deduplicated using DOIs; records without DOIs omitted. Source URL: Crossref Search Query Bain & Co. Survey - Usability File (Prefix: BU_): Metric: Original Percentage (%) of executives reporting tool usage. Tool Names/Years Included: Reengineering (1993, 1996, 2000, 2002); Business Process Reengineering (2004, 2006, 2008, 2010, 2012, 2014, 2017, 2022). Respondent Profile: CEOs, CFOs, COOs, other senior leaders; global, multi-sector. Source: Bain & Company Management Tools & Trends publications (Rigby D., Bilodeau B., Ronan C. et al., various years: 1994, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017, 2023). Data Compilation Period: July 2024 - January 2025. Notes: Data points correspond to specific survey years. Sample sizes: 1993/500; 1996/784; 2000/214; 2002/708; 2004/960; 2006/1221; 2008/1430; 2010/1230; 2012/1208; 2014/1067; 2017/1268; 2022/1068. Bain & Co. Survey - Satisfaction File (Prefix: BS_): Metric: Original Average Satisfaction Score (Scale 0-5). Tool Names/Years Included: Reengineering (1993, 1996, 2000, 2002); Business Process Reengineering (2004, 2006, 2008, 2010, 2012, 2014, 2017, 2022). Respondent Profile: CEOs, CFOs, COOs, other senior leaders; global, multi-sector. Source: Bain & Company Management Tools & Trends publications (Rigby D., Bilodeau B., Ronan C. et al., various years: 1994, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017, 2023). Data Compilation Period: July 2024 - January 2025. Notes: Data points correspond to specific survey years. Sample sizes: 1993/500; 1996/784; 2000/214; 2002/708; 2004/960; 2006/1221; 2008/1430; 2010/1230; 2012/1208; 2014/1067; 2017/1268; 2022/1068. Reflects subjective executive perception of utility. File Naming Convention: Files generally follow the pattern: PREFIX_Tool.csv, where the PREFIX indicates the data source: GT_: Google Trends GB_: Google Books Ngram CR_: Crossref.org (Count Data for this Raw Dataset) BU_: Bain & Company Survey (Usability) BS_: Bain & Company Survey (Satisfaction) The essential identification comes from the PREFIX and the Tool Name segment. This dataset resides within the 'Management Tool Source Data (Raw Extracts)' Dataverse.
H
Cost Management (Activity-Based) - Raw Source Data
dataverse.harvard.edu
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diomar Anez; Dimar Anez (2025). Cost Management (Activity-Based) - Raw Source Data [Dataset]. http://doi.org/10.7910/DVN/8GJH2G
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/8GJH2G
Dataset updated
May 6, 2025
Dataset provided by
Harvard Dataverse
Authors
Diomar Anez; Dimar Anez
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains raw, unprocessed data files pertaining to the management tool group focused on 'Activity-Based Costing' (ABC) and 'Activity-Based Management' (ABM). The data originates from five distinct sources, each reflecting different facets of the tool's prominence and usage over time. Files preserve the original metrics and temporal granularity before any comparative normalization or harmonization. Data Sources & File Details: Google Trends File (Prefix: GT_): Metric: Relative Search Interest (RSI) Index (0-100 scale). Keywords Used: "activity based costing" + "activity based management" + "activity based costing management" Time Period: January 2004 - January 2025 (Native Monthly Resolution). Scope: Global Web Search, broad categorization. Extraction Date: Data extracted January 2025. Notes: Index relative to peak interest within the period for these terms. Reflects public/professional search interest trends. Based on probabilistic sampling. Source URL: Google Trends Query Google Books Ngram Viewer File (Prefix: GB_): Metric: Annual Relative Frequency (% of total n-grams in the corpus). Keywords Used: Activity Based Management + Activity Based Costing Time Period: 1950 - 2022 (Annual Resolution). Corpus: English. Parameters: Case Insensitive OFF, Smoothing 0. Extraction Date: Data extracted January 2025. Notes: Reflects term usage frequency in Google's digitized book corpus. Subject to corpus limitations (English bias, coverage). Source URL: Ngram Viewer Query Crossref.org File (Prefix: CR_): Metric: Absolute count of publications per month matching keywords. Keywords Used: ("activity based costing" OR "activity based management") AND ("management" OR "accounting" OR "cost control" OR "financial" OR "analysis" OR "system") Time Period: 1950 - 2025 (Queried for monthly counts based on publication date metadata). Search Fields: Title, Abstract. Extraction Date: Data extracted January 2025. Notes: Reflects volume of relevant academic publications indexed by Crossref. Deduplicated using DOIs; records without DOIs omitted. Source URL: Crossref Search Query Bain & Co. Survey - Usability File (Prefix: BU_): Metric: Original Percentage (%) of executives reporting tool usage. Tool Names/Years Included: Activity-Based Costing (1993); Activity-Based Management (1999, 2000, 2002, 2004). (Note: Some sources use Activity Based Management). Respondent Profile: CEOs, CFOs, COOs, other senior leaders; global, multi-sector. Source: Bain & Company Management Tools & Trends publications (Rigby D., Bilodeau B., et al., various years: 1994, 2001, 2003, 2005). Note: Tool potentially not surveyed or reported after 2004 under these specific names. Data Compilation Period: July 2024 - January 2025. Notes: Data points correspond to specific survey years. Sample sizes: 1993/500; 1999/475; 2000/214; 2002/708; 2004/960. Bain & Co. Survey - Satisfaction File (Prefix: BS_): Metric: Original Average Satisfaction Score (Scale 0-5). Tool Names/Years Included: Activity-Based Costing (1993); Activity-Based Management (1999, 2000, 2002, 2004). (Note: Some sources use Activity Based Management). Respondent Profile: CEOs, CFOs, COOs, other senior leaders; global, multi-sector. Source: Bain & Company Management Tools & Trends publications (Rigby D., Bilodeau B., et al., various years: 1994, 2001, 2003, 2005). Note: Tool potentially not surveyed or reported after 2004 under these specific names. Data Compilation Period: July 2024 - January 2025. Notes: Data points correspond to specific survey years. Sample sizes: 1993/500; 1999/475; 2000/214; 2002/708; 2004/960. Reflects subjective executive perception of utility. File Naming Convention: Files generally follow the pattern: PREFIX_Tool.csv, where the PREFIX indicates the data source: GT_: Google Trends GB_: Google Books Ngram CR_: Crossref.org (Count Data for this Raw Dataset) BU_: Bain & Company Survey (Usability) BS_: Bain & Company Survey (Satisfaction) The essential identification comes from the PREFIX and the Tool Name segment. This dataset resides within the 'Management Tool Source Data (Raw Extracts)' Dataverse.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Batista, David (2023). A n-grams collection extracted from the Portuguese Web [Dataset]. http://doi.org/10.7910/DVN/ZSXC55

A n-grams collection extracted from the Portuguese Web

Explore at:

Unique identifier

https://doi.org/10.7910/DVN/ZSXC55

Dataset updated

Nov 8, 2023

Dataset provided by

Harvard Dataverse

Authors

Batista, David

Description

The n-grams collection was extracted from the collected documents whose identified language was Portuguese. We extracted word n-grams up to the fifht order (5-grams). A set of regular expressions to tokenize the text were applied. After the extraction, all n-grams with tokens having more than 32 characters were discarded. N-grams with frequencies below 5 were discarded as well. The n-grams collection is available as a set of UTF-8 encoded files, containing the n-grams and their frequencies

Clear search

Close search

Google apps

Main menu

A n-grams collection extracted from the Portuguese Web

Data from: LSTM neural network for textual ngrams

English Word Frequency

Context:

Content:

Acknowledgements:

Inspiration:

PANACEA Labour Legislation Corpus n-grams FR (French) - Dataset - B2FIND

PANACEA Environment Corpus n-grams ES (Spanish)

google n gram

Dataset

Contents

Evaluation datasets and results of the paper "Efficient Online Computation...

PANACEA Environment Corpus n-grams IT (Italian)

Beautiful Data Natural Language Corpus and Code

Knowledge Management - Raw Source Data

Benchmarking - Raw Source Data

Results of our N-Gram (LLR) analysis.

Core Competencies - Raw Source Data

Top 10 most frequent n-grams in cannabis-related posts.

Total Quality Management (TQM) - Raw Source Data

Strategic Alliances & Corporate Venture Capital - Raw Source Data

Business Process Reengineering - Raw Source Data

Cost Management (Activity-Based) - Raw Source Data

A n-grams collection extracted from the Portuguese WebSee More Versions

A n-grams collection extracted from the Portuguese Web