36 datasets found

w
Authors, books and publication dates of book subjects where books equals...
workwithdata.com
Updated Feb 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2024). Authors, books and publication dates of book subjects where books equals Exploratory data mining and data cleaning [Dataset]. https://www.workwithdata.com/dataset?entity=book_subjects&col=publication_date%2Cauthor%2Cbook_subject%2Cbnb_id%2Cbook&f=1&fcol0=book&fop0=%3D&fval0=Exploratory%20data%20mining%20and%20data%20cleaning
Explore at:
Dataset updated
Feb 12, 2024
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about book subjects and is filtered where the books is Exploratory data mining and data cleaning. It has 4 columns: book subject, authors, books, and publication dates. The data is ordered by earliest publication date (descending).
l
LSC (Leicester Scientific Corpus)
figshare.le.ac.uk
Updated Apr 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LSC (Leicester Scientific Corpus) [Dataset]. https://figshare.le.ac.uk/articles/dataset/LSC_Leicester_Scientific_Corpus_/9449639
Explore at:
Unique identifier
https://doi.org/10.25392/leicester.data.9449639.v2
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LSC (Leicester Scientific Corpus)

April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online

The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R

The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:

Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.

Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identiﬁed by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.
l
LScDC Word-Category RIG Matrix
figshare.le.ac.uk
pdf
Updated Apr 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LScDC Word-Category RIG Matrix [Dataset]. https://figshare.le.ac.uk/articles/dataset/LScDC_Word-Category_RIG_Matrix/12133431
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.12133431.v2
Dataset updated
Apr 28, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.
Enhanced Latin Lemma Dataset
zenodo.org
huggingface.co
+1more
csv
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kristiyan Simeonov; Kristiyan Simeonov (2024). Enhanced Latin Lemma Dataset [Dataset]. http://doi.org/10.57967/hf/3130
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.57967/hf/3130
Dataset updated
Sep 25, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Kristiyan Simeonov; Kristiyan Simeonov
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Time period covered
Feb 15, 2024
Description
Overview

The Latin Lexicon Dataset contains information about Latin words collected through webscraping from Wiktionary. The dataset includes various linguistic features such as part of speech, lemma, aspect, tense, verb form, voice, mood, number, person, case, and gender. Additionally, it provides source URLs and links to the Wiktionary pages for further reference. The dataset aims to contribute to linguistic research and analysis of Latin language elements.

Versions of the Dataset

This dataset is available in three versions, each offering varying levels of refinement:

wiki_latin_data_v1.csv(v1): The initial raw version, containing all webscraped data without extensive cleaning or filtering.

wiki_latin_data_v2.csv(v2): A more processed version, where some inconsistencies and duplicates were removed, and linguistic features were better aligned.

wiki_latin_data_v3.csv (v3): The most refined version, offering a clean, well-organized dataset with comprehensive linguistic features and translation equivalents with minimal errors. This version is recommended for most use cases.

Data Source:

Webscraped from Wiktionary

Produced by:

Python-based web scraping algorithms
l
LScDC (Leicester Scientific Dictionary-Core)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScDC (Leicester Scientific Dictionary-Core) [Dataset]. http://doi.org/10.25392/leicester.data.9896579.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9896579.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LScDC (Leicester Scientific Dictionary-Core Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScDC (Leicester Scientific Dictionary-Core) is formed using the updated LScD (Leicester Scientific Dictionary) - Version 3*. All steps applied to build the new version of core dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. The files provided with this description are also same as described as for LScDC Version 2. The numbers of words in the 3rd versions of LScD and LScDC are summarized below. # of wordsLScD (v3) 972,060LScDC (v3) 103,998 * Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v3 ** Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v2[Version 2] Getting StartedThis file describes a sorted and cleaned list of words from LScD (Leicester Scientific Dictionary), explains steps for sub-setting the LScD and basic statistics of words in the LSC (Leicester Scientific Corpus), to be found in [1, 2]. The LScDC (Leicester Scientific Dictionary-Core) is a list of words ordered by the number of documents containing the words, and is available in the CSV file published. There are 104,223 unique words (lemmas) in the LScDC. This dictionary is created to be used in future work on the quantification of the sense of research texts. The objective of sub-setting the LScD is to discard words which appear too rarely in the corpus. In text mining algorithms, usage of enormous number of text data brings the challenge to the performance and the accuracy of data mining applications. The performance and the accuracy of models are heavily depend on the type of words (such as stop words and content words) and the number of words in the corpus. Rare occurrence of words in a collection is not useful in discriminating texts in large corpora as rare words are likely to be non-informative signals (or noise) and redundant in the collection of texts. The selection of relevant words also holds out the possibility of more effective and faster operation of text mining algorithms.To build the LScDC, we decided the following process on LScD: removing words that appear in no more than 10 documents (
m
Data from: Job advertisement and salary monitoring dataset for Latvia in...
data.mendeley.com
Updated Dec 21, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valerijs Skribans (2021). Job advertisement and salary monitoring dataset for Latvia in 2021 Q1-Q2 [Dataset]. http://doi.org/10.17632/4fn48rn24c.1
Explore at:
Unique identifier
https://doi.org/10.17632/4fn48rn24c.1
Dataset updated
Dec 21, 2021
Authors
Valerijs Skribans
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Latvia
Description
In 2018, 28 of November, in Latvia the amendments to Section 32 (3) of the Labor Law entered into force, according with it employers are obliged to indicate in the advertisement wage. This database continue wages monitoring started in 2019 and show data observation for 2021. 2019 year was first year in Latvia, when based on job advertisement analysis it is possible to conclude about salary by occupations, salary grow. Advertisement analysis is operational pointer in comparison with official statistic data. This dataset represent job advertisement collection from biggest Latvian job advertisement web cv.lv . Data was collected by week in 2021 in Q1-Q2, near 1700 advertisements per week. After collecting dataset was cleared from advertisements, in which it was not possible to identify occupations. After data cleaning dataset consist of 41 138 advertisements. First salary monitoring year (2020) data is possible see here Skribans, Valerijs (2021), “Job advertisement and salary monitoring dataset for Latvia in 2020”, Mendeley Data, V1, doi: 10.17632/f3s8h6dzzf.1
Data Processing & Hosting Services Market - Size, Share & Research
mordorintelligence.com
pdf,excel,csv,ppt
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mordor Intelligence, Data Processing & Hosting Services Market - Size, Share & Research [Dataset]. https://www.mordorintelligence.com/industry-reports/data-processing-and-hosting-services-market
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset authored and provided by
Mordor Intelligence
License
https://www.mordorintelligence.com/privacy-policyhttps://www.mordorintelligence.com/privacy-policy
Time period covered
2019 - 2030
Area covered
Global
Description
The report covers Global Data Processing and Hosting Services Companies and the market is segmented by the organization (large enterprise, small & medium enterprise) offering (data processing services (data entry services, data mining services, data cleansing, and formatting, and data scanning and indexing), hosting services (web hosting, cloud hosting, shared (reseller) hosting, virtual private server (VPS) hosting, WordPress hosting, and application hosting)), end-user industry (IT & Telecommunication, BFSI, retail, and other end-user industries), and geography (North America, Europe, Asia Pacific, Latin America, and the Middle East and Africa). The market sizes and forecasts are in terms of value (USD billion) for all the above segments.
Enterprise Data Warehouse (Edw) Market Analysis North America, Europe, APAC,...
technavio.com
Updated Oct 1, 2002
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2002). Enterprise Data Warehouse (Edw) Market Analysis North America, Europe, APAC, Middle East and Africa, South America - US, China, UK, India, Germany - Size and Forecast 2024-2028 [Dataset]. https://www.technavio.com/report/enterprise-data-warehouse-market-industry-analysis
Explore at:
Dataset updated
Oct 1, 2002
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2021 - 2025
Area covered
United States, Global
Description
Snapshot img

Enterprise Data Warehouse Market Size 2024-2028

The enterprise data warehouse market size is forecast to increase by USD 39.24 billion, at a CAGR of 30.08% between 2023 and 2028. The market is experiencing significant growth due to the data explosion across various industries. With the increasing volume, velocity, and variety of data, businesses are investing heavily in EDW solutions and data warehousing to gain insights and make informed decisions. A key growth driver is the spotlight on innovative solution launches, designed with cutting-edge features and functionalities to keep pace with the ever-evolving demands of modern businesses.

However, concerns related to data security continue to pose a challenge in the market. With the increasing amount of sensitive data being stored in EDWs, ensuring its security has become a top priority for organizations. Despite these challenges, the market is expected to grow at a strong pace, driven by the need for efficient data management and analysis.

What will be the Size of the Enterprise Data Warehouse Market During the Forecast Period?

To learn more about the EDW market report, Request Free Sample

An enterprise data warehouse (EDW) is a centralized, large-scale database designed to collect, store, and manage an organization's valuable business information from multiple sources. The EDW acts as the 'brain' of an organization, processing and integrating data from various physical recordings, flat files, and real-time data sources. Data engineering plays a crucial role in the EDW, responsible for data ingestion, cleaning, and digital transformation. Business units across the organization rely on Business Intelligence (BI) tools like Tableau, PowerBI, Qlik, and data visualization tools to extract insights from the EDW. The EDW is a collection of databases, including Teradata, Netezza, Exadata, Amazon Redshift, and Google BigQuery, which serve as the backbone for data-driven decision-making.

Moreover, the cloud has significantly impacted the EDW market, enabling cost-effective and scalable solutions for businesses of all sizes. BI tools and data visualization tools enable departments to access and analyze data, improving operational efficiency and driving innovation. The EDW market continues to grow, with organizations recognizing the importance of a centralized, integrated data platform for managing their valuable assets.

Enterprise Data Warehouse Market Segmentation

The enterprise data warehouse market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD Billion' for the period 2024-2028, as well as historical data from 2018 - 2022 for the following segments.

Product Type Information and analytical processing Data mining Deployment Cloud based On-premises Geography North America US Europe Germany UK APAC China India Middle East and Africa South America

By Product Type

The information and analytical processing segment is estimated to witness significant growth during the forecast period. The market is witnessing significant growth due to the increasing data requirements of various industries such as IT, BFSI, education, healthcare, and retail. The primary function of an EDW system is to extract, transform, and load data from source systems into a central repository for data integration and analysis. This process enables businesses to gain timely insights and make informed decisions based on historical data and real-time analytics. EDW systems are designed to be scalable to cater to the data processing needs of the largest organizations. The use of Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT) processes in data warehousing has become a popular trend to address processing bottlenecks and ensure Service Level Agreements (SLAs) are met.

Furthermore, business users increasingly rely on these systems for business intelligence and data analytics. Big Data technologies like Hadoop MapReduce and Apache Spark are being integrated with ETL tools to enable the processing of large volumes of data. Precisely, as a pioneer in data integration, offers solutions that cater to the needs of various business teams and departments. Data visualization tools like Tableau, PowerBI, Qlik, Teradata, Netezza, Exadata, Amazon Redshift, Google BigQuery, Snowflake, and Data virtualization are being used to gain insights from the data in the EDW. The history of transactions and multiple users accessing the data make the need for data warehousing more critical than ever.

Get a glance at the market share of various segments. Request Free Sample

The information and analytical processing segment was valued at USD 3.65 billion in 2018 and showed a gradual increase during the forecast period.

Regional Insights

APAC is estimated to contribute 32% to the growt
w
Integrated Support Environment (ISE) Laboratory
data.wu.ac.at
Updated Mar 8, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federal Laboratory Consortium (2017). Integrated Support Environment (ISE) Laboratory [Dataset]. https://data.wu.ac.at/schema/data_gov/NzhmNjE5NGItZWQ3Zi00ZWQyLTgxMTktMjRmNDBmZjUwMTIw
Explore at:
Dataset updated
Mar 8, 2017
Dataset provided by
Federal Laboratory Consortium
Description
Purpose:The Integrated Support Environment (ISE) Laboratory serves the fleet, in-service engineers, logisticians and program management offices by automatically and periodically providing key decision makers with the big picture tools and actionable metrics needed for informed decision making within the realm of Support Equipment (SE) and Aircraft Launch and Recovery Equipment (ALRE) system improvements.Function:The ISE Laboratory at the Naval Air Warfare Center Aircraft Division, Lakehurst, NJ correlates cross-competency data to provide meaningful metrics. The lab provides a distributed data system that achieves the lab's mission of providing actionable metrics by combining multiple data sources and leveraging automated data feeds for near real-time situational awareness across all phases of a program including design, development, test and operational deployment all within a single system interface.Capabilities:The ISE Lab utilizes corporate toolsets to provide business intelligence to Naval Aviation Enterprise (NAE) leadership. The ISE Lab provides pertinent metrics to the fleet, engineers, logisticians and program management users on demand. The lab also utilizes specialized software to provide a thorough analysis of the data being collected, which allows for data mining, data cleansing, processing and modeling to identify and visualize trends. Moreover, the lab has defined and implemented streamlined processes for collecting data, performing data mining techniques and providing pertinent data metrics, via reports or dashboards, to decision makers.
P
MNAD Dataset
paperswithcode.com
Updated May 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). MNAD Dataset [Dataset]. https://paperswithcode.com/dataset/mnad
Explore at:
Dataset updated
May 16, 2023
Description
About the MNAD Dataset The MNAD corpus is a collection of over 1 million Moroccan news articles written in modern Arabic language. These news articles have been gathered from 11 prominent electronic news sources. The dataset is made available to the academic community for research purposes, such as data mining (clustering, classification, etc.), information retrieval (ranking, search, etc.), and other non-commercial activities.

Dataset Fields

Title: The title of the article Body: The body of the article Category: The category of the article Source: The Electronic News paper source of the article

About Version 1 of the Dataset (MNAD.v1) Version 1 of the dataset comprises 418,563 articles classified into 19 categories. The data was collected from well-known electronic news sources, namely Akhbarona.ma, Hespress.ma, Hibapress.com, and Le360.com. The articles were stored in four separate CSV files, each corresponding to the news website source. Each CSV file contains three fields: Title, Body, and Category of the news article.

The dataset is rich in Arabic vocabulary, with approximately 906,125 unique words. It has been utilized as a benchmark in the research paper: "A Moroccan News Articles Dataset (MNAD) For Arabic Text Categorization". In 2021 International Conference on Decision Aid Sciences and Application (DASA).

This dataset is available for download from the following sources: - Kaggle Datasets : MNADv1 - Huggingface Datasets: MNADv1

About Version 2 of the Dataset (MNAD.v2) Version 2 of the MNAD dataset includes an additional 653,901 articles, bringing the total number of articles to over 1 million (1,069,489), classified into the same 19 categories as in version 1. The new documents were collected from seven additional prominent Moroccan news websites, namely al3omk.com, medi1news.com, alayam24.com, anfaspress.com, alyaoum24.com, barlamane.com, and SnrtNews.com.

The newly collected articles have been merged with the articles from the previous version into a single CSV file named MNADv2.csv. This file includes an additional column called "Source" to indicate the source of each news article.

Furthermore, MNAD.v2 incorporates improved pre-processing techniques and data cleaning methods. These enhancements involve removing duplicates, eliminating multiple spaces, discarding rows with NaN values, replacing new lines with " ", excluding very long and very short articles, and removing non-Arabic articles. These additions and improvements aim to enhance the usability and value of the MNAD dataset for researchers and practitioners in the field of Arabic text analysis.

This dataset is available for download from the following sources: - Kaggle Datasets : MNADv2 - Huggingface Datasets: MNADv2

Citation If you use our data, please cite the following paper:

bibtex @inproceedings{MNAD2021, author = {Mourad Jbene and Smail Tigani and Rachid Saadane and Abdellah Chehri}, title = {A Moroccan News Articles Dataset ({MNAD}) For Arabic Text Categorization}, year = {2021}, publisher = {{IEEE}}, booktitle = {2021 International Conference on Decision Aid Sciences and Application ({DASA})} doi = {10.1109/dasa53625.2021.9682402}, url = {https://doi.org/10.1109/dasa53625.2021.9682402}, }
a
Catholic MatrimonialAdvice DATA
arc-gis-hub-home-arcgishub.hub.arcgis.com
hub.arcgis.com
+1more
Updated Sep 30, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
burhansm2 (2019). Catholic MatrimonialAdvice DATA [Dataset]. https://arc-gis-hub-home-arcgishub.hub.arcgis.com/content/17333dc517ba4f4c8972c068fe733ddc
Explore at:
Dataset updated
Sep 30, 2019
Dataset authored and provided by
burhansm2
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Area covered
Description
Integrated Geodatabase: The Global Catholic Foortprint of Healthcare and WelfareBurhans, Molly A., Mrowczynski, Jon M., Schweigel, Tayler C., and Burhans, Debra T., Wacta, Christine. The Catholic Foortprint of Care Around the World (1). GoodLands and GHR Foundation, 2019.Catholic Statistics Numbers:Annuarium Statisticum Ecclesiae – Statistical Yearbook of the Church: 1980 – 2018. LIBRERIA EDITRICE VATICAN.Historical Country Boundary Geodatabase:Weidmann, Nils B., Doreen Kuse, and Kristian Skrede Gleditsch. The Geography of the International System: The CShapes Dataset. International Interactions 36 (1). 2010.https://www.tandfonline.com/doi/full/10.1080/03050620903554614GoodLands created a significant new data set for GHR and the UISG of important Church information regarding orphanages and sisters around the world as well as healthcare, welfare, and other child care institutions. The data were extracted from the gold standard of Church data, the Annuarium Statisticum Ecclesiae, published yearly by the Vatican. It is inevitable that raw data sources will contain errors. GoodLands and its partners are not responsible for misinformation within Vatican documents. We encourage error reporting to us at data@good-lands.org or directly to the Vatican.GoodLands worked with the GHR Foundation to map Catholic Healthcare and Welfare around the world using data mined from the Annuarium Statisticum Eccleasiea. GHR supported the data development and GoodLands independently invested in the mapping of information.The workflows and data models developed for this project can be used to map any global, historical country-scale data in a time-series map while accounting for country boundary changes. GoodLands created proprietary software that enables mining the Annuarium Statisticum Eccleasiea (see Software and Program Library at our home page for details).The GHR Foundation supported data extraction and cleaning of this information.GoodLands’ supported the development of maps, infographics, and applications for all healthcare data.
a
Vatican and World Health Organization Clean Care
catholic-geo-hub-cgisc.hub.arcgis.com
hub.arcgis.com
Updated Oct 28, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
burhansm2 (2019). Vatican and World Health Organization Clean Care [Dataset]. https://catholic-geo-hub-cgisc.hub.arcgis.com/datasets/da02bf70e8034ea58202a50a4109911b
Explore at:
Dataset updated
Oct 28, 2019
Dataset authored and provided by
burhansm2
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Area covered
Pacific Ocean, North Pacific Ocean
Description
Integrated Geodatabase: The Global Catholic Foortprint of Healthcare and WelfareBurhans, Molly A., Mrowczynski, Jon M., Schweigel, Tayler C., and Burhans, Debra T., Wacta, Christine. The Catholic Foortprint of Care Around the World (1). GoodLands and GHR Foundation, 2019.WHO Statistics Numbers:Clean Care is Safe Care, Registration Update. (2017). Retrieved n.d., from https://www.who.int/gpsc/5may/registration_update/en/.https://www.who.int/gpsc/5may/registration_update/en/Catholic Statistics Numbers:Annuarium Statisticum Ecclesiae – Statistical Yearbook of the Church: 1980 – 2018. LIBRERIA EDITRICE VATICAN.Historical Country Boundary Geodatabase:Weidmann, Nils B., Doreen Kuse, and Kristian Skrede Gleditsch. The Geography of the International System: The CShapes Dataset. International Interactions 36 (1). 2010.https://www.tandfonline.com/doi/full/10.1080/03050620903554614GoodLands created a significant new data set for GHR and the UISG of important Church information regarding orphanages and sisters around the world as well as healthcare, welfare, and other child care institutions. The data were extracted from the gold standard of Church data, the Annuarium Statisticum Ecclesiae, published yearly by the Vatican. It is inevitable that raw data sources will contain errors. GoodLands and its partners are not responsible for misinformation within Vatican documents. We encourage error reporting to us at data@good-lands.org or directly to the Vatican.GoodLands worked with the GHR Foundation to map Catholic Healthcare and Welfare around the world using data mined from the Annuarium Statisticum Eccleasiea. GHR supported the data development and GoodLands independently invested in the mapping of information.The workflows and data models developed for this project can be used to map any global, historical country-scale data in a time-series map while accounting for country boundary changes. GoodLands created proprietary software that enables mining the Annuarium Statisticum Eccleasiea (see Software and Program Library at our home page for details).The GHR Foundation supported data extraction and cleaning of this information.GoodLands’ supported the development of maps, infographics, and applications for all healthcare data.

Compilation of open asset-level data, as of Dec 2022

zenodo.org
data.niaid.nih.gov

bin, csv

Updated Jul 12, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Rafael Camargo; Rafael Camargo; Andrés Salazar; Alexis Morgan; Andrés Salazar; Alexis Morgan (2024). Compilation of open asset-level data, as of Dec 2022 [Dataset]. http://doi.org/10.5281/zenodo.7804659

Explore at:

bin, csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7804659

Dataset updated

Jul 12, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Rafael Camargo; Rafael Camargo; Andrés Salazar; Alexis Morgan; Andrés Salazar; Alexis Morgan

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset is a compilation of open asset-level data, which means the location of sites (e.g., operation, manufacturing, processing facilities of global supply chains), as of December 2022. This included data from 9 publicly available sources, that after data cleaning and harmonization, resulted in 189,075 data points.

Data source	Number of data points
Open Supply Hub (former Open Apparel Registry)	96,736
Global Power Plant Database	35,419
Climate trace	19,945
FDA database	12,898
Global Dam Watch	11,017
EudraGMDP database	5,181
Sustainable Finance Initiative GeoAsset Databases	4,716
Global Tailings Portal	1,956
Fine print Mining Database	1,207

This data was assigned with the industry in which the asset is. The summary table below shows the number of assets by industry.

Industry	Number of assets
Textiles, Apparel & Luxury Good Production	96,736
Health Care, Pharma and Biotechnology	18,079
Energy - Solar, Wind	16,282
Energy - Hydropower	14,515
Energy - Geothermal or Combustion	11,724
Metals & Mining	11,210
Transportation Services	4,872
Construction Materials	3,117
Agriculture (animal products)	2,388
Agriculture (plant products)	1,896
Oil, Gas & Consumable Fuels	1,194
Water utilities / Water Service Providers	892
Hospitality Services	294
Fishing and aquaculture	14
Other	5,862

Note that this compilation is based on an extensive search, however, we acknowledge that there is a significant discrepancy in data coverage/comprehensiveness among the different industries. The industry "Textiles, Apparel & Luxury Good Production" is by far the most complete, while other are clearly far from complete, for example, “Construction Materials”, "Agriculture (animal products)”, “Agriculture (plant products)”, “Oil, Gas & Consumable Fuels”, “Water utilities / Water Service Providers”, “Hospitality Services”, “Fishing and aquaculture”. Therefore, any comparison between industries should take this coverage/comprehensiveness bias into consideration.

I
Indonesia Manufacturing Industry: Production: Value: Milling and Cleaning of...
ceicdata.com
Updated Aug 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CEICdata.com (2021). Indonesia Manufacturing Industry: Production: Value: Milling and Cleaning of Cereals [Dataset]. https://www.ceicdata.com/en/indonesia/manufacturing-industry-by-product-value/manufacturing-industry-production-value-milling-and-cleaning-of-cereals
Explore at:
Dataset updated
Aug 8, 2021
Dataset provided by
CEICdata.com
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 1, 2003 - Dec 1, 2015
Area covered
Indonesia
Variables measured
Industrial Production
Description
Indonesia Manufacturing Industry: Production: Value: Milling and Cleaning of Cereals data was reported at 1,932.885 IDR bn in 2015. This records an increase from the previous number of 656.652 IDR bn for 2014. Indonesia Manufacturing Industry: Production: Value: Milling and Cleaning of Cereals data is updated yearly, averaging 156.820 IDR bn from Dec 1999 (Median) to 2015, with 15 observations. The data reached an all-time high of 1,932.885 IDR bn in 2015 and a record low of 27.499 IDR bn in 1999. Indonesia Manufacturing Industry: Production: Value: Milling and Cleaning of Cereals data remains active status in CEIC and is reported by Central Bureau of Statistics. The data is categorized under Indonesia Premium Database’s Mining and Manufacturing Sector – Table ID.BAD001: Manufacturing Industry: by Product: Value.
a
Catholic OtherInst Reduced DATA
hub.arcgis.com
arc-gis-hub-home-arcgishub.hub.arcgis.com
+1more
Updated Sep 30, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
burhansm2 (2019). Catholic OtherInst Reduced DATA [Dataset]. https://hub.arcgis.com/content/95b7ba945c264d8b83d9c165bfe29dbd
Explore at:
Dataset updated
Sep 30, 2019
Dataset authored and provided by
burhansm2
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Area covered
Description
ntegrated Geodatabase: The Global Catholic Foortprint of Healthcare and WelfareBurhans, Molly A., Mrowczynski, Jon M., Schweigel, Tayler C., and Burhans, Debra T., Wacta, Christine. The Catholic Foortprint of Care Around the World (1). GoodLands and GHR Foundation, 2019.Catholic Statistics Numbers:Annuarium Statisticum Ecclesiae – Statistical Yearbook of the Church: 1980 – 2018. LIBRERIA EDITRICE VATICAN.Historical Country Boundary Geodatabase:Weidmann, Nils B., Doreen Kuse, and Kristian Skrede Gleditsch. The Geography of the International System: The CShapes Dataset. International Interactions 36 (1). 2010.https://www.tandfonline.com/doi/full/10.1080/03050620903554614GoodLands created a significant new data set for GHR and the UISG of important Church information regarding orphanages and sisters around the world as well as healthcare, welfare, and other child care institutions. The data were extracted from the gold standard of Church data, the Annuarium Statisticum Ecclesiae, published yearly by the Vatican. It is inevitable that raw data sources will contain errors. GoodLands and its partners are not responsible for misinformation within Vatican documents. We encourage error reporting to us at data@good-lands.org or directly to the Vatican.GoodLands worked with the GHR Foundation to map Catholic Healthcare and Welfare around the world using data mined from the Annuarium Statisticum Eccleasiea. GHR supported the data development and GoodLands independently invested in the mapping of information.The workflows and data models developed for this project can be used to map any global, historical country-scale data in a time-series map while accounting for country boundary changes. GoodLands created proprietary software that enables mining the Annuarium Statisticum Eccleasiea (see Software and Program Library at our home page for details).The GHR Foundation supported data extraction and cleaning of this information.GoodLands’ supported the development of maps, infographics, and applications for all healthcare data.
B
Brazil IPI: Mfg: Last 12 Months=100: ytd: Chemicals Products: Cleaning and...
ceicdata.com
Updated Mar 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brazil IPI: Mfg: Last 12 Months=100: ytd: Chemicals Products: Cleaning and Polishing [Dataset]. https://www.ceicdata.com/en/brazil/industrial-production-index-by-industrial-groups-and-classes-last-12-months100-yeartodate/ipi-mfg-last-12-months100-ytd-chemicals-products-cleaning-and-polishing
Explore at:
Dataset updated
Mar 15, 2023
Dataset provided by
CEICdata.com
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jun 1, 2018 - May 1, 2019
Area covered
Brazil
Variables measured
Industrial Production
Description
Brazil IPI: Mfg: Last 12 Months=100: Year to Date: Chemicals Products: Cleaning and Polishing data was reported at 106.000 Prev 12 Mths=100 in May 2019. This records an increase from the previous number of 103.200 Prev 12 Mths=100 for Apr 2019. Brazil IPI: Mfg: Last 12 Months=100: Year to Date: Chemicals Products: Cleaning and Polishing data is updated monthly, averaging 103.400 Prev 12 Mths=100 from Dec 2013 (Median) to May 2019, with 66 observations. The data reached an all-time high of 107.600 Prev 12 Mths=100 in Dec 2018 and a record low of 97.800 Prev 12 Mths=100 in May 2017. Brazil IPI: Mfg: Last 12 Months=100: Year to Date: Chemicals Products: Cleaning and Polishing data remains active status in CEIC and is reported by Brazilian Institute of Geography and Statistics. The data is categorized under Brazil Premium Database’s Mining and Manufacturing Sector – Table BR.BAA023: Industrial Production Index: by Industrial Groups and Classes: Last 12 Months=100: Year-to-Date.
Enterprise Survey 2002 - Albania
microdata.worldbank.org
Updated Sep 26, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enterprise Survey 2002 - Albania [Dataset]. https://microdata.worldbank.org/index.php/catalog/362
Explore at:
Dataset updated
Sep 26, 2013
Dataset provided by
World Bankhttp://worldbank.org/
European Bank for Reconstruction and Developmenthttp://ebrd.com/
Time period covered
2002
Area covered
Albania
Description
Abstract

This research was conducted in Albania from June 19 to July 31, 2002, as part of the second round of the Business Environment and Enterprise Performance Survey. The objective of the survey is to obtain feedback from enterprises on the state of the private sector as well as to help in building a panel of enterprise data that will make it possible to track changes in the business environment over time, thus allowing, for example, impact assessments of reforms. Through face-to-face interviews with firms in the manufacturing and services sectors, the survey assesses the constraints to private sector growth and creates statistically significant business environment indicators that are comparable across countries.

The survey topics include company's characteristics, information about sales and suppliers, competition, infrastructure services, judiciary and law enforcement, security, government policies and regulations, bribery, sources of financing, overall business environment, performance and investment activities, and workforce composition.

Geographic coverage

National

Analysis unit

The primary sampling unit of the study is the establishment.

Universe

The manufacturing and services sectors are the primary business sectors of interest.

Kind of data

Sample survey data [ssd]

Sampling procedure

The information below is taken from "The Business Environment and Enterprise Performance Survey - 2002. A brief report on observations, experiences and methodology from the survey" prepared by MEMRB Custom Research Worldwide (now part of Synovate), a research company that implemented BEEPS II instrument.

The general targeted distributional criteria of the sample in BEEPS II countries were to be as follows:

1) Coverage of countries: The BEEPS II instrument was to be administered to approximately 6,500 enterprises in 28 transition economies: 16 from CEE (Albania, Bosnia and Herzegovina, Bulgaria, Croatia, Czech Republic, Estonia, FR Yugoslavia, FYROM, Hungary, Latvia, Lithuania, Poland, Romania, Slovak Republic, Slovenia and Turkey) and 12 from the CIS (Armenia, Azerbaijan, Belarus, Georgia, Kazakhstan, Kyrgyzstan, Moldova, Russia, Tajikistan, Turkmenistan, Ukraine and Uzbekistan).

2) In each country, the sector composition of the total sample in terms of manufacturing versus services (including commerce) was to be determined by the relative contribution of GDP, subject to a 15% minimum for each category. Firms that operated in sectors subject to government price regulations and prudential supervision, such as banking, electric power, rail transport, and water and wastewater were excluded.

Eligible enterprise activities were as follows (ISIC sections): - Mining and quarrying (Section C: 10-14), Construction (Section F: 45), Manufacturing (Section D: 15-37) - Transportation, storage and communications (Section I: 60-64), Wholesale, retail, repairs (Section G: 50-52), Real estate, business services (Section K: 70-74), Hotels and restaurants (Section H: 55), Other community, social and personal activities (Section O: selected groups).

3) Size: At least 10% of the sample was to be in the small and 10% in the large size categories. A small firm was defined as an establishment with 2-49 employees, medium - with 50-249 workers, and large - with 250 - 9,999 employees. Companies with only one employee or more than 10,000 employees were excluded.

4) Ownership: At least 10% of the firms were to have foreign control (more than 50% shareholding) and 10% of companies - state control.

5) Exporters: At least 10% of the firms were to be exporters. A firm should be regarded as an exporter if it exported 20% or more of its total sales.

6) Location: At least 10% of firms were to be in the category "small city/countryside" (population under 50,000).

7) Year of establishment: Enterprises which were established later than 2000 should be excluded.

The sample structure for BEEPS II was designed to be as representative (self-weighted) as possible to the population of firms within the industry and service sectors subject to the various minimum quotas for the total sample. This approach ensured that there was sufficient weight in the tails of the distribution of firms by the various relevant controlled parameters (sector, size, location and ownership).

As pertinent data on the actual population or data which would have allowed the estimation of the population of foreign-owned and exporting enterprises were not available, it was not feasible to build these two parameters into the design of the sample guidelines from the onset. The primary parameters used for the design of the sample were: - Total population of enterprises; - Ownership: private and state; - Size of enterprise: Small, medium and large; - Geographic location: Capital, over 1 million, 1million-250,000, 250-50,000 and under 50,000; - Sub-sectors (e.g. mining, construction, wholesale, etc).

For certain parameters where statistical information was not available, enterprise populations and distributions were estimated from other accessible demographic (e.g. human population concentrations in rural and urban areas) and socio-economic (e.g. employment levels) data.

Sampling deviation

The survey was discontinued in Turkmenistan due to concerns about Turkmen government interference with implementation of the study.

Mode of data collection

Face-to-face [f2f]

Research instrument

The current survey instruments are available: - Screener and Main Questionnaires.

The survey topics include company's characteristics, information about sales and suppliers, competition, infrastructure services, judiciary and law enforcement, security, government policies and regulations, bribery, sources of financing, overall business environment, performance and investment activities, and workforce composition.

Cleaning operations

Data entry and first checking and validation of the results were undertaken locally. Final checking and validation of the results were made at MEMRB Custom Research Worldwide headquarters.

Response rate

Overall, in all BEEPS II countries, the implementing agency contacted 18,052 enterprises and achieved an interview completion rate of 36.93%.

Respondents who either refused outright (i.e. not interested) or were unavailable to be interviewed (i.e. on holiday, etc) accounted for 38.34% of all contacts. Enterprises which were contacted but were non-eligible (i.e. business activity, year of establishment, etc) or quotas were already met (i.e. size, ownership etc) or to which “blind calls” were made to meet quotas (i.e. foreign ownership, exporters, etc) accounted for 24.73% of the total number of enterprises contacted.
Replication package for DRAGON: Robust Classification for Very Large...
zenodo.org
zip
Updated Mar 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2025). Replication package for DRAGON: Robust Classification for Very Large Collections of Software Repositories [Dataset]. http://doi.org/10.5281/zenodo.15020642
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15020642
Dataset updated
Mar 13, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
DRAGON: Multi-Label Classification Replication Package

This archive contains the replication package for the DRAGON multi-label classification models, which leverage BERT-based architectures. The package includes scripts for repository mining, dataset creation, data processing, model training, and evaluation.

Key Components:

Repository Mining: Scripts to extract repositories for dataset creation.

Dataset Preparation: Jupyter notebooks for cleaning and transforming data.

Data Processing: Conversion into a Hugging Face dataset format.

Model Training: Training scripts for DRAGON and LEGION, with configurable preprocessing options.

Evaluation: Threshold tuning and performance assessment.
B
Brazil IPI: PY=100: São Paulo: Soap, Detergents, Cleaning Products,...
ceicdata.com
Updated Feb 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brazil IPI: PY=100: São Paulo: Soap, Detergents, Cleaning Products, Cosmetics, Perfumery and Personal Care [Dataset]. https://www.ceicdata.com/en/brazil/industrial-production-index-previous-year100-by-state-so-paulo/ipi-py100-so-paulo-soap-detergents-cleaning-products-cosmetics-perfumery-and-personal-care
Explore at:
Dataset updated
Feb 15, 2025
Dataset provided by
CEICdata.com
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jun 1, 2018 - May 1, 2019
Area covered
Brazil
Variables measured
Industrial Production
Description
Brazil IPI: PY=100: São Paulo: Soap, Detergents, Cleaning Products, Cosmetics, Perfumery and Personal Care data was reported at 100.700 Prev Year=100 in May 2019. This records an increase from the previous number of 93.200 Prev Year=100 for Apr 2019. Brazil IPI: PY=100: São Paulo: Soap, Detergents, Cleaning Products, Cosmetics, Perfumery and Personal Care data is updated monthly, averaging 101.800 Prev Year=100 from Jan 2003 (Median) to May 2019, with 197 observations. The data reached an all-time high of 137.600 Prev Year=100 in Apr 2005 and a record low of 83.200 Prev Year=100 in May 2018. Brazil IPI: PY=100: São Paulo: Soap, Detergents, Cleaning Products, Cosmetics, Perfumery and Personal Care data remains active status in CEIC and is reported by Brazilian Institute of Geography and Statistics. The data is categorized under Brazil Premium Database’s Mining and Manufacturing Sector – Table BR.BAB044: Industrial Production Index: Previous Year=100: by State: São Paulo.
a
Catholic HomesforElderlyInfirmHand DATA
hub.arcgis.com
catholic-geo-hub-cgisc.hub.arcgis.com
+1more
Updated Sep 30, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
burhansm2 (2019). Catholic HomesforElderlyInfirmHand DATA [Dataset]. https://hub.arcgis.com/content/9861a9c84d7b4cb3a1b62ed538193752
Explore at:
Dataset updated
Sep 30, 2019
Dataset authored and provided by
burhansm2
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Area covered
Arctic Ocean, Bering Sea, South Pacific Ocean, Proliv Longa, Proliv Longa, Pacific Ocean, North Pacific Ocean
Description
Integrated Geodatabase: The Global Catholic Foortprint of Healthcare and WelfareBurhans, Molly A., Mrowczynski, Jon M., Schweigel, Tayler C., and Burhans, Debra T., Wacta, Christine. The Catholic Foortprint of Care Around the World (1). GoodLands and GHR Foundation, 2019.Catholic Statistics Numbers:Annuarium Statisticum Ecclesiae – Statistical Yearbook of the Church: 1980 – 2018. LIBRERIA EDITRICE VATICAN.Historical Country Boundary Geodatabase:Weidmann, Nils B., Doreen Kuse, and Kristian Skrede Gleditsch. The Geography of the International System: The CShapes Dataset. International Interactions 36 (1). 2010.https://www.tandfonline.com/doi/full/10.1080/03050620903554614GoodLands created a significant new data set for GHR and the UISG of important Church information regarding orphanages and sisters around the world as well as healthcare, welfare, and other child care institutions. The data were extracted from the gold standard of Church data, the Annuarium Statisticum Ecclesiae, published yearly by the Vatican. It is inevitable that raw data sources will contain errors. GoodLands and its partners are not responsible for misinformation within Vatican documents. We encourage error reporting to us at data@good-lands.org or directly to the Vatican.GoodLands worked with the GHR Foundation to map Catholic Healthcare and Welfare around the world using data mined from the Annuarium Statisticum Eccleasiea. GHR supported the data development and GoodLands independently invested in the mapping of information.The workflows and data models developed for this project can be used to map any global, historical country-scale data in a time-series map while accounting for country boundary changes. GoodLands created proprietary software that enables mining the Annuarium Statisticum Eccleasiea (see Software and Program Library at our home page for details).The GHR Foundation supported data extraction and cleaning of this information.GoodLands’ supported the development of maps, infographics, and applications for all healthcare data.

Facebook

Twitter

Click to copy link

Link copied

Cite

Work With Data (2024). Authors, books and publication dates of book subjects where books equals Exploratory data mining and data cleaning [Dataset]. https://www.workwithdata.com/dataset?entity=book_subjects&col=publication_date%2Cauthor%2Cbook_subject%2Cbnb_id%2Cbook&f=1&fcol0=book&fop0=%3D&fval0=Exploratory%20data%20mining%20and%20data%20cleaning

Authors, books and publication dates of book subjects where books equals Exploratory data mining and data cleaning

Explore at:

Dataset updated

Feb 12, 2024

Dataset authored and provided by

Work With Data

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset is about book subjects and is filtered where the books is Exploratory data mining and data cleaning. It has 4 columns: book subject, authors, books, and publication dates. The data is ordered by earliest publication date (descending).

Clear search

Close search

Google apps

Main menu

Authors, books and publication dates of book subjects where books equals...

LSC (Leicester Scientific Corpus)

LScDC Word-Category RIG Matrix

Enhanced Latin Lemma Dataset

Overview

Versions of the Dataset

Data Source:

Produced by:

LScDC (Leicester Scientific Dictionary-Core)

Data from: Job advertisement and salary monitoring dataset for Latvia in...

Data Processing & Hosting Services Market - Size, Share & Research

Enterprise Data Warehouse (Edw) Market Analysis North America, Europe, APAC,...

Snapshot img

Integrated Support Environment (ISE) Laboratory

MNAD Dataset

Catholic MatrimonialAdvice DATA

Vatican and World Health Organization Clean Care

Compilation of open asset-level data, as of Dec 2022

Indonesia Manufacturing Industry: Production: Value: Milling and Cleaning of...

Catholic OtherInst Reduced DATA

Brazil IPI: Mfg: Last 12 Months=100: ytd: Chemicals Products: Cleaning and...

Enterprise Survey 2002 - Albania

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Sampling deviation

Mode of data collection

Research instrument

Cleaning operations

Response rate

Replication package for DRAGON: Robust Classification for Very Large...

Key Components:

Brazil IPI: PY=100: São Paulo: Soap, Detergents, Cleaning Products,...

Catholic HomesforElderlyInfirmHand DATA

Authors, books and publication dates of book subjects where books equals Exploratory data mining and data cleaningSee More Versions

Authors, books and publication dates of book subjects where books equals Exploratory data mining and data cleaning