Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects and is filtered where the books is Exploratory data mining and data cleaning. It has 4 columns: book subject, authors, books, and publication dates. The data is ordered by earliest publication date (descending).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The LSC (Leicester Scientific Corpus)
April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online
The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R
The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:
Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.
Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identified by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
The Latin Lexicon Dataset contains information about Latin words collected through webscraping from Wiktionary. The dataset includes various linguistic features such as part of speech, lemma, aspect, tense, verb form, voice, mood, number, person, case, and gender. Additionally, it provides source URLs and links to the Wiktionary pages for further reference. The dataset aims to contribute to linguistic research and analysis of Latin language elements.
This dataset is available in three versions, each offering varying levels of refinement:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The LScDC (Leicester Scientific Dictionary-Core Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScDC (Leicester Scientific Dictionary-Core) is formed using the updated LScD (Leicester Scientific Dictionary) - Version 3*. All steps applied to build the new version of core dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. The files provided with this description are also same as described as for LScDC Version 2. The numbers of words in the 3rd versions of LScD and LScDC are summarized below. # of wordsLScD (v3) 972,060LScDC (v3) 103,998 * Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v3 ** Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v2[Version 2] Getting StartedThis file describes a sorted and cleaned list of words from LScD (Leicester Scientific Dictionary), explains steps for sub-setting the LScD and basic statistics of words in the LSC (Leicester Scientific Corpus), to be found in [1, 2]. The LScDC (Leicester Scientific Dictionary-Core) is a list of words ordered by the number of documents containing the words, and is available in the CSV file published. There are 104,223 unique words (lemmas) in the LScDC. This dictionary is created to be used in future work on the quantification of the sense of research texts. The objective of sub-setting the LScD is to discard words which appear too rarely in the corpus. In text mining algorithms, usage of enormous number of text data brings the challenge to the performance and the accuracy of data mining applications. The performance and the accuracy of models are heavily depend on the type of words (such as stop words and content words) and the number of words in the corpus. Rare occurrence of words in a collection is not useful in discriminating texts in large corpora as rare words are likely to be non-informative signals (or noise) and redundant in the collection of texts. The selection of relevant words also holds out the possibility of more effective and faster operation of text mining algorithms.To build the LScDC, we decided the following process on LScD: removing words that appear in no more than 10 documents (
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In 2018, 28 of November, in Latvia the amendments to Section 32 (3) of the Labor Law entered into force, according with it employers are obliged to indicate in the advertisement wage. This database continue wages monitoring started in 2019 and show data observation for 2021. 2019 year was first year in Latvia, when based on job advertisement analysis it is possible to conclude about salary by occupations, salary grow. Advertisement analysis is operational pointer in comparison with official statistic data. This dataset represent job advertisement collection from biggest Latvian job advertisement web cv.lv . Data was collected by week in 2021 in Q1-Q2, near 1700 advertisements per week. After collecting dataset was cleared from advertisements, in which it was not possible to identify occupations. After data cleaning dataset consist of 41 138 advertisements. First salary monitoring year (2020) data is possible see here Skribans, Valerijs (2021), “Job advertisement and salary monitoring dataset for Latvia in 2020”, Mendeley Data, V1, doi: 10.17632/f3s8h6dzzf.1
https://www.mordorintelligence.com/privacy-policyhttps://www.mordorintelligence.com/privacy-policy
The report covers Global Data Processing and Hosting Services Companies and the market is segmented by the organization (large enterprise, small & medium enterprise) offering (data processing services (data entry services, data mining services, data cleansing, and formatting, and data scanning and indexing), hosting services (web hosting, cloud hosting, shared (reseller) hosting, virtual private server (VPS) hosting, WordPress hosting, and application hosting)), end-user industry (IT & Telecommunication, BFSI, retail, and other end-user industries), and geography (North America, Europe, Asia Pacific, Latin America, and the Middle East and Africa). The market sizes and forecasts are in terms of value (USD billion) for all the above segments.
Enterprise Data Warehouse Market Size 2024-2028
The enterprise data warehouse market size is forecast to increase by USD 39.24 billion, at a CAGR of 30.08% between 2023 and 2028. The market is experiencing significant growth due to the data explosion across various industries. With the increasing volume, velocity, and variety of data, businesses are investing heavily in EDW solutions and data warehousing to gain insights and make informed decisions. A key growth driver is the spotlight on innovative solution launches, designed with cutting-edge features and functionalities to keep pace with the ever-evolving demands of modern businesses.
However, concerns related to data security continue to pose a challenge in the market. With the increasing amount of sensitive data being stored in EDWs, ensuring its security has become a top priority for organizations. Despite these challenges, the market is expected to grow at a strong pace, driven by the need for efficient data management and analysis.
What will be the Size of the Enterprise Data Warehouse Market During the Forecast Period?
To learn more about the EDW market report, Request Free Sample
An enterprise data warehouse (EDW) is a centralized, large-scale database designed to collect, store, and manage an organization's valuable business information from multiple sources. The EDW acts as the 'brain' of an organization, processing and integrating data from various physical recordings, flat files, and real-time data sources. Data engineering plays a crucial role in the EDW, responsible for data ingestion, cleaning, and digital transformation. Business units across the organization rely on Business Intelligence (BI) tools like Tableau, PowerBI, Qlik, and data visualization tools to extract insights from the EDW. The EDW is a collection of databases, including Teradata, Netezza, Exadata, Amazon Redshift, and Google BigQuery, which serve as the backbone for data-driven decision-making.
Moreover, the cloud has significantly impacted the EDW market, enabling cost-effective and scalable solutions for businesses of all sizes. BI tools and data visualization tools enable departments to access and analyze data, improving operational efficiency and driving innovation. The EDW market continues to grow, with organizations recognizing the importance of a centralized, integrated data platform for managing their valuable assets.
Enterprise Data Warehouse Market Segmentation
The enterprise data warehouse market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD Billion' for the period 2024-2028, as well as historical data from 2018 - 2022 for the following segments.
Product Type
Information and analytical processing
Data mining
Deployment
Cloud based
On-premises
Geography
North America
US
Europe
Germany
UK
APAC
China
India
Middle East and Africa
South America
By Product Type
The information and analytical processing segment is estimated to witness significant growth during the forecast period. The market is witnessing significant growth due to the increasing data requirements of various industries such as IT, BFSI, education, healthcare, and retail. The primary function of an EDW system is to extract, transform, and load data from source systems into a central repository for data integration and analysis. This process enables businesses to gain timely insights and make informed decisions based on historical data and real-time analytics. EDW systems are designed to be scalable to cater to the data processing needs of the largest organizations. The use of Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT) processes in data warehousing has become a popular trend to address processing bottlenecks and ensure Service Level Agreements (SLAs) are met.
Furthermore, business users increasingly rely on these systems for business intelligence and data analytics. Big Data technologies like Hadoop MapReduce and Apache Spark are being integrated with ETL tools to enable the processing of large volumes of data. Precisely, as a pioneer in data integration, offers solutions that cater to the needs of various business teams and departments. Data visualization tools like Tableau, PowerBI, Qlik, Teradata, Netezza, Exadata, Amazon Redshift, Google BigQuery, Snowflake, and Data virtualization are being used to gain insights from the data in the EDW. The history of transactions and multiple users accessing the data make the need for data warehousing more critical than ever.
Get a glance at the market share of various segments. Request Free Sample
The information and analytical processing segment was valued at USD 3.65 billion in 2018 and showed a gradual increase during the forecast period.
Regional Insights
APAC is estimated to contribute 32% to the growt
Purpose:The Integrated Support Environment (ISE) Laboratory serves the fleet, in-service engineers, logisticians and program management offices by automatically and periodically providing key decision makers with the big picture tools and actionable metrics needed for informed decision making within the realm of Support Equipment (SE) and Aircraft Launch and Recovery Equipment (ALRE) system improvements.Function:The ISE Laboratory at the Naval Air Warfare Center Aircraft Division, Lakehurst, NJ correlates cross-competency data to provide meaningful metrics. The lab provides a distributed data system that achieves the lab's mission of providing actionable metrics by combining multiple data sources and leveraging automated data feeds for near real-time situational awareness across all phases of a program including design, development, test and operational deployment all within a single system interface.Capabilities:The ISE Lab utilizes corporate toolsets to provide business intelligence to Naval Aviation Enterprise (NAE) leadership. The ISE Lab provides pertinent metrics to the fleet, engineers, logisticians and program management users on demand. The lab also utilizes specialized software to provide a thorough analysis of the data being collected, which allows for data mining, data cleansing, processing and modeling to identify and visualize trends. Moreover, the lab has defined and implemented streamlined processes for collecting data, performing data mining techniques and providing pertinent data metrics, via reports or dashboards, to decision makers.
About the MNAD Dataset The MNAD corpus is a collection of over 1 million Moroccan news articles written in modern Arabic language. These news articles have been gathered from 11 prominent electronic news sources. The dataset is made available to the academic community for research purposes, such as data mining (clustering, classification, etc.), information retrieval (ranking, search, etc.), and other non-commercial activities.
Dataset Fields
Title: The title of the article Body: The body of the article Category: The category of the article Source: The Electronic News paper source of the article
About Version 1 of the Dataset (MNAD.v1) Version 1 of the dataset comprises 418,563 articles classified into 19 categories. The data was collected from well-known electronic news sources, namely Akhbarona.ma, Hespress.ma, Hibapress.com, and Le360.com. The articles were stored in four separate CSV files, each corresponding to the news website source. Each CSV file contains three fields: Title, Body, and Category of the news article.
The dataset is rich in Arabic vocabulary, with approximately 906,125 unique words. It has been utilized as a benchmark in the research paper: "A Moroccan News Articles Dataset (MNAD) For Arabic Text Categorization". In 2021 International Conference on Decision Aid Sciences and Application (DASA).
This dataset is available for download from the following sources: - Kaggle Datasets : MNADv1 - Huggingface Datasets: MNADv1
About Version 2 of the Dataset (MNAD.v2) Version 2 of the MNAD dataset includes an additional 653,901 articles, bringing the total number of articles to over 1 million (1,069,489), classified into the same 19 categories as in version 1. The new documents were collected from seven additional prominent Moroccan news websites, namely al3omk.com, medi1news.com, alayam24.com, anfaspress.com, alyaoum24.com, barlamane.com, and SnrtNews.com.
The newly collected articles have been merged with the articles from the previous version into a single CSV file named MNADv2.csv. This file includes an additional column called "Source" to indicate the source of each news article.
Furthermore, MNAD.v2 incorporates improved pre-processing techniques and data cleaning methods. These enhancements involve removing duplicates, eliminating multiple spaces, discarding rows with NaN values, replacing new lines with " ", excluding very long and very short articles, and removing non-Arabic articles. These additions and improvements aim to enhance the usability and value of the MNAD dataset for researchers and practitioners in the field of Arabic text analysis.
This dataset is available for download from the following sources: - Kaggle Datasets : MNADv2 - Huggingface Datasets: MNADv2
Citation If you use our data, please cite the following paper:
bibtex @inproceedings{MNAD2021, author = {Mourad Jbene and Smail Tigani and Rachid Saadane and Abdellah Chehri}, title = {A Moroccan News Articles Dataset ({MNAD}) For Arabic Text Categorization}, year = {2021}, publisher = {{IEEE}}, booktitle = {2021 International Conference on Decision Aid Sciences and Application ({DASA})} doi = {10.1109/dasa53625.2021.9682402}, url = {https://doi.org/10.1109/dasa53625.2021.9682402}, }
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Integrated Geodatabase: The Global Catholic Foortprint of Healthcare and WelfareBurhans, Molly A., Mrowczynski, Jon M., Schweigel, Tayler C., and Burhans, Debra T., Wacta, Christine. The Catholic Foortprint of Care Around the World (1). GoodLands and GHR Foundation, 2019.Catholic Statistics Numbers:Annuarium Statisticum Ecclesiae – Statistical Yearbook of the Church: 1980 – 2018. LIBRERIA EDITRICE VATICAN.Historical Country Boundary Geodatabase:Weidmann, Nils B., Doreen Kuse, and Kristian Skrede Gleditsch. The Geography of the International System: The CShapes Dataset. International Interactions 36 (1). 2010.https://www.tandfonline.com/doi/full/10.1080/03050620903554614GoodLands created a significant new data set for GHR and the UISG of important Church information regarding orphanages and sisters around the world as well as healthcare, welfare, and other child care institutions. The data were extracted from the gold standard of Church data, the Annuarium Statisticum Ecclesiae, published yearly by the Vatican. It is inevitable that raw data sources will contain errors. GoodLands and its partners are not responsible for misinformation within Vatican documents. We encourage error reporting to us at data@good-lands.org or directly to the Vatican.GoodLands worked with the GHR Foundation to map Catholic Healthcare and Welfare around the world using data mined from the Annuarium Statisticum Eccleasiea. GHR supported the data development and GoodLands independently invested in the mapping of information.The workflows and data models developed for this project can be used to map any global, historical country-scale data in a time-series map while accounting for country boundary changes. GoodLands created proprietary software that enables mining the Annuarium Statisticum Eccleasiea (see Software and Program Library at our home page for details).The GHR Foundation supported data extraction and cleaning of this information.GoodLands’ supported the development of maps, infographics, and applications for all healthcare data.
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Integrated Geodatabase: The Global Catholic Foortprint of Healthcare and WelfareBurhans, Molly A., Mrowczynski, Jon M., Schweigel, Tayler C., and Burhans, Debra T., Wacta, Christine. The Catholic Foortprint of Care Around the World (1). GoodLands and GHR Foundation, 2019.WHO Statistics Numbers:Clean Care is Safe Care, Registration Update. (2017). Retrieved n.d., from https://www.who.int/gpsc/5may/registration_update/en/.https://www.who.int/gpsc/5may/registration_update/en/Catholic Statistics Numbers:Annuarium Statisticum Ecclesiae – Statistical Yearbook of the Church: 1980 – 2018. LIBRERIA EDITRICE VATICAN.Historical Country Boundary Geodatabase:Weidmann, Nils B., Doreen Kuse, and Kristian Skrede Gleditsch. The Geography of the International System: The CShapes Dataset. International Interactions 36 (1). 2010.https://www.tandfonline.com/doi/full/10.1080/03050620903554614GoodLands created a significant new data set for GHR and the UISG of important Church information regarding orphanages and sisters around the world as well as healthcare, welfare, and other child care institutions. The data were extracted from the gold standard of Church data, the Annuarium Statisticum Ecclesiae, published yearly by the Vatican. It is inevitable that raw data sources will contain errors. GoodLands and its partners are not responsible for misinformation within Vatican documents. We encourage error reporting to us at data@good-lands.org or directly to the Vatican.GoodLands worked with the GHR Foundation to map Catholic Healthcare and Welfare around the world using data mined from the Annuarium Statisticum Eccleasiea. GHR supported the data development and GoodLands independently invested in the mapping of information.The workflows and data models developed for this project can be used to map any global, historical country-scale data in a time-series map while accounting for country boundary changes. GoodLands created proprietary software that enables mining the Annuarium Statisticum Eccleasiea (see Software and Program Library at our home page for details).The GHR Foundation supported data extraction and cleaning of this information.GoodLands’ supported the development of maps, infographics, and applications for all healthcare data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a compilation of open asset-level data, which means the location of sites (e.g., operation, manufacturing, processing facilities of global supply chains), as of December 2022. This included data from 9 publicly available sources, that after data cleaning and harmonization, resulted in 189,075 data points.
Data source | Number of data points |
Open Supply Hub (former Open Apparel Registry) | 96,736 |
Global Power Plant Database | 35,419 |
Climate trace | 19,945 |
FDA database | 12,898 |
Global Dam Watch | 11,017 |
EudraGMDP database | 5,181 |
Sustainable Finance Initiative GeoAsset Databases | 4,716 |
Global Tailings Portal | 1,956 |
Fine print Mining Database | 1,207 |
This data was assigned with the industry in which the asset is. The summary table below shows the number of assets by industry.
Industry | Number of assets |
Textiles, Apparel & Luxury Good Production | 96,736 |
Health Care, Pharma and Biotechnology | 18,079 |
Energy - Solar, Wind | 16,282 |
Energy - Hydropower | 14,515 |
Energy - Geothermal or Combustion | 11,724 |
Metals & Mining | 11,210 |
Transportation Services | 4,872 |
Construction Materials | 3,117 |
Agriculture (animal products) | 2,388 |
Agriculture (plant products) | 1,896 |
Oil, Gas & Consumable Fuels | 1,194 |
Water utilities / Water Service Providers | 892 |
Hospitality Services | 294 |
Fishing and aquaculture | 14 |
Other | 5,862 |
Note that this compilation is based on an extensive search, however, we acknowledge that there is a significant discrepancy in data coverage/comprehensiveness among the different industries. The industry "Textiles, Apparel & Luxury Good Production" is by far the most complete, while other are clearly far from complete, for example, “Construction Materials”, "Agriculture (animal products)”, “Agriculture (plant products)”, “Oil, Gas & Consumable Fuels”, “Water utilities / Water Service Providers”, “Hospitality Services”, “Fishing and aquaculture”. Therefore, any comparison between industries should take this coverage/comprehensiveness bias into consideration.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Indonesia Manufacturing Industry: Production: Value: Milling and Cleaning of Cereals data was reported at 1,932.885 IDR bn in 2015. This records an increase from the previous number of 656.652 IDR bn for 2014. Indonesia Manufacturing Industry: Production: Value: Milling and Cleaning of Cereals data is updated yearly, averaging 156.820 IDR bn from Dec 1999 (Median) to 2015, with 15 observations. The data reached an all-time high of 1,932.885 IDR bn in 2015 and a record low of 27.499 IDR bn in 1999. Indonesia Manufacturing Industry: Production: Value: Milling and Cleaning of Cereals data remains active status in CEIC and is reported by Central Bureau of Statistics. The data is categorized under Indonesia Premium Database’s Mining and Manufacturing Sector – Table ID.BAD001: Manufacturing Industry: by Product: Value.
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
ntegrated Geodatabase: The Global Catholic Foortprint of Healthcare and WelfareBurhans, Molly A., Mrowczynski, Jon M., Schweigel, Tayler C., and Burhans, Debra T., Wacta, Christine. The Catholic Foortprint of Care Around the World (1). GoodLands and GHR Foundation, 2019.Catholic Statistics Numbers:Annuarium Statisticum Ecclesiae – Statistical Yearbook of the Church: 1980 – 2018. LIBRERIA EDITRICE VATICAN.Historical Country Boundary Geodatabase:Weidmann, Nils B., Doreen Kuse, and Kristian Skrede Gleditsch. The Geography of the International System: The CShapes Dataset. International Interactions 36 (1). 2010.https://www.tandfonline.com/doi/full/10.1080/03050620903554614GoodLands created a significant new data set for GHR and the UISG of important Church information regarding orphanages and sisters around the world as well as healthcare, welfare, and other child care institutions. The data were extracted from the gold standard of Church data, the Annuarium Statisticum Ecclesiae, published yearly by the Vatican. It is inevitable that raw data sources will contain errors. GoodLands and its partners are not responsible for misinformation within Vatican documents. We encourage error reporting to us at data@good-lands.org or directly to the Vatican.GoodLands worked with the GHR Foundation to map Catholic Healthcare and Welfare around the world using data mined from the Annuarium Statisticum Eccleasiea. GHR supported the data development and GoodLands independently invested in the mapping of information.The workflows and data models developed for this project can be used to map any global, historical country-scale data in a time-series map while accounting for country boundary changes. GoodLands created proprietary software that enables mining the Annuarium Statisticum Eccleasiea (see Software and Program Library at our home page for details).The GHR Foundation supported data extraction and cleaning of this information.GoodLands’ supported the development of maps, infographics, and applications for all healthcare data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Brazil IPI: Mfg: Last 12 Months=100: Year to Date: Chemicals Products: Cleaning and Polishing data was reported at 106.000 Prev 12 Mths=100 in May 2019. This records an increase from the previous number of 103.200 Prev 12 Mths=100 for Apr 2019. Brazil IPI: Mfg: Last 12 Months=100: Year to Date: Chemicals Products: Cleaning and Polishing data is updated monthly, averaging 103.400 Prev 12 Mths=100 from Dec 2013 (Median) to May 2019, with 66 observations. The data reached an all-time high of 107.600 Prev 12 Mths=100 in Dec 2018 and a record low of 97.800 Prev 12 Mths=100 in May 2017. Brazil IPI: Mfg: Last 12 Months=100: Year to Date: Chemicals Products: Cleaning and Polishing data remains active status in CEIC and is reported by Brazilian Institute of Geography and Statistics. The data is categorized under Brazil Premium Database’s Mining and Manufacturing Sector – Table BR.BAA023: Industrial Production Index: by Industrial Groups and Classes: Last 12 Months=100: Year-to-Date.
This research was conducted in Albania from June 19 to July 31, 2002, as part of the second round of the Business Environment and Enterprise Performance Survey. The objective of the survey is to obtain feedback from enterprises on the state of the private sector as well as to help in building a panel of enterprise data that will make it possible to track changes in the business environment over time, thus allowing, for example, impact assessments of reforms. Through face-to-face interviews with firms in the manufacturing and services sectors, the survey assesses the constraints to private sector growth and creates statistically significant business environment indicators that are comparable across countries.
The survey topics include company's characteristics, information about sales and suppliers, competition, infrastructure services, judiciary and law enforcement, security, government policies and regulations, bribery, sources of financing, overall business environment, performance and investment activities, and workforce composition.
National
The primary sampling unit of the study is the establishment.
The manufacturing and services sectors are the primary business sectors of interest.
Sample survey data [ssd]
The information below is taken from "The Business Environment and Enterprise Performance Survey - 2002. A brief report on observations, experiences and methodology from the survey" prepared by MEMRB Custom Research Worldwide (now part of Synovate), a research company that implemented BEEPS II instrument.
The general targeted distributional criteria of the sample in BEEPS II countries were to be as follows:
1) Coverage of countries: The BEEPS II instrument was to be administered to approximately 6,500 enterprises in 28 transition economies: 16 from CEE (Albania, Bosnia and Herzegovina, Bulgaria, Croatia, Czech Republic, Estonia, FR Yugoslavia, FYROM, Hungary, Latvia, Lithuania, Poland, Romania, Slovak Republic, Slovenia and Turkey) and 12 from the CIS (Armenia, Azerbaijan, Belarus, Georgia, Kazakhstan, Kyrgyzstan, Moldova, Russia, Tajikistan, Turkmenistan, Ukraine and Uzbekistan).
2) In each country, the sector composition of the total sample in terms of manufacturing versus services (including commerce) was to be determined by the relative contribution of GDP, subject to a 15% minimum for each category. Firms that operated in sectors subject to government price regulations and prudential supervision, such as banking, electric power, rail transport, and water and wastewater were excluded.
Eligible enterprise activities were as follows (ISIC sections): - Mining and quarrying (Section C: 10-14), Construction (Section F: 45), Manufacturing (Section D: 15-37) - Transportation, storage and communications (Section I: 60-64), Wholesale, retail, repairs (Section G: 50-52), Real estate, business services (Section K: 70-74), Hotels and restaurants (Section H: 55), Other community, social and personal activities (Section O: selected groups).
3) Size: At least 10% of the sample was to be in the small and 10% in the large size categories. A small firm was defined as an establishment with 2-49 employees, medium - with 50-249 workers, and large - with 250 - 9,999 employees. Companies with only one employee or more than 10,000 employees were excluded.
4) Ownership: At least 10% of the firms were to have foreign control (more than 50% shareholding) and 10% of companies - state control.
5) Exporters: At least 10% of the firms were to be exporters. A firm should be regarded as an exporter if it exported 20% or more of its total sales.
6) Location: At least 10% of firms were to be in the category "small city/countryside" (population under 50,000).
7) Year of establishment: Enterprises which were established later than 2000 should be excluded.
The sample structure for BEEPS II was designed to be as representative (self-weighted) as possible to the population of firms within the industry and service sectors subject to the various minimum quotas for the total sample. This approach ensured that there was sufficient weight in the tails of the distribution of firms by the various relevant controlled parameters (sector, size, location and ownership).
As pertinent data on the actual population or data which would have allowed the estimation of the population of foreign-owned and exporting enterprises were not available, it was not feasible to build these two parameters into the design of the sample guidelines from the onset. The primary parameters used for the design of the sample were: - Total population of enterprises; - Ownership: private and state; - Size of enterprise: Small, medium and large; - Geographic location: Capital, over 1 million, 1million-250,000, 250-50,000 and under 50,000; - Sub-sectors (e.g. mining, construction, wholesale, etc).
For certain parameters where statistical information was not available, enterprise populations and distributions were estimated from other accessible demographic (e.g. human population concentrations in rural and urban areas) and socio-economic (e.g. employment levels) data.
The survey was discontinued in Turkmenistan due to concerns about Turkmen government interference with implementation of the study.
Face-to-face [f2f]
The current survey instruments are available: - Screener and Main Questionnaires.
The survey topics include company's characteristics, information about sales and suppliers, competition, infrastructure services, judiciary and law enforcement, security, government policies and regulations, bribery, sources of financing, overall business environment, performance and investment activities, and workforce composition.
Data entry and first checking and validation of the results were undertaken locally. Final checking and validation of the results were made at MEMRB Custom Research Worldwide headquarters.
Overall, in all BEEPS II countries, the implementing agency contacted 18,052 enterprises and achieved an interview completion rate of 36.93%.
Respondents who either refused outright (i.e. not interested) or were unavailable to be interviewed (i.e. on holiday, etc) accounted for 38.34% of all contacts. Enterprises which were contacted but were non-eligible (i.e. business activity, year of establishment, etc) or quotas were already met (i.e. size, ownership etc) or to which “blind calls” were made to meet quotas (i.e. foreign ownership, exporters, etc) accounted for 24.73% of the total number of enterprises contacted.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DRAGON: Multi-Label Classification Replication Package
This archive contains the replication package for the DRAGON multi-label classification models, which leverage BERT-based architectures. The package includes scripts for repository mining, dataset creation, data processing, model training, and evaluation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Brazil IPI: PY=100: São Paulo: Soap, Detergents, Cleaning Products, Cosmetics, Perfumery and Personal Care data was reported at 100.700 Prev Year=100 in May 2019. This records an increase from the previous number of 93.200 Prev Year=100 for Apr 2019. Brazil IPI: PY=100: São Paulo: Soap, Detergents, Cleaning Products, Cosmetics, Perfumery and Personal Care data is updated monthly, averaging 101.800 Prev Year=100 from Jan 2003 (Median) to May 2019, with 197 observations. The data reached an all-time high of 137.600 Prev Year=100 in Apr 2005 and a record low of 83.200 Prev Year=100 in May 2018. Brazil IPI: PY=100: São Paulo: Soap, Detergents, Cleaning Products, Cosmetics, Perfumery and Personal Care data remains active status in CEIC and is reported by Brazilian Institute of Geography and Statistics. The data is categorized under Brazil Premium Database’s Mining and Manufacturing Sector – Table BR.BAB044: Industrial Production Index: Previous Year=100: by State: São Paulo.
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Integrated Geodatabase: The Global Catholic Foortprint of Healthcare and WelfareBurhans, Molly A., Mrowczynski, Jon M., Schweigel, Tayler C., and Burhans, Debra T., Wacta, Christine. The Catholic Foortprint of Care Around the World (1). GoodLands and GHR Foundation, 2019.Catholic Statistics Numbers:Annuarium Statisticum Ecclesiae – Statistical Yearbook of the Church: 1980 – 2018. LIBRERIA EDITRICE VATICAN.Historical Country Boundary Geodatabase:Weidmann, Nils B., Doreen Kuse, and Kristian Skrede Gleditsch. The Geography of the International System: The CShapes Dataset. International Interactions 36 (1). 2010.https://www.tandfonline.com/doi/full/10.1080/03050620903554614GoodLands created a significant new data set for GHR and the UISG of important Church information regarding orphanages and sisters around the world as well as healthcare, welfare, and other child care institutions. The data were extracted from the gold standard of Church data, the Annuarium Statisticum Ecclesiae, published yearly by the Vatican. It is inevitable that raw data sources will contain errors. GoodLands and its partners are not responsible for misinformation within Vatican documents. We encourage error reporting to us at data@good-lands.org or directly to the Vatican.GoodLands worked with the GHR Foundation to map Catholic Healthcare and Welfare around the world using data mined from the Annuarium Statisticum Eccleasiea. GHR supported the data development and GoodLands independently invested in the mapping of information.The workflows and data models developed for this project can be used to map any global, historical country-scale data in a time-series map while accounting for country boundary changes. GoodLands created proprietary software that enables mining the Annuarium Statisticum Eccleasiea (see Software and Program Library at our home page for details).The GHR Foundation supported data extraction and cleaning of this information.GoodLands’ supported the development of maps, infographics, and applications for all healthcare data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects and is filtered where the books is Exploratory data mining and data cleaning. It has 4 columns: book subject, authors, books, and publication dates. The data is ordered by earliest publication date (descending).