78 datasets found
  1. B

    Data Cleaning Sample

    • borealisdata.ca
    • dataone.org
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  2. q

    Cleaning Biodiversity Data: A Botanical Example Using Excel or RStudio

    • qubeshub.org
    Updated Jul 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shelly Gaynor (2020). Cleaning Biodiversity Data: A Botanical Example Using Excel or RStudio [Dataset]. http://doi.org/10.25334/DRGD-F069
    Explore at:
    Dataset updated
    Jul 16, 2020
    Dataset provided by
    QUBES
    Authors
    Shelly Gaynor
    Description

    Access and clean an open source herbarium dataset using Excel or RStudio.

  3. I

    Data for A Conceptual Model for Transparent, Reusable, and Collaborative...

    • databank.illinois.edu
    • aws-databank-alb.library.illinois.edu
    Updated Jul 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikolaus Parulian (2023). Data for A Conceptual Model for Transparent, Reusable, and Collaborative Data Cleaning [Dataset]. http://doi.org/10.13012/B2IDB-6827044_V1
    Explore at:
    Dataset updated
    Jul 12, 2023
    Authors
    Nikolaus Parulian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dissertation_demo.zip contains the base code and demonstration purpose for the dissertation: A Conceptual Model for Transparent, Reusable, and Collaborative Data Cleaning. Each chapter has a demo folder for demonstrating provenance queries or tools. The Airbnb dataset for demonstration and simulation is not included in this demo but is available to access directly from the reference website. Any updates on demonstration and examples can be found online at: https://github.com/nikolausn/dissertation_demo

  4. l

    LSC (Leicester Scientific Corpus)

    • figshare.le.ac.uk
    Updated Apr 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LSC (Leicester Scientific Corpus) [Dataset]. https://figshare.le.ac.uk/articles/dataset/LSC_Leicester_Scientific_Corpus_/9449639
    Explore at:
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    The LSC (Leicester Scientific Corpus)

    April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online

    The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R

    The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:

    Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.

    Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identified by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.

  5. D

    CNVVE Dataset clean audio samples

    • darus.uni-stuttgart.de
    audio/vnd.wave
    Updated Feb 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramin Hedeshy; Ramin Hedeshy; Raphael Menges; Raphael Menges; Steffen Staab; Steffen Staab (2024). CNVVE Dataset clean audio samples [Dataset]. http://doi.org/10.18419/DARUS-3898
    Explore at:
    audio/vnd.wave(51258), audio/vnd.wave(59578), audio/vnd.wave(57402), audio/vnd.wave(55354), audio/vnd.wave(138298), audio/vnd.wave(63418), audio/vnd.wave(45114), audio/vnd.wave(113338), audio/vnd.wave(41018), audio/vnd.wave(16442), audio/vnd.wave(53818), audio/vnd.wave(47162), audio/vnd.wave(28730), audio/vnd.wave(22586), audio/vnd.wave(30778), audio/vnd.wave(38970), audio/vnd.wave(43066), audio/vnd.wave(144058), audio/vnd.wave(36538), audio/vnd.wave(42298), audio/vnd.wave(92218), audio/vnd.wave(32826), audio/vnd.wave(34874), audio/vnd.wave(26682), audio/vnd.wave(71098), audio/vnd.wave(84538), audio/vnd.wave(36922), audio/vnd.wave(24634), audio/vnd.wave(61498), audio/vnd.wave(62866), audio/vnd.wave(44434), audio/vnd.wave(49210), audio/vnd.wave(44218), audio/vnd.wave(76858), audio/vnd.wave(111418), audio/vnd.wave(74938), audio/vnd.wave(59450), audio/vnd.wave(57658), audio/vnd.wave(73018), audio/vnd.wave(26530), audio/vnd.wave(88378), audio/vnd.wave(46138), audio/vnd.wave(38458), audio/vnd.wave(115258), audio/vnd.wave(63546), audio/vnd.wave(53306), audio/vnd.wave(142138), audio/vnd.wave(67258), audio/vnd.wave(34618), audio/vnd.wave(94138), audio/vnd.wave(90298), audio/vnd.wave(20538), audio/vnd.wave(32698), audio/vnd.wave(56038), audio/vnd.wave(51898), audio/vnd.wave(48530), audio/vnd.wave(40378), audio/vnd.wave(56722), audio/vnd.wave(101818), audio/vnd.wave(26938), audio/vnd.wave(49978), audio/vnd.wave(18490), audio/vnd.wave(39654), audio/vnd.wave(80698), audio/vnd.wave(65338), audio/vnd.wave(96058), audio/vnd.wave(28858), audio/vnd.wave(78778), audio/vnd.wave(69178), audio/vnd.wave(121018), audio/vnd.wave(172858), audio/vnd.wave(209338), audio/vnd.wave(99898), audio/vnd.wave(48058), audio/vnd.wave(124858), audio/vnd.wave(31486), audio/vnd.wave(55738), audio/vnd.wave(57110), audio/vnd.wave(82618), audio/vnd.wave(117178), audio/vnd.wave(62182), audio/vnd.wave(217018), audio/vnd.wave(107578), audio/vnd.wave(32486), audio/vnd.wave(51198), audio/vnd.wave(122938), audio/vnd.wave(45538), audio/vnd.wave(60818), audio/vnd.wave(286138), audio/vnd.wave(60714), audio/vnd.wave(205498), audio/vnd.wave(58086), audio/vnd.wave(86458), audio/vnd.wave(60134), audio/vnd.wave(33866), audio/vnd.wave(14394), audio/vnd.wave(45798), audio/vnd.wave(109498), audio/vnd.wave(130618), audio/vnd.wave(140218), audio/vnd.wave(42918), audio/vnd.wave(128698), audio/vnd.wave(25018), audio/vnd.wave(31970), audio/vnd.wave(105658), audio/vnd.wave(97978), audio/vnd.wave(119098), audio/vnd.wave(163258), audio/vnd.wave(38290), audio/vnd.wave(265018), audio/vnd.wave(199738), audio/vnd.wave(58770)Available download formats
    Dataset updated
    Feb 13, 2024
    Dataset provided by
    DaRUS
    Authors
    Ramin Hedeshy; Ramin Hedeshy; Raphael Menges; Raphael Menges; Steffen Staab; Steffen Staab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    BMBF
    BMWK/ESF
    Description

    This CNVVE Dataset contains clean audio samples encompassing six distinct classes of voice expressions, namely “Uh-huh” or “mm-hmm”, “Uh-uh” or “mm-mm”, “Hush” or “Shh”, “Psst”, “Ahem”, and Continuous humming, e.g., “hmmm.” Audio samples of each class are found in the respective folders. These audio samples have undergone a thorough cleaning process. The raw samples are published in https://doi.org/10.18419/darus-3897. Initially, we applied the Google WebRTC voice activity detection (VAD) algorithm on the given audio files to remove noise or silence from the collected voice signals. The intensity was set to "2", which could be a value between "1" and "3". However, because of variations in the data, some files required additional manual cleaning. These outliers, characterized by sharp click sounds (such as those occurring at the end of recordings), were addressed. The samples are recorded through a dedicated website for data collection that defines the purpose and type of voice data by providing example recordings to participants as well as the expressions’ written equivalent, e.g., “Uh-huh”. Audio recordings were automatically saved in the .wav format and kept anonymous, with a sampling rate of 48 kHz and a bit depth of 32 bits. For more info, please check the paper or feel free to contact the authors for any inquiries.

  6. Z

    The Surface Water Chemistry (SWatCh) database

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 26, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heubach, Franz (2022). The Surface Water Chemistry (SWatCh) database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4559695
    Explore at:
    Dataset updated
    Apr 26, 2022
    Dataset provided by
    Heubach, Franz
    Rotteveel, Lobke
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset presented in the following manuscript: The Surface Water Chemistry (SWatCh) database: A standardized global database of water chemistry to facilitate large-sample hydrological research, which is currently under review at Earth System Science Data.

    Openly accessible global scale surface water chemistry datasets are urgently needed to detect widespread trends and problems, to help identify their possible solutions, and determine critical spatial data gaps where more monitoring is required. Existing datasets are limited in availability, sample size/sampling frequency, and geographic scope. These limitations inhibit the answering of emerging transboundary water chemistry questions, for example, the detection and understanding of delayed recovery from freshwater acidification. Here, we begin to address these limitations by compiling the global surface water chemistry (SWatCh) database. We collect, clean, standardize, and aggregate open access data provided by six national and international agencies to compile a database containing information on sites, methods, and samples, and a GIS shapefile of site locations. We remove poor quality data (for example, values flagged as “suspect” or “rejected”), standardize variable naming conventions and units, and perform other data cleaning steps required for statistical analysis. The database contains water chemistry data for streams, rivers, canals, ponds, lakes, and reservoirs across seven continents, 24 variables, 33,722 sites, and over 5 million samples collected between 1960 and 2022. Similar to prior research, we identify critical spatial data gaps on the African and Asian continents, highlighting the need for more data collection and sharing initiatives in these areas, especially considering freshwater ecosystems in these environs are predicted to be among the most heavily impacted by climate change. We identify the main challenges associated with compiling global databases – limited data availability, dissimilar sample collection and analysis methodology, and reporting ambiguity – and provide recommended solutions. By addressing these challenges and consolidating data from various sources into one standardized, openly available, high quality, and trans-boundary database, SWatCh allows users to conduct powerful and robust statistical analyses of global surface water chemistry.

  7. P

    CulturaX Dataset

    • paperswithcode.com
    • opendatalab.com
    • +2more
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CulturaX Dataset [Dataset]. https://paperswithcode.com/dataset/culturax
    Explore at:
    Dataset updated
    Dec 18, 2024
    Authors
    Thuat Nguyen; Chien Van Nguyen; Viet Dac Lai; Hieu Man; Nghia Trung Ngo; Franck Dernoncourt; Ryan A. Rossi; Thien Huu Nguyen
    Description

    We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for large language model (LLM) development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. We employ MinHash at document level to achieve fuzzy deduplication for the datasets in different languages. Our data cleaning framework includes diverse criteria and threshold selections, guided by extensive data samples, ensuring comprehensive noise filtering in various aspects. CulturaX is fully released to the public in HuggingFace to facilitate research and advancements in multilingual LLMs.

    Our dataset combines the most recent iteration of mC4 (version 3.1.0) [1] with all accessible OSCAR corpora up to the present year, including 20.19, 21.09, 22.01, and 23.01 [2]. After deep cleaning and deduplication, CulturaX involves 16TB data in the parquet format (expanding to 27TB when unpacked). More than a half of our dataset is dedicated to non-English languages to significantly boost the data size and enhance the feasibility of training models in multilingual scenarios.

    To obtain perplexity scores for data cleaning, we train a SentencePiece tokenizer and 5-gram Kneser-Ney language models as provided in the KenLM library [3] using the 20230501 dumps of Wikipedia. Our KenLM models are also released in HuggingFace: https://huggingface.co/uonlp/kenlm.

    Details for the dataset can be found in our technical paper: https://arxiv.org/abs/2309.09400 and https://huggingface.co/datasets/uonlp/CulturaX

  8. m

    Ultimate Arabic News Dataset

    • data.mendeley.com
    • opendatalab.com
    • +1more
    Updated Jul 4, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Hashim Al-Dulaimi (2022). Ultimate Arabic News Dataset [Dataset]. http://doi.org/10.17632/jz56k5wxz7.2
    Explore at:
    Dataset updated
    Jul 4, 2022
    Authors
    Ahmed Hashim Al-Dulaimi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Ultimate Arabic News Dataset is a collection of single-label modern Arabic texts that are used in news websites and press articles.

    Arabic news data was collected by web scraping techniques from many famous news sites such as Al-Arabiya, Al-Youm Al-Sabea (Youm7), the news published on the Google search engine and other various sources.

    • The data we collect consists of two Primary files:

    UltimateArabic: A file containing more than 193,000 original Arabic news texts, without pre-processing. The texts contain words, numbers, and symbols that can be removed using pre-processing to increase accuracy when using the dataset in various Arabic natural language processing tasks such as text classification.

    UltimateArabicPrePros: It is a file that contains the data mentioned in the first file, but after pre-processing, where the number of data became about 188,000 text documents, where stop words, non-Arabic words, symbols and numbers have been removed so that this file is ready for use directly in the various Arabic natural language processing tasks. Like text classification.

    • We have added two folders containing additional detailed datasets:

    1- Sample: This folder contains samples of the results of web-scraping techniques for two popular Arab websites in two different news categories, Sports and Politics. this folder contain two datasets:

    Sample_Youm7_Politic: An example of news in the "Politic" category collected from the Youm7 website. Sample_alarabiya_Sport: An example of news in the "Sport" category collected from the Al-Arabiya website.

    2- Dataset Versions: This volume contains four different versions of the original data set, from which the appropriate version can be selected for use in text classification techniques. The first data set (Original) contains the raw data without pre-processing the data in any way, so the number of tokens in the first data set is very high. In the second data set (Original_without_Stop) the data was cleaned, such as removing symbols, numbers, and non-Arabic words, as well as stop words, so the number of symbols is greatly reduced. In the third dataset (Original_with_Stem) the data was cleaned, and text stemming technique was used to remove all additions and suffixes that might affect the accuracy of the results and to obtain the words roots. In the 4th edition of the dataset (Original_Without_Stop_Stem) all preprocessing techniques such as data cleaning, stop word removal and text stemming technique were applied, so we note that the number of tokens in the 4th edition is the lowest among all releases.

    • The data is divided into 10 different categories: Culture, Diverse, Economy, Sport, Politic, Art, Society, Technology, Medical and Religion.
  9. 2019 Kaggle Machine Learning & Data Science Survey

    • kaggle.com
    Updated Dec 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EK (2020). 2019 Kaggle Machine Learning & Data Science Survey [Dataset]. https://www.kaggle.com/eswarankrishnasamy/2019-kaggle-machine-learning-data-science-survey/notebooks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 22, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    EK
    Description

    Overview Welcome to Kaggle's third annual Machine Learning and Data Science Survey ― and our second-ever survey data challenge. You can read our executive summary here.

    This year, as in 2017 and 2018, we set out to conduct an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for three weeks in October, and after cleaning the data we finished with 19,717 responses!

    There's a lot to explore here. The results include raw numbers about who is working with data, what’s happening with machine learning in different industries, and the best ways for new data scientists to break into the field. We've published the data in as raw a format as possible without compromising anonymization, which makes it an unusual example of a survey dataset.

    Challenge This year Kaggle is launching the second annual Data Science Survey Challenge, where we will be awarding a prize pool of $30,000 to notebook authors who tell a rich story about a subset of the data science and machine learning community.

    In our third year running this survey, we were once again awed by the global, diverse, and dynamic nature of the data science and machine learning industry. This survey data EDA provides an overview of the industry on an aggregate scale, but it also leaves us wanting to know more about the many specific communities comprised within the survey. For that reason, we’re inviting the Kaggle community to dive deep into the survey datasets and help us tell the diverse stories of data scientists from around the world.

    The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A “story” could be defined any number of ways, and that’s deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs). This is an opportunity to be creative and tell the story of a community you identify with or are passionate about!

    Submissions will be evaluated on the following:

    Composition - Is there a clear narrative thread to the story that’s articulated and supported by data? The subject should be well defined, well researched, and well supported through the use of data and visualizations. Originality - Does the reader learn something new through this submission? Or is the reader challenged to think about something in a new way? A great entry will be informative, thought provoking, and fresh all at the same time. Documentation - Are your code, and notebook, and additional data sources well documented so a reader can understand what you did? Are your sources clearly cited? A high quality analysis should be concise and clear at each step so the rationale is easy to follow and the process is reproducible To be valid, a submission must be contained in one notebook, made public on or before the submission deadline. Participants are free to use any datasets in addition to the Kaggle Data Science survey, but those datasets must also be publicly available on Kaggle by the deadline for a submission to be valid.

    How to Participate To make a submission, complete the submission form. Only one submission will be judged per participant, so if you make multiple submissions we will review the last (most recent) entry.

    No submission is necessary for the Weekly Notebook Award. To be eligible, a notebook must be public and use the 2019 Data Science Survey as a data source.

    Submission deadline: 11:59PM UTC, December 2nd, 2019.

    Survey Methodology This survey received 19,717 usable respondents from 171 countries and territories. If a country or territory received less than 50 respondents, we grouped them into a group named “Other” for anonymity.

    We excluded respondents who were flagged by our survey system as “Spam”.

    Most of our respondents were found primarily through Kaggle channels, like our email list, discussion forums and social media channels.

    The survey was live from October 8th to October 28th. We allowed respondents to complete the survey at any time during that window. The median response time for those who participated in the survey was approximately 10 minutes.

    Not every question was shown to every respondent. You can learn more about the different segments we used in the survey_schema.csv file. In general, respondents with more experience were asked more questions and respondents with less experience were asked less questions.

    To protect the respondents’ identity, the answers to multiple choice questions have been separated into a separate data file from the open-ended responses. We do not provide a key to match up the multiple choice and free form responses. Further, the free form responses have been randomized column-wise such that the responses that appear on the same row did not necessarily come from the same survey-taker.

    Multiple choice single response questions fit into individual columns whereas multiple choice multiple response questions were split into multiple columns. Text responses were encoded to protect user privacy and countries with fewer than 50 respondents were grouped into the category "other".

    Data has been released under a CC 2.0 license: https://creativecommons.org/licenses/by/2.0/

  10. i

    Integrated Household Panel Survey 2010-2013-2016-2019 (Long-Term Panel, 102...

    • catalog.ihsn.org
    • datacatalog.ihsn.org
    • +1more
    Updated Jul 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Statistical Office (NSO) (2023). Integrated Household Panel Survey 2010-2013-2016-2019 (Long-Term Panel, 102 EAs) - Malawi [Dataset]. http://catalog.ihsn.org/catalog/8702
    Explore at:
    Dataset updated
    Jul 19, 2023
    Dataset authored and provided by
    National Statistical Office (NSO)
    Time period covered
    2010 - 2019
    Area covered
    Malawi
    Description

    Abstract

    The 2016 Integrated Household Panel Survey (IHPS) was launched in April 2016 as part of the Malawi Fourth Integrated Household Survey fieldwork operation. The IHPS 2016 targeted 1,989 households that were interviewed in the IHPS 2013 and that could be traced back to half of the 204 enumeration areas that were originally sampled as part of the Third Integrated Household Survey (IHS3) 2010/11. The 2019 IHPS was launched in April 2019 as part of the Malawi Fifth Integrated Household Survey fieldwork operations targeting the 2,508 households that were interviewed in 2016. The panel sample expanded each wave through the tracking of split-off individuals and the new households that they formed. Available as part of this project is the IHPS 2019 data, the IHPS 2016 data as well as the rereleased IHPS 2010 & 2013 data including only the subsample of 102 EAs with updated panel weights. Additionally, the IHPS 2016 was the first survey that received complementary financial and technical support from the Living Standards Measurement Study – Plus (LSMS+) initiative, which has been established with grants from the Umbrella Facility for Gender Equality Trust Fund, the World Bank Trust Fund for Statistical Capacity Building, and the International Fund for Agricultural Development, and is implemented by the World Bank Living Standards Measurement Study (LSMS) team, in collaboration with the World Bank Gender Group and partner national statistical offices. The LSMS+ aims to improve the availability and quality of individual-disaggregated household survey data, and is, at start, a direct response to the World Bank IDA18 commitment to support 6 IDA countries in collecting intra-household, sex-disaggregated household survey data on 1) ownership of and rights to selected physical and financial assets, 2) work and employment, and 3) entrepreneurship – following international best practices in questionnaire design and minimizing the use of proxy respondents while collecting personal information. This dataset is included here.

    Geographic coverage

    National coverage

    Analysis unit

    • Households
    • Individuals
    • Children under 5 years
    • Consumption expenditure commodities/items
    • Communities
    • Agricultural household/ Holder/ Crop

    Universe

    The IHPS 2016 and 2019 attempted to track all IHPS 2013 households stemming from 102 of the original 204 baseline panel enumeration areas as well as individuals that moved away from the 2013 dwellings between 2013 and 2016 as long as they were neither servants nor guests at the time of the IHPS 2013; were projected to be at least 12 years of age and were known to be residing in mainland Malawi but excluding those in Likoma Island and in institutions, including prisons, police compounds, and army barracks.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    A sub-sample of IHS3 2010 sample enumeration areas (EAs) (i.e. 204 EAs out of 768 EAs) was selected prior to the start of the IHS3 field work with the intention to (i) to track and resurvey these households in 2013 in accordance with the IHS3 fieldwork timeline and as part of the Integrated Household Panel Survey (IHPS 2013) and (ii) visit a total of 3,246 households in these EAs twice to reduce recall associated with different aspects of agricultural data collection. At baseline, the IHPS sample was selected to be representative at the national, regional, urban/rural levels and for each of the following 6 strata: (i) Northern Region - Rural, (ii) Northern Region - Urban, (iii) Central Region - Rural, (iv) Central Region - Urban, (v) Southern Region - Rural, and (vi) Southern Region - Urban. The IHPS 2013 main fieldwork took place during the period of April-October 2013, with residual tracking operations in November-December 2013.

    Given budget and resource constraints, for the IHPS 2016 the number of sample EAs in the panel was reduced to 102 out of the 204 EAs. As a result, the domains of analysis are limited to the national, urban and rural areas. Although the results of the IHPS 2016 cannot be tabulated by region, the stratification of the IHPS by region, urban and rural strata was maintained. The IHPS 2019 tracked all individuals 12 years or older from the 2016 households.

    Mode of data collection

    Computer Assisted Personal Interview [capi]

    Cleaning operations

    Data Entry Platform To ensure data quality and timely availability of data, the IHPS 2019 was implemented using the World Bank’s Survey Solutions CAPI software. To carry out IHPS 2019, 1 laptop computer and a wireless internet router were assigned to each team supervisor, and each enumerator had an 8–inch GPS-enabled Lenovo tablet computer that the NSO provided. The use of Survey Solutions allowed for the real-time availability of data as the completed data was completed, approved by the Supervisor and synced to the Headquarters server as frequently as possible. While administering the first module of the questionnaire the enumerator(s) also used their tablets to record the GPS coordinates of the dwelling units. Geo-referenced household locations from that tablet complemented the GPS measurements taken by the Garmin eTrex 30 handheld devices and these were linked with publically available geospatial databases to enable the inclusion of a number of geospatial variables - extensive measures of distance (i.e. distance to the nearest market), climatology, soil and terrain, and other environmental factors - in the analysis.

    Data Management The IHPS 2019 Survey Solutions CAPI based data entry application was designed to stream-line the data collection process from the field. IHPS 2019 Interviews were mainly collected in “sample” mode (assignments generated from headquarters) and a few in “census” mode (new interviews created by interviewers from a template) for the NSO to have more control over the sample. This hybrid approach was necessary to aid the tracking operations whereby an enumerator could quickly create a tracking assignment considering that they were mostly working in areas with poor network connection and hence could not quickly receive tracking cases from Headquarters.

    The range and consistency checks built into the application was informed by the LSMS-ISA experience with the IHS3 2010/11, IHPS 2013 and IHPS 2016. Prior programming of the data entry application allowed for a wide variety of range and consistency checks to be conducted and reported and potential issues investigated and corrected before closing the assigned enumeration area. Headquarters (the NSO management) assigned work to the supervisors based on their regions of coverage. The supervisors then made assignments to the enumerators linked to their supervisor account. The work assignments and syncing of completed interviews took place through a Wi-Fi connection to the IHPS 2019 server. Because the data was available in real time it was monitored closely throughout the entire data collection period and upon receipt of the data at headquarters, data was exported to Stata for other consistency checks, data cleaning, and analysis.

    Data Cleaning The data cleaning process was done in several stages over the course of fieldwork and through preliminary analysis. The first stage of data cleaning was conducted in the field by the field-based field teams utilizing error messages generated by the Survey Solutions application when a response did not fit the rules for a particular question. For questions that flagged an error, the enumerators were expected to record a comment within the questionnaire to explain to their supervisor the reason for the error and confirming that they double checked the response with the respondent. The supervisors were expected to sync the enumerator tablets as frequently as possible to avoid having many questionnaires on the tablet, and to enable daily checks of questionnaires. Some supervisors preferred to review completed interviews on the tablets so they would review prior to syncing but still record the notes in the supervisor account and reject questionnaires accordingly. The second stage of data cleaning was also done in the field, and this resulted from the additional error reports generated in Stata, which were in turn sent to the field teams via email or DropBox. The field supervisors collected reports for their assignments and in coordination with the enumerators reviewed, investigated, and collected errors. Due to the quick turn-around in error reporting, it was possible to conduct call-backs while the team was still operating in the EA when required. Corrections to the data were entered in the rejected questionnaires and sent back to headquarters.

    The data cleaning process was done in several stages over the course of the fieldwork and through preliminary analyses. The first stage was during the interview itself. Because CAPI software was used, as enumerators asked the questions and recorded information, error messages were provided immediately when the information recorded did not match previously defined rules for that variable. For example, if the education level for a 12 year old respondent was given as post graduate. The second stage occurred during the review of the questionnaire by the Field Supervisor. The Survey Solutions software allows errors to remain in the data if the enumerator does not make a correction. The enumerator can write a comment to explain why the data appears to be incorrect. For example, if the previously mentioned 12 year old was, in fact, a genius who had completed graduate studies. The next stage occurred when the data were transferred to headquarters where the NSO staff would again review the data for errors and verify the comments from the

  11. Z

    Data from: Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes,...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes, Material Systems, and Data Heterogeneity for Materials Informatics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4903957
    Explore at:
    Dataset updated
    Jun 6, 2021
    Dataset provided by
    Henderson, N. Ashley
    Sparks, D. Taylor
    Kauwe, K. Steven
    Description

    This benchmark data is comprised of 50 different datasets for materials properties obtained from 16 previous publications. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits.

    For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method.

    For further information, as well as directions on how to access the data, please go to the corresponding GitHub repository: https://github.com/anhender/mse_ML_datasets/tree/v1.0

  12. h

    alpaca-cleaned

    • huggingface.co
    Updated Mar 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    alpaca-cleaned [Dataset]. https://huggingface.co/datasets/yahma/alpaca-cleaned
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 30, 2023
    Authors
    Gene Ruebsamen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for Alpaca-Cleaned

    Repository: https://github.com/gururise/AlpacaDataCleaned

      Dataset Description
    

    This is a cleaned version of the original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset:

    Hallucinations: Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an answer.

    "instruction":"Summarize the… See the full description on the dataset page: https://huggingface.co/datasets/yahma/alpaca-cleaned.

  13. COVID-19 Case Surveillance Public Use Data

    • data.cdc.gov
    • data.virginia.gov
    • +6more
    application/rdfxml +5
    Updated Jul 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CDC Data, Analytics and Visualization Task Force (2024). COVID-19 Case Surveillance Public Use Data [Dataset]. https://data.cdc.gov/widgets/vbim-akqf
    Explore at:
    json, application/rdfxml, csv, xml, tsv, application/rssxmlAvailable download formats
    Dataset updated
    Jul 9, 2024
    Dataset provided by
    Centers for Disease Control and Preventionhttp://www.cdc.gov/
    Authors
    CDC Data, Analytics and Visualization Task Force
    License

    https://www.usa.gov/government-workshttps://www.usa.gov/government-works

    Description

    Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.

    Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.

    This case surveillance public use dataset has 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors, and no geographic data.

    CDC has three COVID-19 case surveillance datasets:

    The following apply to all three datasets:

    Overview

    The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020, to clarify the interpretation of antigen detection tests and serologic test results within the case classification (Interim-20-ID-02). The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported voluntarily to CDC.

    For more information: NNDSS Supports the COVID-19 Response | CDC.

    The deidentified data in the “COVID-19 Case Surveillance Public Use Data” include demographic characteristics, any exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and presence of any underlying medical conditions and risk behaviors. All data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.

    COVID-19 Case Reports

    COVID-19 case reports have been routinely submitted using nationally standardized case reporting forms. On April 5, 2020, CSTE released an Interim Position Statement with national surveillance case definitions for COVID-19 included. Current versions of these case definitions are available here: https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/.

    All cases reported on or after were requested to be shared by public health departments to CDC using the standardized case definitions for laboratory-confirmed or probable cases. On May 5, 2020, the standardized case reporting form was revised. Case reporting using this new form is ongoing among U.S. states and territories.

    Data are Considered Provisional

    • The COVID-19 case surveillance data are dynamic; case reports can be modified at any time by the jurisdictions sharing COVID-19 data with CDC. CDC may update prior cases shared with CDC based on any updated information from jurisdictions. For instance, as new information is gathered about previously reported cases, health departments provide updated data to CDC. As more information and data become available, analyses might find changes in surveillance data and trends during a previously reported time window. Data may also be shared late with CDC due to the volume of COVID-19 cases.
    • Annual finalized data: To create the final NNDSS data used in the annual tables, CDC works carefully with the reporting jurisdictions to reconcile the data received during the year until each state or territorial epidemiologist confirms that the data from their area are correct.
    • Access Addressing Gaps in Public Health Reporting of Race and Ethnicity for COVID-19, a report from the Council of State and Territorial Epidemiologists, to better understand the challenges in completing race and ethnicity data for COVID-19 and recommendations for improvement.

    Data Limitations

    To learn more about the limitations in using case surveillance data, visit FAQ: COVID-19 Data and Surveillance.

    Data Quality Assurance Procedures

    CDC’s Case Surveillance Section routinely performs data quality assurance procedures (i.e., ongoing corrections and logic checks to address data errors). To date, the following data cleaning steps have been implemented:

    • Questions that have been left unanswered (blank) on the case report form are reclassified to a Missing value, if applicable to the question. For example, in the question “Was the individual hospitalized?” where the possible answer choices include “Yes,” “No,” or “Unknown,” the blank value is recoded to Missing because the case report form did not include a response to the question.
    • Logic checks are performed for date data. If an illogical date has been provided, CDC reviews the data with the reporting jurisdiction. For example, if a symptom onset date in the future is reported to CDC, this value is set to null until the reporting jurisdiction updates the date appropriately.
    • Additional data quality processing to recode free text data is ongoing. Data on symptoms, race and ethnicity, and healthcare worker status have been prioritized.

    Data Suppression

    To prevent release of data that could be used to identify people, data cells are suppressed for low frequency (<5) records and indirect identifiers (e.g., date of first positive specimen). Suppression includes rare combinations of demographic characteristics (sex, age group, race/ethnicity). Suppressed values are re-coded to the NA answer option; records with data suppression are never removed.

    For questions, please contact Ask SRRG (eocevent394@cdc.gov).

    Additional COVID-19 Data

    COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths by state and by county. These

  14. Climate Change: Earth Surface Temperature Data

    • kaggle.com
    • redivis.com
    zip
    Updated May 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Berkeley Earth (2017). Climate Change: Earth Surface Temperature Data [Dataset]. https://www.kaggle.com/berkeleyearth/climate-change-earth-suRFace-temperature-data/kernels
    Explore at:
    zip(88843537 bytes)Available download formats
    Dataset updated
    May 1, 2017
    Dataset authored and provided by
    Berkeley Earthhttp://berkeleyearth.org/
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    Earth
    Description

    Some say climate change is the biggest threat of our age while others say it’s a myth based on dodgy science. We are turning some of the data over to you so you can form your own view.

    us-climate-change

    Even more than with other data sets that Kaggle has featured, there’s a huge amount of data cleaning and preparation that goes into putting together a long-time study of climate trends. Early data was collected by technicians using mercury thermometers, where any variation in the visit time impacted measurements. In the 1940s, the construction of airports caused many weather stations to be moved. In the 1980s, there was a move to electronic thermometers that are said to have a cooling bias.

    Given this complexity, there are a range of organizations that collate climate trends data. The three most cited land and ocean temperature data sets are NOAA’s MLOST, NASA’s GISTEMP and the UK’s HadCrut.

    We have repackaged the data from a newer compilation put together by the Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory. The Berkeley Earth Surface Temperature Study combines 1.6 billion temperature reports from 16 pre-existing archives. It is nicely packaged and allows for slicing into interesting subsets (for example by country). They publish the source data and the code for the transformations they applied. They also use methods that allow weather observations from shorter time series to be included, meaning fewer observations need to be thrown away.

    In this dataset, we have include several files:

    Global Land and Ocean-and-Land Temperatures (GlobalTemperatures.csv):

    • Date: starts in 1750 for average land temperature and 1850 for max and min land temperatures and global ocean and land temperatures
    • LandAverageTemperature: global average land temperature in celsius
    • LandAverageTemperatureUncertainty: the 95% confidence interval around the average
    • LandMaxTemperature: global average maximum land temperature in celsius
    • LandMaxTemperatureUncertainty: the 95% confidence interval around the maximum land temperature
    • LandMinTemperature: global average minimum land temperature in celsius
    • LandMinTemperatureUncertainty: the 95% confidence interval around the minimum land temperature
    • LandAndOceanAverageTemperature: global average land and ocean temperature in celsius
    • LandAndOceanAverageTemperatureUncertainty: the 95% confidence interval around the global average land and ocean temperature

    Other files include:

    • Global Average Land Temperature by Country (GlobalLandTemperaturesByCountry.csv)
    • Global Average Land Temperature by State (GlobalLandTemperaturesByState.csv)
    • Global Land Temperatures By Major City (GlobalLandTemperaturesByMajorCity.csv)
    • Global Land Temperatures By City (GlobalLandTemperaturesByCity.csv)

    The raw data comes from the Berkeley Earth data page.

  15. Enterprise Survey 2009-2014, Panel Data - Malawi

    • microdata.worldbank.org
    • catalog.ihsn.org
    Updated Oct 7, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Bank (2015). Enterprise Survey 2009-2014, Panel Data - Malawi [Dataset]. https://microdata.worldbank.org/index.php/catalog/2360
    Explore at:
    Dataset updated
    Oct 7, 2015
    Dataset authored and provided by
    World Bankhttp://worldbank.org/
    Time period covered
    2009 - 2014
    Area covered
    Malawi
    Description

    Abstract

    The documented dataset covers Enterprise Survey (ES) panel data collected in Malawi in 2009 and 2014, as part of Africa Enterprise Surveys roll-out, an initiative of the World Bank.

    New Enterprise Surveys target a sample consisting of longitudinal (panel) observations and new cross-sectional data. Panel firms are prioritized in the sample selection, comprising up to 50% of the sample in the current wave. For all panel firms, regardless of the sample, current eligibility or operating status is determined and included in panel datasets.

    Malawi ES 2014 was conducted between April 2014 and February 2015, Malawi ES 2009 was carried out in May - July 2009. The objective of the Enterprise Survey is to obtain feedback from enterprises on the state of the private sector as well as to help in building a panel of enterprise data that will make it possible to track changes in the business environment over time, thus allowing, for example, impact assessments of reforms. Through interviews with firms in the manufacturing and services sectors, the survey assesses the constraints to private sector growth and creates statistically significant business environment indicators that are comparable across countries.

    Stratified random sampling was used to select the surveyed businesses. The data was collected using face-to-face interviews.

    Data from 673 establishments was analyzed: 436 businesses were from 2014 ES only, 63 - from 2009 ES only, and 174 firms were from both 2009 and 2014 panels.

    The standard Enterprise Survey topics include firm characteristics, gender participation, access to finance, annual sales, costs of inputs and labor, workforce composition, bribery, licensing, infrastructure, trade, crime, competition, capacity utilization, land and permits, taxation, informality, business-government relations, innovation and technology, and performance measures. Over 90 percent of the questions objectively measure characteristics of a country’s business environment. The remaining questions assess the survey respondents’ opinions on what are the obstacles to firm growth and performance.

    Geographic coverage

    National

    Analysis unit

    The primary sampling unit of the study is an establishment. An establishment is a physical location where business is carried out and where industrial operations take place or services are provided. A firm may be composed of one or more establishments. For example, a brewery may have several bottling plants and several establishments for distribution. For the purposes of this survey an establishment must make its own financial decisions and have its own financial statements separate from those of the firm. An establishment must also have its own management and control over its payroll.

    Universe

    The whole population, or the universe, covered in the Enterprise Surveys is the non-agricultural private economy. It comprises: all manufacturing sectors according to the ISIC Revision 3.1 group classification (group D), construction sector (group F), services sector (groups G and H), and transport, storage, and communications sector (group I). Note that this population definition excludes the following sectors: financial intermediation (group J), real estate and renting activities (group K, except sub-sector 72, IT, which was added to the population under study), and all public or utilities sectors. Companies with 100% government ownership are not eligible to participate in the Enterprise Surveys.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    For the Malawi ES, multiple sample frames were used: a sample frame was built using data compiled from local and municipal business registries. Due to the fact that the previous round of surveys utilized different stratification criteria in the 2009 survey sample, the presence of panel firms was limited to a maximum of 50% of the achieved interviews in each stratum. That sample is referred to as the panel.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    The following survey instruments were used for Malawi ES 2009 and 2014: - Manufacturing Module Questionnaire - Services Module Questionnaire

    The survey is fielded via manufacturing or services questionnaires in order not to ask questions that are irrelevant to specific types of firms, e.g. a question that relates to production and nonproduction workers should not be asked of a retail firm. In addition to questions that are asked across countries, all surveys are customized and contain country-specific questions. An example of customization would be including tourism-related questions that are asked in certain countries when tourism is an existing or potential sector of economic growth. There is a skip pattern in the Service Module Questionnaire for questions that apply only to retail firms.

    Cleaning operations

    Data entry and quality controls are implemented by the contractor and data is delivered to the World Bank in batches (typically 10%, 50% and 100%). These data deliveries are checked for logical consistency, out of range values, skip patterns, and duplicate entries. Problems are flagged by the World Bank and corrected by the implementing contractor through data checks, callbacks, and revisiting establishments.

    Response rate

    Survey non-response must be differentiated from item non-response. The former refers to refusals to participate in the survey altogether whereas the latter refers to the refusals to answer some specific questions. Enterprise Surveys suffer from both problems and different strategies were used to address these issues.

    Item non-response was addressed by two strategies: a- For sensitive questions that may generate negative reactions from the respondent, such as corruption or tax evasion, enumerators were instructed to collect "Refusal to respond" (-8) as a different option from "Don't know" (-9). b- Establishments with incomplete information were re-contacted in order to complete this information, whenever necessary.

    Survey non-response was addressed by maximizing efforts to contact establishments that were initially selected for interview. Attempts were made to contact the establishment for interview at different times/days of the week before a replacement establishment (with similar strata characteristics) was suggested for interview. Survey non-response did occur but substitutions were made in order to potentially achieve strata-specific goals.

  16. T

    wikipedia

    • tensorflow.org
    • huggingface.co
    Updated Aug 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). wikipedia [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia
    Explore at:
    Dataset updated
    Aug 9, 2019
    Description

    Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikipedia', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  17. E

    Data from: Twitter historical dataset: March 21, 2006 (first tweet) to July...

    • live.european-language-grid.eu
    • portalinvestigacion.uniovi.es
    • +2more
    Updated Mar 21, 2006
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2006). Twitter historical dataset: March 21, 2006 (first tweet) to July 31, 2009 (3 years, 1.5 billion tweets) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/967
    Explore at:
    Dataset updated
    Mar 21, 2006
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Disclaimer: This dataset is distributed by Daniel Gayo-Avello, an associate professor at the Department of Computer Science in the University of Oviedo, for the sole purpose of non-commercial research and it just includes tweet ids.

    The dataset contains tweet IDs for all the published tweets (in any language) bettween March 21, 2006 and July 31, 2009 thus comprising the first whole three years of Twitter from its creation, that is, about 1.5 billion tweets (see file Twitter-historical-20060321-20090731.zip).

    It covers several defining issues in Twitter, such as the invention of hashtags, retweets and trending topics, and it includes tweets related to the 2008 US Presidential Elections, the first Obama’s inauguration speech or the 2009 Iran Election protests (one of the so-called Twitter Revolutions).

    Finally, it does contain tweets in many major languages (mainly English, Portuguese, Japanese, Spanish, German and French) so it should be possible–at least in theory–to analyze international events from different cultural perspectives.

    The dataset was completed in November 2016 and, therefore, the tweet IDs it contains were publicly available at that moment. This means that there could be tweets public during that period that do not appear in the dataset and also that a substantial part of tweets in the dataset has been deleted (or locked) since 2016.

    To make easier to understand the decay of tweet IDs in the dataset a number of representative samples (99% confidence level and 0.5 confidence interval) are provided.

    In general terms, 85.5% ±0.5 of the historical tweets are available as of May 19, 2020 (see file Twitter-historical-20060321-20090731-sample.txt). However, since the amount of tweets vary greatly throughout the period of three years covered in the dataset, additional representative samples are provided for 90-day intervals (see the file 90-day-samples.zip).

    In that regard, the ratio of publicly available tweets (as of May 19, 2020) is as follows:

    March 21, 2006 to June 18, 2006: 88.4% ±0.5 (from 5,512 tweets).

    June 18, 2006 to September 16, 2006: 82.7% ±0.5 (from 14,820 tweets).

    September 16, 2006 to December 15, 2006: 85.7% ±0.5 (from 107,975 tweets).

    December 15, 2006 to March 15, 2007: 88.2% ±0.5 (from 852,463 tweets).

    March 15, 2007 to June 13, 2007: 89.6% ±0.5 (from 6,341,665 tweets).

    June 13, 2007 to September 11, 2007: 88.6% ±0.5 (from 11,171,090 tweets).

    September 11, 2007 to December 10, 2007: 87.9% ±0.5 (from 15,545,532 tweets).

    December 10, 2007 to March 9, 2008: 89.0% ±0.5 (from 23,164,663 tweets).

    March 9, 2008 to June 7, 2008: 66.5% ±0.5 (from 56,416,772 tweets; see below for more details on this).

    June 7, 2008 to September 5, 2008: 78.3% ±0.5 (from 62,868,189 tweets; see below for more details on this).

    September 5, 2008 to December 4, 2008: 87.3% ±0.5 (from 89,947,498 tweets).

    December 4, 2008 to March 4, 2009: 86.9% ±0.5 (from 169,762,425 tweets).

    March 4, 2009 to June 2, 2009: 86.4% ±0.5 (from 474,581,170 tweets).

    June 2, 2009 to July 31, 2009: 85.7% ±0.5 (from 589,116,341 tweets).

    The apparent drop in available tweets from March 9, 2008 to September 5, 2008 has an easy, although embarrassing, explanation.

    At the moment of cleaning all the data to publish this dataset there seemed to be a gap between April 1, 2008 to July 7, 2008 (actually, the data was not missing but in a different backup). Since tweet IDs are easy to regenerate for that Twitter era (source code is provided in generate-ids.m) I simply produced all those that were created between those two dates. All those tweets actually existed but a number of them were obviously private and not crawlable. For those regenerated IDs the actual ratio of public tweets (as of May 19, 2020) is 62.3% ±0.5.

    In other words, what you see in that period (April to July, 2008) is not actually a huge number of tweets having been deleted but the combination of deleted and non-public tweets (whose IDs should not be in the dataset for performance purposes when rehydrating the dataset).

    Additionally, given that not everybody will need the whole period of time the earliest tweet ID for each date is provided in the file date-tweet-id.tsv.

    For additional details regarding this dataset please see: Gayo-Avello, Daniel. "How I Stopped Worrying about the Twitter Archive at the Library of Congress and Learned to Build a Little One for Myself." arXiv preprint arXiv:1611.08144 (2016).

    If you use this dataset in any way please cite that preprint (in addition to the dataset itself).

    If you need to contact me you can find me as @PFCdgayo in Twitter.

  18. K

    Multi-Sensor Voice Command Dataset

    • rdr.kuleuven.be
    text/markdown, zip
    Updated Mar 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manuele Rusci; Manuele Rusci; Hugo Van hamme; Hugo Van hamme; Tinne Tuytelaars; Tinne Tuytelaars (2025). Multi-Sensor Voice Command Dataset [Dataset]. http://doi.org/10.48804/IEKKVZ
    Explore at:
    zip(2017734694), text/markdown(4949)Available download formats
    Dataset updated
    Mar 24, 2025
    Dataset provided by
    KU Leuven RDR
    Authors
    Manuele Rusci; Manuele Rusci; Hugo Van hamme; Hugo Van hamme; Tinne Tuytelaars; Tinne Tuytelaars
    License

    https://rdr.kuleuven.be/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.48804/IEKKVZhttps://rdr.kuleuven.be/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.48804/IEKKVZ

    Dataset funded by
    European Commission
    Description

    The repository includes the audio files recorded from a wireless audio sensor network of 4 sensors. A total of 20 volunteers were instructed to repeat five English voice commands ("Lights On", "Lights Off", "Music On", "Music Stop", "Next Song") from three different positions, for a total of 15 recordings per keyword. Only in a few cases, the number of samples is reduced to 14 after manual data cleaning. Additionally, we registered 15 per-speaker generic spoken utterances, e.g., "Set an alarm to 7am", which are used as negative examples. Given a duration of 3 seconds per utterance, we collected a total of 1.5 hours of audio per sensor. Additionally, the negative data were augmented with 4.4 hours of recordings obtained by replaying the audio files from the test-clean and dev-clean sets of Librispeech using a set of speakers. The audio files are available from the weblink https://www.openslr.org/12 (dev-clean and test-clean repository), which are freely distributed under the CC-BY-4.0 license. Every recording was limited to 3 seconds, leading to a total amount of 5.9 hours of audio per sensor in our multi-sensor dataset. This dataset favors the investigation of new speech recognition algorithms for audio data recorded with a network of ultra-low-power smart audio sensors. The application is voice command recognition, also known as keyword spotting, where the algorithm must recognize a voice command (e.g. "Lights On"), distinguishing it from other voice commands or other audio tracks, i.e. the negative data. Within a wireless audio sensor network scenario, the speech recognition algorithms are fed with the audio data recorded from multiple sensors, which are located in the environment.

  19. c

    Data supporting phosphorus load-reduction estimates from leaf-litter removal...

    • s.cnmilf.com
    • data.usgs.gov
    • +2more
    Updated Jul 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Data supporting phosphorus load-reduction estimates from leaf-litter removal in central and northwestern Vermont [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/data-supporting-phosphorus-load-reduction-estimates-from-leaf-litter-removal-in-central-an
    Explore at:
    Dataset updated
    Jul 20, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Vermont
    Description

    Removal of leaf-litter may help municipalities reduce phosphorus loads. Catch-basin cleaning and street cleaning are two commonly used Best Management Practices that could be modified to remove leaves and qualify for additional load-reduction credits. This Data Release contains four tab-delimited .txt files containing additional information about the study area, characteristics of municipal street solids, and load-reduction estimates from increased catch basin and street cleaning practices that are not available in the associated report. This Data Release also contains a compressed file, "EngBrk_ModelArchive.7z", which archives the model developed and used for the project. The four .txt table files are: (1) "VT_LU_data.txt", which includes the area, in acres and percent, for each land-use type within the nine participating municipalities and within the Englesby Brook basin (based on the National Oceanic and Atmospheric Administration's (NOAA) 2006 C-CAP Regional Land Cover); (2) "CB_sample_characteristics.txt", which describes the physical characteristics of samples collected from piles of catch-basin solids for nine municipalities in this study (September to November 2017 and April to November 2018); (3) "SC_sample_characteristics.txt", which describes the physical characteristics of samples collected from piles of street-cleaning solids for nine municipalities in this study (September to November 2017 and April to November 2018); and (4) "Estimated P-load reductions.txt", which contains estimated phosphorus load-reduction credits by using individual Soil Water Assessment Tool (SWAT) drainage areas for street cleaning and street cleaning with leaf management practices for the seven participating Municipal Separate Storm Sewer Systems (MS4s) municipalities in northwestern Vermont. The cities of Barre and Montpelier currently do not have to meet MS4 permit requirements. Information from the NOAA's 2006 C-CAP Regional Land Cover (https://data.noaa.gov/dataset/dataset/noaas-coastal-change-analysis-program-c-cap-2006-regional-land-cover-data-coastal-united-state1) and the Vermont Center for Geographic Information (https://vcgi.vermont.gov/) were used to characterize land use within each of the nine municipalities in the central and northwestern Vermont, study area, and the partially urbanized Englesby Brook basin located in Burlington and South Burlington, Vermont, that drains into Lake Champlain. The compressed file "EngBrk_ModelArchive.7z" represents the model archive that contains 31 files associated with the Englesby Brook model using the Source Loading and Management Model for Windows (WinSLAMM) version 10.4.0.

  20. DBPedia Classes

    • kaggle.com
    Updated Jul 4, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan Ofer (2019). DBPedia Classes [Dataset]. https://www.kaggle.com/danofer/dbpedia-classes/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 4, 2019
    Dataset provided by
    Kaggle
    Authors
    Dan Ofer
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    DBpedia (from "DB" for "database") is a project aiming to extract structured content from the information created in Wikipedia. This is an extract of the data (after cleaning, kernel included) that provides taxonomic, hierarchical categories ("classes") for 342,782 wikipedia articles. There are 3 levels, with 9, 70 and 219 classes respectively. A version of this dataset is a popular baseline for NLP/text classification tasks. This version of the dataset is much tougher, especially if the L2/L3 levels are used as the targets.

    This is an excellent benchmark for hierarchical multiclass/multilabel text classification. Some example approaches are included as code snippets.

    Content

    DBPedia dataset with multiple levels of hierarchy/classes, as a multiclass dataset. Original DBPedia ontology (triplets data): https://wiki.dbpedia.org/develop/datasets Listing of the class tree/taxonomy: http://mappings.dbpedia.org/server/ontology/classes/

    Acknowledgements

    Thanks to the Wikimedia foundation for creating Wikipedia, DBPedia and associated open-data goodness!

    Thanks to my colleagues at Sparkbeyond (https://www.sparkbeyond.com) for pointing me towards the taxonomical version of this dataset (as opposed to the classic 14 class version)

    Inspiration

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:
141 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Search
Clear search
Close search
Google apps
Main menu