81 datasets found
  1. B

    Data Cleaning Sample

    • borealisdata.ca
    • dataone.org
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  2. f

    Data_Sheet_6_“R” U ready?: a case study using R to analyze changes in gene...

    • frontiersin.figshare.com
    docx
    Updated Mar 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_6_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s006
    Explore at:
    docxAvailable download formats
    Dataset updated
    Mar 22, 2024
    Dataset provided by
    Frontiers
    Authors
    Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.

  3. o

    Data and code for "Plastic bag bans and fees reduce harmful bag litter on...

    • openicpsr.org
    delimited
    Updated Apr 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna Papp; Kimberly Oremus (2024). Data and code for "Plastic bag bans and fees reduce harmful bag litter on shorelines" [Dataset]. http://doi.org/10.3886/E200661V3
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Apr 14, 2024
    Dataset provided by
    Columbia University
    University of Delaware
    Authors
    Anna Papp; Kimberly Oremus
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Code and data for "Plastic bag bans and fees reduce harmful bag litter on shorelines " by Anna Papp and Kimberly Oremus.Please see included README file for details: This folder includes code and data to fully replicate Figures 1-5. In addition, the folder also includes instructions to rerun data cleaning steps. Last modified: March 6, 2025For any questions, please reach out to ap3907@columbia.edu._Code (replication/code):To replicate main figures, run each file for each main figure: - 1_figure1.R- 1_figure2.R- 1_figure3.R - 1_figure4.R- 1_figure5.R Update the home directory to match where the directory is saved ("replication" folder) in this file before running it. The code will require you to install packages (see note on versions below).To replicate entire data cleaning pipeline:- First download all required data (explained in Data section below). - Run code in code/0_setup folder (refer to separate README file)._ R-Version and Package VersionsThe project was developed and executed using:- R version: 4.0.0 (2024-04-24)- Platform: macOS 13.5 Code was developed and main figures were created using the following versions: - data.table: 1.14.2- dplyr: 1.1.4- readr: 2.1.2- tidyr: 1.2.0- broom: 0.7.12- stringr: 1.5.1- lubridate: 1.7.9- raster: 3.5.15- sf: 1.0.7- readxl: 1.4.0- cobalt: 4.4.1.9002- spdep: 1.2.3- ggplot2: 3.4.4- PNWColors: 0.1.0- grid: 4.0.0- gridExtra: 2.3- ggpubr: 0.4.0- knitr: 1.48- zoo: 1.8.12 - fixest: 0.11.2- lfe: 2.8.7.1 - did: 2.1.2- didimputation: 0.3.0 - DIDmultiplegt: 0.1.0- DIDmultiplegtDYN: 1.0.15- scales: 1.2.1 - usmap: 0.6.1 - tigris: 2.0.1 - dotwhisker: 0.7.4_Data Processed data files are provided to replicate main figures. To replicate from raw data, follow the instructions below.Policies (needs to be recreated or email for version): Compiled from bagtheban.com/in-your-state/, rila.org/retail-compliance-center/consumer-bag-legislation, baglaws.com, nicholasinstitute.duke.edu/plastics-policy-inventory, and wikipedia.org/wiki/Plastic_bag_bans_in_the_United_States; and massgreen.org/plastic-bag-legislation.html and cawrecycles.org/list-of-local-bag-bans to confirm legislation in Massachusetts and California.TIDES (needs to be downloaded for full replication): Download cleanup data for the United States from Ocean Conservancy (coastalcleanupdata.org/reports). Download files for 2000-2009, 2010-2014, and then each separate year from 2015 until 2023. Save files in the data/tides directory, as year.csv (and 2000-2009.csv, 2010-2014.csv) Also download entanglement data for each year (2016-2023) separately in a file called data/tides/entanglement (each file should be called 'entangled-animals-united-states_YEAR.csv').Shapefiles (needs to be downloaded for full replication): Download shapefiles for processing cleanups and policies. Download county shapefiles from the US Census Bureau; save files in the data/shapefiles directory, county shapefile should be in folder called county (files called cb_2018_us_county_500k.shp). Download TIGER Zip Code tabulation areas from the US Census Bureau (through data.gov); save files in the data/shapefiles directory, zip codes shapefile folder and files should be called tl_2019_us_zcta510.Other: Helper files with US county and state fips codes, lists of US counties and zip codes in data/other directory, provided in the directory except as follows. Download zip code list and 2020 IRS population data from United States zip codes and save as uszipcodes.csv in data/other directory. Download demographic characteristics of zip codes from Social Explorer and save as raw_zip_characteristics.csv in data/other directory.Refer to the .txt files in each data folder to ensure all necessary files are downloaded.

  4. Data cleaning EVI2

    • figshare.com
    txt
    Updated May 13, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Geraldine Klarenberg (2019). Data cleaning EVI2 [Dataset]. http://doi.org/10.6084/m9.figshare.5327527.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 13, 2019
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Geraldine Klarenberg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Scripts to clean EVI2 data obtained from the VIP lab (University of Arizona) website (https://vip.arizona.edu/about.php and https://vip.arizona.edu/viplab_data_explorer.php). Data obtained in 2012.- outlier detection and removal/replacement- alignment of 2 periodsThe manuscript detailing the methods and resulting data sets has been accepted for publication in Nature Scientific Data (05/11/2019).Instructions: use the R Markdown html file for instructions!Code last manipulated and tested in R 3.4.3 ("Kite-Eating Tree")

  5. m

    Reddit r/AskScience Flair Dataset

    • data.mendeley.com
    Updated May 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
    Explore at:
    Dataset updated
    May 23, 2022
    Authors
    Sumit Mishra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

    The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

    The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

    This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.

  6. q

    Writing Clean Code in R Workshop

    • qubeshub.org
    Updated Oct 15, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Joseph; Leah Wasser (2019). Writing Clean Code in R Workshop [Dataset]. https://qubeshub.org/publications/1442
    Explore at:
    Dataset updated
    Oct 15, 2019
    Dataset provided by
    QUBES
    Authors
    Max Joseph; Leah Wasser
    Description

    When working with data, you often spend the most amount of time cleaning your data. Learn how to write more efficient code using the tidyverse in R.

  7. l

    LSC (Leicester Scientific Corpus)

    • figshare.le.ac.uk
    Updated Apr 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v2
    Explore at:
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    The LSC (Leicester Scientific Corpus)

    April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online

    The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R

    The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:

    Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.

    Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identified by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.

  8. f

    Dataset for a globally synthesised and flagged bee occurrence dataset and...

    • open.flinders.edu.au
    • researchdata.edu.au
    txt
    Updated Jun 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Dorey; Erica E. Fischer; Paige R. Chesshire; Angela Nava-Bolaños; Robert L O'Reilly; Silas Bossert; Shannon M. Collins; Elinor M. Lichtenberg; Tucker, Erika M.; Allan Smith-Pardo; Armando Falcón-Brindis; Diego A. Guevara; Bruno Ribeiro; Diego de Pedro; Keng-Lou James Hung; Katherine A. Parys; Lindsie M. McCabe; Matthew S. Rogan; Robert L. Minckley; Santiago José Elías Velazco; Terry Griswold; Tracy A. Zarrillo; Walter Jetz; Yanina V. Sica; Michael Christopher Orr.; Laura Melissa Guzman; John S. Ascher; Alice Hughes; Neil S. Cobb (2024). Dataset for a globally synthesised and flagged bee occurrence dataset and cleaning workflow [Dataset]. http://doi.org/10.25451/flinders.21709757.v7
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 17, 2024
    Dataset provided by
    Flinders University
    Authors
    James Dorey; Erica E. Fischer; Paige R. Chesshire; Angela Nava-Bolaños; Robert L O'Reilly; Silas Bossert; Shannon M. Collins; Elinor M. Lichtenberg; Tucker, Erika M.; Allan Smith-Pardo; Armando Falcón-Brindis; Diego A. Guevara; Bruno Ribeiro; Diego de Pedro; Keng-Lou James Hung; Katherine A. Parys; Lindsie M. McCabe; Matthew S. Rogan; Robert L. Minckley; Santiago José Elías Velazco; Terry Griswold; Tracy A. Zarrillo; Walter Jetz; Yanina V. Sica; Michael Christopher Orr.; Laura Melissa Guzman; John S. Ascher; Alice Hughes; Neil S. Cobb
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Species occurrence data are foundational for research, conservation, and science communication, but the limited availability and accessibility of reliable data represents a major obstacle, particularly for insects, which face mounting pressures. We present BeeBDC, a new R package, and a global bee occurrence dataset to address this issue. We combined >18.3 million bee occurrence records from multiple public repositories (GBIF, SCAN, iDigBio, USGS, ALA) and smaller datasets, then standardised, flagged, deduplicated, and cleaned the data using the reproducible BeeBDCR-workflow. Specifically, we harmonised species names (following established global taxonomy), country names, and collection dates and we added record-level flags for a series of potential quality issues. These data are provided in two formats, “cleaned” and “flagged-but-uncleaned”. The BeeBDC package with online documentation provides end users the ability to modify filtering parameters to address their research questions. By publishing reproducible R workflows and globally cleaned datasets, we can increase the accessibility and reliability of downstream analyses. This workflow can be implemented for other taxa to support research and conservation.

  9. f

    Data and tools for studying isograms

    • figshare.com
    Updated Jul 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
    Explore at:
    application/x-sqlite3Available download formats
    Dataset updated
    Jul 31, 2017
    Dataset provided by
    figshare
    Authors
    Florian Breit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

    Label Data type Description

    isogramy int The order of isogramy, e.g. "2" is a second order isogram

    length int The length of the word in letters

    word text The actual word/isogram in ASCII

    source_pos text The Part of Speech tag from the original corpus

    count int Token count (total number of occurences)

    vol_count int Volume count (number of different sources which contain the word)

    count_per_million int Token count per million words

    vol_count_as_percent int Volume count as percentage of the total number of volumes

    is_palindrome bool Whether the word is a palindrome (1) or not (0)

    is_tautonym bool Whether the word is a tautonym (1) or not (0)

    The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

    Label

    Data type

    Description

    !total_1grams

    int

    The total number of words in the corpus

    !total_volumes

    int

    The total number of volumes (individual sources) in the corpus

    !total_isograms

    int

    The total number of isograms found in the corpus (before compacting)

    !total_palindromes

    int

    How many of the isograms found are palindromes

    !total_tautonyms

    int

    How many of the isograms found are tautonyms

    The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.

  10. f

    Initial data analysis checklist for data screening in longitudinal studies.

    • plos.figshare.com
    xls
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner (2024). Initial data analysis checklist for data screening in longitudinal studies. [Dataset]. http://doi.org/10.1371/journal.pone.0295726.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Initial data analysis checklist for data screening in longitudinal studies.

  11. d

    Data from: Who shares? Who doesn’t? Factors associated with openly archiving...

    • datadryad.org
    • data.niaid.nih.gov
    zip
    Updated May 26, 2011
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heather A. Piwowar (2011). Who shares? Who doesn’t? Factors associated with openly archiving raw research data [Dataset]. http://doi.org/10.5061/dryad.mf1sd
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 26, 2011
    Dataset provided by
    Dryad
    Authors
    Heather A. Piwowar
    Time period covered
    May 26, 2011
    Description

    Microarray publications and publication attributes157 columns of attributes for 11603 publications identified as creating gene expression microarray data. Tab delimited. Key: PubMed identifier (pmid). See stats.R for data cleaning steps and more details on variables. Data collected in January 2010 using code available at http://github.com/hpiwowar/pypubrawdata.txtJournal policy details for microarray dataData sharing policy details for journals that publish a lot of gene expression microarray data. Policy links, excerpts, and classifications (24 columns) for 156 journals. Some of these classifications are included as columns in rawdata.txt as journal policy attributes.journal_policies_microarray_data.csvStatistical analysis R scriptR script for data cleaning, statistical analysis, and graphics as presented in the paper. Takes rawdata.txt as input and loads helper_functions.R source.stats.RHelper R script functionsHelper functions loaded by stats.R for analysis and graphical output...

  12. m

    Data from: Datasets for lot sizing and scheduling problems in the...

    • data.mendeley.com
    • narcis.nl
    Updated Jan 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan Piñeros (2021). Datasets for lot sizing and scheduling problems in the fruit-based beverage production process [Dataset]. http://doi.org/10.17632/j2x3gbskfw.1
    Explore at:
    Dataset updated
    Jan 19, 2021
    Authors
    Juan Piñeros
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The datasets presented here were partially used in “Formulation and MIP-heuristics for the lot sizing and scheduling problem with temporal cleanings” (Toscano, A., Ferreira, D. , Morabito, R. , Computers & Chemical Engineering) [1], in “A decomposition heuristic to solve the two-stage lot sizing and scheduling problem with temporal cleaning” (Toscano, A., Ferreira, D. , Morabito, R. , Flexible Services and Manufacturing Journal) [2], and in “A heuristic approach to optimize the production scheduling of fruit-based beverages” (Toscano et al., Gestão & Produção, 2020) [3]. In fruit-based production processes, there are two production stages: preparation tanks and production lines. This production process has some process-specific characteristics, such as temporal cleanings and synchrony between the two production stages, which make optimized production planning and scheduling even more difficult. In this sense, some papers in the literature have proposed different methods to solve this problem. To the best of our knowledge, there are no standard datasets used by researchers in the literature in order to verify the accuracy and performance of proposed methods or to be a benchmark for other researchers considering this problem. The authors have been using small data sets that do not satisfactorily represent different scenarios of production. Since the demand in the beverage sector is seasonal, a wide range of scenarios enables us to evaluate the effectiveness of the proposed methods in the scientific literature in solving real scenarios of the problem. The datasets presented here include data based on real data collected from five beverage companies. We presented four datasets that are specifically constructed assuming a scenario of restricted capacity and balanced costs. These dataset is supplementary data for the submitted paper to Data in Brief [4]. [1] Toscano, A., Ferreira, D., Morabito, R., Formulation and MIP-heuristics for the lot sizing and scheduling problem with temporal cleanings, Computers & Chemical Engineering. 142 (2020) 107038. Doi: 10.1016/j.compchemeng.2020.107038. [2] Toscano, A., Ferreira, D., Morabito, R., A decomposition heuristic to solve the two-stage lot sizing and scheduling problem with temporal cleaning, Flexible Services and Manufacturing Journal. 31 (2019) 142-173. Doi: 10.1007/s10696-017-9303-9. [3] Toscano, A., Ferreira, D., Morabito, R., Trassi, M. V. C., A heuristic approach to optimize the production scheduling of fruit-based beverages. Gestão & Produção, 27(4), e4869, 2020. https://doi.org/10.1590/0104-530X4869-20. [4] Piñeros, J., Toscano, A., Ferreira, D., Morabito, R., Datasets for lot sizing and scheduling problems in the fruit-based beverage production process. Data in Brief (2021).

  13. H

    Replication Data for: A more efficient approach to converting ASCII files...

    • dataverse.harvard.edu
    Updated Dec 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Harris (2021). Replication Data for: A more efficient approach to converting ASCII files and cleaning data in R with the speedycode package [Dataset]. http://doi.org/10.7910/DVN/X7UKRL
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 1, 2021
    Dataset provided by
    Harvard Dataverse
    Authors
    Jacob Harris
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Replication data for working paper: A more efficient approach to converting ASCII files and cleaning data in R with the speedycode package

  14. Data cleaning and analysis for the Master's thesis: DIFFERENCES IN CONSUMER...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, html
    Updated Aug 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hana Remesova; Michael Burnard; Michael Burnard; Hana Remesova (2020). Data cleaning and analysis for the Master's thesis: DIFFERENCES IN CONSUMER PREFERENCES FOR UNWEATHERED AND WEATHERED WOOD [Dataset]. http://doi.org/10.5281/zenodo.3981177
    Explore at:
    html, csv, binAvailable download formats
    Dataset updated
    Aug 13, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hana Remesova; Michael Burnard; Michael Burnard; Hana Remesova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data and analytical support the Master's thesis submitted by Hana Remesova at the University of Primorska
    Faculty of Mathematics, Natural Sciences, and Information Technologies. The .csv files are data files, the .Rmd file is an R markdown which can be run. The product of knitting the .Rmd file is the .html.

  15. Cleaning against MHV dataset

    • catalog.data.gov
    Updated Mar 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). Cleaning against MHV dataset [Dataset]. https://catalog.data.gov/dataset/cleaning-against-mhv-dataset
    Explore at:
    Dataset updated
    Mar 4, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    efficacy data against MHV for cleaning surfaces. This dataset is associated with the following publication: Hardison, R., S. Nelson, D. Barriga, J. Ghere, G. Fenton, R. James, M. Stewart, S. Lee, M.W. Calfee, S. Ryan, and M. Howard. Efficacy of Detergent-Based Cleaning Methods Against Coronavirus MHV-A59 on Porous and Non-Porous Surfaces. JOURNAL OF OCCUPATIONAL AND ENVIRONMENTAL HYGIENE. Taylor & Francis, Inc., Philadelphia, PA, USA, 19(2): 91-101, (2022).

  16. H

    Replication Data for: Race, gender, and the politics of incivility

    • dataverse.harvard.edu
    Updated Jun 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sam Gubitz (2020). Replication Data for: Race, gender, and the politics of incivility [Dataset]. http://doi.org/10.7910/DVN/ODPNI8
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 10, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Sam Gubitz
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Use the project file first, then open the cleaning R file to clean the raw data. Then use the R file called OLS analysis to analyze the cleaned data, which was outputted as a .rds file.

  17. d

    The fractured lab notebook: undergraduate and ecological data management...

    • search.dataone.org
    Updated Nov 14, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Center for Ecological Analysis and Synthesis; Carly Strasser (2013). The fractured lab notebook: undergraduate and ecological data management training in the United States [Dataset]. https://search.dataone.org/view/knb.300.9
    Explore at:
    Dataset updated
    Nov 14, 2013
    Dataset provided by
    Knowledge Network for Biocomplexity
    Authors
    National Center for Ecological Analysis and Synthesis; Carly Strasser
    Time period covered
    Mar 29, 2011 - May 25, 2011
    Area covered
    Variables measured
    Answer, Coding, EndDate, Question, R script, StartDate, First Name, Param name, Description, RespondentID, and 157 more
    Description

    Data presented here are those collected from a survey of Ecology professors at 48 undergraduate institutions to assess the current state of data management education. The following files have been uploaded:

    Scripts(2): 1. DataCleaning_20120105.R is an R script for cleaning up data prior to analysis. This script removes spaces, substitutes text for codes, removed duplicate schools, and converts questions and answers from the survey into more simple parameter names, without any numbers, spaces, or symbols. This script is heavily annotated to assist the user of the file in understanding what is being done to the data files. The script produces the file cleandata_[date].Rdata, which is called in the file DataTrimming_20120105.R 2. DataTrimming_20120105.R is an R script for trimming extraneous variables not used in final analyses. Some variables are combined as needed and NAs (no answers) are removed. The file is heavily annotated. It produces trimdata_[date].Rdata, which was imported into Excel for summary statistics.

    Data files (3) 3. AdvancedSpreadsheet_20110526.csv is the output file from the SurveyMonkey online survey tool used for this project. It is a .csv sheet with the complete set of survey data, although some data (e.g., open-ended responses, institution names) are removed to prevent schools and/or instructors from being identifiable. This file is read into DataCleaning_20120105.R for cleaning and editing. 4. VariableRenaming_20110711.csv is called into the DataCleaning_20120105.R script to convert the questions and answers from the survey into simple parameter names, without any numbers, spaces, or symbols. 5. ParamTable.csv is a list of the parameter names used for analysis and the value codes. It can be used to understand outputs from the scripts above (cleandata_[date].Rdata and trimdata_[date].Rdata).

  18. l

    LScDC (Leicester Scientific Dictionary-Core)

    • figshare.le.ac.uk
    docx
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScDC (Leicester Scientific Dictionary-Core) [Dataset]. http://doi.org/10.25392/leicester.data.9896579.v3
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    The LScDC (Leicester Scientific Dictionary-Core Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScDC (Leicester Scientific Dictionary-Core) is formed using the updated LScD (Leicester Scientific Dictionary) - Version 3*. All steps applied to build the new version of core dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. The files provided with this description are also same as described as for LScDC Version 2. The numbers of words in the 3rd versions of LScD and LScDC are summarized below. # of wordsLScD (v3) 972,060LScDC (v3) 103,998 * Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v3 ** Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v2[Version 2] Getting StartedThis file describes a sorted and cleaned list of words from LScD (Leicester Scientific Dictionary), explains steps for sub-setting the LScD and basic statistics of words in the LSC (Leicester Scientific Corpus), to be found in [1, 2]. The LScDC (Leicester Scientific Dictionary-Core) is a list of words ordered by the number of documents containing the words, and is available in the CSV file published. There are 104,223 unique words (lemmas) in the LScDC. This dictionary is created to be used in future work on the quantification of the sense of research texts. The objective of sub-setting the LScD is to discard words which appear too rarely in the corpus. In text mining algorithms, usage of enormous number of text data brings the challenge to the performance and the accuracy of data mining applications. The performance and the accuracy of models are heavily depend on the type of words (such as stop words and content words) and the number of words in the corpus. Rare occurrence of words in a collection is not useful in discriminating texts in large corpora as rare words are likely to be non-informative signals (or noise) and redundant in the collection of texts. The selection of relevant words also holds out the possibility of more effective and faster operation of text mining algorithms.To build the LScDC, we decided the following process on LScD: removing words that appear in no more than 10 documents (

  19. Data and Code for "Urban socioeconomic variation influences the ecology and...

    • zenodo.org
    bin, csv, txt
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ella Martin; Samer El-Galmady; Marc T.J. Johnson; Ella Martin; Samer El-Galmady; Marc T.J. Johnson (2024). Data and Code for "Urban socioeconomic variation influences the ecology and evolution of trophic interactions" [Dataset]. http://doi.org/10.5281/zenodo.10640975
    Explore at:
    csv, bin, txtAvailable download formats
    Dataset updated
    Feb 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ella Martin; Samer El-Galmady; Marc T.J. Johnson; Ella Martin; Samer El-Galmady; Marc T.J. Johnson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data and code required for all analyses in Urban socioeconomic variation influences the ecology and evolution of trophic interactions.

    Gall_Data_2022_new.csv contains data for gall predation and diameter measurements. Goldrod_Gall_Density.csv contains goldenrod and gall density measurements for each site. GallSitesFinal.csv contains location data for all study sites. DisseminationAreaCodes.csv contains codes for each site location needed to obtain census data.

    The script DataCleaning.R assembles the above four datasets with environmental and census data to produce the final dataset: MartinElGalmady&Johnson2023_cleandataset.csv and the supplemental dataset with galls with early larval death removed: NoELD_dataset.csv

    Analysis.R provides the code for conducting analyses and producing figures using the MartinElGalmady&Johnson2023_cleandataset.csv dataset (or the NoELD_dataset.csv for supplemental analyses with early larval death removed).

    Detailed descriptions of each dataset are included in metadata.xlsx (NoELD_dataset has the same rows and columns as the full dataset).

    All code was run in R verison 4.2.2

  20. w

    Synthetic Data for an Imaginary Country, Sample, 2023 - World

    • microdata.worldbank.org
    Updated Jul 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
    Explore at:
    Dataset updated
    Jul 7, 2023
    Dataset authored and provided by
    Development Data Group, Data Analytics Unit
    Time period covered
    2023
    Area covered
    World, World
    Description

    Abstract

    The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

    The full-population dataset (with about 10 million individuals) is also distributed as open data.

    Geographic coverage

    The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

    Analysis unit

    Household, Individual

    Universe

    The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

    Kind of data

    ssd

    Sampling procedure

    The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

    Mode of data collection

    other

    Research instrument

    The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

    Cleaning operations

    The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

    Response rate

    This is a synthetic dataset; the "response rate" is 100%.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:
153 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Search
Clear search
Close search
Google apps
Main menu