100+ datasets found
  1. BI intro to data cleaning eda and machine learning

    • kaggle.com
    zip
    Updated Feb 20, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Walekhwa Tambiti Leo Philip (2026). BI intro to data cleaning eda and machine learning [Dataset]. https://www.kaggle.com/datasets/walekhwatlphilip/intro-to-data-cleaning-eda-and-machine-learning
    Explore at:
    zip(9961 bytes)Available download formats
    Dataset updated
    Feb 20, 2026
    Authors
    Walekhwa Tambiti Leo Philip
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Real-World Data Science Challenge

    Business Intelligence Program Strategy — Student Success Optimization

    Hosted by: Walsoft Computer Institute 📁 Download dataset 👤 Kaggle profile

    Background

    Walsoft Computer Institute runs a Business Intelligence (BI) training program for students from diverse educational, geographical, and demographic backgrounds. The institute has collected detailed data on student attributes, entry exams, study effort, and final performance in two technical subjects: Python Programming and Database Systems.

    As part of an internal review, the leadership team has hired you — a Data Science Consultant — to analyze this dataset and provide clear, evidence-based recommendations on how to improve:

    • Admissions decision-making
    • Academic support strategies
    • Overall program impact and ROI

    Your Mission

    Answer this central question:

    “Using the BI program dataset, how can Walsoft strategically improve student success, optimize resources, and increase the effectiveness of its training program?”

    Key Strategic Areas

    You are required to analyze and provide actionable insights for the following three areas:

    1. Admissions Optimization

    Should entry exams remain the primary admissions filter?

    Your task is to evaluate the predictive power of entry exam scores compared to other features such as prior education, age, gender, and study hours.

    ✅ Deliverables:

    • Feature importance ranking for predicting Python and DB scores
    • Admission policy recommendation (e.g., retain exams, add screening tools, adjust thresholds)
    • Business rationale and risk analysis

    2. Curriculum Support Strategy

    Are there at-risk student groups who need extra support?

    Your task is to uncover whether certain backgrounds (e.g., prior education level, country, residence type) correlate with poor performance and recommend targeted interventions.

    ✅ Deliverables:

    • At-risk segment identification
    • Support program design (e.g., prep course, mentoring)
    • Expected outcomes, costs, and KPIs

    3. Resource Allocation & Program ROI

    How can we allocate resources for maximum student success?

    Your task is to segment students by success profiles and suggest differentiated teaching/facility strategies.

    ✅ Deliverables:

    • Performance drivers
    • Student segmentation
    • Resource allocation plan and ROI projection

    🛠️ Dataset Overview

    ColumnDescription
    fNAME, lNAMEStudent first and last name
    AgeStudent age (21–71 years)
    genderGender (standardized as "Male"/"Female")
    countryStudent’s country of origin
    residenceStudent housing/residence type
    entryEXAMEntry test score (28–98)
    prevEducationPrior education (High School, Diploma, etc.)
    studyHOURSTotal study hours logged
    PythonFinal Python exam score
    DBFinal Database exam score

    📊 Dataset

    You are provided with a real-world messy dataset that reflects the types of issues data scientists face every day — from inconsistent formatting to missing values.

    Raw Dataset (Recommended for Full Project)

    Download: bi.csv

    This dataset includes common data quality challenges:

    • Country name inconsistencies
      e.g. Norge → Norway, RSA → South Africa, UK → United Kingdom

    • Residence type variations
      e.g. BI-Residence, BIResidence, BI_Residence → unify to BI Residence

    • Education level typos and casing issues
      e.g. Barrrchelors → Bachelor, DIPLOMA, DiplomaaaDiploma

    • Gender value noise
      e.g. M, F, female → standardize to Male / Female

    • Missing scores in Python subject
      Fill NaN values using column mean or suitable imputation strategy

    Participants using this dataset are expected to apply data cleaning techniques such as: - String standardization - Null value imputation - Type correction (e.g., scores as float) - Validation and visual verification

    Bonus: Submissions that use and clean this dataset will earn additional Technical Competency points.

    Cleaned Dataset (Optional Shortcut)

    Download: cleaned_bi.csv

    This version has been fully standardized and preprocessed: - All fields cleaned and renamed consistently - Missing Python scores filled with th...

  2. Dirty Dataset to practice Data Cleaning

    • kaggle.com
    zip
    Updated Nov 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amrutha yenikonda (2023). Dirty Dataset to practice Data Cleaning [Dataset]. https://www.kaggle.com/datasets/amruthayenikonda/dirty-dataset-to-practice-data-cleaning
    Explore at:
    zip(1241 bytes)Available download formats
    Dataset updated
    Nov 3, 2023
    Authors
    Amrutha yenikonda
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The dataset has been obtained by web scraping a Wikipedia page for which code is linked below: https://www.kaggle.com/amruthayenikonda/simple-web-scraping-using-pandas

    This dataset can be used to practice data cleaning and manipulation for example dropping of unwanted columns, null vales, removing symbols etc

  3. Medical Clean Dataset

    • kaggle.com
    zip
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aamir Shahzad (2025). Medical Clean Dataset [Dataset]. https://www.kaggle.com/datasets/aamir5659/medical-clean-dataset
    Explore at:
    zip(1262 bytes)Available download formats
    Dataset updated
    Jul 6, 2025
    Authors
    Aamir Shahzad
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is the cleaned version of a real-world medical dataset that was originally noisy, incomplete, and contained various inconsistencies. The dataset was cleaned through a structured and well-documented data preprocessing pipeline using Python and Pandas. Key steps in the cleaning process included:

    • Handling missing values using statistical techniques such as median imputation and mode replacement
    • Converting categorical values to consistent formats (e.g., gender formatting, yes/no standardization)
    • Removing duplicate entries to ensure data accuracy
    • Parsing and standardizing date fields
    • Creating new derived features such as age groups
    • Detecting and reviewing outliers based on IQR
    • Removing irrelevant or redundant columns

    The purpose of cleaning this dataset was to prepare it for further exploratory data analysis (EDA), data visualization, and machine learning modeling.

    This cleaned dataset is now ready for training predictive models, generating visual insights, or conducting healthcare-related research. It provides a high-quality foundation for anyone interested in medical analytics or data science practice.

  4. Z

    NoCORA - Northern Cameroon Observed Rainfall Archive

    • data.niaid.nih.gov
    Updated Jul 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lavarenne, Jérémy; Nenwala, Victor Hugo; Foulna Tcheobe, Carmel (2024). NoCORA - Northern Cameroon Observed Rainfall Archive [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10156437
    Explore at:
    Dataset updated
    Jul 10, 2024
    Dataset provided by
    Centre de Coopération Internationale en Recherche Agronomique pour le Développement
    Center for International Forestry Research
    Authors
    Lavarenne, Jérémy; Nenwala, Victor Hugo; Foulna Tcheobe, Carmel
    Area covered
    North Region, Cameroon
    Description

    Description: The NoCORA dataset represents a significant effort to compile and clean a comprehensive set of daily rainfall data for Northern Cameroon (North and Extreme North regions). This dataset, overing more than 1 million observations across 418 rainfall stations on a temporal range going from 1927 to 2022, is instrumental for researchers, meteorologists, and policymakers working in climate research, agricultural planning, and water resource management in the region. It integrates data from diverse sources, including Sodecoton rain funnels, the archive of Robert Morel (IRD), Centrale de Lagdo, the GHCN daily service, and the TAHMO network. The construction of NoCORA involved meticulous processes, including manual assembly of data, extensive data cleaning, and standardization of station names and coordinates, making it a hopefully robust and reliable resource for understanding climatic dynamics in Northern Cameroon. Data Sources: The dataset comprises eight primary rainfall data sources and a comprehensive coordinates dataset. The rainfall data sources include extensive historical and contemporary measurements, while the coordinates dataset was developed using reference data and an inference strategy for variant station names or missing coordinates. Dataset Preparation Methods: The preparation involved manual compilation, integration of machine-readable files, data cleaning with OpenRefine, and finalization using Python/Jupyter Notebook. This process should ensured the accuracy and consistency of the dataset. Discussion: NoCORA, with its extensive data compilation, presents an invaluable resource for climate-related studies in Northern Cameroon. However, users must navigate its complexities, including missing data interpretations, potential biases, and data inconsistencies. The dataset's comprehensive nature and historical span require careful handling and validation in research applications. Access to Dataset: The NoCORA dataset, while a comprehensive resource for climatological and meteorological research in Northern Cameroon, is subject to specific access conditions due to its compilation from various partner sources. The original data sources vary in their openness and accessibility, and not all partners have confirmed the open-access status of their data. As such, to ensure compliance with these varying conditions, access to the NoCORA dataset is granted on a request basis. Interested researchers and users are encouraged to contact us for permission to access the dataset. This process allows us to uphold the data sharing agreements with our partners while facilitating research and analysis within the scientific community. Authors Contributions:

    Data treatment: Victor Hugo Nenwala, Carmel Foulna Tcheobe, Jérémy Lavarenne. Documentation: Jérémy Lavarenne. Funding: This project was funded by the DESIRA INNOVACC project. Changelog:

    v1.0.2 : corrected interversion in column names in the coordinates dataset v1.0.1 : dataset specification file has been updated with complementary information regarding station locations v1.0.0 : initial submission

  5. Pandas Practice Dataset

    • kaggle.com
    zip
    Updated Jan 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mrityunjay Pathak (2023). Pandas Practice Dataset [Dataset]. https://www.kaggle.com/datasets/themrityunjaypathak/pandas-practice-dataset
    Explore at:
    zip(493 bytes)Available download formats
    Dataset updated
    Jan 27, 2023
    Authors
    Mrityunjay Pathak
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    What is Pandas?

    Pandas is a Python library used for working with data sets.

    It has functions for analyzing, cleaning, exploring, and manipulating data.

    The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

    Why Use Pandas?

    Pandas allows us to analyze big data and make conclusions based on statistical theories.

    Pandas can clean messy data sets, and make them readable and relevant.

    Relevant data is very important in data science.

    What Can Pandas Do?

    Pandas gives you answers about the data. Like:

    Is there a correlation between two or more columns?

    What is average value?

    Max value?

    Min value?

  6. h

    codeparrot-clean

    • huggingface.co
    Updated Dec 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeParrot (2021). codeparrot-clean [Dataset]. https://huggingface.co/datasets/codeparrot/codeparrot-clean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 7, 2021
    Dataset provided by
    Good Engineering, Inc
    Authors
    CodeParrot
    Description

    CodeParrot 🦜 Dataset Cleaned

      What is it?
    

    A dataset of Python files from Github. This is the deduplicated version of the codeparrot.

      Processing
    

    The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:

    Deduplication Remove exact matches

    Filtering Average line length < 100 Maximum line length < 1000 Alpha numeric characters fraction > 0.25 Remove auto-generated files (keyword search)

    For… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-clean.

  7. E

    A Replication Dataset for Fundamental Frequency Estimation

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    json
    Updated Oct 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). A Replication Dataset for Fundamental Frequency Estimation [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7808
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Oct 19, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.© 2020, Bastian Bechtold. All rights reserved. Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise.The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time.Included Code and Data

    ground truth data.zip is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora:

    CMU-ARCTIC (consensus truth) [1]FDA (corpus truth and consensus truth) [2]KEELE (corpus truth and consensus truth) [3]MOCHA-TIMIT (consensus truth) [4]PTDB-TUG (corpus truth and consensus truth) [5]TIMIT (consensus truth) [6]

    noisy speech data.zip is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora:NOISEX [7]QUT-NOISE [8]

    synthetic speech data.zip is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise.noisy_speech.pkl and synthetic_speech.pkl are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms:AUTOC [9]AMDF [10]BANA [11]CEP [12]CREPE [13]DIO [14]DNN [15]KALDI [16]MAPSMBSC [17]NLS [18]PEFAC [19]PRAAT [20]RAPT [21]SACC [22]SAFE [23]SHR [24]SIFT [25]SRH [26]STRAIGHT [27]SWIPE [28]YAAPT [29]YIN [30]

    noisy speech evaluation.py and synthetic speech evaluation.py are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures:Gross Pitch Error (GPE), the percentage of pitches where the estimated pitch deviates from the true pitch by more than 20%.Fine Pitch Error (FPE), the mean error of grossly correct estimates.High/Low Octave Pitch Error (OPE), the percentage pitches that are GPEs and happens to be at an integer multiple of the true pitch.Gross Remaining Error (GRE), the percentage of pitches that are GPEs but not OPEs.Fine Remaining Bias (FRB), the median error of GREs.True Positive Rate (TPR), the percentage of true positive voicing estimates.False Positive Rate (FPR), the percentage of false positive voicing estimates.False Negative Rate (FNR), the percentage of false negative voicing estimates.F₁, the harmonic mean of precision and recall of the voicing decision.

    Pipfile is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs.

    The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory.References:

    John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003.Paul C Bagshaw, Steven Hiller, and Mervyn A Jack. Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching. In EUROSPEECH, 1993.F Plante, Georg F Meyer, and William A Ainsworth. A Pitch Extraction Reference Database. In Fourth European Conference on Speech Communication and Technology, pages 837–840, Madrid, Spain, 1995.Alan Wrench. MOCHA MultiCHannel Articulatory database: English, November 1999.Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011.John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993.Andrew Varga and Herman J.M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recog- nition systems. Speech Communication, 12(3):247–251, July 1993.David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010.Man Mohan Sondhi. New methods of pitch extraction. Audio and Electroacoustics, IEEE Transactions on, 16(2):262—266, 1968.Myron J. Ross, Harry L. Shaffer, Asaf Cohen, Richard Freudberg, and Harold J. Manley. Average magnitude difference function pitch extractor. Acoustics, Speech and Signal Processing, IEEE Transactions on, 22(5):353—362, 1974.Na Yang, He Ba, Weiyang Cai, Ilker Demirkol, and Wendi Heinzelman. BaNa: A Noise Resilient Fundamental Frequency Detection Algorithm for Speech and Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1833–1848, December 2014.Michael Noll. Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, 41(2):293–309, 1967.Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A Convolutional Representation for Pitch Estimation. arXiv:1802.06182 [cs, eess, stat], February 2018. arXiv: 1802.06182.Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems, E99.D(7):1877–1884, 2016.Kun Han and DeLiang Wang. Neural Network Based Pitch Tracking in Very Noisy Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2158–2168, Decem- ber 2014.Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal, and Sanjeev Khudanpur. A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2494–2498. IEEE, 2014.Lee Ngee Tan and Abeer Alwan. Multi-band summary correlogram-based pitch detection for noisy speech. Speech Communication, 55(7-8):841–856, September 2013.Jesper Kjær Nielsen, Tobias Lindstrøm Jensen, Jesper Rindom Jensen, Mads Græsbøll Christensen, and Søren Holdt Jensen. Fast fundamental frequency estimation: Making a statistically efficient estimator computationally efficient. Signal Processing, 135:188–197, June 2017.Sira Gonzalez and Mike Brookes. PEFAC - A Pitch Estimation Algorithm Robust to High Levels of Noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(2):518—530, February 2014.Paul Boersma. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the institute of phonetic sciences, volume 17, page 97—110. Amsterdam, 1993.David Talkin. A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis, 495:518, 1995.Byung Suk Lee and Daniel PW Ellis. Noise robust pitch tracking by subband autocorrelation classification. In Interspeech, pages 707–710, 2012.Wei Chu and Abeer Alwan. SAFE: a statistical algorithm for F0 estimation for both clean and noisy speech. In INTERSPEECH, pages 2590–2593, 2010.Xuejing Sun. Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio. In Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, volume 1, page I—333. IEEE, 2002.Markel. The SIFT algorithm for fundamental frequency estimation. IEEE Transactions on Audio and Electroacoustics, 20(5):367—377, December 1972.Thomas Drugman and Abeer Alwan. Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics. In Interspeech, page 1973—1976, 2011.Hideki Kawahara, Masanori Morise, Toru Takahashi, Ryuichi Nisimura, Toshio Irino, and Hideki Banno. TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In Acous- tics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages 3933–3936. IEEE, 2008.Arturo Camacho. SWIPE: A sawtooth waveform inspired pitch estimator for speech and music. PhD thesis, University of Florida, 2007.Kavita Kasi and Stephen A. Zahorian. Yet Another Algorithm for Pitch Tracking. In IEEE International Conference on Acoustics Speech and Signal Processing, pages I–361–I–364, Orlando, FL, USA, May 2002. IEEE.Alain de Cheveigné and Hideki Kawahara. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4):1917, 2002.

  8. o

    Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

    • explore.openaire.eu
    • data.europa.eu
    Updated Apr 26, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amir M. Mir; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4044635
    Explore at:
    Dataset updated
    Apr 26, 2021
    Authors
    Amir M. Mir; Evaldas Latoskinas; Georgios Gousios
    Description

    The dataset is gathered on Sep. 17th 2020 from GitHub. It has more than 5.2K Python repositories and 4.2M type annotations. The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository.

  9. Dirty Dataset to practice Data Cleaning

    • kaggle.com
    zip
    Updated May 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Kanju (2024). Dirty Dataset to practice Data Cleaning [Dataset]. https://www.kaggle.com/datasets/martinkanju/dirty-dataset-to-practice-data-cleaning
    Explore at:
    zip(1235 bytes)Available download formats
    Dataset updated
    May 20, 2024
    Authors
    Martin Kanju
    Description

    Dataset

    This dataset was created by Martin Kanju

    Released under Other (specified in description)

    Contents

  10. H

    Python Codes for Data Analysis of The Impact of COVID-19 on Technical...

    • dataverse.harvard.edu
    Updated Mar 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Szkirpan (2022). Python Codes for Data Analysis of The Impact of COVID-19 on Technical Services Units Survey Results [Dataset]. http://doi.org/10.7910/DVN/SXMSDZ
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 21, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    Elizabeth Szkirpan
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Copies of Anaconda 3 Jupyter Notebooks and Python script for holistic and clustered analysis of "The Impact of COVID-19 on Technical Services Units" survey results. Data was analyzed holistically using cleaned and standardized survey results and by library type clusters. To streamline data analysis in certain locations, an off-shoot CSV file was created so data could be standardized without compromising the integrity of the parent clean file. Three Jupyter Notebooks/Python scripts are available in relation to this project: COVID_Impact_TechnicalServices_HolisticAnalysis (a holistic analysis of all survey data) and COVID_Impact_TechnicalServices_LibraryTypeAnalysis (a clustered analysis of impact by library type, clustered files available as part of the Dataverse for this project).

  11. H

    Data from: SBIR - STTR Data and Code for Collecting Wrangling and Using It

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Nov 5, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grant Allard (2018). SBIR - STTR Data and Code for Collecting Wrangling and Using It [Dataset]. http://doi.org/10.7910/DVN/CKTAZX
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 5, 2018
    Dataset provided by
    Harvard Dataverse
    Authors
    Grant Allard
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Data set consisting of data joined for analyzing the SBIR/STTR program. Data consists of individual awards and agency-level observations. The R and python code required for pulling, cleaning, and creating useful data sets has been included. Allard_Get and Clean Data.R This file provides the code for getting, cleaning, and joining the numerous data sets that this project combined. This code is written in the R language and can be used in any R environment running R 3.5.1 or higher. If the other files in this Dataverse are downloaded to the working directory, then this Rcode will be able to replicate the original study without needing the user to update any file paths. Allard SBIR STTR WebScraper.py This is the code I deployed to multiple Amazon EC2 instances to scrape data o each individual award in my data set, including the contact info and DUNS data. Allard_Analysis_APPAM SBIR project Forthcoming Allard_Spatial Analysis Forthcoming Awards_SBIR_df.Rdata This unique data set consists of 89,330 observations spanning the years 1983 - 2018 and accounting for all eleven SBIR/STTR agencies. This data set consists of data collected from the Small Business Administration's Awards API and also unique data collected through web scraping by the author. Budget_SBIR_df.Rdata 246 observations for 20 agencies across 25 years of their budget-performance in the SBIR/STTR program. Data was collected from the Small Business Administration using the Annual Reports Dashboard, the Awards API, and an author-designed web crawler of the websites of awards. Solicit_SBIR-df.Rdata This data consists of observations of solicitations published by agencies for the SBIR program. This data was collected from the SBA Solicitations API. Primary Sources Small Business Administration. “Annual Reports Dashboard,” 2018. https://www.sbir.gov/awards/annual-reports. Small Business Administration. “SBIR Awards Data,” 2018. https://www.sbir.gov/api. Small Business Administration. “SBIR Solicit Data,” 2018. https://www.sbir.gov/api.

  12. B

    Office Light Level

    • borealisdata.ca
    • search.dataone.org
    Updated Dec 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Y. Aussat; Costin Ograda-Bratu; S. Huo; S. Keshav (2020). Office Light Level [Dataset]. http://doi.org/10.5683/SP2/3BCADM
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 7, 2020
    Dataset provided by
    Borealis
    Authors
    Y. Aussat; Costin Ograda-Bratu; S. Huo; S. Keshav
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Nov 16, 2020
    Description

    Data Owner: Y. Aussat, S. Keshav Data File: 32.8 MB zip file containing the data files and description Data Description: This dataset contains daylight signals collected over approximately 200 days in four unoccupied offices in the Davis Center building at the University of Waterloo. Thus, these measure the available daylight in the room. Light levels were measured using custom-built light sensing modules based on the Omega Onion microcomputer with a light sensor. An example of the module is shown in the file sensing-module.png in this directory. Each sensing module is named using four hex digits. We started all modules on August 30, 2018, which corresponds to minute 0 in the dataset. However, the modules were not deployed immediately. Below are the times when we started collecting the light data in each office and corresponding sensing module names. Office number Devices Start time DC3526 af65, b02d September 6, 2018, 11:00 am DC2518 afa7 September 6, 2018, 11:00 am DC2319 af67, f073 September 21, 2018, 11:00 am DC3502 afa5, b969 September 21, 2018, 11:00 am Moreover, due to some technical problems, the initial 6 days for offices 1 and 2 and initial 21 days for offices 3 and 4 are dummy data and should be ignored. Finally, there were two known outages in DC during the data collection process: from 00:00 AM to 4:00 AM on September 17, 2018 from 11:00pm on 10/9/2018 until 7:45am on October 10, 2018 We stopped collecting the data around 2:45 pm on May 16, 2019. Therefore, we have 217 uninterrupted days of clean collected data from October 11, 2018 to May 15, 2019. To take care of these problems, we have provided a python script process-lighting-data.ipynb that extracts clean data from the raw data. Both raw and processed data are provided as described next. Raw data: Raw data folder names correspond to the device names. The light sensing modules log (minute_count, visible_light, IR_light) every minute to a file. Here, minute 0 corresponds to August 30, 2018. Every 1440 minutes (i.e., 1 day) we saved the current file, created a new one, and started writing to it. The filename format is {device_name}_{starting_minute}. For example Omega-AF65_28800.csv is data collected by Omega-AF65, starting at minute 28800. A metadata file can also be found in each folder with the details of the log file structure. Processed data: The folder named ‘processed_data’ contains the processed data, which results from running the python script. Each file in this directory is named after the device ID, for example af65.csv stores the processed data of the device Omega-AF65. The columns in this file are: Minutes: Consecutive minute of the experiment Illum: Illumination level (lux) Min_from_midnight: Minutes from midnight of the current day Day_of_exp: Count of the day number starting from October 11, 2018 Day_of_year: Day of the year Funding: The Natural Sciences and Engineering Research Council of Canada (NSERC)

  13. d

    Data from: Data to Estimate Water Use Associated with Oil and Gas...

    • catalog.data.gov
    • data.usgs.gov
    Updated Jan 21, 2026
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2026). Data to Estimate Water Use Associated with Oil and Gas Development within the Bureau of Land Management Carlsbad Field Office Area, New Mexico [Dataset]. https://catalog.data.gov/dataset/data-to-estimate-water-use-associated-with-oil-and-gas-development-within-the-bureau-of-la-3b5f8
    Explore at:
    Dataset updated
    Jan 21, 2026
    Dataset provided by
    U.S. Geological Survey
    Area covered
    New Mexico
    Description

    The purpose of this data release is to provide data in support of the Bureau of Land Management's (BLM) Reasonably Foreseeable Development (RFD) Scenario by estimating water-use associated with oil and gas extraction methods within the BLM Carlsbad Field Office (CFO) planning area, located in Eddy and Lea Counties as well as part of Chaves County, New Mexico. Three comma separated value files and two python scripts are included in this data release. It was determined that all reported oil and gas wells within Chaves County from the FracFocus and New Mexico Oil Conservation Division (NM OCD) databases were outside of the CFO administration area and were excluded from well_records.csv and modeled_estimates.csv. Data from Chaves County are included in the produced_water.csv file to be consistent with the BLM’s water support document. Data were synthesized into comma separated values which include, produced_water.csv (volume) from NM OCD, well_records.csv (including location and completion) from NM OCD and FracFocus, and modeled_estimates.csv (using FracFocus as well as Ball and others (2020) as input data). The results from modeled_estimates.csv were obtained using a previously published regression model (McShane and McDowell, 2021) to estimate water use associated with unconventional oil and gas activities in the Permian Basin (Valder and others, 2021) for the period of interest (2010-2021). Additionally, python scripts to process, clean, and categorize FracFocus data are provided in this data release.

  14. Data from: GLOBE Mosquito Habitat Mapper Citizen Science Data 2017-2020

    • zenodo.org
    Updated Oct 21, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    R. Low; R. Boger; P. Nelson; M. Kimura; R. Low; R. Boger; P. Nelson; M. Kimura (2021). GLOBE Mosquito Habitat Mapper Citizen Science Data 2017-2020 [Dataset]. http://doi.org/10.5281/zenodo.5106571
    Explore at:
    Dataset updated
    Oct 21, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    R. Low; R. Boger; P. Nelson; M. Kimura; R. Low; R. Boger; P. Nelson; M. Kimura
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Three Cases: Metadata and Procedures

    The data sets described here were used in an article submitted to the journal GeoHealth in 2021. The data files and further supplemental links (including general information about GLOBE data) can be accessed at https://observer.globe.gov/get-data/mosquito-habitat-data.

    Case 1: Removal of records with suspect geolocation data. A Python script was applied to remove records where the measured position (in decimal degrees) was identical to the GLOBE MGRS site position. GPS-obtained latitude and longitude coordinates are reported in decimal degrees, so records identified by whole numbers were also removed. This procedure removed 5704 (23%) of the 24983 records in the Mosquito Habitat Mapper database, with 19,279 records remaining. The secondary data sets cleaned only for geolocation anomalies were labeled Case 1.

    Case 2: Identifying suspected training events. For this test, we sought to identify groups of data that exceeded 10 records sharing these characteristics. Another Python script was employed to extract the photos for ease of visual inspection. Because we needed to manually review the photo records, we set the threshold for groups at >10, so that the analysis could be completed in the time allotted. Groups identified thought this procedure were outputted as case 2: groups. The resulting data set cleaned of groups >10 was labeled Case 2. The resulting data set included 20,006 records and identified 2,447 records found in clusters we postulated were training events.

    Case 3: The Case 3 secondary dataset result from applying the Python scripts used to create Cases 1 and 2. We used the Case 3 data sets, with improved geolocation and large groups eliminated, in the following analysis.

    Acknowledgments: These data were obtained from NASA and the GLOBE Program and are freely available for use in research, publications and commercial applications. When data from GLOBE are used in a publication, we request this acknowledgment be included: "These data were obtained from the GLOBE Program." Please include such statements, either where the use of the data or other resource is described, or within the acknowledgments section of the publication.

  15. s

    GLOBE Observer Mosquito Habitat Mapper Citizen Science Data 2017-2020, v1

    • geospatial.strategies.org
    • hub.arcgis.com
    Updated Apr 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute for Global Environmental Strategies (2021). GLOBE Observer Mosquito Habitat Mapper Citizen Science Data 2017-2020, v1 [Dataset]. https://geospatial.strategies.org/documents/8d2e69bc090a42f69fa5e522adef5cab
    Explore at:
    Dataset updated
    Apr 2, 2021
    Dataset authored and provided by
    Institute for Global Environmental Strategies
    Description

    Three Cases: Metadata and ProceduresThe data sets described here were used in an article submitted to the journal GeoHealth in 2021. The data files and further supplemental links (including general information about GLOBE data) can be accessed at https://observer.globe.gov/get-data/mosquito-habitat-data.Case 1: Removal of records with suspect geolocation data. A Python script was applied to remove records where the measured position (in decimal degrees) was identical to the GLOBE MGRS site position. GPS-obtained latitude and longitude coordinates are reported in decimal degrees, so records identified by whole numbers were also removed. This procedure removed 5704 (23%) of the 24983 records in the Mosquito Habitat Mapper database, with 19,279 records remaining. The secondary data sets cleaned only for geolocation anomalies were labeled Case 1.Case 2: Identifying suspected training events. For this test, we sought to identify groups of data that exceeded 10 records sharing these characteristics. Another Python script was employed to extract the photos for ease of visual inspection. Because we needed to manually review the photo records, we set the threshold for groups at >10, so that the analysis could be completed in the time allotted. Groups identified thought this procedure were outputted as case 2: groups. The resulting data set cleaned of groups >10 was labeled Case 2. The resulting data set included 20,006 records and identified 2,447 records found in clusters we postulated were training events.Case 3: The Case 3 secondary dataset result from applying the Python scripts used to create Cases 1 and 2. We used the Case 3 data sets, with improved geolocation and large groups eliminated, in the following analysis.The information in this description was last updated 2021-04-12

  16. Z

    Pre-Processed Power Grid Frequency Time Series

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 15, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kruse, Johannes; Schäfer, Benjamin; Witthaut, Dirk (2021). Pre-Processed Power Grid Frequency Time Series [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3744120
    Explore at:
    Dataset updated
    Jul 15, 2021
    Dataset provided by
    Forschungszentrum Jülich GmbH, Institute for Energy and Climate Research - Systems Analysis and Technology Evaluation (IEK-STE), 52428 Jülich, Germany
    School of Mathematical Sciences, Queen Mary University of London, London E1 4NS, United Kingdom
    Authors
    Kruse, Johannes; Schäfer, Benjamin; Witthaut, Dirk
    Description

    Overview This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid:

    Continental Europe

    Great Britain

    Nordic

    This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper.

    Data sources We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs).

    Continental Europe [2]: We downloaded the data from the German TSO TransnetBW GmbH, which retains the Copyright on the data, but allows to re-publish it upon request [3].

    Great Britain [4]: The download was supported by National Grid ESO Open Data, which belongs to the British TSO National Grid. They publish the frequency recordings under the NGESO Open License [5].

    Nordic [6]: We obtained the data from the Finish TSO Fingrid, which provides the data under the open license CC-BY 4.0 [7].

    Content of the repository

    A) Scripts

    In the "Download_scripts" folder you will find three scripts to automatically download frequency data from the TSO's websites.

    In "convert_data_format.py" we save the data with corrected timestamp formats. Missing data is marked as NaN (processing step (1) in the supplementary material of [1]).

    In "clean_corrupted_data.py" we load the converted data and identify corrupted recordings. We mark them as NaN and clean some of the resulting data holes (processing step (2) in the supplementary material of [1]).

    The python scripts run with Python 3.7 and with the packages found in "requirements.txt".

    B) Yearly converted and cleansed data The folders "_converted" contain the output of "convert_data_format.py" and "_cleansed" contain the output of "clean_corrupted_data.py".

    File type: The files are zipped csv-files, where each file comprises one year.

    Data format: The files contain two columns. The second column contains the frequency values in Hz. The first one represents the time stamps in the format Year-Month-Day Hour-Minute-Second, which is given as naive local time. The local time refers to the following time zones and includes Daylight Saving Times (python time zone in brackets):

    TransnetBW: Continental European Time (CE)

    Nationalgrid: Great Britain (GB)

    Fingrid: Finland (Europe/Helsinki)

    NaN representation: We mark corrupted and missing data as "NaN" in the csv-files.

    Use cases We point out that this repository can be used in two different was:

    Use pre-processed data: You can directly use the converted or the cleansed data. Note however, that both data sets include segments of NaN-values due to missing and corrupted recordings. Only a very small part of the NaN-values were eliminated in the cleansed data to not manipulate the data too much.

    Produce your own cleansed data: Depending on your application, you might want to cleanse the data in a custom way. You can easily add your custom cleansing procedure in "clean_corrupted_data.py" and then produce cleansed data from the raw data in "_converted".

    License

    This work is licensed under multiple licenses, which are located in the "LICENSES" folder.

    We release the code in the folder "Scripts" under the MIT license .

    The pre-processed data in the subfolders "**/Fingrid" and "**/Nationalgrid" are licensed under CC-BY 4.0.

    TransnetBW originally did not publish their data under an open license. We have explicitly received the permission to publish the pre-processed version from TransnetBW. However, we cannot publish our pre-processed version under an open license due to the missing license of the original TransnetBW data.

    Changelog Version 2:

    Add time zone information to description

    Include new frequency data

    Update references

    Change folder structure to yearly folders

    Version 3:

    Correct TransnetBW files for missing data in May 2016

  17. o

    PUDL Data Release v1.0.0

    • explore.openaire.eu
    Updated Feb 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zane A. Selvans; Christina M. Gosnell (2020). PUDL Data Release v1.0.0 [Dataset]. http://doi.org/10.5281/zenodo.3653159
    Explore at:
    Dataset updated
    Feb 7, 2020
    Authors
    Zane A. Selvans; Christina M. Gosnell
    Description

    This is the first data release from the Public Utility Data Liberation (PUDL) project. It can be referenced & cited using https://doi.org/10.5281/zenodo.3653159 For more information about the free and open source software used to generate this data release, see Catalyst Cooperative's PUDL repository on Github, and the associated documentation on Read The Docs. This data release was generated using v0.3.1 of the catalystcoop.pudl python package. Included Data Packages This release consists of three tabular data packages, conforming to the standards published by Frictionless Data and the Open Knowledge Foundation. The data are stored in CSV files (some of which are compressed using gzip), and the associated metadata is stored as JSON. These tabular data can be used to populate a relational database. pudl-eia860-eia923: Data originally collected and published by the US Energy Information Administration (US EIA). The data from EIA Form 860 covers the years 2011-2018. The Form 923 data covers 2009-2018. A large majority of the data published in the original data sources has been included, but some parts, like fuel stocks on hand, and EIA 923 schedules 6, 7, & 8 have not yet been integrated. pudl-eia860-eia923-epacems: This data package contains all of the same data as the pudl-eia860-eia923 package above, as well as the Hourly Emissions data from the US Environmental Protection Agency's (EPA's) Continuous Emissions Monitoring System (CEMS) from 1995-2018. The EPA CEMS data covers thousands of power plants at hourly resolution for decades, and contains close to a billion records. pudl-ferc1: Seven data tables from FERC Form 1 are included, primarily relating to individual power plants, and covering the years 1994-2018 (the entire span of time for which FERC provides this data). These tables are the only ones which have been subjected to any cleaning or organization for programmatic use within PUDL. The complete, raw FERC Form 1 database contains 116 different tables with many thousands of columns of mostly financial data. We will archive a complete copy of the multi-year FERC Form 1 Database as a file-based SQLite database at Zenodo, independent of this data release. It can also be re-generated using the catalystcoop.pudl Python package and the original source data files archived as part of this data release. Contact Us If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. You can also: Subscribe to our announcements list for email updates. Use the Github issue tracker to file bugs, suggest improvements, or ask for help. Email the project team at pudl@catalyst.coop for private communications. Follow @CatalystCoop on Twitter. Using the Data The data packages are just CSVs (data) and JSON (metadata) files. They can be used with a variety of tools on many platforms. However, the data is organized primarily with the idea that it will be loaded into a relational database, and the PUDL Python package that was used to generate this data release can facilitate that process. Once the data is loaded into a database, you can access that DB however you like. Make sure conda is installed None of these commands will work without the conda Python package manager installed, either via Anaconda or miniconda: Install Anaconda Install miniconda Download the data First download the files from the Zenodo archive into a new empty directory. A couple of them are very large (5-10 GB), and depending on what you're trying to do you may not need them. If you don't want to recreate the data release from scratch by re-running the entire ETL process yourself, and you don't want to create a full clone of the original FERC Form 1 database, including all of the data that has not yet been integrated into PUDL, then you don't need to download pudl-input-data.tgz. If you don't need the EPA CEMS Hourly Emissions data, you do not need to download pudl-eia860-eia923-epacems.tgz. Load All of PUDL in a Single Line Use cd to get into your new directory at the terminal (in Linux or Mac OS), or open up an Anaconda terminal in that directory if you're on Windows. If you have downloaded all of the files from the archive, and you want it all to be accessible locally, you can run a single shell script, called load-pudl.sh: bash pudl-load.sh This will do the following: Load the FERC Form 1, EIA Form 860, and EIA Form 923 data packages into an SQLite database which can be found at sqlite/pudl.sqlite. Convert the EPA CEMS data package into an Apache Parquet dataset which can be found at parquet/epacems. Clone all of the FERC Form 1 annual databases into a single SQLite database which can be found at sqlite/ferc1.sqlite. Selectively Load PUDL Data If you don't want to download and load all of the PUDL data, you can load each of the above datasets separately. Create the PUDL conda Environment This installs the PUDL software locally, and a couple of other u...

  18. u

    Data from: USDA Dietary Guidelines Sentiment Analysis

    • agdatacommons.nal.usda.gov
    zip
    Updated Dec 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivam Saith (2023). USDA Dietary Guidelines Sentiment Analysis [Dataset]. http://doi.org/10.15482/USDA.ADC/1438034
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 14, 2023
    Dataset provided by
    Ag Data Commons
    Authors
    Shivam Saith
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    This dataset has all documents, the text and the pdf files as well as the code that was used to carry out the sentiment analysis on USDA Dietary Guidelines. The scope of the project and the resulting dataset uploaded here is carrying out the sentiment analysis on USDA dietary guidelines from 1980 till 2015 (released every 5 years). The motivation behind this project was the fact that recommendations regarding the different nutrients have changed over the years. For example, In the past, fats have usually been presented in a negative tone, but over the time it has also been mentioned that some types of fats are good for the body unlike other fats. The goal was to create visualizations to easily convey complex information. Basically, it is about analyzing all the statements dealing with a particular nutrient, and not just understanding whether the sentiment is positive or negative but also calculating the extent to which a statement is positive or negative. The individual statement sentiments have been averaged over the time to generate trendline visualizations. There are 3 resources added along with this dataset:

    The Code file : This file contains the actual code used to carry out the analysis.

    The Corpus: This contains the official USDA dietary guidelines in both the original PDF as well as the converted text formats (analysis was carried out using the files in the text format)

    Sentiment Polarity Value CSVs: The polarity values CSVs which has been added a a zipped file. This contains the individual statement polarities as calculated by the NLP package. The files are arranged in a way where each nutrient has a separate file for each of the 8 years in consideration (1980,1985,1990,1995,2000,2005,2010,2015) and each such file has the statement and the sentiment scores for individual statements which had that nutrient present as a word in the statement. The CSV files name has 2 parts -> NutrientName_Year. For example : A file named 'Fat_2015' has all the statements and the corresponding polarity values for the nutrient Fat in the Dietary guideline for the year 2015.

    Here are few details about the process and the methodologies used to carry out this analysis: I started by converting the pdfs to text data. I had to use Optimal Character Recognition for that and Google Docs did a fairly good job of converting everything to textual data that was required to carry out the analysis. Basically one just needs to upload the original pdf file on Google drive and then open the same file using Google doc and Google doc does the rest.As far as the data cleaning goes, I had to remove erroneous new line characters and special characters that played no part in the sentiment. I also had to develop regular expressions to identify the beginning of a new statement and this was later used in effectively separating the different statements. Finally, after separating at the individual statement level, I used the relevant package methods to give me the sentiment scores. As far as the technologies used go, I have used Python, which is a very popular language in the Data Science world. I have personally used both Python and R and went with Python in this case because I felt it had a variety of packages for data manipulation as well Natural Language Processing. Jupyter notebooks have been used which allow us to create and share documents that contain live code, equations, visualizations. It allows us to code right in our browser and eliminates the need to install any other Integrated Development Environment and also makes it very convenient to share our code. Also, I have used Anaconda which is an open source distribution and helps in simplifying package management and deployment. The 2 python packages used for sentiment analysis are TextBlob and Vader. Each of these packages is tuned to a specific type of data - Vader is more or less tuned to social media data and Texblob is a beginner level package not tuned to any specific type of data. I have used both of them and provided the end user the option to use either of these packages to carry out the sentiment analysis. One package performs better for some kind of statements and vice versa. For visualizations I ended up using plotly due to the ease as well as the quality of visualizations it produces with minimal code. It is important to note although that only a limited number of visualizations generated are free and it is not free for commercial use. Resources in this dataset:Resource Title: Nutrient Sentiment Polarities. File Name: NutrientSentimentStatementPolarities.zipResource Description: This is a zipped file which once unzipped will have 6 different directories corresponding to each of the 6 nutrients namely Fat, Water, Protein, Mineral, Carbohydrate and Vitamin. Within each of these directories will be individual CSV files containing 8 files for each of the years from 1980 to 2015 (with a 5 year gap). Resource Title: Code. File Name: USDA_DietaryGuidelinesSentimentAnalysis.ipynb.htmlResource Description: Python notebook with the entire code which has everything from cleaning the data to carrying out the analysis to generating the visualizations.Resource Title: Corpus. File Name: Corpus.zipResource Description: This zipped file contains the corpus that was used to carry out the analysis. It has both the PDF version as well as the converted text files.'Resource Title: Data Dictionary. File Name: USDA_DietaryGuidelinesSentimentAnalysis_DataDictionary.csvResource Description: Defines variables and properties for sentiment analysis data for carbohydrates, fats, minerals, proteins, vitamins, and water.

  19. Z

    Wi-Fi (CSI and RSSI) data of six basic knife activities for cooking...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Majid Ghosian Moghaddam; Ali Asghar Nazari Shirehjini; Shervin Shirmohammadi (2023). Wi-Fi (CSI and RSSI) data of six basic knife activities for cooking (chopping, cubing, French cutting, julienning, mincing, and slicing) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7843703
    Explore at:
    Dataset updated
    Apr 19, 2023
    Dataset provided by
    University of Ottawa
    Authors
    Majid Ghosian Moghaddam; Ali Asghar Nazari Shirehjini; Shervin Shirmohammadi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    To gather the dataset, we asked two participants to perform six basic knife activities. The layout of the system experiment is provided in Fig. 4. As it illustrates, we put the receiver on the right side and the ESP32 transceiver on the left side of the performing area. The performing area is a cutting board (30 x 46 cm) in this experiment. Each participant performs each activity five times in the performing area. The data is recorded using a customized version of ESP32-CSI-tool [38] on the laptop that helps us to record and save each data in a separate file. After recording all 60 data entries, we used Python code to extract the clean data from all generated text by the tool. The clean data is stored in a database and creates the dataset.

  20. Business Analyst Test | Attendance Data | Python

    • kaggle.com
    zip
    Updated Dec 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed_Gouda (2025). Business Analyst Test | Attendance Data | Python [Dataset]. https://www.kaggle.com/datasets/mgouda/business-analyst-test-attendance-data-python
    Explore at:
    zip(1336761 bytes)Available download formats
    Dataset updated
    Dec 2, 2025
    Authors
    Mohamed_Gouda
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Exam Instructions: BPO Shrinkage Dashboard Disclaimer: All data is fictitious and for educational purposes only.

    Objective Create a BPO Shrinkage Dashboard with structured data storage and visualization.

    Steps 1. Data Preparation (Python)

    Use Python (e.g., Pandas) to clean, validate, and transform raw data. Standardize date formats, remove duplicates, and handle missing values.

    1. Database Setup (PostgreSQL)

    Create a PostgreSQL database and design normalized tables for:

    Employees Team Leaders Company Country Language Work Location Shrinkage Metrics

    Load the cleaned data into the database.

    1. Dashboard Development

    Connect the dashboard to the PostgreSQL database. Provide drill-down views:

    Employee → Team Leader → Company → Country → Language → Work Location

    Include a clear date map for reporting periods.

    KPIs & Formulas Implement these calculations:

    Absenteeism % = Total Absent Days ÷ Total Scheduled Days

    Unplanned Shrinkage % = Total Unplanned Shrinkage Days ÷ Scheduled Days

    Planned Shrinkage % = Total Planned Shrinkage Days ÷ Total Days

    Sickness % = (Sick + Planned Sick) ÷ (Scheduled Days + Planned Sick)

    Deliverables

    Python scripts for data cleaning. PostgreSQL database with all tables and relationships. Dashboard connected to the database showing KPIs and drill-downs.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Walekhwa Tambiti Leo Philip (2026). BI intro to data cleaning eda and machine learning [Dataset]. https://www.kaggle.com/datasets/walekhwatlphilip/intro-to-data-cleaning-eda-and-machine-learning
Organization logo

BI intro to data cleaning eda and machine learning

Student dataset for Data Cleaning, EDA and Predictive Modeling

Explore at:
zip(9961 bytes)Available download formats
Dataset updated
Feb 20, 2026
Authors
Walekhwa Tambiti Leo Philip
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Real-World Data Science Challenge

Business Intelligence Program Strategy — Student Success Optimization

Hosted by: Walsoft Computer Institute 📁 Download dataset 👤 Kaggle profile

Background

Walsoft Computer Institute runs a Business Intelligence (BI) training program for students from diverse educational, geographical, and demographic backgrounds. The institute has collected detailed data on student attributes, entry exams, study effort, and final performance in two technical subjects: Python Programming and Database Systems.

As part of an internal review, the leadership team has hired you — a Data Science Consultant — to analyze this dataset and provide clear, evidence-based recommendations on how to improve:

  • Admissions decision-making
  • Academic support strategies
  • Overall program impact and ROI

Your Mission

Answer this central question:

“Using the BI program dataset, how can Walsoft strategically improve student success, optimize resources, and increase the effectiveness of its training program?”

Key Strategic Areas

You are required to analyze and provide actionable insights for the following three areas:

1. Admissions Optimization

Should entry exams remain the primary admissions filter?

Your task is to evaluate the predictive power of entry exam scores compared to other features such as prior education, age, gender, and study hours.

✅ Deliverables:

  • Feature importance ranking for predicting Python and DB scores
  • Admission policy recommendation (e.g., retain exams, add screening tools, adjust thresholds)
  • Business rationale and risk analysis

2. Curriculum Support Strategy

Are there at-risk student groups who need extra support?

Your task is to uncover whether certain backgrounds (e.g., prior education level, country, residence type) correlate with poor performance and recommend targeted interventions.

✅ Deliverables:

  • At-risk segment identification
  • Support program design (e.g., prep course, mentoring)
  • Expected outcomes, costs, and KPIs

3. Resource Allocation & Program ROI

How can we allocate resources for maximum student success?

Your task is to segment students by success profiles and suggest differentiated teaching/facility strategies.

✅ Deliverables:

  • Performance drivers
  • Student segmentation
  • Resource allocation plan and ROI projection

🛠️ Dataset Overview

ColumnDescription
fNAME, lNAMEStudent first and last name
AgeStudent age (21–71 years)
genderGender (standardized as "Male"/"Female")
countryStudent’s country of origin
residenceStudent housing/residence type
entryEXAMEntry test score (28–98)
prevEducationPrior education (High School, Diploma, etc.)
studyHOURSTotal study hours logged
PythonFinal Python exam score
DBFinal Database exam score

📊 Dataset

You are provided with a real-world messy dataset that reflects the types of issues data scientists face every day — from inconsistent formatting to missing values.

Raw Dataset (Recommended for Full Project)

Download: bi.csv

This dataset includes common data quality challenges:

  • Country name inconsistencies
    e.g. Norge → Norway, RSA → South Africa, UK → United Kingdom

  • Residence type variations
    e.g. BI-Residence, BIResidence, BI_Residence → unify to BI Residence

  • Education level typos and casing issues
    e.g. Barrrchelors → Bachelor, DIPLOMA, DiplomaaaDiploma

  • Gender value noise
    e.g. M, F, female → standardize to Male / Female

  • Missing scores in Python subject
    Fill NaN values using column mean or suitable imputation strategy

Participants using this dataset are expected to apply data cleaning techniques such as: - String standardization - Null value imputation - Type correction (e.g., scores as float) - Validation and visual verification

Bonus: Submissions that use and clean this dataset will earn additional Technical Competency points.

Cleaned Dataset (Optional Shortcut)

Download: cleaned_bi.csv

This version has been fully standardized and preprocessed: - All fields cleaned and renamed consistently - Missing Python scores filled with th...

Search
Clear search
Close search
Google apps
Main menu