71 datasets found

d
Python Script for Cleaning Alum Dataset
search.dataone.org
hydroshare.org
Updated Oct 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
saikumar payyavula; Jeff Sadler (2025). Python Script for Cleaning Alum Dataset [Dataset]. https://search.dataone.org/view/sha256%3A9df1a010044e2d50d741d5671b755351813450f4331dd7b0cc2f0a527750b30e
Explore at:
Dataset updated
Oct 18, 2025
Dataset provided by
Hydroshare
Authors
saikumar payyavula; Jeff Sadler
Description
This resource contains a Python script used to clean and preprocess the alum dosage dataset from a small Oklahoma water treatment plant. The script handles missing values, removes outliers, merges historical water quality and weather data, and prepares the dataset for AI model training.
Z
NoCORA - Northern Cameroon Observed Rainfall Archive
data.niaid.nih.gov
zenodo.org
Updated Jul 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lavarenne, Jérémy (2024). NoCORA - Northern Cameroon Observed Rainfall Archive [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10156437
Explore at:
Dataset updated
Jul 10, 2024
Dataset provided by
Foulna Tcheobe, Carmel
Nenwala, Victor Hugo
Lavarenne, Jérémy
Area covered
North Region, Cameroon
Description
Description: The NoCORA dataset represents a significant effort to compile and clean a comprehensive set of daily rainfall data for Northern Cameroon (North and Extreme North regions). This dataset, overing more than 1 million observations across 418 rainfall stations on a temporal range going from 1927 to 2022, is instrumental for researchers, meteorologists, and policymakers working in climate research, agricultural planning, and water resource management in the region. It integrates data from diverse sources, including Sodecoton rain funnels, the archive of Robert Morel (IRD), Centrale de Lagdo, the GHCN daily service, and the TAHMO network. The construction of NoCORA involved meticulous processes, including manual assembly of data, extensive data cleaning, and standardization of station names and coordinates, making it a hopefully robust and reliable resource for understanding climatic dynamics in Northern Cameroon. Data Sources: The dataset comprises eight primary rainfall data sources and a comprehensive coordinates dataset. The rainfall data sources include extensive historical and contemporary measurements, while the coordinates dataset was developed using reference data and an inference strategy for variant station names or missing coordinates. Dataset Preparation Methods: The preparation involved manual compilation, integration of machine-readable files, data cleaning with OpenRefine, and finalization using Python/Jupyter Notebook. This process should ensured the accuracy and consistency of the dataset. Discussion: NoCORA, with its extensive data compilation, presents an invaluable resource for climate-related studies in Northern Cameroon. However, users must navigate its complexities, including missing data interpretations, potential biases, and data inconsistencies. The dataset's comprehensive nature and historical span require careful handling and validation in research applications. Access to Dataset: The NoCORA dataset, while a comprehensive resource for climatological and meteorological research in Northern Cameroon, is subject to specific access conditions due to its compilation from various partner sources. The original data sources vary in their openness and accessibility, and not all partners have confirmed the open-access status of their data. As such, to ensure compliance with these varying conditions, access to the NoCORA dataset is granted on a request basis. Interested researchers and users are encouraged to contact us for permission to access the dataset. This process allows us to uphold the data sharing agreements with our partners while facilitating research and analysis within the scientific community. Authors Contributions:

Data treatment: Victor Hugo Nenwala, Carmel Foulna Tcheobe, Jérémy Lavarenne. Documentation: Jérémy Lavarenne. Funding: This project was funded by the DESIRA INNOVACC project. Changelog:

v1.0.2 : corrected interversion in column names in the coordinates dataset v1.0.1 : dataset specification file has been updated with complementary information regarding station locations v1.0.0 : initial submission
h
codeparrot-clean
huggingface.co
Updated Dec 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CodeParrot (2021). codeparrot-clean [Dataset]. https://huggingface.co/datasets/codeparrot/codeparrot-clean
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 7, 2021
Dataset authored and provided by
CodeParrot
Description
CodeParrot 🦜 Dataset Cleaned

What is it?

A dataset of Python files from Github. This is the deduplicated version of the codeparrot.

Processing

The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:

Deduplication Remove exact matches

Filtering Average line length < 100 Maximum line length < 1000 Alpha numeric characters fraction > 0.25 Remove auto-generated files (keyword search)

For… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-clean.
Python Codes for Data Analysis of The Impact of COVID-19 on Technical...
figshare.com
dataverse.harvard.edu
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth Szkirpan (2022). Python Codes for Data Analysis of The Impact of COVID-19 on Technical Services Units Survey Results [Dataset]. http://doi.org/10.6084/m9.figshare.20416092.v1
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.20416092.v1
Dataset updated
Aug 1, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Elizabeth Szkirpan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Copies of Anaconda 3 Jupyter Notebooks and Python script for holistic and clustered analysis of "The Impact of COVID-19 on Technical Services Units" survey results. Data was analyzed holistically using cleaned and standardized survey results and by library type clusters. To streamline data analysis in certain locations, an off-shoot CSV file was created so data could be standardized without compromising the integrity of the parent clean file. Three Jupyter Notebooks/Python scripts are available in relation to this project: COVID_Impact_TechnicalServices_HolisticAnalysis (a holistic analysis of all survey data) and COVID_Impact_TechnicalServices_LibraryTypeAnalysis (a clustered analysis of impact by library type, clustered files available as part of the Dataverse for this project).
London 'Data' Job Posts, Raw and Clean.
kaggle.com
Updated Dec 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EdRenton (2022). London 'Data' Job Posts, Raw and Clean. [Dataset]. https://www.kaggle.com/datasets/edrenton/job-post-data-raw-cleaned-using-sql/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 15, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
EdRenton
Area covered
London
Description
This is a dataset I extracted from UK job posts in London, from reed.co.uk. These jobs have the keyword 'data' in them. I extracted the data using python. I created a loop to extract over 400 pages, allowing me to scrape over 10,000 job posts.
Saccade data cleaning
figshare.com
txt
Updated Mar 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Annie Campbell (2022). Saccade data cleaning [Dataset]. http://doi.org/10.6084/m9.figshare.4810471.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4810471.v1
Dataset updated
Mar 26, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Annie Campbell
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
python scripts and functions needed to view and clean saccade data
o
Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...
explore.openaire.eu
data.europa.eu
Updated Apr 26, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amir M. Mir; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4044635
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4044635
Dataset updated
Apr 26, 2021
Authors
Amir M. Mir; Evaldas Latoskinas; Georgios Gousios
Description
The dataset is gathered on Sep. 17th 2020 from GitHub. It has more than 5.2K Python repositories and 4.2M type annotations. The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository.
Customer Sale Dataset for Data Visualization
kaggle.com
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atul (2025). Customer Sale Dataset for Data Visualization [Dataset]. https://www.kaggle.com/datasets/atulkgoyl/customer-sale-dataset-for-visualization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Atul
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This synthetic dataset is designed specifically for practicing data visualization and exploratory data analysis (EDA) using popular Python libraries like Seaborn, Matplotlib, and Pandas.

Unlike most public datasets, this one includes a diverse mix of column types:

📅 Date columns (for time series and trend plots) 🔢 Numerical columns (for histograms, boxplots, scatter plots) 🏷️ Categorical columns (for bar charts, group analysis)

Whether you are a beginner learning how to visualize data or an intermediate user testing new charting techniques, this dataset offers a versatile playground.

Feel free to:

Create EDA notebooks Practice plotting techniques Experiment with filtering, grouping, and aggregations 🛠️ No missing values, no data cleaning needed — just download and start exploring!

Hope you find this helpful. Looking forward to hearing from you all.
A Replication Dataset for Fundamental Frequency Estimation
zenodo.org
live.european-language-grid.eu
+1more
bin
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bastian Bechtold; Bastian Bechtold (2025). A Replication Dataset for Fundamental Frequency Estimation [Dataset]. http://doi.org/10.5281/zenodo.3904389
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3904389
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Bastian Bechtold; Bastian Bechtold
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.
© 2020, Bastian Bechtold. All rights reserved.

Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise.

The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time.

Included Code and Data

ground truth data.zip is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora:

CMU-ARCTIC (consensus truth) [1]

FDA (corpus truth and consensus truth) [2]

KEELE (corpus truth and consensus truth) [3]

MOCHA-TIMIT (consensus truth) [4]

PTDB-TUG (corpus truth and consensus truth) [5]

TIMIT (consensus truth) [6]

noisy speech data.zip is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora:

NOISEX [7]

QUT-NOISE [8]

synthetic speech data.zip is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise.

noisy_speech.pkl and synthetic_speech.pkl are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms:

AUTOC [9]

AMDF [10]

BANA [11]

CEP [12]

CREPE [13]

DIO [14]

DNN [15]

KALDI [16]

MAPS

MBSC [17]

NLS [18]

PEFAC [19]

PRAAT [20]

RAPT [21]

SACC [22]

SAFE [23]

SHR [24]

SIFT [25]

SRH [26]

STRAIGHT [27]

SWIPE [28]

YAAPT [29]

YIN [30]

noisy speech evaluation.py and synthetic speech evaluation.py are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures:

Gross Pitch Error (GPE), the percentage of pitches where the estimated pitch deviates from the true pitch by more than 20%.

Fine Pitch Error (FPE), the mean error of grossly correct estimates.

High/Low Octave Pitch Error (OPE), the percentage pitches that are GPEs and happens to be at an integer multiple of the true pitch.

Gross Remaining Error (GRE), the percentage of pitches that are GPEs but not OPEs.

Fine Remaining Bias (FRB), the median error of GREs.

True Positive Rate (TPR), the percentage of true positive voicing estimates.

False Positive Rate (FPR), the percentage of false positive voicing estimates.

False Negative Rate (FNR), the percentage of false negative voicing estimates.

F₁, the harmonic mean of precision and recall of the voicing decision.

Pipfile is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs.

The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory.

References:

John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003.

Paul C Bagshaw, Steven Hiller, and Mervyn A Jack. Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching. In EUROSPEECH, 1993.

F Plante, Georg F Meyer, and William A Ainsworth. A Pitch Extraction Reference Database. In Fourth European Conference on Speech Communication and Technology, pages 837–840, Madrid, Spain, 1995.

Alan Wrench. MOCHA MultiCHannel Articulatory database: English, November 1999.

Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011.

John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993.

Andrew Varga and Herman J.M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recog- nition systems. Speech Communication, 12(3):247–251, July 1993.

David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010.

Man Mohan Sondhi. New methods of pitch extraction. Audio and Electroacoustics, IEEE Transactions on, 16(2):262—266, 1968.

Myron J. Ross, Harry L. Shaffer, Asaf Cohen, Richard Freudberg, and Harold J. Manley. Average magnitude difference function pitch extractor. Acoustics, Speech and Signal Processing, IEEE Transactions on, 22(5):353—362, 1974.

Na Yang, He Ba, Weiyang Cai, Ilker Demirkol, and Wendi Heinzelman. BaNa: A Noise Resilient Fundamental Frequency Detection Algorithm for Speech and Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1833–1848, December 2014.

Michael Noll. Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, 41(2):293–309, 1967.

Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A Convolutional Representation for Pitch Estimation. arXiv:1802.06182 [cs, eess, stat], February 2018. arXiv: 1802.06182.

Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems, E99.D(7):1877–1884, 2016.

Kun Han and DeLiang Wang. Neural Network Based Pitch Tracking in Very Noisy Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2158–2168, Decem- ber 2014.

Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal, and Sanjeev Khudanpur. A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2494–2498. IEEE, 2014.

Lee Ngee Tan and Abeer Alwan. Multi-band summary correlogram-based pitch detection for noisy speech. Speech Communication, 55(7-8):841–856, September 2013.

Jesper Kjær Nielsen, Tobias Lindstrøm Jensen, Jesper Rindom Jensen, Mads Græsbøll Christensen, and Søren Holdt Jensen. Fast fundamental frequency estimation: Making a statistically
Netflix Movies and TV Shows Dataset Cleaned(excel)
kaggle.com
Updated Apr 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaurav Tawri (2025). Netflix Movies and TV Shows Dataset Cleaned(excel) [Dataset]. https://www.kaggle.com/datasets/gauravtawri/netflix-movies-and-tv-shows-dataset-cleanedexcel
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gaurav Tawri
Description
This dataset is a cleaned and preprocessed version of the original Netflix Movies and TV Shows dataset available on Kaggle. All cleaning was done using Microsoft Excel — no programming involved.

🎯 What’s Included: - Cleaned Excel file (standardized columns, proper date format, removed duplicates/missing values) - A separate "formulas_used.txt" file listing all Excel formulas used during cleaning (e.g., TRIM, CLEAN, DATE, SUBSTITUTE, TEXTJOIN, etc.) - Columns like 'date_added' have been properly formatted into DMY structure - Multi-valued columns like 'listed_in' are split for better analysis - Null values replaced with “Unknown” for clarity - Duration field broken into numeric + unit components

🔍 Dataset Purpose: Ideal for beginners and analysts who want to: - Practice data cleaning in Excel - Explore Netflix content trends - Analyze content by type, country, genre, or date added

📁 Original Dataset Credit: The base version was originally published by Shivam Bansal on Kaggle: https://www.kaggle.com/shivamb/netflix-shows

📌 Bonus: You can find a step-by-step cleaning guide and the same dataset on GitHub as well — along with screenshots and formulas documentation.
d
Data from: SBIR - STTR Data and Code for Collecting Wrangling and Using It
search.dataone.org
dataverse.harvard.edu
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allard, Grant (2023). SBIR - STTR Data and Code for Collecting Wrangling and Using It [Dataset]. http://doi.org/10.7910/DVN/CKTAZX
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/CKTAZX
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Allard, Grant
Description
Data set consisting of data joined for analyzing the SBIR/STTR program. Data consists of individual awards and agency-level observations. The R and python code required for pulling, cleaning, and creating useful data sets has been included. Allard_Get and Clean Data.R This file provides the code for getting, cleaning, and joining the numerous data sets that this project combined. This code is written in the R language and can be used in any R environment running R 3.5.1 or higher. If the other files in this Dataverse are downloaded to the working directory, then this Rcode will be able to replicate the original study without needing the user to update any file paths. Allard SBIR STTR WebScraper.py This is the code I deployed to multiple Amazon EC2 instances to scrape data o each individual award in my data set, including the contact info and DUNS data. Allard_Analysis_APPAM SBIR project Forthcoming Allard_Spatial Analysis Forthcoming Awards_SBIR_df.Rdata This unique data set consists of 89,330 observations spanning the years 1983 - 2018 and accounting for all eleven SBIR/STTR agencies. This data set consists of data collected from the Small Business Administration's Awards API and also unique data collected through web scraping by the author. Budget_SBIR_df.Rdata 246 observations for 20 agencies across 25 years of their budget-performance in the SBIR/STTR program. Data was collected from the Small Business Administration using the Annual Reports Dashboard, the Awards API, and an author-designed web crawler of the websites of awards. Solicit_SBIR-df.Rdata This data consists of observations of solicitations published by agencies for the SBIR program. This data was collected from the SBA Solicitations API. Primary Sources Small Business Administration. “Annual Reports Dashboard,” 2018. https://www.sbir.gov/awards/annual-reports. Small Business Administration. “SBIR Awards Data,” 2018. https://www.sbir.gov/api. Small Business Administration. “SBIR Solicit Data,” 2018. https://www.sbir.gov/api.
D
CompuCrawl: Full database and code
dataverse.nl
Updated Sep 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Haans; Richard Haans (2025). CompuCrawl: Full database and code [Dataset]. http://doi.org/10.34894/OBVAOY
Explore at:
Unique identifier
https://doi.org/10.34894/OBVAOY
Dataset updated
Sep 23, 2025
Dataset provided by
DataverseNL
Authors
Richard Haans; Richard Haans
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This folder contains the full set of code and data for the CompuCrawl database. The database contains the archived websites of publicly traded North American firms listed in the Compustat database between 1996 and 2020\u2014representing 11,277 firms with 86,303 firm/year observations and 1,617,675 webpages in the final cleaned and selected set.The files are ordered by moment of use in the work flow. For example, the first file in the list is the input file for code files 01 and 02, which create and update the two tracking files "scrapedURLs.csv" and "URLs_1_deeper.csv" and which write HTML files to its folder. "HTML.zip" is the resultant folder, converted to .zip for ease of sharing. Code file 03 then reads this .zip file and is therefore below it in the ordering.The full set of files, in order of use, is as follows:Compustat_2021.xlsx: The input file containing the URLs to be scraped and their date range.01 Collect frontpages.py: Python script scraping the front pages of the list of URLs and generating a list of URLs one page deeper in the domains.URLs_1_deeper.csv: List of URLs one page deeper on the main domains.02 Collect further pages.py: Python script scraping the list of URLs one page deeper in the domains.scrapedURLs.csv: Tracking file containing all URLs that were accessed and their scraping status.HTML.zip: Archived version of the set of individual HTML files.03 Convert HTML to plaintext.py: Python script converting the individual HTML pages to plaintext.TXT_uncleaned.zip: Archived version of the converted yet uncleaned plaintext files.input_categorization_allpages.csv: Input file for classification of pages using GPT according to their HTML title and URL.04 GPT application.py: Python script using OpenAI\u2019s API to classify selected pages according to their HTML title and URL.categorization_applied.csv: Output file containing classification of selected pages.exclusion_list.xlsx: File containing three sheets: 'gvkeys' containing the GVKEYs of duplicate observations (that need to be excluded), 'pages' containing page IDs for pages that should be removed, and 'sentences' containing (sub-)sentences to be removed.05 Clean and select.py: Python script applying data selection and cleaning (including selection based on page category), with setting and decisions described at the top of the script. This script also combined individual pages into one combined observation per GVKEY/year.metadata.csv: Metadata containing information on all processed HTML pages, including those not selected.TXT_cleaned.zip: Archived version of the selected and cleaned plaintext page files. This file serves as input for the word embeddings application.TXT_combined.zip: Archived version of the combined plaintext files at the GVKEY/year level. This file serves as input for the data description using topic modeling.06 Topic model.R: R script that loads up the combined text data from the folder stored in "TXT_combined.zip", applies further cleaning, and estimates a 125-topic model.TM_125.RData: RData file containing the results of the 125-topic model.loadings125.csv: CSV file containing the loadings for all 125 topics for all GVKEY/year observations that were included in the topic model.125_topprob.xlsx: Overview of top-loading terms for the 125 topic model.07 Word2Vec train and align.py: Python script that loads the plaintext files in the "TXT_cleaned.zip" archive to train a series of Word2Vec models and subsequently align them in order to compare word embeddings across time periods.Word2Vec_models.zip: Archived version of the saved Word2Vec models, both unaligned and aligned.08 Word2Vec work with aligned models.py: Python script which loads the trained Word2Vec models to trace the development of the embeddings for the terms \u201csustainability\u201d and \u201cprofitability\u201d over time.99 Scrape further levels down.py: Python script that can be used to generate a list of unscraped URLs from the pages that themselves were one level deeper than the front page.URLs_2_deeper.csv: CSV file containing unscraped URLs from the pages that themselves were one level deeper than the front page.For those only interested in downloading the final database of texts, the files "HTML.zip", "TXT_uncleaned.zip", "TXT_cleaned.zip", and "TXT_combined.zip" contain the full set of HTML pages, the processed but uncleaned texts, the selected and cleaned texts, and combined and cleaned texts at the GVKEY/year level, respectively.The following webpage contains answers to frequently asked questions: https://haans-mertens.github.io/faq/. More information on the database and the underlying project can be found here: https://haans-mertens.github.io/ and the following article: \u201cThe Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data\u201d, by Richard F.J. Haans and Marc J. Mertens in Organizational Research Methods. The full paper can be accessed here.
Comprehensive Formula 1 Dataset (2020-2025)
kaggle.com
Updated Jul 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
V SHREE KAMALESH (2025). Comprehensive Formula 1 Dataset (2020-2025) [Dataset]. https://www.kaggle.com/datasets/vshreekamalesh/comprehensive-formula-1-dataset-2020-2025
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 27, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
V SHREE KAMALESH
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Formula 1 Comprehensive Dataset (2020-2025)

Dataset Description This comprehensive Formula 1 dataset contains detailed racing data spanning from 2020 to 2025, including race results, qualifying sessions, championship standings, circuit information, and historical driver statistics.

Perfect for:

📊 F1 performance analysis

🤖 Machine learning projects

📈 Data visualization

🏆 Championship predictions

📋 Racing statistics research

📁 Files Included 1. f1_race_results_2020_2025.csv (53 entries) Race winners and results from Grand Prix weekends

Date, Grand Prix name, race winner

Constructor, nationality, grid position

Race time, fastest lap time, points scored

f1_qualifying_results_2020_2024.csv (820 entries) Qualifying session results with timing data

Q1, Q2, Q3 session times

Grid positions, laps completed

Driver and constructor information

f1_driver_standings_progressive.csv (600 entries) Championship standings progression throughout seasons

Points accumulation over race weekends

Wins, podiums, pole positions tracking

Season-long championship battle data

f1_constructor_standings_progressive.csv (360 entries) Team championship standings evolution

Constructor points and wins

Team performance metrics

Manufacturer rivalry data

f1_circuits_technical_data.csv (24 entries) Technical specifications for all F1 circuits

Track length, number of turns

Lap records and record holders

Circuit designers and first F1 usage

f1_historical_driver_statistics.csv (30 entries) All-time career statistics for F1 drivers

Career wins, poles, podiums

Racing entries and achievements

Active and retired driver records

f1_comprehensive_dataset_2020_2025.csv (432 entries) MAIN DATASET - Combined data from all sources

Multiple data types in one file

Ready for immediate analysis

Comprehensive F1 information hub

🔧 Data Features Clean & Structured: All data professionally format
Google Ads sales dataset
kaggle.com
Updated Jul 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NayakGanesh007 (2025). Google Ads sales dataset [Dataset]. https://www.kaggle.com/datasets/nayakganesh007/google-ads-sales-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
NayakGanesh007
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Google Ads Sales Dataset for Data Analytics Campaigns (Raw & Uncleaned) 📝 Dataset Overview This dataset contains raw, uncleaned advertising data from a simulated Google Ads campaign promoting data analytics courses and services. It closely mimics what real digital marketers and analysts would encounter when working with exported campaign data — including typos, formatting issues, missing values, and inconsistencies.

It is ideal for practicing:

Data cleaning

Exploratory Data Analysis (EDA)

Marketing analytics

Campaign performance insights

Dashboard creation using tools like Excel, Python, or Power BI

📁 Columns in the Dataset Column Name ----- -Description Ad_ID --------Unique ID of the ad campaign Campaign_Name ------Name of the campaign (with typos and variations) Clicks --Number of clicks received Impressions --Number of ad impressions Cost --Total cost of the ad (in ₹ or $ format with missing values) Leads ---Number of leads generated Conversions ----Number of actual conversions (signups, sales, etc.) Conversion Rate ---Calculated conversion rate (Conversions ÷ Clicks) Sale_Amount ---Revenue generated from the conversions Ad_Date------ Date of the ad activity (in inconsistent formats like YYYY/MM/DD, DD-MM-YY) Location ------------City where the ad was served (includes spelling/case variations) Device------------ Device type (Mobile, Desktop, Tablet with mixed casing) Keyword ----------Keyword that triggered the ad (with typos)

⚠️ Data Quality Issues (Intentional) This dataset was intentionally left raw and uncleaned to reflect real-world messiness, such as:

Inconsistent date formats

Spelling errors (e.g., "analitics", "anaytics")

Duplicate rows

Mixed units and symbols in cost/revenue columns

Missing values

Irregular casing in categorical fields (e.g., "mobile", "Mobile", "MOBILE")

🎯 Use Cases Data cleaning exercises in Python (Pandas), R, Excel

Data preprocessing for machine learning

Campaign performance analysis

Conversion optimization tracking

Building dashboards in Power BI, Tableau, or Looker

💡 Sample Analysis Ideas Track campaign cost vs. return (ROI)

Analyze click-through rates (CTR) by device or location

Clean and standardize campaign names and keywords

Investigate keyword performance vs. conversions

🔖 Tags Digital Marketing · Google Ads · Marketing Analytics · Data Cleaning · Pandas Practice · Business Analytics · CRM Data
d
Data from: Data to Estimate Water Use Associated with Oil and Gas...
catalog.data.gov
data.usgs.gov
+1more
Updated Oct 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Data to Estimate Water Use Associated with Oil and Gas Development within the Bureau of Land Management Carlsbad Field Office Area, New Mexico [Dataset]. https://catalog.data.gov/dataset/data-to-estimate-water-use-associated-with-oil-and-gas-development-within-the-bureau-of-la
Explore at:
Dataset updated
Oct 1, 2025
Dataset provided by
U.S. Geological Survey
Area covered
New Mexico
Description
The purpose of this data release is to provide data in support of the Bureau of Land Management's (BLM) Reasonably Foreseeable Development (RFD) Scenario by estimating water-use associated with oil and gas extraction methods within the BLM Carlsbad Field Office (CFO) planning area, located in Eddy and Lea Counties as well as part of Chaves County, New Mexico. Three comma separated value files and two python scripts are included in this data release. It was determined that all reported oil and gas wells within Chaves County from the FracFocus and New Mexico Oil Conservation Division (NM OCD) databases were outside of the CFO administration area and were excluded from well_records.csv and modeled_estimates.csv. Data from Chaves County are included in the produced_water.csv file to be consistent with the BLM’s water support document. Data were synthesized into comma separated values which include, produced_water.csv (volume) from NM OCD, well_records.csv (including location and completion) from NM OCD and FracFocus, and modeled_estimates.csv (using FracFocus as well as Ball and others (2020) as input data). The results from modeled_estimates.csv were obtained using a previously published regression model (McShane and McDowell, 2021) to estimate water use associated with unconventional oil and gas activities in the Permian Basin (Valder and others, 2021) for the period of interest (2010-2021). Additionally, python scripts to process, clean, and categorize FracFocus data are provided in this data release.

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

zenodo.org

application/gzip, bin +2

Updated Aug 2, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788

Explore at:

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

Z
#PraCegoVer dataset
data.niaid.nih.gov
Updated Jan 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Oliveira dos Santos; Esther Luna Colombini; Sandra Avila (2023). #PraCegoVer dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5710561
Explore at:
Dataset updated
Jan 19, 2023
Dataset provided by
Institute of Computing, University of Campinas
Authors
Gabriel Oliveira dos Santos; Esther Luna Colombini; Sandra Avila
Description
Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

Dataset Structure

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

containing the images. The file dataset.json comprehends a list of json objects with the attributes:

user: anonymized user that made the post;

filename: image file name;

raw_caption: raw caption;

caption: clean caption;

date: post date.

Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

Download Instructions

If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:

cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz

Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

python download_dataset.py --access_token=
Data from: ManyTypes4Py: A Benchmark Python Dataset for Machine...
data.europa.eu
unknown
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-5244636?locale=pt
Explore at:
unknown(1052407809)Available download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is gathered on Sep. 17th 2020 from GitHub. It has clean and complete versions (from v0.7): The clean version has 5.1K type-checked Python repositories and 1.2M type annotations. The complete version has 5.2K Python repositories and 3.3M type annotations. The dataset's source files are type-checked using mypy (clean version). The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository.
Wi-Fi (CSI and RSSI) data of six basic knife activities for cooking...
zenodo.org
data.niaid.nih.gov
zip
Updated Apr 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Majid Ghosian Moghaddam; Majid Ghosian Moghaddam; Ali Asghar Nazari Shirehjini; Ali Asghar Nazari Shirehjini; Shervin Shirmohammadi; Shervin Shirmohammadi (2023). Wi-Fi (CSI and RSSI) data of six basic knife activities for cooking (chopping, cubing, French cutting, julienning, mincing, and slicing) [Dataset]. http://doi.org/10.5281/zenodo.7843704
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7843704
Dataset updated
Apr 19, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Majid Ghosian Moghaddam; Majid Ghosian Moghaddam; Ali Asghar Nazari Shirehjini; Ali Asghar Nazari Shirehjini; Shervin Shirmohammadi; Shervin Shirmohammadi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
To gather the dataset, we asked two participants to perform six basic knife activities. The layout of the system experiment is provided in Fig. 4. As it illustrates, we put the receiver on the right side and the ESP32 transceiver on the left side of the performing area. The performing area is a cutting board (30 x 46 cm) in this experiment. Each participant performs each activity five times in the performing area. The data is recorded using a customized version of ESP32-CSI-tool [38] on the laptop that helps us to record and save each data in a separate file. After recording all 60 data entries, we used Python code to extract the clean data from all generated text by the tool. The clean data is stored in a database and creates the dataset.
u
Experimental results for solar melting of zinc metal using multi-facet...
researchdata.up.ac.za
xlsx
Updated Sep 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pieter Bezuidenhout (2024). Experimental results for solar melting of zinc metal using multi-facet parabolic dish and a cavity receiver [Dataset]. http://doi.org/10.25403/UPresearchdata.26855203.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.25403/UPresearchdata.26855203.v1
Dataset updated
Sep 3, 2024
Dataset provided by
University of Pretoria
Authors
Pieter Bezuidenhout
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains the experimental data collected from thermocouples positioned inside a dual cavity solar receiver, used to demonstrate and evaluate the melting of zinc metal using only concentrated solar power as heat input. More information pertaining to the thermocouple positioning and the receiver design can be found in the thesis titled "Testing and development of a solar-dish cavity receiver for the melting of zinc metal". Experiments were conducted, each with a unique set of environmental conditions:

Experiment 1 – 26th of July 2022 = “Exp 1_26072022” Experiment 2 – 04t of August 2022= “Exp 2_04082022” Experiment 3 – 16th of August 2022= “Exp 3_16082022” Experiment 4 – 21st of August 2022= “Exp 4_21082022” Experiment 5 – 5th of September 2022= “Exp 5_05092022”

Also included in the dataset are the original weather data collected on the respective experimental test work days as well as the weather data in the processed form after correcting the weather data to serve as input for the numerical model developed in the Python coding language. Raw weather data:

Exp 1_Weather data_Original_26072022 Exp 2_Weather data_Original_04082022 Exp 3_Weather data_Original_16082022 Exp 4_Weather data_Original_21082022 Exp 5_Weather data_Original_05092022

Processed weather data:

Exp 1_Weather data_Post-process_26072022 Exp 2_Weather data_Post-process_04082022 Exp 3_Weather data_Post-process_16082022 Exp 4_Weather data_Post-process_21082022 Exp 5_Weather data_Post-process_05092022

In addition to all the weather data and the experimental results collected on the five experimental runs, the dataset also contains the Python code used to predict the zinc temperature in the cavity receiver. The code was compiled in Jupyter Notebook and the files consist of the heat loss calculations and zinc temperature prediction for each experimental run. The code contained has been validated against the experimental data and has been demonstrated to have a mean absolute percentage error (MAPE) of 2.7%. The code can thus be used to within 2.7% accuracy predict the zinc temperature inside a cavity receiver, by making use of actual weather data as input. Python code for each experiment, with heat transfer factor validated using experimental data mentioned above:

Experiment 1.ipynb Experiment 2.ipynb Experiment 3.ipynb Experiment 4.ipynb Experiment 5.ipynb

Facebook

Twitter

Click to copy link

Link copied

Cite

saikumar payyavula; Jeff Sadler (2025). Python Script for Cleaning Alum Dataset [Dataset]. https://search.dataone.org/view/sha256%3A9df1a010044e2d50d741d5671b755351813450f4331dd7b0cc2f0a527750b30e

Python Script for Cleaning Alum Dataset

Explore at:

Dataset updated

Oct 18, 2025

Dataset provided by

Hydroshare

Authors

saikumar payyavula; Jeff Sadler

Description

This resource contains a Python script used to clean and preprocess the alum dosage dataset from a small Oklahoma water treatment plant. The script handles missing values, removes outliers, merges historical water quality and weather data, and prepares the dataset for AI model training.

Clear search

Close search

Google apps

Main menu

Python Script for Cleaning Alum Dataset

NoCORA - Northern Cameroon Observed Rainfall Archive

codeparrot-clean

Python Codes for Data Analysis of The Impact of COVID-19 on Technical...

London 'Data' Job Posts, Raw and Clean.

Saccade data cleaning

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

Customer Sale Dataset for Data Visualization

A Replication Dataset for Fundamental Frequency Estimation

Netflix Movies and TV Shows Dataset Cleaned(excel)

Data from: SBIR - STTR Data and Code for Collecting Wrangling and Using It

CompuCrawl: Full database and code

Comprehensive Formula 1 Dataset (2020-2025)

Google Ads sales dataset

Data from: Data to Estimate Water Use Associated with Oil and Gas...

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

#PraCegoVer dataset

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

Data from: ManyTypes4Py: A Benchmark Python Dataset for Machine...

Wi-Fi (CSI and RSSI) data of six basic knife activities for cooking...

Experimental results for solar melting of zinc metal using multi-facet...

Python Script for Cleaning Alum Dataset