12 datasets found
  1. Prosper loan data.

    • kaggle.com
    zip
    Updated Jun 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shikhar Sharma (2021). Prosper loan data. [Dataset]. https://www.kaggle.com/shikhar07/prosper-loan-data
    Explore at:
    zip(23591647 bytes)Available download formats
    Dataset updated
    Jun 7, 2021
    Authors
    Shikhar Sharma
    Description

    Context

    Loan Data from Prosper.

    Content

    This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others. This data dictionary explains the variables in the data set.

  2. HR Records with Realistic Data Quality Issues

    • kaggle.com
    zip
    Updated Sep 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moamen Abdelkawy (2025). HR Records with Realistic Data Quality Issues [Dataset]. https://www.kaggle.com/datasets/moamenabdelkawy/employment-dataset-csv-data-cleaning-practice-set
    Explore at:
    zip(141231 bytes)Available download formats
    Dataset updated
    Sep 15, 2025
    Authors
    Moamen Abdelkawy
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    employment_dataset.csv: Data Cleaning Practice Set

    This file contains anonymized employee records for hands-on data wrangling and cleaning tasks. The structure matches common corporate HR exports, with a mix of numerical, categorical, and date fields plus representative real-world issues.

    Contents

    Rows: ~10,990 (some duplicates present) Columns: 9

    ColumnDescription
    AgeAge of employee (float, may be missing)
    SalaryAnnual salary in USD (integer, no missing values)
    ExperienceYears of experience as text (e.g. '17 years', may be empty)
    Performance_ScoreLast evaluation score (integer, no missing values)
    GenderM, F, Male, Female, or blank
    DepartmentDepartment (HR, Engineering, Sales, Marketing, or blank)
    HiredYes, No, Y, N, or blank
    Hiring_DateISO date string (may be empty, various validity)
    LocationWork city (Austin, Seattle, Boston, New York, or blank)

    Key Data Issues

    • Missing values in most columns
    • Mixed representations for Gender and Hired
    • Text in Experience column (needs to be numeric)
    • Hiring_Date in free text, some missing or invalid
    • Exact duplicate rows present
    • No outlier handling has been performed

    Sample Rows

    AgeSalaryExperiencePerformance_ScoreGenderDepartmentHiredHiring_DateLocation
    394694517 years9MarketingYes2021-02-24Austin
    27271024 years3MSalesY2013-07-19Seattle
    295062410 years8MaleEngineeringYes2015-03-28Austin

    Suggestions for Users

    • Parse and convert Experience to integer
    • Standardize Gender and Hired fields to a single format
    • Impute missing numeric features using mean or median
    • Fill categorical NAs with a placeholder (e.g., Unknown)
    • Parse Hiring_Date to pandas datetime (handle errors)
    • Remove exact duplicate rows for clean analysis

    Notebooks

    See the provided Notebook (hr_records_data_cleaning) for typical cleaning techniques. Scripts demonstrate type conversion, missing value imputing, categorical mapping, date parsing, and duplicate removal.

    License

    Synthetic data for skill-building and demonstration only. No real identities are included.

  3. Data Visualization Tools Market Size to Grow by USD 7.95 Billion from 2024...

    • technavio.com
    pdf
    Updated Feb 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Data Visualization Tools Market Size to Grow by USD 7.95 Billion from 2024 to 2029 – Research Report | Technavio [Dataset]. https://www.technavio.com/report/data-visualization-tools-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Feb 6, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Description

    snapshot-tab-pane Data Visualization Tools Market Size 2025-2029The data visualization tools market size is forecast to increase by USD 7.95 billion at a CAGR of 11.2% between 2024 and 2029.The market is experiencing significant growth due to the increasing demand for business intelligence and AI-powered insights. Companies are recognizing the value of transforming complex data into easily digestible visual representations to inform strategic decision-making. However, this market faces challenges as data complexity and massive data volumes continue to escalate. Organizations must invest in advanced data visualization tools to effectively manage and analyze their data to gain a competitive edge.The ability to automate data visualization processes and integrate AI capabilities will be crucial for companies to overcome the challenges posed by data complexity and volume. By doing so, they can streamline their business operations, enhance data-driven insights, and ultimately drive growth in their respective industries.What will be the Size of the Data Visualization Tools Market during the forecast period? Request Free SampleIn today's data-driven business landscape, the market continues to evolve, integrating advanced capabilities to support various sectors in making informed decisions. Data storytelling and preparation are crucial elements, enabling organizations to effectively communicate complex data insights. Real-time data visualization ensures agility, while data security safeguards sensitive information. Data dashboards facilitate data exploration and discovery, offering data-driven finance, strategy, and customer experience. Big data visualization tackles complex datasets, enabling data-driven decision making and innovation. Data blending and filtering streamline data integration and analysis. Data visualization software supports data transformation, cleaning, and aggregation, enhancing data-driven operations and healthcare. On-premises and cloud-based solutions cater to diverse business needs.Data governance, ethics, and literacy are integral components, ensuring data-driven product development, government, and education adhere to best practices. Natural language processing, machine learning, and visual analytics further enrich data-driven insights, enabling interactive charts and data reporting. Data connectivity and data-driven sales fuel business intelligence and marketing, while data discovery and data wrangling simplify data exploration and preparation. The market's continuous dynamism underscores the importance of data culture, data-driven innovation, and data-driven HR, as organizations strive to leverage data to gain a competitive edge.How is this Data Visualization Tools Industry segmented?The data visualization tools industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in "USD million" for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. DeploymentOn-premisesCloudCustomer TypeLarge enterprisesSMEsComponentSoftwareServicesApplicationHuman resourcesFinanceOthersEnd-userBFSIIT and telecommunicationHealthcareRetailOthersGeographyNorth AmericaUSMexicoEuropeFranceGermanyUKMiddle East and AfricaUAEAPACAustraliaChinaIndiaJapanSouth KoreaSouth AmericaBrazilRest of World (ROW)By Deployment InsightsThe on-premises segment is estimated to witness significant growth during the forecast period.The market has experienced notable expansion as businesses across diverse sectors acknowledge the significance of data analysis and representation to uncover valuable insights and inform strategic decisions. Data visualization plays a pivotal role in this domain. On-premises deployment, which involves implementing data visualization tools within an organization's physical infrastructure or dedicated data centers, is a popular choice. This approach offers organizations greater control over their data, ensuring data security, privacy, and adherence to data governance policies. It caters to industries dealing with sensitive data, subject to regulatory requirements, or having stringent security protocols that prohibit cloud-based solutions. Data storytelling, data preparation, data-driven product development, data-driven government, real-time data visualization, data security, data dashboards, data-driven finance, data-driven strategy, big data visualization, data-driven decision making, data blending, data filtering, data visualization software, data exploration, data-driven insights, data-driven customer experience, data mapping, data culture, data cleaning, data-driven operations, data aggregation, data transformation, data-driven healthcare, on-premises data visualization, data governance, data ethics, data discovery, natural language pr

  4. d

    Data from: Generalizable EHR-R-REDCap pipeline for a national...

    • datadryad.org
    • data.niaid.nih.gov
    • +2more
    zip
    Updated Jan 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2022
    Dataset provided by
    Dryad
    Authors
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
    Time period covered
    Dec 28, 2021
    Description

    Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

    Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

    Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

    Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR...

  5. Z

    Rapid Creation of a Data Product for the World's Specimens of Horseshoe Bats...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mast, Austin R.; Paul, Deborah L.; Rios, Nelson; Bruhn, Robert; Dalton, Trevor; Krimmel, Erica R.; Pearson, Katelin D.; Sherman, Aja; Shorthouse, David P.; Simmons, Nancy B.; Soltis, Pam; Upham, Nathan; Abibou, Djihbrihou (2024). Rapid Creation of a Data Product for the World's Specimens of Horseshoe Bats and Relatives, a Known Reservoir for Coronaviruses [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3974999
    Explore at:
    Dataset updated
    Jul 18, 2024
    Dataset provided by
    Yale University Peabody Museum of Natural History
    Arizona State University
    Agriculture and Agri-Food Canada
    American Museum of Natural History
    Florida State University
    University of Florida
    Authors
    Mast, Austin R.; Paul, Deborah L.; Rios, Nelson; Bruhn, Robert; Dalton, Trevor; Krimmel, Erica R.; Pearson, Katelin D.; Sherman, Aja; Shorthouse, David P.; Simmons, Nancy B.; Soltis, Pam; Upham, Nathan; Abibou, Djihbrihou
    License

    https://creativecommons.org/licenses/publicdomain/https://creativecommons.org/licenses/publicdomain/

    Description

    This repository is associated with NSF DBI 2033973, RAPID Grant: Rapid Creation of a Data Product for the World's Specimens of Horseshoe Bats and Relatives, a Known Reservoir for Coronaviruses (https://www.nsf.gov/awardsearch/showAward?AWD_ID=2033973). Specifically, this repository contains (1) raw data from iDigBio (http://portal.idigbio.org) and GBIF (https://www.gbif.org), (2) R code for reproducible data wrangling and improvement, (3) protocols associated with data enhancements, and (4) enhanced versions of the dataset published at various project milestones. Additional code associated with this grant can be found in the BIOSPEX repository (https://github.com/iDigBio/Biospex). Long-term data management of the enhanced specimen data created by this project is expected to be accomplished by the natural history collections curating the physical specimens, a list of which can be found in this Zenodo resource.

    Grant abstract: "The award to Florida State University will support research contributing to the development of georeferenced, vetted, and versioned data products of the world's specimens of horseshoe bats and their relatives for use by researchers studying the origins and spread of SARS-like coronaviruses, including the causative agent of COVID-19. Horseshoe bats and other closely related species are reported to be reservoirs of several SARS-like coronaviruses. Species of these bats are primarily distributed in regions where these viruses have been introduced to populations of humans. Currently, data associated with specimens of these bats are housed in natural history collections that are widely distributed both nationally and globally. Additionally, information tying these specimens to localities are mostly vague, or in many instances missing. This decreases the utility of the specimens for understanding the source, emergence, and distribution of SARS-COV-2 and similar viruses. This project will provide quality georeferenced data products through the consolidation of ancillary information linked to each bat specimen, using the extended specimen model. The resulting product will serve as a model of how data in biodiversity collections might be used to address emerging diseases of zoonotic origin. Results from the project will be disseminated widely in opensource journals, at scientific meetings, and via websites associated with the participating organizations and institutions. Support of this project provides a quality resource optimized to inform research relevant to improving our understanding of the biology and spread of SARS-CoV-2. The overall objectives are to deliver versioned data products, in formats used by the wider research and biodiversity collections communities, through an open-access repository; project protocols and code via GitHub and described in a peer-reviewed paper, and; sustained engagement with biodiversity collections throughout the project for reintegration of improved data into their local specimen data management systems improving long-term curation.

    This RAPID award will produce and deliver a georeferenced, vetted and consolidated data product for horseshoe bats and related species to facilitate understanding of the sources, distribution, and spread of SARS-CoV-2 and related viruses, a timely response to the ongoing global pandemic caused by SARS-CoV-2 and an important contribution to the global effort to consolidate and provide quality data that are relevant to understanding emergent and other properties the current pandemic. This RAPID award is made by the Division of Biological Infrastructure (DBI) using funds from the Coronavirus Aid, Relief, and Economic Security (CARES) Act.

    This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria."

    Files included in this resource

    9d4b9069-48c4-4212-90d8-4dd6f4b7f2a5.zip: Raw data from iDigBio, DwC-A format

    0067804-200613084148143.zip: Raw data from GBIF, DwC-A format

    0067806-200613084148143.zip: Raw data from GBIF, DwC-A format

    1623690110.zip: Full export of this project's data (enhanced and raw) from BIOSPEX, CSV format

    bionomia-datasets-attributions.zip: Directory containing 103 Frictionless Data packages for datasets that have attributions made containing Rhinolophids or Hipposiderids, each package also containing a CSV file for mismatches in person date of birth/death and specimen eventDate. File bionomia-datasets-attributions-key_2021-02-25.csv included in this directory provides a key between dataset identifier (how the Frictionless Data package files are named) and dataset name.

    bionomia-problem-dates-all-datasets_2021-02-25.csv: List of 21 Hipposiderid or Rhinolophid records whose eventDate or dateIdentified mismatches a wikidata recipient’s date of birth or death across all datasets.

    flagEventDate.txt: file containing term definition to reference in DwC-A

    flagExclude.txt: file containing term definition to reference in DwC-A

    flagGeoreference.txt: file containing term definition to reference in DwC-A

    flagTaxonomy.txt: file containing term definition to reference in DwC-A

    georeferencedByID.txt: file containing term definition to reference in DwC-A

    identifiedByNames.txt: file containing term definition to reference in DwC-A

    instructions-to-get-people-data-from-bionomia-via-datasetKey: instructions given to data providers

    RAPID-code_collection-date.R: code associated with enhancing collection dates

    RAPID-code_compile-deduplicate.R: code associated with compiling and deduplicating raw data

    RAPID-code_external-linkages-bold.R: code associated with enhancing external linkages

    RAPID-code_external-linkages-genbank.R: code associated with enhancing external linkages

    RAPID-code_external-linkages-standardize.R: code associated with enhancing external linkages

    RAPID-code_people.R: code associated with enhancing data about people

    RAPID-code_standardize-country.R: code associated with standardizing country data

    RAPID-data-dictionary.pdf: metadata about terms included in this project’s data, in PDF format

    RAPID-data-dictionary.xlsx: metadata about terms included in this project’s data, in spreadsheet format

    rapid-data-providers_2021-05-03.csv: list of data providers and number of records provided to rapid-joined-records_country-cleanup_2020-09-23.csv

    rapid-final-data-product_2021-06-29.zip: Enhanced data from BIOSPEX, DwC-A format

    rapid-final-gazetteer.zip: Gazetteer providing georeference data and metadata for 10,341 localities assessed as part of this project

    rapid-joined-records_country-cleanup_2020-09-23.csv: data product initial version where raw data has been compiled and deduplicated, and country data has been standardized

    RAPID-protocol_collection-date.pdf: protocol associated with enhancing collection dates

    RAPID-protocol_compile-deduplicate.pdf: protocol associated with compiling and deduplicating raw data

    RAPID-protocol_external-linkages.pdf: protocol associated with enhancing external linkages

    RAPID-protocol_georeference.pdf: protocol associated with georeferencing

    RAPID-protocol_people.pdf: protocol associated with enhancing data about people

    RAPID-protocol_standardize-country.pdf: protocol associated with standardizing country data

    RAPID-protocol_taxonomic-names.pdf: protocol associated with enhancing taxonomic name data

    RAPIDAgentStrings1_archivedCopy_30March2021.ods: resource used in conjunction with RAPID people protocol

    recordedByNames.txt: file containing term definition to reference in DwC-A

    Rhinolophid-HipposideridAgentStrings_and_People2_archivedCopy_30March2021.ods: resource used in conjunction with RAPID people protocol

    wikidata-notes-for-bat-collectors_leachman_2020: please see https://zenodo.org/record/4724139 for this resource

  6. Additional file 1 of Genomic data integration and user-defined sample-set...

    • springernature.figshare.com
    xlsx
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tommaso Alfonsi; Anna Bernasconi; Arif Canakoglu; Marco Masseroli (2023). Additional file 1 of Genomic data integration and user-defined sample-set extraction for population variant analysis [Dataset]. http://doi.org/10.6084/m9.figshare.21251612.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Tommaso Alfonsi; Anna Bernasconi; Arif Canakoglu; Marco Masseroli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 1. Example of translation from VCF into GDM format for genomic region data: This .xlsx (MS Excel) spreadsheet exemplifies the transformation of the original 1KGP mutations—expressed in VCF format—into GDM genomic regions. As a demonstrative example, some variants about chromosome X have been selected from the source data (in VCF format) and listed in the first table at the top of the file. The values of columns #CHROM, POS, REF and ALT appear as in the source. We removed the details that are unnecessary for the transformation from the column INFO. In the column FORMAT it is indicated exclusively the value “GT”, meaning that the next columns contain only the genotype of the samples (this and other conventions are expressed in the VCF specification document and in the header section of each VCF file). In multiallelic variants (examples e, f.1 and f.2), the genotype indicates with a number which of the alternative alleles in ALT is present in the corresponding samples (e.g., the number 2 means that the second variant is present); otherwise, it only assumes the values 0—mutation absent, or 1—the mutation is present. Additionally, the genotype indicates whether one or both chromosome copies contain the mutation and which one, i.e., the left one or the right one; the mutated alleles are normally separated by a pipe (“|”), if not otherwise specified in the header section; we do not know which chromosome copy is maternal or paternal, but as the 1KGP mutations are “phased”, we know that the “left chromosome” is the same in every mutation located in the same chromosome of the same donor. As in this example we have only one column after the FORMAT one, the mutations described are relative to only one sample, called “HG123456”. Actually, this sample does not exist in the source, but serves the purpose of demonstrating several mutation types that are found in the original data. The table reports six variants in VCF format, with the last one repeated two times to show how different values of genotype lead to a different translation (indeed, examples f.1 and f.2 differ only for the last column). Below in the same file, the same variants appear converted in GDM format. The transformation outputs the chr, left, right, strand, AL1, AL2, ref, alt, mut_type and length columns. The value of strand is positive in every mutation, as clarified by the 1KGP Consortium after the release of the data collections. Values of AL1 and AL2 express on which chromatid the mutation occur and depend on the value of the original genotype (column HG123456). The values of the other columns, namely chr, left, right, ref, alt, mut_type and length, are obtained from the variant original values after the split of multi-allelic variants, the transformation of the original position into 0-based coordinates, and the removal of repeated nucleotide bases from the original REF and ALT columns. In 0-based coordinates, a nucleotide base occupies the space between the coordinates x and x + 1. So, SNPs (examples a and f.2) are encoded as the replacement of ref at position between left and right with alt. Insertions (examples c and f.1) are described as the addition of the sequence of bases in alt at the position indicated in left and right, i.e., in between two nucleotide bases. Deletions (example b) are represented as the substitution of ref between positions left and right with an empty value (alt is indeed empty in this case). Finally, structural variants (examples d and e) such as copy number variations and large deletions have an empty ref because, according to the VCF specification document, the original column REF reports a nucleotide (called padding-base) that is located before the scope of the variant on the genome and is unnecessary in a 0-based representation. In this file, we reported only the columns relevant for the understanding of the transformation method regarding the mutation coordinates, reference and alternative alleles. Actually, in addition to the ones reported in the second table, the transformation adds some more columns, called as the attributes in the original INFO column to capture a selection of the attributes present in the original file.

  7. Data Science_Data Wrangling Project

    • kaggle.com
    zip
    Updated Dec 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DIMAS IZZULHAQ (2025). Data Science_Data Wrangling Project [Dataset]. https://www.kaggle.com/datasets/rafizzul/data-science-data-wrangling-project
    Explore at:
    zip(404527717 bytes)Available download formats
    Dataset updated
    Dec 4, 2025
    Authors
    DIMAS IZZULHAQ
    Description

    Integrative Analysis of Air Quality, Population Mobility, and Weather Conditions in Indonesia

    Dataset Overview

    This comprehensive dataset integrates three heterogeneous data sources to analyze the relationship between air quality, population mobility patterns, and weather conditions across major Indonesian cities from September 2024 to October 2025. The dataset provides valuable insights for environmental monitoring, urban planning, and public health research in Indonesia.

    Dataset Composition

    1. Mobility Pattern Dataset

    • Source: Meta's Data for Good Movement Distribution Initiative
    • Records: 124,212 observations
    • Coverage: Indonesian administrative regions (Level 2 - Kabupaten/Kota)
    • Time Period: September 2024 - September 2025
    • Key Features:
      • Geographic identifiers (GADM codes and district names)
      • Distance categories: Stay at home (0 km), Short-range (0-10 km), Medium-range (10-100 km), Long-range (100+ km)
      • Percentage distribution of population movement (ping fractions)
      • Temporal indicators (date, year, month)

    Key Findings: Over 95% of movements occur within 0-10 km from home, indicating predominantly local mobility patterns. Long-distance travel remains minimal (<0.4%).

    2. Air Quality Dataset (IQAir)

    • Source: IQAir 2024 World Air Quality Report
    • Coverage: 10 major Indonesian cities (Jakarta, Surabaya, Bandung, Palembang, Medan, Semarang, Mojokerto, Pekanbaru, Yogyakarta, Jambi)
    • Time Period: Monthly data for 2024
    • Key Features:
      • PM2.5 concentrations (μg/m³)
      • Air Quality Index (AQI) values
      • Annual and monthly averages per city
      • Comparisons with WHO guidelines

    Key Findings: PM2.5 levels consistently exceed WHO guidelines throughout 2024, with critical peaks during May (65-132 μg/m³) and significant improvement in December. Seasonal patterns show higher pollution during dry months (April-October) due to biomass burning and decreased precipitation.

    3. Weather Dataset (Open-Meteo)

    • Source: Open-Meteo Historical Weather API
    • Records: 744 daily observations
    • Coverage: Multiple Indonesian cities
    • Time Period: September 1, 2024 - October 1, 2025
    • Key Features:
      • Temperature metrics (actual and apparent, min/max/mean)
      • Solar radiation (sunrise, sunset, daylight duration, sunshine duration)
      • Wind parameters (speed, gusts, direction)
      • Precipitation variables (rainfall, precipitation hours)

    Key Findings: Consistent tropical monsoon characteristics with stable temperatures (23-30°C), erratic rainfall patterns, and high humidity levels. Temperature shows strong correlation with both AQI (0.39) and wind speed (0.57).

    Data Processing & Quality

    Preprocessing Techniques Applied:

    • Advanced Imputation: K-Nearest Neighbors (KNN) and Random Forest methods for missing data
    • Data Cleaning: Removal of duplicates, outliers, and inconsistent records
    • Temporal Alignment: Standardization across different time resolutions (daily to monthly aggregation)
    • Feature Engineering: Creation of derived variables including temporal indicators, weather indices, and mobility ratios
    • Spatial Standardization: City-level normalization and geographic identifier harmonization

    Data Quality Indicators:

    • Comprehensive coverage across 9 consistently overlapping cities
    • Temporal consistency validated across observation periods
    • Multiple imputation methods tested for reliability
    • Cross-validation performed on integrated dataset

    Use Cases

    This dataset is ideal for:

    1. Environmental Policy Research: Understanding pollution patterns and seasonal variations
    2. Urban Planning: Analyzing mobility trends and transportation needs
    3. Public Health Studies: Investigating environmental health impacts and exposure patterns
    4. Climate Analysis: Examining tropical monsoon characteristics and weather-pollution interactions
    5. Machine Learning Applications: Time-series forecasting, correlation analysis, and predictive modeling
    6. Behavioral Studies: Understanding how environmental conditions influence human mobility
    7. Sustainability Research: Evaluating sustainable mobility and environmental monitoring strategies

    Key Research Findings

    • Weak Correlation: Air quality shows minimal direct correlation with mobility patterns (-0.35 to 0.33), suggesting low public awareness or economic necessity overrides health concerns
    • Mobility Trade-offs: Strong negative correlation (-0.94) between short-distance and medium-distance travel indicates distinct mobility profiles
    • Seasonal Patterns: Clear dry-wet season impact on both air quality and mobility
    • Temperature Effects: Significant positive correlation between temperature and both AQI and wind speed

    File Structure

    The integrated dataset contains the following columns: ...

  8. Party strength in each US state

    • kaggle.com
    zip
    Updated Jan 13, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GeneBurin (2017). Party strength in each US state [Dataset]. https://www.kaggle.com/kiwiphrases/partystrengthbystate
    Explore at:
    zip(16377 bytes)Available download formats
    Dataset updated
    Jan 13, 2017
    Authors
    GeneBurin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    United States
    Description

    Data on party strength in each US state

    The repository contains data on party strength for each state as shown on each state's corresponding party strength Wikipedia page (for example, here is Virginia )

    Each state has a table of a detailed summary of the state of its governing and representing bodies on Wikipedia but there is no data set that collates these entries. I scraped each state's Wikipedia table and collated the entries into a single dataset. The data are stored in the state_party_strength.csv and state_party_strength_cleaned.csv. The code that generated the file can be found in corresponding Python notebooks.

    Data contents:

    The data contain information from 1980 on each state's: 1. governor and party 2. state house and senate composition 3. state representative composition in congress 4. electoral votes

    Clean Version

    Data in the clean version has been cleaned and processed substantially. Namely: - all columns now contain homogenous data within the column - names and Wiki-citations have been removed - only the party counts and party identification have been left The notebook that created this file is here

    Uncleaned Data Version

    The data contained herein have not been altered from their Wikipedia tables except in two instances: - Forced column names to be in accord across states - Any needed data modifications (ie concatenated string columns) to retain information when combining columns

    To use the data:

    Please note that the right encoding for the dataset is "ISO-8859-1", not 'utf-8' though in future versions I will try to fix that to make it more accessible.

    This means that you will likely have to perform further data wrangling prior to doing any substantive analysis. The notebook that has been used to create this data file is located here

    Raw scraped data

    The raw scraped data can be found in the pickle. This file contains a Python dictionary where each key is a US state name and each element is the raw scraped table in Pandas DataFrame format.

    Hope it proves as useful to you in analyzing/using political patterns at the state level in the US for political and policy research.

  9. Machine Learning 😃 ❤️😃

    • kaggle.com
    zip
    Updated Mar 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qusay AL-Btoush (2022). Machine Learning 😃 ❤️😃 [Dataset]. https://www.kaggle.com/qusaybtoush1990/machine-learning
    Explore at:
    zip(533584 bytes)Available download formats
    Dataset updated
    Mar 8, 2022
    Authors
    Qusay AL-Btoush
    Description

    Machine Learning 😃 ❤️😃

    Predicting Bank Loan Defaults 🙄 😃🙄 ❤️😃🙄 😃

    DESCRIPTION❤️❤️

    A data science approach to predict and understand the applicant’s profile to minimize the risk of future loan defaults.

    About the project

    The dataset contains information about credit applicants. Banks, globally, use this kind of dataset and type of informative data to create models to help in deciding on who to accept/refuse for a loan. After all the exploratory data analysis, cleansing and dealing with all the anomalies we might (will) find along the way, the patterns of a good/bad applicant will be exposed to be learned by machine learning models.

    • Machine Learning issue and objectives We’re dealing with a supervised binary classification problem. The goal is to train the best machine learning model to maximize the predictive capability of deeply understanding the past customer’s profile minimizing the risk of future loan defaults.

    • Performance Metric The metric used for the models’ evaluation is the ROC AUC given that we’re dealing with a highly unbalanced data.

    • Project structure The project divides into three categories: EDA: Exploratory Data Analysis Data Wrangling: Cleansing and Feature Selection Machine Learning: Predictive Modelling

    • The dataset You can download the data set here. Feature description

    • id: Unique ID of the loan application.

    • grade: LC assigned loan grade.

    • annual_inc: The self-reported annual income provided by the borrower during registration.

    • short_emp: 1 when employed for 1 year or less.

    • emp_length_num: Employment length in years. Possible values are - between 0 and 10 where 0 means less than one year and 10 means ten or more years.

    • home_ownership: Type of home ownership.

    • dti (Debt-To-Income Ratio): A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.

    • purpose: A category provided by the borrower for the loan request.

    • term: The number of payments on the loan. Values are in months and can be either 36 or 60.

    • last_delinq_none: 1 when the borrower had at least one event of delinquency.

    • last_major_derog_none: 1 borrower had at least 90 days of a bad rating.

    • revol_util: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.

    • total_rec_late_fee: Late fees received to date.

    • od_ratio: Overdraft ratio.

    • bad_loan: 1 when a loan was not paid.

    Note😃😃😃😃 This data is for training how using data analysis 🤝🎉

    Please appreciate the effort with an upvote 👍 😃😃

    Thank You ❤️❤️❤️

  10. Dirty Dataset to practice Data Cleaning

    • kaggle.com
    zip
    Updated Nov 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amrutha yenikonda (2023). Dirty Dataset to practice Data Cleaning [Dataset]. https://www.kaggle.com/datasets/amruthayenikonda/dirty-dataset-to-practice-data-cleaning
    Explore at:
    zip(1241 bytes)Available download formats
    Dataset updated
    Nov 3, 2023
    Authors
    Amrutha yenikonda
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The dataset has been obtained by web scraping a Wikipedia page for which code is linked below: https://www.kaggle.com/amruthayenikonda/simple-web-scraping-using-pandas

    This dataset can be used to practice data cleaning and manipulation for example dropping of unwanted columns, null vales, removing symbols etc

  11. Bank dataset

    • kaggle.com
    zip
    Updated May 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santhosh (2023). Bank dataset [Dataset]. https://www.kaggle.com/datasets/santhoshs623/bank-dataset/code
    Explore at:
    zip(37847527 bytes)Available download formats
    Dataset updated
    May 22, 2023
    Authors
    Santhosh
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description: The dataset is intentionally provided for data cleansing and applying EDA techniques. This brings fun exploring and wrangling for data geeks. The data is very original so dive-in and Happy Exploring.

    Features: In total the dataset contains 121 Features. Details given below.

    SK_ID_CURR ID of loan in our sample TARGET Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases) NAME_CONTRACT_TYPE Identification if loan is cash or revolving CODE_GENDER Gender of the client FLAG_OWN_CAR Flag if the client owns a car FLAG_OWN_REALTY Flag if client owns a house or flat CNT_CHILDREN Number of children the client has AMT_INCOME_TOTAL Income of the client AMT_CREDIT Credit amount of the loan AMT_ANNUITY Loan annuity AMT_GOODS_PRICE For consumer loans it is the price of the goods for which the loan is given NAME_TYPE_SUITE Who was accompanying client when he was applying for the loan NAME_INCOME_TYPE Clients income type (businessman, working, maternity leave,…) NAME_EDUCATION_TYPE Level of highest education the client achieved NAME_FAMILY_STATUS Family status of the client NAME_HOUSING_TYPE What is the housing situation of the client (renting, living with parents, ...) REGION_POPULATION_RELATIVE Normalized population of region where client lives (higher number means the client lives in more populated region) DAYS_BIRTH Client's age in days at the time of application DAYS_EMPLOYED How many days before the application the person started current employment DAYS_REGISTRATION How many days before the application did client change his registration DAYS_ID_PUBLISH How many days before the application did client change the identity document with which he applied for the loan OWN_CAR_AGE Age of client's car FLAG_MOBIL Did client provide mobile phone (1=YES, 0=NO) FLAG_EMP_PHONE Did client provide work phone (1=YES, 0=NO) **FLAG_WORK_PHONE ** Did client provide home phone (1=YES, 0=NO) FLAG_CONT_MOBILE Was mobile phone reachable (1=YES, 0=NO) FLAG_PHONE Did client provide home phone (1=YES, 0=NO) FLAG_EMAIL Did client provide email (1=YES, 0=NO) OCCUPATION_TYPE What kind of occupation does the client have CNT_FAM_MEMBERS How many family members does client have REGION_RATING_CLIENT Our rating of the region where client lives (1,2,3) REGION_RATING_CLIENT_W_CITY Our rating of the region where client lives with taking city into account (1,2,3) WEEKDAY_APPR_PROCESS_START On which day of the week did the client apply for the loan HOUR_APPR_PROCESS_START Approximately at what hour did the client apply for the loan REG_REGION_NOT_LIVE_REGION Flag if client's permanent address does not match contact address (1=different, 0=same, at region level) REG_REGION_NOT_WORK_REGION Flag if client's permanent address does not match work address (1=different, 0=same, at region level) LIVE_REGION_NOT_WORK_REGION Flag if client's contact address does not match work address (1=different, 0=same, at region level) REG_CITY_NOT_LIVE_CITY Flag if client's permanent address does not match contact address (1=different, 0=same, at city level) REG_CITY_NOT_WORK_CITY Flag if client's permanent address does not match work address (1=different, 0=same, at city level) LIVE_CITY_NOT_WORK_CITY Flag if client's contact address does not match work address (1=different, 0=same, at city level) ORGANIZATION_TYPE Type of organization where client works EXT_SOURCE_1 Normalized score from external data source EXT_SOURCE_2 Normalized score from external data source EXT_SOURCE_3 Normalized score from external data source APARTMENTS_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor BASEMENTAREA_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor YEARS_BEGINEXPLUATATION_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor YEARS_BUILD_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MED...

  12. People_Dataset_Fields

    • kaggle.com
    zip
    Updated Aug 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    willian oliveira (2025). People_Dataset_Fields [Dataset]. https://www.kaggle.com/datasets/willianoliveiragibin/people-dataset-fields
    Explore at:
    zip(2989638 bytes)Available download formats
    Dataset updated
    Aug 2, 2025
    Authors
    willian oliveira
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    this graph was created in Loocker Studio,PowerBi and Tableau:

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F16731800%2Fcc028c19723f992a06fafed25acb3c0a%2Fgraph1.jpg?generation=1754162133477099&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F16731800%2F29b3afed376627bc7506b4e7168d50db%2Fgraph2.jpg?generation=1754162139850838&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F16731800%2Fed520c5b3cb94c2c22973a0fce44185a%2Fgraph3.png?generation=1754162145812743&alt=media" alt=""> The People sample dataset on Datablist is a privacy-safe, synthetic census designed for demos, data-wrangling drills and performance benchmarks; each row carries an incremental Index that doubles as a primary key, a unique User Id token, the person’s First Name and Last Name, a binary Sex flag (“Male” or “Female”), a well-formed Email, a phone number in varied international formats under Phone, a Date of birth in YYYY-MM-DD style for age maths, and a realistic Job Title ranging from clerk to C-suite. The same schema is cloned across seven file sizes so you can scale your experiments in a single keystroke: people-100.csv and its zipped twin give you 100 lines; the pattern repeats for 1 000, 10 000 and 100 000 rows; once you hit half a million (people-500 000.csv) the files ship raw, followed by one- and two-million-record behemoths. Every download begins with a header line, all data are random, and the open-source generator script means you can fork your own flavour if you crave extra columns. Load it into pandas, Excel, BigQuery or anything that speaks CSV, stress-test your ETL, train regexes on the emails, or time how long your SQL engine takes to count birthdays—no compliance team will raise an eyebrow.

  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shikhar Sharma (2021). Prosper loan data. [Dataset]. https://www.kaggle.com/shikhar07/prosper-loan-data
Organization logo

Prosper loan data.

Data wrangling and Data visualization

Explore at:
zip(23591647 bytes)Available download formats
Dataset updated
Jun 7, 2021
Authors
Shikhar Sharma
Description

Context

Loan Data from Prosper.

Content

This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others. This data dictionary explains the variables in the data set.

Search
Clear search
Close search
Google apps
Main menu