Facebook
TwitterLoan Data from Prosper.
This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others. This data dictionary explains the variables in the data set.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This file contains anonymized employee records for hands-on data wrangling and cleaning tasks. The structure matches common corporate HR exports, with a mix of numerical, categorical, and date fields plus representative real-world issues.
Rows: ~10,990 (some duplicates present) Columns: 9
| Column | Description |
|---|---|
| Age | Age of employee (float, may be missing) |
| Salary | Annual salary in USD (integer, no missing values) |
| Experience | Years of experience as text (e.g. '17 years', may be empty) |
| Performance_Score | Last evaluation score (integer, no missing values) |
| Gender | M, F, Male, Female, or blank |
| Department | Department (HR, Engineering, Sales, Marketing, or blank) |
| Hired | Yes, No, Y, N, or blank |
| Hiring_Date | ISO date string (may be empty, various validity) |
| Location | Work city (Austin, Seattle, Boston, New York, or blank) |
Hiring_Date in free text, some missing or invalid| Age | Salary | Experience | Performance_Score | Gender | Department | Hired | Hiring_Date | Location |
|---|---|---|---|---|---|---|---|---|
| 39 | 46945 | 17 years | 9 | Marketing | Yes | 2021-02-24 | Austin | |
| 27 | 27102 | 4 years | 3 | M | Sales | Y | 2013-07-19 | Seattle |
| 29 | 50624 | 10 years | 8 | Male | Engineering | Yes | 2015-03-28 | Austin |
Experience to integerGender and Hired fields to a single formatHiring_Date to pandas datetime (handle errors)See the provided Notebook (hr_records_data_cleaning) for typical cleaning techniques. Scripts demonstrate type conversion, missing value imputing, categorical mapping, date parsing, and duplicate removal.
Synthetic data for skill-building and demonstration only. No real identities are included.
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Facebook
TwitterObjective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.
Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.
Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.
Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR...
Facebook
Twitterhttps://creativecommons.org/licenses/publicdomain/https://creativecommons.org/licenses/publicdomain/
This repository is associated with NSF DBI 2033973, RAPID Grant: Rapid Creation of a Data Product for the World's Specimens of Horseshoe Bats and Relatives, a Known Reservoir for Coronaviruses (https://www.nsf.gov/awardsearch/showAward?AWD_ID=2033973). Specifically, this repository contains (1) raw data from iDigBio (http://portal.idigbio.org) and GBIF (https://www.gbif.org), (2) R code for reproducible data wrangling and improvement, (3) protocols associated with data enhancements, and (4) enhanced versions of the dataset published at various project milestones. Additional code associated with this grant can be found in the BIOSPEX repository (https://github.com/iDigBio/Biospex). Long-term data management of the enhanced specimen data created by this project is expected to be accomplished by the natural history collections curating the physical specimens, a list of which can be found in this Zenodo resource.
Grant abstract: "The award to Florida State University will support research contributing to the development of georeferenced, vetted, and versioned data products of the world's specimens of horseshoe bats and their relatives for use by researchers studying the origins and spread of SARS-like coronaviruses, including the causative agent of COVID-19. Horseshoe bats and other closely related species are reported to be reservoirs of several SARS-like coronaviruses. Species of these bats are primarily distributed in regions where these viruses have been introduced to populations of humans. Currently, data associated with specimens of these bats are housed in natural history collections that are widely distributed both nationally and globally. Additionally, information tying these specimens to localities are mostly vague, or in many instances missing. This decreases the utility of the specimens for understanding the source, emergence, and distribution of SARS-COV-2 and similar viruses. This project will provide quality georeferenced data products through the consolidation of ancillary information linked to each bat specimen, using the extended specimen model. The resulting product will serve as a model of how data in biodiversity collections might be used to address emerging diseases of zoonotic origin. Results from the project will be disseminated widely in opensource journals, at scientific meetings, and via websites associated with the participating organizations and institutions. Support of this project provides a quality resource optimized to inform research relevant to improving our understanding of the biology and spread of SARS-CoV-2. The overall objectives are to deliver versioned data products, in formats used by the wider research and biodiversity collections communities, through an open-access repository; project protocols and code via GitHub and described in a peer-reviewed paper, and; sustained engagement with biodiversity collections throughout the project for reintegration of improved data into their local specimen data management systems improving long-term curation.
This RAPID award will produce and deliver a georeferenced, vetted and consolidated data product for horseshoe bats and related species to facilitate understanding of the sources, distribution, and spread of SARS-CoV-2 and related viruses, a timely response to the ongoing global pandemic caused by SARS-CoV-2 and an important contribution to the global effort to consolidate and provide quality data that are relevant to understanding emergent and other properties the current pandemic. This RAPID award is made by the Division of Biological Infrastructure (DBI) using funds from the Coronavirus Aid, Relief, and Economic Security (CARES) Act.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria."
Files included in this resource
9d4b9069-48c4-4212-90d8-4dd6f4b7f2a5.zip: Raw data from iDigBio, DwC-A format
0067804-200613084148143.zip: Raw data from GBIF, DwC-A format
0067806-200613084148143.zip: Raw data from GBIF, DwC-A format
1623690110.zip: Full export of this project's data (enhanced and raw) from BIOSPEX, CSV format
bionomia-datasets-attributions.zip: Directory containing 103 Frictionless Data packages for datasets that have attributions made containing Rhinolophids or Hipposiderids, each package also containing a CSV file for mismatches in person date of birth/death and specimen eventDate. File bionomia-datasets-attributions-key_2021-02-25.csv included in this directory provides a key between dataset identifier (how the Frictionless Data package files are named) and dataset name.
bionomia-problem-dates-all-datasets_2021-02-25.csv: List of 21 Hipposiderid or Rhinolophid records whose eventDate or dateIdentified mismatches a wikidata recipient’s date of birth or death across all datasets.
flagEventDate.txt: file containing term definition to reference in DwC-A
flagExclude.txt: file containing term definition to reference in DwC-A
flagGeoreference.txt: file containing term definition to reference in DwC-A
flagTaxonomy.txt: file containing term definition to reference in DwC-A
georeferencedByID.txt: file containing term definition to reference in DwC-A
identifiedByNames.txt: file containing term definition to reference in DwC-A
instructions-to-get-people-data-from-bionomia-via-datasetKey: instructions given to data providers
RAPID-code_collection-date.R: code associated with enhancing collection dates
RAPID-code_compile-deduplicate.R: code associated with compiling and deduplicating raw data
RAPID-code_external-linkages-bold.R: code associated with enhancing external linkages
RAPID-code_external-linkages-genbank.R: code associated with enhancing external linkages
RAPID-code_external-linkages-standardize.R: code associated with enhancing external linkages
RAPID-code_people.R: code associated with enhancing data about people
RAPID-code_standardize-country.R: code associated with standardizing country data
RAPID-data-dictionary.pdf: metadata about terms included in this project’s data, in PDF format
RAPID-data-dictionary.xlsx: metadata about terms included in this project’s data, in spreadsheet format
rapid-data-providers_2021-05-03.csv: list of data providers and number of records provided to rapid-joined-records_country-cleanup_2020-09-23.csv
rapid-final-data-product_2021-06-29.zip: Enhanced data from BIOSPEX, DwC-A format
rapid-final-gazetteer.zip: Gazetteer providing georeference data and metadata for 10,341 localities assessed as part of this project
rapid-joined-records_country-cleanup_2020-09-23.csv: data product initial version where raw data has been compiled and deduplicated, and country data has been standardized
RAPID-protocol_collection-date.pdf: protocol associated with enhancing collection dates
RAPID-protocol_compile-deduplicate.pdf: protocol associated with compiling and deduplicating raw data
RAPID-protocol_external-linkages.pdf: protocol associated with enhancing external linkages
RAPID-protocol_georeference.pdf: protocol associated with georeferencing
RAPID-protocol_people.pdf: protocol associated with enhancing data about people
RAPID-protocol_standardize-country.pdf: protocol associated with standardizing country data
RAPID-protocol_taxonomic-names.pdf: protocol associated with enhancing taxonomic name data
RAPIDAgentStrings1_archivedCopy_30March2021.ods: resource used in conjunction with RAPID people protocol
recordedByNames.txt: file containing term definition to reference in DwC-A
Rhinolophid-HipposideridAgentStrings_and_People2_archivedCopy_30March2021.ods: resource used in conjunction with RAPID people protocol
wikidata-notes-for-bat-collectors_leachman_2020: please see https://zenodo.org/record/4724139 for this resource
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 1. Example of translation from VCF into GDM format for genomic region data: This .xlsx (MS Excel) spreadsheet exemplifies the transformation of the original 1KGP mutations—expressed in VCF format—into GDM genomic regions. As a demonstrative example, some variants about chromosome X have been selected from the source data (in VCF format) and listed in the first table at the top of the file. The values of columns #CHROM, POS, REF and ALT appear as in the source. We removed the details that are unnecessary for the transformation from the column INFO. In the column FORMAT it is indicated exclusively the value “GT”, meaning that the next columns contain only the genotype of the samples (this and other conventions are expressed in the VCF specification document and in the header section of each VCF file). In multiallelic variants (examples e, f.1 and f.2), the genotype indicates with a number which of the alternative alleles in ALT is present in the corresponding samples (e.g., the number 2 means that the second variant is present); otherwise, it only assumes the values 0—mutation absent, or 1—the mutation is present. Additionally, the genotype indicates whether one or both chromosome copies contain the mutation and which one, i.e., the left one or the right one; the mutated alleles are normally separated by a pipe (“|”), if not otherwise specified in the header section; we do not know which chromosome copy is maternal or paternal, but as the 1KGP mutations are “phased”, we know that the “left chromosome” is the same in every mutation located in the same chromosome of the same donor. As in this example we have only one column after the FORMAT one, the mutations described are relative to only one sample, called “HG123456”. Actually, this sample does not exist in the source, but serves the purpose of demonstrating several mutation types that are found in the original data. The table reports six variants in VCF format, with the last one repeated two times to show how different values of genotype lead to a different translation (indeed, examples f.1 and f.2 differ only for the last column). Below in the same file, the same variants appear converted in GDM format. The transformation outputs the chr, left, right, strand, AL1, AL2, ref, alt, mut_type and length columns. The value of strand is positive in every mutation, as clarified by the 1KGP Consortium after the release of the data collections. Values of AL1 and AL2 express on which chromatid the mutation occur and depend on the value of the original genotype (column HG123456). The values of the other columns, namely chr, left, right, ref, alt, mut_type and length, are obtained from the variant original values after the split of multi-allelic variants, the transformation of the original position into 0-based coordinates, and the removal of repeated nucleotide bases from the original REF and ALT columns. In 0-based coordinates, a nucleotide base occupies the space between the coordinates x and x + 1. So, SNPs (examples a and f.2) are encoded as the replacement of ref at position between left and right with alt. Insertions (examples c and f.1) are described as the addition of the sequence of bases in alt at the position indicated in left and right, i.e., in between two nucleotide bases. Deletions (example b) are represented as the substitution of ref between positions left and right with an empty value (alt is indeed empty in this case). Finally, structural variants (examples d and e) such as copy number variations and large deletions have an empty ref because, according to the VCF specification document, the original column REF reports a nucleotide (called padding-base) that is located before the scope of the variant on the genome and is unnecessary in a 0-based representation. In this file, we reported only the columns relevant for the understanding of the transformation method regarding the mutation coordinates, reference and alternative alleles. Actually, in addition to the ones reported in the second table, the transformation adds some more columns, called as the attributes in the original INFO column to capture a selection of the attributes present in the original file.
Facebook
TwitterThis comprehensive dataset integrates three heterogeneous data sources to analyze the relationship between air quality, population mobility patterns, and weather conditions across major Indonesian cities from September 2024 to October 2025. The dataset provides valuable insights for environmental monitoring, urban planning, and public health research in Indonesia.
Key Findings: Over 95% of movements occur within 0-10 km from home, indicating predominantly local mobility patterns. Long-distance travel remains minimal (<0.4%).
Key Findings: PM2.5 levels consistently exceed WHO guidelines throughout 2024, with critical peaks during May (65-132 μg/m³) and significant improvement in December. Seasonal patterns show higher pollution during dry months (April-October) due to biomass burning and decreased precipitation.
Key Findings: Consistent tropical monsoon characteristics with stable temperatures (23-30°C), erratic rainfall patterns, and high humidity levels. Temperature shows strong correlation with both AQI (0.39) and wind speed (0.57).
This dataset is ideal for:
The integrated dataset contains the following columns: ...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The repository contains data on party strength for each state as shown on each state's corresponding party strength Wikipedia page (for example, here is Virginia )
Each state has a table of a detailed summary of the state of its governing and representing bodies on Wikipedia but there is no data set that collates these entries. I scraped each state's Wikipedia table and collated the entries into a single dataset. The data are stored in the state_party_strength.csv and state_party_strength_cleaned.csv. The code that generated the file can be found in corresponding Python notebooks.
The data contain information from 1980 on each state's: 1. governor and party 2. state house and senate composition 3. state representative composition in congress 4. electoral votes
Data in the clean version has been cleaned and processed substantially. Namely: - all columns now contain homogenous data within the column - names and Wiki-citations have been removed - only the party counts and party identification have been left The notebook that created this file is here
The data contained herein have not been altered from their Wikipedia tables except in two instances: - Forced column names to be in accord across states - Any needed data modifications (ie concatenated string columns) to retain information when combining columns
Please note that the right encoding for the dataset is "ISO-8859-1", not 'utf-8' though in future versions I will try to fix that to make it more accessible.
This means that you will likely have to perform further data wrangling prior to doing any substantive analysis. The notebook that has been used to create this data file is located here
The raw scraped data can be found in the pickle. This file contains a Python dictionary where each key is a US state name and each element is the raw scraped table in Pandas DataFrame format.
Hope it proves as useful to you in analyzing/using political patterns at the state level in the US for political and policy research.
Facebook
TwitterA data science approach to predict and understand the applicant’s profile to minimize the risk of future loan defaults.
The dataset contains information about credit applicants. Banks, globally, use this kind of dataset and type of informative data to create models to help in deciding on who to accept/refuse for a loan. After all the exploratory data analysis, cleansing and dealing with all the anomalies we might (will) find along the way, the patterns of a good/bad applicant will be exposed to be learned by machine learning models.
Machine Learning issue and objectives We’re dealing with a supervised binary classification problem. The goal is to train the best machine learning model to maximize the predictive capability of deeply understanding the past customer’s profile minimizing the risk of future loan defaults.
Performance Metric The metric used for the models’ evaluation is the ROC AUC given that we’re dealing with a highly unbalanced data.
Project structure The project divides into three categories: EDA: Exploratory Data Analysis Data Wrangling: Cleansing and Feature Selection Machine Learning: Predictive Modelling
The dataset You can download the data set here. Feature description
id: Unique ID of the loan application.
grade: LC assigned loan grade.
annual_inc: The self-reported annual income provided by the borrower during registration.
short_emp: 1 when employed for 1 year or less.
emp_length_num: Employment length in years. Possible values are - between 0 and 10 where 0 means less than one year and 10 means ten or more years.
home_ownership: Type of home ownership.
dti (Debt-To-Income Ratio): A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
purpose: A category provided by the borrower for the loan request.
term: The number of payments on the loan. Values are in months and can be either 36 or 60.
last_delinq_none: 1 when the borrower had at least one event of delinquency.
last_major_derog_none: 1 borrower had at least 90 days of a bad rating.
revol_util: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
total_rec_late_fee: Late fees received to date.
od_ratio: Overdraft ratio.
bad_loan: 1 when a loan was not paid.
Note😃😃😃😃 This data is for training how using data analysis 🤝🎉
Please appreciate the effort with an upvote 👍 😃😃
Thank You ❤️❤️❤️
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset has been obtained by web scraping a Wikipedia page for which code is linked below: https://www.kaggle.com/amruthayenikonda/simple-web-scraping-using-pandas
This dataset can be used to practice data cleaning and manipulation for example dropping of unwanted columns, null vales, removing symbols etc
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description: The dataset is intentionally provided for data cleansing and applying EDA techniques. This brings fun exploring and wrangling for data geeks. The data is very original so dive-in and Happy Exploring.
Features: In total the dataset contains 121 Features. Details given below.
SK_ID_CURR ID of loan in our sample TARGET Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases) NAME_CONTRACT_TYPE Identification if loan is cash or revolving CODE_GENDER Gender of the client FLAG_OWN_CAR Flag if the client owns a car FLAG_OWN_REALTY Flag if client owns a house or flat CNT_CHILDREN Number of children the client has AMT_INCOME_TOTAL Income of the client AMT_CREDIT Credit amount of the loan AMT_ANNUITY Loan annuity AMT_GOODS_PRICE For consumer loans it is the price of the goods for which the loan is given NAME_TYPE_SUITE Who was accompanying client when he was applying for the loan NAME_INCOME_TYPE Clients income type (businessman, working, maternity leave,…) NAME_EDUCATION_TYPE Level of highest education the client achieved NAME_FAMILY_STATUS Family status of the client NAME_HOUSING_TYPE What is the housing situation of the client (renting, living with parents, ...) REGION_POPULATION_RELATIVE Normalized population of region where client lives (higher number means the client lives in more populated region) DAYS_BIRTH Client's age in days at the time of application DAYS_EMPLOYED How many days before the application the person started current employment DAYS_REGISTRATION How many days before the application did client change his registration DAYS_ID_PUBLISH How many days before the application did client change the identity document with which he applied for the loan OWN_CAR_AGE Age of client's car FLAG_MOBIL Did client provide mobile phone (1=YES, 0=NO) FLAG_EMP_PHONE Did client provide work phone (1=YES, 0=NO) **FLAG_WORK_PHONE ** Did client provide home phone (1=YES, 0=NO) FLAG_CONT_MOBILE Was mobile phone reachable (1=YES, 0=NO) FLAG_PHONE Did client provide home phone (1=YES, 0=NO) FLAG_EMAIL Did client provide email (1=YES, 0=NO) OCCUPATION_TYPE What kind of occupation does the client have CNT_FAM_MEMBERS How many family members does client have REGION_RATING_CLIENT Our rating of the region where client lives (1,2,3) REGION_RATING_CLIENT_W_CITY Our rating of the region where client lives with taking city into account (1,2,3) WEEKDAY_APPR_PROCESS_START On which day of the week did the client apply for the loan HOUR_APPR_PROCESS_START Approximately at what hour did the client apply for the loan REG_REGION_NOT_LIVE_REGION Flag if client's permanent address does not match contact address (1=different, 0=same, at region level) REG_REGION_NOT_WORK_REGION Flag if client's permanent address does not match work address (1=different, 0=same, at region level) LIVE_REGION_NOT_WORK_REGION Flag if client's contact address does not match work address (1=different, 0=same, at region level) REG_CITY_NOT_LIVE_CITY Flag if client's permanent address does not match contact address (1=different, 0=same, at city level) REG_CITY_NOT_WORK_CITY Flag if client's permanent address does not match work address (1=different, 0=same, at city level) LIVE_CITY_NOT_WORK_CITY Flag if client's contact address does not match work address (1=different, 0=same, at city level) ORGANIZATION_TYPE Type of organization where client works EXT_SOURCE_1 Normalized score from external data source EXT_SOURCE_2 Normalized score from external data source EXT_SOURCE_3 Normalized score from external data source APARTMENTS_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor BASEMENTAREA_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor YEARS_BEGINEXPLUATATION_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor YEARS_BUILD_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MED...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
this graph was created in Loocker Studio,PowerBi and Tableau:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F16731800%2Fcc028c19723f992a06fafed25acb3c0a%2Fgraph1.jpg?generation=1754162133477099&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F16731800%2F29b3afed376627bc7506b4e7168d50db%2Fgraph2.jpg?generation=1754162139850838&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F16731800%2Fed520c5b3cb94c2c22973a0fce44185a%2Fgraph3.png?generation=1754162145812743&alt=media" alt="">
The People sample dataset on Datablist is a privacy-safe, synthetic census designed for demos, data-wrangling drills and performance benchmarks; each row carries an incremental Index that doubles as a primary key, a unique User Id token, the person’s First Name and Last Name, a binary Sex flag (“Male” or “Female”), a well-formed Email, a phone number in varied international formats under Phone, a Date of birth in YYYY-MM-DD style for age maths, and a realistic Job Title ranging from clerk to C-suite. The same schema is cloned across seven file sizes so you can scale your experiments in a single keystroke: people-100.csv and its zipped twin give you 100 lines; the pattern repeats for 1 000, 10 000 and 100 000 rows; once you hit half a million (people-500 000.csv) the files ship raw, followed by one- and two-million-record behemoths. Every download begins with a header line, all data are random, and the open-source generator script means you can fork your own flavour if you crave extra columns. Load it into pandas, Excel, BigQuery or anything that speaks CSV, stress-test your ETL, train regexes on the emails, or time how long your SQL engine takes to count birthdays—no compliance team will raise an eyebrow.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterLoan Data from Prosper.
This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others. This data dictionary explains the variables in the data set.