Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary
This record provides the full, time-stamped dataset and documentation for an urban air-quality forecasting task in Paris, France (calendar year 2023). The archive includes both the raw measurements and the processed version used for modeling, plus a Data Dictionary and a Processing Log to enable complete transparency and reproducibility.
Study area & coverage
· Location: Paris, Île-de-France, France
· Temporal coverage: 2023-01-01 00:00:00 – 2023-12-31 23:00:00 (local time)
· Time zone: CET/CEST (UTC+1 in winter, UTC+2 in summer)
· Frequency: Hourly observations (where available)
· Primary variables (units):
- Pollutants: NO₂ (µg/m³), PM₂.₅ (µg/m³), PM₁₀ (µg/m³), CO (mg/m³ or µg/m³ — see dictionary)
- Meteorology: Temperature (°C), Relative Humidity (%), Wind Speed (m/s), [others if present]
- Key field: timestamp (ISO 8601: YYYY-MM-DD HH:mm:ss)
What’s included
· data/Raw.csv — Raw time-series with a unified timestamp column and all measured variables.
· data/Processed.csv — Cleaned/chronologically sorted dataset used for modeling (original units retained unless noted).
· docs/Data_Dictionary.docx — Variable names, definitions, units, and sources.
· docs/Processing_ Tracability . xlsx— Step-by-step preprocessing record (missing-data strategy, outlier policy, scaling, and temporal train/test split).
Methodological notes
The dataset is organized for time-series modeling. All preprocessing decisions are documented in docs/Processing_Log.docx. To prevent information leakage, feature selection and normalization are to be performed on the training partition only when reproducing the models. A one-click MATLAB pipeline (code/00_run_all.m) is available in the companion repository (see Related resources) to reproduce the splits and exports.
Intended use
This dataset supports research and teaching in environmental data science, air-quality forecasting, time-series modeling, and reproducible ML. Users can:
· Recreate the chronological train/test setup for 2023.
· Benchmark alternative models and feature-engineering strategies.
· Explore pollutant–meteorology relationships in Paris during 2023.
Provenance & quality control
Data were compiled for the study from Paris monitoring sources for pollutants and standard meteorological observations. Basic QA steps (timestamp harmonization, duplicate checks, unit checks) are documented in the Processing Log. Please consult docs/Data_Dictionary.docx for variable-level details and known caveats.
Licensing & reuse
The dataset is released under CC BY 4.0. Please cite this record and the associated article (if applicable) when reusing the data.
Related resources
NOAA, And OpenAQ
How to cite
[Somia Asklany ], (2025). Time-Stamped Air Quality & Weather Data (Paris, 2023) — Reproducible Processing Logs (v1.0) [Data set]. Zenodo. https://doi.org/[Zenodo-DOI]
Contact
[ Somia Asklany ], [ Northern Boarder University ( somia.asklany@nbu.edu.sa], [— ORCID: [
Facebook
TwitterA loan is when you receive the money from a financial institution in exchange for future repayment of the principal, plus interest. Financial institutions provide loans to the industries, corporates and individuals. The interest received on these loans is one among the main sources of income for the financial institutions.
A top-up loan, true to its name, is a facility of availing further funds on an existing loan. When you have a loan that has already been disbursed and under repayment and if you need more funds then, you can simply avail additional funding on the same loan thereby minimizing time, effort and cost related to applying again.
LTFS provides it’s loan services to its customers and is interested in selling more of its Top-up loan services to its existing customers so they have decided to identify when to pitch a Top-up during the original loan tenure. If they correctly identify the most suitable time to offer a top-up, this will ultimately lead to more disbursals and can also help them beat competing offerings from other institutions.
To understand this behaviour, LTFS has provided data for its customers containing the information whether that particular customer took the Top-up service and when he took such Top-up service, represented by the target variable Top-up Month.
You are provided with two types of information:
Customer’s Demographics: The demography table along with the target variable & demographic information contains variables related to Frequency of the loan, Tenure of the loan, Disbursal Amount for a loan & LTV.
Bureau data: Bureau data contains the behavioural and transactional attributes of the customers like current balance, Loan Amount, Overdue etc. for various tradelines of a given customer
As a data scientist, LTFS has tasked you with building a model given the Top-up loan bucket of 128655 customers along with demographic and bureau data, predict the right bucket/period for 14745 customers in the test data.
Important Note
Note that feasibility of implementation of top solutions in real production scenario will be considered while adjudging winners and can change the final standing for Prize Eligibility
Train_Data.zip This zip file contains the train files for demography data and bureau data. The data dictionary is also included here.
Test_Data.zip This zip file contains information on demography data and bureau data for a different set of customers
Sample Submission This file contains the exact submission format for the predictions. Please submit CSV file only.
Variable Definition ID Unique Identifier for a row Top-up Month (Target) bucket/period for the Top-up Loan
All Submissions are to be done at the solution checker tab. For a step by step view on how to make a submission check the below video
The evaluation metric for this competition is macro_f1_score across all entries in the test set.
Public and Private Split Test data is further divided into Public 40% and Private 60%
Your initial responses will be checked and scored on the Public data. The final rankings would be based on your private score which will be published once the competition is over.
Please ensure that your final submission includes the following:
Solution file containing the predicted Top-up Month bucket in the test dataset (format is given in sample submission CSV) Code file containing the following: Code: Note that it is mandatory to submit your code for a valid final submission Approach: Please share your approach to solve the problem (doc/ppt/pdf format). It should cover the following topics: A brief on the approach, which you have used to solve the problem. What data-preprocessing / feature engineering ideas really worked? How did you discover them? What does your final model look like? How did you reach it?
Hackathon Rules The final standings would be based on private leaderboard score and presentations made in Online Interview round with LTFS & Analytics Vidhya which will be held after contest close. Setting the final submission is recommended. Without a final submission, the submission corresponding to best public score will be taken as the final submission Use of external data is prohibited You can only make 10 submissions per day Entries submitted after the contest is closed, will not be considered The code file pertaining to your final submission is mandatory while setting final submission Throughout the hackathon, you are expected to respect fellow hackers and act with high integrity. Analytics Vidhya and LTFS hold the right to disqualify any participant at any stage of the compe...
Facebook
TwitterSheet_1.csv contains 80 user responses, in the response_text column, to a therapy chatbot. Bot said: 'Describe a time when you have acted as a resource for someone else'. User responded. If a response is 'not flagged', the user can continue talking to the bot. If it is 'flagged', the user is referred to help.
Sheet_2.csv contains 125 resumes, in the resume_text column. Resumes were queried from Indeed.com with keyword 'data scientist', location 'Vermont'. If a resume is 'not flagged', the applicant can submit a modified resume version at a later date. If it is 'flagged', the applicant is invited to interview.
Classify new resumes/responses as flagged or not flagged.
There are two sets of data here - resumes and responses. Split the data into a train set and a test set to test the accuracy of your classifier. Bonus points for using the same classifier for both problems.
Good luck.
Thank you to Parsa Ghaffari (Aylien), without whom these visuals (cover photo is in Parsa Ghaffari's excellent LinkedIn article on English, Spanish and German postive v. negative sentiment analysis) would not exist.
You can use any of the code in that kernel anywhere, on or off Kaggle. Ping me at @_samputnam for questions.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for MMLU
Dataset Summary
Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks… See the full description on the dataset page: https://huggingface.co/datasets/cais/mmlu.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Synthetic Employee Attrition Dataset is a simulated dataset designed for the analysis and prediction of employee attrition. It contains detailed information about various aspects of an employee's profile, including demographics, job-related features, and personal circumstances.
The dataset comprises 74,498 samples, split into training and testing sets to facilitate model development and evaluation. Each record includes a unique Employee ID and features that influence employee attrition. The goal is to understand the factors contributing to attrition and develop predictive models to identify at-risk employees.
This dataset is ideal for HR analytics, machine learning model development, and demonstrating advanced data analysis techniques. It provides a comprehensive and realistic view of the factors affecting employee retention, making it a valuable resource for researchers and practitioners in the field of human resources and organizational development.
FEATURES:
Employee ID: A unique identifier assigned to each employee. Age: The age of the employee, ranging from 18 to 60 years. Gender: The gender of the employee Years at Company: The number of years the employee has been working at the company. Monthly Income: The monthly salary of the employee, in dollars. Job Role: The department or role the employee works in, encoded into categories such as Finance, Healthcare, Technology, Education, and Media. Work-Life Balance: The employee's perceived balance between work and personal life, (Poor, Below Average, Good, Excellent) Job Satisfaction: The employee's satisfaction with their job: (Very Low, Low, Medium, High) Performance Rating: The employee's performance rating: (Low, Below Average, Average, High) Number of Promotions: The total number of promotions the employee has received. Distance from Home: The distance between the employee's home and workplace, in miles. Education Level: The highest education level attained by the employee: (High School, Associate Degree, Bachelor’s Degree, Master’s Degree, PhD) Marital Status: The marital status of the employee: (Divorced, Married, Single) Job Level: The job level of the employee: (Entry, Mid, Senior) Company Size: The size of the company the employee works for: (Small,Medium,Large) Company Tenure: The total number of years the employee has been working in the industry. Remote Work: Whether the employee works remotely: (Yes or No) Leadership Opportunities: Whether the employee has leadership opportunities: (Yes or No) Innovation Opportunities: Whether the employee has opportunities for innovation: (Yes or No) Company Reputation: The employee's perception of the company's reputation: (Very Poor, Poor,Good, Excellent) Employee Recognition: The level of recognition the employee receives:(Very Low, Low, Medium, High)
Attrition: Whether the employee has left the company, encoded as 0 (stayed) and 1 (Left).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Digitization of healthcare data along with algorithmic breakthroughts in AI will have a major impact on healthcare delivery in coming years. Its intresting to see application of AI to assist clinicians during patient treatment in a privacy preserving way. While scientific knowledge can help guide interventions, there remains a key need to quickly cut through the space of decision policies to find effective strategies to support patients during the care process.
Offline Reinforcement learning (also referred to as safe or batch reinforcement learning) is a promising sub-field of RL which provides us with a mechanism for solving real world sequential decision making problems where access to simulator is not available. Here we assume that learn a policy from fixed dataset of trajectories with further interaction with the environment(agent doesn't receive reward or punishment signal from the environment). It has shown that such an approach can leverage vast amount of existing logged data (in the form of previous interactions with the environment) and can outperform supervised learning approaches or heuristic based policies for solving real world - decision making problems. Offline RL algorithms when trained on sufficiently large and diverse offline datasets can produce close to optimal policies(ability to generalize beyond training data).
As Part of my PhD, research, I investigated the problem of developing a Clinical Decision Support System for Sepsis Management using Offline Deep Reinforcement Learning.
MIMIC-III ('Medical Information Mart for Intensive Care') is a large open-access anonymized single-center database which consists of comprehensive clinical data of 61,532 critical care admissions from 2001–2012 collected at a Boston teaching hospital. Dataset consists of 47 features (including demographics, vitals, and lab test results) on a cohort of sepsis patients who meet the sepsis-3 definition criteria.
we try to answer the following question:
Given a particular patient’s characteristics and physiological information at each time step as input, can our DeepRL approach, learn an optimal treatment policy that can prescribe the right intervention(e.g use of ventilator) to the patient each stage of the treatment process, in order to improve the final outcome(e.g patient mortality)?
we can use popular state-of-the-art algorithms such as Deep Q Learning(DQN), Double Deep Q Learning (DDQN), DDQN combined with BNC, Mixed Monte Carlo(MMC) and Persistent Advantage Learning (PAL). Using these methods we can train an RL policy to recommend optimum treatment path for a given patient.
Data acquisition, standard pre-processing and modelling details can be found here in Github repo: https://github.com/asjad99/MIMIC_RL_COACH
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
https://zenodo.org/badge/DOI/10.5281/zenodo.10892316.svg" alt="DOI">
DOI Link
https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT">
License Details
Welcome to CPAISD, a dataset featuring 112 non-contrast cranial CT scans from patients with hyperacute stroke. Each scan includes expertly segmented ischemic core and penumbra zones, making this a powerful resource for advancing medical image analysis, stroke lesion segmentation, and understanding acute ischemic stroke dynamics. 🩺✨
The Core-Penumbra Acute Ischemic Stroke Dataset (CPAISD) provides 112 anonymized CT scans from hyperacute stroke patients. Experts have manually delineated the ischemic core and penumbra on every relevant slice. Anonymized with Kitware DicomAnonymizer, it retains key DICOM fields for demographic and domain shift studies:
- (0x0010, 0x0040) – Patient's Sex
- (0x0010, 0x1010) – Patient's Age
- (0x0008, 0x0070) – CT Scanner Manufacturer
- (0x0008, 0x1090) – CT Scanner Model
The dataset is split into three folds for robust research:
- Training: 92 studies, 8,376 slices 📚
- Validation: 10 studies, 980 slices ✅
- Testing: 10 studies, 809 slices 🧪
Here’s the structure:
dataset/ ├── metadata.json # Dataset stats and split parameters ├── summary.csv # Study metadata (name, split, etc.) ├── train/ # Training fold │ ├── study_id_1/ │ │ ├── StudySliceraw.dcm # Raw DICOM slice │ │ ├── image.npz # Slice as Numpy array │ │ ├── mask.npz # Core & penumbra mask │ │ ├── metadata.json # Slice metadata │ │ └── metadata.json # Study metadata │ └── ... ├── val/ # Validation fold │ └── ... └── test/ # Testing fold └── ...
File Breakdown:
- metadata.json (root): Dataset-wide info (split params, stats).
- summary.csv: Study-level metadata in table form.
- StudySliceraw.dcm: Original anonymized DICOM slice.
- image.npz: CT slice in Numpy format.
- mask.npz: Segmentation mask (core & penumbra).
- metadata.json (slice): Slice-specific details.
- metadata.json (study): Study details like manufacturer, model, age, sex, dsa, nihss, time, lethality.
Check out the GitHub repo for code and more:
github.com/sb-ai-lab/early_hyperacute_stroke_dataset. It’s Python-based and actively maintained! 🐍
CT · Penumbra · Core · Stroke · Medical Imaging · Segmentation
Released under the MIT License.
✅ Use, modify, share, or sell—just follow the terms!
Using CPAISD? Cite it as:
Umerenkov, D., Kudin, S., Peksheva, M., & Pavlov, D. (2024). CPAISD: Core-Penumbra Acute Ischemic Stroke Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10892316
We hope CPAISD fuels your research in stroke detection and treatment. Happy exploring, and please upvote this dataset if it helps you—let’s drive progress together! 🙌
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset can be used to develop natural language interfaces for relational databases. The data fields are the same among all splits, and the file contains information on the phase, question, table, and SQL for each interface
- This dataset can be used to develop natural language interfaces for relational databases.
- This dataset can be used to develop a knowledge base of common SQL queries.
- This dataset can be used to generate a training set for a neural network that translates natural language into SQL queries
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
File: test.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary
This record provides the full, time-stamped dataset and documentation for an urban air-quality forecasting task in Paris, France (calendar year 2023). The archive includes both the raw measurements and the processed version used for modeling, plus a Data Dictionary and a Processing Log to enable complete transparency and reproducibility.
Study area & coverage
· Location: Paris, Île-de-France, France
· Temporal coverage: 2023-01-01 00:00:00 – 2023-12-31 23:00:00 (local time)
· Time zone: CET/CEST (UTC+1 in winter, UTC+2 in summer)
· Frequency: Hourly observations (where available)
· Primary variables (units):
- Pollutants: NO₂ (µg/m³), PM₂.₅ (µg/m³), PM₁₀ (µg/m³), CO (mg/m³ or µg/m³ — see dictionary)
- Meteorology: Temperature (°C), Relative Humidity (%), Wind Speed (m/s), [others if present]
- Key field: timestamp (ISO 8601: YYYY-MM-DD HH:mm:ss)
What’s included
· data/Raw.csv — Raw time-series with a unified timestamp column and all measured variables.
· data/Processed.csv — Cleaned/chronologically sorted dataset used for modeling (original units retained unless noted).
· docs/Data_Dictionary.docx — Variable names, definitions, units, and sources.
· docs/Processing_ Tracability . xlsx— Step-by-step preprocessing record (missing-data strategy, outlier policy, scaling, and temporal train/test split).
Methodological notes
The dataset is organized for time-series modeling. All preprocessing decisions are documented in docs/Processing_Log.docx. To prevent information leakage, feature selection and normalization are to be performed on the training partition only when reproducing the models. A one-click MATLAB pipeline (code/00_run_all.m) is available in the companion repository (see Related resources) to reproduce the splits and exports.
Intended use
This dataset supports research and teaching in environmental data science, air-quality forecasting, time-series modeling, and reproducible ML. Users can:
· Recreate the chronological train/test setup for 2023.
· Benchmark alternative models and feature-engineering strategies.
· Explore pollutant–meteorology relationships in Paris during 2023.
Provenance & quality control
Data were compiled for the study from Paris monitoring sources for pollutants and standard meteorological observations. Basic QA steps (timestamp harmonization, duplicate checks, unit checks) are documented in the Processing Log. Please consult docs/Data_Dictionary.docx for variable-level details and known caveats.
Licensing & reuse
The dataset is released under CC BY 4.0. Please cite this record and the associated article (if applicable) when reusing the data.
Related resources
NOAA, And OpenAQ
How to cite
[Somia Asklany ], (2025). Time-Stamped Air Quality & Weather Data (Paris, 2023) — Reproducible Processing Logs (v1.0) [Data set]. Zenodo. https://doi.org/[Zenodo-DOI]
Contact
[ Somia Asklany ], [ Northern Boarder University ( somia.asklany@nbu.edu.sa], [— ORCID: [