8 datasets found

Time-Stamped Air Quality & Weather Data (Paris, 2023)
zenodo.org
bin, csv
Updated Sep 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Somia Asklany; Somia Asklany; Doaa Mohammed; Doaa Mohammed (2025). Time-Stamped Air Quality & Weather Data (Paris, 2023) [Dataset]. http://doi.org/10.5281/zenodo.17167030
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.17167030
Dataset updated
Sep 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Somia Asklany; Somia Asklany; Doaa Mohammed; Doaa Mohammed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Paris
Description
Dataset: Time-Stamped Air Quality & Weather Data (Paris, 2023) — Reproducible Processing Logs

Summary

This record provides the full, time-stamped dataset and documentation for an urban air-quality forecasting task in Paris, France (calendar year 2023). The archive includes both the raw measurements and the processed version used for modeling, plus a Data Dictionary and a Processing Log to enable complete transparency and reproducibility.

Study area & coverage

· Location: Paris, Île-de-France, France

· Temporal coverage: 2023-01-01 00:00:00 – 2023-12-31 23:00:00 (local time)

· Time zone: CET/CEST (UTC+1 in winter, UTC+2 in summer)

· Frequency: Hourly observations (where available)

· Primary variables (units):

- Pollutants: NO₂ (µg/m³), PM₂.₅ (µg/m³), PM₁₀ (µg/m³), CO (mg/m³ or µg/m³ — see dictionary)

- Meteorology: Temperature (°C), Relative Humidity (%), Wind Speed (m/s), [others if present]

- Key field: timestamp (ISO 8601: YYYY-MM-DD HH:mm:ss)

What’s included

· data/Raw.csv — Raw time-series with a unified timestamp column and all measured variables.

· data/Processed.csv — Cleaned/chronologically sorted dataset used for modeling (original units retained unless noted).

· docs/Data_Dictionary.docx — Variable names, definitions, units, and sources.

· docs/Processing_ Tracability . xlsx— Step-by-step preprocessing record (missing-data strategy, outlier policy, scaling, and temporal train/test split).

Methodological notes

The dataset is organized for time-series modeling. All preprocessing decisions are documented in docs/Processing_Log.docx. To prevent information leakage, feature selection and normalization are to be performed on the training partition only when reproducing the models. A one-click MATLAB pipeline (code/00_run_all.m) is available in the companion repository (see Related resources) to reproduce the splits and exports.

Intended use

This dataset supports research and teaching in environmental data science, air-quality forecasting, time-series modeling, and reproducible ML. Users can:

· Recreate the chronological train/test setup for 2023.

· Benchmark alternative models and feature-engineering strategies.

· Explore pollutant–meteorology relationships in Paris during 2023.

Provenance & quality control

Data were compiled for the study from Paris monitoring sources for pollutants and standard meteorological observations. Basic QA steps (timestamp harmonization, duplicate checks, unit checks) are documented in the Processing Log. Please consult docs/Data_Dictionary.docx for variable-level details and known caveats.

Licensing & reuse

The dataset is released under CC BY 4.0. Please cite this record and the associated article (if applicable) when reusing the data.

Related resources

NOAA, And OpenAQ

How to cite

[Somia Asklany ], (2025). Time-Stamped Air Quality & Weather Data (Paris, 2023) — Reproducible Processing Logs (v1.0) [Data set]. Zenodo. https://doi.org/[Zenodo-DOI]

Contact

[ Somia Asklany ], [ Northern Boarder University ( somia.asklany@nbu.edu.sa], [— ORCID: [
LTFS Data Science FinHack 3(Analytics Vidhya)
kaggle.com
zip
Updated Feb 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Parv619 (2021). LTFS Data Science FinHack 3(Analytics Vidhya) [Dataset]. https://www.kaggle.com/datasets/parv619/ltfs-data-science-finhack-3analytics-vidhya/code
Explore at:
zip(109500058 bytes)Available download formats
Dataset updated
Feb 1, 2021
Authors
Parv619
Description
This dataset contains extracted data from LTFS Data Science FinHack 3 (Analytics Vidhya)

LTFS Top-up loan Up-sell prediction

A loan is when you receive the money from a financial institution in exchange for future repayment of the principal, plus interest. Financial institutions provide loans to the industries, corporates and individuals. The interest received on these loans is one among the main sources of income for the financial institutions.

A top-up loan, true to its name, is a facility of availing further funds on an existing loan. When you have a loan that has already been disbursed and under repayment and if you need more funds then, you can simply avail additional funding on the same loan thereby minimizing time, effort and cost related to applying again.

LTFS provides it’s loan services to its customers and is interested in selling more of its Top-up loan services to its existing customers so they have decided to identify when to pitch a Top-up during the original loan tenure. If they correctly identify the most suitable time to offer a top-up, this will ultimately lead to more disbursals and can also help them beat competing offerings from other institutions.

To understand this behaviour, LTFS has provided data for its customers containing the information whether that particular customer took the Top-up service and when he took such Top-up service, represented by the target variable Top-up Month.

You are provided with two types of information:

Customer’s Demographics: The demography table along with the target variable & demographic information contains variables related to Frequency of the loan, Tenure of the loan, Disbursal Amount for a loan & LTV.

Bureau data: Bureau data contains the behavioural and transactional attributes of the customers like current balance, Loan Amount, Overdue etc. for various tradelines of a given customer

As a data scientist, LTFS has tasked you with building a model given the Top-up loan bucket of 128655 customers along with demographic and bureau data, predict the right bucket/period for 14745 customers in the test data.

Important Note

Note that feasibility of implementation of top solutions in real production scenario will be considered while adjudging winners and can change the final standing for Prize Eligibility

Data Dictionary

Train_Data.zip This zip file contains the train files for demography data and bureau data. The data dictionary is also included here.

Test_Data.zip This zip file contains information on demography data and bureau data for a different set of customers

Sample Submission This file contains the exact submission format for the predictions. Please submit CSV file only.

Variable Definition ID Unique Identifier for a row Top-up Month (Target) bucket/period for the Top-up Loan

How to Make a Submission?

All Submissions are to be done at the solution checker tab. For a step by step view on how to make a submission check the below video

Evaluation

The evaluation metric for this competition is macro_f1_score across all entries in the test set.

Public and Private Split Test data is further divided into Public 40% and Private 60%

Your initial responses will be checked and scored on the Public data. The final rankings would be based on your private score which will be published once the competition is over.

Guidelines for Final Submission

Please ensure that your final submission includes the following:

Solution file containing the predicted Top-up Month bucket in the test dataset (format is given in sample submission CSV) Code file containing the following: Code: Note that it is mandatory to submit your code for a valid final submission Approach: Please share your approach to solve the problem (doc/ppt/pdf format). It should cover the following topics: A brief on the approach, which you have used to solve the problem. What data-preprocessing / feature engineering ideas really worked? How did you discover them? What does your final model look like? How did you reach it?

How to Set Final Submission?

Hackathon Rules The final standings would be based on private leaderboard score and presentations made in Online Interview round with LTFS & Analytics Vidhya which will be held after contest close. Setting the final submission is recommended. Without a final submission, the submission corresponding to best public score will be taken as the final submission Use of external data is prohibited You can only make 10 submissions per day Entries submitted after the contest is closed, will not be considered The code file pertaining to your final submission is mandatory while setting final submission Throughout the hackathon, you are expected to respect fellow hackers and act with high integrity. Analytics Vidhya and LTFS hold the right to disqualify any participant at any stage of the compe...
Deep-NLP
kaggle.com
zip
Updated Mar 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
samdeeplearning (2017). Deep-NLP [Dataset]. https://www.kaggle.com/samdeeplearning/deepnlp
Explore at:
zip(239413 bytes)Available download formats
Dataset updated
Mar 1, 2017
Authors
samdeeplearning
Description
What's In The Deep-NLP Dataset?

Sheet_1.csv contains 80 user responses, in the response_text column, to a therapy chatbot. Bot said: 'Describe a time when you have acted as a resource for someone else'. User responded. If a response is 'not flagged', the user can continue talking to the bot. If it is 'flagged', the user is referred to help.

Sheet_2.csv contains 125 resumes, in the resume_text column. Resumes were queried from Indeed.com with keyword 'data scientist', location 'Vermont'. If a resume is 'not flagged', the applicant can submit a modified resume version at a later date. If it is 'flagged', the applicant is invited to interview.

What Do I Do With This?

Classify new resumes/responses as flagged or not flagged.

There are two sets of data here - resumes and responses. Split the data into a train set and a test set to test the accuracy of your classifier. Bonus points for using the same classifier for both problems.

Good luck.

Acknowledgements

Thank you to Parsa Ghaffari (Aylien), without whom these visuals (cover photo is in Parsa Ghaffari's excellent LinkedIn article on English, Spanish and German postive v. negative sentiment analysis) would not exist.

There Is A 'deep natural language processing' Kernel. I will update it. I Hope You Find It Useful.

You can use any of the code in that kernel anywhere, on or off Kaggle. Ping me at @_samputnam for questions.
mmlu
huggingface.co
Updated May 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Center for AI Safety (2023). mmlu [Dataset]. https://huggingface.co/datasets/cais/mmlu
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 10, 2023
Dataset authored and provided by
Center for AI Safetyhttps://safe.ai/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for MMLU

Dataset Summary

Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks… See the full description on the dataset page: https://huggingface.co/datasets/cais/mmlu.
Employee Attrition Classification Dataset
kaggle.com
zip
Updated Jun 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umair Zia (2024). Employee Attrition Classification Dataset [Dataset]. https://www.kaggle.com/datasets/stealthtechnologies/employee-attrition-dataset
Explore at:
zip(1802815 bytes)Available download formats
Dataset updated
Jun 11, 2024
Authors
Umair Zia
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The Synthetic Employee Attrition Dataset is a simulated dataset designed for the analysis and prediction of employee attrition. It contains detailed information about various aspects of an employee's profile, including demographics, job-related features, and personal circumstances.

The dataset comprises 74,498 samples, split into training and testing sets to facilitate model development and evaluation. Each record includes a unique Employee ID and features that influence employee attrition. The goal is to understand the factors contributing to attrition and develop predictive models to identify at-risk employees.

This dataset is ideal for HR analytics, machine learning model development, and demonstrating advanced data analysis techniques. It provides a comprehensive and realistic view of the factors affecting employee retention, making it a valuable resource for researchers and practitioners in the field of human resources and organizational development.

FEATURES:

Employee ID: A unique identifier assigned to each employee. Age: The age of the employee, ranging from 18 to 60 years. Gender: The gender of the employee Years at Company: The number of years the employee has been working at the company. Monthly Income: The monthly salary of the employee, in dollars. Job Role: The department or role the employee works in, encoded into categories such as Finance, Healthcare, Technology, Education, and Media. Work-Life Balance: The employee's perceived balance between work and personal life, (Poor, Below Average, Good, Excellent) Job Satisfaction: The employee's satisfaction with their job: (Very Low, Low, Medium, High) Performance Rating: The employee's performance rating: (Low, Below Average, Average, High) Number of Promotions: The total number of promotions the employee has received. Distance from Home: The distance between the employee's home and workplace, in miles. Education Level: The highest education level attained by the employee: (High School, Associate Degree, Bachelor’s Degree, Master’s Degree, PhD) Marital Status: The marital status of the employee: (Divorced, Married, Single) Job Level: The job level of the employee: (Entry, Mid, Senior) Company Size: The size of the company the employee works for: (Small,Medium,Large) Company Tenure: The total number of years the employee has been working in the industry. Remote Work: Whether the employee works remotely: (Yes or No) Leadership Opportunities: Whether the employee has leadership opportunities: (Yes or No) Innovation Opportunities: Whether the employee has opportunities for innovation: (Yes or No) Company Reputation: The employee's perception of the company's reputation: (Very Poor, Poor,Good, Excellent) Employee Recognition: The level of recognition the employee receives:(Very Low, Low, Medium, High)

Attrition: Whether the employee has left the company, encoded as 0 (stayed) and 1 (Left).
MIMIC-III - Deep Reinforcement Learning
kaggle.com
zip
Updated Apr 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asjad K (2022). MIMIC-III - Deep Reinforcement Learning [Dataset]. https://www.kaggle.com/datasets/asjad99/mimiciii
Explore at:
zip(11100065 bytes)Available download formats
Dataset updated
Apr 7, 2022
Authors
Asjad K
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Digitization of healthcare data along with algorithmic breakthroughts in AI will have a major impact on healthcare delivery in coming years. Its intresting to see application of AI to assist clinicians during patient treatment in a privacy preserving way. While scientific knowledge can help guide interventions, there remains a key need to quickly cut through the space of decision policies to find effective strategies to support patients during the care process.

Offline Reinforcement learning (also referred to as safe or batch reinforcement learning) is a promising sub-field of RL which provides us with a mechanism for solving real world sequential decision making problems where access to simulator is not available. Here we assume that learn a policy from fixed dataset of trajectories with further interaction with the environment(agent doesn't receive reward or punishment signal from the environment). It has shown that such an approach can leverage vast amount of existing logged data (in the form of previous interactions with the environment) and can outperform supervised learning approaches or heuristic based policies for solving real world - decision making problems. Offline RL algorithms when trained on sufficiently large and diverse offline datasets can produce close to optimal policies(ability to generalize beyond training data).

As Part of my PhD, research, I investigated the problem of developing a Clinical Decision Support System for Sepsis Management using Offline Deep Reinforcement Learning.

MIMIC-III ('Medical Information Mart for Intensive Care') is a large open-access anonymized single-center database which consists of comprehensive clinical data of 61,532 critical care admissions from 2001–2012 collected at a Boston teaching hospital. Dataset consists of 47 features (including demographics, vitals, and lab test results) on a cohort of sepsis patients who meet the sepsis-3 definition criteria.

we try to answer the following question:

Given a particular patient’s characteristics and physiological information at each time step as input, can our DeepRL approach, learn an optimal treatment policy that can prescribe the right intervention(e.g use of ventilator) to the patient each stage of the treatment process, in order to improve the final outcome(e.g patient mortality)?

we can use popular state-of-the-art algorithms such as Deep Q Learning(DQN), Double Deep Q Learning (DDQN), DDQN combined with BNC, Mixed Monte Carlo(MMC) and Persistent Advantage Learning (PAL). Using these methods we can train an RL policy to recommend optimum treatment path for a given patient.

Data acquisition, standard pre-processing and modelling details can be found here in Github repo: https://github.com/asjad99/MIMIC_RL_COACH
CPAISD - Acute Ischemic Stroke Dataset
kaggle.com
zip
Updated Mar 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Orvile (2025). CPAISD - Acute Ischemic Stroke Dataset [Dataset]. https://www.kaggle.com/datasets/orvile/cpaisd-acute-ischemic-stroke-dataset
Explore at:
zip(5655172945 bytes)Available download formats
Dataset updated
Mar 29, 2025
Authors
Orvile
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🧠 CPAISD: Core-Penumbra Acute Ischemic Stroke Dataset 🩸

CT Scans for Hyperacute Stroke Research

https://zenodo.org/badge/DOI/10.5281/zenodo.10892316.svg" alt="DOI">
DOI Link
https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT">
License Details

Welcome to CPAISD, a dataset featuring 112 non-contrast cranial CT scans from patients with hyperacute stroke. Each scan includes expertly segmented ischemic core and penumbra zones, making this a powerful resource for advancing medical image analysis, stroke lesion segmentation, and understanding acute ischemic stroke dynamics. 🩺✨

📜 What’s This Dataset About?

The Core-Penumbra Acute Ischemic Stroke Dataset (CPAISD) provides 112 anonymized CT scans from hyperacute stroke patients. Experts have manually delineated the ischemic core and penumbra on every relevant slice. Anonymized with Kitware DicomAnonymizer, it retains key DICOM fields for demographic and domain shift studies:
- (0x0010, 0x0040) – Patient's Sex
- (0x0010, 0x1010) – Patient's Age
- (0x0008, 0x0070) – CT Scanner Manufacturer
- (0x0008, 0x1090) – CT Scanner Model

The dataset is split into three folds for robust research:
- Training: 92 studies, 8,376 slices 📚
- Validation: 10 studies, 980 slices ✅
- Testing: 10 studies, 809 slices 🧪

📁 How’s It Organized?

Here’s the structure:

dataset/ ├── metadata.json # Dataset stats and split parameters ├── summary.csv # Study metadata (name, split, etc.) ├── train/ # Training fold │ ├── study_id_1/ │ │ ├── StudySliceraw.dcm # Raw DICOM slice │ │ ├── image.npz # Slice as Numpy array │ │ ├── mask.npz # Core & penumbra mask │ │ ├── metadata.json # Slice metadata │ │ └── metadata.json # Study metadata │ └── ... ├── val/ # Validation fold │ └── ... └── test/ # Testing fold └── ...

File Breakdown:
- metadata.json (root): Dataset-wide info (split params, stats).
- summary.csv: Study-level metadata in table form.
- StudySliceraw.dcm: Original anonymized DICOM slice.
- image.npz: CT slice in Numpy format.
- mask.npz: Segmentation mask (core & penumbra).
- metadata.json (slice): Slice-specific details.
- metadata.json (study): Study details like manufacturer, model, age, sex, dsa, nihss, time, lethality.

💻 Tools & Code

Check out the GitHub repo for code and more:
github.com/sb-ai-lab/early_hyperacute_stroke_dataset. It’s Python-based and actively maintained! 🐍

💡 What Can You Do With It?

Build deep learning models for stroke lesion segmentation (core & penumbra). 🖌️

Explore links between imaging and outcomes (e.g., NIHSS, lethality). 📈

Study scanner effects on lesion appearance (domain shift). 🔬

Develop tools for early stroke detection and analysis. ⏱️

🏷️ Keywords

CT · Penumbra · Core · Stroke · Medical Imaging · Segmentation

📜 License

Released under the MIT License.

✅ Use, modify, share, or sell—just follow the terms!

✍️ Citation

Using CPAISD? Cite it as:

Umerenkov, D., Kudin, S., Peksheva, M., & Pavlov, D. (2024). CPAISD: Core-Penumbra Acute Ischemic Stroke Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10892316

🧑‍🤝‍🧑 Creators

Dmitriy Umerenkov (Researcher)¹

Stepan Kudin (Researcher)¹

Marina Peksheva (Researcher)²

Denis Pavlov (Researcher)²

🌟 Let’s Make an Impact!

We hope CPAISD fuels your research in stroke detection and treatment. Happy exploring, and please upvote this dataset if it helps you—let’s drive progress together! 🙌
WikiSQL (Questions and SQL Queries)
kaggle.com
zip
Updated Nov 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). WikiSQL (Questions and SQL Queries) [Dataset]. https://www.kaggle.com/datasets/thedevastator/dataset-for-developing-natural-language-interfac
Explore at:
zip(21491264 bytes)Available download formats
Dataset updated
Nov 25, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
WikiSQL (Questions and SQL Queries)

80654 hand-annotated questions and SQL queries on 24241 Wikipedia tables

By Huggingface Hub [source]

About this dataset

A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset can be used to develop natural language interfaces for relational databases. The data fields are the same among all splits, and the file contains information on the phase, question, table, and SQL for each interface

Research Ideas

This dataset can be used to develop natural language interfaces for relational databases.

This dataset can be used to develop a knowledge base of common SQL queries.

This dataset can be used to generate a training set for a neural network that translates natural language into SQL queries

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

File: test.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Somia Asklany; Somia Asklany; Doaa Mohammed; Doaa Mohammed (2025). Time-Stamped Air Quality & Weather Data (Paris, 2023) [Dataset]. http://doi.org/10.5281/zenodo.17167030

Time-Stamped Air Quality & Weather Data (Paris, 2023)

Explore at:

bin, csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.17167030

Dataset updated

Sep 20, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Somia Asklany; Somia Asklany; Doaa Mohammed; Doaa Mohammed

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

Paris

Description

Dataset: Time-Stamped Air Quality & Weather Data (Paris, 2023) — Reproducible Processing Logs

Summary

This record provides the full, time-stamped dataset and documentation for an urban air-quality forecasting task in Paris, France (calendar year 2023). The archive includes both the raw measurements and the processed version used for modeling, plus a Data Dictionary and a Processing Log to enable complete transparency and reproducibility.

Study area & coverage

· Location: Paris, Île-de-France, France

· Temporal coverage: 2023-01-01 00:00:00 – 2023-12-31 23:00:00 (local time)

· Time zone: CET/CEST (UTC+1 in winter, UTC+2 in summer)

· Frequency: Hourly observations (where available)

· Primary variables (units):

- Pollutants: NO₂ (µg/m³), PM₂.₅ (µg/m³), PM₁₀ (µg/m³), CO (mg/m³ or µg/m³ — see dictionary)

- Meteorology: Temperature (°C), Relative Humidity (%), Wind Speed (m/s), [others if present]

- Key field: timestamp (ISO 8601: YYYY-MM-DD HH:mm:ss)

What’s included

· data/Raw.csv — Raw time-series with a unified timestamp column and all measured variables.

· data/Processed.csv — Cleaned/chronologically sorted dataset used for modeling (original units retained unless noted).

· docs/Data_Dictionary.docx — Variable names, definitions, units, and sources.

· docs/Processing_ Tracability . xlsx— Step-by-step preprocessing record (missing-data strategy, outlier policy, scaling, and temporal train/test split).

Methodological notes

The dataset is organized for time-series modeling. All preprocessing decisions are documented in docs/Processing_Log.docx. To prevent information leakage, feature selection and normalization are to be performed on the training partition only when reproducing the models. A one-click MATLAB pipeline (code/00_run_all.m) is available in the companion repository (see Related resources) to reproduce the splits and exports.

Intended use

This dataset supports research and teaching in environmental data science, air-quality forecasting, time-series modeling, and reproducible ML. Users can:

· Recreate the chronological train/test setup for 2023.

· Benchmark alternative models and feature-engineering strategies.

· Explore pollutant–meteorology relationships in Paris during 2023.

Provenance & quality control

Data were compiled for the study from Paris monitoring sources for pollutants and standard meteorological observations. Basic QA steps (timestamp harmonization, duplicate checks, unit checks) are documented in the Processing Log. Please consult docs/Data_Dictionary.docx for variable-level details and known caveats.

Licensing & reuse

The dataset is released under CC BY 4.0. Please cite this record and the associated article (if applicable) when reusing the data.

Related resources

NOAA, And OpenAQ

How to cite

[Somia Asklany ], (2025). Time-Stamped Air Quality & Weather Data (Paris, 2023) — Reproducible Processing Logs (v1.0) [Data set]. Zenodo. https://doi.org/[Zenodo-DOI]

Contact

[ Somia Asklany ], [ Northern Boarder University ( somia.asklany@nbu.edu.sa], [— ORCID: [

Clear search

Close search

Google apps

Main menu

Time-Stamped Air Quality & Weather Data (Paris, 2023)

Dataset: Time-Stamped Air Quality & Weather Data (Paris, 2023) — Reproducible Processing Logs

LTFS Data Science FinHack 3(Analytics Vidhya)

This dataset contains extracted data from LTFS Data Science FinHack 3 (Analytics Vidhya)

LTFS Top-up loan Up-sell prediction

Data Dictionary

How to Make a Submission?

Evaluation

Guidelines for Final Submission

How to Set Final Submission?

Deep-NLP

What's In The Deep-NLP Dataset?

What Do I Do With This?

Acknowledgements

There Is A 'deep natural language processing' Kernel. I will update it. I Hope You Find It Useful.

mmlu

Employee Attrition Classification Dataset

MIMIC-III - Deep Reinforcement Learning

CPAISD - Acute Ischemic Stroke Dataset

🧠 CPAISD: Core-Penumbra Acute Ischemic Stroke Dataset 🩸

CT Scans for Hyperacute Stroke Research

📜 What’s This Dataset About?

📁 How’s It Organized?

💻 Tools & Code

💡 What Can You Do With It?

🏷️ Keywords

📜 License

✍️ Citation

🧑‍🤝‍🧑 Creators

🌟 Let’s Make an Impact!

WikiSQL (Questions and SQL Queries)

WikiSQL (Questions and SQL Queries)

80654 hand-annotated questions and SQL queries on 24241 Wikipedia tables

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Time-Stamped Air Quality & Weather Data (Paris, 2023)

Dataset: Time-Stamped Air Quality & Weather Data (Paris, 2023) — Reproducible Processing Logs