8 datasets found
  1. Time-Stamped Air Quality & Weather Data (Paris, 2023)

    • zenodo.org
    bin, csv
    Updated Sep 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Somia Asklany; Somia Asklany; Doaa Mohammed; Doaa Mohammed (2025). Time-Stamped Air Quality & Weather Data (Paris, 2023) [Dataset]. http://doi.org/10.5281/zenodo.17167030
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Sep 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Somia Asklany; Somia Asklany; Doaa Mohammed; Doaa Mohammed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Paris
    Description

    Dataset: Time-Stamped Air Quality & Weather Data (Paris, 2023) — Reproducible Processing Logs

    Summary

    This record provides the full, time-stamped dataset and documentation for an urban air-quality forecasting task in Paris, France (calendar year 2023). The archive includes both the raw measurements and the processed version used for modeling, plus a Data Dictionary and a Processing Log to enable complete transparency and reproducibility.

    Study area & coverage

    · Location: Paris, Île-de-France, France

    · Temporal coverage: 2023-01-01 00:00:00 – 2023-12-31 23:00:00 (local time)

    · Time zone: CET/CEST (UTC+1 in winter, UTC+2 in summer)

    · Frequency: Hourly observations (where available)

    · Primary variables (units):

    - Pollutants: NO₂ (µg/m³), PM₂.₅ (µg/m³), PM₁₀ (µg/m³), CO (mg/m³ or µg/m³ — see dictionary)

    - Meteorology: Temperature (°C), Relative Humidity (%), Wind Speed (m/s), [others if present]

    - Key field: timestamp (ISO 8601: YYYY-MM-DD HH:mm:ss)

    What’s included

    · data/Raw.csv — Raw time-series with a unified timestamp column and all measured variables.

    · data/Processed.csv — Cleaned/chronologically sorted dataset used for modeling (original units retained unless noted).

    · docs/Data_Dictionary.docx — Variable names, definitions, units, and sources.

    · docs/Processing_ Tracability . xlsx— Step-by-step preprocessing record (missing-data strategy, outlier policy, scaling, and temporal train/test split).

    Methodological notes

    The dataset is organized for time-series modeling. All preprocessing decisions are documented in docs/Processing_Log.docx. To prevent information leakage, feature selection and normalization are to be performed on the training partition only when reproducing the models. A one-click MATLAB pipeline (code/00_run_all.m) is available in the companion repository (see Related resources) to reproduce the splits and exports.

    Intended use

    This dataset supports research and teaching in environmental data science, air-quality forecasting, time-series modeling, and reproducible ML. Users can:

    · Recreate the chronological train/test setup for 2023.

    · Benchmark alternative models and feature-engineering strategies.

    · Explore pollutant–meteorology relationships in Paris during 2023.

    Provenance & quality control

    Data were compiled for the study from Paris monitoring sources for pollutants and standard meteorological observations. Basic QA steps (timestamp harmonization, duplicate checks, unit checks) are documented in the Processing Log. Please consult docs/Data_Dictionary.docx for variable-level details and known caveats.

    Licensing & reuse

    The dataset is released under CC BY 4.0. Please cite this record and the associated article (if applicable) when reusing the data.

    Related resources

    NOAA, And OpenAQ

    How to cite

    [Somia Asklany ], (2025). Time-Stamped Air Quality & Weather Data (Paris, 2023) — Reproducible Processing Logs (v1.0) [Data set]. Zenodo. https://doi.org/[Zenodo-DOI]

    Contact

    [ Somia Asklany ], [ Northern Boarder University ( somia.asklany@nbu.edu.sa], [— ORCID: [

  2. LTFS Data Science FinHack 3(Analytics Vidhya)

    • kaggle.com
    zip
    Updated Feb 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Parv619 (2021). LTFS Data Science FinHack 3(Analytics Vidhya) [Dataset]. https://www.kaggle.com/datasets/parv619/ltfs-data-science-finhack-3analytics-vidhya/code
    Explore at:
    zip(109500058 bytes)Available download formats
    Dataset updated
    Feb 1, 2021
    Authors
    Parv619
    Description

    This dataset contains extracted data from LTFS Data Science FinHack 3 (Analytics Vidhya)

    LTFS Top-up loan Up-sell prediction

    A loan is when you receive the money from a financial institution in exchange for future repayment of the principal, plus interest. Financial institutions provide loans to the industries, corporates and individuals. The interest received on these loans is one among the main sources of income for the financial institutions.

    A top-up loan, true to its name, is a facility of availing further funds on an existing loan. When you have a loan that has already been disbursed and under repayment and if you need more funds then, you can simply avail additional funding on the same loan thereby minimizing time, effort and cost related to applying again.

    LTFS provides it’s loan services to its customers and is interested in selling more of its Top-up loan services to its existing customers so they have decided to identify when to pitch a Top-up during the original loan tenure. If they correctly identify the most suitable time to offer a top-up, this will ultimately lead to more disbursals and can also help them beat competing offerings from other institutions.

    To understand this behaviour, LTFS has provided data for its customers containing the information whether that particular customer took the Top-up service and when he took such Top-up service, represented by the target variable Top-up Month.

    You are provided with two types of information:

    1. Customer’s Demographics: The demography table along with the target variable & demographic information contains variables related to Frequency of the loan, Tenure of the loan, Disbursal Amount for a loan & LTV.

    2. Bureau data: Bureau data contains the behavioural and transactional attributes of the customers like current balance, Loan Amount, Overdue etc. for various tradelines of a given customer

    As a data scientist, LTFS has tasked you with building a model given the Top-up loan bucket of 128655 customers along with demographic and bureau data, predict the right bucket/period for 14745 customers in the test data.

    Important Note

    Note that feasibility of implementation of top solutions in real production scenario will be considered while adjudging winners and can change the final standing for Prize Eligibility

    Data Dictionary

    Train_Data.zip This zip file contains the train files for demography data and bureau data. The data dictionary is also included here.

    Test_Data.zip This zip file contains information on demography data and bureau data for a different set of customers

    Sample Submission This file contains the exact submission format for the predictions. Please submit CSV file only.

    Variable Definition ID Unique Identifier for a row Top-up Month (Target) bucket/period for the Top-up Loan

    How to Make a Submission?

    All Submissions are to be done at the solution checker tab. For a step by step view on how to make a submission check the below video

    Evaluation

    The evaluation metric for this competition is macro_f1_score across all entries in the test set.

    Public and Private Split Test data is further divided into Public 40% and Private 60%

    Your initial responses will be checked and scored on the Public data. The final rankings would be based on your private score which will be published once the competition is over.

    Guidelines for Final Submission

    Please ensure that your final submission includes the following:

    Solution file containing the predicted Top-up Month bucket in the test dataset (format is given in sample submission CSV) Code file containing the following: Code: Note that it is mandatory to submit your code for a valid final submission Approach: Please share your approach to solve the problem (doc/ppt/pdf format). It should cover the following topics: A brief on the approach, which you have used to solve the problem. What data-preprocessing / feature engineering ideas really worked? How did you discover them? What does your final model look like? How did you reach it?

    How to Set Final Submission?

    Hackathon Rules The final standings would be based on private leaderboard score and presentations made in Online Interview round with LTFS & Analytics Vidhya which will be held after contest close. Setting the final submission is recommended. Without a final submission, the submission corresponding to best public score will be taken as the final submission Use of external data is prohibited You can only make 10 submissions per day Entries submitted after the contest is closed, will not be considered The code file pertaining to your final submission is mandatory while setting final submission Throughout the hackathon, you are expected to respect fellow hackers and act with high integrity. Analytics Vidhya and LTFS hold the right to disqualify any participant at any stage of the compe...

  3. Deep-NLP

    • kaggle.com
    zip
    Updated Mar 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    samdeeplearning (2017). Deep-NLP [Dataset]. https://www.kaggle.com/samdeeplearning/deepnlp
    Explore at:
    zip(239413 bytes)Available download formats
    Dataset updated
    Mar 1, 2017
    Authors
    samdeeplearning
    Description

    What's In The Deep-NLP Dataset?

    Sheet_1.csv contains 80 user responses, in the response_text column, to a therapy chatbot. Bot said: 'Describe a time when you have acted as a resource for someone else'. User responded. If a response is 'not flagged', the user can continue talking to the bot. If it is 'flagged', the user is referred to help.

    Sheet_2.csv contains 125 resumes, in the resume_text column. Resumes were queried from Indeed.com with keyword 'data scientist', location 'Vermont'. If a resume is 'not flagged', the applicant can submit a modified resume version at a later date. If it is 'flagged', the applicant is invited to interview.

    What Do I Do With This?

    Classify new resumes/responses as flagged or not flagged.

    There are two sets of data here - resumes and responses. Split the data into a train set and a test set to test the accuracy of your classifier. Bonus points for using the same classifier for both problems.

    Good luck.

    Acknowledgements

    Thank you to Parsa Ghaffari (Aylien), without whom these visuals (cover photo is in Parsa Ghaffari's excellent LinkedIn article on English, Spanish and German postive v. negative sentiment analysis) would not exist.

    There Is A 'deep natural language processing' Kernel. I will update it. I Hope You Find It Useful.

    You can use any of the code in that kernel anywhere, on or off Kaggle. Ping me at @_samputnam for questions.

  4. mmlu

    • huggingface.co
    Updated May 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Center for AI Safety (2023). mmlu [Dataset]. https://huggingface.co/datasets/cais/mmlu
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 10, 2023
    Dataset authored and provided by
    Center for AI Safetyhttps://safe.ai/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for MMLU

      Dataset Summary
    

    Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks… See the full description on the dataset page: https://huggingface.co/datasets/cais/mmlu.

  5. Employee Attrition Classification Dataset

    • kaggle.com
    zip
    Updated Jun 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umair Zia (2024). Employee Attrition Classification Dataset [Dataset]. https://www.kaggle.com/datasets/stealthtechnologies/employee-attrition-dataset
    Explore at:
    zip(1802815 bytes)Available download formats
    Dataset updated
    Jun 11, 2024
    Authors
    Umair Zia
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The Synthetic Employee Attrition Dataset is a simulated dataset designed for the analysis and prediction of employee attrition. It contains detailed information about various aspects of an employee's profile, including demographics, job-related features, and personal circumstances.

    The dataset comprises 74,498 samples, split into training and testing sets to facilitate model development and evaluation. Each record includes a unique Employee ID and features that influence employee attrition. The goal is to understand the factors contributing to attrition and develop predictive models to identify at-risk employees.

    This dataset is ideal for HR analytics, machine learning model development, and demonstrating advanced data analysis techniques. It provides a comprehensive and realistic view of the factors affecting employee retention, making it a valuable resource for researchers and practitioners in the field of human resources and organizational development.

    FEATURES:

    Employee ID: A unique identifier assigned to each employee. Age: The age of the employee, ranging from 18 to 60 years. Gender: The gender of the employee Years at Company: The number of years the employee has been working at the company. Monthly Income: The monthly salary of the employee, in dollars. Job Role: The department or role the employee works in, encoded into categories such as Finance, Healthcare, Technology, Education, and Media. Work-Life Balance: The employee's perceived balance between work and personal life, (Poor, Below Average, Good, Excellent) Job Satisfaction: The employee's satisfaction with their job: (Very Low, Low, Medium, High) Performance Rating: The employee's performance rating: (Low, Below Average, Average, High) Number of Promotions: The total number of promotions the employee has received. Distance from Home: The distance between the employee's home and workplace, in miles. Education Level: The highest education level attained by the employee: (High School, Associate Degree, Bachelor’s Degree, Master’s Degree, PhD) Marital Status: The marital status of the employee: (Divorced, Married, Single) Job Level: The job level of the employee: (Entry, Mid, Senior) Company Size: The size of the company the employee works for: (Small,Medium,Large) Company Tenure: The total number of years the employee has been working in the industry. Remote Work: Whether the employee works remotely: (Yes or No) Leadership Opportunities: Whether the employee has leadership opportunities: (Yes or No) Innovation Opportunities: Whether the employee has opportunities for innovation: (Yes or No) Company Reputation: The employee's perception of the company's reputation: (Very Poor, Poor,Good, Excellent) Employee Recognition: The level of recognition the employee receives:(Very Low, Low, Medium, High)

    Attrition: Whether the employee has left the company, encoded as 0 (stayed) and 1 (Left).

  6. MIMIC-III - Deep Reinforcement Learning

    • kaggle.com
    zip
    Updated Apr 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asjad K (2022). MIMIC-III - Deep Reinforcement Learning [Dataset]. https://www.kaggle.com/datasets/asjad99/mimiciii
    Explore at:
    zip(11100065 bytes)Available download formats
    Dataset updated
    Apr 7, 2022
    Authors
    Asjad K
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Digitization of healthcare data along with algorithmic breakthroughts in AI will have a major impact on healthcare delivery in coming years. Its intresting to see application of AI to assist clinicians during patient treatment in a privacy preserving way. While scientific knowledge can help guide interventions, there remains a key need to quickly cut through the space of decision policies to find effective strategies to support patients during the care process.

    Offline Reinforcement learning (also referred to as safe or batch reinforcement learning) is a promising sub-field of RL which provides us with a mechanism for solving real world sequential decision making problems where access to simulator is not available. Here we assume that learn a policy from fixed dataset of trajectories with further interaction with the environment(agent doesn't receive reward or punishment signal from the environment). It has shown that such an approach can leverage vast amount of existing logged data (in the form of previous interactions with the environment) and can outperform supervised learning approaches or heuristic based policies for solving real world - decision making problems. Offline RL algorithms when trained on sufficiently large and diverse offline datasets can produce close to optimal policies(ability to generalize beyond training data).

    As Part of my PhD, research, I investigated the problem of developing a Clinical Decision Support System for Sepsis Management using Offline Deep Reinforcement Learning.

    MIMIC-III ('Medical Information Mart for Intensive Care') is a large open-access anonymized single-center database which consists of comprehensive clinical data of 61,532 critical care admissions from 2001–2012 collected at a Boston teaching hospital. Dataset consists of 47 features (including demographics, vitals, and lab test results) on a cohort of sepsis patients who meet the sepsis-3 definition criteria.

    we try to answer the following question:

    Given a particular patient’s characteristics and physiological information at each time step as input, can our DeepRL approach, learn an optimal treatment policy that can prescribe the right intervention(e.g use of ventilator) to the patient each stage of the treatment process, in order to improve the final outcome(e.g patient mortality)?

    we can use popular state-of-the-art algorithms such as Deep Q Learning(DQN), Double Deep Q Learning (DDQN), DDQN combined with BNC, Mixed Monte Carlo(MMC) and Persistent Advantage Learning (PAL). Using these methods we can train an RL policy to recommend optimum treatment path for a given patient.

    Data acquisition, standard pre-processing and modelling details can be found here in Github repo: https://github.com/asjad99/MIMIC_RL_COACH

  7. CPAISD - Acute Ischemic Stroke Dataset

    • kaggle.com
    zip
    Updated Mar 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Orvile (2025). CPAISD - Acute Ischemic Stroke Dataset [Dataset]. https://www.kaggle.com/datasets/orvile/cpaisd-acute-ischemic-stroke-dataset
    Explore at:
    zip(5655172945 bytes)Available download formats
    Dataset updated
    Mar 29, 2025
    Authors
    Orvile
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🧠 CPAISD: Core-Penumbra Acute Ischemic Stroke Dataset 🩸

    CT Scans for Hyperacute Stroke Research

    https://zenodo.org/badge/DOI/10.5281/zenodo.10892316.svg" alt="DOI">
    DOI Link
    https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT">
    License Details

    Welcome to CPAISD, a dataset featuring 112 non-contrast cranial CT scans from patients with hyperacute stroke. Each scan includes expertly segmented ischemic core and penumbra zones, making this a powerful resource for advancing medical image analysis, stroke lesion segmentation, and understanding acute ischemic stroke dynamics. 🩺✨

    📜 What’s This Dataset About?

    The Core-Penumbra Acute Ischemic Stroke Dataset (CPAISD) provides 112 anonymized CT scans from hyperacute stroke patients. Experts have manually delineated the ischemic core and penumbra on every relevant slice. Anonymized with Kitware DicomAnonymizer, it retains key DICOM fields for demographic and domain shift studies:
    - (0x0010, 0x0040) – Patient's Sex
    - (0x0010, 0x1010) – Patient's Age
    - (0x0008, 0x0070) – CT Scanner Manufacturer
    - (0x0008, 0x1090) – CT Scanner Model

    The dataset is split into three folds for robust research:
    - Training: 92 studies, 8,376 slices 📚
    - Validation: 10 studies, 980 slices ✅
    - Testing: 10 studies, 809 slices 🧪

    📁 How’s It Organized?

    Here’s the structure:

    dataset/ ├── metadata.json # Dataset stats and split parameters ├── summary.csv # Study metadata (name, split, etc.) ├── train/ # Training fold │ ├── study_id_1/ │ │ ├── StudySliceraw.dcm # Raw DICOM slice │ │ ├── image.npz # Slice as Numpy array │ │ ├── mask.npz # Core & penumbra mask │ │ ├── metadata.json # Slice metadata │ │ └── metadata.json # Study metadata │ └── ... ├── val/ # Validation fold │ └── ... └── test/ # Testing fold └── ...

    File Breakdown:
    - metadata.json (root): Dataset-wide info (split params, stats).
    - summary.csv: Study-level metadata in table form.
    - StudySliceraw.dcm: Original anonymized DICOM slice.
    - image.npz: CT slice in Numpy format.
    - mask.npz: Segmentation mask (core & penumbra).
    - metadata.json (slice): Slice-specific details.
    - metadata.json (study): Study details like manufacturer, model, age, sex, dsa, nihss, time, lethality.

    💻 Tools & Code

    Check out the GitHub repo for code and more:
    github.com/sb-ai-lab/early_hyperacute_stroke_dataset. It’s Python-based and actively maintained! 🐍

    💡 What Can You Do With It?

    • Build deep learning models for stroke lesion segmentation (core & penumbra). 🖌️
    • Explore links between imaging and outcomes (e.g., NIHSS, lethality). 📈
    • Study scanner effects on lesion appearance (domain shift). 🔬
    • Develop tools for early stroke detection and analysis. ⏱️

    🏷️ Keywords

    CT · Penumbra · Core · Stroke · Medical Imaging · Segmentation

    📜 License

    Released under the MIT License.

    Use, modify, share, or sell—just follow the terms!

    ✍️ Citation

    Using CPAISD? Cite it as:

    Umerenkov, D., Kudin, S., Peksheva, M., & Pavlov, D. (2024). CPAISD: Core-Penumbra Acute Ischemic Stroke Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10892316

    🧑‍🤝‍🧑 Creators

    • Dmitriy Umerenkov (Researcher)¹
    • Stepan Kudin (Researcher)¹
    • Marina Peksheva (Researcher)²
    • Denis Pavlov (Researcher)²

    🌟 Let’s Make an Impact!

    We hope CPAISD fuels your research in stroke detection and treatment. Happy exploring, and please upvote this dataset if it helps you—let’s drive progress together! 🙌

  8. WikiSQL (Questions and SQL Queries)

    • kaggle.com
    zip
    Updated Nov 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). WikiSQL (Questions and SQL Queries) [Dataset]. https://www.kaggle.com/datasets/thedevastator/dataset-for-developing-natural-language-interfac
    Explore at:
    zip(21491264 bytes)Available download formats
    Dataset updated
    Nov 25, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    WikiSQL (Questions and SQL Queries)

    80654 hand-annotated questions and SQL queries on 24241 Wikipedia tables

    By Huggingface Hub [source]

    About this dataset

    A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset can be used to develop natural language interfaces for relational databases. The data fields are the same among all splits, and the file contains information on the phase, question, table, and SQL for each interface

    Research Ideas

    • This dataset can be used to develop natural language interfaces for relational databases.
    • This dataset can be used to develop a knowledge base of common SQL queries.
    • This dataset can be used to generate a training set for a neural network that translates natural language into SQL queries

    Acknowledgements

    If you use this dataset in your research, please credit the original authors.

    Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

    File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

    File: test.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Somia Asklany; Somia Asklany; Doaa Mohammed; Doaa Mohammed (2025). Time-Stamped Air Quality & Weather Data (Paris, 2023) [Dataset]. http://doi.org/10.5281/zenodo.17167030
Organization logo

Time-Stamped Air Quality & Weather Data (Paris, 2023)

Explore at:
bin, csvAvailable download formats
Dataset updated
Sep 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Somia Asklany; Somia Asklany; Doaa Mohammed; Doaa Mohammed
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
Paris
Description

Dataset: Time-Stamped Air Quality & Weather Data (Paris, 2023) — Reproducible Processing Logs

Summary

This record provides the full, time-stamped dataset and documentation for an urban air-quality forecasting task in Paris, France (calendar year 2023). The archive includes both the raw measurements and the processed version used for modeling, plus a Data Dictionary and a Processing Log to enable complete transparency and reproducibility.

Study area & coverage

· Location: Paris, Île-de-France, France

· Temporal coverage: 2023-01-01 00:00:00 – 2023-12-31 23:00:00 (local time)

· Time zone: CET/CEST (UTC+1 in winter, UTC+2 in summer)

· Frequency: Hourly observations (where available)

· Primary variables (units):

- Pollutants: NO₂ (µg/m³), PM₂.₅ (µg/m³), PM₁₀ (µg/m³), CO (mg/m³ or µg/m³ — see dictionary)

- Meteorology: Temperature (°C), Relative Humidity (%), Wind Speed (m/s), [others if present]

- Key field: timestamp (ISO 8601: YYYY-MM-DD HH:mm:ss)

What’s included

· data/Raw.csv — Raw time-series with a unified timestamp column and all measured variables.

· data/Processed.csv — Cleaned/chronologically sorted dataset used for modeling (original units retained unless noted).

· docs/Data_Dictionary.docx — Variable names, definitions, units, and sources.

· docs/Processing_ Tracability . xlsx— Step-by-step preprocessing record (missing-data strategy, outlier policy, scaling, and temporal train/test split).

Methodological notes

The dataset is organized for time-series modeling. All preprocessing decisions are documented in docs/Processing_Log.docx. To prevent information leakage, feature selection and normalization are to be performed on the training partition only when reproducing the models. A one-click MATLAB pipeline (code/00_run_all.m) is available in the companion repository (see Related resources) to reproduce the splits and exports.

Intended use

This dataset supports research and teaching in environmental data science, air-quality forecasting, time-series modeling, and reproducible ML. Users can:

· Recreate the chronological train/test setup for 2023.

· Benchmark alternative models and feature-engineering strategies.

· Explore pollutant–meteorology relationships in Paris during 2023.

Provenance & quality control

Data were compiled for the study from Paris monitoring sources for pollutants and standard meteorological observations. Basic QA steps (timestamp harmonization, duplicate checks, unit checks) are documented in the Processing Log. Please consult docs/Data_Dictionary.docx for variable-level details and known caveats.

Licensing & reuse

The dataset is released under CC BY 4.0. Please cite this record and the associated article (if applicable) when reusing the data.

Related resources

NOAA, And OpenAQ

How to cite

[Somia Asklany ], (2025). Time-Stamped Air Quality & Weather Data (Paris, 2023) — Reproducible Processing Logs (v1.0) [Data set]. Zenodo. https://doi.org/[Zenodo-DOI]

Contact

[ Somia Asklany ], [ Northern Boarder University ( somia.asklany@nbu.edu.sa], [— ORCID: [

Search
Clear search
Close search
Google apps
Main menu