31 datasets found
  1. Real State Website Data

    • kaggle.com
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M. Mazhar (2023). Real State Website Data [Dataset]. https://www.kaggle.com/datasets/mazhar01/real-state-website-data/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    M. Mazhar
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Check: End-to-End Regression Model Pipeline Development with FastAPI: From Data Scraping to Deployment with CI/CD Integration

    This CSV dataset provides comprehensive information about house prices. It consists of 9,819 entries and 54 columns, offering a wealth of features for analysis. The dataset includes various numerical and categorical variables, providing insights into factors that influence house prices.

    The key columns in the dataset are as follows:

    1. Location1: The location of the houses. Location2 column is identical or shorter version of Location1 Year: The year of construction. Type: The type of the house. Bedrooms: The number of bedrooms in the house. Bathrooms: The number of bathrooms in the house. Size_in_SqYds: The size of the house in square yards. Price: The price of the house. Parking_Spaces: The number of parking spaces available. Floors_in_Building: The number of floors in the building. Elevators: The presence of elevators in the building. Lobby_in_Building: The presence of a lobby in the building.

    In addition to these, the dataset contains several other features related to various amenities and facilities available in the houses, such as double-glazed windows, central air conditioning, central heating, waste disposal, furnished status, service elevators, and more.

    By performing exploratory data analysis on this dataset using Python and the Pandas library, valuable insights can be gained regarding the relationships between different variables and the impact they have on house prices. Descriptive statistics, data visualization, and feature engineering techniques can be applied to uncover patterns and trends in the housing market.

    This dataset serves as a valuable resource for real estate professionals, analysts, and researchers interested in understanding the factors that contribute to house prices and making informed decisions in the real estate market.

  2. IMDb Top 4070: Explore the Cinema Data

    • kaggle.com
    Updated Aug 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    K.T.S. Prabhu (2023). IMDb Top 4070: Explore the Cinema Data [Dataset]. https://www.kaggle.com/datasets/ktsprabhu/imdb-top-4070-explore-the-cinema-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    K.T.S. Prabhu
    Description

    Description: Dive into the world of exceptional cinema with our meticulously curated dataset, "IMDb's Gems Unveiled." This dataset is a result of an extensive data collection effort based on two critical criteria: IMDb ratings exceeding 7 and a substantial number of votes, surpassing 10,000. The outcome? A treasure trove of 4070 movies meticulously selected from IMDb's vast repository.

    What sets this dataset apart is its richness and diversity. With more than 20 data points meticulously gathered for each movie, this collection offers a comprehensive insight into each cinematic masterpiece. Our data collection process leveraged the power of Selenium and Pandas modules, ensuring accuracy and reliability.

    Cleaning this vast dataset was a meticulous task, combining both Excel and Python for optimum precision. Analysis is powered by Pandas, Matplotlib, and NLTK, enabling to uncover hidden patterns, trends, and themes within the realm of cinema.

    Note: The data is collected as of April 2023. Future versions of this analysis include Movie recommendation system Please do connect for any queries, All Love, No Hate.

  3. d

    Replication Data for Exploring an extinct society through the lens of...

    • dataone.org
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wieczorek, Oliver; Malzahn, Melanie (2023). Replication Data for Exploring an extinct society through the lens of Habitus-Field theory and the Tocharian text corpus [Dataset]. http://doi.org/10.7910/DVN/UF8DHK
    Explore at:
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Wieczorek, Oliver; Malzahn, Melanie
    Description

    The files and workflow will allow you to replicate the study titled "Exploring an extinct society through the lens of Habitus-Field theory and the Tocharian text corpus". This study aimed at utilizing the CEToM-corpus (https://cetom.univie.ac.at/) (Tocharian) to analyze the life-world of the elites of an extinct society situated in modern eastern China. To acquire the raw data needed for steps 1 & 2, please contact Melanie Malzahn melanie.malzahn@univie.ac.at. We conducted a mixed methods study, containing of close reading, content analysis, and multiple correspondence analysis (MCA). The excel file titled "fragments_architecture_combined.xlsx" allows for replication of the MCA and equates to the third step of the workflow outlined below. We used the following programming languages and packages to prepare the dataset and to analyze the data. Data preparation and merging procedures were achieved in python (version 3.9.10) with packages pandas (version 1.5.3), os (version 3.12.0), re (version 3.12.0), numpy (version 1.24.3), gensim (version 4.3.1), BeautifulSoup4 (version 4.12.2), pyasn1 (version 0.4.8), and langdetect (version 1.0.9). Multiple Correspondence Analyses were conducted in R (version 4.3.2) with the packages FactoMineR (version 2.9), factoextra (version 1.0.7), readxl version(1.4.3), tidyverse version(2.0.0), ggplot2 (version 3.4.4) and psych (version 2.3.9). After requesting the necessary files, please open the scripts in the order outlined bellow and execute the code-files to replicate the analysis: Preparatory step: Create a folder for the python and r-scripts downloadable in this repository. Open the file 0_create folders.py and declare a root folder in line 19. This first script will generate you the following folders: "tarim-brahmi_database" = Folder, which contains tocharian dictionaries and tocharian text fragments. "dictionaries" = contains tocharian A and tocharian B vocabularies, including linguistic features such as translations, meanings, part of speech tags etc. A full overview of the words is provided on https://cetom.univie.ac.at/?words. "fragments" = contains tocharian text fragments as xml-files. "word_corpus_data" = folder will contain excel-files of the corpus data after the first step. "Architectural_terms" = This folder contains the data on the architectural terms used in the dataset (e.g. dwelling, house). "regional_data" = This folder contains the data on the findsports (tocharian and modern chinese equivalent, e.g. Duldur-Akhur & Kucha). "mca_ready_data" = This is the folder, in which the excel-file with the merged data will be saved. Note that the prepared file named "fragments_architecture_combined.xlsx" can be saved into this directory. This allows you to skip steps 1 &2 and reproduce the MCA of the content analysis based on the third step of our workflow (R-Script titled 3_conduct_MCA.R). First step - run 1_read_xml-files.py: loops over the xml-files in folder dictionaries and identifies a) word metadata, including language (Tocharian A or B), keywords, part of speech, lemmata, word etymology, and loan sources. Then, it loops over the xml-textfiles and extracts a text id number, langauge (Tocharian A or B), text title, text genre, text subgenre, prose type, verse type, material on which the text is written, medium, findspot, the source text in tocharian, and the translation where available. After successful feature extraction, the resulting pandas dataframe object is exported to the word_corpus_data folder. Second step - run 2_merge_excel_files.py: merges all excel files (corpus, data on findspots, word data) and reproduces the content analysis, which was based upon close reading in the first place. Third step - run 3_conduct_MCA.R: recodes, prepares, and selects the variables necessary to conduct the MCA. Then produces the descriptive values, before conducitng the MCA, identifying typical texts per dimension, and exporting the png-files uploaded to this repository.

  4. e

    Machine Learning Majorite barometer - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Feb 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Machine Learning Majorite barometer - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/1a523db9-b8d3-508d-9d69-3efed2629d00
    Explore at:
    Dataset updated
    Feb 6, 2021
    Description

    A machine learning barometer (using Random Forest Regression) to calculate equilibration pressure for majoritic garnetsUpdated 04/02/21 (21/01/21) (10/12/20):**The barometer codeThe barometer is provided as python scripts (.py) and Jupiter Notebooks (.ipynb) files. These are completely equivalent to one another and which is used depends on the users preference. Separate instructions are provided for each.data files included in this repository are: ‱ "Majorite_database_04022021.xlsm" (Excel sheet of literature majoritic garnet compositions - inclusions (up to date as of 04/02/2021) and experiments (up to date as of 03/07/2020). This data includes all compositions that are close to majoritic, but some are borderline. Filtering as described in paper accompanying this barometer is performed in the python script prior to any data analysis or fitting) ‱ "lit_maj_nat_030720.txt" (python script input file of experimental literature majoritic garnet compositions - taken from dataset above) ‱ "di_incs_040221.txt" (python script input file of literature compilation of majoritic garnet inclusions observed in natural diamonds - taken from the dataset above)*The barometer as Jupiter Notebooks - including integrated Caret validation (added 21/01/2021)For those more unfamiliar with Python, running the barometer as a Notebook is somewhat more intuitive than running the scripts below. It also has the benefit of including the RFR validation in using Caret within a single integrated notebook. For success the Jupiter Notebook requires a suitable Python3 environment (with pandas, numpy, matplotlib, sklearn, rpy2 and pickle packages + dependencies). We recommend installing the latest anaconda python distribution (found here https://docs.anaconda.com/anaconda/install/) and creating a custom environment containing the required packages to run the Jupiter Notebook (as both python3 and R must be active in the environment). Instructions on this procedure can be found here (https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html), or to assist we have provided a copy of the environment used to produce the scripts to assist in this process (barom-spec-file.txt). An identical conda environment (called myenv) can be created, and used by:1) copying the barometer-spec-file.txt to a suitable location (i.e. your home directory)2) running the command conda create --name myenv --file barom-spec-file.txt3) entering this environmentconda activate myenv4) Running an instance of Jupyter Notebook by typingjupyter notebookTwo Notebooks are provided: ‱ calculate_pressures_notebook.ipynb (equivalent to calculate_pressures.py described below) ‱ rfr_majbar_10122020_notebook.ipynb (equivalent to rfr_majbar_10122020.py described below) but also including integrated Caret validation performed using the rpy2 package in a single notebook environment*The barometer as scripts (10/12/2020)The scripts below need to be run in a suitable Python3 environment (with pandas, numpy, matplotlib, sklearn and pickle packages + dependencies). For inexperienced users we recommend installing the latest anaconda python distribution (found here https://docs.anaconda.com/anaconda/install/) and running in Spyder (a GUI scripting environment provided with Anaconda.Note - if running python 3.7 (or earlier) then you will need to install pickle5 package to use the provided barometer files and comment / uncomment the appropriate lines in the “calculate_pressures.py” (lines 16/17) and “rfr_majbar_10122020.py” (lines 26/27) scripts.The user may additionally need to download and install the packages required if they are not provided with the anaconda distribution (pandas, numpy, matplotlib, scikit-learn and pickle). This will be obvious as, when run, the script will return an error similar to “No module name XXXX”. Packages can either be installed using the anaconda package manager or in the command line / terminal via commands such as: conda install -c conda-forge pickle5Appropriate command line installation commands can be obtained via searching the anaconda cloud at anaconda.org for each required package.A python script (.py) is provided to calculate pressures for any majoritic garnet using barometer calibrated in Thomson et al. (2021) ‱ calculate_pressures.py script takes an input file of any majoritic garnet compositions (example input file is provided “example_test_data.txt" - which are inclusion compositions reported by Zedgenizov et al., 2014, Chemical Geology, 363, pp 114-124). ‱ employs published RFR model and scaler - both provided as pickle files (pickle_model_20201210.pkl, scaler_20201210.pkl)User can simply edit the input file name in the provided .py script - and then runs the script in a suitable python3 environment (requires pandas, numpy, sklearn and pickle packages). Script initially filters data for majoritic compositions (according to criteria used for barometer calibration) and predicts pressures for these compositions. Writes out pressures and 2 x std_dev in pressure estimates alongside input data into "out_pressures_test.txt". if this script produces any errors or warnings it is likely because the serialised pickle files provided are not compatible with the python build being used (this is a common issue with serialised ML models). Please first try installing the pickle5 package and commenting/uncommenting lines 16/17. If this is unsuccessful then run the full barometer calibration script below (using the same input files as in Thomson et al. (2021) which are provided) to produce pickle files compatible with the python build on the local machine (action 5 of script below). Subsequently edit the filenames called in the “calculate_pressures.py” script (lines 22 & 27) to match the new barometer calibration files and re-run the calculate pressure script. The output (predicted pressures) for the test dataset provided (and using the published calibration) given in the output file should be similar to the following results:P (GPa) error (GPa)17.0 0.416.6 0.319.5 1.321.8 1.312.8 0.314.3 0.414.7 0.414.4 0.612.1 0.614.6 0.517.0 1.014.6 0.611.9 0.714.0 0.516.8 0.8Full RFR barometer calibration script - rfr_majbar_10122020.py The RFR barometer calibration script used and described in Thomson et al. (2021). This script performs the following actions. 1) filters input data - outputs this filtered data as a .txt file (which is the input expected for RFR validation script using R package Caret) 2) fits 1000 RFR models each using a randomly selected training dataset (70% of the input data) 3) performs leave-one-out validation 4) plots figure 5 from Thomson et al. (2021) 5) fits one single RFR barometer using all input data (saves this and the scaler as .pkl files with a datestamp for use in the "calculate_pressures.py script) 6) calculates the pressure for all literature inclusion compositions over 100 iterations with randomly distributed compositional uncertainties added - provides the mean pressure and 2 std deviations, written alongside input inclusion compositons, as a .txt output file "diout.txt" 7) plots the global distribution of majoritic inclusion pressuresThe RFR barometer can be easily updated to include (or exclude) additional experimental compositions by modification of the literature data input files providedRFR validation using Caret in R (script titled “RFR_validation_03072020.R”)Additional validation tests of RFR barometer completed using the Caret package in R. Requires the filtered experimental dataset file "data_filteredforvalidation.txt" (which is generated by the rfr_majbar_10122020.py script if required for a new dataset) performs bootstrap, K-fold and leave-one out validation. outputs validation stats for 5, 7 and 9 input variables (elements)Please email Andrew Thomson (a.r.thomson@ucl.ac.uk) if you have any questions or queries.

  5. An Empirical Study on Energy Usage Patterns of Different Variants of Data...

    • figshare.com
    zip
    Updated Nov 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Princy Chauhan (2024). An Empirical Study on Energy Usage Patterns of Different Variants of Data Processing Libraries [Dataset]. http://doi.org/10.6084/m9.figshare.27611421.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 5, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Princy Chauhan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As computing power grows, so does the need for data processing, which uses a lot of energy in steps like cleaning and analyzing data. This study looks at the energy and time efficiency of four common Python libraries—Pandas, Vaex, Scikit-learn, and NumPy—tested on five datasets across 21 tasks. We compared the energy use of the newest and older versions of each library. Our findings show that no single library always saves the most energy. Instead, energy use varies by task type, how often tasks are done, and the library version. In some cases, newer versions use less energy, pointing to the need for more research on making data processing more energy-efficient.A zip file accompanying this study contains the scripts, datasets, and a README file for guidance. This setup allows for easy replication and testing of the experiments described, helping to further analyze energy efficiency across different libraries and tasks.

  6. u

    Data from: CADDI: An in-Class Activity Detection Dataset using IMU data from...

    • observatorio-cientifico.ua.es
    • scidb.cn
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel; Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel (2025). CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors [Dataset]. https://observatorio-cientifico.ua.es/documentos/668fc49bb9e7c03b01be251c
    Explore at:
    Dataset updated
    2025
    Authors
    Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel; Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel
    Description

    Data DescriptionThe CADDI dataset is designed to support research in in-class activity recognition using IMU data from low-cost sensors. It provides multimodal data capturing 19 different activities performed by 12 participants in a classroom environment, utilizing both IMU sensors from a Samsung Galaxy Watch 5 and synchronized stereo camera images. This dataset enables the development and validation of activity recognition models using sensor fusion techniques.Data Generation ProceduresThe data collection process involved recording both continuous and instantaneous activities that typically occur in a classroom setting. The activities were captured using a custom setup, which included:A Samsung Galaxy Watch 5 to collect accelerometer, gyroscope, and rotation vector data at 100Hz.A ZED stereo camera capturing 1080p images at 25-30 fps.A synchronized computer acting as a data hub, receiving IMU data and storing images in real-time.A D-Link DSR-1000AC router for wireless communication between the smartwatch and the computer.Participants were instructed to arrange their workspace as they would in a real classroom, including a laptop, notebook, pens, and a backpack. Data collection was performed under realistic conditions, ensuring that activities were captured naturally.Temporal and Spatial ScopeThe dataset contains a total of 472.03 minutes of recorded data.The IMU sensors operate at 100Hz, while the stereo camera captures images at 25-30Hz.Data was collected from 12 participants, each performing all 19 activities multiple times.The geographical scope of data collection was Alicante, Spain, under controlled indoor conditions.Dataset ComponentsThe dataset is organized into JSON and PNG files, structured hierarchically:IMU Data: Stored in JSON files, containing:Samsung Linear Acceleration Sensor (X, Y, Z values, 100Hz)LSM6DSO Gyroscope (X, Y, Z values, 100Hz)Samsung Rotation Vector (X, Y, Z, W quaternion values, 100Hz)Samsung HR Sensor (heart rate, 1Hz)OPT3007 Light Sensor (ambient light levels, 5Hz)Stereo Camera Images: High-resolution 1920×1080 PNG files from left and right cameras.Synchronization: Each IMU data record and image is timestamped for precise alignment.Data StructureThe dataset is divided into continuous and instantaneous activities:Continuous Activities (e.g., typing, writing, drawing) were recorded for 210 seconds, with the central 200 seconds retained.Instantaneous Activities (e.g., raising a hand, drinking) were repeated 20 times per participant, with data captured only during execution.The dataset is structured as:/continuous/subject_id/activity_name/ /camera_a/ → Left camera images /camera_b/ → Right camera images /sensors/ → JSON files with IMU data

    /instantaneous/subject_id/activity_name/repetition_id/ /camera_a/ /camera_b/ /sensors/ Data Quality & Missing DataThe smartwatch buffers 100 readings per second before sending them, ensuring minimal data loss.Synchronization latency between the smartwatch and the computer is negligible.Not all IMU samples have corresponding images due to different recording rates.Outliers and anomalies were handled by discarding incomplete sequences at the start and end of continuous activities.Error Ranges & LimitationsSensor data may contain noise due to minor hand movements.The heart rate sensor operates at 1Hz, limiting its temporal resolution.Camera exposure settings were automatically adjusted, which may introduce slight variations in lighting.File Formats & Software CompatibilityIMU data is stored in JSON format, readable with Python’s json library.Images are in PNG format, compatible with all standard image processing tools.Recommended libraries for data analysis:Python: numpy, pandas, scikit-learn, tensorflow, pytorchVisualization: matplotlib, seabornDeep Learning: Keras, PyTorchPotential ApplicationsDevelopment of activity recognition models in educational settings.Study of student engagement based on movement patterns.Investigation of sensor fusion techniques combining visual and IMU data.This dataset represents a unique contribution to activity recognition research, providing rich multimodal data for developing robust models in real-world educational environments.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025caddiinclassactivitydetection, title={CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Monica Pina-Navarro and Miguel Cazorla and Francisco Gomez-Donoso}, year={2025}, eprint={2503.02853}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.02853}, }

  7. z

    Simulated Inventory Management Database and Object-Centric Event Logs for...

    • zenodo.org
    bin, csv +2
    Updated May 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessandro Berti; Alessandro Berti (2025). Simulated Inventory Management Database and Object-Centric Event Logs for Process Analysis [Dataset]. http://doi.org/10.5281/zenodo.15515788
    Explore at:
    xml, text/x-python, csv, binAvailable download formats
    Dataset updated
    May 26, 2025
    Dataset provided by
    Zenodo
    Authors
    Alessandro Berti; Alessandro Berti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract: This repository/dataset provides a suite of Python scripts to generate a simulated relational database for inventory management processes and transform this data into object-centric event logs (OCEL) suitable for advanced process mining analysis. The primary goal is to offer a synthetic yet realistic dataset that facilitates research, development, and application of object-centric process mining techniques in the domain of inventory control and supply chain management. The generated event logs capture common inventory operations, track stock level changes, and are enriched with key inventory management parameters (like EOQ, Safety Stock, Reorder Point) and status-based activity labels (e.g., indicating understock or overstock situations).

    Overview: Inventory management is a critical business process characterized by the interaction of various entities such as materials, purchase orders, sales orders, plants, suppliers, and customers. Traditional process mining often struggles to capture these complex interactions. Object-Centric Process Mining (OCPM) offers a more suitable paradigm. This project provides the tools to create and explore such data.

    The workflow involves:

    1. Database Simulation: Generating a SQLite database with tables for materials, sales orders, purchase orders, goods movements, stock levels, etc., populated with simulated data.
    2. Initial OCEL Generation: Extracting data from the SQLite database and structuring it as an object-centric event log (in CSV format). This log includes activities like "Create Purchase Order Item", "Goods Receipt", "Create Sales Order Item", "Goods Issue", and tracks running stock levels for materials.
    3. OCEL Post-processing and Enrichment:
      • Calculating standard inventory management metrics such as Economic Order Quantity (EOQ), Safety Stock (SS), and Reorder Point (ROP) for each material-plant combination based on the simulated historical data.
      • Merging these metrics into the event log.
      • Enhancing activity labels to include the current stock status (e.g., "Understock", "Overstock", "Normal") relative to calculated SS and Overstock (OS) levels (where OS = SS + EOQ).
      • Generating new, distinct events to explicitly mark the moments when stock statuses change (e.g., "START UNDERSTOCK", "ST CHANGE NORMAL to OVERSTOCK", "END NORMAL").
    4. Format Conversion: Converting the CSV-based OCELs into the standard OCEL XML/OCEL2 format using the pm4py library.

    Contents:

    The repository contains the following Python scripts:

    • 01_generate_simulation.py:

      • Creates a SQLite database named inventory_management.db.
      • Defines and populates tables including: Materials, SalesOrderDocuments, SalesOrderItems, PurchaseOrderDocuments, PurchaseOrderItems, PurchaseRequisitions, GoodsReceiptsAndIssues, MaterialStocks, MaterialDocuments, SalesDocumentFlows, and OrderSuggestions.
      • Simulates data for a configurable number of materials, customers, sales, purchases, etc., with randomized dates and quantities.
    • 02_database_to_ocel_csv.py:

      • Connects to the inventory_management.db.
      • Executes a SQL query to extract relevant events and their associated objects for inventory processes.
      • Constructs an initial object-centric event log, saved as ocel_inventory_management.csv.
      • Identified object types include: MAT (Material), PLA (Plant), PO_ITEM (Purchase Order Item), SO_ITEM (Sales Order Item), CUSTOMER, SUPPLIER.
      • Calculates "Stock Before" and "Stock After" for each event affecting material stock.
      • Standardizes column names to OCEL conventions (e.g., ocel:activity, ocel:timestamp, ocel:type:).
    • 03_ocel_csv_to_ocel.py:

      • Reads ocel_inventory_management.csv.
      • Uses pm4py to convert the CSV event log into the standard OCEL XML format (ocel_inventory_management.xml).
    • 04_postprocess_activities.py:

      • Reads data from inventory_management.db to calculate inventory parameters:
        • Annual Demand (Dm)
        • Average Daily Demand (dm)
        • Standard Deviation of Daily Demand (σm)
        • Average Lead Time (lm)
        • Economic Order Quantity (EOQ): (2⋅Dm⋅S)/H (where S is fixed order cost, H is holding cost)
        • Safety Stock (SS): z⋅σm⋅lm (where z is the z-score for the desired service level)
        • Reorder Point (ROP): (dm⋅lm)+SS
      • Merges these calculated parameters with ocel_inventory_management.csv.
      • Computes an Overstock level (OS) as SS+EOQ.
      • Derives a "Current Status" (Understock, Overstock, Normal) for each event based on "Stock After" relative to SS and OS.
      • Appends this status to the ocel:activity label (e.g., "Goods Issue (Understock)").
      • Generates new events for status changes (e.g., "START NORMAL", "ST CHANGE UNDERSTOCK to NORMAL", "END OVERSTOCK") with adjusted timestamps to precisely mark these transitions.
      • Creates a new object type MAT_PLA (Material-Plant combination) for easier status tracking.
      • Saves the enriched and transformed log as post_ocel_inventory_management.csv.
    • 05_ocel_csv_to_ocel.py:

      • Reads the post-processed post_ocel_inventory_management.csv.
      • Uses pm4py to convert this enriched CSV event log into the standard OCEL XML format (post_ocel_inventory_management.xml).

    Generated Dataset Files (if included, or can be generated using the scripts):

    • inventory_management.db: The SQLite database containing the simulated raw data.
    • ocel_inventory_management.csv: The initial OCEL in CSV format.
    • ocel_inventory_management.xml: The initial OCEL in standard OCEL XML format.
    • post_ocel_inventory_management.csv: The post-processed and enriched OCEL in CSV format.
    • post_ocel_inventory_management.xml: The post-processed and enriched OCEL in standard OCEL XML format.

    How to Use:

    1. Ensure you have Python installed along with the following libraries: sqlite3 (standard library), pandas, numpy, pm4py.
    2. Run the scripts sequentially in a terminal or command prompt:
      • python 01_generate_simulation.py (generates inventory_management.db)
      • python 02_database_to_ocel_csv.py (generates ocel_inventory_management.csv from the database)
      • python 03_ocel_csv_to_ocel.py (generates ocel_inventory_management.xml)
      • python 04_postprocess_activities.py (generates post_ocel_inventory_management.csv using the database and the initial CSV OCEL)
      • python 05_ocel_csv_to_ocel.py (generates post_ocel_inventory_management.xml)

    Potential Applications and Research: This dataset and the accompanying scripts can be used for:

    • Applying and evaluating object-centric process mining algorithms on inventory management data.
    • Analyzing inventory dynamics, such as the causes and effects of understocking or overstocking.
    • Discovering and conformance checking process models that involve multiple interacting objects (materials, orders, plants).
    • Investigating the impact of different inventory control parameters (EOQ, SS, ROP) on process execution.
    • Developing educational materials for teaching OCPM in a supply chain context.
    • Serving as a benchmark for new OCEL-based analysis techniques.

    Keywords: Object-Centric Event Log, OCEL, Process Mining, Inventory Management, Supply Chain, Simulation, Synthetic Data, SQLite, Python, pandas, pm4py, Economic Order Quantity (EOQ), Safety Stock (SS), Reorder Point (ROP), Stock Status Analysis.

  8. Replication Kit: "Are Unit and Integration Test Definitions Still Valid for...

    • zenodo.org
    • explore.openaire.eu
    application/gzip, bin
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabian Trautsch; Fabian Trautsch; Steffen Herbold; Jens Grabowski; Steffen Herbold; Jens Grabowski (2020). Replication Kit: "Are Unit and Integration Test Definitions Still Valid for Modern Java Projects? An Empirical Study on Open-Source Projects" [Dataset]. http://doi.org/10.5281/zenodo.1415334
    Explore at:
    application/gzip, binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Fabian Trautsch; Fabian Trautsch; Steffen Herbold; Jens Grabowski; Steffen Herbold; Jens Grabowski
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Replication Kit for the Paper "Are Unit and Integration Test Definitions Still Valid for Modern Java Projects? An Empirical Study on Open-Source Projects"
    This additional material shall provide other researchers with the ability to replicate our results. Furthermore, we want to facilitate further insights that might be generated based on our data sets.

    Structure
    The structure of the replication kit is as follows:

    • additional_visualizations: contains additional visualizations (Venn-Diagrams) for each projects for each of the data sets that we used
    • data_analysis: contains two python scripts that we used to analyze our raw data (one for each research question)
    • data_collection_tools: contains all source code used for the data collection, including the used versions of the COMFORT framework, the BugFixClassifier, and the used tools of the SmartSHARK environment;
    • mongodb_no_authors: Archived dump of our MongoDB that we created by executing our data collection tools. The "comfort" database can be restored via the mongorestore command.


    Additional Visualizations
    We provide two additional visualizations for each project:
    1)

    For each of these data sets there exist one visualization for each project that shows four Venn-Diagrams for each of the different defect types. These Venn-Diagrams show the number of defects that were detected by either unit, or integration tests (or both).

    Furthermore, we added boxplots for each of the data sets (i.e., ALL and DISJ) showing the scores of unit and integration tests for each defect type.


    Analysis scripts
    Requirements:
    - python3.5
    - tabulate
    - scipy
    - seaborn
    - mongoengine
    - pycoshark
    - pandas
    - matplotlib

    Both python files contain all code for the statistical analysis we performed.

    Data Collection Tools
    We provide all data collection tools that we have implemented and used throughout our paper. Overall it contains six different projects and one python script:

    • BugFixClassifier: Used to classify our defects.
    • comfort-core: Core of the comfort framework. Used to classify our tests into unit and integration tests and calculate different metrics for these tests.
    • comfort-jacoco-listner: Used to intercept the coverage collection process as we were executing the tests of our case study projects.
    • issueSHARK: Used to collect data from the ITSs of the projects.
    • pycoSHARK: Library that contains models for the used ORM mapper that is used insight the SmartSHARK environment.
    • vcsSHARK: Used to collect data from the VCSs of the projects.

  9. Iris Species Dataset and Database

    • kaggle.com
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghanshyam Saini (2025). Iris Species Dataset and Database [Dataset]. https://www.kaggle.com/datasets/ghnshymsaini/iris-species-dataset-and-database
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 15, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ghanshyam Saini
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Iris Flower Dataset

    This is a classic and very widely used dataset in machine learning and statistics, often serving as a first dataset for classification problems. Introduced by the British statistician and biologist Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems," it is a foundational resource for learning classification algorithms.

    Overview:

    The dataset contains measurements for 150 samples of iris flowers. Each sample belongs to one of three species of iris:

    • Iris setosa
    • Iris versicolor
    • Iris virginica

    For each flower, four features were measured:

    • Sepal length (in cm)
    • Sepal width (in cm)
    • Petal length (in cm)
    • Petal width (in cm)

    The goal is typically to build a model that can classify iris flowers into their correct species based on these four features.

    File Structure:

    The dataset is usually provided as a single CSV (Comma Separated Values) file, often named iris.csv or similar. This file typically contains the following columns:

    1. sepal_length (cm): Numerical. The length of the sepal of the iris flower.
    2. sepal_width (cm): Numerical. The width of the sepal of the iris flower.
    3. petal_length (cm): Numerical. The length of the petal of the iris flower.
    4. petal_width (cm): Numerical. The width of the petal of the iris flower.
    5. species: Categorical. The species of the iris flower (either 'setosa', 'versicolor', or 'virginica'). This is the target variable for classification.

    Content of the Data:

    The dataset contains an equal number of samples (50) for each of the three iris species. The measurements of the sepal and petal dimensions vary between the species, allowing for their differentiation using machine learning models.

    How to Use This Dataset:

    1. Download the iris.csv file.
    2. Load the data using libraries like Pandas in Python.
    3. Explore the data through visualization and statistical analysis to understand the relationships between the features and the different species.
    4. Build classification models (e.g., Logistic Regression, Support Vector Machines, Decision Trees, K-Nearest Neighbors) using the sepal and petal measurements as features and the 'species' column as the target variable.
    5. Evaluate the performance of your model using appropriate metrics (e.g., accuracy, precision, recall, F1-score).
    6. The dataset is small and well-behaved, making it excellent for learning and experimenting with various classification techniques.

    Citation:

    When using the Iris dataset, it is common to cite Ronald Fisher's original work:

    Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179-188.

    Data Contribution:

    Thank you for providing this classic and fundamental dataset to the Kaggle community. The Iris dataset remains an invaluable resource for both beginners learning the basics of classification and experienced practitioners testing new algorithms. Its simplicity and clear class separation make it an ideal starting point for many data science projects.

    If you find this dataset description helpful and the dataset itself useful for your learning or projects, please consider giving it an upvote after downloading. Your appreciation is valuable!

  10. h

    gallica_literary_fictions

    • huggingface.co
    Updated Oct 18, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigLAM: BigScience Libraries, Archives and Museums (2022). gallica_literary_fictions [Dataset]. http://doi.org/10.5281/zenodo.4660197
    Explore at:
    Dataset updated
    Oct 18, 2022
    Dataset authored and provided by
    BigLAM: BigScience Libraries, Archives and Museums
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    The collection "Fiction littéraire de Gallica" includes 19,240 public domain documents from the digital platform of the French National Library that were originally classified as novels or, more broadly, as literary fiction in prose. It consists of 372 tables of data in tsv format for each year of publication from 1600 to 1996 (all the missing years are in the 17th and 20th centuries). Each table is structured at the page-level of each novel (5,723,986 pages in all). It contains the complete text with the addition of some metadata. It can be opened in Excel or, preferably, with the new data analysis environments in R or Python (tidyverse, pandas
)

    This corpus can be used for large-scale quantitative analyses in computational humanities. The OCR text is presented in a raw format without any correction or enrichment in order to be directly processed for text mining purposes.

    The extraction is based on a historical categorization of the novels: the Y2 or Ybis classification. This classification, invented in 1730, is the only one that has been continuously applied to the BNF collections now available in the public domain (mainly before 1950). Consequently, the dataset is based on a definition of "novel" that is generally contemporary of the publication.

    A French data paper (in PDF and HTML) presents the construction process of the Y2 category and describes the structuring of the corpus. It also gives several examples of possible uses for computational humanities projects.

  11. D

    Data from: Compromised through Compression: Python source code for DLMS...

    • phys-techsciences.datastations.nl
    text/markdown, txt +2
    Updated Dec 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    P.J.M. van Aubel; E. Poll; P.J.M. van Aubel; E. Poll (2021). Compromised through Compression: Python source code for DLMS compression privacy analysis & graphing [Dataset]. http://doi.org/10.17026/DANS-2BY-BNA3
    Explore at:
    xml(5795), zip(20542), text/markdown(792), txt(626), zip(12920)Available download formats
    Dataset updated
    Dec 14, 2021
    Dataset provided by
    DANS Data Station Physical and Technical Sciences
    Authors
    P.J.M. van Aubel; E. Poll; P.J.M. van Aubel; E. Poll
    License

    http://opensource.org/licenses/BSD-2-Clausehttp://opensource.org/licenses/BSD-2-Clause

    Description

    Python code (for Python 3.9 & Pandas 1.3.2) to generate the results used in "Compromised through Compression: Privacy Implications of Smart Meter Traffic Analysis".Smart metering comes with risks to privacy. One concern is the possibility of an attacker seeing the traffic that reports the energy use of a household and deriving private information from that. Encryption helps to mask the actual energy measurements, but is not sufficient to cover all risks. One aspect which has yet gone unexplored — and where encryption does not help — is traffic analysis, i.e. whether the length of messages communicating energy measurements can leak privacy-sensitive information to an observer. In this paper we examine whether using encodings or compression for smart metering data could potentially leak information about household energy use. Our analysis is based on the real-world energy use data of ±80 Dutch households.We find that traffic analysis could reveal information about the energy use of individual households if compression is used. As a result, when messages are sent daily, an attacker performing traffic analysis would be able to determine when all the members of a household are away or not using electricity for an entire day. We demonstrate this issue by recognizing when households from our dataset were on holiday. If messages are sent more often, more granular living patterns could likely be determined.We propose a method of encoding the data that is nearly as effective as compression at reducing message size, but does not leak the information that compression leaks. By not requiring compression to achieve the best possible data savings, the risk of traffic analysis is eliminated.This code operates on the relative energy measurements from the "Zonnedael dataset" from Liander N.V. This dataset needs to be obtained separately; see instructions accompanying the code. The code transforms the dataset into absolute measurements such as would be taken by a smart meter. It then generates batch messages covering 24-hour periods starting at midnight, similar to how the Dutch infrastructure batches daily meter readings, in the different possible encodings with and without compression applied. For an explanation of the different encodings, see the paper. The code will then provide statistics on the efficiency of encoding and compression for the entire dataset, and attempt to find the periods of multi-day absences for each household. It will also generate the graphs in the style used in the paper and presentation.

  12. Crimp Force Curve Dataset

    • zenodo.org
    • dataverse.harvard.edu
    bin
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bernd Hofmann; Bernd Hofmann; Patrick BrĂŒndl; Patrick BrĂŒndl; Jörg Franke; Jörg Franke (2025). Crimp Force Curve Dataset [Dataset]. http://doi.org/10.7910/dvn/wbdkn6
    Explore at:
    binAvailable download formats
    Dataset updated
    May 15, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bernd Hofmann; Bernd Hofmann; Patrick BrĂŒndl; Patrick BrĂŒndl; Jörg Franke; Jörg Franke
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The "Crimp Force Curve Dataset" is a comprehensive collection of univariate time series data representing crimp force curves recorded during the manufacturing process of crimp connections. This dataset has been designed to support a variety of applications, including anomaly detection, fault diagnosis, and research in data-driven quality assurance.

    A salient feature of this dataset is the presence of high-quality labels. Each crimp force curve is annotated both by a state-of-the-art crimp force monitoring system - capable of binary anomaly detection - and by domain experts who manually classified the curves into detailed quality classes. The expert annotations provide a valuable ground truth for training and benchmarking machine learning models beyond anomaly detection.

    The dataset is particularly well-suited for tasks involving time series analysis, such as training and evaluating of machine learning algorithms for quality control and fault detection. It provides a substantial foundation for the development of generalisable, yet domain-specific (crimping), data-driven quality control systems.

    The data is stored in a Python pickle file crimp_force_curves.pkl, which is a binary format used to serialize and deserialize Python objects. It can be conveniently loaded into a pandas DataFrame for exploration and analysis using the following command:

    df = pd.read_pickle("crimp_force_curves.pkl")

    This dataset is a valuable resource for researchers and practitioners in manufacturing engineering, computer science, and data science who are working at the intersection of quality control in manufacturing and machine learning.

  13. Z

    TrafficDator Madrid

    • data.niaid.nih.gov
    Updated Apr 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ilarri, Sergio (2024). TrafficDator Madrid [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10435153
    Explore at:
    Dataset updated
    Apr 6, 2024
    Dataset provided by
    GĂłmez, IvĂĄn
    Ilarri, Sergio
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Area covered
    Madrid
    Description

    Data Origin: This dataset was generated using information from the Community of Madrid, including traffic data collected by multiple sensors located throughout the city, as well as work calendar and meteorological data, all provided by the Community.

    Data Type: The data consists of traffic measurements in Madrid from June 1, 2022, to September 30, 2023. Each record includes information on the date, time, location (longitude and latitude), traffic intensity, and associated road and weather conditions (e.g., whether it is a working day, holiday, information on wind, temperature, precipitation, etc.).

    Technical Details:

    Data Preprocessing: We utilized advanced techniques for cleaning and normalizing traffic data collected from sensors across Madrid. This included handling outliers and missing values to ensure data quality.

    Geospatial Analysis: We used GeoPandas and OSMnx to map traffic data points onto Madrid's road network. This process involved processing spatial attributes such as street lanes and speed limits to add context to the traffic data.

    Meteorological Data Integration: We incorporated Madrid's weather data, including temperature, precipitation, and wind speed. Understanding the impact of weather conditions on traffic patterns was crucial in this step.

    Traffic Data Clustering: We implemented K-Means clustering to identify patterns in traffic data. This approach facilitated the selection of representative sensors from each cluster, focusing on the most relevant data points.

    Calendar Integration: We combined the traffic data with the work calendar to distinguish between different types of days. This provided insights into traffic variations on working days and holidays.

    Comprehensive Analysis Approach: The analysis was conducted using Python libraries such as Pandas, NumPy, scikit-learn, and Shapely. It covered data from the years 2022 and 2023, focusing on the unique characteristics of the Madrid traffic dataset.

    Data Structure: Each row of the dataset represents an individual measurement from a traffic sensor, including:

    id: Unique sensor identifier.

    date: Date and time of the measurement.

    longitude and latitude: Geographical coordinates of the sensor.

    day type: Information about the day being a working day, holiday, or festive Sunday.

    intensity: Measured traffic intensity.

    Additional data like wind, temperature, precipitation, etc.

    Purpose of the Dataset: This dataset is useful for traffic analysis, urban mobility studies, infrastructure planning, and research related to traffic behavior under different environmental and temporal conditions.

    Acknowledgment and Funding:

    This dataset was obtained as part of the R&D project PID2020-113037RB-I00, funded by MCIN/AEI/10.13039/501100011033.

    In addition to the NEAT-AMBIENCE project, support from the Department of Science, University, and Knowledge Society of the Government of Aragon (Government of Aragon: group reference T64_23R, COSMOS research group) is also acknowledged.

    For academic and research purposes, please reference this dataset using its DOI for proper attribution and tracking.

  14. f

    Metaverse Gait Authentication Dataset (MGAD)

    • figshare.com
    csv
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sandeep ravikanti (2025). Metaverse Gait Authentication Dataset (MGAD) [Dataset]. http://doi.org/10.6084/m9.figshare.28387664.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Feb 11, 2025
    Dataset provided by
    figshare
    Authors
    sandeep ravikanti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    1. Dataset OverviewThe Metaverse Gait Authentication Dataset (MGAD) is a large-scale dataset for gait-based biometric authentication in virtual environments. It consists of gait data from 5,000 simulated users, generated using Unity 3D and processed using OpenPose and MediaPipe. This dataset is ideal for researchers working on biometric authentication, gait analysis, and AI-driven identity verification systems.2. Data Structure & FormatFile Format: CSVNumber of Samples: 5,000 usersNumber of Features: 16 gait-based featuresColumns: Each row represents a user with corresponding gait feature valuesSize: Approximately (mention size in MB/GB after upload)3. Feature DescriptionsThe dataset includes 16 extracted gait features:Stride Length (m): Average distance covered in one gait cycle.Step Frequency (steps/min): Number of steps taken per minute.Stance Phase Duration (s): Stance phase in a gait cycle.Swing Phase Duration (s): Duration of the swing phase in a gait cycle.Double Support Phase Duration (s): Time both feet are in contact with the ground.Step Length (m): Distance between consecutive foot placements.Cadence Variability (%): Variability in step rate.Hip Joint Angle (°): Maximum angle variation in the hip joint.Knee Joint Angle (°): Maximum flexion-extension knee angle.Ankle Joint Angle (°): Angle variation at the ankle joint.Avg. Vertical GRF (N): Average vertical ground reaction force.Avg. Anterior-Posterior GRF (N): Ground reaction force in the forward-backward direction.Avg. Medial-Lateral GRF (N): Ground reaction force in the side-to-side direction.Avg. COP Excursion (mm): Center of pressure movement during stance phase.Foot Clearance during Swing Phase (mm): Minimum height of the foot during the swing phase.Gait Symmetry Index (%): Measure of symmetry between left and right gait cycles.4. How to Use the DatasetLoad the dataset in Python using Pandas:Use the features for machine learning models in biometric authentication.Apply preprocessing techniques like normalization and feature scaling.Train and evaluate deep learning or ensemble models for gait recognition.5. Citation & LicenseIf you use this dataset, please cite it as follows:Sandeep Ravikanti, "Metaverse Gait Authentication Dataset (MGAD)," IEEE DataPort, 2025. DOI: https://dx.doi.org/10.21227/rvh5-88426. Contact InformationFor inquiries or collaborations, please contact: bitsrmit2023@gmail.com
  15. e

    Data related to article "Advanced quantification of receptor–ligand...

    • b2find.eudat.eu
    • researchdata.tuwien.ac.at
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Data related to article "Advanced quantification of receptor–ligand interaction lifetimes via single-molecule FRET microscopy" [Dataset]. https://b2find.eudat.eu/dataset/21de8f86-5f7c-50f8-9392-1027bc06532e
    Explore at:
    Dataset updated
    Aug 8, 2024
    Description

    Simulated data files Simulated single-molecule tracks for characterizing the algorithm described in the article. char_short_sim.h5, char_n_tracks_sim_1.0.h5, and char_long_sim.h5 were used to investigate the effect of changing recording intervals, char_n_tracks_sim_0.5.h5, char_n_tracks_sim_1.0.h5, char_n_tracks_sim_2.0.h5 to examine the impact of the dataset size. h5 files contain tables created using to DataFrame.to_hdf method from the pandas Python package. Each table is identfied by the key //, where is the simulated recording interval and is an integer identifying a particular simulation execution. Raw data files FRET microscopy image sequences of TCR–pMHC interactions of 5c.c7 and AND TCR-transgenic T cells as described in the article. Zip archives' POPC subfolders contain the recorded image sequences with recording delay (in ms) and number of donor excitation frames indicated in the file names. The beads subfolders contain images of fiducial markers for image registration. Analysis files Save files generated by the smfret-bondtime analysis software described in the article for 5c.c7 and AND T cell data. Note that these files were generated using a software version predating the version published as 1.0.0. They can nontheless be loaded with the newer version. In order to load the experimental data, install smfret-bondtime software extract raw data extract analysis files; the current folder should now contain 5cc7 and/or AND subfolders as well as 5cc7.yaml, 5cc7.h5, AND.yaml, andAND.h5 files. If raw data is extracted to a different place, open the respective YAML files using a text editor and adjusts the data_dir entry accordingly.

  16. H

    Python Web Scraping and Data Analysis: Gorilla Specimens from Chicago’s...

    • dataverse.harvard.edu
    Updated Mar 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Woodger Faugas (2023). Python Web Scraping and Data Analysis: Gorilla Specimens from Chicago’s Field Museum [Dataset]. http://doi.org/10.7910/DVN/ELAZCU
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 24, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Woodger Faugas
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Chicago
    Description

    An in-depth analysis of millions of data entries from Chicago’s Field Museum underwent implementation, furnishing insights related to 25 Gorilla specimens and spanning the realms of biogeography, zoology, primatology, and biological anthropology. Taxonomically, and at first glance, all specimens examined belong to the kingdom Animalia, phylum Chordata, class Mammalia, order Primates, and family Hominidae. Furthermore, these specimens can be further categorized under the genus Gorilla and species gorilla, with most belonging to the subspecies Gorilla gorilla gorilla and some specimens being categorized as Gorilla gorilla. Biologically, specimens’ sex distribution entails 16 specimens (or 64% of the total) being identified as male and 5 (or 20%) identified as female, with 4 (or 16%) specimens having their sex unassigned. Furthermore, collectors, none of whom are unidentified by name, culled most of these specimens from unidentified zoos, with a few specimens having been sourced from Ward’s Natural Science Establishment, a well-known natural science materials supplier to North American museums. In terms of historicity, the specimens underwent collection between 1975 and 1993, with some entries lacking this information. Additionally, multiple organ preparations have been performed on the specimens, encompassing skulls, skeletons, skins, and endocrine organs being mounted and alcohol-preserved. Disappointingly, despite the existence of these preparations, tissue samples and coordinates are largely unavailable for the 25 specimens on record, limiting further research or analysis. In fact, tissue sampling is available for a sole specimen identified by IRN 2661980. Only one specimen, identifiable as IRN 2514759, has a specified geographical location indicated as “Africa, West Africa, West Indies,” while the rest have either “Unknown/None, Zoo” locations, signaling that no entry is available. Python code to extract data from the Field Museum’s zoological collections records and online database include the contents of the .py file herewith attached. This code constitutes a web scraping algorithm, retrieving data from the above-mentioned website, processing it, and storing it in a structured format. To achieve these tasks, it first imports necessary libraries by drawing on requests for making HTTP requests, Pandas for handling data, time for introducing delays, lxml for parsing HTML, and BeautifulSoup for web scraping. Furthermore, this algorithm defines the main URL for searching for Gorilla gorilla specimens before setting up headers for making HTTP requests, e.g., User-Agent and other headers to mimic a browser request. Next, an HTTP GET request to the main URL is made, and the response text is obtained. The next step consists of parsing the response text using BeautifulSoup and lxml. Extracting information from the search results page (e.g., Internal Record Number, Catalog Subset, Higher Classification, Catalog Number, Taxonomic Name, DwC Locality, Collector/field, Collection No., Coordinates Available, Tissue Available, and Sex) comes next. This information is then stored in a list called basic_data. The algorithm subsequently iterates through each record in basic_data, and accesses its detailed information page by making another HTTP GET request with the extracted URL. For each detailed information page, the code thereafter extracts additional data (e.g., FM Catalog, Scientific Name, Phylum, Class, Order, Family, Genus, Species, Field Number, Collector, Collection No., Geography, Date Collected, Preparations, Tissue Available, Co-ordinates Available, and Sex). Correspondingly, this information is stored in a list called main_data. The above algorithm processes the final main_data list and converts it into a structured format, i.e., a CSV file.

  17. R

    Data and codes from: Comparison of Solar Imaging Feature Extraction Methods...

    • entrepot.recherche.data.gouv.fr
    7z, application/x-h5 +2
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Tahtouh; Maria Tahtouh; Guillerme Bernoux; Guillerme Bernoux; Antoine Brunet; Antoine Brunet; Denis Standarovski; Denis Standarovski; Gautier Nguyen; Gautier Nguyen; Angélica Sicard; Angélica Sicard (2025). Data and codes from: Comparison of Solar Imaging Feature Extraction Methods in the Context of Space Weather Prediction with Deep Learning-Based Models [Dataset]. http://doi.org/10.57745/DZT7DS
    Explore at:
    application/x-h5(2599174), 7z(4407015), bin(40653687), 7z(2335796), text/x-python(29618), text/x-python(2593), text/x-python(4013), text/x-python(11669), bin(42832463), text/x-python(2710), application/x-h5(5006388127), bin(1784082487), text/x-python(18773)Available download formats
    Dataset updated
    Jun 4, 2025
    Dataset provided by
    Recherche Data Gouv
    Authors
    Maria Tahtouh; Maria Tahtouh; Guillerme Bernoux; Guillerme Bernoux; Antoine Brunet; Antoine Brunet; Denis Standarovski; Denis Standarovski; Gautier Nguyen; Gautier Nguyen; Angélica Sicard; Angélica Sicard
    License

    https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html

    Time period covered
    2010 - 2020
    Description

    This dataset contains replication data for the paper "Comparison of Solar Imaging Feature Extraction Methods in the Context of Space Weather Prediction with Deep Learning-Based Models". It includes files stored into HDF5 (Hierarchical Data Format) file using HDFStore. One file contains the extracted features using the 6 different techniques for the wavelength 19.3 nm named solar_extracted_features_v01_2010-2020.h5 and the second the SERENADE outputs named serenade_predictions_v01.h5. Both files contain several datasets labeled with ‘keys’. The latter correspond to the extraction method. Here is a list of the key names: gn_1024: corresponding to the GoogLenet extractor with 1024 components. pca_1024: corresponding to the Principle Component Analysis technique leaving 1024 components. ae_1024: corresponding to the AutoEncoder with a latent space of 1024. gn_256 (only in solar_extracted_features_v01_2010-2020.h5): corresponding to the GoogLenet extractor with 256 components. pca_256: corresponding to the Principle Component Analysis technique leaving 256 components. ae_256: corresponding to the AutoEncoder technique with a latent space of 256. vae_256 (only in solar_extracted_features_v01_2010-2020.h5): corresponding to the Variational AutoEncoder technique with a latent space of 256. vae_256_old (only in serenade_predictions_v01.h5): the output predictions of SERENADE using the VAE extracted features using the hyperparameters optimized for GoogLeNet. vae_256_new (only in serenade_predictions_v01.h5): the output predictions of SERENADE using the VAE extracted features with the alternative architecture. All the above-mentioned models are explained and detailed in the paper. In order to read the files, the user can do it with the Pandas package for Python as follows: import pandas as pd df = pd.read_hdf('file_name.h5', key = 'model_name') and replace file_name by either solar_extracted_features_v01_2010-2020.h5 or serenade_predictions_v01.h5 and model_name by one of the models in the list above. The extracted features dataset will output a pandas DataFrame indexed by datetime and either 1024 or 256 columns of features. An additional column indicates to which subset (train, validation and test) the corresponding row belongs. The SERENADE outputs dataset will output a DataFrame indexed by datetime and 4 columns: Observations: the first column contains the true daily maximum of the Kp index. Predictions: the second column contains the predicted mean of the daily maximum of the Kp index. Standard Deviation: the third column contains the standard deviation as the predictions are probabilistic. Model: this column specifies from which feature extractor model the inputs were used to generate the predictions. We add the feature extractors AE and VAE class codes as well as their weights in the AEs_class.py and VAE_class.py codes and best_AE_1024.ckpt, best_AE_256.ckpt and best_VAE.ckpt checkpoints respectively. The figures in the manuscript can be reproduced using the codes named after the corresponding figure. The files 6_mins_predictions and seed_variation contain the SERENADE predictions to reproduce figures 7, 8, 9 and 10.

  18. o

    Analysis of references in the IPCC AR6 WG2 Report of 2022

    • explore.openaire.eu
    • zenodo.org
    Updated Mar 4, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cameron Neylon; Bianca Kramer (2022). Analysis of references in the IPCC AR6 WG2 Report of 2022 [Dataset]. http://doi.org/10.5281/zenodo.6327207
    Explore at:
    Dataset updated
    Mar 4, 2022
    Authors
    Cameron Neylon; Bianca Kramer
    Description

    This repository contains data on 17,420 DOIs cited in the IPCC Working Group 2 contribution to the Sixth Assessment Report, and the code to link them to the dataset built at the Curtin Open Knowledge Initiative (COKI). References were extracted from the report's PDFs (downloaded 2022-03-01) via Scholarcy and exported as RIS and BibTeX files. DOI strings were identified from RIS files by pattern matching and saved as CSV file. The list of DOIs for each chapter and cross chapter paper was processed using a custom Python script to generate a pandas DataFrame which was saved as CSV file and uploaded to Google Big Query. We used the main object table of the Academic Observatory, which combines information from Crossref, Unpaywall, Microsoft Academic, Open Citations, the Research Organization Registry and Geonames to enrich the DOIs with bibliographic information, affiliations, and open access status. A custom query was used to join and format the data and the resulting table was visualised in a Google DataStudio dashboard. A brief descriptive analysis was provided as a blogpost on the COKI website. The repository contains the following content: Data: data/scholarcy/RIS/ - extracted references as RIS files data/scholarcy/BibTeX/ - extracted references as BibTeX files IPCC_AR6_WGII_dois.csv - list of DOIs Processing: preprocessing.txt - preprocessing steps for identifying and cleaning DOIs process.py - Python script for transforming data and linking to COKI data through Google Big Query Outcomes: Dataset on BigQuery - requires a google account for access and bigquery account for querying Data Studio Dashboard - interactive analysis of the generated data Zotero library of references extracted via Scholarcy PDF version of blogpost Note on licenses: Data are made available under CC0 Code is made available under Apache License 2.0 Archived version of Release 2022-03-04 of GitHub repository: https://github.com/Curtin-Open-Knowledge-Initiative/ipcc-ar6

  19. Datasets and tools: Genotypes, Tannin Capacity, and Seasonality Influence...

    • zenodo.org
    bin, html, zip
    Updated Jan 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abu Bakar Siddique; Abu Bakar Siddique (2025). Datasets and tools: Genotypes, Tannin Capacity, and Seasonality Influence the Structure and Function of Symptomless Fungal Communities in Aspen Leaves, Regardless of Historical Nitrogen Addition [Dataset]. http://doi.org/10.5281/zenodo.10839669
    Explore at:
    zip, bin, htmlAvailable download formats
    Dataset updated
    Jan 16, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Abu Bakar Siddique; Abu Bakar Siddique
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The repository associated with the following study:

    Genotype, Tannin Capacity, and Seasonality Influence the Structure and Function of Symptomless Fungal Communities in Aspen Leaves, Regardless of Historical Nitrogen Addition

    Abu Bakar Siddique1, Abu Bakar Siddique2,3, Benedicte Riber Albrectsen2*, Lovely Mahawar2*

    1. Department of Plant Biology, Swedish University of Agricultural Sciences, 75007, Uppsala, Sweden.

    2. UmeÄ Plant Science Centre (UPSC), Department of Plant Physiology, UmeÄ University, 90187 UmeÄ, Sweden.

    3. Tasmanian Institute of Agriculture (TIA), University of Tasmania, Prospect 7250, Tasmania, Australia.

    *Correspondence: benedicte.albrectsen@umu.se & lovely.mahawar@umu.se

    Data guidence:
    A reproducible and nextflow-based 'nf-core/ampliseq' pipeline was used for analyzing raw sequencing data, followed by Guild analysis and R analysis. A full summary report of the bioinformatic analysis (step-by-step methods and description) can be found as an HTML file named summary_report.html. Bioinformatic results and entire R analysis can be found as sub-folders within a zip folder named bioinformatic_and_ranalysis_submission.zip (please extract the zip folder or file if you downloaded). Guild analysis can be found in the 'guild' subfolder within the 'r_analysis' folder (see within the zip folder). R and statistical analyses were visualized with the quarto document; please refer to file r_analysis_script_full_run_final.qmd. For downsampled bioinformatic & R analysis see 'rarefy' subfolder.

    Bioinformatics:
    Data was processed using nf-core/ampliseq version 2.11.0dev, revision ce811bec9b (doi: 10.5281/zenodo.1493841) (Straub et al., 2020) of the nf-core collection of workflows (Ewels et al., 2020), utilising reproducible software environments from the Bioconda (GrĂŒning et al., 2018) and Biocontainers (da Veiga Leprevost et al., 2017) projects.

    In brief, Raw Illumina data (MiSeq v3 2 _ 300 bp paired-end reads) were demultiplexed by SciLifeLab and delivered as sample specific fastq files (submitted on SRA: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1090416), that were individually quality checked with FastQC (Andrews, 2010).

    Cutadapt (Marcel et al., 2011) trimmed primers and all untrimmed sequences were discarded. Sequences that did not contain primer sequences were considered artifacts. Less than 100% of the sequences were discarded per sample and a mean of 96.4% of the sequences per sample passed the filtering. Adapter and primer-free sequences were processed as one pool (pooled)with DADA2 (Callahan et al., 2016) to eliminate PhiX contamination, trim reads (forward reads at 223 bp and reverse reads at 162 bp, reads shorter than this were discarded), discard reads with > 2 expected errors, correct errors, merge read pairs, and remove polymerase chain reaction (PCR) chimeras; ultimately, 2199 amplicon sequencing variants (ASVs) were obtained across all samples. Between 55.56% and 100% reads per sample (average 82.3%) were retained. The ASV count table contained in total 32632582 counts, at least 1 and at most 964860 per sample (average 87020).

    VSEARCH (Rognes et al., 2016) clustered 2199 ASVs into 770 centroids with pairwise identity of 0.97. Barrnap (Seemann, 2013) filtered ASVs for bac,arc,mito,euk (bac: Bacteria, arc: Archaea, mito: Mitochondria, euk: Eukaryotes), 5 ASVs were removed with less than 0.019999999999996% counts per sample (765 ASVs passed).

    Taxonomic classification was performed by DADA2 and the database ‘UNITE general FASTA release for Fungi - Version 9.0’ (Abarenkov, Kessy; Zirk, Allan; Piirmann, Timo; Pöhönen, Raivo; Ivanov, Filipp; Nilsson, R. Henrik; KĂ”ljalg, Urmas (2023): UNITE general FASTA release for Fungi. Version 18.07.2023. UNITE Community. https://doi.org/10.15156/BIO/2938067).

    ASV sequences, abundance and DADA2 taxonomic assignments were loaded into QIIME2 (Bolyen et al., 2019). Of 765 ASVs, 160 were removed because the taxonomic string contained any of (mitochondria,chloroplast,archaea,bacteria), had fewer than 5 total read counts over all samples (Brown et al., 2015), were present in fewer than 2 samples (605 ASVs passed). Within QIIME2, the final microbial community data was visualized in a barplot.

    Bioinformatic codes are saved in 'Github repository'. That means the github repository contains step-by-step descriptions of bioinformatic setup in HPC (computer cluster) and pipeline 'nf-core/ampliseq' execution.

    Tools or software versions:

    ASSIGNSH:
    python: 3.9.1
    pandas: 1.1.5
    BARRNAP:
    barrnap: 0.9
    BARRNAPSUMMARY:
    python: Python 3.9.1
    COMBINE_TABLE_DADA2:
    R: 4.0.3
    CUTADAPT_BASIC:
    cutadapt: 4.6
    CUTADAPT_SUMMARY_STD:
    python: Python 3.8.3
    DADA2_DENOISING:
    R: 4.3.2
    dada2: 1.30.0
    DADA2_ERR:
    R: 4.3.2
    dada2: 1.30.0
    DADA2_FILTNTRIM:
    R: 4.3.2
    dada2: 1.30.0
    DADA2_MERGE:
    R: 4.1.1
    dada2: 1.22.0
    DADA2_RMCHIMERA:
    R: 4.3.2
    dada2: 1.30.0
    DADA2_STATS:
    R: 4.3.2
    dada2: 1.30.0
    DADA2_TAXONOMY:
    R: 4.3.2
    dada2: 1.30.0
    FILTER_CLUSTERS:
    python: 3.9.1
    pandas: 1.1.5
    FILTER_SSU:
    R: 4.0.3
    Biostrings: 2.58.0
    FILTER_STATS:
    python: 3.9.1
    pandas: 1.1.5
    FORMAT_TAXONOMY:
    bash: 5.0.16
    FORMAT_TAXRESULTS_STD:
    python: 3.9.1
    pandas: 1.1.5
    ITSX_CUTASV:
    ITSx: 1.1.3
    MERGE_STATS_FILTERSSU:
    R: 4.3.2
    MERGE_STATS_FILTERTAXA:
    R: 4.3.2
    MERGE_STATS_STD:
    R: 4.3.2
    PHYLOSEQ:
    R: 4.3.2
    phyloseq: 1.46.0
    QIIME2_BARPLOT:
    qiime2: 2023.7.0
    QIIME2_EXPORT_ABSOLUTE:
    qiime2: 2023.7.0
    QIIME2_EXPORT_RELASV:
    qiime2: 2023.7.0
    QIIME2_EXPORT_RELTAX:
    qiime2: 2023.7.0
    QIIME2_INASV:
    qiime2: 2023.7.0
    QIIME2_INSEQ:
    qiime2: 2023.7.0
    QIIME2_SEQFILTERTABLE:
    qiime2: 2023.7.0
    QIIME2_TABLEFILTERTAXA:
    qiime2: 2023.7.0
    RENAME_RAW_DATA_FILES:
    sed: 4.7
    VSEARCH_CLUSTER:
    vsearch: 2.21.1
    VSEARCH_USEARCHGLOBAL:
    vsearch: 2.21.1
    Workflow:
    nf-core/ampliseq: v2.11.0dev-g6549c5b
    Nextflow: 24.04.4

    List of references (Tools):

    Pipeline

    nf-core/ampliseq

    Straub D, Blackwell N, Langarica-Fuentes A, Peltzer A, Nahnsen S, Kleindienst S. Interpretations of Environmental Microbial Community Studies Are Biased by the Selected 16S rRNA (Gene) Amplicon Sequencing Pipeline. Front Microbiol. 2020 Oct 23;11:550420. doi: 10.3389/fmicb.2020.550420. PMID: 33193131; PMCID: PMC7645116.

    nf-core

    Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

    Nextflow

    Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

    Pipeline tools

    Core tools

    FastQC'

    Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

    Cutadapt

    Marcel, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal 17.1 (2011): pp-10. doi: 10.14806/ej.17.1.200.

    Barrnap

    Seemann T. barrnap 0.9 : rapid ribosomal RNA prediction.

    DADA2

    Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods. 2016 Jul;13(7):581-3. doi: 10.1038/nmeth.3869. Epub 2016 May 23. PMID: 27214047; PMCID: PMC4927377.

    Taxonomic classification and database (only one database)

    Classification by QIIME2 classifier

    Bokulich NA, Kaehler BD, Rideout JR, Dillon M, Bolyen E, Knight R, Huttley GA, Gregory Caporaso J. Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2's q2-feature-classifier plugin. Microbiome. 2018 May 17;6(1):90. doi: 10.1186/s40168-018-0470-z. PMID: 29773078; PMCID: PMC5956843.

    UNITE - eukaryotic nuclear ribosomal ITS region

    KÔljalg U, Larsson KH, Abarenkov K, Nilsson RH,

  20. Olympics game data analysis

    • kaggle.com
    Updated Mar 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sarita (2025). Olympics game data analysis [Dataset]. https://www.kaggle.com/datasets/saritas95/olympics-game-data-analysis/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 2, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    sarita
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The Olympics Data Analysis project explores historical Olympic data using Exploratory Data Analysis (EDA) techniques. By leveraging Python libraries such as pandas, seaborn, and matplotlib, the project uncovers patterns in medal distribution, athlete demographics, and country-wise performance.

    Key findings reveal that most medalists are aged between 20-30 years, with USA, China, and Russia leading in total medals. Over time, female participation has increased significantly, reflecting improved gender equality in sports. Additionally, athlete characteristics like height and weight play a crucial role in certain sports, such as basketball (favoring taller players) and gymnastics (favoring younger athletes).

    The project includes interactive visualizations such as heatmaps, medal trends, and gender-wise participation charts to provide a comprehensive understanding of Olympic history and trends. The insights can help sports analysts, researchers, and enthusiasts better understand performance patterns in the Olympics.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
M. Mazhar (2023). Real State Website Data [Dataset]. https://www.kaggle.com/datasets/mazhar01/real-state-website-data/code
Organization logo

Real State Website Data

Analyze and Understand Key Features in a CSV Dataset

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 11, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
M. Mazhar
License

http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

Description

Check: End-to-End Regression Model Pipeline Development with FastAPI: From Data Scraping to Deployment with CI/CD Integration

This CSV dataset provides comprehensive information about house prices. It consists of 9,819 entries and 54 columns, offering a wealth of features for analysis. The dataset includes various numerical and categorical variables, providing insights into factors that influence house prices.

The key columns in the dataset are as follows:

  1. Location1: The location of the houses. Location2 column is identical or shorter version of Location1 Year: The year of construction. Type: The type of the house. Bedrooms: The number of bedrooms in the house. Bathrooms: The number of bathrooms in the house. Size_in_SqYds: The size of the house in square yards. Price: The price of the house. Parking_Spaces: The number of parking spaces available. Floors_in_Building: The number of floors in the building. Elevators: The presence of elevators in the building. Lobby_in_Building: The presence of a lobby in the building.

In addition to these, the dataset contains several other features related to various amenities and facilities available in the houses, such as double-glazed windows, central air conditioning, central heating, waste disposal, furnished status, service elevators, and more.

By performing exploratory data analysis on this dataset using Python and the Pandas library, valuable insights can be gained regarding the relationships between different variables and the impact they have on house prices. Descriptive statistics, data visualization, and feature engineering techniques can be applied to uncover patterns and trends in the housing market.

This dataset serves as a valuable resource for real estate professionals, analysts, and researchers interested in understanding the factors that contribute to house prices and making informed decisions in the real estate market.

Search
Clear search
Close search
Google apps
Main menu