http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This CSV dataset provides comprehensive information about house prices. It consists of 9,819 entries and 54 columns, offering a wealth of features for analysis. The dataset includes various numerical and categorical variables, providing insights into factors that influence house prices.
The key columns in the dataset are as follows:
In addition to these, the dataset contains several other features related to various amenities and facilities available in the houses, such as double-glazed windows, central air conditioning, central heating, waste disposal, furnished status, service elevators, and more.
By performing exploratory data analysis on this dataset using Python and the Pandas library, valuable insights can be gained regarding the relationships between different variables and the impact they have on house prices. Descriptive statistics, data visualization, and feature engineering techniques can be applied to uncover patterns and trends in the housing market.
This dataset serves as a valuable resource for real estate professionals, analysts, and researchers interested in understanding the factors that contribute to house prices and making informed decisions in the real estate market.
Description: Dive into the world of exceptional cinema with our meticulously curated dataset, "IMDb's Gems Unveiled." This dataset is a result of an extensive data collection effort based on two critical criteria: IMDb ratings exceeding 7 and a substantial number of votes, surpassing 10,000. The outcome? A treasure trove of 4070 movies meticulously selected from IMDb's vast repository.
What sets this dataset apart is its richness and diversity. With more than 20 data points meticulously gathered for each movie, this collection offers a comprehensive insight into each cinematic masterpiece. Our data collection process leveraged the power of Selenium and Pandas modules, ensuring accuracy and reliability.
Cleaning this vast dataset was a meticulous task, combining both Excel and Python for optimum precision. Analysis is powered by Pandas, Matplotlib, and NLTK, enabling to uncover hidden patterns, trends, and themes within the realm of cinema.
Note: The data is collected as of April 2023. Future versions of this analysis include Movie recommendation system Please do connect for any queries, All Love, No Hate.
The files and workflow will allow you to replicate the study titled "Exploring an extinct society through the lens of Habitus-Field theory and the Tocharian text corpus". This study aimed at utilizing the CEToM-corpus (https://cetom.univie.ac.at/) (Tocharian) to analyze the life-world of the elites of an extinct society situated in modern eastern China. To acquire the raw data needed for steps 1 & 2, please contact Melanie Malzahn melanie.malzahn@univie.ac.at. We conducted a mixed methods study, containing of close reading, content analysis, and multiple correspondence analysis (MCA). The excel file titled "fragments_architecture_combined.xlsx" allows for replication of the MCA and equates to the third step of the workflow outlined below. We used the following programming languages and packages to prepare the dataset and to analyze the data. Data preparation and merging procedures were achieved in python (version 3.9.10) with packages pandas (version 1.5.3), os (version 3.12.0), re (version 3.12.0), numpy (version 1.24.3), gensim (version 4.3.1), BeautifulSoup4 (version 4.12.2), pyasn1 (version 0.4.8), and langdetect (version 1.0.9). Multiple Correspondence Analyses were conducted in R (version 4.3.2) with the packages FactoMineR (version 2.9), factoextra (version 1.0.7), readxl version(1.4.3), tidyverse version(2.0.0), ggplot2 (version 3.4.4) and psych (version 2.3.9). After requesting the necessary files, please open the scripts in the order outlined bellow and execute the code-files to replicate the analysis: Preparatory step: Create a folder for the python and r-scripts downloadable in this repository. Open the file 0_create folders.py and declare a root folder in line 19. This first script will generate you the following folders: "tarim-brahmi_database" = Folder, which contains tocharian dictionaries and tocharian text fragments. "dictionaries" = contains tocharian A and tocharian B vocabularies, including linguistic features such as translations, meanings, part of speech tags etc. A full overview of the words is provided on https://cetom.univie.ac.at/?words. "fragments" = contains tocharian text fragments as xml-files. "word_corpus_data" = folder will contain excel-files of the corpus data after the first step. "Architectural_terms" = This folder contains the data on the architectural terms used in the dataset (e.g. dwelling, house). "regional_data" = This folder contains the data on the findsports (tocharian and modern chinese equivalent, e.g. Duldur-Akhur & Kucha). "mca_ready_data" = This is the folder, in which the excel-file with the merged data will be saved. Note that the prepared file named "fragments_architecture_combined.xlsx" can be saved into this directory. This allows you to skip steps 1 &2 and reproduce the MCA of the content analysis based on the third step of our workflow (R-Script titled 3_conduct_MCA.R). First step - run 1_read_xml-files.py: loops over the xml-files in folder dictionaries and identifies a) word metadata, including language (Tocharian A or B), keywords, part of speech, lemmata, word etymology, and loan sources. Then, it loops over the xml-textfiles and extracts a text id number, langauge (Tocharian A or B), text title, text genre, text subgenre, prose type, verse type, material on which the text is written, medium, findspot, the source text in tocharian, and the translation where available. After successful feature extraction, the resulting pandas dataframe object is exported to the word_corpus_data folder. Second step - run 2_merge_excel_files.py: merges all excel files (corpus, data on findspots, word data) and reproduces the content analysis, which was based upon close reading in the first place. Third step - run 3_conduct_MCA.R: recodes, prepares, and selects the variables necessary to conduct the MCA. Then produces the descriptive values, before conducitng the MCA, identifying typical texts per dimension, and exporting the png-files uploaded to this repository.
A machine learning barometer (using Random Forest Regression) to calculate equilibration pressure for majoritic garnetsUpdated 04/02/21 (21/01/21) (10/12/20):**The barometer codeThe barometer is provided as python scripts (.py) and Jupiter Notebooks (.ipynb) files. These are completely equivalent to one another and which is used depends on the users preference. Separate instructions are provided for each.data files included in this repository are: âą "Majorite_database_04022021.xlsm" (Excel sheet of literature majoritic garnet compositions - inclusions (up to date as of 04/02/2021) and experiments (up to date as of 03/07/2020). This data includes all compositions that are close to majoritic, but some are borderline. Filtering as described in paper accompanying this barometer is performed in the python script prior to any data analysis or fitting) âą "lit_maj_nat_030720.txt" (python script input file of experimental literature majoritic garnet compositions - taken from dataset above) âą "di_incs_040221.txt" (python script input file of literature compilation of majoritic garnet inclusions observed in natural diamonds - taken from the dataset above)*The barometer as Jupiter Notebooks - including integrated Caret validation (added 21/01/2021)For those more unfamiliar with Python, running the barometer as a Notebook is somewhat more intuitive than running the scripts below. It also has the benefit of including the RFR validation in using Caret within a single integrated notebook. For success the Jupiter Notebook requires a suitable Python3 environment (with pandas, numpy, matplotlib, sklearn, rpy2 and pickle packages + dependencies). We recommend installing the latest anaconda python distribution (found here https://docs.anaconda.com/anaconda/install/) and creating a custom environment containing the required packages to run the Jupiter Notebook (as both python3 and R must be active in the environment). Instructions on this procedure can be found here (https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html), or to assist we have provided a copy of the environment used to produce the scripts to assist in this process (barom-spec-file.txt). An identical conda environment (called myenv) can be created, and used by:1) copying the barometer-spec-file.txt to a suitable location (i.e. your home directory)2) running the command conda create --name myenv --file barom-spec-file.txt3) entering this environmentconda activate myenv4) Running an instance of Jupyter Notebook by typingjupyter notebookTwo Notebooks are provided: âą calculate_pressures_notebook.ipynb (equivalent to calculate_pressures.py described below) âą rfr_majbar_10122020_notebook.ipynb (equivalent to rfr_majbar_10122020.py described below) but also including integrated Caret validation performed using the rpy2 package in a single notebook environment*The barometer as scripts (10/12/2020)The scripts below need to be run in a suitable Python3 environment (with pandas, numpy, matplotlib, sklearn and pickle packages + dependencies). For inexperienced users we recommend installing the latest anaconda python distribution (found here https://docs.anaconda.com/anaconda/install/) and running in Spyder (a GUI scripting environment provided with Anaconda.Note - if running python 3.7 (or earlier) then you will need to install pickle5 package to use the provided barometer files and comment / uncomment the appropriate lines in the âcalculate_pressures.pyâ (lines 16/17) and ârfr_majbar_10122020.pyâ (lines 26/27) scripts.The user may additionally need to download and install the packages required if they are not provided with the anaconda distribution (pandas, numpy, matplotlib, scikit-learn and pickle). This will be obvious as, when run, the script will return an error similar to âNo module name XXXXâ. Packages can either be installed using the anaconda package manager or in the command line / terminal via commands such as: conda install -c conda-forge pickle5Appropriate command line installation commands can be obtained via searching the anaconda cloud at anaconda.org for each required package.A python script (.py) is provided to calculate pressures for any majoritic garnet using barometer calibrated in Thomson et al. (2021) âą calculate_pressures.py script takes an input file of any majoritic garnet compositions (example input file is provided âexample_test_data.txt" - which are inclusion compositions reported by Zedgenizov et al., 2014, Chemical Geology, 363, pp 114-124). âą employs published RFR model and scaler - both provided as pickle files (pickle_model_20201210.pkl, scaler_20201210.pkl)User can simply edit the input file name in the provided .py script - and then runs the script in a suitable python3 environment (requires pandas, numpy, sklearn and pickle packages). Script initially filters data for majoritic compositions (according to criteria used for barometer calibration) and predicts pressures for these compositions. Writes out pressures and 2 x std_dev in pressure estimates alongside input data into "out_pressures_test.txt". if this script produces any errors or warnings it is likely because the serialised pickle files provided are not compatible with the python build being used (this is a common issue with serialised ML models). Please first try installing the pickle5 package and commenting/uncommenting lines 16/17. If this is unsuccessful then run the full barometer calibration script below (using the same input files as in Thomson et al. (2021) which are provided) to produce pickle files compatible with the python build on the local machine (action 5 of script below). Subsequently edit the filenames called in the âcalculate_pressures.pyâ script (lines 22 & 27) to match the new barometer calibration files and re-run the calculate pressure script. The output (predicted pressures) for the test dataset provided (and using the published calibration) given in the output file should be similar to the following results:P (GPa) error (GPa)17.0 0.416.6 0.319.5 1.321.8 1.312.8 0.314.3 0.414.7 0.414.4 0.612.1 0.614.6 0.517.0 1.014.6 0.611.9 0.714.0 0.516.8 0.8Full RFR barometer calibration script - rfr_majbar_10122020.py The RFR barometer calibration script used and described in Thomson et al. (2021). This script performs the following actions. 1) filters input data - outputs this filtered data as a .txt file (which is the input expected for RFR validation script using R package Caret) 2) fits 1000 RFR models each using a randomly selected training dataset (70% of the input data) 3) performs leave-one-out validation 4) plots figure 5 from Thomson et al. (2021) 5) fits one single RFR barometer using all input data (saves this and the scaler as .pkl files with a datestamp for use in the "calculate_pressures.py script) 6) calculates the pressure for all literature inclusion compositions over 100 iterations with randomly distributed compositional uncertainties added - provides the mean pressure and 2 std deviations, written alongside input inclusion compositons, as a .txt output file "diout.txt" 7) plots the global distribution of majoritic inclusion pressuresThe RFR barometer can be easily updated to include (or exclude) additional experimental compositions by modification of the literature data input files providedRFR validation using Caret in R (script titled âRFR_validation_03072020.Râ)Additional validation tests of RFR barometer completed using the Caret package in R. Requires the filtered experimental dataset file "data_filteredforvalidation.txt" (which is generated by the rfr_majbar_10122020.py script if required for a new dataset) performs bootstrap, K-fold and leave-one out validation. outputs validation stats for 5, 7 and 9 input variables (elements)Please email Andrew Thomson (a.r.thomson@ucl.ac.uk) if you have any questions or queries.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As computing power grows, so does the need for data processing, which uses a lot of energy in steps like cleaning and analyzing data. This study looks at the energy and time efficiency of four common Python librariesâPandas, Vaex, Scikit-learn, and NumPyâtested on five datasets across 21 tasks. We compared the energy use of the newest and older versions of each library. Our findings show that no single library always saves the most energy. Instead, energy use varies by task type, how often tasks are done, and the library version. In some cases, newer versions use less energy, pointing to the need for more research on making data processing more energy-efficient.A zip file accompanying this study contains the scripts, datasets, and a README file for guidance. This setup allows for easy replication and testing of the experiments described, helping to further analyze energy efficiency across different libraries and tasks.
Data DescriptionThe CADDI dataset is designed to support research in in-class activity recognition using IMU data from low-cost sensors. It provides multimodal data capturing 19 different activities performed by 12 participants in a classroom environment, utilizing both IMU sensors from a Samsung Galaxy Watch 5 and synchronized stereo camera images. This dataset enables the development and validation of activity recognition models using sensor fusion techniques.Data Generation ProceduresThe data collection process involved recording both continuous and instantaneous activities that typically occur in a classroom setting. The activities were captured using a custom setup, which included:A Samsung Galaxy Watch 5 to collect accelerometer, gyroscope, and rotation vector data at 100Hz.A ZED stereo camera capturing 1080p images at 25-30 fps.A synchronized computer acting as a data hub, receiving IMU data and storing images in real-time.A D-Link DSR-1000AC router for wireless communication between the smartwatch and the computer.Participants were instructed to arrange their workspace as they would in a real classroom, including a laptop, notebook, pens, and a backpack. Data collection was performed under realistic conditions, ensuring that activities were captured naturally.Temporal and Spatial ScopeThe dataset contains a total of 472.03 minutes of recorded data.The IMU sensors operate at 100Hz, while the stereo camera captures images at 25-30Hz.Data was collected from 12 participants, each performing all 19 activities multiple times.The geographical scope of data collection was Alicante, Spain, under controlled indoor conditions.Dataset ComponentsThe dataset is organized into JSON and PNG files, structured hierarchically:IMU Data: Stored in JSON files, containing:Samsung Linear Acceleration Sensor (X, Y, Z values, 100Hz)LSM6DSO Gyroscope (X, Y, Z values, 100Hz)Samsung Rotation Vector (X, Y, Z, W quaternion values, 100Hz)Samsung HR Sensor (heart rate, 1Hz)OPT3007 Light Sensor (ambient light levels, 5Hz)Stereo Camera Images: High-resolution 1920Ă1080 PNG files from left and right cameras.Synchronization: Each IMU data record and image is timestamped for precise alignment.Data StructureThe dataset is divided into continuous and instantaneous activities:Continuous Activities (e.g., typing, writing, drawing) were recorded for 210 seconds, with the central 200 seconds retained.Instantaneous Activities (e.g., raising a hand, drinking) were repeated 20 times per participant, with data captured only during execution.The dataset is structured as:/continuous/subject_id/activity_name/ /camera_a/ â Left camera images /camera_b/ â Right camera images /sensors/ â JSON files with IMU data
/instantaneous/subject_id/activity_name/repetition_id/ /camera_a/ /camera_b/ /sensors/ Data Quality & Missing DataThe smartwatch buffers 100 readings per second before sending them, ensuring minimal data loss.Synchronization latency between the smartwatch and the computer is negligible.Not all IMU samples have corresponding images due to different recording rates.Outliers and anomalies were handled by discarding incomplete sequences at the start and end of continuous activities.Error Ranges & LimitationsSensor data may contain noise due to minor hand movements.The heart rate sensor operates at 1Hz, limiting its temporal resolution.Camera exposure settings were automatically adjusted, which may introduce slight variations in lighting.File Formats & Software CompatibilityIMU data is stored in JSON format, readable with Pythonâs json library.Images are in PNG format, compatible with all standard image processing tools.Recommended libraries for data analysis:Python: numpy, pandas, scikit-learn, tensorflow, pytorchVisualization: matplotlib, seabornDeep Learning: Keras, PyTorchPotential ApplicationsDevelopment of activity recognition models in educational settings.Study of student engagement based on movement patterns.Investigation of sensor fusion techniques combining visual and IMU data.This dataset represents a unique contribution to activity recognition research, providing rich multimodal data for developing robust models in real-world educational environments.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025caddiinclassactivitydetection, title={CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Monica Pina-Navarro and Miguel Cazorla and Francisco Gomez-Donoso}, year={2025}, eprint={2503.02853}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.02853}, }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract: This repository/dataset provides a suite of Python scripts to generate a simulated relational database for inventory management processes and transform this data into object-centric event logs (OCEL) suitable for advanced process mining analysis. The primary goal is to offer a synthetic yet realistic dataset that facilitates research, development, and application of object-centric process mining techniques in the domain of inventory control and supply chain management. The generated event logs capture common inventory operations, track stock level changes, and are enriched with key inventory management parameters (like EOQ, Safety Stock, Reorder Point) and status-based activity labels (e.g., indicating understock or overstock situations).
Overview: Inventory management is a critical business process characterized by the interaction of various entities such as materials, purchase orders, sales orders, plants, suppliers, and customers. Traditional process mining often struggles to capture these complex interactions. Object-Centric Process Mining (OCPM) offers a more suitable paradigm. This project provides the tools to create and explore such data.
The workflow involves:
pm4py
library.Contents:
The repository contains the following Python scripts:
01_generate_simulation.py
:
inventory_management.db
.Materials
, SalesOrderDocuments
, SalesOrderItems
, PurchaseOrderDocuments
, PurchaseOrderItems
, PurchaseRequisitions
, GoodsReceiptsAndIssues
, MaterialStocks
, MaterialDocuments
, SalesDocumentFlows
, and OrderSuggestions
.02_database_to_ocel_csv.py
:
inventory_management.db
.ocel_inventory_management.csv
.MAT
(Material), PLA
(Plant), PO_ITEM
(Purchase Order Item), SO_ITEM
(Sales Order Item), CUSTOMER
, SUPPLIER
.ocel:activity
, ocel:timestamp
, ocel:type:
).03_ocel_csv_to_ocel.py
:
ocel_inventory_management.csv
.pm4py
to convert the CSV event log into the standard OCEL XML format (ocel_inventory_management.xml
).04_postprocess_activities.py
:
inventory_management.db
to calculate inventory parameters:
ocel_inventory_management.csv
.ocel:activity
label (e.g., "Goods Issue (Understock)").MAT_PLA
(Material-Plant combination) for easier status tracking.post_ocel_inventory_management.csv
.05_ocel_csv_to_ocel.py
:
post_ocel_inventory_management.csv
.pm4py
to convert this enriched CSV event log into the standard OCEL XML format (post_ocel_inventory_management.xml
).Generated Dataset Files (if included, or can be generated using the scripts):
inventory_management.db
: The SQLite database containing the simulated raw data.ocel_inventory_management.csv
: The initial OCEL in CSV format.ocel_inventory_management.xml
: The initial OCEL in standard OCEL XML format.post_ocel_inventory_management.csv
: The post-processed and enriched OCEL in CSV format.post_ocel_inventory_management.xml
: The post-processed and enriched OCEL in standard OCEL XML format.How to Use:
sqlite3
(standard library), pandas
, numpy
, pm4py
.python 01_generate_simulation.py
(generates inventory_management.db
)python 02_database_to_ocel_csv.py
(generates ocel_inventory_management.csv
from the database)python 03_ocel_csv_to_ocel.py
(generates ocel_inventory_management.xml
)python 04_postprocess_activities.py
(generates post_ocel_inventory_management.csv
using the database and the initial CSV OCEL)python 05_ocel_csv_to_ocel.py
(generates post_ocel_inventory_management.xml
)Potential Applications and Research: This dataset and the accompanying scripts can be used for:
Keywords: Object-Centric Event Log, OCEL, Process Mining, Inventory Management, Supply Chain, Simulation, Synthetic Data, SQLite, Python, pandas, pm4py, Economic Order Quantity (EOQ), Safety Stock (SS), Reorder Point (ROP), Stock Status Analysis.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Replication Kit for the Paper "Are Unit and Integration Test Definitions Still Valid for Modern Java Projects? An Empirical Study on Open-Source Projects"
This additional material shall provide other researchers with the ability to replicate our results. Furthermore, we want to facilitate further insights that might be generated based on our data sets.
Structure
The structure of the replication kit is as follows:
Additional Visualizations
We provide two additional visualizations for each project:
1)
For each of these data sets there exist one visualization for each project that shows four Venn-Diagrams for each of the different defect types. These Venn-Diagrams show the number of defects that were detected by either unit, or integration tests (or both).
Furthermore, we added boxplots for each of the data sets (i.e., ALL and DISJ) showing the scores of unit and integration tests for each defect type.
Analysis scripts
Requirements:
- python3.5
- tabulate
- scipy
- seaborn
- mongoengine
- pycoshark
- pandas
- matplotlib
Both python files contain all code for the statistical analysis we performed.
Data Collection Tools
We provide all data collection tools that we have implemented and used throughout our paper. Overall it contains six different projects and one python script:
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is a classic and very widely used dataset in machine learning and statistics, often serving as a first dataset for classification problems. Introduced by the British statistician and biologist Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems," it is a foundational resource for learning classification algorithms.
Overview:
The dataset contains measurements for 150 samples of iris flowers. Each sample belongs to one of three species of iris:
For each flower, four features were measured:
The goal is typically to build a model that can classify iris flowers into their correct species based on these four features.
File Structure:
The dataset is usually provided as a single CSV (Comma Separated Values) file, often named iris.csv
or similar. This file typically contains the following columns:
Content of the Data:
The dataset contains an equal number of samples (50) for each of the three iris species. The measurements of the sepal and petal dimensions vary between the species, allowing for their differentiation using machine learning models.
How to Use This Dataset:
iris.csv
file.Citation:
When using the Iris dataset, it is common to cite Ronald Fisher's original work:
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179-188.
Data Contribution:
Thank you for providing this classic and fundamental dataset to the Kaggle community. The Iris dataset remains an invaluable resource for both beginners learning the basics of classification and experienced practitioners testing new algorithms. Its simplicity and clear class separation make it an ideal starting point for many data science projects.
If you find this dataset description helpful and the dataset itself useful for your learning or projects, please consider giving it an upvote after downloading. Your appreciation is valuable!
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
The collection "Fiction littĂ©raire de Gallica" includes 19,240 public domain documents from the digital platform of the French National Library that were originally classified as novels or, more broadly, as literary fiction in prose. It consists of 372 tables of data in tsv format for each year of publication from 1600 to 1996 (all the missing years are in the 17th and 20th centuries). Each table is structured at the page-level of each novel (5,723,986 pages in all). It contains the complete text with the addition of some metadata. It can be opened in Excel or, preferably, with the new data analysis environments in R or Python (tidyverse, pandasâŠ)
This corpus can be used for large-scale quantitative analyses in computational humanities. The OCR text is presented in a raw format without any correction or enrichment in order to be directly processed for text mining purposes.
The extraction is based on a historical categorization of the novels: the Y2 or Ybis classification. This classification, invented in 1730, is the only one that has been continuously applied to the BNF collections now available in the public domain (mainly before 1950). Consequently, the dataset is based on a definition of "novel" that is generally contemporary of the publication.
A French data paper (in PDF and HTML) presents the construction process of the Y2 category and describes the structuring of the corpus. It also gives several examples of possible uses for computational humanities projects.
http://opensource.org/licenses/BSD-2-Clausehttp://opensource.org/licenses/BSD-2-Clause
Python code (for Python 3.9 & Pandas 1.3.2) to generate the results used in "Compromised through Compression: Privacy Implications of Smart Meter Traffic Analysis".Smart metering comes with risks to privacy. One concern is the possibility of an attacker seeing the traffic that reports the energy use of a household and deriving private information from that. Encryption helps to mask the actual energy measurements, but is not sufficient to cover all risks. One aspect which has yet gone unexplored â and where encryption does not help â is traffic analysis, i.e. whether the length of messages communicating energy measurements can leak privacy-sensitive information to an observer. In this paper we examine whether using encodings or compression for smart metering data could potentially leak information about household energy use. Our analysis is based on the real-world energy use data of ±80 Dutch households.We find that traffic analysis could reveal information about the energy use of individual households if compression is used. As a result, when messages are sent daily, an attacker performing traffic analysis would be able to determine when all the members of a household are away or not using electricity for an entire day. We demonstrate this issue by recognizing when households from our dataset were on holiday. If messages are sent more often, more granular living patterns could likely be determined.We propose a method of encoding the data that is nearly as effective as compression at reducing message size, but does not leak the information that compression leaks. By not requiring compression to achieve the best possible data savings, the risk of traffic analysis is eliminated.This code operates on the relative energy measurements from the "Zonnedael dataset" from Liander N.V. This dataset needs to be obtained separately; see instructions accompanying the code. The code transforms the dataset into absolute measurements such as would be taken by a smart meter. It then generates batch messages covering 24-hour periods starting at midnight, similar to how the Dutch infrastructure batches daily meter readings, in the different possible encodings with and without compression applied. For an explanation of the different encodings, see the paper. The code will then provide statistics on the efficiency of encoding and compression for the entire dataset, and attempt to find the periods of multi-day absences for each household. It will also generate the graphs in the style used in the paper and presentation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The "Crimp Force Curve Dataset" is a comprehensive collection of univariate time series data representing crimp force curves recorded during the manufacturing process of crimp connections. This dataset has been designed to support a variety of applications, including anomaly detection, fault diagnosis, and research in data-driven quality assurance.
A salient feature of this dataset is the presence of high-quality labels. Each crimp force curve is annotated both by a state-of-the-art crimp force monitoring system - capable of binary anomaly detection - and by domain experts who manually classified the curves into detailed quality classes. The expert annotations provide a valuable ground truth for training and benchmarking machine learning models beyond anomaly detection.
The dataset is particularly well-suited for tasks involving time series analysis, such as training and evaluating of machine learning algorithms for quality control and fault detection. It provides a substantial foundation for the development of generalisable, yet domain-specific (crimping), data-driven quality control systems.
The data is stored in a Python pickle file crimp_force_curves.pkl
, which is a binary format used to serialize and deserialize Python objects. It can be conveniently loaded into a pandas DataFrame for exploration and analysis using the following command:
df = pd.read_pickle("crimp_force_curves.pkl")
This dataset is a valuable resource for researchers and practitioners in manufacturing engineering, computer science, and data science who are working at the intersection of quality control in manufacturing and machine learning.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Data Origin: This dataset was generated using information from the Community of Madrid, including traffic data collected by multiple sensors located throughout the city, as well as work calendar and meteorological data, all provided by the Community.
Data Type: The data consists of traffic measurements in Madrid from June 1, 2022, to September 30, 2023. Each record includes information on the date, time, location (longitude and latitude), traffic intensity, and associated road and weather conditions (e.g., whether it is a working day, holiday, information on wind, temperature, precipitation, etc.).
Technical Details:
Data Preprocessing: We utilized advanced techniques for cleaning and normalizing traffic data collected from sensors across Madrid. This included handling outliers and missing values to ensure data quality.
Geospatial Analysis: We used GeoPandas and OSMnx to map traffic data points onto Madrid's road network. This process involved processing spatial attributes such as street lanes and speed limits to add context to the traffic data.
Meteorological Data Integration: We incorporated Madrid's weather data, including temperature, precipitation, and wind speed. Understanding the impact of weather conditions on traffic patterns was crucial in this step.
Traffic Data Clustering: We implemented K-Means clustering to identify patterns in traffic data. This approach facilitated the selection of representative sensors from each cluster, focusing on the most relevant data points.
Calendar Integration: We combined the traffic data with the work calendar to distinguish between different types of days. This provided insights into traffic variations on working days and holidays.
Comprehensive Analysis Approach: The analysis was conducted using Python libraries such as Pandas, NumPy, scikit-learn, and Shapely. It covered data from the years 2022 and 2023, focusing on the unique characteristics of the Madrid traffic dataset.
Data Structure: Each row of the dataset represents an individual measurement from a traffic sensor, including:
id: Unique sensor identifier.
date: Date and time of the measurement.
longitude and latitude: Geographical coordinates of the sensor.
day type: Information about the day being a working day, holiday, or festive Sunday.
intensity: Measured traffic intensity.
Additional data like wind, temperature, precipitation, etc.
Purpose of the Dataset: This dataset is useful for traffic analysis, urban mobility studies, infrastructure planning, and research related to traffic behavior under different environmental and temporal conditions.
Acknowledgment and Funding:
This dataset was obtained as part of the R&D project PID2020-113037RB-I00, funded by MCIN/AEI/10.13039/501100011033.
In addition to the NEAT-AMBIENCE project, support from the Department of Science, University, and Knowledge Society of the Government of Aragon (Government of Aragon: group reference T64_23R, COSMOS research group) is also acknowledged.
For academic and research purposes, please reference this dataset using its DOI for proper attribution and tracking.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Simulated data files Simulated single-molecule tracks for characterizing the algorithm described in the article. char_short_sim.h5, char_n_tracks_sim_1.0.h5, and char_long_sim.h5 were used to investigate the effect of changing recording intervals, char_n_tracks_sim_0.5.h5, char_n_tracks_sim_1.0.h5, char_n_tracks_sim_2.0.h5 to examine the impact of the dataset size. h5 files contain tables created using to DataFrame.to_hdf method from the pandas Python package. Each table is identfied by the key //, where is the simulated recording interval and is an integer identifying a particular simulation execution. Raw data files FRET microscopy image sequences of TCRâpMHC interactions of 5c.c7 and AND TCR-transgenic T cells as described in the article. Zip archives' POPC subfolders contain the recorded image sequences with recording delay (in ms) and number of donor excitation frames indicated in the file names. The beads subfolders contain images of fiducial markers for image registration. Analysis files Save files generated by the smfret-bondtime analysis software described in the article for 5c.c7 and AND T cell data. Note that these files were generated using a software version predating the version published as 1.0.0. They can nontheless be loaded with the newer version. In order to load the experimental data, install smfret-bondtime software extract raw data extract analysis files; the current folder should now contain 5cc7 and/or AND subfolders as well as 5cc7.yaml, 5cc7.h5, AND.yaml, andAND.h5 files. If raw data is extracted to a different place, open the respective YAML files using a text editor and adjusts the data_dir entry accordingly.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
An in-depth analysis of millions of data entries from Chicagoâs Field Museum underwent implementation, furnishing insights related to 25 Gorilla specimens and spanning the realms of biogeography, zoology, primatology, and biological anthropology. Taxonomically, and at first glance, all specimens examined belong to the kingdom Animalia, phylum Chordata, class Mammalia, order Primates, and family Hominidae. Furthermore, these specimens can be further categorized under the genus Gorilla and species gorilla, with most belonging to the subspecies Gorilla gorilla gorilla and some specimens being categorized as Gorilla gorilla. Biologically, specimensâ sex distribution entails 16 specimens (or 64% of the total) being identified as male and 5 (or 20%) identified as female, with 4 (or 16%) specimens having their sex unassigned. Furthermore, collectors, none of whom are unidentified by name, culled most of these specimens from unidentified zoos, with a few specimens having been sourced from Wardâs Natural Science Establishment, a well-known natural science materials supplier to North American museums. In terms of historicity, the specimens underwent collection between 1975 and 1993, with some entries lacking this information. Additionally, multiple organ preparations have been performed on the specimens, encompassing skulls, skeletons, skins, and endocrine organs being mounted and alcohol-preserved. Disappointingly, despite the existence of these preparations, tissue samples and coordinates are largely unavailable for the 25 specimens on record, limiting further research or analysis. In fact, tissue sampling is available for a sole specimen identified by IRN 2661980. Only one specimen, identifiable as IRN 2514759, has a specified geographical location indicated as âAfrica, West Africa, West Indies,â while the rest have either âUnknown/None, Zooâ locations, signaling that no entry is available. Python code to extract data from the Field Museumâs zoological collections records and online database include the contents of the .py file herewith attached. This code constitutes a web scraping algorithm, retrieving data from the above-mentioned website, processing it, and storing it in a structured format. To achieve these tasks, it first imports necessary libraries by drawing on requests for making HTTP requests, Pandas for handling data, time for introducing delays, lxml for parsing HTML, and BeautifulSoup for web scraping. Furthermore, this algorithm defines the main URL for searching for Gorilla gorilla specimens before setting up headers for making HTTP requests, e.g., User-Agent and other headers to mimic a browser request. Next, an HTTP GET request to the main URL is made, and the response text is obtained. The next step consists of parsing the response text using BeautifulSoup and lxml. Extracting information from the search results page (e.g., Internal Record Number, Catalog Subset, Higher Classification, Catalog Number, Taxonomic Name, DwC Locality, Collector/field, Collection No., Coordinates Available, Tissue Available, and Sex) comes next. This information is then stored in a list called basic_data. The algorithm subsequently iterates through each record in basic_data, and accesses its detailed information page by making another HTTP GET request with the extracted URL. For each detailed information page, the code thereafter extracts additional data (e.g., FM Catalog, Scientific Name, Phylum, Class, Order, Family, Genus, Species, Field Number, Collector, Collection No., Geography, Date Collected, Preparations, Tissue Available, Co-ordinates Available, and Sex). Correspondingly, this information is stored in a list called main_data. The above algorithm processes the final main_data list and converts it into a structured format, i.e., a CSV file.
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
This dataset contains replication data for the paper "Comparison of Solar Imaging Feature Extraction Methods in the Context of Space Weather Prediction with Deep Learning-Based Models". It includes files stored into HDF5 (Hierarchical Data Format) file using HDFStore. One file contains the extracted features using the 6 different techniques for the wavelength 19.3 nm named solar_extracted_features_v01_2010-2020.h5 and the second the SERENADE outputs named serenade_predictions_v01.h5. Both files contain several datasets labeled with âkeysâ. The latter correspond to the extraction method. Here is a list of the key names: gn_1024: corresponding to the GoogLenet extractor with 1024 components. pca_1024: corresponding to the Principle Component Analysis technique leaving 1024 components. ae_1024: corresponding to the AutoEncoder with a latent space of 1024. gn_256 (only in solar_extracted_features_v01_2010-2020.h5): corresponding to the GoogLenet extractor with 256 components. pca_256: corresponding to the Principle Component Analysis technique leaving 256 components. ae_256: corresponding to the AutoEncoder technique with a latent space of 256. vae_256 (only in solar_extracted_features_v01_2010-2020.h5): corresponding to the Variational AutoEncoder technique with a latent space of 256. vae_256_old (only in serenade_predictions_v01.h5): the output predictions of SERENADE using the VAE extracted features using the hyperparameters optimized for GoogLeNet. vae_256_new (only in serenade_predictions_v01.h5): the output predictions of SERENADE using the VAE extracted features with the alternative architecture. All the above-mentioned models are explained and detailed in the paper. In order to read the files, the user can do it with the Pandas package for Python as follows: import pandas as pd df = pd.read_hdf('file_name.h5', key = 'model_name') and replace file_name by either solar_extracted_features_v01_2010-2020.h5 or serenade_predictions_v01.h5 and model_name by one of the models in the list above. The extracted features dataset will output a pandas DataFrame indexed by datetime and either 1024 or 256 columns of features. An additional column indicates to which subset (train, validation and test) the corresponding row belongs. The SERENADE outputs dataset will output a DataFrame indexed by datetime and 4 columns: Observations: the first column contains the true daily maximum of the Kp index. Predictions: the second column contains the predicted mean of the daily maximum of the Kp index. Standard Deviation: the third column contains the standard deviation as the predictions are probabilistic. Model: this column specifies from which feature extractor model the inputs were used to generate the predictions. We add the feature extractors AE and VAE class codes as well as their weights in the AEs_class.py and VAE_class.py codes and best_AE_1024.ckpt, best_AE_256.ckpt and best_VAE.ckpt checkpoints respectively. The figures in the manuscript can be reproduced using the codes named after the corresponding figure. The files 6_mins_predictions and seed_variation contain the SERENADE predictions to reproduce figures 7, 8, 9 and 10.
This repository contains data on 17,420 DOIs cited in the IPCC Working Group 2 contribution to the Sixth Assessment Report, and the code to link them to the dataset built at the Curtin Open Knowledge Initiative (COKI). References were extracted from the report's PDFs (downloaded 2022-03-01) via Scholarcy and exported as RIS and BibTeX files. DOI strings were identified from RIS files by pattern matching and saved as CSV file. The list of DOIs for each chapter and cross chapter paper was processed using a custom Python script to generate a pandas DataFrame which was saved as CSV file and uploaded to Google Big Query. We used the main object table of the Academic Observatory, which combines information from Crossref, Unpaywall, Microsoft Academic, Open Citations, the Research Organization Registry and Geonames to enrich the DOIs with bibliographic information, affiliations, and open access status. A custom query was used to join and format the data and the resulting table was visualised in a Google DataStudio dashboard. A brief descriptive analysis was provided as a blogpost on the COKI website. The repository contains the following content: Data: data/scholarcy/RIS/ - extracted references as RIS files data/scholarcy/BibTeX/ - extracted references as BibTeX files IPCC_AR6_WGII_dois.csv - list of DOIs Processing: preprocessing.txt - preprocessing steps for identifying and cleaning DOIs process.py - Python script for transforming data and linking to COKI data through Google Big Query Outcomes: Dataset on BigQuery - requires a google account for access and bigquery account for querying Data Studio Dashboard - interactive analysis of the generated data Zotero library of references extracted via Scholarcy PDF version of blogpost Note on licenses: Data are made available under CC0 Code is made available under Apache License 2.0 Archived version of Release 2022-03-04 of GitHub repository: https://github.com/Curtin-Open-Knowledge-Initiative/ipcc-ar6
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The repository associated with the following study:
Genotype, Tannin Capacity, and Seasonality Influence the Structure and Function of Symptomless Fungal Communities in Aspen Leaves, Regardless of Historical Nitrogen Addition
Abu Bakar Siddique1, Abu Bakar Siddique2,3, Benedicte Riber Albrectsen2*, Lovely Mahawar2*
1. Department of Plant Biology, Swedish University of Agricultural Sciences, 75007, Uppsala, Sweden.
2. UmeÄ Plant Science Centre (UPSC), Department of Plant Physiology, UmeÄ University, 90187 UmeÄ, Sweden.
3. Tasmanian Institute of Agriculture (TIA), University of Tasmania, Prospect 7250, Tasmania, Australia.
*Correspondence: benedicte.albrectsen@umu.se & lovely.mahawar@umu.se
Data guidence:
A reproducible and nextflow-based 'nf-core/ampliseq' pipeline was used for analyzing raw sequencing data, followed by Guild analysis and R analysis. A full summary report of the bioinformatic analysis (step-by-step methods and description) can be found as an HTML file named summary_report.html. Bioinformatic results and entire R analysis can be found as sub-folders within a zip folder named bioinformatic_and_ranalysis_submission.zip (please extract the zip folder or file if you downloaded). Guild analysis can be found in the 'guild' subfolder within the 'r_analysis' folder (see within the zip folder). R and statistical analyses were visualized with the quarto document; please refer to file r_analysis_script_full_run_final.qmd. For downsampled bioinformatic & R analysis see 'rarefy' subfolder.
Bioinformatics:
Data was processed using nf-core/ampliseq version 2.11.0dev, revision ce811bec9b (doi: 10.5281/zenodo.1493841) (Straub et al., 2020) of the nf-core collection of workflows (Ewels et al., 2020), utilising reproducible software environments from the Bioconda (GrĂŒning et al., 2018) and Biocontainers (da Veiga Leprevost et al., 2017) projects.
In brief, Raw Illumina data (MiSeq v3 2 _ 300 bp paired-end reads) were demultiplexed by SciLifeLab and delivered as sample specific fastq files (submitted on SRA: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1090416), that were individually quality checked with FastQC (Andrews, 2010).
Cutadapt (Marcel et al., 2011) trimmed primers and all untrimmed sequences were discarded. Sequences that did not contain primer sequences were considered artifacts. Less than 100% of the sequences were discarded per sample and a mean of 96.4% of the sequences per sample passed the filtering. Adapter and primer-free sequences were processed as one pool (pooled)with DADA2 (Callahan et al., 2016) to eliminate PhiX contamination, trim reads (forward reads at 223 bp and reverse reads at 162 bp, reads shorter than this were discarded), discard reads with > 2 expected errors, correct errors, merge read pairs, and remove polymerase chain reaction (PCR) chimeras; ultimately, 2199 amplicon sequencing variants (ASVs) were obtained across all samples. Between 55.56% and 100% reads per sample (average 82.3%) were retained. The ASV count table contained in total 32632582 counts, at least 1 and at most 964860 per sample (average 87020).
VSEARCH (Rognes et al., 2016) clustered 2199 ASVs into 770 centroids with pairwise identity of 0.97. Barrnap (Seemann, 2013) filtered ASVs for bac,arc,mito,euk (bac: Bacteria, arc: Archaea, mito: Mitochondria, euk: Eukaryotes), 5 ASVs were removed with less than 0.019999999999996% counts per sample (765 ASVs passed).
Taxonomic classification was performed by DADA2 and the database âUNITE general FASTA release for Fungi - Version 9.0â (Abarenkov, Kessy; Zirk, Allan; Piirmann, Timo; Pöhönen, Raivo; Ivanov, Filipp; Nilsson, R. Henrik; KĂ”ljalg, Urmas (2023): UNITE general FASTA release for Fungi. Version 18.07.2023. UNITE Community. https://doi.org/10.15156/BIO/2938067).
ASV sequences, abundance and DADA2 taxonomic assignments were loaded into QIIME2 (Bolyen et al., 2019). Of 765 ASVs, 160 were removed because the taxonomic string contained any of (mitochondria,chloroplast,archaea,bacteria), had fewer than 5 total read counts over all samples (Brown et al., 2015), were present in fewer than 2 samples (605 ASVs passed). Within QIIME2, the final microbial community data was visualized in a barplot.
Bioinformatic codes are saved in 'Github repository'. That means the github repository contains step-by-step descriptions of bioinformatic setup in HPC (computer cluster) and pipeline 'nf-core/ampliseq' execution.
Tools or software versions:
ASSIGNSH:
python: 3.9.1
pandas: 1.1.5
BARRNAP:
barrnap: 0.9
BARRNAPSUMMARY:
python: Python 3.9.1
COMBINE_TABLE_DADA2:
R: 4.0.3
CUTADAPT_BASIC:
cutadapt: 4.6
CUTADAPT_SUMMARY_STD:
python: Python 3.8.3
DADA2_DENOISING:
R: 4.3.2
dada2: 1.30.0
DADA2_ERR:
R: 4.3.2
dada2: 1.30.0
DADA2_FILTNTRIM:
R: 4.3.2
dada2: 1.30.0
DADA2_MERGE:
R: 4.1.1
dada2: 1.22.0
DADA2_RMCHIMERA:
R: 4.3.2
dada2: 1.30.0
DADA2_STATS:
R: 4.3.2
dada2: 1.30.0
DADA2_TAXONOMY:
R: 4.3.2
dada2: 1.30.0
FILTER_CLUSTERS:
python: 3.9.1
pandas: 1.1.5
FILTER_SSU:
R: 4.0.3
Biostrings: 2.58.0
FILTER_STATS:
python: 3.9.1
pandas: 1.1.5
FORMAT_TAXONOMY:
bash: 5.0.16
FORMAT_TAXRESULTS_STD:
python: 3.9.1
pandas: 1.1.5
ITSX_CUTASV:
ITSx: 1.1.3
MERGE_STATS_FILTERSSU:
R: 4.3.2
MERGE_STATS_FILTERTAXA:
R: 4.3.2
MERGE_STATS_STD:
R: 4.3.2
PHYLOSEQ:
R: 4.3.2
phyloseq: 1.46.0
QIIME2_BARPLOT:
qiime2: 2023.7.0
QIIME2_EXPORT_ABSOLUTE:
qiime2: 2023.7.0
QIIME2_EXPORT_RELASV:
qiime2: 2023.7.0
QIIME2_EXPORT_RELTAX:
qiime2: 2023.7.0
QIIME2_INASV:
qiime2: 2023.7.0
QIIME2_INSEQ:
qiime2: 2023.7.0
QIIME2_SEQFILTERTABLE:
qiime2: 2023.7.0
QIIME2_TABLEFILTERTAXA:
qiime2: 2023.7.0
RENAME_RAW_DATA_FILES:
sed: 4.7
VSEARCH_CLUSTER:
vsearch: 2.21.1
VSEARCH_USEARCHGLOBAL:
vsearch: 2.21.1
Workflow:
nf-core/ampliseq: v2.11.0dev-g6549c5b
Nextflow: 24.04.4
List of references (Tools):
Pipeline
nf-core/ampliseq
Straub D, Blackwell N, Langarica-Fuentes A, Peltzer A, Nahnsen S, Kleindienst S. Interpretations of Environmental Microbial Community Studies Are Biased by the Selected 16S rRNA (Gene) Amplicon Sequencing Pipeline. Front Microbiol. 2020 Oct 23;11:550420. doi: 10.3389/fmicb.2020.550420. PMID: 33193131; PMCID: PMC7645116.
nf-core
Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.
Nextflow
Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.
Pipeline tools
Core tools
FastQC'
Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Cutadapt
Marcel, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal 17.1 (2011): pp-10. doi: 10.14806/ej.17.1.200.
Barrnap
Seemann T. barrnap 0.9 : rapid ribosomal RNA prediction.
DADA2
Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods. 2016 Jul;13(7):581-3. doi: 10.1038/nmeth.3869. Epub 2016 May 23. PMID: 27214047; PMCID: PMC4927377.
Taxonomic classification and database (only one database)
Classification by QIIME2 classifier
Bokulich NA, Kaehler BD, Rideout JR, Dillon M, Bolyen E, Knight R, Huttley GA, Gregory Caporaso J. Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2's q2-feature-classifier plugin. Microbiome. 2018 May 17;6(1):90. doi: 10.1186/s40168-018-0470-z. PMID: 29773078; PMCID: PMC5956843.
UNITE - eukaryotic nuclear ribosomal ITS region
KÔljalg U, Larsson KH, Abarenkov K, Nilsson RH,
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The Olympics Data Analysis project explores historical Olympic data using Exploratory Data Analysis (EDA) techniques. By leveraging Python libraries such as pandas, seaborn, and matplotlib, the project uncovers patterns in medal distribution, athlete demographics, and country-wise performance.
Key findings reveal that most medalists are aged between 20-30 years, with USA, China, and Russia leading in total medals. Over time, female participation has increased significantly, reflecting improved gender equality in sports. Additionally, athlete characteristics like height and weight play a crucial role in certain sports, such as basketball (favoring taller players) and gymnastics (favoring younger athletes).
The project includes interactive visualizations such as heatmaps, medal trends, and gender-wise participation charts to provide a comprehensive understanding of Olympic history and trends. The insights can help sports analysts, researchers, and enthusiasts better understand performance patterns in the Olympics.
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This CSV dataset provides comprehensive information about house prices. It consists of 9,819 entries and 54 columns, offering a wealth of features for analysis. The dataset includes various numerical and categorical variables, providing insights into factors that influence house prices.
The key columns in the dataset are as follows:
In addition to these, the dataset contains several other features related to various amenities and facilities available in the houses, such as double-glazed windows, central air conditioning, central heating, waste disposal, furnished status, service elevators, and more.
By performing exploratory data analysis on this dataset using Python and the Pandas library, valuable insights can be gained regarding the relationships between different variables and the impact they have on house prices. Descriptive statistics, data visualization, and feature engineering techniques can be applied to uncover patterns and trends in the housing market.
This dataset serves as a valuable resource for real estate professionals, analysts, and researchers interested in understanding the factors that contribute to house prices and making informed decisions in the real estate market.