18 datasets found

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...
zenodo.org
data.europa.eu
zip
Updated Oct 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. http://doi.org/10.5281/zenodo.6832242
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6832242
Dataset updated
Oct 20, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LifeSnaps Dataset Documentation

Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

Data Import: Reading CSV

For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

Data Import: Setting up a MongoDB (Recommended)

To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

For the Fitbit data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

For the SEMA data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c sema

For surveys data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c surveys

If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

Data Availability

The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

{ _id:
Merge number of excel file,convert into csv file
kaggle.com
zip
Updated Mar 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aashirvad pandey (2024). Merge number of excel file,convert into csv file [Dataset]. https://www.kaggle.com/datasets/aashirvadpandey/merge-number-of-excel-fileconvert-into-csv-file
Explore at:
zip(6731 bytes)Available download formats
Dataset updated
Mar 30, 2024
Authors
Aashirvad pandey
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Project Description:

Title: Pandas Data Manipulation and File Conversion

Overview: This project aims to demonstrate the basic functionalities of Pandas, a powerful data manipulation library in Python. In this project, we will create a DataFrame, perform some data manipulation operations using Pandas, and then convert the DataFrame into both Excel and CSV formats.

Key Objectives:

DataFrame Creation: Utilize Pandas to create a DataFrame with sample data.

Data Manipulation: Perform basic data manipulation tasks such as adding columns, filtering data, and performing calculations.

File Conversion: Convert the DataFrame into Excel (.xlsx) and CSV (.csv) file formats.

Tools and Libraries Used:

Python

Pandas

Project Implementation:

DataFrame Creation:

Import the Pandas library.

Create a DataFrame using either a dictionary, a list of dictionaries, or by reading data from an external source like a CSV file.

Populate the DataFrame with sample data representing various data types (e.g., integer, float, string, datetime).

Data Manipulation:

Add new columns to the DataFrame representing derived data or computations based on existing columns.

Filter the DataFrame to include only specific rows based on certain conditions.

Perform basic calculations or transformations on the data, such as aggregation functions or arithmetic operations.

File Conversion:

Utilize Pandas to convert the DataFrame into an Excel (.xlsx) file using the to_excel() function.

Convert the DataFrame into a CSV (.csv) file using the to_csv() function.

Save the generated files to the local file system for further analysis or sharing.

Expected Outcome:

Upon completion of this project, you will have gained a fundamental understanding of how to work with Pandas DataFrames, perform basic data manipulation tasks, and convert DataFrames into different file formats. This knowledge will be valuable for data analysis, preprocessing, and data export tasks in various data science and analytics projects.

Conclusion:

The Pandas library offers powerful tools for data manipulation and file conversion in Python. By completing this project, you will have acquired essential skills that are widely applicable in the field of data science and analytics. You can further extend this project by exploring more advanced Pandas functionalities or integrating it into larger data processing pipelines.in this data we add number of data and make that data a data frame.and save in single excel file as different sheet name and then convert that excel file in csv file .
d
U.S. Select Demographics by Census Block Groups
dataone.org
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bryan, Michael (2023). U.S. Select Demographics by Census Block Groups [Dataset]. http://doi.org/10.7910/DVN/UZGNMM
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/UZGNMM
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Bryan, Michael
Area covered
United States
Description
Overview This dataset re-shares cartographic and demographic data from the U.S. Census Bureau to provide an obvious supplement to Open Environments Block Group publications.These results do not reflect any proprietary or predictive model. Rather, they extract from Census Bureau results with some proportions and aggregation rules applied. For additional support or more detail, please see the Census Bureau citations below. Cartographics refer to shapefiles shared in the Census TIGER/Line publications. Block Group areas are updated annually, with major revisions accompanying the Decennial Census at the turn of each decade. These shapes are useful for visualizing estimates as a map and relating geographies based upon geo-operations like overlapping. This data is kept in a geodatabase file format and requires the geopandas package and its supporting fiona and DAL software. Demographics are taken from popular variables in the American Community Survey (ACS) including age, race, income, education and family structure. This data simply requires csv reader software or pythons pandas package. While the demographic data has many columns, the cartographic data has a very, very large column called "geometry" storing the many-point boundaries of each shape. So, this process saves the data separately, with demographics columns in a csv file and geometry in a gpd file needed an installation of geopandas, fiona and DAL software. More details on the ACS variables selected and derivation rules applied can be found in the commentary docstrings in the source code found here: https://github.com/OpenEnvironments/blockgroupdemographics. ## Files While the demographic data has many columns, the cartographic data has a very, very large column called "geometry" storing the many-point boundaries of each shape. So, this process saves the data separately, with demographics columns in a csv file named YYYYblcokgroupdemographics.csv. The cartographic column, 'geometry', is shared as file named YYYYblockgroupdemographics-geometry.pkl. This file needs an installation of geopandas, fiona and DAL software.
riiid_train_converted to Multiple Formats
kaggle.com
zip
Updated Jun 2, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santh Raul (2021). riiid_train_converted to Multiple Formats [Dataset]. https://www.kaggle.com/santhraul/riiid-train-converted-to-multiple-formats
Explore at:
zip(4473461912 bytes)Available download formats
Dataset updated
Jun 2, 2021
Authors
Santh Raul
Description
Context

Train data of Riiid competition is a large dataset of over 100 million rows and 10 columns that does not fit into Kaggle Notebook's RAM using the default pandas read.csv resulting in a search for alternative approaches and formats.

Content

Train data of Riiid competition in different formats.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

reading .CSV file for riiid completion took huge time and memory. This inspired me to convert .CSV in to different file format so that those can be loaded easily to Kaggle kernel.
MS malware hdf5
kaggle.com
zip
Updated Jun 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bhargav ram (2021). MS malware hdf5 [Dataset]. https://www.kaggle.com/bhargavrko619/ms-malware-hdf5
Explore at:
zip(1948172957 bytes)Available download formats
Dataset updated
Jun 8, 2021
Authors
Bhargav ram
Description
Context

This dataset contains the files taken from Microsoft Malware Prediction competition dataset and converted to "hdf5" from "csv" format.

Content

The purpose of this conversion from CSV to HDF5 is to work with VAEX tools rather than Pandas. If we try to read the csv files of huge data containing about a million rows or more Pandas are not effective. But working VAEX would be comfortable. Check this link for better understanding of "Why VAEX is better than Pandas when handling with huge datasets?".

Acknowledgements

An upvote will be so encouraging.
Z
Data from: Redocking the PDB
data-staging.niaid.nih.gov
data.niaid.nih.gov
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Flachsenberg, Florian; Ehrt, Christiane; Gutermuth, Torben; Rarey, Matthias (2023). Redocking the PDB [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7579501
Explore at:
Dataset updated
Dec 6, 2023
Dataset provided by
Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstraße 43, 20146 Hamburg, Germany
Authors
Flachsenberg, Florian; Ehrt, Christiane; Gutermuth, Torben; Rarey, Matthias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains supplementary data to the journal article 'Redocking the PDB' by Flachsenberg et al. (https://doi.org/10.1021/acs.jcim.3c01573)[1]. In this paper, we described two datasets: The PDBScan22 dataset with a large set of 322,051 macromolecule–ligand binding sites generally suitable for redocking and the PDBScan22-HQ dataset with 21,355 binding sites passing different structure quality filters. These datasets were further characterized by calculating properties of the ligand (e.g., molecular weight), properties of the binding site (e.g., volume), and structure quality descriptors (e.g., crystal structure resolution). Additionally, we performed redocking experiments with our novel JAMDA structure preparation and docking workflow[1] and with AutoDock Vina[2,3]. Details for all these experiments and the dataset composition can be found in the journal article[1]. Here, we provide all the datasets, i.e., the PDBScan22 and PDBScan22-HQ datasets as well as the docking results and the additionally calculated properties (for the ligand, the binding sites, and structure quality descriptors). Furthermore, we give a detailed description of their content (i.e., the data types and a description of the column values). All datasets consist of CSV files with the actual data and associated metadata JSON files describing their content. The CSV/JSON files are compliant with the CSV on the web standard (https://csvw.org/). General hints

All docking experiment results consist of two CSV files, one with general information about the docking run (e.g., was it successful?) and one with individual pose results (i.e., score and RMSD to the crystal structure). All files (except for the docking pose tables) can be indexed uniquely by the column tuple '(pdb, name)' containing the PDB code of the complex (e.g., 1gm8) and the name ligand (in the format '_', e.g., 'SOX_B_1559'). All files (except for the docking pose tables) have exactly the same number of rows as the dataset they were calculated on (e.g., PDBScan22 or PDBScan22-HQ). However, some CSV files may have missing values (see also the JSON metadata files) in some or even all columns (except for 'pdb' and 'name'). The docking pose tables also contain the 'pdb' and 'name' columns. However, these alone are not unique but only together with the 'rank' column (i.e., there might be multiple poses for each docking run or none). Example usage Using the pandas library (https://pandas.pydata.org/) in Python, we can calculate the number of protein-ligand complexes in the PDBScan22-HQ dataset with a top-ranked pose RMSD to the crystal structure ≤ 2.0 Å in the JAMDA redocking experiment and a molecular weight between 100 Da and 200 Da:

import pandas as pd df = pd.read_csv('PDBScan22-HQ.csv') df_poses = pd.read_csv('PDBScan22-HQ_JAMDA_NL_NR_poses.csv') df_properties = pd.read_csv('PDBScan22_ligand_properties.csv') merged = df.merge(df_properties, how='left', on=['pdb', 'name']) merged = merged[(merged['MW'] >= 100) & (merged['MW'] <= 200)].merge(df_poses[df_poses['rank'] == 1], how='left', on=['pdb', 'name']) nof_successful_top_ranked = (merged['rmsd_ai'] <= 2.0).sum() nof_no_top_ranked = merged['rmsd_ai'].isna().sum() Datasets

PDBScan22.csv: This is the PDBScan22 dataset[1]. This dataset was derived from the PDB4. It contains macromolecule–ligand binding sites (defined by PDB code and ligand identifier) that can be read by the NAOMI library[5,6] and pass basic consistency filters. PDBScan22-HQ.csv: This is the PDBScan22-HQ dataset[1]. It contains macromolecule–ligand binding sites from the PDBScan22 dataset that pass certain structure quality filters described in our publication[1]. PDBScan22-HQ-ADV-Success.csv: This is a subset of the PDBScan22-HQ dataset without 336 binding sites where AutoDock Vina[2,3] fails. PDBScan22-HQ-Macrocycles.csv: This is a subset of the PDBScan22-HQ dataset without 336 binding sites where AutoDock Vina[2,3] fails and only contains molecules with macrocycles with at least ten atoms. Properties for PDBScan22

PDBScan22_ligand_properties.csv: Conformation-independent properties of all ligand molecules in the PDBScan22 dataset. Properties were calculated using an in-house tool developed with the NAOMI library[5,6]. PDBScan22_StructureProfiler_quality_descriptors.csv: Structure quality descriptors for the binding sites in the PDBScan22 dataset calculated using the StructureProfiler tool[7]. PDBScan22_basic_complex_properties.csv: Simple properties of the binding sites in the PDBScan22 dataset. Properties were calculated using an in-house tool developed with the NAOMI library[5,6]. Properties for PDBScan22-HQ

PDBScan22-HQ_DoGSite3_pocket_descriptors.csv: Binding site descriptors calculated for the binding sites in the PDBScan22-HQ dataset using the DoGSite3 tool[8]. PDBScan22-HQ_molecule_types.csv: Assignment of ligands in the PDBScan22-HQ dataset (without 336 binding sites where AutoDock Vina fails) to different molecular classes (i.e., drug-like, fragment-like oligosaccharide, oligopeptide, cofactor, macrocyclic). A detailed description of the assignment can be found in our publication[1]. Docking results on PDBScan22

PDBScan22_JAMDA_NL_NR.csv: Docking results of JAMDA[1] on the PDBScan22 dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22_JAMDA_NL_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22_JAMDA_NL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22 dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. Docking results on PDBScan22-HQ

PDBScan22-HQ_JAMDA_NL_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NL_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NL_WR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NL_WR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NL_WR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NW_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NW_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NW_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NW_WR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NW_WR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NW_WR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_WL_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_WL_NR_poses.csv'. For this experiment, the ligand was considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_WL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand
yahoo_finance_data_nse_2000_stocks
kaggle.com
zip
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stormblessed_Ash (2025). yahoo_finance_data_nse_2000_stocks [Dataset]. https://www.kaggle.com/datasets/ashvinvinodh97/yahoo-finance-data-nse-2000-stocks
Explore at:
zip(198144682 bytes)Available download formats
Dataset updated
Apr 11, 2025
Authors
Stormblessed_Ash
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This dataset contains daily OHLCV data for ~ 2000 Indian Stocks listed on the National Stock Exchange for all time. The columns are multi-index columns, so this needs to be taken into account when reading and using the data. Source : Yahoo Finance Type: All files are CSV format. Currency : INR

All the tickers have been collected from here : https://www.nseindia.com/market-data/securities-available-for-trading

If using pandas, the following function is a utility to read any of the CSV files: ``` import pandas as pd def read_ohlcv(filename): "read a given ohlcv data file downloaded from yfinance" return pd.read_csv( filename, skiprows=[0, 1, 2], # remove the multiindex rows that cause trouble names=["Date", "Close", "High", "Low", "Open", "Volume"], index_col="Date", parse_dates=["Date"], )

dataset = read_ohlcv("ABCAPITAL.NS.csv")
CpG Signature Profiling and Heatmap Visualization of SARS-CoV Genomes:...
figshare.com
txt
Updated Apr 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tahir Bhatti (2025). CpG Signature Profiling and Heatmap Visualization of SARS-CoV Genomes: Tracing the Genomic Divergence From SARS-CoV (2003) to SARS-CoV-2 (2019) [Dataset]. http://doi.org/10.6084/m9.figshare.28736501.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28736501.v1
Dataset updated
Apr 5, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Tahir Bhatti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveThe primary objective of this study was to analyze CpG dinucleotide dynamics in coronaviruses by comparing Wuhan-Hu-1 with its closest and most distant relatives. Heatmaps were generated to visualize CpG counts and O/E ratios across intergenic regions, providing a clear depiction of conserved and divergent CpG patterns.Methods1. Data CollectionSource : The dataset includes CpG counts and O/E ratios for various coronaviruses, extracted from publicly available genomic sequences.Format : Data was compiled into a CSV file containing columns for intergenic regions, CpG counts, and O/E ratios for each virus.2. PreprocessingData Cleaning :Missing values (NaN), infinite values (inf, -inf), and blank entries were handled using Python's pandas library.Missing values were replaced with column means, and infinite values were capped at a large finite value (1e9).Reshaping :The data was reshaped into matrices for CpG counts and O/E ratios using meltpandas[] and pivot[] functions.3. Distance CalculationEuclidean Distance :Pairwise Euclidean distances were calculated between Wuhan-Hu-1 and other viruses using the scipy.spatial.distance.euclidean function.Distances were computed separately for CpG counts and O/E ratios, and the total distance was derived as the sum of both metrics.4. Identification of Closest and Distant RelativesThe virus with the smallest total distance was identified as the closest relative .The virus with the largest total distance was identified as the most distant relative .5. Heatmap GenerationTools :Heatmaps were generated using Python's seaborn library (sns.heatmap) and matplotlib for visualization.Parameters :Heatmaps were annotated with numerical values for clarity.A color gradient (coolwarm) was used to represent varying CpG counts and O/E ratios.Titles and axis labels were added to describe the comparison between Wuhan-Hu-1 and its relatives.ResultsClosest Relative :The closest relative to Wuhan-Hu-1 was identified based on the smallest Euclidean distance.Heatmaps for CpG counts and O/E ratios show high similarity in specific intergenic regions.Most Distant Relative :The most distant relative was identified based on the largest Euclidean distance.Heatmaps reveal significant differences in CpG dynamics compared to Wuhan-Hu-1 .Tools and LibrariesThe following tools and libraries were used in this analysis:Programming Language :Python 3.13Libraries :pandas: For data manipulation and cleaning.numpy: For numerical operations and handling missing/infinite values.scipy.spatial.distance: For calculating Euclidean distances.seaborn: For generating heatmaps.matplotlib: For additional visualization enhancements.File Formats :Input: CSV files containing CpG counts and O/E ratios.Output: PNG images of heatmaps.Files IncludedCSV File :Contains the raw data of CpG counts and O/E ratios for all viruses.Heatmap Images :Heatmaps for CpG counts and O/E ratios comparing Wuhan-Hu-1 with its closest and most distant relatives.Python Script :Full Python code used for data processing, distance calculation, and heatmap generation.Usage NotesResearchers can use this dataset to further explore the evolutionary dynamics of CpG dinucleotides in coronaviruses.The Python script can be adapted to analyze other viral genomes or datasets.Heatmaps provide a visual summary of CpG dynamics, aiding in hypothesis generation and experimental design.AcknowledgmentsSpecial thanks to the open-source community for developing tools like pandas, numpy, seaborn, and matplotlib.This work was conducted as part of an independent research project in molecular biology and bioinformatics.LicenseThis dataset is shared under the CC BY 4.0 License , allowing others to share and adapt the material as long as proper attribution is given.DOI: 10.6084/m9.figshare.28736501

The Device Activity Report with Complete Knowledge (DARCK) for NILM

zenodo.org

bin, xz

Updated Sep 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonymous Anonymous; Anonymous Anonymous (2025). The Device Activity Report with Complete Knowledge (DARCK) for NILM [Dataset]. http://doi.org/10.5281/zenodo.17159850

Explore at:

bin, xzAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.17159850

Dataset updated

Sep 19, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonymous Anonymous; Anonymous Anonymous

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

1. Abstract

This dataset contains aggregated and sub-metered power consumption data from a two-person apartment in Germany. Data was collected from March 5 to September 4, 2025, spanning 6 months. It includes an aggregate reading from a main smart meter and individual readings from 40 smart plugs, smart relays, and smart power meters monitoring various appliances.

2. Dataset Overview

Apartment: Two-person apartment, approx. 58m², located in Aachen, Germany.
Aggregate Meter: eBZ DD3
Sub-meters: 31 Shelly Plus Plug S, 6 Shelly Plus 1PM, 3 Shelly Plus PM Mini Gen3
Sampling Rate: 1 Hz
Measured Quantity: Active Power
Unit of Measurement: Watt
Duration: 6 months
Format: Single CSV file (`DARCK.csv`)
Structure: Timestamped rows with columns for the aggregate meter and each sub-metered appliance.
Completeness: The main power meter has a completeness of 99.3%. Missing values were linearly interpolated.

3. Download and Usage

The dataset can be downloaded here: https://doi.org/10.5281/zenodo.17159850

As it contains longer off periods with zeros, the CSV file is nicely compressible.

To extract it use: xz -d DARCK.csv.xz.
The compression leads to a 97% smaller file size (From 4GB to 90.9MB).

To use the dataset in python, you can, e.g., load the csv file into a pandas dataframe.

python
import pandas as pd

df = pd.read_csv("DARCK.csv", parse_dates=["time"])

4. Measurement Setup

The main meter was monitored using an infrared reading head magnetically attached to the infrared interface of the meter. An ESP8266 flashed with Tasmota decodes the binary datagrams and forwards the Watt readings to the MQTT broker. Individual appliances were monitored using a combination of Shelly Plugs (for outlets), Shelly 1PM (for wired-in devices like ceiling lights), and Shelly PM Mini (for each of the three phases of the oven). All devices reported to a central InfluxDB database via Home Assistant running in docker on a Dell OptiPlex 3020M.

5. File Format (`DARCK.csv`)

The dataset is provided as a single comma-separated value (CSV) file.

The first row is a header containing the column names.
All power values are rounded to the first decimal place.
There are no missing values in the final dataset.
Each row represents 1 second, from start of measuring in March until the end in September.

Column Descriptions

Column Name	Data Type	Unit	Description
`time`	datetime	-	Timestamp for the reading in `YYYY-MM-DD HH:MM:SS`
`main`	float	Watt	Total aggregate power consumption for the apartment, measured at the main electrical panel.
`[appliance_name]`	float	Watt	Power consumption of an individual appliance (e.g., `lightbathroom`, `fridge`, `sherlockpc`). See Section 8 for a full list.
Aggregate Columns
`aggr_chargers`	float	Watt	The sum of `sherlockcharger`, `sherlocklaptop`, `watsoncharger`, `watsonlaptop`, `watsonipadcharger`, `kitchencharger`.
`aggr_stoveplates`	float	Watt	The sum of `stoveplatel1` and `stoveplatel2`.
`aggr_lights`	float	Watt	The sum of `lightbathroom`, `lighthallway`, `lightsherlock`, `lightkitchen`, `lightlivingroom`, `lightwatson`, `lightstoreroom`, `fcob`, `sherlockalarmclocklight`, `sherlockfloorlamphue`, `sherlockledstrip`, `livingfloorlamphue`, `sherlockglobe`, `watsonfloorlamp`, `watsondesklamp` and `watsonledmap`.
Analysis Columns
`inaccuracy`	float	Watt	As no electrical device bypasses a power meter, the true inaccuracy can be assessed. It is the absolute error between the sum of individual measurements and the mains reading. A 30W offset is applied to the sum since the measurement devices themselves draw power which is otherwise unaccounted for.

6. Data Postprocessing Pipeline

The final dataset was generated from two raw data sources (meter.csv and shellies.csv) using a comprehensive postprocessing pipeline.

6.1. Main Meter (`main`) Postprocessing

The aggregate power data required several cleaning steps to ensure accuracy.

Outlier Removal: Readings below 10W or above 10,000W were removed (merely 3 occurrences).
Timestamp Burst Correction: The source data contained bursts of delayed readings. A custom algorithm was used to identify these bursts (large time gap followed by rapid readings) and back-fill the timestamps to create an evenly spaced time series.
Alignment & Interpolation: The smart meter pushes a new value via infrared every second. To align those to the whole seconds, it was resampled to a 1-second frequency by taking the mean of all readings within each second (in 99.5% only 1 value). Any resulting gaps (0.7% outage ratio) were filled using linear interpolation.

6.2. Sub-metered Devices (`shellies`) Postprocessing

The Shelly devices are not prone to the same burst issue as the ESP8266 is. They push a new reading at every change in power drawn. If no power change is observed or the one observed is too small (less than a few Watt), the reading is pushed once a minute, together with a heartbeat. When a device turns on or off, intermediate power values are published, which leads to sub-second values that need to be handled.

Grouping: Data was grouped by the unique device identifier.
Resampling & Filling: The data for each device was resampled to a 1-second frequency using .resample('1s').last().ffill().
This method was chosen to firstly, capture the last known state of the device within each second, handling rapid on/off events. Secondly, to forward-fill the last state across periods of no new data, modeling that the device's consumption remained constant until a new reading was sent.

6.3. Merging and Finalization

Merge: The cleaned main meter and all sub-metered device dataframes were merged into a single dataframe on the time index.
Final Fill: Any remaining NaN values (e.g., from before a device was installed) were filled with 0.0, assuming zero consumption.

7. Manual Corrections and Known Data Issues

During analysis, two significant unmetered load events were identified and manually corrected to improve the accuracy of the aggregate reading. The error column (inaccuracy) was recalculated after these corrections.

March 10th - Unmetered Bulb: An unmetered 107W bulb was active. It was subtracted from the main reading as if it never happened.
May 31st - Unmetered Air Pump: An unmetered 101W pump for an air mattress was used directly in an outlet with no intermediary plug and hence manually added to the respective plug.

8. Appliance Details and Multipurpose Plugs

The following table lists the column names with an explanation where needed. As Watson moved at the beginning of June, some metering plugs changed their appliance.

Analysis of references in the IPCC AR6 WG2 Report of 2022
zenodo.org
zip
Updated Mar 10, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cameron Neylon; Cameron Neylon; Bianca Kramer; Bianca Kramer (2022). Analysis of references in the IPCC AR6 WG2 Report of 2022 [Dataset]. http://doi.org/10.5281/zenodo.6327207
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6327207
Dataset updated
Mar 10, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Cameron Neylon; Cameron Neylon; Bianca Kramer; Bianca Kramer
License
https://creativecommons.org/licenses/publicdomain/https://creativecommons.org/licenses/publicdomain/
Description
This repository contains data on 17,420 DOIs cited in the IPCC Working Group 2 contribution to the Sixth Assessment Report, and the code to link them to the dataset built at the Curtin Open Knowledge Initiative (COKI).

References were extracted from the report's PDFs (downloaded 2022-03-01) via Scholarcy and exported as RIS and BibTeX files. DOI strings were identified from RIS files by pattern matching and saved as CSV file. The list of DOIs for each chapter and cross chapter paper was processed using a custom Python script to generate a pandas DataFrame which was saved as CSV file and uploaded to Google Big Query.

We used the main object table of the Academic Observatory, which combines information from Crossref, Unpaywall, Microsoft Academic, Open Citations, the Research Organization Registry and Geonames to enrich the DOIs with bibliographic information, affiliations, and open access status. A custom query was used to join and format the data and the resulting table was visualised in a Google DataStudio dashboard.

A brief descriptive analysis was provided as a blogpost on the COKI website.

The repository contains the following content:

Data:

data/scholarcy/RIS/ - extracted references as RIS files

data/scholarcy/BibTeX/ - extracted references as BibTeX files

IPCC_AR6_WGII_dois.csv - list of DOIs

Processing:

preprocessing.txt - preprocessing steps for identifying and cleaning DOIs

process.py - Python script for transforming data and linking to COKI data through Google Big Query

Outcomes:

Dataset on BigQuery - requires a google account for access and bigquery account for querying

Data Studio Dashboard - interactive analysis of the generated data

Zotero library of references extracted via Scholarcy

PDF version of blogpost

Note on licenses:
Data are made available under CC0
Code is made available under Apache License 2.0
Citi Bike Stations
kaggle.com
zip
Updated Dec 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ethan Rosenthal (2021). Citi Bike Stations [Dataset]. https://www.kaggle.com/rosenthal/citi-bike-stations
Explore at:
zip(4139139012 bytes)Available download formats
Dataset updated
Dec 8, 2021
Authors
Ethan Rosenthal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

The New York City bikeshare, Citi Bike, has a real time, public API. This API conforms to the General Bikeshare Feed Specification. As such, this API contains information about the number of bikes and docks available at every station in NYC.

Since 2016, I have been pinging the public API every 2 minutes and storing the results. This dataset contains all of these results, from 8/15/2016 - 12/8/2021. The data unfortunately comes in the form of a bunch of CSVs. I recognize that this is not the best format to read large datasets like this, but a CSV is still a pretty universal format! My suggestion would be to convert these CSVs to parquet or something similar if you plan to do lots of analysis on lots of files.

Originally, I setup an EC2 instance and pinged a legacy API (code). In 2019, I switched to pinging this API via a Lambda function (code).

As part of this 2019 switch, I also started pinging the station information API once per week in order to collect information about each station, such as the name, latitude and longitude. While this dataset contains columns for all of the station information, these columns are missing data between 2016 and 8/2019. It would probably be reasonable to backfill that data with the earliest info available for each station, although be warned that this is not guaranteed to be accurate.

Details

In order to reduce the individual file size, the full dataset has been bucketed by station_id into 50 separate files. All historical data for a given station_id are in the same file, and the stations are randomly distributed across the 50 files.

As previously mentioned, station information is missing for all data earlier than 8/2019. I have included a column, missing_station_information to indicate when this information is missing. You may wonder why I don't just create a separate station information file which can be joined to the file containing the time series. The reason is that the station information can technically change over time. When station information is provided in a given row, that information is accurate within sometime 7 days prior. This is because I pinged the station information weekly and then had to join it to the time series.

The CSV files are the result of a CREATE TABLE AS AWS Athena query using the TEXTFILE format. Consequently, null values are demarcated as \N. The two timestamp columns, station_status_last_reported and station_information_last_updated are in units of POXIX/UNIX time (i.e. seconds since 1970-01-01 00:00:00 UTC). The following code may be helpful to get you started loading the data as a pandas DataFrame.

def read_csv(filename: str) -> pd.DataFrame: """ Read DataFrame from a CSV file ``filename`` and convert to a preferred schema. """ df = pd.read_csv( filename, sep=",", na_values="\\N", dtype={ "station_id": str, # Use Pandas Int16 dtype to allow for nullable integers "num_bikes_available": "Int16", "num_ebikes_available": "Int16", "num_bikes_disabled": "Int16", "num_docks_available": "Int16", "num_docks_disabled": "Int16", "is_installed": "Int16", "is_renting": "Int16", "is_returning": "Int16", "station_status_last_reported": "Int64", "station_name": str, "lat": float, "lon": float, "region_id": str, "capacity": "Int16", # Use pandas boolean dtype to allow for nullable booleans "has_kiosk": "boolean", "station_information_last_updated": "Int64", "missing_station_information": "boolean" }, ) # Read in timestamps as UNIX/POSIX epochs but then convert to the local # bike share timezone. df["station_status_last_reported"] = pd.to_datetime( df["station_status_last_reported"], unit="s", origin="unix", utc=True ).dt.tz_convert("US/Eastern") df["station_information_last_updated"] = pd.to_datetime( df["station_information_last_updated"], unit="s", origin="unix", utc=True ).dt.tz_convert("US/Eastern") return df

The column names almost come directly from the station_status and station_information APIs. See the [GBFS schema](https://github.com/MobilityData/gbfs...
Russian external economic activity 2016-2021
kaggle.com
zip
Updated Apr 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vadim Rudakov (2024). Russian external economic activity 2016-2021 [Dataset]. https://www.kaggle.com/datasets/lefthand67/russian-external-economic-activity-2016-2021
Explore at:
zip(397392192 bytes)Available download formats
Dataset updated
Apr 28, 2024
Authors
Vadim Rudakov
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Area covered
Russia
Description
PostgreSQL 15.

SQL optimized database with the data from the Russian Federal Customs Service on the Russian external economic activity during 2016-2021.

This is the SQL version of the dataset created by people from the "authors.pdf" file. The csv dataset was converted into the optimized relational database by Vadim Rudakov. I will do my best to provide the detailed "about" file in the short future, but for now I can say only that all the attributes from one table are now divided into different tables with foreign keys. The original dataset was too large (2GB) and had a huge impact on the memory and CPU. The SQL version is only 1.2GB and lets the researcher to investigate data even on slow machines, select needed data and save these data to csv for further work in pandas or whatever.
Named Entity Recognition (NER) Corpus
kaggle.com
zip
Updated Jan 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Naser Al-qaydeh (2022). Named Entity Recognition (NER) Corpus [Dataset]. https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus
Explore at:
zip(4343548 bytes)Available download formats
Dataset updated
Jan 14, 2022
Authors
Naser Al-qaydeh
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Task

Named Entity Recognition(NER) is a task of categorizing the entities in a text into categories like names of persons, locations, organizations, etc.

Dataset

Each row in the CSV file is a complete sentence, list of POS tags for each word in the sentence, and list of NER tags for each word in the sentence

You can use Pandas Dataframe to read and manipulate this dataset.

Since each row in the CSV file contain lists, if we read the file with pandas.read_csv() and try to get tag lists by indexing the list will be a string. ```

data['tag'][0] "['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']" type(data['tag'][0]) string You can use the following to convert it back to list type: from ast import literal_eval literal_eval(data['tag'][0] ) ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O'] type(literal_eval(data['tag'][0] )) list ```

Acknowledgements

This dataset is taken from Annotated Corpus for Named Entity Recognition by Abhinav Walia dataset and then processed.

Annotated Corpus for Named Entity Recognition is annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

Essential info about entities:

geo = Geographical Entity

org = Organization

per = Person

gpe = Geopolitical Entity

tim = Time indicator

art = Artifact

eve = Event

nat = Natural Phenomenon
Singers' Gender
kaggle.com
zip
Updated Jun 25, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raihan Kibria (2017). Singers' Gender [Dataset]. https://www.kaggle.com/forums/f/4208/singers-gender
Explore at:
zip(181013 bytes)Available download formats
Dataset updated
Jun 25, 2017
Authors
Raihan Kibria
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I'm just getting into data science and wanted to see if there's any kind of pattern in preference for male or female singers in the Billboard charts over time (I don't believe there is, it's just a toy problem to work on).

Some finding: https://github.com/rkibria/rkibria.github.com/blob/master/singers_gender_1.md

Content

The CSV has 3 columns: artist, gender, category. The data comes from Wikipedia where I scraped categories using the code here. I then concatenated the data with pandas in Python and wrote it to a CSV file.

NOTE: to load this CSV file into pandas you will need to use the latin1 encoding, or there will be an exception:

df = pd.read_csv('singers_gender.csv', encoding='latin1')

Inspiration

I want to apply this data to the huge Billboard Hot 100 database generated by Dayne Batten.
TPS-October-2022-data-feather
kaggle.com
zip
Updated Oct 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tim Rörup (2022). TPS-October-2022-data-feather [Dataset]. https://www.kaggle.com/datasets/timrrup/tpsoctober2022datafeather
Explore at:
zip(4409338221 bytes)Available download formats
Dataset updated
Oct 6, 2022
Authors
Tim Rörup
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
It is the identical dataset of the Tabular Playground Series - October 2022 but compressed to feather files instead of csv to enable faster processing.

Can be parsed by pandas.read_feather(path)

Description

The following description is copied from the competition:

The dataset consists of sequences of snapshots of the state of a Rocket League match, including position and velocity of all players and the ball, as well as extra information. The goal of the competition is to predict -- from a given snapshot in the game -- for each team, the probability that they will score within the next 10 seconds of game time.

The data was taken from professional Rocket League matches. Each event consists of a chronological series of frames recorded at 10 frames per second. All events begin with a kickoff, and most end in one team scoring a goal, but some are truncated and end with no goal scored due to circumstances which can cause gameplay strategies to shift, for example 1) nearing end of regulation (where the game continues until the ball touches the ground) or 2) becoming non-competitive, eg one team winning by 3+ goals with little time remaining.

Files:

train_[0-9].csv: Train set split into 10 files. Rows are sorted by game_num, event_id, and event_time, and each event is entirely contained in one file.

test.csv: Test set. Unlike the train set, the rows are scrambled.

[train|test]_dtypes.csv: pandas dtypes for the columns in the train / test set, which can be pulled and passed to pd.read_csv() on the full set to read it with correct types since by default, pd.read_csv() will use 64-bit types which will waste memory. See below for example code.

sample_submission.csv: A sample submission in the correct format.

Columns:

game_num (train only): Unique identifier for the game from which the event was taken.

event_id (train only): Unique identifier for the sequence of consecutive frames.

event_time (train only): Time in seconds before the event ended, either by a goal being scored or simply when we decided to truncate the timeseries if a goal was not scored.

ball_pos_[xyz]: Ball's position as a 3d vector.

ball_vel_[xyz]: Ball's velocity as a 3d vector.

For i in [0, 6): - p{i}_pos_[xyz]: Player i's position as a 3d vector. - p{i}_vel_[xyz]: Player i's velocity as a 3d vector. - p{i}_boost: Player i's boost remaining, in [0, 100]. A player can consume boost to substantially increase their speed, and is required to fly up into the z dimension (besides driving up a wall, or the small air gained by a jump). - All p{i} columns will be NaN if and only if the player is demolished (destroyed by an enemy player; will respawn within a few seconds). - Players 0, 1, and 2 make up team A and players 3, 4, and 5 make up team B. - The orientation vector of the player's car (which way the car is facing) does not necessarily match the player's velocity vector, and this dataset does not capture orientation data.

For i in [0, 6): - boost{i}_timer: Time in seconds until big boost orb i respawns, or 0 if it's available. Big boost orbs grant a full 100 boost to a player driving over it. The orb (x, y) locations are roughly [ (-61.4, -81.9), (61.4, -81.9), (-71.7, 0), (71.7, 0), (-61.4, 81.9), (61.4, 81.9) ] with z = 0. (Players can also gain boost from small boost pads across the map, but we do not capture those pads in this dataset).

player_scoring_next (train only): Which player scores at the end of the current event, in [0, 6), or -1 if the event does not end in a goal.

team_scoring_next (train only): Which team scores at the end of the current event (A or B), or NaN if the event does not end in a goal.

team_[A|B]_scoring_within_10sec (train only): [Target columns] Value of 1 if team_scoring_next == [A|B] and time_before_event is in [-10, 0], otherwise 0.

id (test and submission only): Unique identifier for each test row. Your submission should be a pair of team_A_scoring_within_10sec and team_B_scoring_within_10sec probability predictions for each id, where your predictions can range the real numbers from [0,
DROP (Discrete Reasoning Over Paragraphs)
kaggle.com
zip
Updated Nov 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). DROP (Discrete Reasoning Over Paragraphs) [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-discrete-reasoning-challenges-with-the
Explore at:
zip(8431892 bytes)Available download formats
Dataset updated
Nov 29, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
DROP (Discrete Reasoning Over Paragraphs)

A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

By Huggingface Hub [source]

About this dataset

Discover the limitless power of paragraphs with the DROP dataset! This adversary-crafted, 96k-question benchmark is a text-based exploration into complex discrete reasoning tasks. With its wide range of natural language understanding tasks, DROP allows you to take your NLP models and techniques to the next level by building powerful systems that can tackle more intricate challenges. Unprecedented in its complexity, DROP is an invaluable tool for redefining what's possible with natural language processing and creating a brighter future for our connected world. Unlock the potential within your paragraphs with DROP today!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

How to Use the Unlocking Discrete Reasoning Challenges with the DROP Dataset The DROP dataset is an excellent resource for natural language understanding tasks, allowing users to explore the possibilities of discrete reasoning using text. This guide will provide an overview of how to get started and take full advantage of this powerful dataset.

Step 1: Explore the Dataset Structure

The DROP dataset contains two CSV files: train.csv and validation.csv. The train file contains 96k questions and answers related to natural language understanding tasks, while the validation file consists of questions and answers designed to evaluate a model's performance on the task at hand. Each row contains four columns: passage, passage_sourceidx, answers_texts, and answers_spans

The 'passage' column holds text that corresponds to a given question; 'passage_sourceidx' indicates which source document from which each passage was extracted; 'answers_texts' provides accurate strings indicating part or all of a given answer; and lastly, 'answers_spans' gives information about what part (or parts) of each passage are relevant for answering that question correctly in terms integer indices indicating starting points within each passage in order from first position (0) through last position (length -1).

Step 2: Pre-Processing Your Data With pandas/python libraries

After determining your context by exploring both CSV files it is time for pre-processing your data! To start learning more effectively you can use any existing python library such as pandas in order to cleanse noisy data by deleting empty cells , rows or values depending on what your problem requires before even beginning training your model's logic structure on it . You can also decide if it makes more sense splitting those passages into smaller chunks with fewer words so they are easier readable directly within code!

Step 3: Utilize Natural Language Processing Toolsets For Efficiency

Once you have taken necessary preprocessing steps like above you can benefit further enhancing efficiency by utilizing existing NLP toolsets such as Spacy which are amazing technologies fitting every kind of need when dealing with vast amounts real world data !

From their quick implementation capabilities for complex tasks such as extracting relevant entities providing trained tokenization models ready used helping identify proper Part Of Speech tags up until their customizable pipelines perfectly crafted according everybody’ purpose seen form common English creating better word embeddings underlying semantic meaning

Research Ideas

Developing natural language processing algorithms with the ability to detect complex patterns in text and context.

Applying advanced logical operators to understand the relationship between individual concepts in a text passage.

Creating models and systems capable of understanding multiple tasks simultaneously using a single dataset

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description | |:------------------|:----------------------------------------------------------------| | passage ...
GoT Characters Screen Time
kaggle.com
zip
Updated Dec 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). GoT Characters Screen Time [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncover-the-mystery-behind-got-characters-screen/code
Explore at:
zip(13664 bytes)Available download formats
Dataset updated
Dec 14, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
GoT Characters Screen Time

How Long did Characters Spend on Screen?

By Ændrew Rininsland [source]

About this dataset

This dataset provides an eye-opening look into the characters and actors involved in the globally acclaimed TV series, Game of Thrones. By examining the screen time and episodes for each character, as well as their portrayed actor or actress's IMDB URL, one can gain remarkable insight into which characters have seized the spotlight in this epic saga. Compiled by ninewheels0 on IMDB, this dataset took a long time to amass and deserves appreciation for doing so. Each character is listed with their length of screen time measured in minutes with fractional seconds (i.e., 1.5 minutes means one minute and thirty seconds). With each character's contribution to screen time equal to what they will remember by within the minds of those watching Game of Thrones around the world, witness how each actor has intruded into our lives across multiple seasons!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

How to Use the GoT Characters Screen Time Dataset This Kaggle dataset contains information about the screentime of characters in the Game of Thrones TV series, including their name, IMDB URL, screentime (in minutes with fractional seconds), number of episodes appeared in and the actor/actress portraying them. It is a helpful resource for game theorists who want to further study character arcs and build theories around key points in certain stories.

First off, make sure you acknowledge ninewheels0 on IMDB as they created this list before it was uploaded onto Kaggle. Read the “About this dataset” section carefully before getting started to remember all sources and credits should be given out properly.

To begin studying and analyzing this data set you can use various software tools that allow for analyzing and visualizing large data sets such as Python pandas function which allows you to easily study every aspect provided through columns such as name or portrayed_by_name). You can also use Tableau which let’s you turn your selected columns into charts or graphs so that patterns are easily found within large datasets. Additionally, tools such as Excel can also be used for similar purposes but not nearly as well organized so it would only work if very few interactions are done between different elements from within this CSV file.

When analyzing these huge datasets it is important to note down certain key questions you want to answer while understanding what kind of information is already presented there – what correlations exist? What could affect a certain element? Is there something specific I want to uncover through my analysis etc.. After deciding on those major things one should take a look at distinct elements present within each column such as its highest values (max) or lowest values (min). Afterwards one should remember to always check consistency with patterns found due do outliers before making any kind of accusations or assumptions afterwards - sometimes they might influence our results more than expected sometimes not providing us with much insight at all but instead just confusing our story even further . Knowing how each element interacts with other variables from within same dataset helps when looking into relation between two separate items from inside same file!

Finally after taking into account primarily described ways we can start drawing parallels between different parts present inside same As soon we did sufficient amount amount if observation steps we may even have enough evidence needed finish majority aspects related research phase like proving possible hypothesis correct or incorrect !

Research Ideas

Create an interactive feature on a website/app to compare the screentime across all characters in Game of Thrones

Analyze how the screentime of characters evolve over time and seasons

Using ML algorithms, explore different patterns in the data to identify relationships between screen time and other factors such as character gender, type etc

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: GOT_screentimes.csv | Column...
TinyStories
kaggle.com
opendatalab.com
+1more
zip
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). TinyStories [Dataset]. https://www.kaggle.com/datasets/thedevastator/tinystories-narrative-classification/discussion?sort=undefined
Explore at:
zip(603587417 bytes)Available download formats
Dataset updated
Nov 24, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
TinyStories

A Diverse, Richly Annotated Corpus of Short-Form Stories

By Huggingface Hub [source]

About this dataset

This dataset contains the text of a remarkable collection of short stories known as the TinyStories Corpus. With over 2,000 annotated stories, it is populated with an array of diverse styles and genres from multiple sources. This corpus is enriched by intricate annotations across each narrative content, making it a valuable resource for narrative text classification. The text field in each row includes the entirety of each story that can be used to identify plots, characters and other features associated with story-telling techniques. Through this collection of stories, users will gain an extensive insight into a wide range of narratives which could be used to produce powerful machine learning models for Narrative Text Classification

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In this dataset, each row contains a short story along with its associated labels for narrative text classification tasks. The data consists of the following columns: - text: The story text itself (string) - validation.csv: Contains a set of short stories for validation (dataframe) - train.csv: Contains the text of short stories used for narrative text classification (dataframe)

The data contained in both files can be used for various types of machine learning tasks related to narrative text classification. These include but are not limited to experiments such as determining story genres, predicting user reactions, sentiment analysis etc.

To get started with using this dataset, begin by downloading both validation and train csv files from Kaggle datasets page and saving them on your computer or local environment. Once downloaded, you may need to preprocess both datasets by cleaning up any unnecessary/wrongly formatted values or duplicate entries if any exists within it before proceeding further on to your research work or machine learning task experimentations as these have great impacts on your research results accuracy rate which you do not want compromised!

Next step is simply loading up these two datasets into Python pandas dataframes so that they can easily be manipulated and analyzed using common tools associated with Natural Language Processing(NLP). This would require you writing few simple lines using pandas API functions like read_csv(), .append(), .concat()etc depending upon what kind of analysis/experiment you intend conducting afterwards utilizing this dataset in Python Jupyter Notebook framework as well as other machine learning frameworks popular among data scientists like scikit-learn if it will be something more complex than simple NLP task operations!

By now if done everything mentioned correctly here then we are ready now to finally get into actually working out our desired applications from exploring potential connections between different narratives or character traits via supervised Machine Learning models such as Naive Bayes Classifier among many others that could ultimately provide us useful insights revealing patterns existing underneath all those texts! With all necessary datas loaded up in supporting python platforms correctly so feel free to make interesting discoveries/predictions from extensive analyses provided by this richly annotated TinyStories Narrative Dataset!

Research Ideas

Creating a text classification algorithm to automatically categorize short stories by genre.

Developing an AI-based summarization tool to quickly summarize the main points in a story.

Developing an AI-based story generator that can generate new stories based on existing ones in the dataset

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description | |:--------------|:--------------------------------| | text | The text of the story. (String) |

File: train.csv | Column name | Description | |:--------------|:----------------------------...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. http://doi.org/10.5281/zenodo.6832242

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6832242

Dataset updated

Oct 20, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

LifeSnaps Dataset Documentation

Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

Data Import: Reading CSV

For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

Data Import: Setting up a MongoDB (Recommended)

To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

For the Fitbit data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

For the SEMA data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c sema

For surveys data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c surveys

If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

Data Availability

The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

{
  _id:

Clear search

Close search

Google apps

Main menu

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

Merge number of excel file,convert into csv file

U.S. Select Demographics by Census Block Groups

riiid_train_converted to Multiple Formats

Context

Content

Acknowledgements

Inspiration

MS malware hdf5

Context

Content

Acknowledgements

Data from: Redocking the PDB

yahoo_finance_data_nse_2000_stocks

dataset = read_ohlcv("ABCAPITAL.NS.csv")

CpG Signature Profiling and Heatmap Visualization of SARS-CoV Genomes:...

The Device Activity Report with Complete Knowledge (DARCK) for NILM

1. Abstract

2. Dataset Overview

3. Download and Usage

4. Measurement Setup

5. File Format (DARCK.csv)

Column Descriptions

Column Name

Data Type

Unit

Description

6. Data Postprocessing Pipeline

6.1. Main Meter (main) Postprocessing

6.2. Sub-metered Devices (shellies) Postprocessing

6.3. Merging and Finalization

7. Manual Corrections and Known Data Issues

8. Appliance Details and Multipurpose Plugs

Analysis of references in the IPCC AR6 WG2 Report of 2022

Citi Bike Stations

Context

Details

Russian external economic activity 2016-2021

Named Entity Recognition (NER) Corpus

Task

Dataset

Acknowledgements

Singers' Gender

Context

Content

Inspiration

TPS-October-2022-data-feather

Description

Files:

Columns:

DROP (Discrete Reasoning Over Paragraphs)

DROP (Discrete Reasoning Over Paragraphs)

A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Step 1: Explore the Dataset Structure

Step 2: Pre-Processing Your Data With pandas/python libraries

Step 3: Utilize Natural Language Processing Toolsets For Efficiency

Research Ideas

Acknowledgements

License

Columns

GoT Characters Screen Time

GoT Characters Screen Time

How Long did Characters Spend on Screen?

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

TinyStories

TinyStories

A Diverse, Richly Annotated Corpus of Short-Form Stories

About this dataset

More Datasets

5. File Format (`DARCK.csv`)

6.1. Main Meter (`main`) Postprocessing

6.2. Sub-metered Devices (`shellies`) Postprocessing

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild