Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{
_id:
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Project Description:
Title: Pandas Data Manipulation and File Conversion
Overview: This project aims to demonstrate the basic functionalities of Pandas, a powerful data manipulation library in Python. In this project, we will create a DataFrame, perform some data manipulation operations using Pandas, and then convert the DataFrame into both Excel and CSV formats.
Key Objectives:
Tools and Libraries Used:
Project Implementation:
DataFrame Creation:
Data Manipulation:
File Conversion:
to_excel() function.to_csv() function.Expected Outcome:
Upon completion of this project, you will have gained a fundamental understanding of how to work with Pandas DataFrames, perform basic data manipulation tasks, and convert DataFrames into different file formats. This knowledge will be valuable for data analysis, preprocessing, and data export tasks in various data science and analytics projects.
Conclusion:
The Pandas library offers powerful tools for data manipulation and file conversion in Python. By completing this project, you will have acquired essential skills that are widely applicable in the field of data science and analytics. You can further extend this project by exploring more advanced Pandas functionalities or integrating it into larger data processing pipelines.in this data we add number of data and make that data a data frame.and save in single excel file as different sheet name and then convert that excel file in csv file .
Facebook
TwitterOverview This dataset re-shares cartographic and demographic data from the U.S. Census Bureau to provide an obvious supplement to Open Environments Block Group publications.These results do not reflect any proprietary or predictive model. Rather, they extract from Census Bureau results with some proportions and aggregation rules applied. For additional support or more detail, please see the Census Bureau citations below. Cartographics refer to shapefiles shared in the Census TIGER/Line publications. Block Group areas are updated annually, with major revisions accompanying the Decennial Census at the turn of each decade. These shapes are useful for visualizing estimates as a map and relating geographies based upon geo-operations like overlapping. This data is kept in a geodatabase file format and requires the geopandas package and its supporting fiona and DAL software. Demographics are taken from popular variables in the American Community Survey (ACS) including age, race, income, education and family structure. This data simply requires csv reader software or pythons pandas package. While the demographic data has many columns, the cartographic data has a very, very large column called "geometry" storing the many-point boundaries of each shape. So, this process saves the data separately, with demographics columns in a csv file and geometry in a gpd file needed an installation of geopandas, fiona and DAL software. More details on the ACS variables selected and derivation rules applied can be found in the commentary docstrings in the source code found here: https://github.com/OpenEnvironments/blockgroupdemographics. ## Files While the demographic data has many columns, the cartographic data has a very, very large column called "geometry" storing the many-point boundaries of each shape. So, this process saves the data separately, with demographics columns in a csv file named YYYYblcokgroupdemographics.csv. The cartographic column, 'geometry', is shared as file named YYYYblockgroupdemographics-geometry.pkl. This file needs an installation of geopandas, fiona and DAL software.
Facebook
TwitterTrain data of Riiid competition is a large dataset of over 100 million rows and 10 columns that does not fit into Kaggle Notebook's RAM using the default pandas read.csv resulting in a search for alternative approaches and formats.
Train data of Riiid competition in different formats.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
reading .CSV file for riiid completion took huge time and memory. This inspired me to convert .CSV in to different file format so that those can be loaded easily to Kaggle kernel.
Facebook
TwitterThis dataset contains the files taken from Microsoft Malware Prediction competition dataset and converted to "hdf5" from "csv" format.
The purpose of this conversion from CSV to HDF5 is to work with VAEX tools rather than Pandas. If we try to read the csv files of huge data containing about a million rows or more Pandas are not effective. But working VAEX would be comfortable. Check this link for better understanding of "Why VAEX is better than Pandas when handling with huge datasets?".
An upvote will be so encouraging.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains supplementary data to the journal article 'Redocking the PDB' by Flachsenberg et al. (https://doi.org/10.1021/acs.jcim.3c01573)[1]. In this paper, we described two datasets: The PDBScan22 dataset with a large set of 322,051 macromolecule–ligand binding sites generally suitable for redocking and the PDBScan22-HQ dataset with 21,355 binding sites passing different structure quality filters. These datasets were further characterized by calculating properties of the ligand (e.g., molecular weight), properties of the binding site (e.g., volume), and structure quality descriptors (e.g., crystal structure resolution). Additionally, we performed redocking experiments with our novel JAMDA structure preparation and docking workflow[1] and with AutoDock Vina[2,3]. Details for all these experiments and the dataset composition can be found in the journal article[1]. Here, we provide all the datasets, i.e., the PDBScan22 and PDBScan22-HQ datasets as well as the docking results and the additionally calculated properties (for the ligand, the binding sites, and structure quality descriptors). Furthermore, we give a detailed description of their content (i.e., the data types and a description of the column values). All datasets consist of CSV files with the actual data and associated metadata JSON files describing their content. The CSV/JSON files are compliant with the CSV on the web standard (https://csvw.org/). General hints
All docking experiment results consist of two CSV files, one with general information about the docking run (e.g., was it successful?) and one with individual pose results (i.e., score and RMSD to the crystal structure). All files (except for the docking pose tables) can be indexed uniquely by the column tuple '(pdb, name)' containing the PDB code of the complex (e.g., 1gm8) and the name ligand (in the format '_', e.g., 'SOX_B_1559'). All files (except for the docking pose tables) have exactly the same number of rows as the dataset they were calculated on (e.g., PDBScan22 or PDBScan22-HQ). However, some CSV files may have missing values (see also the JSON metadata files) in some or even all columns (except for 'pdb' and 'name'). The docking pose tables also contain the 'pdb' and 'name' columns. However, these alone are not unique but only together with the 'rank' column (i.e., there might be multiple poses for each docking run or none). Example usage Using the pandas library (https://pandas.pydata.org/) in Python, we can calculate the number of protein-ligand complexes in the PDBScan22-HQ dataset with a top-ranked pose RMSD to the crystal structure ≤ 2.0 Å in the JAMDA redocking experiment and a molecular weight between 100 Da and 200 Da:
import pandas as pd df = pd.read_csv('PDBScan22-HQ.csv') df_poses = pd.read_csv('PDBScan22-HQ_JAMDA_NL_NR_poses.csv') df_properties = pd.read_csv('PDBScan22_ligand_properties.csv') merged = df.merge(df_properties, how='left', on=['pdb', 'name']) merged = merged[(merged['MW'] >= 100) & (merged['MW'] <= 200)].merge(df_poses[df_poses['rank'] == 1], how='left', on=['pdb', 'name']) nof_successful_top_ranked = (merged['rmsd_ai'] <= 2.0).sum() nof_no_top_ranked = merged['rmsd_ai'].isna().sum() Datasets
PDBScan22.csv: This is the PDBScan22 dataset[1]. This dataset was derived from the PDB4. It contains macromolecule–ligand binding sites (defined by PDB code and ligand identifier) that can be read by the NAOMI library[5,6] and pass basic consistency filters. PDBScan22-HQ.csv: This is the PDBScan22-HQ dataset[1]. It contains macromolecule–ligand binding sites from the PDBScan22 dataset that pass certain structure quality filters described in our publication[1]. PDBScan22-HQ-ADV-Success.csv: This is a subset of the PDBScan22-HQ dataset without 336 binding sites where AutoDock Vina[2,3] fails. PDBScan22-HQ-Macrocycles.csv: This is a subset of the PDBScan22-HQ dataset without 336 binding sites where AutoDock Vina[2,3] fails and only contains molecules with macrocycles with at least ten atoms. Properties for PDBScan22
PDBScan22_ligand_properties.csv: Conformation-independent properties of all ligand molecules in the PDBScan22 dataset. Properties were calculated using an in-house tool developed with the NAOMI library[5,6]. PDBScan22_StructureProfiler_quality_descriptors.csv: Structure quality descriptors for the binding sites in the PDBScan22 dataset calculated using the StructureProfiler tool[7]. PDBScan22_basic_complex_properties.csv: Simple properties of the binding sites in the PDBScan22 dataset. Properties were calculated using an in-house tool developed with the NAOMI library[5,6]. Properties for PDBScan22-HQ
PDBScan22-HQ_DoGSite3_pocket_descriptors.csv: Binding site descriptors calculated for the binding sites in the PDBScan22-HQ dataset using the DoGSite3 tool[8]. PDBScan22-HQ_molecule_types.csv: Assignment of ligands in the PDBScan22-HQ dataset (without 336 binding sites where AutoDock Vina fails) to different molecular classes (i.e., drug-like, fragment-like oligosaccharide, oligopeptide, cofactor, macrocyclic). A detailed description of the assignment can be found in our publication[1]. Docking results on PDBScan22
PDBScan22_JAMDA_NL_NR.csv: Docking results of JAMDA[1] on the PDBScan22 dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22_JAMDA_NL_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22_JAMDA_NL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22 dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. Docking results on PDBScan22-HQ
PDBScan22-HQ_JAMDA_NL_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NL_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NL_WR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NL_WR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NL_WR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NW_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NW_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NW_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NW_WR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NW_WR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NW_WR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_WL_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_WL_NR_poses.csv'. For this experiment, the ligand was considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_WL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset contains daily OHLCV data for ~ 2000 Indian Stocks listed on the National Stock Exchange for all time. The columns are multi-index columns, so this needs to be taken into account when reading and using the data. Source : Yahoo Finance Type: All files are CSV format. Currency : INR
All the tickers have been collected from here : https://www.nseindia.com/market-data/securities-available-for-trading
If using pandas, the following function is a utility to read any of the CSV files:
```
import pandas as pd
def read_ohlcv(filename):
"read a given ohlcv data file downloaded from yfinance"
return pd.read_csv(
filename,
skiprows=[0, 1, 2], # remove the multiindex rows that cause trouble
names=["Date", "Close", "High", "Low", "Open", "Volume"],
index_col="Date",
parse_dates=["Date"],
)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectiveThe primary objective of this study was to analyze CpG dinucleotide dynamics in coronaviruses by comparing Wuhan-Hu-1 with its closest and most distant relatives. Heatmaps were generated to visualize CpG counts and O/E ratios across intergenic regions, providing a clear depiction of conserved and divergent CpG patterns.Methods1. Data CollectionSource : The dataset includes CpG counts and O/E ratios for various coronaviruses, extracted from publicly available genomic sequences.Format : Data was compiled into a CSV file containing columns for intergenic regions, CpG counts, and O/E ratios for each virus.2. PreprocessingData Cleaning :Missing values (NaN), infinite values (inf, -inf), and blank entries were handled using Python's pandas library.Missing values were replaced with column means, and infinite values were capped at a large finite value (1e9).Reshaping :The data was reshaped into matrices for CpG counts and O/E ratios using meltpandas[] and pivot[] functions.3. Distance CalculationEuclidean Distance :Pairwise Euclidean distances were calculated between Wuhan-Hu-1 and other viruses using the scipy.spatial.distance.euclidean function.Distances were computed separately for CpG counts and O/E ratios, and the total distance was derived as the sum of both metrics.4. Identification of Closest and Distant RelativesThe virus with the smallest total distance was identified as the closest relative .The virus with the largest total distance was identified as the most distant relative .5. Heatmap GenerationTools :Heatmaps were generated using Python's seaborn library (sns.heatmap) and matplotlib for visualization.Parameters :Heatmaps were annotated with numerical values for clarity.A color gradient (coolwarm) was used to represent varying CpG counts and O/E ratios.Titles and axis labels were added to describe the comparison between Wuhan-Hu-1 and its relatives.ResultsClosest Relative :The closest relative to Wuhan-Hu-1 was identified based on the smallest Euclidean distance.Heatmaps for CpG counts and O/E ratios show high similarity in specific intergenic regions.Most Distant Relative :The most distant relative was identified based on the largest Euclidean distance.Heatmaps reveal significant differences in CpG dynamics compared to Wuhan-Hu-1 .Tools and LibrariesThe following tools and libraries were used in this analysis:Programming Language :Python 3.13Libraries :pandas: For data manipulation and cleaning.numpy: For numerical operations and handling missing/infinite values.scipy.spatial.distance: For calculating Euclidean distances.seaborn: For generating heatmaps.matplotlib: For additional visualization enhancements.File Formats :Input: CSV files containing CpG counts and O/E ratios.Output: PNG images of heatmaps.Files IncludedCSV File :Contains the raw data of CpG counts and O/E ratios for all viruses.Heatmap Images :Heatmaps for CpG counts and O/E ratios comparing Wuhan-Hu-1 with its closest and most distant relatives.Python Script :Full Python code used for data processing, distance calculation, and heatmap generation.Usage NotesResearchers can use this dataset to further explore the evolutionary dynamics of CpG dinucleotides in coronaviruses.The Python script can be adapted to analyze other viral genomes or datasets.Heatmaps provide a visual summary of CpG dynamics, aiding in hypothesis generation and experimental design.AcknowledgmentsSpecial thanks to the open-source community for developing tools like pandas, numpy, seaborn, and matplotlib.This work was conducted as part of an independent research project in molecular biology and bioinformatics.LicenseThis dataset is shared under the CC BY 4.0 License , allowing others to share and adapt the material as long as proper attribution is given.DOI: 10.6084/m9.figshare.28736501
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains aggregated and sub-metered power consumption data from a two-person apartment in Germany. Data was collected from March 5 to September 4, 2025, spanning 6 months. It includes an aggregate reading from a main smart meter and individual readings from 40 smart plugs, smart relays, and smart power meters monitoring various appliances.
The dataset can be downloaded here: https://doi.org/10.5281/zenodo.17159850
As it contains longer off periods with zeros, the CSV file is nicely compressible.
To extract it use: xz -d DARCK.csv.xz.
The compression leads to a 97% smaller file size (From 4GB to 90.9MB).
To use the dataset in python, you can, e.g., load the csv file into a pandas dataframe.
pythonimport pandas as pd
df = pd.read_csv("DARCK.csv", parse_dates=["time"])
The main meter was monitored using an infrared reading head magnetically attached to the infrared interface of the meter. An ESP8266 flashed with Tasmota decodes the binary datagrams and forwards the Watt readings to the MQTT broker. Individual appliances were monitored using a combination of Shelly Plugs (for outlets), Shelly 1PM (for wired-in devices like ceiling lights), and Shelly PM Mini (for each of the three phases of the oven). All devices reported to a central InfluxDB database via Home Assistant running in docker on a Dell OptiPlex 3020M.
DARCK.csv)The dataset is provided as a single comma-separated value (CSV) file.
Column Name |
Data Type |
Unit |
Description |
time | datetime | - | Timestamp for the reading in YYYY-MM-DD HH:MM:SS |
main | float | Watt | Total aggregate power consumption for the apartment, measured at the main electrical panel. |
[appliance_name] | float | Watt | Power consumption of an individual appliance (e.g., lightbathroom, fridge, sherlockpc). See Section 8 for a full list. |
| Aggregate Columns | |||
aggr_chargers | float | Watt | The sum of sherlockcharger, sherlocklaptop, watsoncharger, watsonlaptop, watsonipadcharger, kitchencharger. |
aggr_stoveplates | float | Watt | The sum of stoveplatel1 and stoveplatel2. |
aggr_lights | float | Watt | The sum of lightbathroom, lighthallway, lightsherlock, lightkitchen, lightlivingroom, lightwatson, lightstoreroom, fcob, sherlockalarmclocklight, sherlockfloorlamphue, sherlockledstrip, livingfloorlamphue, sherlockglobe, watsonfloorlamp, watsondesklamp and watsonledmap. |
| Analysis Columns | |||
inaccuracy | float | Watt | As no electrical device bypasses a power meter, the true inaccuracy can be assessed. It is the absolute error between the sum of individual measurements and the mains reading. A 30W offset is applied to the sum since the measurement devices themselves draw power which is otherwise unaccounted for. |
The final dataset was generated from two raw data sources (meter.csv and shellies.csv) using a comprehensive postprocessing pipeline.
main) PostprocessingThe aggregate power data required several cleaning steps to ensure accuracy.
shellies) PostprocessingThe Shelly devices are not prone to the same burst issue as the ESP8266 is. They push a new reading at every change in power drawn. If no power change is observed or the one observed is too small (less than a few Watt), the reading is pushed once a minute, together with a heartbeat. When a device turns on or off, intermediate power values are published, which leads to sub-second values that need to be handled.
.resample('1s').last().ffill(). time index.NaN values (e.g., from before a device was installed) were filled with 0.0, assuming zero consumption.During analysis, two significant unmetered load events were identified and manually corrected to improve the accuracy of the aggregate reading. The error column (inaccuracy) was recalculated after these corrections.
The following table lists the column names with an explanation where needed. As Watson moved at the beginning of June, some metering plugs changed their appliance.
Facebook
Twitterhttps://creativecommons.org/licenses/publicdomain/https://creativecommons.org/licenses/publicdomain/
This repository contains data on 17,420 DOIs cited in the IPCC Working Group 2 contribution to the Sixth Assessment Report, and the code to link them to the dataset built at the Curtin Open Knowledge Initiative (COKI).
References were extracted from the report's PDFs (downloaded 2022-03-01) via Scholarcy and exported as RIS and BibTeX files. DOI strings were identified from RIS files by pattern matching and saved as CSV file. The list of DOIs for each chapter and cross chapter paper was processed using a custom Python script to generate a pandas DataFrame which was saved as CSV file and uploaded to Google Big Query.
We used the main object table of the Academic Observatory, which combines information from Crossref, Unpaywall, Microsoft Academic, Open Citations, the Research Organization Registry and Geonames to enrich the DOIs with bibliographic information, affiliations, and open access status. A custom query was used to join and format the data and the resulting table was visualised in a Google DataStudio dashboard.
A brief descriptive analysis was provided as a blogpost on the COKI website.
The repository contains the following content:
Data:
Processing:
Outcomes:
Note on licenses:
Data are made available under CC0
Code is made available under Apache License 2.0
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The New York City bikeshare, Citi Bike, has a real time, public API. This API conforms to the General Bikeshare Feed Specification. As such, this API contains information about the number of bikes and docks available at every station in NYC.
Since 2016, I have been pinging the public API every 2 minutes and storing the results. This dataset contains all of these results, from 8/15/2016 - 12/8/2021. The data unfortunately comes in the form of a bunch of CSVs. I recognize that this is not the best format to read large datasets like this, but a CSV is still a pretty universal format! My suggestion would be to convert these CSVs to parquet or something similar if you plan to do lots of analysis on lots of files.
Originally, I setup an EC2 instance and pinged a legacy API (code). In 2019, I switched to pinging this API via a Lambda function (code).
As part of this 2019 switch, I also started pinging the station information API once per week in order to collect information about each station, such as the name, latitude and longitude. While this dataset contains columns for all of the station information, these columns are missing data between 2016 and 8/2019. It would probably be reasonable to backfill that data with the earliest info available for each station, although be warned that this is not guaranteed to be accurate.
In order to reduce the individual file size, the full dataset has been bucketed by station_id into 50 separate files. All historical data for a given station_id are in the same file, and the stations are randomly distributed across the 50 files.
As previously mentioned, station information is missing for all data earlier than 8/2019. I have included a column, missing_station_information to indicate when this information is missing. You may wonder why I don't just create a separate station information file which can be joined to the file containing the time series. The reason is that the station information can technically change over time. When station information is provided in a given row, that information is accurate within sometime 7 days prior. This is because I pinged the station information weekly and then had to join it to the time series.
The CSV files are the result of a CREATE TABLE AS AWS Athena query using the TEXTFILE format. Consequently, null values are demarcated as \N. The two timestamp columns, station_status_last_reported and station_information_last_updated are in units of POXIX/UNIX time (i.e. seconds since 1970-01-01 00:00:00 UTC). The following code may be helpful to get you started loading the data as a pandas DataFrame.
def read_csv(filename: str) -> pd.DataFrame:
"""
Read DataFrame from a CSV file ``filename`` and convert to a
preferred schema.
"""
df = pd.read_csv(
filename,
sep=",",
na_values="\\N",
dtype={
"station_id": str,
# Use Pandas Int16 dtype to allow for nullable integers
"num_bikes_available": "Int16",
"num_ebikes_available": "Int16",
"num_bikes_disabled": "Int16",
"num_docks_available": "Int16",
"num_docks_disabled": "Int16",
"is_installed": "Int16",
"is_renting": "Int16",
"is_returning": "Int16",
"station_status_last_reported": "Int64",
"station_name": str,
"lat": float,
"lon": float,
"region_id": str,
"capacity": "Int16",
# Use pandas boolean dtype to allow for nullable booleans
"has_kiosk": "boolean",
"station_information_last_updated": "Int64",
"missing_station_information": "boolean"
},
)
# Read in timestamps as UNIX/POSIX epochs but then convert to the local
# bike share timezone.
df["station_status_last_reported"] = pd.to_datetime(
df["station_status_last_reported"], unit="s", origin="unix", utc=True
).dt.tz_convert("US/Eastern")
df["station_information_last_updated"] = pd.to_datetime(
df["station_information_last_updated"], unit="s", origin="unix", utc=True
).dt.tz_convert("US/Eastern")
return df
The column names almost come directly from the station_status and station_information APIs. See the [GBFS schema](https://github.com/MobilityData/gbfs...
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
PostgreSQL 15.
SQL optimized database with the data from the Russian Federal Customs Service on the Russian external economic activity during 2016-2021.
This is the SQL version of the dataset created by people from the "authors.pdf" file. The csv dataset was converted into the optimized relational database by Vadim Rudakov. I will do my best to provide the detailed "about" file in the short future, but for now I can say only that all the attributes from one table are now divided into different tables with foreign keys. The original dataset was too large (2GB) and had a huge impact on the memory and CPU. The SQL version is only 1.2GB and lets the researcher to investigate data even on slow machines, select needed data and save these data to csv for further work in pandas or whatever.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Named Entity Recognition(NER) is a task of categorizing the entities in a text into categories like names of persons, locations, organizations, etc.
Each row in the CSV file is a complete sentence, list of POS tags for each word in the sentence, and list of NER tags for each word in the sentence
You can use Pandas Dataframe to read and manipulate this dataset.
Since each row in the CSV file contain lists, if we read the file with pandas.read_csv() and try to get tag lists by indexing the list will be a string. ```
data['tag'][0] "['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']" type(data['tag'][0]) string
You can use the following to convert it back to list type:from ast import literal_eval literal_eval(data['tag'][0] ) ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O'] type(literal_eval(data['tag'][0] )) list ```
This dataset is taken from Annotated Corpus for Named Entity Recognition by Abhinav Walia dataset and then processed.
Annotated Corpus for Named Entity Recognition is annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.
Essential info about entities:
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I'm just getting into data science and wanted to see if there's any kind of pattern in preference for male or female singers in the Billboard charts over time (I don't believe there is, it's just a toy problem to work on).
Some finding: https://github.com/rkibria/rkibria.github.com/blob/master/singers_gender_1.md
The CSV has 3 columns: artist, gender, category. The data comes from Wikipedia where I scraped categories using the code here. I then concatenated the data with pandas in Python and wrote it to a CSV file.
NOTE: to load this CSV file into pandas you will need to use the latin1 encoding, or there will be an exception:
df = pd.read_csv('singers_gender.csv', encoding='latin1')
I want to apply this data to the huge Billboard Hot 100 database generated by Dayne Batten.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
It is the identical dataset of the Tabular Playground Series - October 2022 but compressed to feather files instead of csv to enable faster processing.
Can be parsed by pandas.read_feather(path)
The following description is copied from the competition:
The dataset consists of sequences of snapshots of the state of a Rocket League match, including position and velocity of all players and the ball, as well as extra information. The goal of the competition is to predict -- from a given snapshot in the game -- for each team, the probability that they will score within the next 10 seconds of game time.
The data was taken from professional Rocket League matches. Each event consists of a chronological series of frames recorded at 10 frames per second. All events begin with a kickoff, and most end in one team scoring a goal, but some are truncated and end with no goal scored due to circumstances which can cause gameplay strategies to shift, for example 1) nearing end of regulation (where the game continues until the ball touches the ground) or 2) becoming non-competitive, eg one team winning by 3+ goals with little time remaining.
For i in [0, 6): - boost{i}_timer: Time in seconds until big boost orb i respawns, or 0 if it's available. Big boost orbs grant a full 100 boost to a player driving over it. The orb (x, y) locations are roughly [ (-61.4, -81.9), (61.4, -81.9), (-71.7, 0), (71.7, 0), (-61.4, 81.9), (61.4, 81.9) ] with z = 0. (Players can also gain boost from small boost pads across the map, but we do not capture those pads in this dataset).
player_scoring_next (train only): Which player scores at the end of the current event, in [0, 6), or -1 if the event does not end in a goal.
team_scoring_next (train only): Which team scores at the end of the current event (A or B), or NaN if the event does not end in a goal.
team_[A|B]_scoring_within_10sec (train only): [Target columns] Value of 1 if team_scoring_next == [A|B] and time_before_event is in [-10, 0], otherwise 0.
id (test and submission only): Unique identifier for each test row. Your submission should be a pair of team_A_scoring_within_10sec and team_B_scoring_within_10sec probability predictions for each id, where your predictions can range the real numbers from [0,
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
Discover the limitless power of paragraphs with the DROP dataset! This adversary-crafted, 96k-question benchmark is a text-based exploration into complex discrete reasoning tasks. With its wide range of natural language understanding tasks, DROP allows you to take your NLP models and techniques to the next level by building powerful systems that can tackle more intricate challenges. Unprecedented in its complexity, DROP is an invaluable tool for redefining what's possible with natural language processing and creating a brighter future for our connected world. Unlock the potential within your paragraphs with DROP today!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to Use the Unlocking Discrete Reasoning Challenges with the DROP Dataset The DROP dataset is an excellent resource for natural language understanding tasks, allowing users to explore the possibilities of discrete reasoning using text. This guide will provide an overview of how to get started and take full advantage of this powerful dataset.
Step 1: Explore the Dataset Structure
The DROP dataset contains two CSV files: train.csv and validation.csv. The train file contains 96k questions and answers related to natural language understanding tasks, while the validation file consists of questions and answers designed to evaluate a model's performance on the task at hand. Each row contains four columns: passage, passage_sourceidx, answers_texts, and answers_spans
The 'passage' column holds text that corresponds to a given question; 'passage_sourceidx' indicates which source document from which each passage was extracted; 'answers_texts' provides accurate strings indicating part or all of a given answer; and lastly, 'answers_spans' gives information about what part (or parts) of each passage are relevant for answering that question correctly in terms integer indices indicating starting points within each passage in order from first position (0) through last position (length -1).
Step 2: Pre-Processing Your Data With pandas/python libraries
After determining your context by exploring both CSV files it is time for pre-processing your data! To start learning more effectively you can use any existing python library such as pandas in order to cleanse noisy data by deleting empty cells , rows or values depending on what your problem requires before even beginning training your model's logic structure on it . You can also decide if it makes more sense splitting those passages into smaller chunks with fewer words so they are easier readable directly within code!
Step 3: Utilize Natural Language Processing Toolsets For Efficiency
Once you have taken necessary preprocessing steps like above you can benefit further enhancing efficiency by utilizing existing NLP toolsets such as Spacy which are amazing technologies fitting every kind of need when dealing with vast amounts real world data !
From their quick implementation capabilities for complex tasks such as extracting relevant entities providing trained tokenization models ready used helping identify proper Part Of Speech tags up until their customizable pipelines perfectly crafted according everybody’ purpose seen form common English creating better word embeddings underlying semantic meaning
- Developing natural language processing algorithms with the ability to detect complex patterns in text and context.
- Applying advanced logical operators to understand the relationship between individual concepts in a text passage.
- Creating models and systems capable of understanding multiple tasks simultaneously using a single dataset
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:------------------|:----------------------------------------------------------------| | passage ...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Ændrew Rininsland [source]
This dataset provides an eye-opening look into the characters and actors involved in the globally acclaimed TV series, Game of Thrones. By examining the screen time and episodes for each character, as well as their portrayed actor or actress's IMDB URL, one can gain remarkable insight into which characters have seized the spotlight in this epic saga. Compiled by ninewheels0 on IMDB, this dataset took a long time to amass and deserves appreciation for doing so. Each character is listed with their length of screen time measured in minutes with fractional seconds (i.e., 1.5 minutes means one minute and thirty seconds). With each character's contribution to screen time equal to what they will remember by within the minds of those watching Game of Thrones around the world, witness how each actor has intruded into our lives across multiple seasons!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to Use the GoT Characters Screen Time Dataset This Kaggle dataset contains information about the screentime of characters in the Game of Thrones TV series, including their name, IMDB URL, screentime (in minutes with fractional seconds), number of episodes appeared in and the actor/actress portraying them. It is a helpful resource for game theorists who want to further study character arcs and build theories around key points in certain stories.
First off, make sure you acknowledge ninewheels0 on IMDB as they created this list before it was uploaded onto Kaggle. Read the “About this dataset” section carefully before getting started to remember all sources and credits should be given out properly.
To begin studying and analyzing this data set you can use various software tools that allow for analyzing and visualizing large data sets such as Python pandas function which allows you to easily study every aspect provided through columns such as name or portrayed_by_name). You can also use Tableau which let’s you turn your selected columns into charts or graphs so that patterns are easily found within large datasets. Additionally, tools such as Excel can also be used for similar purposes but not nearly as well organized so it would only work if very few interactions are done between different elements from within this CSV file.
When analyzing these huge datasets it is important to note down certain key questions you want to answer while understanding what kind of information is already presented there – what correlations exist? What could affect a certain element? Is there something specific I want to uncover through my analysis etc.. After deciding on those major things one should take a look at distinct elements present within each column such as its highest values (max) or lowest values (min). Afterwards one should remember to always check consistency with patterns found due do outliers before making any kind of accusations or assumptions afterwards - sometimes they might influence our results more than expected sometimes not providing us with much insight at all but instead just confusing our story even further . Knowing how each element interacts with other variables from within same dataset helps when looking into relation between two separate items from inside same file!
Finally after taking into account primarily described ways we can start drawing parallels between different parts present inside same As soon we did sufficient amount amount if observation steps we may even have enough evidence needed finish majority aspects related research phase like proving possible hypothesis correct or incorrect !
- Create an interactive feature on a website/app to compare the screentime across all characters in Game of Thrones
- Analyze how the screentime of characters evolve over time and seasons
- Using ML algorithms, explore different patterns in the data to identify relationships between screen time and other factors such as character gender, type etc
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: GOT_screentimes.csv | Column...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset contains the text of a remarkable collection of short stories known as the TinyStories Corpus. With over 2,000 annotated stories, it is populated with an array of diverse styles and genres from multiple sources. This corpus is enriched by intricate annotations across each narrative content, making it a valuable resource for narrative text classification. The text field in each row includes the entirety of each story that can be used to identify plots, characters and other features associated with story-telling techniques. Through this collection of stories, users will gain an extensive insight into a wide range of narratives which could be used to produce powerful machine learning models for Narrative Text Classification
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In this dataset, each row contains a short story along with its associated labels for narrative text classification tasks. The data consists of the following columns: - text: The story text itself (string) - validation.csv: Contains a set of short stories for validation (dataframe) - train.csv: Contains the text of short stories used for narrative text classification (dataframe)
The data contained in both files can be used for various types of machine learning tasks related to narrative text classification. These include but are not limited to experiments such as determining story genres, predicting user reactions, sentiment analysis etc.
To get started with using this dataset, begin by downloading both validation and train csv files from Kaggle datasets page and saving them on your computer or local environment. Once downloaded, you may need to preprocess both datasets by cleaning up any unnecessary/wrongly formatted values or duplicate entries if any exists within it before proceeding further on to your research work or machine learning task experimentations as these have great impacts on your research results accuracy rate which you do not want compromised!
Next step is simply loading up these two datasets into Python pandas dataframes so that they can easily be manipulated and analyzed using common tools associated with Natural Language Processing(NLP). This would require you writing few simple lines using pandas API functions like read_csv(), .append(), .concat()etc depending upon what kind of analysis/experiment you intend conducting afterwards utilizing this dataset in Python Jupyter Notebook framework as well as other machine learning frameworks popular among data scientists like scikit-learn if it will be something more complex than simple NLP task operations!
By now if done everything mentioned correctly here then we are ready now to finally get into actually working out our desired applications from exploring potential connections between different narratives or character traits via supervised Machine Learning models such as Naive Bayes Classifier among many others that could ultimately provide us useful insights revealing patterns existing underneath all those texts! With all necessary datas loaded up in supporting python platforms correctly so feel free to make interesting discoveries/predictions from extensive analyses provided by this richly annotated TinyStories Narrative Dataset!
- Creating a text classification algorithm to automatically categorize short stories by genre.
- Developing an AI-based summarization tool to quickly summarize the main points in a story.
- Developing an AI-based story generator that can generate new stories based on existing ones in the dataset
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:--------------|:--------------------------------| | text | The text of the story. (String) |
File: train.csv | Column name | Description | |:--------------|:----------------------------...
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{
_id: