Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{
_id:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.
The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.
Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.
The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.
Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).
As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).
The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Spectra data generated for "Color Classification of Earth-like Planets with Machine Learning" (https://academic.oup.com/mnras/advance-article-abstract/doi/10.1093/mnras/stab1144/6247611).
The flux (units: W/m^2) can be accessed in the flux.pk (pickle file) or flux.csv (comma-separated file). These files also contain the biota information and composition of various surfaces. There are 318,780 spectra generated in total. The spectra contain 6 km cloud layer, Rayleigh scattering. The surface compositions are: cloud, seawater, sand, snow, biota (six kinds), and cloud. They are in 5% resolution for each composition.
The wavelength (units: micrometer) can be accessed in the wavelength.pk (pickle file) or wavelength.csv (comma-separated file). The wavelength ranges from 0.36 micrometers to 1.1 micrometers, with 1000 sampling points.
To access the pickle file using Python:
import pickle import pandas
wavelength_dataframe = pickle.load(open('wavelength.pk', 'rb'))
flux_dataframe = pickle.load(open('flux.pk', 'rb'))
The objects loaded by the pickle files will be Pandas dataframes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is test dataset taken by a CERN@school Timepix detector in Comma Separated Value (CSV) format. It consists of the frame data for three 256x256 pixel frames, with each frame's data in a separate file. The original binary format data may be found at the figshare link below. The data themselves are the readings from the pixels (X, Y, number of counts) caused by particles incident on the Timepix detector's silicon sensor element when exposed to a potassium chloride source. Three frames were taken with an acquisition time of 60 seconds. Further information may be found on the CERN@school website. A simple frame display (written in Python, with matplotlib) may be found in the Github repository linked to below.
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
This repository provides access to five pre-computed reconstruction files as well as the static polygons and rotation files used to generate them. This set of palaeogeographic reconstruction files provide palaeocoordinates for three global grids at H3 resolutions 2, 3, and 4, which have an average cell spacing of ~316 km, ~119 km, and ~45 km, respectively. Grids were reconstructed at a temporal resolution of one million years throughout the entire Phanerozoic (540–0 Ma). The reconstruction files are stored as comma-separated-value (CSV) files which can be easily read by almost any spreadsheet program (e.g. Microsoft Excel and Google Sheets) or programming language (e.g. Python, Julia, and R). In addition, R Data Serialization (RDS) files—a common format for saving R objects—are also provided as lighter (and compressed) alternatives to the CSV files. The structure of the reconstruction files follows a wide-form data frame structure to ease indexing. Each file consists of three initial index columns relating to the H3 cell index (i.e. the 'H3 address'), present-day longitude of the cell centroid, and the present-day latitude of the cell centroid. The subsequent columns provide the reconstructed longitudinal and latitudinal coordinate pairs for their respective age of reconstruction in ascending order, indicated by a numerical suffix. Each row contains a unique spatial point on the Earth's continental surface reconstructed through time. NA values within the reconstruction files indicate points which are not defined in deeper time (i.e. either the static polygon does not exist at that time, or it is outside the temporal coverage as defined by the rotation file).
The following five Global Plate Models are provided (abbreviation, temporal coverage, reference) within the GPMs folder:
WR13, 0–550 Ma, (Wright et al., 2013)
MA16, 0–410 Ma, (Matthews et al., 2016)
TC16, 0–540 Ma, (Torsvik and Cocks, 2016)
SC16, 0–1100 Ma, (Scotese, 2016)
ME21, 0–1000 Ma, (Merdith et al., 2021)
In addition, the H3 grids for resolutions 2, 3, and 4 are provided within the grids folder. Finally, we also provide two scripts (python and R) within the code folder which can be used to generate reconstructed coordinates for user data from the reconstruction files.
For access to the code used to generate these files:
https://github.com/LewisAJones/PhanGrids
For more information, please refer to the article describing the data:
Jones, L.A. and Domeier, M.M. 2024. A Phanerozoic gridded dataset for palaeogeographic reconstructions. (2024).
For any additional queries, contact:
Lewis A. Jones (lewisa.jones@outlook.com) or Mathew M. Domeier (mathewd@uio.no)
If you use these files, please cite:
Jones, L.A. and Domeier, M.M. 2024. A Phanerozoic gridded dataset for palaeogeographic reconstructions. DOI: 10.5281/zenodo.10069221
References
Matthews, K. J., Maloney, K. T., Zahirovic, S., Williams, S. E., Seton, M., & Müller, R. D. (2016). Global plate boundary evolution and kinematics since the late Paleozoic. Global and Planetary Change, 146, 226–250. https://doi.org/10.1016/j.gloplacha.2016.10.002.
Merdith, A. S., Williams, S. E., Collins, A. S., Tetley, M. G., Mulder, J. A., Blades, M. L., Young, A., Armistead, S. E., Cannon, J., Zahirovic, S., & Müller, R. D. (2021). Extending full-plate tectonic models into deep time: Linking the Neoproterozoic and the Phanerozoic. Earth-Science Reviews, 214, 103477. https://doi.org/10.1016/j.earscirev.2020.103477.
Scotese, C. R. (2016). Tutorial: PALEOMAP paleoAtlas for GPlates and the paleoData plotter program: PALEOMAP Project, Technical Report.
Torsvik, T. H., & Cocks, L. R. M. (2017). Earth history and palaeogeography. Cambridge University Press. https://doi.org/10.1017/9781316225523.
Wright, N., Zahirovic, S., Müller, R. D., & Seton, M. (2013). Towards community-driven paleogeographic reconstructions: Integrating open-access paleogeographic and paleobiology data with plate tectonics. Biogeosciences, 10, 1529–1541. https://doi.org/10.5194/bg-10-1529-2013.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ARtracks Atmospheric River Catalogue is based on the ERA5 climate reanalysis dataset, specifically the output parameters "vertical integral of east-/northward water vapour flux". Most of the processing relies on
IPART (Image-Processing based Atmospheric River (AR) Tracking, https://github.com/ihesp/IPART), a Python package for automated AR detection, axis finding and AR tracking. The catalogue is provided as a pickled pandas.DataFrame as well as a CSV file.
For detailed information, please see https://github.com/dominiktraxl/artracks.
The ARtracks catalogue covers the years from 1979 to the end of the year 2019.
https://creativecommons.org/licenses/publicdomain/https://creativecommons.org/licenses/publicdomain/
This repository contains data on 17,419 DOIs cited in the IPCC Working Group 2 contribution to the Sixth Assessment Report, and the code to link them to the dataset built at the Curtin Open Knowledge Initiative (COKI).
References were extracted from the report's PDFs (downloaded 2022-03-01) via Scholarcy and exported as RIS and BibTeX files. DOI strings were identified from RIS files by pattern matching and saved as CSV file. The list of DOIs for each chapter and cross chapter paper was processed using a custom Python script to generate a pandas DataFrame which was saved as CSV file and uploaded to Google Big Query.
We used the main object table of the Academic Observatory, which combines information from Crossref, Unpaywall, Microsoft Academic, Open Citations, the Research Organization Registry and Geonames to enrich the DOIs with bibliographic information, affiliations, and open access status. A custom query was used to join and format the data and the resulting table was visualised in a Google DataStudio dashboard.
This version of the repository also includes the set of DOIs from references in the IPCC Working Group 1 contribution to the Sixth Assessment Report as extracted by Alexis-Michel Mugabushaka and shared on Zenodo: https://doi.org/10.5281/zenodo.5475442 (CC-BY)
A brief descriptive analysis was provided as a blogpost on the COKI website.
The repository contains the following content:
Data:
data/scholarcy/RIS/ - extracted references as RIS files
data/scholarcy/BibTeX/ - extracted references as BibTeX files
IPCC_AR6_WGII_dois.csv - list of DOIs
data/10.5281_zenodo.5475442/ - references from IPCC AR6 WG1 report
Processing:
preprocessing.R - preprocessing steps for identifying and cleaning DOIs
process.py - Python script for transforming data and linking to COKI data through Google Big Query
Outcomes:
Dataset on BigQuery - requires a google account for access and bigquery account for querying
Data Studio Dashboard - interactive analysis of the generated data
Zotero library of references extracted via Scholarcy
PDF version of blogpost
Note on licenses: Data are made available under CC0 (with the exception of WG1 reference data, which have been shared under CC-BY 4.0) Code is made available under Apache License 2.0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a collection of synthetically generated Question-Answer (Q&A) pairs on sustainable fashion and style, with an emphasis on timeless wardrobe pieces, sustainable choices, and capsule wardrobe principles. The data was created using a large language model with advanced reasoning, prompted with various grounded contexts and real-world examples. It can be used to train or evaluate models that specialize in sustainable fashion advice, styling recommendations, or instruction-following tasks.
Context: The data focuses on classic, long-lasting wardrobe recommendations. Topics include choosing neutral color palettes, selecting high-quality fabrics (like wool), finding universally flattering silhouettes, and embracing sustainability in fashion choices...
Structure: Each entry is formatted, containing two primary fields:
instruction
– The user’s question or prompt response
– The corresponding answer or adviceExample Entry (Truncated for Clarity):
csv
instruction,response
"What makes a neutral color palette so timeless?", "Neutral tones like black, navy, beige, and gray offer unmatched versatility..."
Synthetic Creation:
This dataset is synthetic—the questions and answers were generated by a large language model. The prompts used in creation were seeded with diverse real-world fashion contexts and examples to ensure groundedness and practical relevance.
Advanced Reasoning:
The large language model was employed to simulate more detailed and nuanced fashion advice, making each Q&A pair comprehensive yet concise. Despite the synthetic nature, the reasoning incorporates established fashion principles and best practices.
Column Name | Description |
---|---|
instruction | A concise question related to fashion, style tips, capsule wardrobes, or sustainability. |
response | A short, detailed answer offering timeless styling advice, illustrating best practices in fashion. |
Sustainable Fashion Chatbot/Assistant:
Instruction-Following/QA Models:
Content Generation:
Sustainable Fashion Product Descriptions:
Download the Dataset
instruction
and response
.Data Preprocessing
Sample Use
```python
import csv
data = [] with open('sustainable_fashion.csv', 'r', encoding='utf-8') as f: reader = csv.DictReader(f) for row in reader: data.append(row)
print("Question:", data[0]['instruction']) print("Answer:", data[0]['response']) ```
You are a fashion advisor. Provide concise, accurate style guidance.
Maintain Consistency:
instruction
and response
consistent. Models often learn better with clearly defined roles.Supplementary Data:
Evaluate Quality:
Ethical and Inclusive Language:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Austin's data portal activity metrics’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/data-portal-activity-metricse on 13 February 2022.
--- Dataset description provided by original source is as follows ---
Background
Austin's open data portal provides lots of public data about the City of Austin. It also provides portal administrators with behind-the-scenes information about how the portal is used... but that data is mysterious, hard to handle in a spreadsheet, and not located all in one place.
Until now! Authorized city staff used admin credentials to grab this usage data and share it the public. The City of Austin wants to use this data to inform the development of its open data initiative and manage the open data portal more effectively.
This project contains related datasets for anyone to explore. These include site-level metrics, dataset-level metrics, and department information for context. A detailed detailed description of how the files were prepared (along with code) can be found on github here.
Example questions to answer about the data portal
- What parts of the open data portal do people seem to value most?
- What can we tell about who our users are?
- How are our data publishers doing?
- How much data is published programmatically vs manually?
- How data is super fresh? Super stale?
- Whatever you think we should know...
About the files
all_views_20161003.csv
There is a resource available to portal administrators called "Dataset of datasets". This is the export of that resource, and it was captured on Oct 3, 2016. It contains a summary of the assets available on the data portal. While this file contains over 1400 resources (such as views, charts, and binary files), only 363 are actual tabular datasets.
table_metrics_ytd.csv
This file contains information about the 363 tabular datasets on the portal. Activity metrics for an individual dataset can be accessed by calling Socrata's views/metrics API and passing along the dataset's unique ID, a time frame, and admin credentials. The process of obtaining the 363 identifiers, calling the API, and staging the information can be reviewed in the python notebook here.
site_metrics.csv
This file is the export of site-level stats that Socrata generates using a given time frame and grouping preference. This file contains records about site usage each month from Nov 2011 through Sept 2016. By the way, it contains 285 columns... and we don't know what many of them mean. But we are determined to find out!! For a preliminary exploration of the columns and what portal-related business processes to which they might relate, check out the notes in this python notebook here
city_departments_in_current_budget.csv
This file contains a list of all City of Austin departments according to how they're identified in the most recently approved budget documents. Could be helpful for getting to know more about who the publishers are.
crosswalk_to_budget_dept.csv
The City is in the process of standardizing how departments identify themselves on the data portal. In the meantime, here's a crosswalk from the department values observed in
all_views_20161003.csv
to the department names that appear in the City's budgetThis dataset was created by Hailey Pate and contains around 100 samples along with Di Sync Success, Browser Firefox 19, technical information and other features such as: - Browser Firefox 33 - Di Sync Failed - and more.
- Analyze Sf Query Error User in relation to Js Page View Admin
- Study the influence of Browser Firefox 37 on Datasets Created
- More datasets
If you use this dataset in your research, please credit Hailey Pate
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contains data from a single 1-photon widefield imaging experiment and a single Thorlabs mesoscope 2-photon imaging session from the same side mount mouse, corresponding to panels a-c and d of Figure 3, respectively. Included files contain imaging data, behavioral data, and python files with combined neurobehavioral data.Note that session names have the following format: "mouse#_bigEndianDate_cage#_info_session#_attempt#".Raw mesoscope imaging data is included in ScanImage rendered format as single big tiffs with the following nomenclature: "filename_2D.tiff".Mouse face and body cam images are included as standalone or concatenated .avi movie files, and behavioral data is included both as Spike2 files (smrx) and in exported form at Matlab data files (.mat).In all cases the first frame of the 2-photon movie, the right face/body movie, and Spike2 data are aligned to the first Labview-issued frameclock trigger (also recorded in Spike2, along with all other frameclock events). 2-photon triggers were sometimes incorrectly recorded in Spike2 (generally we recorded these as both events and waveforms), but were in all cases additionally exported from ScanImage tiff metadeta as timestamps (csv files ending in header.csv). Session start-time timestamps, also exported from ScanImage tiff metadata, appear as .txt files ending in "_starttime.txt".Preprocessed data (python) can be found in npy files with various names, each containing different subsets of variables relevant to the analysis. For each session, the npy file containing the string "standard_frames" contains the most complete, final stage set of preprocessed neurobehavioral data (in combined DataFrame format, exportable to nwb), including CCF/MMM alignments. The file containing the string "nb_dump" contains a large set of auxilliary variables that may be needed for additional preprocessing.Additional image files (tiff, png) and excel worksheets (xlsx, csv) containing high-level data summaries and records of intermediate analysis steps are also included.Please contact the authors for any additional clarifications as needed.See related materials in Collection at: https://doi.org/10.25452/figshare.plus.c.7052513
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contains data from a single 1-photon widefield imaging experiment and a single Thorlabs mesoscope 2-photon imaging session from the same side mount mouse as in Figures 5 and 6 (but a different session than in those Figures), corresponding to panels d-f of Figure S6. Included files contain imaging data, behavioral data, and python files with combined neurobehavioral data. Additional python files corresponding to this session can be found in the FigShare+ folder in this collection corresponding to Figure 6 (this session was used in the BSOiD model training for the session that was fit in Figure 6).Note that session names have the following format: "mouse#_bigEndianDate_cage#_info_session#_attempt#".Raw mesoscope imaging data is included in ScanImage rendered format as single big tiffs with the following nomenclature: "filename_2D.tiff".Mouse face and body cam images are included as standalone or concatenated .avi movie files, and behavioral data is included both as Spike2 files (smrx) and in exported form at Matlab data files (.mat).In all cases the first frame of the 2-photon movie, the right face/body movie, and Spike2 data are aligned to the first Labview-issued frameclock trigger (also recorded in Spike2, along with all other frameclock events). 2-photon triggers were sometimes incorrectly recorded in Spike2 (generally we recorded these as both events and waveforms), but were in all cases additionally exported from ScanImage tiff metadeta as timestamps (csv files ending in header.csv). Session start-time timestamps, also exported from ScanImage tiff metadata, appear as .txt files ending in "_starttime.txt".Preprocessed data (python) can be found in npy files with various names, each containing different subsets of variables relevant to the analysis. For each session, the npy file containing the string "standard_frames" contains the most complete, final stage set of preprocessed neurobehavioral data (in combined DataFrame format, exportable to nwb), including CCF/MMM alignments. The file containing the string "nb_dump" contains a large set of auxilliary variables that may be needed for additional preprocessing.Additional image files (tiff, png) and excel worksheets (xlsx, csv) containing high-level data summaries and records of intermediate analysis steps are also included.Please contact the authors for any additional clarifications as needed.See related materials in Collection at: https://doi.org/10.25452/figshare.plus.c.7052513
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Benchmark Dataset for Deep Learning-based Methods for 3D Topology Optimization.
One can find a description of the provided dataset partitions in Section 3 of Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.
Every dataset container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and a corresponding binarized SIMP solution. Every file of the form {i}.csv contains all voxel-wise information about the sample i. Every file of the form {i}_info.csv file contains scalar parameters of the topology optimization problem, such as material parameters.
This dataset represents topology optimization problems and solutions on the bases of voxels. We define all spatially varying quantities via the voxels' centers -- rather than via the vertices or surfaces of the voxels.
In {i}.csv files, each row corresponds to one voxel in the design space. The columns correspond to ['x', 'y', 'z', 'design_space', 'dirichlet_x', 'dirichlet_y', 'dirichlet_z', 'force_x', 'force_y', 'force_z', 'density'].
Any of these files with the index i can be imported using pandas by executing:
import pandas as pd
directory = ...
file_path = f'{directory}/{i}.csv'
column_names = ['x', 'y', 'z', 'design_space','dirichlet_x', 'dirichlet_y', 'dirichlet_z', 'force_x', 'force_y', 'force_z', 'density']
data = pd.read_csv(file_path, names=column_names)
From this pandas dataframe one can extract the torch tensors of forces F, Dirichlet conditions ωDirichlet, and design space information ωdesign using the following functions:
import torch
def get_shape_and_voxels(data):
shape = data[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1
vox_x = data['x'].values
vox_y = data['y'].values
vox_z = data['z'].values
voxels = [vox_x, vox_y, vox_z]
return shape, voxels
def get_forces_boundary_conditions_and_design_space(data, shape, voxels):
F = torch.zeros(3, *shape, dtype=torch.float32)
F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['force_x'].values, dtype=torch.float32)
F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['force_y'].values, dtype=torch.float32)
F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['force_z'].values, dtype=torch.float32)
ω_Dirichlet = torch.zeros(3, *shape, dtype=torch.float32)
ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['dirichlet_x'].values, dtype=torch.float32)
ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['dirichlet_y'].values, dtype=torch.float32)
ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['dirichlet_z'].values, dtype=torch.float32)
ω_design = torch.zeros(1, *shape, dtype=int)
ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['design_space'].values.astype(int))
return F, ω_Dirichlet, ω_design
The corresponding {i}_info.csv files only have one row with column labels ['E', 'ν', 'σ_ys', 'vox_size', 'p_x', 'p_y', 'p_z'].
Analogously to above, one can import any {i}_info.csv file by executing:
file_path = f'{directory}/{i}_info.csv'
data_info_column_names = ['E', 'ν', 'σ_ys', 'vox_size', 'p_x', 'p_y', 'p_z']
data_info = pd.read_csv(file_path, names=data_info_column_names)
Algoritmo de la detección de expresiones de odio en español. Este algoritmo fue desarrollado en el marco del proyecto Hatemedia (PID2020-114584GB-I00), financiado por MCIN/AEI /10.13039/501100011033, con la colaboración de Possible Inc.Algoritmo desarrollado en el marco del proyecto Hatemedia (PID2020-114584GB-I00), financiado por MCIN/ AEI /10.13039/501100011033La estructura de carpetas con la documentación de Github es la presentada a continuación:02 Documentación Github└── 00_Odio y no odio├── DOCUMENTACIÓN GITHUB.docx├── ejemplo (1).py├── Modelo_binario_ (1) (1).ipynb├── obtener_caracteristicas (1).py└── Recursos-20231027T110710Z-001 (1).zip
Se detalla a continuación el contenido de cada fichero:- DOCUMENTACIÓN GITHUB.docx:Informe en el que se presenta el uso de los scripts ejemplo (1).py y obtener_caracteristicas (1).py para emplear los modelos.- ejemplo (1).py: Script Python que muestra el uso de los modelos para realizar predicciones.Modelo_binario_(1) (1).ipnyb:Notebook con el código utilizado para el entrenamiento de los distintos modelos.Obtener_caracteristicas (1).py: Script Python con las funciones de preprocesado utilizadas previamente al uso de los modelos para predecir las entradas de un dataframe.Recursos-20231027T110710Z-001 (1).zip:La carpeta recursos contiene 3 .csv utilizados en la extracción de características.El dataset que se ha utilizado para el entrenamiento de los modelos es dataset_completo_caracteristicas_ampliadas_todas_combinaciones_v1_textoProcesado.csv(https://acortar.link/diSV7o)El Algoritmo se desarrolló, a partir de las pruebas de modelos aplicados que se muestran a continuación:MODELOS├── 70-30│ ├── CART_binario_70-30.joblib│ ├── GB_binario_70-30.joblib│ ├── MLP_binario_70-30.joblib│ ├── NB_binario_70-30.joblib│ ├── RF_binario_70-30.joblib│ └── SVM_binario_70-30.joblib├── 80-20│ ├── CART_binario_80-20.joblib│ ├── GB_binario_80-20.joblib│ ├── MLP_binario_80-20.joblib│ ├── NB_binario_80-20.joblib│ ├── RF_binario_80-20.joblib│ └── SVM_binario_80-20.joblib└── 90-10├── CART_binario_90-10.joblib├── GB_binario_90-10.joblib├── MLP_binario_90-10.joblib├── NB_binario_90-10.joblib├── RF_binario_90-10.joblib└── SVM_binario_90-10.joblibEn las carpetas 70-30, 80-20 y 90-10 podemos encontrar los distintos modelos ya entrenados con los respectivos porcentajes de train y test.Se comparte resultados y comparativas generados durante el proceso de entrenamiento y validación de modelo final usado para el desarrollo del algoritmo, la carpeta MODELOS (subido en Github), y en documento Comparativa_V2.xlsx (subido en github).El procedimiento seguido para realizar el entrenamiento de los modelos queda reflejado en el Informe técnico desarrollo de algoritmo de clasificación de odio/no odio en medios informativos digitales españoles en X (Twitter), Facebook y portales web (https://doi.org/10.6084/m9.figshare.26085688.v1).Autores:Elias Said-HungJulio Montero-DíazOscar De Gregorio- Almudena RuizXiomara BlancoJuan José CubillasDaniel Pérez PalauFinanciado por:Agencia Estatal de Investigación – Ministerio de Ciencia e InnovaciónCon el apoyo de:- POSSIBLE S.L.Como citar: Said-Hung, E., Montero-Diaz, J., De Gregorio Vicente, O., Ruiz-Iniesta, A., Blanco Valencia, X., José Cubillas, J., and Pérez Palau, D. (2023), “Algorithm for classifying hate expressions in Spanish”, figshare. https://doi.org/10.6084/m9.figshare.24574906.Más información:- https://www.hatemedia.es/ o contactar con: elias.said@unir.net----Algorithm for detection of hate expressions in Spanish. This algorithm was developed within the framework of the Hatemedia project (PID2020-114584GB-I00), funded by MCIN/ AEI /10.13039/501100011033, with the collaboration of Possible Inc.Algorithm developed within the framework of the Hatemedia project (PID2020-114584GB-I00), funded by MCIN/ AEI /10.13039/501100011033The folder structure with the GitHub documentation is presented below:02 Documentación Github└── 00_Odio y no odio├── DOCUMENTACIÓN GITHUB.docx├── ejemplo (1).py├── Modelo_binario_ (1) (1).ipynb├── obtener_caracteristicas (1).py└── Recursos-20231027T110710Z-001 (1).zipThe content of each file is detailed below:DOCUMENTACIÓN GITHUB.docx: Report that presents the example of the script (1).py and get_characteristics (1).py to use the models.- ejemplo (1).py: Python script showing the use of models to make predictions.Modelo_binario_(1) (1).ipnyb: This is a notebook with the code used to train the different models.Obtener_caracteristicas (1).py: Python script with the preprocessing functions used before using the models to predict the inputs of a data frame.Recursos-20231027T110710Z-001 (1).zip: The resources folder contains 3 .csv used in feature extraction.The dataset that has been used for training the models is dataset_completo_caracteristicas_ampliadas_todos_combinaciones_v1_textoProcesado.csv (https://acortar.link/diSV7o)The Algorithm was developed from the tests of applied models shown below:MODELS├── 70-30│ ├── CART_binario_70-30.joblib│ ├── GB_binario_70-30.joblib│ ├── MLP_binario_70-30.joblib│ ├── NB_binario_70-30.joblib│ ├── RF_binario_70-30.joblib│ └── SVM_binario_70-30.joblib├── 80-20│ ├── CART_binario_80-20.joblib│ ├── GB_binario_80-20.joblib│ ├── MLP_binario_80-20.joblib│ ├── NB_binario_80-20.joblib│ ├── RF_binario_80-20.joblib│ └── SVM_binario_80-20.joblib└── 90-10├── CART_binario_90-10.joblib├── GB_binario_90-10.joblib├── MLP_binario_90-10.joblib├── NB_binario_90-10.joblib├── RF_binario_90-10.joblib└── SVM_binario_90-10.joblibIn folders 70-30, 80-20 and 90-10, we can find the different models already trained with the respective percentages of train and test.Results and comparisons generated during the training and validation process of the final model used for the algorithm's development are shared in the MODELS folder (uploaded on Github) and in the document Comparativa_V2.xlsx (uploaded on GitHub).The procedure for training the models is reflected in the Technical report development of hate/non-hate classification algorithm in Spanish digital news media on X (Twitter), Facebook and web portals (https://doi.org/10.6084/m9.figshare.26085688.v1).The dataset used for training is dataset_completo_caracteristicas_ampliadas_todas_combinaciones_v1_textoProcesado.csv (https://acortar.link/diSV7o)As documentation, in the folder "02 Documentación Github/00_Odio y no odio", the report "DOCUMENTACIÓN GITHUB.docx" explains the use of the different training models for making predictions.Authors:Elias Said-HungJulio Montero-DíazOscar De GregorioAlmudena Ruiz- Xiomara BlancoJuan José CubillasDaniel Pérez PalauFunded by: State Research Agency – Ministry of Science and InnovationWith the support of:- POSSIBLE S.L.How to cites: Said-Hung, E., Montero-Diaz, J., De Gregorio Vicente, O., Ruiz-Iniesta, A., Blanco Valencia, X., José Cubillas, J., and Pérez Palau, D. (2023), “Algorithm for classifying hate expressions in Spanish”, figshare. https://doi.org/10.6084/m9.figshare.24574906.More information:- https://www.hatemedia.es/ or contact: elias.said@unir.net
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
3D skeletons UP-Fall Dataset
Different between Fall and Impact detection
Overview
This dataset aims to facilitate research in fall detection, particularly focusing on the precise detection of impact moments within fall events. The 3D skeletons data accuracy and comprehensiveness make it a valuable resource for developing and benchmarking fall detection algorithms. The dataset contains 3D skeletal data extracted from fall events and daily activities of 5 subjects performing fall scenarios
Data Collection
The skeletal data was extracted using a pose estimation algorithm, which processes images frames to determine the 3D coordinates of each joint. Sequences with less than 100 frames of extracted data were excluded to ensure the quality and reliability of the dataset. As a result, some subjects may have fewer CSV files.
CSV Structure
The data is organized by subjects, and each subject contains CSV files named according to the pattern C1S1A1T1, where:
subject1/`: Contains CSV files for Subject 1.
subject2/`: Contains CSV files for Subject 2.
subject3/, subject4/, subject5/: Similar structure as above, but may contain fewer CSV files due to the data extraction criteria mentioned above.
Column Descriptions
Each CSV file contains the following columns representing different skeletal joints and their respective coordinates in 3D space:
Column Name |
Description |
joint_1_x |
X coordinate of joint 1 |
joint_1_y |
Y coordinate of joint 1 |
joint_1_z |
Z coordinate of joint 1 |
joint_2_x |
X coordinate of joint 2 |
joint_2_y |
Y coordinate of joint 2 |
joint_2_z |
Z coordinate of joint 2 |
... |
... |
joint_n_x |
X coordinate of joint n |
joint_n_y |
Y coordinate of joint n |
joint_n_z |
Z coordinate of joint n |
LABEL |
Label indicating impact (1) or non-impact (0) |
Example
Here is an example of what a row in one of the CSV files might look like:
joint_1_x |
joint_1_y |
joint_1_z |
joint_2_x |
joint_2_y |
joint_2_z |
... |
joint_n_x |
joint_n_y |
joint_n_33 |
LABEL |
0.123 |
0.456 |
0.789 |
0.234 |
0.567 |
0.890 |
... |
0.345 |
0.678 |
0.901 |
0 |
Usage
This data can be used for developing and benchmarking impact fall detection algorithms. It provides detailed information on human posture and movement during falls, making it suitable for machine learning and deep learning applications in impact fall detection and prevention.
Using github
1. Clone the repository:
-bash
git clone
https://github.com/Tresor-Koffi/3D_skeletons-UP-Fall-Dataset
2. Navigate to the directory:
-bash
-cd 3D_skeletons-UP-Fall-Dataset
Here's a simple example of how to load and inspect a sample data file using Python:
```python
import pandas as pd
# Load a sample data file for Subject 1, Camera 1, Activity 1, Trial 1
data = pd.read_csv('subject1/C1S1A1T1.csv')
print(data.head())
Detta dataset består av tekniskt-språk-annoteringar från fyra års insamling från två pappersmaskiner i norra Sverige, strukturerat som en Pandas dataframe. Samma data finns också tillgänglig som en semikolonseparerad .csv-fil. Datan består av två kolumner, där den första kolumnen motsvarar annoteringens textinnehåll, och den andra titeln. Annoteringarna är skrivna på svenska, och processade så att alla egennamn ersatts av textsträngen ’egennamn’. Varje rad motsvarar en annotering med titel.
Data behandlas i Python med: import pandas as pd annotations_df = pd.read_pickle("Technical_Language_Annotations.pkl") annotation_contents = annotations_df['noteComment'] annotation_titles = annotations_df['title']
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
1117 Russian cities with city name, region, geographic coordinates and 2020 population estimate.
How to use
from pathlib import Path import requests import pandas as pd url = ("https://raw.githubusercontent.com/" "epogrebnyak/ru-cities/main/assets/towns.csv") # save file locally p = Path("towns.csv") if not p.exists(): content = requests.get(url).text p.write_text(content, encoding="utf-8") # read as dataframe df = pd.read_csv("towns.csv") print(df.sample(5))
Files:
Сolumns (towns.csv):
Basic info:
city
- city name (several cities have alternative names marked in alt_city_names.json
)population
- city population, thousand people, Rosstat estimate as of 1.1.2020lat,lon
- city geographic coordinatesRegion:
region_name
- subnational region (oblast, republic, krai or AO)region_iso_code
- ISO 3166 code, eg RU-VLD
federal_district
, eg Центральный
City codes:
okato
oktmo
fias_id
kladr_id
Data sources
Comments
City groups
Ханты-Мансийский
and Ямало-Ненецкий
autonomous regions excluded to avoid duplication as parts of Тюменская область
.
Several notable towns are classified as administrative part of larger cities (Сестрорецк
is a municpality at Saint-Petersburg, Щербинка
part of Moscow). They are not and not reported in this dataset.
By individual city
Белоозерский
not found in Rosstat publication, but should be considered a city as of 1.1.2020
Alternative city names
We suppressed letter "ё" city
columns in towns.csv - we have Орел
, but not Орёл
. This affected:
Белоозёрский
Королёв
Ликино-Дулёво
Озёры
Щёлково
Орёл
Дмитриев
and Дмитриев-Льговский
are the same city.
assets/alt_city_names.json
contains these names.
Tests
poetry install
poetry run python -m pytest
How to replicate dataset
1. Base dataset
Run:
Саратовская область.doc
to docxCreates:
_towns.csv
assets/regions.csv
2. API calls
Note: do not attempt if you do not have to - this runs a while and loads third-party API access.
You have the resulting files in repo, so probably does not need to these scripts.
Run:
cd geocoding
Creates:
3. Merge data
Run:
Creates:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
VIPERdb clean data. Contains structural data about the capsids of icosahedral viral genera, as taken from VIPERdb after merging together records of the same genus (see Methods). Rename this file to “viperdb_clean.csv†in order to load it through our Python framework. (CSV 6 kb)
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{
_id: