100+ datasets found
  1. Light novel forum dataset

    • kaggle.com
    zip
    Updated May 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manusha (2020). Light novel forum dataset [Dataset]. https://www.kaggle.com/manushadilan/light-novel-forum-dataset
    Explore at:
    zip(44833 bytes)Available download formats
    Dataset updated
    May 12, 2020
    Authors
    Manusha
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    This datasets was created as a result of web scrapping using python tool called scrapy.

    Content

    This contains the user interacting with light novel sharing forum. Subject column represent subject of the forum post and almost every subject is a name of light novel they are sharing. Second column represent who created it. Third column shows that how many views got that light novel. Next column shows how many users have replied for that post. Next 2 columns show who post last on that post and when that last post made.

    Inspiration

    I hope this will unlock hidden secrets of light novel reading communities.

  2. d

    Dataset metadata of known Dataverse installations

    • search.dataone.org
    • dataverse.harvard.edu
    • +1more
    Updated Nov 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautier, Julian (2023). Dataset metadata of known Dataverse installations [Dataset]. http://doi.org/10.7910/DVN/DCDKZQ
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Gautier, Julian
    Description

    This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation).csv │ ├── basic.csv │ ├── contributor(citation).csv │ ├── ... │ └── topic_classification(citation).csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2022.10.02_17.11.19.zip │ ├── dataset_pids_Abacus_2022.10.02_17.11.19.csv │ ├── Dataverse_JSON_metadata_2022.10.02_17.11.19 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0.json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2022.10.02_17.26.19.zip │ ├── ADA_Dataverse_2022.10.02_17.26.57.zip │ ├── Arca_Dados_2022.10.02_17.44.35.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected ... Visit https://dataone.org/datasets/sha256%3Ad27d528dae8cf01e3ea915f450426c38fd6320e8c11d3e901c43580f997a3146 for complete metadata about this dataset.

  3. gdp_per_capita from 2017 to 2022

    • kaggle.com
    zip
    Updated Jul 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    D0ktor (2023). gdp_per_capita from 2017 to 2022 [Dataset]. https://www.kaggle.com/datasets/strategos2/gdp-per-capita-from-2017-to-2022
    Explore at:
    zip(504 bytes)Available download formats
    Dataset updated
    Jul 21, 2023
    Authors
    D0ktor
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    I created this dataset for beginners such as myself to get GDP per capita into playground competition more easily.

    Here is a code on how to implement it.

    dataset = pd.read_csv('../train.csv')
    gdp = pd.read_csv('../GDP_Playground.csv')
    
    def get_gdp(dataset):
        # Rename the columns in GDP df 
        gdp.columns = ['Argentina', 'Canada', 'Estonia', 'Spain', 'Japan']
    
        # Create a dictionary
        GDP_dictionary = gdp.unstack().to_dict()
        
        # Create GDP column
        dataset['GDP']=dataset.set_index(['country', 'year']).index.map(GDP_dictionary.get)
        
        # Split GDP by country (for linear model)
        dataset['GDP_Argentina']=dataset['GDP'] * (dataset['country']=='Argentina')
        dataset['GDP_Canada']=dataset['GDP'] * (dataset['country']=='Canada')
        dataset['GDP_Estonia']=dataset['GDP'] * (dataset['country']=='Estonia')
        dataset['GDP_Spain']=dataset['GDP'] * (dataset['country']=='Spain')
        dataset['GDP_Japan']=dataset['GDP'] * (dataset['country']=='Japan')
        
        # Drop column
        dataset=dataset.drop(['GDP','year'],axis=1)
        
        return dataset
    

    Source of this function can be found in the older playground series from January 2022 Link is here: Click here

  4. u

    Data from: Data corresponding to the paper "Traveling Bubbles and Vortex...

    • portalcientifico.uvigo.gal
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michinel, Humberto; Michinel, Humberto (2025). Data corresponding to the paper "Traveling Bubbles and Vortex Pairs within Symmetric 2D Quantum Droplets" [Dataset]. https://portalcientifico.uvigo.gal/documentos/682afb714c44bf76b287f3ae
    Explore at:
    Dataset updated
    2025
    Authors
    Michinel, Humberto; Michinel, Humberto
    Description

    Datasets generated for the Physical Review E article with title: "Traveling Bubbles and Vortex Pairs within Symmetric 2D Quantum Droplets" by Paredes, Guerra-Carmenate, Salgueiro, Tommasini and Michinel. In particular, we provide the data needed to generate the figures in the publication, which illustrate the numerical results found during this work.

    We also include python code in the file "plot_from_data_for_repository.py" that generates a version of the figures of the paper from .pt data sets. Data can be read and plots can be produced with a simple modification of the python code.

    Figure 1: Data are in fig1.csv

    The csv file has four columns separated by comas. The four columns correspond to values of r (first column) and the function psi(r) for the three cases depicted in the figure (columns 2-4).

    Figures 2 and 4: Data are in data_figs_2_and_4.pt

    This is a data file generated with the torch module of python. It includes eight torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the six eigenstates depicted in figures 2 and 4 ("psia", "psib", "psic", "psid", "psie", "psif"). Notice that figure 2 is the square of the modulus and figure 4 is the argument, both are obtained from the same data sets.

    Figure 3: Data are in fig3.csv

    The csv file has three columns separated by comas. The three columns correspond to values of momentum p (first column), energy E (second column) and velocity U (third column).

    Figure 5: Data are in fig5.csv

    The csv file has three columns separated by comas. The three columns correspond to values of momentum p (first column), the minimum value of |psi|^2 (second column) and the value of |psi|^2 at the center (third column).

    Figure 6: Data are in data_fig_6.pt

    This is a data file generated with the torch module of python. It includes six torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the four instants of time depicted in figure 6 ("psia", "psib", "psic", "psid").

    Figure 7: Data are in data_fig_7.pt

    This is a data file generated with the torch module of python. It includes six torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the four instants of time depicted in figure 7 ("psia", "psib", "psic", "psid").

    Figures 8 and 10: Data are in data_figs_8_and_10.pt

    This is a data file generated with the torch module of python. It includes eight torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the six eigenstates depicted in figures 8 and 10 ("psia", "psib", "psic", "psid", "psie", "psif"). Notice that figure 8 is the square of the modulus and figure 10 is the argument, both are obtained from the same data sets.

    Figure 9: Data are in fig9.csv

    The csv file has two columns separated by comas. The two columns correspond to values of momentum p (first column) and energy (second column).

    Figure 11: Data are in data_fig_11.pt

    This is a data file generated with the torch module of python. It includes ten torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the two cases, four instants of time for each case, depicted in figure 11 ("psia", "psib", "psic", "psid", "psie", "psif", "psig", "psih").

    Figure 12: Data are in data_fig_12.pt

    This is a data file generated with the torch module of python. It includes eight torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the six instants of time depicted in figure 12 ("psia", "psib", "psic", "psid", "psie", "psif").

    Figure 13: Data are in data_fig_13.pt

    This is a data file generated with the torch module of python. It includes ten torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the eight instants of time depicted in figure 13 ("psia", "psib", "psic", "psid", "psie", "psif", "psig", "psih").

  5. Data and tools for studying isograms

    • figshare.com
    Updated Jul 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
    Explore at:
    application/x-sqlite3Available download formats
    Dataset updated
    Jul 31, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Florian Breit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

    Label Data type Description

    isogramy int The order of isogramy, e.g. "2" is a second order isogram

    length int The length of the word in letters

    word text The actual word/isogram in ASCII

    source_pos text The Part of Speech tag from the original corpus

    count int Token count (total number of occurences)

    vol_count int Volume count (number of different sources which contain the word)

    count_per_million int Token count per million words

    vol_count_as_percent int Volume count as percentage of the total number of volumes

    is_palindrome bool Whether the word is a palindrome (1) or not (0)

    is_tautonym bool Whether the word is a tautonym (1) or not (0)

    The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

    Label

    Data type

    Description

    !total_1grams

    int

    The total number of words in the corpus

    !total_volumes

    int

    The total number of volumes (individual sources) in the corpus

    !total_isograms

    int

    The total number of isograms found in the corpus (before compacting)

    !total_palindromes

    int

    How many of the isograms found are palindromes

    !total_tautonyms

    int

    How many of the isograms found are tautonyms

    The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.

  6. London Housing Data

    • kaggle.com
    zip
    Updated Sep 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Science Lovers (2025). London Housing Data [Dataset]. https://www.kaggle.com/datasets/rohitgrewal/london-housing-data
    Explore at:
    zip(138862 bytes)Available download formats
    Dataset updated
    Sep 15, 2025
    Authors
    Data Science Lovers
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Area covered
    London
    Description

    📹Project Video available on YouTube - https://youtu.be/q-Omt6LgRLc

    🖇️Connect with me on LinkedIn - https://www.linkedin.com/in/rohit-grewal

    London Housing Price Dataset

    The dataset contains housing market information for different areas of London over time. It includes details such as average house prices, the number of houses sold, and crime statistics. The data spans multiple years and is organized by date and geographic area.

    This data is available as a CSV file. We are going to analyze this data set using the Pandas DataFrame.

    Using this dataset, we answered multiple questions with Python in our Project.

    Q. 1) Convert the Datatype of 'Date' column to Date-Time format.

    Q. 2.A) Add a new column ''year'' in the dataframe, which contains years only.

    Q. 2.B) Add a new column ''month'' as 2nd column in the dataframe, which contains month only.

    Q. 3) Remove the columns 'year' and 'month' from the dataframe.

    Q. 4) Show all the records where 'No. of Crimes' is 0. And, how many such records are there ?

    Q. 5) What is the maximum & minimum 'average_price' per year in england ?

    Q. 6) What is the Maximum & Minimum No. of Crimes recorded per area ?

    Q. 7) Show the total count of records of each area, where average price is less than 100000.

    Enrol in our Udemy courses : 1. Python Data Analytics Projects - https://www.udemy.com/course/bigdata-analysis-python/?referralCode=F75B5F25D61BD4E5F161 2. Python For Data Science - https://www.udemy.com/course/python-for-data-science-real-time-exercises/?referralCode=9C91F0B8A3F0EB67FE67 3. Numpy For Data Science - https://www.udemy.com/course/python-numpy-exercises/?referralCode=FF9EDB87794FED46CBDF

    These are the main Features/Columns available in the dataset :

    1) Date – The month and year when the data was recorded.

    2) Area – The London borough or area for which the housing and crime data is reported.

    3) Average_price – The average house price in the given area during the specified month.

    4) Code – The unique area code (e.g., government statistical code) corresponding to each borough or region.

    5) Houses_sold – The number of houses sold in the given area during the specified month.

    6) No_of_crimes – The number of crimes recorded in the given area during the specified month.

  7. h

    DS1000Retrieval

    • huggingface.co
    Updated Oct 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark (2025). DS1000Retrieval [Dataset]. https://huggingface.co/datasets/mteb/DS1000Retrieval
    Explore at:
    Dataset updated
    Oct 24, 2025
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    DS1000Retrieval An MTEB dataset Massive Text Embedding Benchmark

    A code retrieval task based on 1,000 data science programming problems from DS-1000. Each query is a natural language description of a data science task (e.g., 'Create a scatter plot of column A vs column B with matplotlib'), and the corpus contains Python code implementations using libraries like pandas, numpy, matplotlib, scikit-learn, and scipy. The task is to retrieve the correct code snippet that solves the… See the full description on the dataset page: https://huggingface.co/datasets/mteb/DS1000Retrieval.

  8. Z

    Wrist-mounted IMU data towards the investigation of free-living human eating...

    • data.niaid.nih.gov
    Updated Jun 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kyritsis, Konstantinos; Diou, Christos; Delopoulos, Anastasios (2022). Wrist-mounted IMU data towards the investigation of free-living human eating behavior - the Free-living Food Intake Cycle (FreeFIC) dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4420038
    Explore at:
    Dataset updated
    Jun 20, 2022
    Dataset provided by
    Aristotle University of Thessaloniki
    Harokopio University of Athens
    Authors
    Kyritsis, Konstantinos; Diou, Christos; Delopoulos, Anastasios
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    The Free-living Food Intake Cycle (FreeFIC) dataset was created by the Multimedia Understanding Group towards the investigation of in-the-wild eating behavior. This is achieved by recording the subjects’ meals as a small part part of their everyday life, unscripted, activities. The FreeFIC dataset contains the (3D) acceleration and orientation velocity signals ((6) DoF) from (22) in-the-wild sessions provided by (12) unique subjects. All sessions were recorded using a commercial smartwatch ((6) using the Huawei Watch 2™ and the MobVoi TicWatch™ for the rest) while the participants performed their everyday activities. In addition, FreeFIC also contains the start and end moments of each meal session as reported by the participants.

    Description

    FreeFIC includes (22) in-the-wild sessions that belong to (12) unique subjects. Participants were instructed to wear the smartwatch to the hand of their preference well ahead before any meal and continue to wear it throughout the day until the battery is depleted. In addition, we followed a self-report labeling model, meaning that the ground truth is provided from the participant by documenting the start and end moments of their meals to the best of their abilities as well as the hand they wear the smartwatch on. The total duration of the (22) recordings sums up to (112.71) hours, with a mean duration of (5.12) hours. Additional data statistics can be obtained by executing the provided python script stats_dataset.py. Furthermore, the accompanying python script viz_dataset.py will visualize the IMU signals and ground truth intervals for each of the recordings. Information on how to execute the Python scripts can be found below.

    The script(s) and the pickle file must be located in the same directory.

    Tested with Python 3.6.4

    Requirements: Numpy, Pickle and Matplotlib

    Calculate and echo dataset statistics

    $ python stats_dataset.py

    Visualize signals and ground truth

    $ python viz_dataset.py

    FreeFIC is also tightly related to Food Intake Cycle (FIC), a dataset we created in order to investigate the in-meal eating behavior. More information about FIC can be found here and here.

    Publications

    If you plan to use the FreeFIC dataset or any of the resources found in this page, please cite our work:

    @article{kyritsis2020data,
    title={A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches},
    author={Kyritsis, Konstantinos and Diou, Christos and Delopoulos, Anastasios},
    journal={IEEE Journal of Biomedical and Health Informatics}, year={2020},
    publisher={IEEE}}

    @inproceedings{kyritsis2017automated, title={Detecting Meals In the Wild Using the Inertial Data of a Typical Smartwatch}, author={Kyritsis, Konstantinos and Diou, Christos and Delopoulos, Anastasios}, booktitle={2019 41th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)},
    year={2019}, organization={IEEE}}

    Technical details

    We provide the FreeFIC dataset as a pickle. The file can be loaded using Python in the following way:

    import pickle as pkl import numpy as np

    with open('./FreeFIC_FreeFIC-heldout.pkl','rb') as fh: dataset = pkl.load(fh)

    The dataset variable in the snipet above is a dictionary with (5) keys. Namely:

    'subject_id'

    'session_id'

    'signals_raw'

    'signals_proc'

    'meal_gt'

    The contents under a specific key can be obtained by:

    sub = dataset['subject_id'] # for the subject id ses = dataset['session_id'] # for the session id raw = dataset['signals_raw'] # for the raw IMU signals proc = dataset['signals_proc'] # for the processed IMU signals gt = dataset['meal_gt'] # for the meal ground truth

    The sub, ses, raw, proc and gt variables in the snipet above are lists with a length equal to (22). Elements across all lists are aligned; e.g., the (3)rd element of the list under the 'session_id' key corresponds to the (3)rd element of the list under the 'signals_proc' key.

    sub: list Each element of the sub list is a scalar (integer) that corresponds to the unique identifier of the subject that can take the following values: ([1, 2, 3, 4, 13, 14, 15, 16, 17, 18, 19, 20]). It should be emphasized that the subjects with ids (15, 16, 17, 18, 19) and (20) belong to the held-out part of the FreeFIC dataset (more information can be found in ( )the publication titled "A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches" by Kyritsis et al). Moreover, the subject identifier in FreeFIC is in-line with the subject identifier in the FIC dataset (more info here and here); i.e., FIC’s subject with id equal to (2) is the same person as FreeFIC’s subject with id equal to (2).

    ses: list Each element of this list is a scalar (integer) that corresponds to the unique identifier of the session that can range between (1) and (5). It should be noted that not all subjects have the same number of sessions.

    raw: list Each element of this list is dictionary with the 'acc' and 'gyr' keys. The data under the 'acc' key is a (N_{acc} \times 4) numpy.ndarray that contains the timestamps in seconds (first column) and the (3D) raw accelerometer measurements in (g) (second, third and forth columns - representing the (x, y ) and (z) axis, respectively). The data under the 'gyr' key is a (N_{gyr} \times 4) numpy.ndarray that contains the timestamps in seconds (first column) and the (3D) raw gyroscope measurements in ({degrees}/{second})(second, third and forth columns - representing the (x, y ) and (z) axis, respectively). All sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the FIC dataset (more info here and here). Finally, the length of the raw accelerometer and gyroscope numpy.ndarrays is different ((N_{acc} eq N_{gyr})). This behavior is predictable and is caused by the Android platform.

    proc: list Each element of this list is an (M\times7) numpy.ndarray that contains the timestamps, (3D) accelerometer and gyroscope measurements for each meal. Specifically, the first column contains the timestamps in seconds, the second, third and forth columns contain the (x,y) and (z) accelerometer values in (g) and the fifth, sixth and seventh columns contain the (x,y) and (z) gyroscope values in ({degrees}/{second}). Unlike elements in the raw list, processed measurements (in the proc list) have a constant sampling rate of (100) Hz and the accelerometer/gyroscope measurements are aligned with each other. In addition, all sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the FIC dataset (more info here and here). No other preprocessing is performed on the data; e.g., the acceleration component due to the Earth's gravitational field is present at the processed acceleration measurements. The potential researcher can consult the article "A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches" by Kyritsis et al. on how to further preprocess the IMU signals (i.e., smooth and remove the gravitational component).

    meal_gt: list Each element of this list is a (K\times2) matrix. Each row represents the meal intervals for the specific in-the-wild session. The first column contains the timestamps of the meal start moments whereas the second one the timestamps of the meal end moments. All timestamps are in seconds. The number of meals (K) varies across recordings (e.g., a recording exist where a participant consumed two meals).

    Ethics and funding

    Informed consent, including permission for third-party access to anonymised data, was obtained from all subjects prior to their engagement in the study. The work has received funding from the European Union's Horizon 2020 research and innovation programme under Grant Agreement No 727688 - BigO: Big data against childhood obesity.

    Contact

    Any inquiries regarding the FreeFIC dataset should be addressed to:

    Dr. Konstantinos KYRITSIS

    Multimedia Understanding Group (MUG) Department of Electrical & Computer Engineering Aristotle University of Thessaloniki University Campus, Building C, 3rd floor Thessaloniki, Greece, GR54124

    Tel: +30 2310 996359, 996365 Fax: +30 2310 996398 E-mail: kokirits [at] mug [dot] ee [dot] auth [dot] gr

  9. Z

    LIST

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vojtěch Kaše; Petra Heřmánková; Adéla Sobotková (2024). LIST [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6846656
    Explore at:
    Dataset updated
    Jan 9, 2024
    Dataset provided by
    University of West Bohemia
    Aarhus University
    Authors
    Vojtěch Kaše; Petra Heřmánková; Adéla Sobotková
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Latin Inscriptions in Space and Time (LIST) dataset is an aggregate of the Epigraphic Database Heidelberg (https://edh.ub.uni-heidelberg.de/); aggregated EDH on Zenodo and Epigraphic Database Clauss Slaby (http://www.manfredclauss.de/); aggregated EDCS on Zenodo epigraphic datasets created by the Social Dynamics in the Ancient Mediterranean Project (SDAM), 2019-2023, funded by the Aarhus University Forskningsfond Starting grant no. AUFF-E-2018-7-2. The LIST dataset consists of 525,870 inscriptions, enriched by 65 attributes. 77,091 inscriptions are overlapping between the two source datasets (i.e. EDH and EDCS); 3,316 inscriptions are exclusively from EDH; 445,463 inscriptions are exclusively from EDCS. 511,973 inscriptions have valid geospatial coordinates (the geometry attribute). This information is also used to determine the urban context of each inscription (i.e. whether it is in the neighbourhood (i.e. within a 5000m buffer) of a large city, medium city, or small city or rural (>5000m to any type of city; see the attributes urban_context, urban_context_city, and urban_context_pop). 206,570 inscriptions have a numerical date of origin expressed by means of an interval or singular year using the attributes not_before and not_after. The dataset also employs a machine learning model to classify the inscriptions covered exclusively by EDCS in terms of 22 categories employed by EDH, see Kaše, Heřmánková, Sobotkova 2021.

    Formats

    We publish the dataset in the parquet and geojson file format. A description of individual attributes is available in the Metadata.csv. Using geopandas library, you can load the data directly from Zenodo into your Python environment using the following command: LIST = gpd.read_parquet("https://zenodo.org/record/8431323/files/LIST_v1-0.parquet?download=1"). In R, the sfarrow and sf library hold tools (st_read_parquet(), read_sf()) to load a parquet and geojson respectively after you have downloaded the datasets locally. The scripts used to generate the dataset are available via GitHub: https://github.com/sdam-au/LI_ETL

    The origin of existing attributes is further described in columns ‘dataset_source’, ‘source’, and ‘description’ in the attached Metadata.csv.

    Further reading on the dataset creation and methodology:

    Heřmánková, Petra, Vojtěch Kaše, and Adéla Sobotkova. “Inscriptions as Data: Digital Epigraphy in Macro-Historical Perspective.” Journal of Digital History 1, no. 1 (2021): 99. https://doi.org/10.1515/jdh-2021-1004.

    Kaše, Vojtěch, Petra Heřmánková, and Adéla Sobotkova. “Classifying Latin Inscriptions of the Roman Empire: A Machine-Learning Approach.” Proceedings of the 2nd Workshop on Computational Humanities Research (CHR2021) 2989 (2021): 123–35.

    Reading on applications of the datasets in research:

    Glomb, Tomáš, Vojtěch Kaše, and Petra Heřmánková. “Popularity of the Cult of Asclepius in the Times of the Antonine Plague: Temporal Modeling of Epigraphic Evidence.” Journal of Archaeological Science: Reports 43 (2022): 103466. https://doi.org/10.1016/j.jasrep.2022.103466.

    Kaše, Vojtěch, Petra Heřmánková, and Adéla Sobotková. “Division of Labor, Specialization and Diversity in the Ancient Roman Cities: A Quantitative Approach to Latin Epigraphy.” Edited by Peter F. Biehl. PLOS ONE 17, no. 6 (June 16, 2022): e0269869. https://doi.org/10.1371/journal.pone.0269869.

    Notes on spatial attributes

    Machine-readable spatial point geometries are provided within the geojson and parquet formats, as well as ‘Latitude’ and ‘Longitude’ columns, which contain geospatial decimal coordinates where these are known. Additional attributes exist that contain textual references to original location at different scales. The most reliable attribute with textual information on place of origin is the urban_context_city. This contains the ancient toponym of the largest city within a 5 km distance from the inscription findspot, using cities from Hanson’s 2016 list. After these universal attributes, the remaining columns are source-dependent, and exist only for either EDH or EDCS subsets. ‘pleiades_id’ column, for example, cross references the inscription findspot to geospatial location in the Pleiades but only in the EDH subset. ‘place’ attribute exists for data from EDCS (Ort) and contains ancient as well as modern place names referring to the findspot or region of provenance separated by “/”. This column requires additional cleaning before computational analysis. Attributes with _clean affix indicate that the text string has been stripped of symbols (such as ?), and most refer to aspects of provenance in the EDH subset of inscriptions.

    List of all spatial attributes:

    ‘geometry’ spatial point coordinate pair, ready for computational use in R or Python ‘latitude’ and ‘longitude’ attributes contain geospatial coordinates

    ‘urban_context_city’ attribute contains a name (ancient toponym) of the city determining the urban context, based on Hanson 2016.

    ‘province’ attribute contains province names as they appear in EDCS. This attribute contains data only for inscriptions appearing in EDCS, for inscriptions appearing solely in EDH this attribute is empty.

    ‘pleiades_id’ provides a referent for the geographic location in Pleiades (https://pleiades.stoa.org/), provided by EDH. In EDCS this attribute is empty.

    ‘province_label_clean’ attribute contains province names as they appear in EDH. This attribute contains data only for inscriptions appearing in EDH, for inscriptions appearing solely in EDCS this attribute is empty.

    ‘findspot_ancient_clean’, ‘findspot_modern_clean’, ‘country_clean’, ‘modern_region_clean’, and ‘present_location’ are additional EDH metadata, for their description see the attached Metadata file.

    Disclaimer

    The original data is provided by the third party indicated as the data source (see the ‘data_source’ column in the Metadata.csv). SDAM did not create the original data, vouch for its accuracy, or guarantee that it is the most recent data available from the data provider. For many or all of the data, the data is by its nature approximate and will contain some inaccuracies or missing values. The data may contain errors introduced by the data provider(s) and/or by SDAM. We always recommend checking the accuracy directly in the primary source, i.e. the editio princeps of the inscription in question.

  10. Z

    LIRE

    • data.niaid.nih.gov
    Updated Oct 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaše, Vojtěch; Heřmánková, Petra; Sobotková, Adéla (2023). LIRE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5074773
    Explore at:
    Dataset updated
    Oct 11, 2023
    Dataset provided by
    Aarhus University
    Authors
    Kaše, Vojtěch; Heřmánková, Petra; Sobotková, Adéla
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Latin Inscriptions of the Roman Empire (LIRE) dataset is an aggregate of Epigraphic Database Heidelberg (https://edh.ub.uni-heidelberg.de/); aggregated EDH on Zenodo and Epigraphic Database Clauss Slaby (http://www.manfredclauss.de/); aggregated EDCS on Zenodo epigraphic datasets created by the Social Dynamics in the Ancient Mediterranean Project (SDAM), 2019-2023, funded by the Aarhus University Forskningsfond Starting grant no. AUFF-E-2018-7-2.

    The LIRE dataset is a filtered and spatiotemporally restricted version of the Latin Inscriptions in Space and Time LIST dataset, including only inscriptions which are (a) geolocated, (b) within the borders of the Roman Empire in its highest extent, (c) dated (d) in the dating interval intersecting the period from 50 BC to 350 AD. In this version, the dataset consists of 182,852 records and 65 attributes. There are 59,374 inscriptions shared by EDH and EDCS, inheriting attributes from both parent collections. Further, there are 2,244 inscriptions recorded exclusively in EDH and 121,234 inscriptions originating solely from EDCS. In cases in which an inscription is available only in one dataset, it contains attributes only from that one dataset. Formats

    We publish the dataset as one parquet or geojson file. Using geopandas library, you can load the data directly into your Python environment using the following command: LIRE = gpd.read_parquet("https://zenodo.org/record/8147298/files/LIRE_v2-3.parquet?download=1"). The scripts used to generate the dataset and their metadata are available via GitHub: https://github.com/sdam-au/LI_ETL. The origin of existing attributes is further described in columns ‘dataset_source’, ‘source’, and ‘description’ in the attached Metadata.csv.

    Further reading on the dataset creation and methodology:

    Heřmánková, Petra, Vojtěch Kaše, and Adéla Sobotkova. “Inscriptions as Data: Digital Epigraphy in Macro-Historical Perspective.” Journal of Digital History 1, no. 1 (2021): 99. https://doi.org/10.1515/jdh-2021-1004.

    Kaše, Vojtěch, Petra Heřmánková, and Adéla Sobotkova. “Classifying Latin Inscriptions of the Roman Empire: A Machine-Learning Approach.” Proceedings of the 2nd Workshop on Computational Humanities Research (CHR2021) 2989 (2021): 123–35.

    Reading on applications of the datasets in research:

    Glomb, Tomáš, Vojtěch Kaše, and Petra Heřmánková. “Popularity of the Cult of Asclepius in the Times of the Antonine Plague: Temporal Modeling of Epigraphic Evidence.” Journal of Archaeological Science: Reports 43 (2022): 103466. https://doi.org/10.1016/j.jasrep.2022.103466.

    Kaše, Vojtěch, Petra Heřmánková, and Adéla Sobotková. “Division of Labor, Specialization and Diversity in the Ancient Roman Cities: A Quantitative Approach to Latin Epigraphy.” Edited by Peter F. Biehl. PLOS ONE 17, no. 6 (June 16, 2022): e0269869. https://doi.org/10.1371/journal.pone.0269869.

    Notes on spatial attributes

    Machine-readable spatial point geometries are provided within the geojson and parquet formats, as well as ‘Latitude’ and ‘Longitude’ columns, which contain geospatial decimal coordinates where these are known. Additional attributes exist that contain textual references to original location at different scales. The most reliable attribute with textual information on place of origin is the urban_context_city. This contains the ancient toponym of the largest city within a 5 km distance from the inscription findspot, using cities from Hanson’s 2016 list. After these universal attributes, the remaining columns are source-dependent, and exist only for either EDH or EDCS subsets. ‘pleiades_id’ column, for example, cross references the inscription findspot to geospatial location in the Pleiades but only in the EDH subset. ‘place’ attribute exists for data from EDCS (Ort) and contains ancient as well as modern place names referring to the findspot or region of provenance separated by “/”. This column requires additional cleaning before computational analysis. Attributes with _clean affix indicate that the text string has been stripped of symbols (such as ?), and most refer to aspects of provenance in the EDH subset of inscriptions.

    List of all spatial attributes:

    ‘geometry’ spatial point coordinate pair, ready for computational use in R or Python ‘latitude’ and ‘longitude’ attributes contain geospatial coordinates

    ‘urban_context_city’ attribute contains a name (ancient toponym) of the city determining the urban context, based on Hanson 2016.

    ‘province’ attribute contains province names as they appear in EDCS. This attribute contains data only for inscriptions appearing in EDCS, for inscriptions appearing solely in EDH this attribute is empty.

    ‘pleiades_id’ provides a referent for the geographic location in Pleiades (https://pleiades.stoa.org/), provided by EDH. In EDCS this attribute is empty.

    ‘province_label_clean’ attribute contains province names as they appear in EDH. This attribute contains data only for inscriptions appearing in EDH, for inscriptions appearing solely in EDCS this attribute is empty.

    ‘findspot_ancient_clean’, ‘findspot_modern_clean’, ‘country_clean’, ‘modern_region_clean’, ‘present_location’ are additional EDH metadata, for their description see the attached Metadata file.

    Disclaimer

    The original data is provided by the third party indicated as the data source (see the ‘data_source’ column in the Metadata.csv). SDAM did not create the original data, vouch for its accuracy, or guarantee that it is the most recent data available from the data provider. For many or all of the data, the data is by its nature approximate and will contain some inaccuracies or missing values. The data may contain errors introduced by the data provider(s) and/or by SDAM. We always recommend checking the accuracy directly in the primary source, i.e. the editio princeps of the inscription in question.

  11. r

    BA ALL Assessment Units 1000m 'super set' 20160516_v01

    • researchdata.edu.au
    • data.wu.ac.at
    Updated Jun 18, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2018). BA ALL Assessment Units 1000m 'super set' 20160516_v01 [Dataset]. https://researchdata.edu.au/ba-all-assessment-set-20160516v01/1435744
    Explore at:
    Dataset updated
    Jun 18, 2018
    Dataset provided by
    data.gov.au
    Authors
    Bioregional Assessment Program
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Abstract

    The dataset was created by the Bioregional Assessment Programme. The History Field in this metadata statement describes how this dataset was created.

    A 1000 m \* 1000m vector grid over the entire Bioregional Assessment Bioregions/Preminary Areas of Extent (using the boundary that is largest) starting at the whole km to ensure grid lines fall exactly on the whole km. The grid is in Australia Albers (GDA94) (EPSG 3577). This grid is intended as the template for standardized assessment units for the following bioregional assessment regions:

    Hunter

    Namoi

    Clarence-Moreton

    Galilee

    Please note for the Gloucester subregion model a 500m grid ( GUID ) is proposed to be used as the standard assessment unit due to the finer resolution of the output models.

    To facilitate processing speed and efficiency each of the above Bioregional Assessments have their own grid and extent created from this master vector grid template, (please see Lineage).

    The unique ID field for each grid cell is AUID and starts from 1. The grid also has a column id and row for easy reference and processing.

    Purpose

    The GRID is an attempt to standardise (where possible) outputs of models from BA assessments and is a whole of BA template for the groundwater and potentially surface water teams of the above mentioned assessment areas.

    Dataset History

    The Vector grid was generated using the Fishnet tool in ArcGIS. The following fields were added:

    AUID - Assessment Unit Unique Id

    R001_C001 - A row and column id was calculated using the following python code in the field calculator in ArcGIS where 2685 is the number of rows in the grid and 2324 is the number of columns.

    'R' + str(( !OID!-1)/2685).rjust(3, '0') + '_C' + str(( !OID!-1)%2324).rjust(3, '0')

    A spatial index was added in ArcGIS 10.1 to increase processing and rendering speed using the Spatial index tool from the ArcToolbox.

    The following parameters were used to generate the grid in the Create Fishnet tool in ArcGIS 10.1

    Left: -148000

    Bottom: -4485000

    Fishnet Origin Coordinate

    x Coordinate = -148000 Y Coordinate -4485000

    Y-Axis Coordinate

    X Coordinate -148000 Y Coordinate -4484990

    Cell Height - 1000m

    Cell Width - 1000m

    Number of rows 0

    Number of columns 0

    Opposite corner: default

    Geometry type: Polygon

    Y

    Dataset Citation

    XXXX XXX (2016) BA ALL Assessment Units 1000m 'super set' 20160516_v01. Bioregional Assessment Source Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/6c1aa99e-c973-4472-b434-756e60667bfa.

  12. NYC Jobs Dataset (Filtered Columns)

    • kaggle.com
    zip
    Updated Oct 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeffery Mandrake (2022). NYC Jobs Dataset (Filtered Columns) [Dataset]. https://www.kaggle.com/datasets/jefferymandrake/nyc-jobs-filtered-cols
    Explore at:
    zip(93408 bytes)Available download formats
    Dataset updated
    Oct 5, 2022
    Authors
    Jeffery Mandrake
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    New York
    Description

    Use this dataset with Misra's Pandas tutorial: How to use the Pandas GroupBy function | Pandas tutorial

    The original dataset came from this site: https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t/data

    I used Google Colab to filter the columns with the following Pandas commands. Here's a Colab Notebook you can use with the commands listed below: https://colab.research.google.com/drive/17Jpgeytc075CpqDnbQvVMfh9j-f4jM5l?usp=sharing

    Once the csv file is uploaded to Google Colab, use these commands to process the file.

    import pandas as pd # load the file and create a pandas dataframe df = pd.read_csv('/content/NYC_Jobs.csv') # keep only these columns df = df[['Job ID', 'Civil Service Title', 'Agency', 'Posting Type', 'Job Category', 'Salary Range From', 'Salary Range To' ]] # save the csv file without the index column df.to_csv('/content/NYC_Jobs_filtered_cols.csv', index=False)

  13. Z

    Wrist-mounted IMU data towards the investigation of in-meal human eating...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jun 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kyritsis, Konstantinos; Diou, Christos; Delopoulos, Anastasios (2022). Wrist-mounted IMU data towards the investigation of in-meal human eating behavior - the Food Intake Cycle (FIC) dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4421860
    Explore at:
    Dataset updated
    Jun 20, 2022
    Dataset provided by
    Aristotle University of Thessaloniki
    Harokopio University of Athens
    Authors
    Kyritsis, Konstantinos; Diou, Christos; Delopoulos, Anastasios
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    The Food Intake Cycle (FIC) dataset was created by the Multimedia Understanding Group towards the investigation of in-meal eating behavior. The FIC dataset contains the triaxial acceleration and orientation velocity signals ((6) DoF) from (21) meal sessions provided by (12) unique subjects. All meals were recorded in the restaurant of Aristotle University of Thessaloniki using a commercial smartwatch, the Microsoft Band (2)™ for ten out of the twenty-one meals and the Sony Smartwatch (2)™ for the remaining meals. In addition, the start and end moments of each food intake cycle as well as of each micromovement are annotated throughout the FIC dataset.

    Description

    A total of (12) subjects were recorded while eating their launch at the university’s cafeteria. The total duration of the (21) meals sums up to (246) minutes, with a mean duration of (11.7) minutes. Each participant was free to select the food of their preference, typically consisting of a starter soup, a salad, a main course and a desert. Prior to the recording, the participant was asked to wear the smartwatch to the hand that he typically uses in his everyday life to manipulate the fork and/or the spoon. A GoPro™ Hero (5) camera was already set at the table of the participant using a small, (23) cm in height, tripod facing the participant, including both the food tray and upper body part in it’s field of view. The purpose of video recording was to obtain ground truth data by manually annotating the IMU sequences based on the video stream. Participants were also asked to perform a clapping hand movement both at the start and end of the meal, for synchronization purposes (as this movement is distinctive in the accelerometer signal). No other instructions were given to the participants. It should be noted that the FIC dataset does not contain instances related with liquid consumption or eating without the fork, knife and spoon (e.g. eating directly with hands). The accompanying python script viz_dataset.py will visualize the IMU signals and food intake cycle (i.e., bite) ground truth intervals for each of the recordings. Information on how to execute the Python scripts can be found below.

    The script(s) and the pickle file must be located in the same directory.

    Tested with Python 3.6.4

    Requirements: Numpy, Pickle and Matplotlib

    Visualize signals and ground truth

    $ python viz_dataset.py

    FIC is also tightly related to FreeFIC, a dataset we created in order to investigate the in-the-wild eating behavior. More information on FreeFIC can be found here and here.

    Annotation

    Micromovements

    For all recordings, the start and end points of all (6) micromovements of interest were manually labeled. The micromovements of interest include:

    pick food, wrist manipulates a fork to pick food from the plate

    upwards, wrist moves upwards, towards the mouth area

    downwards, wrist moves downwards, away from the mouth area

    mouth, wrist inserts food in mouth

    no movement, wrist exhibits no movement

    other movement, every other wrist movement

    The annotation process was performed in such a way that the start and end times of each micro-movement span the whole meal session, without overlapping each other.

    Food intake cycles

    For all recordings, we annotated the start and end points for each intake cycle (i.e. every bite). Each food intake cycle starts with a p, ends with a d and contains an m micromovement.

    Publications

    If you plan to use the FIC dataset or any of the resources found in this page, please cite our work:

    @article{kyritsis2019modeling, title={Modeling Wrist Micromovements to Measure In-Meal Eating Behavior from Inertial Sensor Data}, author={Kyritsis, Konstantinos and Diou, Christos and Delopoulos, Anastasios}, journal={IEEE journal of biomedical and health informatics}, year={2019}, publisher={IEEE}}

    @inproceedings{kyritsis2017food, title={Food intake detection from inertial sensors using lstm networks}, author={Kyritsis, Konstantinos and Diou, Christos and Delopoulos, Anastasios}, booktitle={International Conference on Image Analysis and Processing}, pages={411--418}, year={2017}, organization={Springer}}

    @inproceedings{kyritsis2017automated, title={Automated analysis of in meal eating behavior using a commercial wristband IMU sensor}, author={Kyritsis, Konstantinos and Tatli, Christina Lefkothea and Diou, Christos and Delopoulos, Anastasios}, booktitle={2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)}, pages={2843--2846}, year={2017}, organization={IEEE}}

    Technical details

    We provide the FIC dataset as a pickle. The file can be loaded using Python in the following way:

    import pickle as pkl import numpy as np

    with open('./FIC.pkl','rb') as fh: dataset = pkl.load(fh)

    The dataset variable in the snipet above is a dictionary with (6) keys. Namely:

    'subject_id'

    'session_id'

    'signals_raw'

    'signals_proc'

    'meal_gt'

    'bite_gt'

    The contents under a specific key can be obtained by:

    sub = dataset['subject_id'] # for the subject id ses = dataset['session_id'] # for the session id raw = dataset['signals_raw'] # for the raw IMU signals proc = dataset['signals_proc'] # for the processed IMU signals mm = dataset['mm_gt'] # for the micromovement ground truth bite = dataset['bite_gt'] # for the bite ground truth

    The sub, ses, raw, proc, mm and gt variables in the snipet above are lists with a length equal to (21). Elements across all lists are aligned; e.g., the 3rd element of the list under the 'session_id' key corresponds to the 3rd element of the list under the 'signals_proc' key.

    sub: list Each element of the sub list is a scalar (integer) that corresponds to the unique identifier of the subject that can take values between (1) and (12). Moreover, the subject identifier in FIC is in-line with the subject identifier in the FreeFIC dataset (information available here and here); i.e., FIC’s subject with id equal to 2 is the same person as FreeFIC’s subject with id equal to 2.

    ses: list Each element of this list is a scalar (integer) that corresponds to the unique identifier of the session that can range between 1 and (3). It should be noted that not all subjects have the same number of sessions.

    raw: list Each element of this list is dictionary with the 'acc', 'gyr' and 'offset' keys. The data under the 'acc' key is a (N_{acc}\times4) numpy.ndarray that contains the timestamps in seconds (first column) and the (3D) raw accelerometer measurements in (g) (second, third and forth columns - representing the (x, y) and (z) axis, respectively). The data under the 'gyr' key is a (N_{gyr} \times 4) numpy.ndarray that contains the timestamps in seconds (first column) and the (3D) raw gyroscope measurements in (degrees/second)(second, third and forth columns - representing the (x, y) and (z) axis, respectively). All sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the FreeFIC dataset (information available here and here). Finally, the length of the raw accelerometer and gyroscope numpy.ndarrays is different (N_{acc} eq N_{gyr}). This behavior is predictable and is caused by the Android/MS Band platforms. The offset key contains a float that is used to align the IMU sensor streams with the videos that were used for annotation purposes (videos are not provided).

    proc: list Each element of this list is an (M \times 7) numpy.ndarray that contains the timestamps, (3D) accelerometer and gyroscope measurements for each meal. Specifically, the first column contains the timestamps in seconds, the second, third and forth columns contain the (x,y) and (z) accelerometer values in (g) and the fifth, sixth and seventh columns contain the (x, y) and (z) gyroscope values in (degrees/second). Unlike elements in the raw list, processed measurements (in the proc list) have a constant sampling rate of 100 Hz and the accelerometer/gyroscope measurements are aligned with each other. In addition, all sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the FreeFIC dataset (information available here and here). No other preprocessing is performed on the data; e.g., the acceleration component due to the Earth's gravitational field is present at the processed acceleration measurements. The potential researcher can consult the article "Modeling Wrist Micromovements to Measure In-Meal Eating Behavior from Inertial Sensor Data" by Kyritsis et al. on how to further preprocess the IMU signals (i.e., smooth and remove the gravitational component).

    mm: list Each element of this list is a (K \times 3) numpy.ndarray. Each row represents a single micromovement interval. The first column contains the timestamps of the start moments in seconds, the second column the timestamps of the end moments in seconds and the third column a number representing the type of the micromovement. The identifier to micromovement mapping is provided below: ([1] \rightarrow) no movement ([2] \rightarrow) upwards ([3] \rightarrow) downwards ([4] \rightarrow) pick food ([5] \rightarrow) mouth ([6] \rightarrow) other movement

    bite: list Each element of this list is a (L\times2) numpy.ndarray. Each row represents a single food intake event (i.e., a bite). The first column contains the start moments while the second column contains the end moments of each intake event. Both the start and end moments are provided in seconds.

    Ethics and funding

    Informed consent, including permission

  14. Z

    Data from: Russian Financial Statements Database: A firm-level collection of...

    • data.niaid.nih.gov
    Updated Mar 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
    Explore at:
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    European University at St. Petersburg
    European University at St Petersburg
    Authors
    Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

    • 🔓 First open data set with information on every active firm in Russia.

    • 🗂️ First open financial statements data set that includes non-filing firms.

    • 🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

    • 📅 Covers 2011-2023 initially, will be continuously updated.

    • 🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

    The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

    The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

    Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

    Importing The Data

    You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

    Python

    🤗 Hugging Face Datasets

    It is as easy as:

    from datasets import load_dataset import polars as pl

    This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

    RFSD = load_dataset('irlspbru/RFSD')

    Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

    RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

    Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

    Local File Import

    Importing in Python requires pyarrow package installed.

    import pyarrow.dataset as ds import polars as pl

    Read RFSD metadata from local file

    RFSD = ds.dataset("local/path/to/RFSD")

    Use RFSD_dataset.schema to glimpse the data structure and columns' classes

    print(RFSD.schema)

    Load full dataset into memory

    RFSD_full = pl.from_arrow(RFSD.to_table())

    Load only 2019 data into memory

    RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

    Load only revenue for firms in 2019, identified by taxpayer id

    RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

    Give suggested descriptive names to variables

    renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

    R

    Local File Import

    Importing in R requires arrow package installed.

    library(arrow) library(data.table)

    Read RFSD metadata from local file

    RFSD <- open_dataset("local/path/to/RFSD")

    Use schema() to glimpse into the data structure and column classes

    schema(RFSD)

    Load full dataset into memory

    scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

    Load only 2019 data into memory

    scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

    Load only revenue for firms in 2019, identified by taxpayer id

    scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

    Give suggested descriptive names to variables

    renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

    Use Cases

    🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

    🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

    🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

    FAQ

    Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

    To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

    What is the data period?

    We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

    Why are there no data for firm X in year Y?

    Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

    We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

    Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

    Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

    Why is the geolocation of firm X incorrect?

    We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

    Why is the data for firm X different from https://bo.nalog.ru/?

    Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

    Why is the data for firm X unrealistic?

    We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

    Why is the data for groups of companies different from their IFRS statements?

    We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

    Why is the data not in CSV?

    The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

    Version and Update Policy

    Version (SemVer): 1.0.0.

    We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

    Licence

    Creative Commons License Attribution 4.0 International (CC BY 4.0).

    Copyright © the respective contributors.

    Citation

    Please cite as:

    @unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

    Acknowledgments and Contacts

    Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

    Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,

  15. H

    Library Services Contributing to Institutional Success Python Code

    • dataverse.harvard.edu
    Updated Oct 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Szkirpan (2024). Library Services Contributing to Institutional Success Python Code [Dataset]. http://doi.org/10.7910/DVN/PAXRYR
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 24, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Elizabeth Szkirpan
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Python code hosted in a Jupyter Notebook file (.ipynb) that investigates version 7 of Szkirpan's R1 Library Data Code from July 2024 using data visualizations, correlation matrices, and regression analytics. This is the fourth and final version of code used for this project. Version 1 of the code was abandoned due to a number formatting issue in the raw dataset (version 6 of the dataset). Version 6 of the dataset was revised as version 7 with correctly-formatted numbers before the data was reimported into the second version of Python code. The second version of the code utilized data visualizations, correlation matrices, and regression analysis to delve into relationships amongst the dataset. However, due to a coding error, dummy variables were incorrectly recategorized as dummy variables within Python, resulting in erroneous columns. The third version of the code created an additional correlation matrix that removed all dummy variables to analyze the relationship between numerical-only columns. The second version of the code was revised to fix the dummy variable issue and combined with the third version of the code, resulting in the fourth and final version of the code. Version 4 of the code is available within this dataverse as 20240809_Szkirpan_LibraryServicesCode.ipynb. Versions 1-3 are not available to researchers and are stored in a password-protected file as they are draft code and do not feed into final analyses.

  16. H

    New Zealand Hydrological Society Data Workshop 2024: A Python Package for...

    • beta.hydroshare.org
    • hydroshare.org
    • +1more
    zip
    Updated Apr 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amber Spackman Jones (2024). New Zealand Hydrological Society Data Workshop 2024: A Python Package for Automating Aquatic Data QA/QC [Dataset]. https://beta.hydroshare.org/resource/5e942e193e494f3fab89dc317d8084fa/
    Explore at:
    zip(159.6 MB)Available download formats
    Dataset updated
    Apr 9, 2024
    Dataset provided by
    HydroShare
    Authors
    Amber Spackman Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    New Zealand
    Description

    This resource was created for the 2024 New Zealand Hydrological Society Data Workshop in Queenstown, NZ. This resource contains Jupyter Notebooks with examples for conducting quality control post processing for in situ aquatic sensor data. The code uses the Python pyhydroqc package to detect anomalies. This resource consists of 3 example notebooks and associated data files. For more information, see the original resource from which this was derived: http://www.hydroshare.org/resource/451c4f9697654b1682d87ee619cd7924.

    Notebooks: 1. Example 1: Import and plot data 2. Example 2: Perform rules-based quality control 3. Example 3: Perform model-based quality control (ARIMA) 4. Example 4: Model-based quality control (ARIMA) with user data

    Data files: Data files are available for 6 aquatic sites in the Logan River Observatory. Each file contains data for one site for a single year. Each file corresponds to a single year of data. The files are named according to monitoring site (FranklinBasin, TonyGrove, WaterLab, MainStreet, Mendon, BlackSmithFork) and year. The files were sourced by querying the Logan River Observatory relational database, and equivalent data could be obtained from the LRO website or on HydroShare. Additional information on sites, variables, and methods can be found on the LRO website (http://lrodata.usu.edu/tsa/) or HydroShare (https://www.hydroshare.org/search/?q=logan%20river%20observatory). Each file has the same structure indexed with a datetime column (mountain standard time) with three columns corresponding to each variable. Variable abbreviations and units are: - temp: water temperature, degrees C - cond: specific conductance, μS/cm - ph: pH, standard units - do: dissolved oxygen, mg/L - turb: turbidity, NTU - stage: stage height, cm

    For each variable, there are 3 columns: - Raw data value measured by the sensor (column header is the variable abbreviation). - Technician quality controlled (corrected) value (column header is the variable abbreviation appended with '_cor'). - Technician labels/qualifiers (column header is the variable abbreviation appended with '_qual').

    There is also a file "data.csv" for use with Example 4. If any user wants to bring their own data file, they should structure it similarly to this file with a single column of datetime values and a single column of numeric observations labeled "raw".

  17. Z

    Data from: Long-Term Tracing of Indoor Solar Harvesting

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sigrist, Lukas; Gomez, Andres; Thiele, Lothar (2024). Long-Term Tracing of Indoor Solar Harvesting [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3346975
    Explore at:
    Dataset updated
    Jul 22, 2024
    Dataset provided by
    ETH Zurich
    Authors
    Sigrist, Lukas; Gomez, Andres; Thiele, Lothar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Information

    This dataset presents long-term term indoor solar harvesting traces and jointly monitored with the ambient conditions. The data is recorded at 6 indoor positions with diverse characteristics at our institute at ETH Zurich in Zurich, Switzerland.

    The data is collected with a measurement platform [3] consisting of a solar panel (AM-5412) connected to a bq25505 energy harvesting chip that stores the harvested energy in a virtual battery circuit. Two TSL45315 light sensors placed on opposite sides of the solar panel monitor the illuminance level and a BME280 sensor logs ambient conditions like temperature, humidity and air pressure.

    The dataset contains the measurement of the energy flow at the input and the output of the bq25505 harvesting circuit, as well as the illuminance, temperature, humidity and air pressure measurements of the ambient sensors. The following timestamped data columns are available in the raw measurement format, as well as preprocessed and filtered HDF5 datasets:

    V_in - Converter input/solar panel output voltage, in volt

    I_in - Converter input/solar panel output current, in ampere

    V_bat - Battery voltage (emulated through circuit), in volt

    I_bat - Net Battery current, in/out flowing current, in ampere

    Ev_left - Illuminance left of solar panel, in lux

    Ev_right - Illuminance left of solar panel, in lux

    P_amb - Ambient air pressure, in pascal

    RH_amb - Ambient relative humidity, unit-less between 0 and 1

    T_amb - Ambient temperature, in centigrade Celsius

    The following publication presents and overview of the dataset and more details on the deployment used for data collection. A copy of the abstract is included in this dataset, see the file abstract.pdf.

    L. Sigrist, A. Gomez, and L. Thiele. "Dataset: Tracing Indoor Solar Harvesting." In Proceedings of the 2nd Workshop on Data Acquisition To Analysis (DATA '19), 2019.

    Folder Structure and Files

    processed/ - This folder holds the imported, merged and filtered datasets of the power and sensor measurements. The datasets are stored in HDF5 format and split by measurement position posXX and and power and ambient sensor measurements. The files belonging to this folder are contained in archives named yyyy_mm_processed.tar, where yyyy and mm represent the year and month the data was published. A separate file lists the exact content of each archive (see below).

    raw/ - This folder holds the raw measurement files recorded with the RocketLogger [1, 2] and using the measurement platform available at [3]. The files belonging to this folder are contained in archives named yyyy_mm_raw.tar, where yyyy and mmrepresent the year and month the data was published. A separate file lists the exact content of each archive (see below).

    LICENSE - License information for the dataset.

    README.md - The README file containing this information.

    abstract.pdf - A copy of the above mentioned abstract submitted to the DATA '19 Workshop, introducing this dataset and the deployment used to collect it.

    raw_import.ipynb [open in nbviewer] - Jupyter Python notebook to import, merge, and filter the raw dataset from the raw/ folder. This is the exact code used to generate the processed dataset and store it in the HDF5 format in the processed/folder.

    raw_preview.ipynb [open in nbviewer] - This Jupyter Python notebook imports the raw dataset directly and plots a preview of the full power trace for all measurement positions.

    processing_python.ipynb [open in nbviewer] - Jupyter Python notebook demonstrating the import and use of the processed dataset in Python. Calculates column-wise statistics, includes more detailed power plots and the simple energy predictor performance comparison included in the abstract.

    processing_r.ipynb [open in nbviewer] - Jupyter R notebook demonstrating the import and use of the processed dataset in R. Calculates column-wise statistics and extracts and plots the energy harvesting conversion efficiency included in the abstract. Furthermore, the harvested power is analyzed as a function of the ambient light level.

    Dataset File Lists

    Processed Dataset Files

    The list of the processed datasets included in the yyyy_mm_processed.tar archive is provided in yyyy_mm_processed.files.md. The markdown formatted table lists the name of all files, their size in bytes, as well as the SHA-256 sums.

    Raw Dataset Files

    A list of the raw measurement files included in the yyyy_mm_raw.tar archive(s) is provided in yyyy_mm_raw.files.md. The markdown formatted table lists the name of all files, their size in bytes, as well as the SHA-256 sums.

    Dataset Revisions

    v1.0 (2019-08-03)

    Initial release. Includes the data collected from 2017-07-27 to 2019-08-01. The dataset archive files related to this revision are 2019_08_raw.tar and 2019_08_processed.tar. For position pos06, the measurements from 2018-01-06 00:00:00 to 2018-01-10 00:00:00 are filtered (data inconsistency in file indoor1_p27.rld).

    v1.1 (2019-09-09)

    Revision of the processed dataset v1.0 and addition of the final dataset abstract. Updated processing scripts reduce the timestamp drift in the processed dataset, the archive 2019_08_processed.tar has been replaced. For position pos06, the measurements from 2018-01-06 16:00:00 to 2018-01-10 00:00:00 are filtered (indoor1_p27.rld data inconsistency).

    v2.0 (2020-03-20)

    Addition of new data. Includes the raw data collected from 2019-08-01 to 2019-03-16. The processed data is updated with full coverage from 2017-07-27 to 2019-03-16. The dataset archive files related to this revision are 2020_03_raw.tar and 2020_03_processed.tar.

    Dataset Authors, Copyright and License

    Authors: Lukas Sigrist, Andres Gomez, and Lothar Thiele

    Contact: Lukas Sigrist (lukas.sigrist@tik.ee.ethz.ch)

    Copyright: (c) 2017-2019, ETH Zurich, Computer Engineering Group

    License: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)

    References

    [1] L. Sigrist, A. Gomez, R. Lim, S. Lippuner, M. Leubin, and L. Thiele. Measurement and validation of energy harvesting IoT devices. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

    [2] ETH Zurich, Computer Engineering Group. RocketLogger Project Website, https://rocketlogger.ethz.ch/.

    [3] L. Sigrist. Solar Harvesting and Ambient Tracing Platform, 2019. https://gitlab.ethz.ch/tec/public/employees/sigristl/harvesting_tracing

  18. D

    Data set for reproducing plots showing stable water isotopologue transport...

    • darus.uni-stuttgart.de
    Updated Oct 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefanie Kiemle; Katharina Heck (2022). Data set for reproducing plots showing stable water isotopologue transport and fractionation [Dataset]. http://doi.org/10.18419/DARUS-3108
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 6, 2022
    Dataset provided by
    DaRUS
    Authors
    Stefanie Kiemle; Katharina Heck
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    DFG
    Description

    This data set includes the *.csv data and the used scripts to reproduce the plots of the three different scenarios presented in S. Kiemle, K. Heck, E. Coltman, R. Helmig (2022) Stable water isotopologue fractionation during soil-water evaporation: Analysis using a coupled soil-atmosphere model. (Under review) Water Resources Research. *.csv files The isotope distribution has been analyzed in the vertical and in horizontal direction of a soil column for all scenarios. Therefore, we provide *.csv files generated using the ParaView Tools "plot over line" or "plot over time". Each *.csv file contains information about the saturation, temperature, and component composition for each phase in mole fraction or in the isotopic-specific delta notation. Additionally, information about the evaporation rate is given in a separate file *.txt file. python scripts For each scenario, we provide scripts to reproduce the presented plots. Scenarios We used different free flow conditions to analyze the fractionation processes inside the porous medium. Scenario 1. laminar flow, Scenario 2. laminar flow, but with isolation of parameter affecting the fractionation process, Scenario 3. turbulent flow. Please find below a detailed description of the data labeling and needed scripts to reproduce a certain plot for each scenario. Scenario: The spatial distribution of stable water isotopologues in horizontal (-0.01 m depth) and vertical (at 0.05 m width) inside a soil column at selected days (DoE (Day of Experiment)): Use the python scripts plot_concentration_horizontal_all.py (horizontal direction) and plot_concentration_spatial_all.py (vertical direction) to create the specific plots. In the folder IsotopeProfile_Horizontal and IsotopeProfile_Vertical the belonging *.csv can be found. The *.csv files are named after the selected day (e.g. DoE_80 refers to day 80 of the virtual experiment). The influence of the evaporation rate on isotopic fractionation processes in various depths (-0.001, -0.005, -0.009, and -0.018 m ) during the whole virtual experiment time: Use the python script plot_evap_isotopes_v2.py to create the plots. The data for the isotopologues distribution and the saturation can be found in the folder PlotOverTime. All data is named as PlotOverTime_xxxxm with xxxx representing the respective depth (e.g. PlotOverTime_0.001m refers to -0.001 m depth). The data for the evaporation rate can be found in the folder EvaporationRate. Note, the evaporation rate data is available as a .txt because we extract the information about the evaporation directly during the simulation and do not derive it through any post-processing. Scenario: Process behavior of isolated parameters that influences the isotopic fractionation: Use plot_concentration.py to reproduce the plots either represented in the isotopic-specific delta notation or in mole fraction. The corresponding data can be found in the folder IsotopeProfile_Vertical. The data labeling refers to the single cases (1- no fractionation; 2 - only equilibrium fractionation; 3 - only kinetic fractionation; 4 - only liquid diffusion; 5 - Reference). Scenario: Evaporation rate during the virtual experiment for different flow cases: With plot_evap.py and the .txt files which can be found in the folder EvaporationRate, the evaporation progression can be plotted. The labeling of the .txt files refers to the different flow cases (1 - 0.1 m/s (laminar); 2 - 0.13 m/s (laminar); 3 - 0.5 m/s (turbulent); 4 - 1 m/s (turbulent); 5 - 3 m/s (turbulent)). The isotope profiles in the vertical and horizontal direction of the soil column (similar to Scenario 1) for selected days: With plot_cocentration_horizontal_all.py and plot_concentration_spatial_all.py the plots for the horizontal and vertical distribution of isotopologues can be generated. The corresponding data can be found in the folders IsotopeProfile_Horizontal and IsotopeProfile_Vertical. These folders are structured with subfolders containing the data of selected days of the virtual experiments (DoE - Day of Experiments), in this case, day 2, 10, and 35. The data labeling remains similar to Scenario 3a).

  19. c

    Data for: An effective methodology to quantify cooling demand in the UK...

    • research-data.cardiff.ac.uk
    zip
    Updated Dec 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lloyd Corcoran; Pranaynil Saikia; Carlos Ugalde Loo; Muditha Abeysekera (2024). Data for: An effective methodology to quantify cooling demand in the UK housing stock [Dataset]. http://doi.org/10.17035/cardiff.28017161.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 20, 2024
    Dataset provided by
    Cardiff University
    Authors
    Lloyd Corcoran; Pranaynil Saikia; Carlos Ugalde Loo; Muditha Abeysekera
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United Kingdom
    Description

    This repository contains the data shown in the figures in the paper ‘An effective methodology to quantify cooling demand in the UK housing stock’. The data is stored in CSV files, with the datetime index (created in Python) as the first column and the hourly thermal demand in the adjacent columns. Each header is the thermal efficiency dwelling code described in the paper. The compressed file (*.ZIP) of the data needs to be unzipped to obtain this folder.For more information on the datasets, please refer to the user manual provided along with the files.Please cite the following paper when using this data:L. Corcoran, P. Saikia, C. E. Ugalde-Loo, and M. Abeysekera, ‘An effective methodology to quantify cooling demand in the UK housing stock’, Appl. Energy, vol. 380, p. 125002, Feb. 2025, doi: 10.1016/j.apenergy.2024.125002.

  20. m

    Reddit r/AskScience Flair Dataset

    • data.mendeley.com
    Updated May 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
    Explore at:
    Dataset updated
    May 23, 2022
    Authors
    Sumit Mishra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

    The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

    The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

    This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Manusha (2020). Light novel forum dataset [Dataset]. https://www.kaggle.com/manushadilan/light-novel-forum-dataset
Organization logo

Light novel forum dataset

light novel forum user interaction data

Explore at:
zip(44833 bytes)Available download formats
Dataset updated
May 12, 2020
Authors
Manusha
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

Context

This datasets was created as a result of web scrapping using python tool called scrapy.

Content

This contains the user interacting with light novel sharing forum. Subject column represent subject of the forum post and almost every subject is a name of light novel they are sharing. Second column represent who created it. Third column shows that how many views got that light novel. Next column shows how many users have replied for that post. Next 2 columns show who post last on that post and when that last post made.

Inspiration

I hope this will unlock hidden secrets of light novel reading communities.

Search
Clear search
Close search
Google apps
Main menu