53 datasets found

US Consumer Complaints Against Businesses
kaggle.com
zip
Updated Oct 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffery Mandrake (2022). US Consumer Complaints Against Businesses [Dataset]. https://www.kaggle.com/jefferymandrake/us-consumer-complaints-dataset-through-2019
Explore at:
zip(343188956 bytes)Available download formats
Dataset updated
Oct 9, 2022
Authors
Jeffery Mandrake
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
2,121,458 records

I used Google Colab to check out this dataset and pull the column names using Pandas.

Sample code example: Python Pandas read csv file compressed with gzip and load into Pandas dataframe https://pastexy.com/106/python-pandas-read-csv-file-compressed-with-gzip-and-load-into-pandas-dataframe

Columns: ['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative', 'Company public response', 'Company', 'State', 'ZIP code', 'Tags', 'Consumer consent provided?', 'Submitted via', 'Date sent to company', 'Company response to consumer', 'Timely response?', 'Consumer disputed?', 'Complaint ID']

I did not modify the dataset.

Use it to practice with dataframes - Pandas or PySpark on Google Colab:

!unzip complaints.csv.zip

import pandas as pd df = pd.read_csv('complaints.csv') df.columns

df.head() etc.
Merge number of excel file,convert into csv file
kaggle.com
zip
Updated Mar 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aashirvad pandey (2024). Merge number of excel file,convert into csv file [Dataset]. https://www.kaggle.com/datasets/aashirvadpandey/merge-number-of-excel-fileconvert-into-csv-file
Explore at:
zip(6731 bytes)Available download formats
Dataset updated
Mar 30, 2024
Authors
Aashirvad pandey
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Project Description:

Title: Pandas Data Manipulation and File Conversion

Overview: This project aims to demonstrate the basic functionalities of Pandas, a powerful data manipulation library in Python. In this project, we will create a DataFrame, perform some data manipulation operations using Pandas, and then convert the DataFrame into both Excel and CSV formats.

Key Objectives:

DataFrame Creation: Utilize Pandas to create a DataFrame with sample data.

Data Manipulation: Perform basic data manipulation tasks such as adding columns, filtering data, and performing calculations.

File Conversion: Convert the DataFrame into Excel (.xlsx) and CSV (.csv) file formats.

Tools and Libraries Used:

Python

Pandas

Project Implementation:

DataFrame Creation:

Import the Pandas library.

Create a DataFrame using either a dictionary, a list of dictionaries, or by reading data from an external source like a CSV file.

Populate the DataFrame with sample data representing various data types (e.g., integer, float, string, datetime).

Data Manipulation:

Add new columns to the DataFrame representing derived data or computations based on existing columns.

Filter the DataFrame to include only specific rows based on certain conditions.

Perform basic calculations or transformations on the data, such as aggregation functions or arithmetic operations.

File Conversion:

Utilize Pandas to convert the DataFrame into an Excel (.xlsx) file using the to_excel() function.

Convert the DataFrame into a CSV (.csv) file using the to_csv() function.

Save the generated files to the local file system for further analysis or sharing.

Expected Outcome:

Upon completion of this project, you will have gained a fundamental understanding of how to work with Pandas DataFrames, perform basic data manipulation tasks, and convert DataFrames into different file formats. This knowledge will be valuable for data analysis, preprocessing, and data export tasks in various data science and analytics projects.

Conclusion:

The Pandas library offers powerful tools for data manipulation and file conversion in Python. By completing this project, you will have acquired essential skills that are widely applicable in the field of data science and analytics. You can further extend this project by exploring more advanced Pandas functionalities or integrating it into larger data processing pipelines.in this data we add number of data and make that data a data frame.and save in single excel file as different sheet name and then convert that excel file in csv file .
Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...
zenodo.org
data.europa.eu
zip
Updated Oct 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. http://doi.org/10.5281/zenodo.6832242
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6832242
Dataset updated
Oct 20, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LifeSnaps Dataset Documentation

Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

Data Import: Reading CSV

For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

Data Import: Setting up a MongoDB (Recommended)

To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

For the Fitbit data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

For the SEMA data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c sema

For surveys data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c surveys

If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

Data Availability

The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

{ _id:
Dataset & code for "Using large language models to address the bottleneck of...
figshare.com
txt
Updated Nov 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuyang Xie; xiao feng (2025). Dataset & code for "Using large language models to address the bottleneck of georeferencing natural history collections" [Dataset]. http://doi.org/10.6084/m9.figshare.28904936.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28904936.v1
Dataset updated
Nov 17, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Yuyang Xie; xiao feng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets and codes, which are used in the paper "Using large language models to address the bottleneck of georeferencing natural history collections"1. System requirements: Windows 10; R language: v 4.2.2; Python: v 3.8.122. Instructions for use: The "data" folder contain the key sampling and intermediate data in the analysis process of this study. The initial specimen dataset included a total of 13,064,051 records from the Global Biodiversity Information Facility (GBIF) can be downloaded from GBIF DOI: https://doi.org/10.15468/dl.fj3sqk.Data file name and its meaning or purpose：occurrence_filter_clean.csv: The data before sampling 5,000 records based on continents, after cleaning the initial specimen datamain data frame 5000_only country state county locality.csv: The 5,000 sample data used for georeferencing, containing only basic information such as country, state/province, county, locality, and true latitude and longitude from GBIFmain data frame 100_only country state county locality.csv: The 100 sub-sample data used for humnan and reasoning-LLM georeferencing, containing only basic information such as country, state/province, county, locality, and true latitude and longitude from GBIFmain data frame 5000.csv: records all output data and required records from the analysis of 5,000 sample points, including coordinates and error distances from various georeferencing methods, locality text features, and readability metricsmain data frame 100.csv: records all output data and required records from the analysis of 100 sub-sample points, including coordinates and error distances from various georeferencing methods, locality text features, and readability metricsgeoref_errorDis.csv: used for Figure 1bsummary_error_time_cost.csv: time taken and cost records for various georeferencing methods, used for Figure 4for_human_completed.csv: results of manual georeferencing by the participantshf_v2geo.tif: Global Human Footprint Dataset (Geographic) (Version 2.00), from https://gis.earthdata.nasa.gov/portal/home/item.html?id=048c92f5ce50462a86b0837254924151, used for Figure 5acountry file folder: global country and county polygon vector data, used to extract centroid coordinates of counties in ArcGIS v10.8
Sample Dataset for DataFrame Styling
kaggle.com
zip
Updated Jun 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leonie (2022). Sample Dataset for DataFrame Styling [Dataset]. https://www.kaggle.com/datasets/iamleonie/sample-dataset-for-dataframe-styling
Explore at:
zip(257 bytes)Available download formats
Dataset updated
Jun 11, 2022
Authors
Leonie
Description
Dataset

This dataset was created by Leonie

Contents
Exploring the Relationship between Lipid Profile Changes, Growth and...
zenodo.org
Updated Nov 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saúl Fernandes; Saúl Fernandes; Diana Ilyaskina; Diana Ilyaskina (2023). Exploring the Relationship between Lipid Profile Changes, Growth and Reproduction in Folsomia candida Exposed to Teflubenzuron Over Time [Dataset]. http://doi.org/10.5281/zenodo.10069317
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10069317
Dataset updated
Nov 3, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Saúl Fernandes; Saúl Fernandes; Diana Ilyaskina; Diana Ilyaskina
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This submission provides csv files with the data files from a comprehensive study aimed at investigating the effects of sublethal concentrations of the insecticide teflubenzuron on the survival, growth, reproduction, and lipid changes of theCollembola Folsomia candida over different exposure periods.
The dataset files are provided in CSV format with Comma Separated Values:
Survival_Growth_Reproduction_FolsomiaCandida_RawData.csv
Main_Lipid_Class_Log10_RawData.csv
Lipid_Categories_RawData.csv
Bioaccumulation_SoilQuantification_RawData.csv
Description of the files
The csv Survival_Growth_Reproduction_FolsomiaCandida_RawData.csv containes the dataframe in vertical format with the data for suvival of Folsomia candida, changes in biomass and reproduction.
The csv file Main_Lipid_Class_Log10_RawData provides the dataframe in horizontal format with the log10 transformed total lipid content of the main lipid classes in Folsomia candida. Data used to produce Figure 5 of the manuscript. Full name of lipids abbreviation are provided in supplementary information of the manuscript.
The csv file Lipid_Categories_RawData provides the dataframe with lipid categories in Folsomia candida. Data was not used in the data analysis described in the manuscript.
The csv file Bioaccumulation_SoilQuantification_RawData.csvontaines the dataframe in vertical format with the data for soil quantification of the insecticide in the soil, in the animals and the calculation for bioaccumulation factor.
Variables in the files:
File 1:
sample: sample unique ID
days: day of sampling
dose: dose of teflubenzuron (insecticide) in soil (mg a.s. kg soil -1)
solvent: solvent used in the soil (acetone or water)
age: age of the animals Folsomia candida at the day of sampling
survival: number of surviving adults of Folsomia candida
total.biomass(mg): total biomass in mg of the pool of animals in each sample
biomass.individual(ng): "total.biomass(mg)" devided by the "survival" and converted to ng.
offspring: number of offspring produced in each sample (NA for samples where this number is not possible to acess)
Files 2 and 3:
sample: sample unique ID
days: day of sampling
dose: dose of teflubenzuron (insecticide) in soil (mg a.s. kg soil -1)
File 4:
sample: sample unique ID
days: day of sampling
dose.nominal(ng/g): nominal dose of teflubenzuron (insecticide) in soil (mg a.s. kg soil -1)
dose.measured(ng/g): measured dose of teflubenzuron (insecticide) in soil (mg a.s. kg soil -1)
biomass.total.wet(g): "total.biomass(mg)" devided by the "survival" and converted to ng.
number.animals: umber of surviving adults of Folsomia candida in each sample
biomass.individual.dry(g): "total.biomass(mg)" devided by the "number.animals" and converted to ng.
measured.insecticide.animals(ng): measured amount of teflubenzuron (insecticide) in the pool of animals (mg a.s. kg soil -1)
accumulation.insecticide(ng/g of dry body weight): "measured.insecticide.animals(ng)" devided by "biomass.total.wet(g)"
baf: "accumulation.insecticide" devided by "dose.measured(ng/g)"
[NA stands for samples lost/ not measured]
This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 859891.
This publication reflects only the authors' view and the European Commission is not responsible for any use that may be made of the information it contains.
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...
zenodo.org
data.niaid.nih.gov
bin, csv, zip
Updated Dec 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. http://doi.org/10.5281/zenodo.6965147
Explore at:
bin, zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6965147
Dataset updated
Dec 24, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

Background

This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

Usage

The data is licensed through the Creative Commons Attribution 4.0 International.

If you have used our data and are publishing your work, we ask that you please reference both:

this database through its DOI, and

any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

Included Files

Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.

Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.

Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data

Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.

We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Clean_Data_v1-0-0.zip: contains all the downsampled data

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Database_References_v1-0-0.bib

Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

File Format: Downsampled Data

These are the "LP_

The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data

Time[s]: time in seconds since the start of the test

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: the surface temperature in degC

These data files can be easily loaded using the pandas library in Python through:

import pandas data = pandas.read_csv(data_file, index_col=0)

The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

File Format: Unreduced Data

These are the "LP_

The first column is the index of each data point

S/No: sample number recorded by the DAQ

System Date: Date and time of sample

Time[s]: time in seconds since the start of the test

C_1_Force[kN]: load cell force

C_1_Déform1[mm]: extensometer displacement

C_1_Déplacement[mm]: cross-head displacement

Eng_Stress[MPa]: engineering stress

Eng_Strain[]: engineering strain

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: specimen surface temperature in degC

The data can be loaded and used similarly to the downsampled data.

File Format: Overall_Summary

The overall summary file provides data on all the test specimens in the database. The columns include:

hidden_index: internal reference ID

grade: material grade

spec: specifications for the material

source: base material for the test specimen

id: internal name for the specimen

lp: load protocol

size: type of specimen (M8, M12, M20)

gage_length_mm_: unreduced section length in mm

avg_reduced_dia_mm_: average measured diameter for the reduced section in mm

avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm

avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm

fy_n_mpa_: nominal yield stress

fu_n_mpa_: nominal ultimate stress

t_a_deg_c_: ambient temperature in degC

date: date of test

investigator: person(s) who conducted the test

location: laboratory where test was conducted

machine: setup used to conduct test

pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control

pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control

pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control

citekey: reference corresponding to the Database_References.bib file

yield_stress_mpa_: computed yield stress in MPa

elastic_modulus_mpa_: computed elastic modulus in MPa

fracture_strain: computed average true strain across the fracture surface

c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass

file: file name of corresponding clean (downsampled) stress-strain data

File Format: Summarized_Mechanical_Props_Campaign

Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv', index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1], keep_default_na=False, na_values='')

citekey: reference in "Campaign_References.bib".

Grade: material grade.

Spec.: specifications (e.g., J2+N).

Yield Stress [MPa]: initial yield stress in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Elastic Modulus [MPa]: initial elastic modulus in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Caveats

The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:

A500

A992_Gr50

BCP325

BCR295

HYP400

S460NL

S690QL/25mm

S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
n
Data from: Generalizable EHR-R-REDCap pipeline for a national...
data.niaid.nih.gov
datadryad.org
zip
Updated Jan 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rjdfn2zcm
Dataset updated
Jan 9, 2022
Dataset provided by
Harvard Medical School
Massachusetts General Hospital
Authors
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

Methods eLAB Development and Source Code (R statistical software):

eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

Data Dictionary (DD)

EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

Study Cohort

This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

Statistical Analysis

OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
Z
Data from: ScientISST MOVE: Annotated Wearable Multimodal Biosignals...
data.niaid.nih.gov
Updated Nov 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Areias Saraiva, João; Abreu, Mariana; Carmo, Ana Sofia; Plácido da Silva, Hugo; Fred, Ana (2023). ScientISST MOVE: Annotated Wearable Multimodal Biosignals recorded during Everyday Life Activities in Naturalistic Environments [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7940438
Explore at:
Dataset updated
Nov 14, 2023
Dataset provided by
Instituto de Telecomunicações
Instituto Superior Técnico
Authors
Areias Saraiva, João; Abreu, Mariana; Carmo, Ana Sofia; Plácido da Silva, Hugo; Fred, Ana
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
A multi-modality, multi-activity, and multi-subject dataset of wearable biosignals. Modalities: ECG, EMG, EDA, PPG, ACC, TEMP Main Activities: Lift object, Greet people, Gesticulate while talking, Jumping, Walking, and Running Cohort: 17 subjects (10 male, 7 female); median age: 24 Devices: 2x ScientISST Core + 1x Empatica E4 Body Locations: Chest, Abdomen, Left bicep, wrist and index finger No filter has been applied to the signals, but the correct transfer functions were applied, so the data is given in relevant unis (mV, uS, g, ºC).

For more information on background, methods and the acquisition protocol, refer to https://doi.org/10.13026/0ppk-ha30.

In this repository, there are two formats available: a) LTBio's Biosignal files. Should be open like: x = Biosignal.load(path) LTBio Package: https://pypi.org/project/LongTermBiosignals/ Under the directory biosignal, the following tree structure is found: subject/x.biosignal, where subject is the subject's code, and x is any of the following { acc_chest, acc_wrist, ecg, eda, emg, ppg, temp }. Each file includes the signals recorded from every sensor that acquires the modality after which the file is named, independently of the device. Channels, activities and time intervals can be easily indexed with the index operator . A sneak peak of the signals can also be quickly plotted with: x.preview.plot() Any Biosignal can be easily converted to NumPy arrays or DataFrames, if needed. b) CSV files. Can be open like: x = pandas.read_csv(path) Pandas Package: https://pypi.org/project/pandas/ These files can be found under the directory csv, named as subject.csv, where subject is the subject's code. There is only one file per subject, containing their full session and all biosignal modalities. When read as tables, the time axis is in the first column, each sensor is in one of the middle columns, and the activity labels are in the last column. In each row are the samples of each sensor, if any, at each timestamp. At any given timestamp, if there is no sample for a sensor, it means the acquisition was interrupted for that sensor, which happens between activities, and sometimes for short periods during the running activity. Also in each row, on the last column, is one or more activity labels, if an activity was taking place at that timestamp. If there are multiple annotations, the labels are separated by vertical bars (e.g 'run | sprint'). If there are no annotations, the column is empty for that timestamp.

In order to provide a tabular format with sensors with different sampling frequencies, the sensors with sampling frequency lower than 500 Hz were upsampled to 500 Hz. This way, the tables are regularly sampled, i.e., there is a row every 2 ms. If a sensor was not acquiring at a given timestamp, the corresponding cell with be empty. So, not only the segments with samples are regularly sampled, but the interruptions are also discretised. This means that if, after an interruption, a sensor starts acquiring at a non regular timestamp, the first sample will be written on the previous or the following timestamp, by half-up rounding. Naturally, this process cumulatively introduces lags in the table, some of which cancel out. Each individual lag is no longer than half the sampling period (1 ms), hence negligible. The cumulative lags are no longer than 48 ms for all subjects, which is also negligible. Nevertheless, only the LBio's Biosignal format preserves the exact original timestamps (10E-6 precision) of all samples and the original sampling frequencies.

Both include annotations of the activities, however LTBio bio signal files have better time resolution and include clinical data and demographic data as well.
Datasets for the Carpentry-style RNA-seq lesson
zenodo.org
application/gzip, tsv +1
Updated Mar 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tijs bliek; marc galland; tijs bliek; marc galland (2023). Datasets for the Carpentry-style RNA-seq lesson [Dataset]. http://doi.org/10.5281/zenodo.6205896
Explore at:
tsv, application/gzip, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6205896
Dataset updated
Mar 22, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
tijs bliek; marc galland; tijs bliek; marc galland
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Lesson files

For all compressed files, go to the Shell and uncompress using `tar -xzvf myarchive.tar.gz`.

1) Bioinformatic files: bioinformatic_tutorial_files.tar.gz

This archive contains the following datasets:

FASTQ files from Arabidopsis leaf RNA-seq:

Arabidopsis_sample3.fq.gz

Arabidopsis_sample1.fq.gz

Arabidopsis_sample4.fq.gz

Arabidopsis_sample2.fq.gz

Arabidopsis thaliana genome assembly and genome annotation:

AtChromosome1.fa.gz

ath_annotation.gff3.gz

The sequence of sequencing adapters in adapters.fasta.

2) Gene counts usable with DESeq2 and R: tutorial.tar.gz

This archive contains the following datasets:

raw_counts.csv: a dataframe of the sample raw counts. It is a comma-separate values file therefore data are separated by commas ','.

samples_to_conditions.csv: a dataframe that indicates the correspondence between samples and experimental conditions (e.g. control, treated).

differential_genes.csv: a dataframe that contains the result of the DESeq2 analysis specifying this contrast in `DESEq2::results()` function: `contrast = c("infected", "Pseudomonas_syringae_DC3000", "mock")

The raw_counts.csv file was obtained by running the `v0.1.1` version of a RNA-Seq bioinformatic pipeline on the mRNA-Seq sequencing files from Vogel et al. (2016): https://www.ebi.ac.uk/ena/data/view/PRJEB13938.

Please read the original study (Vogel et al. 2016): https://nph.onlinelibrary.wiley.com/doi/full/10.1111/nph.14036

====

Exercise files

1) NASA spaceflight

The NASA GeneLab experiment GLDS-38 performed transcriptomics and proteomics of Arabidopsis seedlings in microgravity by sending seedlings to the International Space Station (ISS).

The raw counts, scaled counts and sample to conditions files are available in the ZIP archive

2) Deforges 2019 hormone-treatments: deforges_2019.tar.gz

This archive contains:

arabidopsis_root_hormones_raw_counts.csv

arabidopsis_root_hormones_sample2condition.csv

dataset01_IAA_arabidopsis_root_raw_counts.csv

dataset02_ABA_arabidopsis_root_raw_counts.csv

dataset03_ACC_arabidopsis_root_raw_counts.csv

dataset04_MeJA_arabidopsis_root_raw_counts.csv

The arabidopsis_root_hormones_raw_counts.csv file contains all gene counts from all hormones. Separate datasets were made for each hormone for convenience.

The Device Activity Report with Complete Knowledge (DARCK) for NILM

zenodo.org

bin, xz

Updated Sep 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonymous Anonymous; Anonymous Anonymous (2025). The Device Activity Report with Complete Knowledge (DARCK) for NILM [Dataset]. http://doi.org/10.5281/zenodo.17159850

Explore at:

bin, xzAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.17159850

Dataset updated

Sep 19, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonymous Anonymous; Anonymous Anonymous

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

1. Abstract

This dataset contains aggregated and sub-metered power consumption data from a two-person apartment in Germany. Data was collected from March 5 to September 4, 2025, spanning 6 months. It includes an aggregate reading from a main smart meter and individual readings from 40 smart plugs, smart relays, and smart power meters monitoring various appliances.

2. Dataset Overview

Apartment: Two-person apartment, approx. 58m², located in Aachen, Germany.
Aggregate Meter: eBZ DD3
Sub-meters: 31 Shelly Plus Plug S, 6 Shelly Plus 1PM, 3 Shelly Plus PM Mini Gen3
Sampling Rate: 1 Hz
Measured Quantity: Active Power
Unit of Measurement: Watt
Duration: 6 months
Format: Single CSV file (`DARCK.csv`)
Structure: Timestamped rows with columns for the aggregate meter and each sub-metered appliance.
Completeness: The main power meter has a completeness of 99.3%. Missing values were linearly interpolated.

3. Download and Usage

The dataset can be downloaded here: https://doi.org/10.5281/zenodo.17159850

As it contains longer off periods with zeros, the CSV file is nicely compressible.

To extract it use: xz -d DARCK.csv.xz.
The compression leads to a 97% smaller file size (From 4GB to 90.9MB).

To use the dataset in python, you can, e.g., load the csv file into a pandas dataframe.

python
import pandas as pd

df = pd.read_csv("DARCK.csv", parse_dates=["time"])

4. Measurement Setup

The main meter was monitored using an infrared reading head magnetically attached to the infrared interface of the meter. An ESP8266 flashed with Tasmota decodes the binary datagrams and forwards the Watt readings to the MQTT broker. Individual appliances were monitored using a combination of Shelly Plugs (for outlets), Shelly 1PM (for wired-in devices like ceiling lights), and Shelly PM Mini (for each of the three phases of the oven). All devices reported to a central InfluxDB database via Home Assistant running in docker on a Dell OptiPlex 3020M.

5. File Format (`DARCK.csv`)

The dataset is provided as a single comma-separated value (CSV) file.

The first row is a header containing the column names.
All power values are rounded to the first decimal place.
There are no missing values in the final dataset.
Each row represents 1 second, from start of measuring in March until the end in September.

Column Descriptions

Column Name	Data Type	Unit	Description
`time`	datetime	-	Timestamp for the reading in `YYYY-MM-DD HH:MM:SS`
`main`	float	Watt	Total aggregate power consumption for the apartment, measured at the main electrical panel.
`[appliance_name]`	float	Watt	Power consumption of an individual appliance (e.g., `lightbathroom`, `fridge`, `sherlockpc`). See Section 8 for a full list.
Aggregate Columns
`aggr_chargers`	float	Watt	The sum of `sherlockcharger`, `sherlocklaptop`, `watsoncharger`, `watsonlaptop`, `watsonipadcharger`, `kitchencharger`.
`aggr_stoveplates`	float	Watt	The sum of `stoveplatel1` and `stoveplatel2`.
`aggr_lights`	float	Watt	The sum of `lightbathroom`, `lighthallway`, `lightsherlock`, `lightkitchen`, `lightlivingroom`, `lightwatson`, `lightstoreroom`, `fcob`, `sherlockalarmclocklight`, `sherlockfloorlamphue`, `sherlockledstrip`, `livingfloorlamphue`, `sherlockglobe`, `watsonfloorlamp`, `watsondesklamp` and `watsonledmap`.
Analysis Columns
`inaccuracy`	float	Watt	As no electrical device bypasses a power meter, the true inaccuracy can be assessed. It is the absolute error between the sum of individual measurements and the mains reading. A 30W offset is applied to the sum since the measurement devices themselves draw power which is otherwise unaccounted for.

6. Data Postprocessing Pipeline

The final dataset was generated from two raw data sources (meter.csv and shellies.csv) using a comprehensive postprocessing pipeline.

6.1. Main Meter (`main`) Postprocessing

The aggregate power data required several cleaning steps to ensure accuracy.

Outlier Removal: Readings below 10W or above 10,000W were removed (merely 3 occurrences).
Timestamp Burst Correction: The source data contained bursts of delayed readings. A custom algorithm was used to identify these bursts (large time gap followed by rapid readings) and back-fill the timestamps to create an evenly spaced time series.
Alignment & Interpolation: The smart meter pushes a new value via infrared every second. To align those to the whole seconds, it was resampled to a 1-second frequency by taking the mean of all readings within each second (in 99.5% only 1 value). Any resulting gaps (0.7% outage ratio) were filled using linear interpolation.

6.2. Sub-metered Devices (`shellies`) Postprocessing

The Shelly devices are not prone to the same burst issue as the ESP8266 is. They push a new reading at every change in power drawn. If no power change is observed or the one observed is too small (less than a few Watt), the reading is pushed once a minute, together with a heartbeat. When a device turns on or off, intermediate power values are published, which leads to sub-second values that need to be handled.

Grouping: Data was grouped by the unique device identifier.
Resampling & Filling: The data for each device was resampled to a 1-second frequency using .resample('1s').last().ffill().
This method was chosen to firstly, capture the last known state of the device within each second, handling rapid on/off events. Secondly, to forward-fill the last state across periods of no new data, modeling that the device's consumption remained constant until a new reading was sent.

6.3. Merging and Finalization

Merge: The cleaned main meter and all sub-metered device dataframes were merged into a single dataframe on the time index.
Final Fill: Any remaining NaN values (e.g., from before a device was installed) were filled with 0.0, assuming zero consumption.

7. Manual Corrections and Known Data Issues

During analysis, two significant unmetered load events were identified and manually corrected to improve the accuracy of the aggregate reading. The error column (inaccuracy) was recalculated after these corrections.

March 10th - Unmetered Bulb: An unmetered 107W bulb was active. It was subtracted from the main reading as if it never happened.
May 31st - Unmetered Air Pump: An unmetered 101W pump for an air mattress was used directly in an outlet with no intermediary plug and hence manually added to the respective plug.

8. Appliance Details and Multipurpose Plugs

The following table lists the column names with an explanation where needed. As Watson moved at the beginning of June, some metering plugs changed their appliance.

Z
SELTO Dataset
data.niaid.nih.gov
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dittmer, Sören; Erzmann, David; Harms, Henrik; Falck, Rielson; Gosch, Marco (2023). SELTO Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7034898
Explore at:
Dataset updated
May 23, 2023
Dataset provided by
University of Bremen, University of Cambridge
ArianeGroup GmbH
University of Bremen
Authors
Dittmer, Sören; Erzmann, David; Harms, Henrik; Falck, Rielson; Gosch, Marco
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Benchmark Dataset for Deep Learning for 3D Topology Optimization

This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.

The following paper provides full documentation and examples:

Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.

The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets. Each TAR.GZ file container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i-th sample is stored in the files i.csv and i_info.csv, where i.csv contains all voxel-wise information and i_info.csv contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.

For the i-th sample, the columns of i_info.csv correspond to the following scalar information:

E - Young's modulus [Pa]

ν - Poisson's ratio [-]

σ_ys - a yield stress [Pa]

h - discretization size of the voxel grid [m]

The columns of i.csv correspond to the following voxel-wise information:

x, y, z - the indices that state the location of the voxel within the voxel mesh

Ω_design - design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0 and 1 indicate that the density is fixed at 0 or 1, respectively. -1 indicates the absence of constraints, i.e., the density in that voxel can be freely optimized

Ω_dirichlet_x, Ω_dirichlet_y, Ω_dirichlet_z - homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimension

F_x, F_y, F_z - floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]

density - defines the binary voxel-wise density of the ground truth solution to the topology optimization problem

How to Import the Dataset

with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset object. As shown in the tutorial this can be done via:

from dl4to.datasets import SELTODataset

dataset = SELTODataset(root=root, name=name, train=train)

Here, root is the path where the dataset should be saved. name is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset class.

without DL4TO: After downloading and unzipping, any of the i.csv files can be manually imported into Python as a Pandas dataframe object:

import pandas as pd

root = ... file_path = f'{root}/{i}.csv' columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density'] df = pd.read_csv(file_path, names=columns)

Similarly, we can import a i_info.csv file via:

file_path = f'{root}/{i}_info.csv' info_column_names = ['E', 'ν', 'σ_ys', 'h'] df_info = pd.read_csv(file_path, names=info_columns)

We can extract PyTorch tensors from the Pandas dataframe df using the following function:

import torch

def get_torch_tensors_from_dataframe(df, dtype=torch.float32): shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1 voxels = [df['x'].values, df['y'].values, df['z'].values]

Ω_design = torch.zeros(1, *shape, dtype=int) Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int)) Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype) Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype) Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype) Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype) F = torch.zeros(3, *shape, dtype=dtype) F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype) F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype) F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype) density = torch.zeros(1, *shape, dtype=dtype) density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype) return Ω_design, Ω_Dirichlet, F, density
r
Data for the Farewell and Herberg example of a two-phase experiment using a...
researchdata.edu.au
datasetcatalog.nlm.nih.gov
+1more
Updated Jul 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chris Brien (2021). Data for the Farewell and Herberg example of a two-phase experiment using a plaid design [Dataset]. http://doi.org/10.25909/13122095
Explore at:
Unique identifier
https://doi.org/10.25909/13122095
Dataset updated
Jul 1, 2021
Dataset provided by
The University of Adelaide
Authors
Chris Brien
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The experiment that Farewell and Herzberg (2003) describe is pain-rating experiment that is a subset of the experiment reported by Solomon et al. (1997). It is a two-phase experiment. The first phase is a self-assessment phase in which patients self-assess for pain while moving a painful shoulder joint. The second phase of this experiment is an evaluation phase in which occupational and physical therapy students (the raters) are evaluated for rating patients in a set of videos for pain. The measured response is the difference between a student rating and the patient's rating.

The R data file plaid.dat.rda contains the data.frame plaid.dat that has a revised version of the data for the Farewell and Herzberg example downloaded from https://doi.org/10.17863/CAM.54494. The comma delimited text file plaid.dat.csv has the same information in this more commonly accepted format, but without the metadata associated with the data.frame<\CODE>. The data.frame contains the factors Raters, Viewings, Trainings, Expressiveness, Patients, Occasions, and Motions and a column for the response variable Y. The two factors Viewings and Occasions are additional to those in the downloaded file and the remaining factors have been converted from integers or characters to factors and renamed to the names given above. The column Y is unchanged from the column in the original file. To load the data in R use: load("plaid.dat.rda") or plaid.dat <- read.csv(file = "plaid.dat.csv").
References
Farewell, V. T.,& Herzberg, A. M. (2003). Plaid designs for the evaluation of training for medical practitioners. Journal of Applied Statistics, 30(9), 957-965. https://doi.org/10.1080/0266476032000076092
Solomon, P. E., Prkachin, K. M. & Farewell, V. (1997). Enhancing sensitivity to facial expression of pain. Pain, 71(3), 279-284. https://doi.org/10.1016/S0304-3959(97)03377-0
GENEActiv accelerometer file related to the #120 OxWearables / stepcount...
zenodo.org
data.niaid.nih.gov
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guillaume Wattelez; Guillaume Wattelez (2024). GENEActiv accelerometer file related to the #120 OxWearables / stepcount issue [Dataset]. http://doi.org/10.5281/zenodo.14213237
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14213237
Dataset updated
Nov 25, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Guillaume Wattelez; Guillaume Wattelez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Oct 2, 2018 - Oct 6, 2018
Description
An example of .bin file that have an IndexError when processing.

Consider https://github.com/OxWearables/stepcount/issues/120" target="_blank" rel="noopener">#120 OxWearables / stepcount issue for more details.

The .csv files are 1-second epoch conversions from the .bin file and contain time, x, y, z columns. The conversion was done by:

reading the .bin with the https://www.rdocumentation.org/packages/GENEAread/" target="_blank" rel="noopener">GENEAread R package.

keeping only the time, x, y and z columns.

saving the data.frame into a .csv file.

The only difference between the .csv files is the column format used for the time column before saving:

time column in XXXXXX_....csv had a string class

time column in XXXXXT....csv had a "POSIXct" "POSIXt" class
Market Basket Analysis
kaggle.com
zip
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
r
Myrstener et al. (2025) Downstream temperature effects of boreal forest...
researchdata.se
su.figshare.com
Updated Feb 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caroline Greiser; Lenka Kuglerová; Maria Myrstener (2025). Myrstener et al. (2025) Downstream temperature effects of boreal forest clearcutting vary with riparian buffer width - Data and Code [Dataset]. http://doi.org/10.17045/STHLMUNI.27188004
Explore at:
Unique identifier
https://doi.org/10.17045/STHLMUNI.27188004
Dataset updated
Feb 17, 2025
Dataset provided by
Stockholm University
Authors
Caroline Greiser; Lenka Kuglerová; Maria Myrstener
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please read the readme.txt !

This depository contains raw and clean data (.csv), as well as the R-scripts (.r) that process the data, create the plots and the models.

We recommend to go through the R-scripts in their chronological order.

Code was developed in the R software:

R version 4.4.1 (2024-06-14 ucrt) -- "Race for Your Life" Copyright (C) 2024 The R Foundation for Statistical Computing Platform: x86_64-w64-mingw32/x64

****** List of files ********************************

Data

---raw

72 files from 72 Hobo data loggers

names: site_position_medium.csv

example: "20_20_down_water.csv" (site = 20, position = 20 m downstream, medium = water)

---clean

site_logger_position_medium.csv list of all sites, their loggers, their position and medium in which they were placed

loggerdata_compiled.csv all raw logger data (see above) compiled into one dataframe, for column names see below

Daily_loggerdata.csv all data aggregated to daily mean, max and min values, for column names see below

CG_site_distance_pairs.csv all logger positions for each stream and their pairwise geographical distance in meters

Discharge_site7.csv Discharge data for the same season as logger data from a reference stream

buffer_width_eniro_CG.csv measured and averaged buffer widths for each of the studied streams (in m)

Scripts

01_compile_clean_loggerdata.r

02_aggregate_loggerdata.r

03_model_stream_temp_summer.r

03b_model_stream_temp_autumn.r

04_calculate_warming_cooling_rates_summer.r

04b_calculate_warming_cooling_rates_autumn.r

05_model_air_temp_summer.r

05b_model_air_temp_autumn.r

06_plot_representative_time_series_temp_discharge.r

****** Column names ********************************

Most column names are self explaining, and are also explained in the R code.

Below some detailed info on two dataframes (.csv) - the column names are similar in other csv files

File "loggerdata_compiled.csv" [in Data/clean/ ]

"Logger.SN" Logger serial number

"Timestamp" Datetime, YYYY-MM-DD HH:MM:SS

"Temp" temperature in °C

"Illum" light in lux

"Year" YYYY

"Month" MM

"Day" DD

"Hour" HH

"Minute" MM

"Second" SS

"tz" time zone

"path" file path

"site" stream/site ID

"file" file name

"medium" "water" or "air"

"position" one of 6 positions along the stream: up, mid, end, 20, 70, 150

"date" YYYY-MM-DD

File "Daily_loggerdata.csv" [in Data/clean/ ]

"date" ... (see above)

"Logger.SN" Logger serial number

"mean_temp" mean daily temperature

"min_temp" minimum daily temperature

"max_temp" maximum daily temperature

"path" ...

"site" ...

"file" ...

"medium" ...

"position" ...

"buffer" one of 3 buffer categories: no, thin, wide

"Temp.max.ref" maximum daily temperature of the upstream reference logger

"Temp.min.ref" minimum daily temperature of the upstream reference logger

"Temp.mean.ref" mean daily temperature of the upstream reference logger

"Temp.max.dev" max. temperature difference to upstream reference

"Temp.min.dev" min. temperature difference to upstream reference

"Temp.mean.dev" mean temperature difference to upstream reference

Paper abstract:

Clearcutting increases temperatures of forest streams, and in temperate zones, the effects can extend far downstream. Here, we studied whether similar patterns are found in colder, boreal zones and if riparian buffers can prevent stream water from heating up. We recorded temperature at 45 locations across nine streams with varying buffer widths. In these streams, we compared upstream (control) reaches with reaches in clearcuts and up to 150 m downstream. In summer, we found daily maximum water temperature increases on clearcuts up to 4.1 °C with the warmest week ranging from 12.0 to 18.6 °C. We further found that warming was sustained downstream of clearcuts to 150 m in three out of six streams with buffers < 10 m. Surprisingly, temperature patterns in autumn resembled those in summer, yet with lower absolute temperatures (maximum warming was 1.9 °C in autumn). Clearcuts in boreal forests can indeed warm streams, and because these temperature effects are propagated downstream, we risk catchment-scale effects and cumulative warming when streams pass through several clearcuts. In this study, riparian buffers wider than 15 m protected against water temperature increases; hence, we call for a general increase of riparian buffer width along small streams in boreal forests.
Z
Dataset for "ConfSolv: Prediction of solute conformer free energies across a...
data.niaid.nih.gov
Updated Oct 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lagnajit Pattanaik; Angiras Menon; Volker Settels; Kevin A. Spiekermann; Zipei Tan; Florence Vermeire; Frederik Sandfort; Philipp Eiden; William H. Green (2023). Dataset for "ConfSolv: Prediction of solute conformer free energies across a range of solvents" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8292519
Explore at:
Dataset updated
Oct 25, 2023
Dataset provided by
Massachusetts Institute of Technology
BASF SE Scientific Modelling
Katholieke Universiteit Leuven
Authors
Lagnajit Pattanaik; Angiras Menon; Volker Settels; Kevin A. Spiekermann; Zipei Tan; Florence Vermeire; Frederik Sandfort; Philipp Eiden; William H. Green
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains three archives. The first archive, full_dataset.zip, contains geometries and free energies for nearly 44,000 solute molecules with almost 9 million conformers, in 42 different solvents. The geometries and gas phase free energies are computed using density functional theory (DFT). The solvation free energy for each conformer is computed using COSMO-RS and the solution free energies are computed using the sum of the gas phase free energies and the solvation free energies. The geometries for each solute conformer are provided as ASE_atoms_objects within a pandas DataFrame, found in the compressed file dft coords.pkl.gz within full_dataset.zip. The gas-phase energies, solvation free energies, and solution free energies are also provided as a pandas DataFrame in the compressed file free_energy.pkl.gz within full_dataset.zip. Ten example data splits for both random and scaffold split types are also provided in the ZIP archive for training models. Scaffold split index 0 is used to generate results in the corresponding publication. The second archive, refined_conf_search.zip, contains geometries and free energies for a representative sample of 28 solute molecules from the full dataset that were subject to a refined conformer search and thus had more conformers located. The format of the data is identical to full_dataset.zip. The third archive contains one folder for each solvent for which we have provided free energies in full_dataset.zip. Each folder contains the .cosmo file for every solvent conformer used in the COSMOtherm calculations, a dummy input file for the COSMOtherm calculations, and a CSV file that contains the electronic energy of each solvent conformer that needs to be substituted for "EH_Line" in the dummy input file.
The CORESIDENCE Database: National and Subnational Data on Household and...
data.europa.eu
zenodo.org
unknown
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo, The CORESIDENCE Database: National and Subnational Data on Household and Living Arrangements Around the World, 1964-2021 [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-8142652?locale=hu
Explore at:
unknown(18275)Available download formats
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Households are the fundamental units of co-residence and play a crucial role in social and economic reproduction worldwide. They are also widely used as units of enumeration for data collection purposes, with substantive implications for research on poverty, living conditions, family structure, and gender dynamics. However, reliable comparative data on households and changes and living arrangements around the world is still under development. The CORESIDENCE database (CoDB) aims to bridge the existing data gap by offering valuable insights not only into the documented disparities between countries but also into the often-elusive regional differences within countries. By providing comprehensive data, it facilitates a deeper understanding of the complex dynamics of co-residence around the world. This database is a significant contribution to research, as it sheds light on both macro-level variations across nations and micro-level variations within specific regions, facilitating more nuanced analyses and evidence-based policymaking. The CoDB is composed of three datasets covering 155 countries (National Dataset), 3563 regions (Subnational Dataset), and 1511 harmonized regions (Subnational-Harmonized Dataset) for the period 1960 to 2021, and it provides 146 indicators on household composition and family arrangements across the world. This repository is composed of the following elements: a RData file named CORESIDENDE_DATABASE containing the CoDB in the form of a List. The CORESIDENDE_DB list object is composed of six elements: NATIONAL: a data frame with the household composition and living arrangements indicators at the national level. SUBNATIONAL: a data frame with the household composition and living arrangements indicators at the subnational level computed over the original subnational division provided in each sample and data source. SUBNATIONAL_HARMONIZED: a data frame with the household composition and living arrangements indicators computed over the harmonized subnational regions. SUBNATIONAL_BOUNDARIES_CORESIDENCE: a spatial data frame (a sf object) with the boundary’s delimitation of the subnational harmonized regions created for this project. CODEBOOK: a data frame with the complete list of indicators, their code names and description. HARMONIZATION_TABLE: a data frame with the full list of individual country-year samples employed in this project and their state of inclusion in the 3 datasets composing the CoDB. Elements 1, 2, 3, 5 and 6 of the R list are also provided as csv files under the same names. Element 4, the harmonized boundaries, is at disposal as gpkg (Geopackage) file.
Data for "Topological grain boundary segregation transitions"
zenodo.org
zip
Updated Oct 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vivek Devulapalli; Vivek Devulapalli; Chen Enze; Chen Enze; Tobias Brink; Tobias Brink; Frolov Timofey; Frolov Timofey; Liebscher Christian H; Liebscher Christian H (2024). Data for "Topological grain boundary segregation transitions" [Dataset]. http://doi.org/10.5281/zenodo.13903314
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13903314
Dataset updated
Oct 25, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Vivek Devulapalli; Vivek Devulapalli; Chen Enze; Chen Enze; Tobias Brink; Tobias Brink; Frolov Timofey; Frolov Timofey; Liebscher Christian H; Liebscher Christian H
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cite as: Vivek Devulapalli et al. ,Topological grain boundary segregation transitions.Science386,420-424(2024). DOI:10.1126/science.adq4147

This repository contains the raw data from STEM imaging, EDS, and EELS experiments, the code used for GB simulations and theoretical calculations presented in the paper.

=========================================================

MDMC-SGC directory contains the MD/MC simulation in the semi-grand-canonical
ensemble (Fig. 4 of the paper).

Fe-Ti-phase-diagram
===================

First, the bulk concentration of Fe in Ti is calculated as a function
of the chemical potential difference Δµ between Fe and Ti. This is
required to calculate the grain boundary excess over the bulk.

Here, it turns out that the bulk concentration is approximately zero
in the range of Δµ investigated.

MD/MC simulations of grain boundaries
=====================================

The following sample names map to the naming in the paper:

* ABC: Ti ground state structure
* large-1cage-2300000: isolated cage
* larger-2cages-3200000: double cage
* large-02-10000220: one layer of cages
* large-01-10000367: second layer of cages forming

Each directory contains subdirectories for all investigated Δµ. The
subdirectory `final-states` contains the final snapshots for each Δµ.

The script `prepare.py` was used to set up the simulations (template
for the LAMMPS input file is `lmp.in.template`). The script
`collect.py` was used to extract the thermodynamic excess properties
of the grain boundaries, stored in the file `T_0300K.excess.dat` in
each subdirectory.

The notebook `plot-excess.ipynb` can be used to plot the excess data.

=========================================================

# GRand canonical Interface Predictor (GRIP)

_Authors: [Enze Chen](https://enze-chen.github.io/) (Stanford University) and
[Timofey Frolov](https://people.llnl.gov/frolov2) (Lawrence Livermore National Laboratory)_
_Version: 0.1.2024.01.21_

An algorithm for performing grand canonical optimization (GCO) of interfacial
structure (e.g., grain boundaries) in crystalline materials.
It automates sampling of slab translations and reconstructions
along with vacancy generation and finite temperature molecular dynamics (MD).
The algorithm repeatedly samples different structures in two phases:
1. Structure generation and manipulation is largely handled using the
[Atomic Simulation Environment (ASE)](https://wiki.fysik.dtu.dk/ase/).
2. Molecular dynamics and static relaxations are currently performed using
[LAMMPS](https://www.lammps.org), although in principle other energy
evaluation methods (e.g., density functional theory in [VASP](https://www.vasp.at))
may be used.

------

## Dependencies
- [Python](https://www.python.org/) (3.6+)
- [NumPy](https://numpy.org/) (1.23.0)
- [ASE](https://wiki.fysik.dtu.dk/ase/) (3.22.1)
- [LAMMPS](https://www.lammps.org) (stable)

_Optional_
- [pandas](https://pandas.pydata.org/) (1.5.3)
- [Matplotlib](https://matplotlib.org/stable/index.html) (3.5.3)

## Usage

Assuming the above libraries are installed, clone the repo and make the
appropriate modifications in `params.yaml` (see file for detailed comments),
including the path to the LAMMPS binary on your system.
If you wish, you can supply your own slabs for the bicrystal configuration as
POSCAR_LOWER and POSCAR_UPPER (in the [POSCAR](https://www.vasp.at/wiki/index.php/POSCAR)
file format).
Then call:
```python
python main.py
```
If you don't have LAMMPS or just want to test the script, you can run it with the `-d` flag.
See the `.examples` folder for a SLURM submission script for parallel execution (preferred).

## File structure
- `main.py`: Script to launch everything.
- `params.yaml`: Simulation parameters; **you'll want to edit this.**
- `core`: Main classes (`Bicrystal`, `Simulation`, etc.)
- `utility`: Main helper functions (`utils.py`, `unique.py`, etc.)
- `simul_files`: Files for simulations (LAMMPS input files, etc.)
- `best`: All relaxed structures are stored here. The naming convention is:
`lammps_Egb_n_X-SHIFT_Y-SHIFT_X-REPS_Y-REPS_TEMP_STEPS`

Duplicate files are periodically deleted by calling `clear_best()` in `utils/unique.py`.
The default method cleans about 1-3% of files on average.
Use the `-e` flag for more aggressive cleaning (>50%).
Use the `-s` flag to save the processed results to CSV from a pandas DataFrame.

Results can be visualized by running `utils/plot_gco.py` and it generates a GCO plot
of $E_{\mathrm{gb}}$ vs. $n$.
The `.examples` folder has this plot for several boundaries.
By default executing this file will save both the results (CSV) and the figure (PNG)
to the same folder as the GRIP output files.

## Citation
If you use GRIP in your work, we would appreciate a citation to the original manuscript:

> Enze Chen, Tae Wook Heo, Brandon C. Wood, Mark Asta, and Timofey Frolov.
"Grand canonically optimized grain boundary phases in hexagonal close-packed titanium."
_arXiv:XXXX.YYYYY [cond-mat.mtrl-sci]_, 2024.

or in BibTeX format:

```
@article{chen_2024_grip,
author = {Chen, Enze and Heo, Tae Wook and Wood, Brandon C. and Asta, Mark and Frolov, Timofey},
title = {Grand canonically optimized grain boundary phases in hexagonal close-packed titanium},
year = {2024},
journal = {arXiv:XXXX.YYYYY [cond-mat.mtrl-sci]},
doi = {10.48550/arXiv.XXXX.YYYYY},
}
```

=========================================================
H
White Clay Creek - Stage, Streamflow / Discharge (1968-2014)
beta.hydroshare.org
hydroshare.org
+1more
zip
Updated Jan 21, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Denis Newbold; Sara Geleskie Damiano; Anthony Aufdenkampe; Charles Dow (2018). White Clay Creek - Stage, Streamflow / Discharge (1968-2014) [Dataset]. http://doi.org/10.4211/hs.0886014659d443d4a2b9b189c96bab68
Explore at:
zip(76.9 MB)Available download formats
Unique identifier
https://doi.org/10.4211/hs.0886014659d443d4a2b9b189c96bab68
Dataset updated
Jan 21, 2018
Dataset provided by
HydroShare
Authors
Denis Newbold; Sara Geleskie Damiano; Anthony Aufdenkampe; Charles Dow
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
May 21, 1968 - Dec 31, 2014
Area covered

Description
Continuous streamflow data collected by the Stroud Water Research Center within the 3rd-order research watershed, White Clay Creek above McCue Road.

Variables: Gage height, Discharge

Date Range: (1968-2014)

Dataset Creators/Authors: Stroud Water Research Center

Contact: Sara G. Damiano, Stroud Water Research Center, 970 Spencer Road, Avondale, PA 19311, sdamiano@stroudcenter.org Denis Newbold, Stroud Water Research Center, 970 Spencer Road, Avondale, PA 19311. newbold@stroudcenter.org Anthony Aufdenkampe, Stroud Water Research Center, 970 Spencer Road, Avondale, PA 1931.1 aufdenkampe@stroudcenter.org

Field Area: White Clay Creek @ SWRC | Christina River Basin

Copied from: Stroud Water Research Center (2014). "CZO Dataset: White Clay Creek - Stage, Streamflow / Discharge (1968-2014)." Retrieved 09 Nov 2017, from http://criticalzone.org/christina/data/dataset/2464/.

NOTE: does not include data in this CZO Data listing that was from this site: WCC2154: White Clay Creek, west branch at Rt. 926, downstream side.

In addition, Aufdenkampe added an example Jupyter Notebook in Python (CZODisplaytoDataFrame_WCC-Flow.ipynb), to create a single concatenated data frame and export to a single CSV file (CRB_WCC_STAGEFLOW_from_df.csv). The full example can be found at https://github.com/aufdenkampe/EnviroDataScripts/tree/master/CZODisplayParsePlot.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jeffery Mandrake (2022). US Consumer Complaints Against Businesses [Dataset]. https://www.kaggle.com/jefferymandrake/us-consumer-complaints-dataset-through-2019

US Consumer Complaints Against Businesses

2 Million records including product, company name, issue details and response

Explore at:

zip(343188956 bytes)Available download formats

Dataset updated

Oct 9, 2022

Authors

Jeffery Mandrake

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

2,121,458 records

I used Google Colab to check out this dataset and pull the column names using Pandas.

Sample code example: Python Pandas read csv file compressed with gzip and load into Pandas dataframe https://pastexy.com/106/python-pandas-read-csv-file-compressed-with-gzip-and-load-into-pandas-dataframe

Columns: ['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative', 'Company public response', 'Company', 'State', 'ZIP code', 'Tags', 'Consumer consent provided?', 'Submitted via', 'Date sent to company', 'Company response to consumer', 'Timely response?', 'Consumer disputed?', 'Complaint ID']

I did not modify the dataset.

Use it to practice with dataframes - Pandas or PySpark on Google Colab:

!unzip complaints.csv.zip

import pandas as pd df = pd.read_csv('complaints.csv') df.columns

df.head() etc.

Clear search

Close search

Google apps

Main menu

US Consumer Complaints Against Businesses

Merge number of excel file,convert into csv file

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

Dataset & code for "Using large language models to address the bottleneck of...

Sample Dataset for DataFrame Styling

Dataset

Contents

Exploring the Relationship between Lipid Profile Changes, Growth and...

Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

Data from: Generalizable EHR-R-REDCap pipeline for a national...

Data from: ScientISST MOVE: Annotated Wearable Multimodal Biosignals...

For more information on background, methods and the acquisition protocol, refer to https://doi.org/10.13026/0ppk-ha30.

Datasets for the Carpentry-style RNA-seq lesson

The Device Activity Report with Complete Knowledge (DARCK) for NILM

1. Abstract

2. Dataset Overview

3. Download and Usage

4. Measurement Setup

5. File Format (DARCK.csv)

Column Descriptions

Column Name

Data Type

Unit

Description

6. Data Postprocessing Pipeline

6.1. Main Meter (main) Postprocessing

6.2. Sub-metered Devices (shellies) Postprocessing

6.3. Merging and Finalization

7. Manual Corrections and Known Data Issues

8. Appliance Details and Multipurpose Plugs

SELTO Dataset

Data for the Farewell and Herberg example of a two-phase experiment using a...

GENEActiv accelerometer file related to the #120 OxWearables / stepcount...

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

Myrstener et al. (2025) Downstream temperature effects of boreal forest...

Dataset for "ConfSolv: Prediction of solute conformer free energies across a...

The CORESIDENCE Database: National and Subnational Data on Household and...

Data for "Topological grain boundary segregation transitions"

White Clay Creek - Stage, Streamflow / Discharge (1968-2014)

US Consumer Complaints Against Businesses

2 Million records including product, company name, issue details and response

5. File Format (`DARCK.csv`)

6.1. Main Meter (`main`) Postprocessing

6.2. Sub-metered Devices (`shellies`) Postprocessing