46 datasets found

Merge number of excel file,convert into csv file
kaggle.com
zip
Updated Mar 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aashirvad pandey (2024). Merge number of excel file,convert into csv file [Dataset]. https://www.kaggle.com/datasets/aashirvadpandey/merge-number-of-excel-fileconvert-into-csv-file
Explore at:
zip(6731 bytes)Available download formats
Dataset updated
Mar 30, 2024
Authors
Aashirvad pandey
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Project Description:

Title: Pandas Data Manipulation and File Conversion

Overview: This project aims to demonstrate the basic functionalities of Pandas, a powerful data manipulation library in Python. In this project, we will create a DataFrame, perform some data manipulation operations using Pandas, and then convert the DataFrame into both Excel and CSV formats.

Key Objectives:

DataFrame Creation: Utilize Pandas to create a DataFrame with sample data.

Data Manipulation: Perform basic data manipulation tasks such as adding columns, filtering data, and performing calculations.

File Conversion: Convert the DataFrame into Excel (.xlsx) and CSV (.csv) file formats.

Tools and Libraries Used:

Python

Pandas

Project Implementation:

DataFrame Creation:

Import the Pandas library.

Create a DataFrame using either a dictionary, a list of dictionaries, or by reading data from an external source like a CSV file.

Populate the DataFrame with sample data representing various data types (e.g., integer, float, string, datetime).

Data Manipulation:

Add new columns to the DataFrame representing derived data or computations based on existing columns.

Filter the DataFrame to include only specific rows based on certain conditions.

Perform basic calculations or transformations on the data, such as aggregation functions or arithmetic operations.

File Conversion:

Utilize Pandas to convert the DataFrame into an Excel (.xlsx) file using the to_excel() function.

Convert the DataFrame into a CSV (.csv) file using the to_csv() function.

Save the generated files to the local file system for further analysis or sharing.

Expected Outcome:

Upon completion of this project, you will have gained a fundamental understanding of how to work with Pandas DataFrames, perform basic data manipulation tasks, and convert DataFrames into different file formats. This knowledge will be valuable for data analysis, preprocessing, and data export tasks in various data science and analytics projects.

Conclusion:

The Pandas library offers powerful tools for data manipulation and file conversion in Python. By completing this project, you will have acquired essential skills that are widely applicable in the field of data science and analytics. You can further extend this project by exploring more advanced Pandas functionalities or integrating it into larger data processing pipelines.in this data we add number of data and make that data a data frame.and save in single excel file as different sheet name and then convert that excel file in csv file .
AI4Code Train Dataframe
kaggle.com
zip
Updated May 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darien Schettler (2022). AI4Code Train Dataframe [Dataset]. https://www.kaggle.com/datasets/dschettler8845/ai4code-train-dataframe
Explore at:
zip(622120487 bytes)Available download formats
Dataset updated
May 12, 2022
Authors
Darien Schettler
Description
[EDIT/UPDATE]

There are a few important updates.

When SAVING the pd.Dataframe as a .csv, the following command should be used to avoid improper interpretation of newline character(s).

train_df.to_csv( "train.csv", index=False, encoding='utf-8', quoting=csv.QUOTE_NONNUMERIC # <== THIS IS REQUIRED )

When LOADING the .csv as a pd.Dataframe, the following command must be used to avoid misinterpretation of NaN like strings (null, nan, ...) as pd.NaN values.

train_df = pd.read_csv( "/kaggle/input/ai4code-train-dataframe/train.csv", keep_default_na=False # <== THIS IS REQUIRED )
Z
Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...
data.niaid.nih.gov
Updated Oct 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yfantidou, Sofia; Karagianni, Christina; Efstathiou, Stefanos; Vakali, Athena; Palotti, Joao; Giakatos, Dimitrios Panteleimon; Marchioro, Thomas; Kazlouski, Andrei; Ferrari, Elena; Girdzijauskas, Šarūnas (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6826682
Explore at:
Dataset updated
Oct 20, 2022
Dataset provided by
Earkick
University of Insubria
KTH Royal Institute of Technology
Aristotle University of Thessaloniki
Foundation for Research and Technology Hellas
Authors
Yfantidou, Sofia; Karagianni, Christina; Efstathiou, Stefanos; Vakali, Athena; Palotti, Joao; Giakatos, Dimitrios Panteleimon; Marchioro, Thomas; Kazlouski, Andrei; Ferrari, Elena; Girdzijauskas, Šarūnas
Description
LifeSnaps Dataset Documentation

Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

Data Import: Reading CSV

For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

Data Import: Setting up a MongoDB (Recommended)

To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

For the Fitbit data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

For the SEMA data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c sema

For surveys data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c surveys

If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

Data Availability

The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

{ _id: id (or user_id): type: data: }

Each document consists of four fields: id (also found as user_id in sema and survey collections), type, and data. The _id field is the MongoDB-defined primary key and can be ignored. The id field refers to a user-specific ID used to uniquely identify each user across all collections. The type field refers to the specific data type within the collection, e.g., steps, heart rate, calories, etc. The data field contains the actual information about the document e.g., steps count for a specific timestamp for the steps type, in the form of an embedded object. The contents of the data object are type-dependent, meaning that the fields within the data object are different between different types of data. As mentioned previously, all times are stored in local time, and user IDs are common across different collections. For more information on the available data types, see the related publication.

Surveys Encoding

BREQ2

Why do you engage in exercise?

Code Text engage[SQ001] I exercise because other people say I should engage[SQ002] I feel guilty when I don’t exercise engage[SQ003] I value the benefits of exercise engage[SQ004] I exercise because it’s fun engage[SQ005] I don’t see why I should have to exercise engage[SQ006] I take part in exercise because my friends/family/partner say I should engage[SQ007] I feel ashamed when I miss an exercise session engage[SQ008] It’s important to me to exercise regularly engage[SQ009] I can’t see why I should bother exercising engage[SQ010] I enjoy my exercise sessions engage[SQ011] I exercise because others will not be pleased with me if I don’t engage[SQ012] I don’t see the point in exercising engage[SQ013] I feel like a failure when I haven’t exercised in a while engage[SQ014] I think it is important to make the effort to exercise regularly engage[SQ015] I find exercise a pleasurable activity engage[SQ016] I feel under pressure from my friends/family to exercise engage[SQ017] I get restless if I don’t exercise regularly engage[SQ018] I get pleasure and satisfaction from participating in exercise engage[SQ019] I think exercising is a waste of time

PANAS

Indicate the extent you have felt this way over the past week

P1[SQ001] Interested P1[SQ002] Distressed P1[SQ003] Excited P1[SQ004] Upset P1[SQ005] Strong P1[SQ006] Guilty P1[SQ007] Scared P1[SQ008] Hostile P1[SQ009] Enthusiastic P1[SQ010] Proud P1[SQ011] Irritable P1[SQ012] Alert P1[SQ013] Ashamed P1[SQ014] Inspired P1[SQ015] Nervous P1[SQ016] Determined P1[SQ017] Attentive P1[SQ018] Jittery P1[SQ019] Active P1[SQ020] Afraid

Personality

How Accurately Can You Describe Yourself?

Code Text ipip[SQ001] Am the life of the party. ipip[SQ002] Feel little concern for others. ipip[SQ003] Am always prepared. ipip[SQ004] Get stressed out easily. ipip[SQ005] Have a rich vocabulary. ipip[SQ006] Don't talk a lot. ipip[SQ007] Am interested in people. ipip[SQ008] Leave my belongings around. ipip[SQ009] Am relaxed most of the time. ipip[SQ010] Have difficulty understanding abstract ideas. ipip[SQ011] Feel comfortable around people. ipip[SQ012] Insult people. ipip[SQ013] Pay attention to details. ipip[SQ014] Worry about things. ipip[SQ015] Have a vivid imagination. ipip[SQ016] Keep in the background. ipip[SQ017] Sympathize with others' feelings. ipip[SQ018] Make a mess of things. ipip[SQ019] Seldom feel blue. ipip[SQ020] Am not interested in abstract ideas. ipip[SQ021] Start conversations. ipip[SQ022] Am not interested in other people's problems. ipip[SQ023] Get chores done right away. ipip[SQ024] Am easily disturbed. ipip[SQ025] Have excellent ideas. ipip[SQ026] Have little to say. ipip[SQ027] Have a soft heart. ipip[SQ028] Often forget to put things back in their proper place. ipip[SQ029] Get upset easily. ipip[SQ030] Do not have a good imagination. ipip[SQ031] Talk to a lot of different people at parties. ipip[SQ032] Am not really interested in others. ipip[SQ033] Like order. ipip[SQ034] Change my mood a lot. ipip[SQ035] Am quick to understand things. ipip[SQ036] Don't like to draw attention to myself. ipip[SQ037] Take time out for others. ipip[SQ038] Shirk my duties. ipip[SQ039] Have frequent mood swings. ipip[SQ040] Use difficult words. ipip[SQ041] Don't mind being the centre of attention. ipip[SQ042] Feel others' emotions. ipip[SQ043] Follow a schedule. ipip[SQ044] Get irritated easily. ipip[SQ045] Spend time reflecting on things. ipip[SQ046] Am quiet around strangers. ipip[SQ047] Make people feel at ease. ipip[SQ048] Am exacting in my work. ipip[SQ049] Often feel blue. ipip[SQ050] Am full of ideas.

STAI

Indicate how you feel right now

Code Text STAI[SQ001] I feel calm STAI[SQ002] I feel secure STAI[SQ003] I am tense STAI[SQ004] I feel strained STAI[SQ005] I feel at ease STAI[SQ006] I feel upset STAI[SQ007] I am presently worrying over possible misfortunes STAI[SQ008] I feel satisfied STAI[SQ009] I feel frightened STAI[SQ010] I feel comfortable STAI[SQ011] I feel self-confident STAI[SQ012] I feel nervous STAI[SQ013] I am jittery STAI[SQ014] I feel indecisive STAI[SQ015] I am relaxed STAI[SQ016] I feel content STAI[SQ017] I am worried STAI[SQ018] I feel confused STAI[SQ019] I feel steady STAI[SQ020] I feel pleasant

TTM

Do you engage in regular physical activity according to the definition above? How frequently did each event or experience occur in the past month?

Code Text processes[SQ002] I read articles to learn more about physical
US Means of Transportation to Work Census Data
kaggle.com
zip
Updated Feb 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sagar G (2022). US Means of Transportation to Work Census Data [Dataset]. https://www.kaggle.com/goswamisagard/american-census-survey-b08301-cleaned-csv-data
Explore at:
zip(3388809 bytes)Available download formats
Dataset updated
Feb 23, 2022
Authors
Sagar G
Area covered
United States
Description

US Census Bureau conducts American Census Survey 1 and 5 Yr surveys that record various demographics and provide public access through APIs. I have attempted to call the APIs through the python environment using the requests library, Clean, and organize the data in a usable format.

Data Ingestion and Cleaning:

ACS Subject data [2011-2019] was accessed using Python by following the below API Link: https://api.census.gov/data/2011/acs/acs1?get=group(B08301)&for=county:* The data was obtained in JSON format by calling the above API, then imported as Python Pandas Dataframe. The 84 variables returned have 21 Estimate values for various metrics, 21 pairs of respective Margin of Error, and respective Annotation values for Estimate and Margin of Error Values. This data was then undergone through various cleaning processes using Python, where excess variables were removed, and the column names were renamed. Web-Scraping was carried out to extract the variables' names and replace the codes in the column names in raw data.

The above step was carried out for multiple ACS/ACS-1 datasets spanning 2011-2019 and then merged into a single Python Pandas Dataframe. The columns were rearranged, and the "NAME" column was split into two columns, namely 'StateName' and 'CountyName.' The counties for which no data was available were also removed from the Dataframe. Once the Dataframe was ready, it was separated into two new dataframes for separating State and County Data and exported into '.csv' format

Data Source:

More information about the source of Data can be found at the URL below: US Census Bureau. (n.d.). About: Census Bureau API. Retrieved from Census.gov https://www.census.gov/data/developers/about.html

Final Word:

I hope this data helps you to create something beautiful, and awesome. I will be posting a lot more databases shortly, if I get more time from assignments, submissions, and Semester Projects 🧙🏼‍♂️. Good Luck.
m
Data for: Can government transfers make energy subsidy reform socially...
data.mendeley.com
Updated Mar 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Filip Schaffitzel (2020). Data for: Can government transfers make energy subsidy reform socially acceptable? A case study on Ecuador [Dataset]. http://doi.org/10.17632/z35m76mf9g.1
Explore at:
Unique identifier
https://doi.org/10.17632/z35m76mf9g.1
Dataset updated
Mar 31, 2020
Authors
Filip Schaffitzel
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Area covered
Ecuador
Description
Estimating the distributional impacts of energy subsidy removal and compensation schemes in Ecuador based on input-output and household data.

Import files: Dictionary Categories.csv, Dictionary ENI-IOT.csv, and Dictionary Subcategories.csv based on [1] Dictionary IOT.csv and IOT_2012.csv (cannot be redistruted) based on [2] Dictionary Taxes.csv and Dictionary Transfers.csv based on [3] ENIGHUR11_GASTOS_V.csv, ENIGHUR11_HOGARES_AGREGADOS.csv, and ENIGHUR11_PERSONAS_INGRESOS.csv based on [4] Price increase scenarios.csv based on [5]

Further basic files and documents: [1] 4_M&D_Mapping ENIGHUR expenditures to IOT_180605.xlsm [2] Input-output table 2012 (https://contenido.bce.fin.ec/documentos/PublicacionesNotas/Catalogo/CuentasNacionales/Anuales/Dolares/MIP2012Ampliada.xls). Save the sheet with the IOT 2012 (Matriz simétrica) as IOT_2012.csv and edit the format: first column and row: IOT labels [3] 4_M&D_ENIGHUR income_180606.xlsx [4] ENIGHUR data can be retrieved from http://www.ecuadorencifras.gob.ec/encuesta-nacional-de-ingresos-y-gastos-de-los-hogares-urbanos-y-rurales/ Household datasets are only available in SPSS file format and the free software PSPP is used to convert .sav- to .csv-files, as this format can be read directly and efficiently into a Python Pandas DataFrame. See PSPP syntax below: save translate /outfile = filename /type = CSV /textoptions decimal = DOT /textoptions delimiter = ';' /fieldnames /cells=values /replace. [5] 3_Ecuador_Energy subsidies and 4_M&D_Price scenarios_180610.xlsx

The Device Activity Report with Complete Knowledge (DARCK) for NILM

zenodo.org

bin, xz

Updated Sep 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonymous Anonymous; Anonymous Anonymous (2025). The Device Activity Report with Complete Knowledge (DARCK) for NILM [Dataset]. http://doi.org/10.5281/zenodo.17159850

Explore at:

bin, xzAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.17159850

Dataset updated

Sep 19, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonymous Anonymous; Anonymous Anonymous

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

1. Abstract

This dataset contains aggregated and sub-metered power consumption data from a two-person apartment in Germany. Data was collected from March 5 to September 4, 2025, spanning 6 months. It includes an aggregate reading from a main smart meter and individual readings from 40 smart plugs, smart relays, and smart power meters monitoring various appliances.

2. Dataset Overview

Apartment: Two-person apartment, approx. 58m², located in Aachen, Germany.
Aggregate Meter: eBZ DD3
Sub-meters: 31 Shelly Plus Plug S, 6 Shelly Plus 1PM, 3 Shelly Plus PM Mini Gen3
Sampling Rate: 1 Hz
Measured Quantity: Active Power
Unit of Measurement: Watt
Duration: 6 months
Format: Single CSV file (`DARCK.csv`)
Structure: Timestamped rows with columns for the aggregate meter and each sub-metered appliance.
Completeness: The main power meter has a completeness of 99.3%. Missing values were linearly interpolated.

3. Download and Usage

The dataset can be downloaded here: https://doi.org/10.5281/zenodo.17159850

As it contains longer off periods with zeros, the CSV file is nicely compressible.

To extract it use: xz -d DARCK.csv.xz.
The compression leads to a 97% smaller file size (From 4GB to 90.9MB).

To use the dataset in python, you can, e.g., load the csv file into a pandas dataframe.

python
import pandas as pd

df = pd.read_csv("DARCK.csv", parse_dates=["time"])

4. Measurement Setup

The main meter was monitored using an infrared reading head magnetically attached to the infrared interface of the meter. An ESP8266 flashed with Tasmota decodes the binary datagrams and forwards the Watt readings to the MQTT broker. Individual appliances were monitored using a combination of Shelly Plugs (for outlets), Shelly 1PM (for wired-in devices like ceiling lights), and Shelly PM Mini (for each of the three phases of the oven). All devices reported to a central InfluxDB database via Home Assistant running in docker on a Dell OptiPlex 3020M.

5. File Format (`DARCK.csv`)

The dataset is provided as a single comma-separated value (CSV) file.

The first row is a header containing the column names.
All power values are rounded to the first decimal place.
There are no missing values in the final dataset.
Each row represents 1 second, from start of measuring in March until the end in September.

Column Descriptions

Column Name	Data Type	Unit	Description
`time`	datetime	-	Timestamp for the reading in `YYYY-MM-DD HH:MM:SS`
`main`	float	Watt	Total aggregate power consumption for the apartment, measured at the main electrical panel.
`[appliance_name]`	float	Watt	Power consumption of an individual appliance (e.g., `lightbathroom`, `fridge`, `sherlockpc`). See Section 8 for a full list.
Aggregate Columns
`aggr_chargers`	float	Watt	The sum of `sherlockcharger`, `sherlocklaptop`, `watsoncharger`, `watsonlaptop`, `watsonipadcharger`, `kitchencharger`.
`aggr_stoveplates`	float	Watt	The sum of `stoveplatel1` and `stoveplatel2`.
`aggr_lights`	float	Watt	The sum of `lightbathroom`, `lighthallway`, `lightsherlock`, `lightkitchen`, `lightlivingroom`, `lightwatson`, `lightstoreroom`, `fcob`, `sherlockalarmclocklight`, `sherlockfloorlamphue`, `sherlockledstrip`, `livingfloorlamphue`, `sherlockglobe`, `watsonfloorlamp`, `watsondesklamp` and `watsonledmap`.
Analysis Columns
`inaccuracy`	float	Watt	As no electrical device bypasses a power meter, the true inaccuracy can be assessed. It is the absolute error between the sum of individual measurements and the mains reading. A 30W offset is applied to the sum since the measurement devices themselves draw power which is otherwise unaccounted for.

6. Data Postprocessing Pipeline

The final dataset was generated from two raw data sources (meter.csv and shellies.csv) using a comprehensive postprocessing pipeline.

6.1. Main Meter (`main`) Postprocessing

The aggregate power data required several cleaning steps to ensure accuracy.

Outlier Removal: Readings below 10W or above 10,000W were removed (merely 3 occurrences).
Timestamp Burst Correction: The source data contained bursts of delayed readings. A custom algorithm was used to identify these bursts (large time gap followed by rapid readings) and back-fill the timestamps to create an evenly spaced time series.
Alignment & Interpolation: The smart meter pushes a new value via infrared every second. To align those to the whole seconds, it was resampled to a 1-second frequency by taking the mean of all readings within each second (in 99.5% only 1 value). Any resulting gaps (0.7% outage ratio) were filled using linear interpolation.

6.2. Sub-metered Devices (`shellies`) Postprocessing

The Shelly devices are not prone to the same burst issue as the ESP8266 is. They push a new reading at every change in power drawn. If no power change is observed or the one observed is too small (less than a few Watt), the reading is pushed once a minute, together with a heartbeat. When a device turns on or off, intermediate power values are published, which leads to sub-second values that need to be handled.

Grouping: Data was grouped by the unique device identifier.
Resampling & Filling: The data for each device was resampled to a 1-second frequency using .resample('1s').last().ffill().
This method was chosen to firstly, capture the last known state of the device within each second, handling rapid on/off events. Secondly, to forward-fill the last state across periods of no new data, modeling that the device's consumption remained constant until a new reading was sent.

6.3. Merging and Finalization

Merge: The cleaned main meter and all sub-metered device dataframes were merged into a single dataframe on the time index.
Final Fill: Any remaining NaN values (e.g., from before a device was installed) were filled with 0.0, assuming zero consumption.

7. Manual Corrections and Known Data Issues

During analysis, two significant unmetered load events were identified and manually corrected to improve the accuracy of the aggregate reading. The error column (inaccuracy) was recalculated after these corrections.

March 10th - Unmetered Bulb: An unmetered 107W bulb was active. It was subtracted from the main reading as if it never happened.
May 31st - Unmetered Air Pump: An unmetered 101W pump for an air mattress was used directly in an outlet with no intermediary plug and hence manually added to the respective plug.

8. Appliance Details and Multipurpose Plugs

The following table lists the column names with an explanation where needed. As Watson moved at the beginning of June, some metering plugs changed their appliance.

US Consumer Complaints Against Businesses
kaggle.com
zip
Updated Oct 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffery Mandrake (2022). US Consumer Complaints Against Businesses [Dataset]. https://www.kaggle.com/jefferymandrake/us-consumer-complaints-dataset-through-2019
Explore at:
zip(343188956 bytes)Available download formats
Dataset updated
Oct 9, 2022
Authors
Jeffery Mandrake
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
2,121,458 records

I used Google Colab to check out this dataset and pull the column names using Pandas.

Sample code example: Python Pandas read csv file compressed with gzip and load into Pandas dataframe https://pastexy.com/106/python-pandas-read-csv-file-compressed-with-gzip-and-load-into-pandas-dataframe

Columns: ['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative', 'Company public response', 'Company', 'State', 'ZIP code', 'Tags', 'Consumer consent provided?', 'Submitted via', 'Date sent to company', 'Company response to consumer', 'Timely response?', 'Consumer disputed?', 'Complaint ID']

I did not modify the dataset.

Use it to practice with dataframes - Pandas or PySpark on Google Colab:

!unzip complaints.csv.zip

import pandas as pd df = pd.read_csv('complaints.csv') df.columns

df.head() etc.
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...
zenodo.org
data.niaid.nih.gov
bin, csv, zip
Updated Dec 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. http://doi.org/10.5281/zenodo.6965147
Explore at:
bin, zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6965147
Dataset updated
Dec 24, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

Background

This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

Usage

The data is licensed through the Creative Commons Attribution 4.0 International.

If you have used our data and are publishing your work, we ask that you please reference both:

this database through its DOI, and

any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

Included Files

Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.

Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.

Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data

Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.

We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Clean_Data_v1-0-0.zip: contains all the downsampled data

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Database_References_v1-0-0.bib

Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

File Format: Downsampled Data

These are the "LP_

The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data

Time[s]: time in seconds since the start of the test

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: the surface temperature in degC

These data files can be easily loaded using the pandas library in Python through:

import pandas data = pandas.read_csv(data_file, index_col=0)

The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

File Format: Unreduced Data

These are the "LP_

The first column is the index of each data point

S/No: sample number recorded by the DAQ

System Date: Date and time of sample

Time[s]: time in seconds since the start of the test

C_1_Force[kN]: load cell force

C_1_Déform1[mm]: extensometer displacement

C_1_Déplacement[mm]: cross-head displacement

Eng_Stress[MPa]: engineering stress

Eng_Strain[]: engineering strain

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: specimen surface temperature in degC

The data can be loaded and used similarly to the downsampled data.

File Format: Overall_Summary

The overall summary file provides data on all the test specimens in the database. The columns include:

hidden_index: internal reference ID

grade: material grade

spec: specifications for the material

source: base material for the test specimen

id: internal name for the specimen

lp: load protocol

size: type of specimen (M8, M12, M20)

gage_length_mm_: unreduced section length in mm

avg_reduced_dia_mm_: average measured diameter for the reduced section in mm

avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm

avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm

fy_n_mpa_: nominal yield stress

fu_n_mpa_: nominal ultimate stress

t_a_deg_c_: ambient temperature in degC

date: date of test

investigator: person(s) who conducted the test

location: laboratory where test was conducted

machine: setup used to conduct test

pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control

pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control

pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control

citekey: reference corresponding to the Database_References.bib file

yield_stress_mpa_: computed yield stress in MPa

elastic_modulus_mpa_: computed elastic modulus in MPa

fracture_strain: computed average true strain across the fracture surface

c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass

file: file name of corresponding clean (downsampled) stress-strain data

File Format: Summarized_Mechanical_Props_Campaign

Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv', index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1], keep_default_na=False, na_values='')

citekey: reference in "Campaign_References.bib".

Grade: material grade.

Spec.: specifications (e.g., J2+N).

Yield Stress [MPa]: initial yield stress in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Elastic Modulus [MPa]: initial elastic modulus in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Caveats

The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:

A500

A992_Gr50

BCP325

BCR295

HYP400

S460NL

S690QL/25mm

S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
r
Dataset with four years of condition monitoring technical language...
researchdata.se
Updated Jun 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karl Löwenmark; Fredrik Sandin; Marcus Liwicki; Stephan Schnabel (2025). Dataset with four years of condition monitoring technical language annotations from paper machine industries in northern Sweden [Dataset]. http://doi.org/10.5878/hafd-ms27
Explore at:
(74859)Available download formats
Unique identifier
https://doi.org/10.5878/hafd-ms27
Dataset updated
Jun 17, 2025
Dataset provided by
Luleå University of Technology
Authors
Karl Löwenmark; Fredrik Sandin; Marcus Liwicki; Stephan Schnabel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2018 - 2022
Area covered
Sweden
Description
This dataset consists of four years of technical language annotations from two paper machines in northern Sweden, structured as a Pandas dataframe. The same data is also available as a semicolon-separated .csv file. The data consists of two columns, where the first column corresponds to annotation note contents, and the second column corresponds to annotation titles. The annotations are in Swedish, and processed so that all mentions of personal information are replaced with the string ‘egennamn’, meaning “personal name” in Swedish. Each row corresponds to one annotation with the corresponding title.

Data can be accessed in Python with: import pandas as pd annotations_df = pd.read_pickle("Technical_Language_Annotations.pkl") annotation_contents = annotations_df['noteComment'] annotation_titles = annotations_df['title']
Z
SELTO Dataset
data.niaid.nih.gov
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dittmer, Sören; Erzmann, David; Harms, Henrik; Falck, Rielson; Gosch, Marco (2023). SELTO Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7034898
Explore at:
Dataset updated
May 23, 2023
Dataset provided by
ArianeGroup GmbH
University of Bremen, University of Cambridge
University of Bremen
Authors
Dittmer, Sören; Erzmann, David; Harms, Henrik; Falck, Rielson; Gosch, Marco
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Benchmark Dataset for Deep Learning for 3D Topology Optimization

This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.

The following paper provides full documentation and examples:

Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.

The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets. Each TAR.GZ file container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i-th sample is stored in the files i.csv and i_info.csv, where i.csv contains all voxel-wise information and i_info.csv contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.

For the i-th sample, the columns of i_info.csv correspond to the following scalar information:

E - Young's modulus [Pa]

ν - Poisson's ratio [-]

σ_ys - a yield stress [Pa]

h - discretization size of the voxel grid [m]

The columns of i.csv correspond to the following voxel-wise information:

x, y, z - the indices that state the location of the voxel within the voxel mesh

Ω_design - design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0 and 1 indicate that the density is fixed at 0 or 1, respectively. -1 indicates the absence of constraints, i.e., the density in that voxel can be freely optimized

Ω_dirichlet_x, Ω_dirichlet_y, Ω_dirichlet_z - homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimension

F_x, F_y, F_z - floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]

density - defines the binary voxel-wise density of the ground truth solution to the topology optimization problem

How to Import the Dataset

with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset object. As shown in the tutorial this can be done via:

from dl4to.datasets import SELTODataset

dataset = SELTODataset(root=root, name=name, train=train)

Here, root is the path where the dataset should be saved. name is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset class.

without DL4TO: After downloading and unzipping, any of the i.csv files can be manually imported into Python as a Pandas dataframe object:

import pandas as pd

root = ... file_path = f'{root}/{i}.csv' columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density'] df = pd.read_csv(file_path, names=columns)

Similarly, we can import a i_info.csv file via:

file_path = f'{root}/{i}_info.csv' info_column_names = ['E', 'ν', 'σ_ys', 'h'] df_info = pd.read_csv(file_path, names=info_columns)

We can extract PyTorch tensors from the Pandas dataframe df using the following function:

import torch

def get_torch_tensors_from_dataframe(df, dtype=torch.float32): shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1 voxels = [df['x'].values, df['y'].values, df['z'].values]

Ω_design = torch.zeros(1, *shape, dtype=int) Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int)) Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype) Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype) Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype) Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype) F = torch.zeros(3, *shape, dtype=dtype) F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype) F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype) F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype) density = torch.zeros(1, *shape, dtype=dtype) density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype) return Ω_design, Ω_Dirichlet, F, density
Z
OpenForecast results in 2020-2021
data.niaid.nih.gov
Updated Dec 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayzel, Georgy (2021). OpenForecast results in 2020-2021 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5801140
Explore at:
Dataset updated
Dec 24, 2021
Dataset provided by
State Hydrological Institute, 199004 Saint Petersburg, Russia
Authors
Ayzel, Georgy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OpenForecast is the first national-scale operational runoff forecasting system in Russia. The presented data supports a research article on a long-term assessment of OpenForecast performance in 2020-2021.

File listing:

calibration_vs_hindcast.npy -- Python dictionary that provides results of efficiency assessment for calibration and evaluation (hindcast) periods in terms of NSE and KGE metrics for individual hydrological models (GR4JNSE, GR4JKGE, HBVNSE, HBVKGE).

hindcast_vs_forecast.npy -- Python dictionary that provides results of efficiency assessment for hindcast, pre-operational hindcast, and forecast periods in terms of NSE and KGE metrics for individual hydrological models (GR4JNSE, GR4JKGE, HBVNSE, HBVKGE), as well as their ensemble mean (ENS).

meteo_forecast.npy -- Python dictionary that reports correlation coefficients between ICON and ERA5 reanalysis for air temperature and precipitation forecasts.

users.csv -- daily numbers of OpenForecast users.

Sample code for data access:

import numpy as np import pandas as pd

calibration_hindcast = np.load("calibration_vs_hindcast.npy", allow_pickle=True).item()

hindcast_forecast = np.load("hindcast_vs_forecast.npy", allow_pickle=True).item()

meteo_forecasts = np.load("meteo_forecast.npy", allow_pickle=True).item()

users = pd.read_csv("users.csv", index_col=0, parse_dates=True, dayfirst=True)

% pandas dataframe for the GR4J_KGE model efficiency in terms of NSE for calibration and hindcast periods calibration_hindcast["GR4J_KGE"]["NSE"]

% pandas dataframe for the ensemble mean efficiency in terms of NSE for hindcast and forecast periods hindcast_forecast["ENS"]["NSE"]

% pandas dataframe for correlation coefficients between ICON and ERA5 for precipitation forecasts meteo_forecasts["P"]["Correlation"]

% available keys of Python dictionaries could be checked as follows calibration_hindcast.keys()
Z
Analysis of references in the IPCC AR6 WG2 Report of 2022
data.niaid.nih.gov
Updated Mar 11, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cameron Neylon; Bianca Kramer (2022). Analysis of references in the IPCC AR6 WG2 Report of 2022 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6327206
Explore at:
Dataset updated
Mar 11, 2022
Dataset provided by
Centre for Culture and Technology, Curtin University
Utrehct University
Authors
Cameron Neylon; Bianca Kramer
License
https://creativecommons.org/licenses/publicdomain/https://creativecommons.org/licenses/publicdomain/
Description
This repository contains data on 17,419 DOIs cited in the IPCC Working Group 2 contribution to the Sixth Assessment Report, and the code to link them to the dataset built at the Curtin Open Knowledge Initiative (COKI).

References were extracted from the report's PDFs (downloaded 2022-03-01) via Scholarcy and exported as RIS and BibTeX files. DOI strings were identified from RIS files by pattern matching and saved as CSV file. The list of DOIs for each chapter and cross chapter paper was processed using a custom Python script to generate a pandas DataFrame which was saved as CSV file and uploaded to Google Big Query.

We used the main object table of the Academic Observatory, which combines information from Crossref, Unpaywall, Microsoft Academic, Open Citations, the Research Organization Registry and Geonames to enrich the DOIs with bibliographic information, affiliations, and open access status. A custom query was used to join and format the data and the resulting table was visualised in a Google DataStudio dashboard.

This version of the repository also includes the set of DOIs from references in the IPCC Working Group 1 contribution to the Sixth Assessment Report as extracted by Alexis-Michel Mugabushaka and shared on Zenodo: https://doi.org/10.5281/zenodo.5475442 (CC-BY)

A brief descriptive analysis was provided as a blogpost on the COKI website.

The repository contains the following content:

Data:

data/scholarcy/RIS/ - extracted references as RIS files

data/scholarcy/BibTeX/ - extracted references as BibTeX files

IPCC_AR6_WGII_dois.csv - list of DOIs

data/10.5281_zenodo.5475442/ - references from IPCC AR6 WG1 report

Processing:

preprocessing.R - preprocessing steps for identifying and cleaning DOIs

process.py - Python script for transforming data and linking to COKI data through Google Big Query

Outcomes:

Dataset on BigQuery - requires a google account for access and bigquery account for querying

Data Studio Dashboard - interactive analysis of the generated data

Zotero library of references extracted via Scholarcy

PDF version of blogpost

Note on licenses: Data are made available under CC0 (with the exception of WG1 reference data, which have been shared under CC-BY 4.0) Code is made available under Apache License 2.0
u
Data for Gradient boosted decision trees reveal nuances of auditory...
rdr.ucl.ac.uk
txt
Updated Mar 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carla Griffiths; Jennifer Bizley; Jules Lebert; Joseph Sollini (2024). Data for Gradient boosted decision trees reveal nuances of auditory discrimination behavior [Dataset]. http://doi.org/10.5522/04/25386565.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.5522/04/25386565.v1
Dataset updated
Mar 22, 2024
Dataset provided by
University College London
Authors
Carla Griffiths; Jennifer Bizley; Jules Lebert; Joseph Sollini
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Raw data for the article: Gradient boosted decision trees reveal nuances of auditory discrimination behaviour (PLOS Computational Biology).This data repository contains the csv files after extraction of the raw MATLAB metadata files into pandas (Python) dataframes (helper function author: Jules Lebert). The csv files can easily be loaded back into dataframe objects using pandas before the subsampling steps (as documented in the paper, we used subsampling to ensure the number of F0-roved and control F0 trials were relatively equal) are completed.Link to GitHub repository to run the models on this data: https://github.com/carlacodes/boostmodelsA full description of each of the variables within the dataframe can be found in the data_description_instructions_for_datasets_plos_bio.pdf.Abstract: Animal psychophysics can generate rich behavioral datasets, often comprised of many 1000s of trials for an individual subject. Gradient-boosted models are a promising machine learning approach for analyzing such data, partly due to the tools that allow users to gain insight into how the model makes predictions. We trained ferrets to report a target word’s presence, timing, and lateralization within a stream of consecutively presented non-target words. To assess the animals’ ability to generalize across pitch, we manipulated the fundamental frequency (F0) of the speech stimuli across trials, and to assess the contribution of pitch to streaming, we roved the F0 from word token-to-token. We then implemented gradient-boosted regression and decision trees on the trial outcome and reaction time data to understand the behavioral factors behind the ferrets’ decision-making. We visualized model contributions by implementing SHAPs feature importance and partial dependency plots. While ferrets could accurately perform the task across all pitch-shifted conditions, our models reveal subtle effects of shifting F0 on performance, with within-trial pitch shifting elevating false alarms and extending reaction times. Our models identified a subset of non-target words that animals commonly false alarmed to. Follow-up analysis demonstrated that the spectrotemporal similarity of target and non-target words rather than similarity in duration or amplitude waveform was the strongest predictor of the likelihood of false alarming. Finally, we compared the results with those obtained with traditional mixed effects models, revealing equivalent or better performance for the gradient-boosted models over these approaches.
Myocardial motion dataset (processed data)
figshare.com
txt
Updated Jun 25, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Magnus Krogh (2018). Myocardial motion dataset (processed data) [Dataset]. http://doi.org/10.6084/m9.figshare.6631400.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6631400.v1
Dataset updated
Jun 25, 2018
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Magnus Krogh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
processed_data.pklProcessed myocardial motion recordings organized into python data structures. Pkl files can be loaded in python using the pickle package.analysed_data.pklMeasures extracted from the processed data organised into a pandas dataframe and saved in pickle format.analysed_data_R.csvSame data as analysed_data.pkl but exported as .csv for statistical analysis in R.
NYC Jobs Dataset (Filtered Columns)
kaggle.com
zip
Updated Oct 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffery Mandrake (2022). NYC Jobs Dataset (Filtered Columns) [Dataset]. https://www.kaggle.com/datasets/jefferymandrake/nyc-jobs-filtered-cols
Explore at:
zip(93408 bytes)Available download formats
Dataset updated
Oct 5, 2022
Authors
Jeffery Mandrake
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
New York
Description
Use this dataset with Misra's Pandas tutorial: How to use the Pandas GroupBy function | Pandas tutorial

The original dataset came from this site: https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t/data

I used Google Colab to filter the columns with the following Pandas commands. Here's a Colab Notebook you can use with the commands listed below: https://colab.research.google.com/drive/17Jpgeytc075CpqDnbQvVMfh9j-f4jM5l?usp=sharing

Once the csv file is uploaded to Google Colab, use these commands to process the file.

import pandas as pd # load the file and create a pandas dataframe df = pd.read_csv('/content/NYC_Jobs.csv') # keep only these columns df = df[['Job ID', 'Civil Service Title', 'Agency', 'Posting Type', 'Job Category', 'Salary Range From', 'Salary Range To' ]] # save the csv file without the index column df.to_csv('/content/NYC_Jobs_filtered_cols.csv', index=False)
Z
ARtracks - a Global Atmospheric River Catalogue Based on ERA5 and IPART
data-staging.niaid.nih.gov
zenodo.org
Updated May 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Traxl, Dominik (2024). ARtracks - a Global Atmospheric River Catalogue Based on ERA5 and IPART [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7018724
Explore at:
Dataset updated
May 2, 2024
Dataset provided by
Potsdam Institute for Climate Impact Research (PIK)
Authors
Traxl, Dominik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The ARtracks Atmospheric River Catalogue is based on the ERA5 climate reanalysis dataset, specifically the output parameters "vertical integral of east-/northward water vapour flux". Most of the processing relies onIPART (Image-Processing based Atmospheric River (AR) Tracking, https://github.com/ihesp/IPART), a Python package for automated AR detection, axis finding and AR tracking. The catalogue is provided as a pickled pandas.DataFrame as well as a CSV file.

For detailed information, please see https://github.com/dominiktraxl/artracks.

The ARtracks catalogue covers the years from 1979 to the end of the year 2019.
Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...
data.europa.eu
zenodo.org
unknown
Updated Jul 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-6832242?locale=fr
Explore at:
unknown(642961582)Available download formats
Dataset updated
Jul 12, 2022
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LifeSnaps Dataset Documentation Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction. The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication. Data Import: Reading CSV For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command. Data Import: Setting up a MongoDB (Recommended) To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database. To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here. For the Fitbit data, run the following: mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
Data for "Topological grain boundary segregation transitions"
zenodo.org
zip
Updated Oct 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vivek Devulapalli; Vivek Devulapalli; Chen Enze; Chen Enze; Tobias Brink; Tobias Brink; Frolov Timofey; Frolov Timofey; Liebscher Christian H; Liebscher Christian H (2024). Data for "Topological grain boundary segregation transitions" [Dataset]. http://doi.org/10.5281/zenodo.13903314
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13903314
Dataset updated
Oct 25, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Vivek Devulapalli; Vivek Devulapalli; Chen Enze; Chen Enze; Tobias Brink; Tobias Brink; Frolov Timofey; Frolov Timofey; Liebscher Christian H; Liebscher Christian H
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cite as: Vivek Devulapalli et al. ,Topological grain boundary segregation transitions.Science386,420-424(2024). DOI:10.1126/science.adq4147

This repository contains the raw data from STEM imaging, EDS, and EELS experiments, the code used for GB simulations and theoretical calculations presented in the paper.

=========================================================

MDMC-SGC directory contains the MD/MC simulation in the semi-grand-canonical
ensemble (Fig. 4 of the paper).

Fe-Ti-phase-diagram
===================

First, the bulk concentration of Fe in Ti is calculated as a function
of the chemical potential difference Δµ between Fe and Ti. This is
required to calculate the grain boundary excess over the bulk.

Here, it turns out that the bulk concentration is approximately zero
in the range of Δµ investigated.

MD/MC simulations of grain boundaries
=====================================

The following sample names map to the naming in the paper:

* ABC: Ti ground state structure
* large-1cage-2300000: isolated cage
* larger-2cages-3200000: double cage
* large-02-10000220: one layer of cages
* large-01-10000367: second layer of cages forming

Each directory contains subdirectories for all investigated Δµ. The
subdirectory `final-states` contains the final snapshots for each Δµ.

The script `prepare.py` was used to set up the simulations (template
for the LAMMPS input file is `lmp.in.template`). The script
`collect.py` was used to extract the thermodynamic excess properties
of the grain boundaries, stored in the file `T_0300K.excess.dat` in
each subdirectory.

The notebook `plot-excess.ipynb` can be used to plot the excess data.

=========================================================

# GRand canonical Interface Predictor (GRIP)

_Authors: [Enze Chen](https://enze-chen.github.io/) (Stanford University) and
[Timofey Frolov](https://people.llnl.gov/frolov2) (Lawrence Livermore National Laboratory)_
_Version: 0.1.2024.01.21_

An algorithm for performing grand canonical optimization (GCO) of interfacial
structure (e.g., grain boundaries) in crystalline materials.
It automates sampling of slab translations and reconstructions
along with vacancy generation and finite temperature molecular dynamics (MD).
The algorithm repeatedly samples different structures in two phases:
1. Structure generation and manipulation is largely handled using the
[Atomic Simulation Environment (ASE)](https://wiki.fysik.dtu.dk/ase/).
2. Molecular dynamics and static relaxations are currently performed using
[LAMMPS](https://www.lammps.org), although in principle other energy
evaluation methods (e.g., density functional theory in [VASP](https://www.vasp.at))
may be used.

------

## Dependencies
- [Python](https://www.python.org/) (3.6+)
- [NumPy](https://numpy.org/) (1.23.0)
- [ASE](https://wiki.fysik.dtu.dk/ase/) (3.22.1)
- [LAMMPS](https://www.lammps.org) (stable)

_Optional_
- [pandas](https://pandas.pydata.org/) (1.5.3)
- [Matplotlib](https://matplotlib.org/stable/index.html) (3.5.3)

## Usage

Assuming the above libraries are installed, clone the repo and make the
appropriate modifications in `params.yaml` (see file for detailed comments),
including the path to the LAMMPS binary on your system.
If you wish, you can supply your own slabs for the bicrystal configuration as
POSCAR_LOWER and POSCAR_UPPER (in the [POSCAR](https://www.vasp.at/wiki/index.php/POSCAR)
file format).
Then call:
```python
python main.py
```
If you don't have LAMMPS or just want to test the script, you can run it with the `-d` flag.
See the `.examples` folder for a SLURM submission script for parallel execution (preferred).

## File structure
- `main.py`: Script to launch everything.
- `params.yaml`: Simulation parameters; **you'll want to edit this.**
- `core`: Main classes (`Bicrystal`, `Simulation`, etc.)
- `utility`: Main helper functions (`utils.py`, `unique.py`, etc.)
- `simul_files`: Files for simulations (LAMMPS input files, etc.)
- `best`: All relaxed structures are stored here. The naming convention is:
`lammps_Egb_n_X-SHIFT_Y-SHIFT_X-REPS_Y-REPS_TEMP_STEPS`

Duplicate files are periodically deleted by calling `clear_best()` in `utils/unique.py`.
The default method cleans about 1-3% of files on average.
Use the `-e` flag for more aggressive cleaning (>50%).
Use the `-s` flag to save the processed results to CSV from a pandas DataFrame.

Results can be visualized by running `utils/plot_gco.py` and it generates a GCO plot
of $E_{\mathrm{gb}}$ vs. $n$.
The `.examples` folder has this plot for several boundaries.
By default executing this file will save both the results (CSV) and the figure (PNG)
to the same folder as the GRIP output files.

## Citation
If you use GRIP in your work, we would appreciate a citation to the original manuscript:

> Enze Chen, Tae Wook Heo, Brandon C. Wood, Mark Asta, and Timofey Frolov.
"Grand canonically optimized grain boundary phases in hexagonal close-packed titanium."
_arXiv:XXXX.YYYYY [cond-mat.mtrl-sci]_, 2024.

or in BibTeX format:

```
@article{chen_2024_grip,
author = {Chen, Enze and Heo, Tae Wook and Wood, Brandon C. and Asta, Mark and Frolov, Timofey},
title = {Grand canonically optimized grain boundary phases in hexagonal close-packed titanium},
year = {2024},
journal = {arXiv:XXXX.YYYYY [cond-mat.mtrl-sci]},
doi = {10.48550/arXiv.XXXX.YYYYY},
}
```

=========================================================
m
Dataset of Leak Simulations in Experimental Testbed Water Distribution...
data.mendeley.com
Updated Dec 12, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohsen Aghashahi (2022). Dataset of Leak Simulations in Experimental Testbed Water Distribution System [Dataset]. http://doi.org/10.17632/tbrnp6vrnj.1
Explore at:
Unique identifier
https://doi.org/10.17632/tbrnp6vrnj.1
Dataset updated
Dec 12, 2022
Authors
Mohsen Aghashahi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the first fully labeled open dataset for leak detection and localization in water distribution systems. This dataset includes two hundred and eighty signals acquired from a laboratory-scale water distribution testbed with four types of induced leaks and no-leak. The testbed was 47 m long built from 152.4 mm diameter PVC pipes. Two accelerometers (A1 and A2), two hydrophones (H1 and H2), and two dynamic pressure sensors (P1 and P2) were deployed to measure acceleration, acoustic, and dynamic pressure data. The data were recorded through controlled experiments where the following were changed: network architecture, leak type, background flow condition, background noise condition, and sensor types and locations. Each signal was recorded for 30 seconds. Network architectures were looped (LO) and branched (BR). Leak types were Longitudinal Crack (LC), Circumferential Crack (CC), Gasket Leak (GL), Orifice Leak (OL), and No-leak (NL). Background flow conditions included 0 L/s (ND), 0.18 L/s, 0.47 L/s, and Transient (background flow rate abruptly changed from 0.47 L/s to 0 L/s at the second 20th of 30-second long measurements). Background noise conditions, with noise (N) and without noise (NN), determined whether a background noise was present during acoustic data measurements. Accelerometer and dynamic pressure data are in ‘.csv’ format, and the hydrophone data are in ‘.raw’ format with 8000 Hz frequency. The file “Python code to convert raw acoustic data to pandas DataFrame.py” converts the raw hydrophone data to DataFrame in Python.
Gemma-Python Training Dataset
kaggle.com
zip
Updated Mar 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Mairs (2024). Gemma-Python Training Dataset [Dataset]. https://www.kaggle.com/datasets/dmcstllc/gemma-python-training-dataset
Explore at:
zip(102676250 bytes)Available download formats
Dataset updated
Mar 17, 2024
Authors
David Mairs
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Instruction/Response format of python code question and answers Oasst files are split 3000 lines each to prevent OOm when loading. The format of the files can easily be changed with a simple python script """

`` import pandas as pd import json input_file_path = 'output_file.jsonl' output_file_path = 'output_file2.csv'

Prepare an empty list to hold the processed records

processed_records = []

Open the .jsonl file and process each line

with open(input_file_path, 'r') as file: for line in file: # Parse the JSON object from the current line record = json.loads(line)

# Rename, filter the desired keys, and replace newline characters processed_record = { "prompt": record.get("INSTRUCTION", "").replace('

', ' ').strip(), "response": record.get("RESPONSE", "").replace(' ', ' ').strip() }

# Add the processed record to the list processed_records.append(processed_record)

Convert the list of processed records to a DataFrame

df = pd.DataFrame(processed_records)

Writing the DataFrame to a .csv file

df.to_csv(output_file_path, index=False, quoting=2) # using quoting=2 to quote all fields

print(f"Conversion complete. The output is saved to '{output_file_path}'") `` """

Facebook

Twitter

Click to copy link

Link copied

Cite

Aashirvad pandey (2024). Merge number of excel file,convert into csv file [Dataset]. https://www.kaggle.com/datasets/aashirvadpandey/merge-number-of-excel-fileconvert-into-csv-file

Merge number of excel file,convert into csv file

merging the file and converting the file

Explore at:

zip(6731 bytes)Available download formats

Dataset updated

Mar 30, 2024

Authors

Aashirvad pandey

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Project Description:

Title: Pandas Data Manipulation and File Conversion

Overview: This project aims to demonstrate the basic functionalities of Pandas, a powerful data manipulation library in Python. In this project, we will create a DataFrame, perform some data manipulation operations using Pandas, and then convert the DataFrame into both Excel and CSV formats.

Key Objectives:

DataFrame Creation: Utilize Pandas to create a DataFrame with sample data.
Data Manipulation: Perform basic data manipulation tasks such as adding columns, filtering data, and performing calculations.
File Conversion: Convert the DataFrame into Excel (.xlsx) and CSV (.csv) file formats.

Tools and Libraries Used:

Python
Pandas

Project Implementation:

DataFrame Creation:
- Import the Pandas library.
- Create a DataFrame using either a dictionary, a list of dictionaries, or by reading data from an external source like a CSV file.
- Populate the DataFrame with sample data representing various data types (e.g., integer, float, string, datetime).
Data Manipulation:
- Add new columns to the DataFrame representing derived data or computations based on existing columns.
- Filter the DataFrame to include only specific rows based on certain conditions.
- Perform basic calculations or transformations on the data, such as aggregation functions or arithmetic operations.
File Conversion:
- Utilize Pandas to convert the DataFrame into an Excel (.xlsx) file using the to_excel() function.
- Convert the DataFrame into a CSV (.csv) file using the to_csv() function.
- Save the generated files to the local file system for further analysis or sharing.

Expected Outcome:

Upon completion of this project, you will have gained a fundamental understanding of how to work with Pandas DataFrames, perform basic data manipulation tasks, and convert DataFrames into different file formats. This knowledge will be valuable for data analysis, preprocessing, and data export tasks in various data science and analytics projects.

Conclusion:

The Pandas library offers powerful tools for data manipulation and file conversion in Python. By completing this project, you will have acquired essential skills that are widely applicable in the field of data science and analytics. You can further extend this project by exploring more advanced Pandas functionalities or integrating it into larger data processing pipelines.in this data we add number of data and make that data a data frame.and save in single excel file as different sheet name and then convert that excel file in csv file .

Clear search

Close search

Google apps

Main menu

Merge number of excel file,convert into csv file

AI4Code Train Dataframe

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

US Means of Transportation to Work Census Data

Data Ingestion and Cleaning:

Data Source:

Final Word:

Data for: Can government transfers make energy subsidy reform socially...

The Device Activity Report with Complete Knowledge (DARCK) for NILM

1. Abstract

2. Dataset Overview

3. Download and Usage

4. Measurement Setup

5. File Format (DARCK.csv)

Column Descriptions

Column Name

Data Type

Unit

Description

6. Data Postprocessing Pipeline

6.1. Main Meter (main) Postprocessing

6.2. Sub-metered Devices (shellies) Postprocessing

6.3. Merging and Finalization

7. Manual Corrections and Known Data Issues

8. Appliance Details and Multipurpose Plugs

US Consumer Complaints Against Businesses

Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

Dataset with four years of condition monitoring technical language...

SELTO Dataset

OpenForecast results in 2020-2021

Analysis of references in the IPCC AR6 WG2 Report of 2022

Data for Gradient boosted decision trees reveal nuances of auditory...

Myocardial motion dataset (processed data)

NYC Jobs Dataset (Filtered Columns)

ARtracks - a Global Atmospheric River Catalogue Based on ERA5 and IPART

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

Data for "Topological grain boundary segregation transitions"

Dataset of Leak Simulations in Experimental Testbed Water Distribution...

Gemma-Python Training Dataset

Prepare an empty list to hold the processed records

Open the .jsonl file and process each line

Convert the list of processed records to a DataFrame

Writing the DataFrame to a .csv file

Merge number of excel file,convert into csv file

merging the file and converting the file

5. File Format (`DARCK.csv`)

6.1. Main Meter (`main`) Postprocessing

6.2. Sub-metered Devices (`shellies`) Postprocessing