Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
2,121,458 records
I used Google Colab to check out this dataset and pull the column names using Pandas.
Sample code example: Python Pandas read csv file compressed with gzip and load into Pandas dataframe https://pastexy.com/106/python-pandas-read-csv-file-compressed-with-gzip-and-load-into-pandas-dataframe
Columns: ['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative', 'Company public response', 'Company', 'State', 'ZIP code', 'Tags', 'Consumer consent provided?', 'Submitted via', 'Date sent to company', 'Company response to consumer', 'Timely response?', 'Consumer disputed?', 'Complaint ID']
I did not modify the dataset.
Use it to practice with dataframes - Pandas or PySpark on Google Colab:
!unzip complaints.csv.zip
import pandas as pd df = pd.read_csv('complaints.csv') df.columns
df.head() etc.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials
Background
This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.
The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).
Usage
Included Files
File Format: Downsampled Data
These are the "LP_
These data files can be easily loaded using the pandas library in Python through:
import pandas
data = pandas.read_csv(data_file, index_col=0)
The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.
File Format: Unreduced Data
These are the "LP_
The data can be loaded and used similarly to the downsampled data.
File Format: Overall_Summary
The overall summary file provides data on all the test specimens in the database. The columns include:
File Format: Summarized_Mechanical_Props_Campaign
Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,
tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv',
index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1],
keep_default_na=False, na_values='')
Caveats
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset containing measurements of Linux Kernel binary size after compilation. The reported size, in the column "perf", is the size in bytes of the vmlinux file. In contains also a column "active_options" reporting the number of activated options (set at "y"). All other columns, the list being reported in the file "Linux_options.json", are Linux kernel options. The sampling have been made using randconfig. The version of Linux used is 4.13.3.
Not all available options are present. First, it only contains options about the x86 and 64 bits version. Then, all non-tristate options have been ignored. Finally, options not having multiple value through the whole dataset, due to not enough variability in the sampling, are ignored. All options are encoded as 0 for "n" and "m" options value, and 1 for "y".
In python, importing the dataset using pandas will attribute all columns to int64, which will lead to a great consumption of memory (~50GB). We provide this way to import it using less than 1 GB of memory by setting options columns to int8.
import pandas as pd import json import numpy
with open("Linux_options.json","r") as f: linux_options = json.load(f)
return pd.read_csv("Linux.csv", dtype={f:numpy.int8 for f in linux_options})
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Project Description:
Title: Pandas Data Manipulation and File Conversion
Overview: This project aims to demonstrate the basic functionalities of Pandas, a powerful data manipulation library in Python. In this project, we will create a DataFrame, perform some data manipulation operations using Pandas, and then convert the DataFrame into both Excel and CSV formats.
Key Objectives:
Tools and Libraries Used:
Project Implementation:
DataFrame Creation:
Data Manipulation:
File Conversion:
to_excel() function.to_csv() function.Expected Outcome:
Upon completion of this project, you will have gained a fundamental understanding of how to work with Pandas DataFrames, perform basic data manipulation tasks, and convert DataFrames into different file formats. This knowledge will be valuable for data analysis, preprocessing, and data export tasks in various data science and analytics projects.
Conclusion:
The Pandas library offers powerful tools for data manipulation and file conversion in Python. By completing this project, you will have acquired essential skills that are widely applicable in the field of data science and analytics. You can further extend this project by exploring more advanced Pandas functionalities or integrating it into larger data processing pipelines.in this data we add number of data and make that data a data frame.and save in single excel file as different sheet name and then convert that excel file in csv file .
Facebook
Twitterhttps://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.57745/VCALE0https://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.57745/VCALE0
Dataset Description Dataset Description This dataset is associated with the publication titled "A Distraction Knee-Brace and a Robotic Testbed for Tibiofemoral Load Reduction during Squatting" in IEEE Transactions on Medical Robotics and Bionics. It provides comprehensive data supporting the development and evaluation of a knee distraction brace designed to reduce tibiofemoral contact forces during flexion. Contents Cam Profiles STL files of the initial cam profiles designed based on averaged tibiofemoral contact force data collected from 5 squats of a patient with an instrumented prosthesis (K7L) from the CAMS Knee dataset (accessible via https://orthoload.com/). Optimized cam profiles, corrected based on experimental results, are also included. These profiles enable patient-specific adjustments to account for the non-linear evolution of tibiofemoral contact forces with flexion angles. Experimental Results CSV files containing raw results from robotic testbed experiments, testing the knee brace under various initial pneumatic pressures in the actuators. Data is provided for tests conducted: Without the brace, With the initial cam profiles, With the optimized cam profiles. Each CSV file corresponds to a specific test condition, detailing forces and kinematics observed during squatting. 3D Models of Bones and Testbed Components Geometries of the femur head and tibial plateau used in the robotic testbed experiments, provided in STEP, STL, and SLDPRT/SLDASM formats. A README file describes the biomechanical coordinate systems used for: Force and kinematic control of the robotic testbed, Result interpretation and visualization. How to Open and Read the Provided Files The dataset includes files in CSV, SLDPRT, SLDASM, STL, and IGES formats. Below are recommended software solutions, with a preference for open-source options: CSV (Comma-Separated Values): Can be opened with Microsoft Excel, Google Sheets, or open-source software like LibreOffice Calc or Python (using pandas). SLDPRT & SLDASM (SolidWorks Parts and Assemblies): These files are native to SolidWorks. For viewing without SolidWorks, use eDrawings Viewer (free) or FreeCAD (limited compatibility). STL (3D Model Format): Can be opened with MeshLab, FreeCAD, or Blender. Most 3D printing software (like Cura or PrusaSlicer) also support STL. IGES (3D CAD Exchange Format): Can be read with FreeCAD, Fusion 360 (free for personal use), or OpenCascade-based software like CAD Assistant. For full compatibility, commercial software like SolidWorks or CATIA may be required for SLDPRT and SLDASM files. However, FreeCAD and other open-source tools provide partial support. See the associated publication and the README files included in the dataset for more information.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{
_id:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains aggregated and sub-metered power consumption data from a two-person apartment in Germany. Data was collected from March 5 to September 4, 2025, spanning 6 months. It includes an aggregate reading from a main smart meter and individual readings from 40 smart plugs, smart relays, and smart power meters monitoring various appliances.
The dataset can be downloaded here: https://doi.org/10.5281/zenodo.17159850
As it contains longer off periods with zeros, the CSV file is nicely compressible.
To extract it use: xz -d DARCK.csv.xz.
The compression leads to a 97% smaller file size (From 4GB to 90.9MB).
To use the dataset in python, you can, e.g., load the csv file into a pandas dataframe.
pythonimport pandas as pd
df = pd.read_csv("DARCK.csv", parse_dates=["time"])
The main meter was monitored using an infrared reading head magnetically attached to the infrared interface of the meter. An ESP8266 flashed with Tasmota decodes the binary datagrams and forwards the Watt readings to the MQTT broker. Individual appliances were monitored using a combination of Shelly Plugs (for outlets), Shelly 1PM (for wired-in devices like ceiling lights), and Shelly PM Mini (for each of the three phases of the oven). All devices reported to a central InfluxDB database via Home Assistant running in docker on a Dell OptiPlex 3020M.
DARCK.csv)The dataset is provided as a single comma-separated value (CSV) file.
Column Name |
Data Type |
Unit |
Description |
time | datetime | - | Timestamp for the reading in YYYY-MM-DD HH:MM:SS |
main | float | Watt | Total aggregate power consumption for the apartment, measured at the main electrical panel. |
[appliance_name] | float | Watt | Power consumption of an individual appliance (e.g., lightbathroom, fridge, sherlockpc). See Section 8 for a full list. |
| Aggregate Columns | |||
aggr_chargers | float | Watt | The sum of sherlockcharger, sherlocklaptop, watsoncharger, watsonlaptop, watsonipadcharger, kitchencharger. |
aggr_stoveplates | float | Watt | The sum of stoveplatel1 and stoveplatel2. |
aggr_lights | float | Watt | The sum of lightbathroom, lighthallway, lightsherlock, lightkitchen, lightlivingroom, lightwatson, lightstoreroom, fcob, sherlockalarmclocklight, sherlockfloorlamphue, sherlockledstrip, livingfloorlamphue, sherlockglobe, watsonfloorlamp, watsondesklamp and watsonledmap. |
| Analysis Columns | |||
inaccuracy | float | Watt | As no electrical device bypasses a power meter, the true inaccuracy can be assessed. It is the absolute error between the sum of individual measurements and the mains reading. A 30W offset is applied to the sum since the measurement devices themselves draw power which is otherwise unaccounted for. |
The final dataset was generated from two raw data sources (meter.csv and shellies.csv) using a comprehensive postprocessing pipeline.
main) PostprocessingThe aggregate power data required several cleaning steps to ensure accuracy.
shellies) PostprocessingThe Shelly devices are not prone to the same burst issue as the ESP8266 is. They push a new reading at every change in power drawn. If no power change is observed or the one observed is too small (less than a few Watt), the reading is pushed once a minute, together with a heartbeat. When a device turns on or off, intermediate power values are published, which leads to sub-second values that need to be handled.
.resample('1s').last().ffill(). time index.NaN values (e.g., from before a device was installed) were filled with 0.0, assuming zero consumption.During analysis, two significant unmetered load events were identified and manually corrected to improve the accuracy of the aggregate reading. The error column (inaccuracy) was recalculated after these corrections.
The following table lists the column names with an explanation where needed. As Watson moved at the beginning of June, some metering plugs changed their appliance.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
📦 Ecommerce Dataset (Products & Sizes Included)
🛍️ Essential Data for Building an Ecommerce Website & Analyzing Online Shopping Trends 📌 Overview This dataset contains 1,000+ ecommerce products, including detailed information on pricing, ratings, product specifications, seller details, and more. It is designed to help data scientists, developers, and analysts build product recommendation systems, price prediction models, and sentiment analysis tools.
🔹 Dataset Features
Column Name Description product_id Unique identifier for the product title Product name/title product_description Detailed product description rating Average customer rating (0-5) ratings_count Number of ratings received initial_price Original product price discount Discount percentage (%) final_price Discounted price currency Currency of the price (e.g., USD, INR) images URL(s) of product images delivery_options Available delivery methods (e.g., standard, express) product_details Additional product attributes breadcrumbs Category path (e.g., Electronics > Smartphones) product_specifications Technical specifications of the product amount_of_stars Distribution of star ratings (1-5 stars) what_customers_said Customer reviews (sentiments) seller_name Name of the product seller sizes Available sizes (for clothing, shoes, etc.) videos Product video links (if available) seller_information Seller details, such as location and rating variations Different variants of the product (e.g., color, size) best_offer Best available deal for the product more_offers Other available deals/offers category Product category
📊 Potential Use Cases
📌 Build an Ecommerce Website: Use this dataset to design a functional online store with product listings, filtering, and sorting. 🔍 Price Prediction Models: Predict product prices based on features like ratings, category, and discount. 🎯 Recommendation Systems: Suggest products based on user preferences, rating trends, and customer feedback. 🗣 Sentiment Analysis: Analyze what_customers_said to understand customer satisfaction and product popularity. 📈 Market & Competitor Analysis: Track pricing trends, popular categories, and seller performance. 🔍 Why Use This Dataset? ✅ Rich Feature Set: Includes all necessary ecommerce attributes. ✅ Realistic Pricing & Rating Data: Useful for price analysis and recommendations. ✅ Multi-Purpose: Suitable for machine learning, web development, and data visualization. ✅ Structured Format: Easy-to-use CSV format for quick integration.
📂 Dataset Format
CSV file (ecommerce_dataset.csv)
1000+ samples
Multi-category coverage
🔗 How to Use?
Download the dataset from Kaggle.
Load it in Python using Pandas:
python
Copy
Edit
import pandas as pd
df = pd.read_csv("ecommerce_dataset.csv")
df.head()
Explore trends & patterns using visualization tools (Seaborn, Matplotlib).
Build models & applications based on the dataset!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# Baghdad VANET BSM-Based
Attack Dataset (F2MD Scenarios) — Raw CSVs (File-level by Attack ID)
## Overview
This package groups the original raw CSV files by attack type at the file level. No content changes were made to the CSVs—files are copied as-is into family folders.
## Dataset Summary
Total files: 17
Total records (all files): 35830975
Attack records (label=1): 2359250
Benign records (label=0): 33471725
Other/unknown labels: 0
## Attack Families
- ConstPos — Frozen constant position
- ConstPosOffset — Constant offset to coordinates
- RandomPos — Random fake positions
- RandomPosOffset — Random offset to the true position
- ConstSpeedOffset — Constant speed bias
- RandomSpeed — Random implausible speeds
- EventualStop — Gradual or sudden stop spoofing
- Disruptive — Protocol fields/values deliberately disruptive
- DataReplay — Replay of past data
- StaleMessages — Old or delayed messages
- DoS — High-rate flooding
- DoSRandom — Randomly fluctuating flooding
- DoSDisruptive — Intermittent aggressive flooding
- GridSybil — Coordinated fake identities (Sybil)
- DoSRandomSybil — Random DoS with Sybil identities
- DoSDisruptiveSybil — Aggressive DoS with Sybil identities
- Unknown — Files not mapped to a specific family
## Per-family Totals
| Family | Files | Records | Attack (label=1) | Benign (label=0) | Other |
|:--|--:|--:|--:|--:|--:|
| ConstPos | 1 | 1327217 | 63504 | 1263713 | 0 |
| ConstPosOffset | 1 | 1305206 | 62749 | 1242457 | 0 |
| ConstSpeedOffset | 1 | 2150356 | 102707 | 2047649 | 0 |
| DataReplay | 1 | 922481 | 43916 | 878565 | 0 |
| Disruptive | 1 | 1063416 | 52038 | 1011378 | 0 |
| DoS | 1 | 1241649 | 167490 | 1074159 | 0 |
| DoSDisruptive | 1 | 705817 | 97649 | 608168 | 0 |
| DoSDisruptiveSybil | 1 | 2113005 | 24365 | 2088640 | 0 |
| DoSRandom | 1 | 6440583 | 867983 | 5572600 | 0 |
| DoSRandomSybil | 1 | 2499578 | 30382 | 2469196 | 0 |
| EventualStop | 1 | 2124546 | 101617 | 2022929 | 0 |
| GridSybil | 1 | 622012 | 108728 | 513284 | 0 |
| RandomPos | 1 | 1087145 | 53253 | 1033892 | 0 |
| RandomPosOffset | 1 | 3258131 | 158686 | 3099445 | 0 |
| RandomSpeed | 1 | 3676823 | 176305 | 3500518 | 0 |
| StaleMessages | 1 | 770026 | 36645 | 733381 | 0 |
| Unknown | 1 | 4522984 | 211233 | 4311751 | 0 |
## Per-file Details
| Family | File | Records | Attack (1) | Benign (0) | Other | Label column present |
|:--|:--|--:|--:|--:|--:|:--:|
| ConstPos | 1.csv | 1327217 | 63504 | 1263713 | 0 | yes |
| ConstPosOffset | 2.csv | 1305206 | 62749 | 1242457 | 0 | yes |
| ConstSpeedOffset | 6.csv | 2150356 | 102707 | 2047649 | 0 | yes |
| DataReplay | 11.csv | 922481 | 43916 | 878565 | 0 | yes |
| Disruptive | 10.csv | 1063416 | 52038 | 1011378 | 0 | yes |
| DoS | 13.csv | 1241649 | 167490 | 1074159 | 0 | yes |
| DoSDisruptive | 15.csv | 705817 | 97649 | 608168 | 0 | yes |
| DoSDisruptiveSybil | 19.csv | 2113005 | 24365 | 2088640 | 0 | yes |
| DoSRandom | 14.csv | 6440583 | 867983 | 5572600 | 0 | yes |
| DoSRandomSybil | 18.csv | 2499578 | 30382 | 2469196 | 0 | yes |
| EventualStop | 9.csv | 2124546 | 101617 | 2022929 | 0 | yes |
| GridSybil | 16.csv | 622012 | 108728 | 513284 | 0 | yes |
| RandomPos | 3.csv | 1087145 | 53253 | 1033892 | 0 | yes |
| RandomPosOffset | 4.csv | 3258131 | 158686 | 3099445 | 0 | yes |
| RandomSpeed | 7.csv | 3676823 | 176305 | 3500518 | 0 | yes |
| StaleMessages | 12.csv | 770026 | 36645 | 733381 | 0 | yes |
| Unknown | 5.csv | 4522984 | 211233 | 4311751 | 0 | yes |
## How to Load (Python)
Use pandas to read any CSV under data/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
3D skeletons UP-Fall Dataset
Different between Fall and Impact detection
Overview
This dataset aims to facilitate research in fall detection, particularly focusing on the precise detection of impact moments within fall events. The 3D skeletons data accuracy and comprehensiveness make it a valuable resource for developing and benchmarking fall detection algorithms. The dataset contains 3D skeletal data extracted from fall events and daily activities of 5 subjects performing fall scenarios
Data Collection
The skeletal data was extracted using a pose estimation algorithm, which processes images frames to determine the 3D coordinates of each joint. Sequences with less than 100 frames of extracted data were excluded to ensure the quality and reliability of the dataset. As a result, some subjects may have fewer CSV files.
CSV Structure
The data is organized by subjects, and each subject contains CSV files named according to the pattern C1S1A1T1, where:
C: Camera (1 or 2)
S: Subject (1 to 5)
A: Activity (1 to N, representing different activities)
T: Trial (1 to 3)
subject1/`: Contains CSV files for Subject 1.
C1S1A1T1.csv: Data from Camera 1, Activity 1, Trial 1 for Subject 1
C1S1A2T1.csv: Data from Camera 1, Activity 2, Trial 1 for Subject 1
C1S1A3T1.csv: Data from Camera 1, Activity 3, Trial 1 for Subject 1
C2S1A1T1.csv: Data from Camera 2, Activity 1, Trial 1 for Subject 1
C2S1A2T1.csv: Data from Camera 2, Activity 2, Trial 1 for Subject 1
C2S1A3T1.csv: Data from Camera 2, Activity 3, Trial 1 for Subject 1
subject2/`: Contains CSV files for Subject 2.
C1S2A1T1.csv: Data from Camera 1, Activity 1, Trial 1 for Subject 2
C1S2A2T1.csv: Data from Camera 1, Activity 2, Trial 1 for Subject 2
C1S2A3T1.csv: Data from Camera 1, Activity 3, Trial 1 for Subject 2
C2S2A1T1.csv: Data from Camera 2, Activity 1, Trial 1 for Subject 2
C2S2A2T1.csv: Data from Camera 2, Activity 2, Trial 1 for Subject 2
C2S2A3T1.csv: Data from Camera 2, Activity 3, Trial 1 for Subject 2
subject3/, subject4/, subject5/: Similar structure as above, but may contain fewer CSV files due to the data extraction criteria mentioned above.
Column Descriptions
Each CSV file contains the following columns representing different skeletal joints and their respective coordinates in 3D space:
Column Name
Description
joint_1_x
X coordinate of joint 1
joint_1_y
Y coordinate of joint 1
joint_1_z
Z coordinate of joint 1
joint_2_x
X coordinate of joint 2
joint_2_y
Y coordinate of joint 2
joint_2_z
Z coordinate of joint 2
...
...
joint_n_x
X coordinate of joint n
joint_n_y
Y coordinate of joint n
joint_n_z
Z coordinate of joint n
LABEL
Label indicating impact (1) or non-impact (0)
Example
Here is an example of what a row in one of the CSV files might look like:
joint_1_x
joint_1_y
joint_1_z
joint_2_x
joint_2_y
joint_2_z
...
joint_n_x
joint_n_y
joint_n_33
LABEL
0.123
0.456
0.789
0.234
0.567
0.890
...
0.345
0.678
0.901
0
Usage
This data can be used for developing and benchmarking impact fall detection algorithms. It provides detailed information on human posture and movement during falls, making it suitable for machine learning and deep learning applications in impact fall detection and prevention.
Using github
Clone the repository:
-bash git clone
https://github.com/Tresor-Koffi/3D_skeletons-UP-Fall-Dataset
Navigate to the directory:
-bash -cd 3D_skeletons-UP-Fall-Dataset
Examples
Here's a simple example of how to load and inspect a sample data file using Python:```pythonimport pandas as pd
data = pd.read_csv('subject1/C1S1A1T1.csv')print(data.head())
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Use this dataset with Misra's Pandas tutorial: How to use the Pandas GroupBy function | Pandas tutorial
The original dataset came from this site: https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t/data
I used Google Colab to filter the columns with the following Pandas commands. Here's a Colab Notebook you can use with the commands listed below: https://colab.research.google.com/drive/17Jpgeytc075CpqDnbQvVMfh9j-f4jM5l?usp=sharing
Once the csv file is uploaded to Google Colab, use these commands to process the file.
import pandas as pd # load the file and create a pandas dataframe df = pd.read_csv('/content/NYC_Jobs.csv') # keep only these columns df = df[['Job ID', 'Civil Service Title', 'Agency', 'Posting Type', 'Job Category', 'Salary Range From', 'Salary Range To' ]] # save the csv file without the index column df.to_csv('/content/NYC_Jobs_filtered_cols.csv', index=False)
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
To compare baseball player statistics effectively using visualization, we can create some insightful plots. Below are the steps to accomplish this in Python using libraries like Pandas and Matplotlib or Seaborn.
First, we need to load the judge.csv file into a DataFrame. This will allow us to manipulate and analyze the data easily.
Before creating visualizations, it’s good to understand the data structure and identify the columns we want to compare. The relevant columns in your data include pitch_type, release_speed, game_date, and events.
We can create various visualizations, such as: - A bar chart to compare the average release speed of different pitch types. - A line plot to visualize trends over time based on game dates. - A scatter plot to analyze the relationship between release speed and the outcome of the pitches (e.g., strikeouts, home runs).
Here is a sample code to demonstrate how to create these visualizations using Matplotlib and Seaborn:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the data
df = pd.read_csv('judge.csv')
# Display the first few rows of the dataframe
print(df.head())
# Set the style of seaborn
sns.set(style="whitegrid")
# 1. Average Release Speed by Pitch Type
plt.figure(figsize=(12, 6))
avg_speed = df.groupby('pitch_type')['release_speed'].mean().sort_values()
sns.barplot(x=avg_speed.values, y=avg_speed.index, palette="viridis")
plt.title('Average Release Speed by Pitch Type')
plt.xlabel('Average Release Speed (mph)')
plt.ylabel('Pitch Type')
plt.show()
# 2. Trends in Release Speed Over Time
# First, convert the 'game_date' to datetime
df['game_date'] = pd.to_datetime(df['game_date'])
plt.figure(figsize=(14, 7))
sns.lineplot(data=df, x='game_date', y='release_speed', estimator='mean', ci=None)
plt.title('Trends in Release Speed Over Time')
plt.xlabel('Game Date')
plt.ylabel('Average Release Speed (mph)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# 3. Scatter Plot of Release Speed vs. Events
plt.figure(figsize=(12, 6))
sns.scatterplot(data=df, x='release_speed', y='events', hue='pitch_type', alpha=0.7)
plt.title('Release Speed vs. Events')
plt.xlabel('Release Speed (mph)')
plt.ylabel('Event Type')
plt.legend(title='Pitch Type', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
These visualizations will help you compare player statistics in a meaningful way. You can customize the plots further based on your specific needs, such as filtering data for specific players or seasons. If you have any specific comparisons in mind or additional data to visualize, let me know!
Facebook
TwitterOverview
This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid:
This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper.
Data sources
We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs).
Content of the repository
A) Scripts
The python scripts run with Python 3.7 and with the packages found in "requirements.txt".
B) Data_converted and Data_cleansed
The folder "Data_converted" contains the output of "convert_data_format.py" and "Data_cleansed" contains the output of "clean_corrupted_data.py".
Use cases
We point out that this repository can be used in two different was:
from helper_functions import *
import pandas as pd
cleansed_data = pd.read_csv('/Path_to_cleansed_data/data.zip',
index_col=0, header=None, squeeze=True,
parse_dates=[0])
valid_bounds, valid_sizes = true_intervals(~cleansed_data.isnull())
start,end= valid_bounds[ np.argmax(valid_sizes) ]
data_without_nan = cleansed_data.iloc[start:end]
License
We release the code in the folder "Scripts" under the MIT license [8]. In the case of Nationalgrid and Fingrid, we further release the pre-processed data in the folder "Data_converted" and "Data_cleansed" under the CC-BY 4.0 license [7]. TransnetBW originally did not publish their data under an open license. We have explicitly received the permission to publish the pre-processed version from TransnetBW. However, we cannot publish our pre-processed version under an open license due to the missing license of the original TransnetBW data.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset provides grayscale pixel values for brain tumor MRI images, stored in a CSV format for simplified access and ease of use. The goal is to create a "MNIST-like" dataset for brain tumors, where each row in the CSV file represents the pixel values of a single image in its original resolution. This format makes it convenient for researchers and developers to quickly load and analyze MRI data for brain tumor detection, classification, and segmentation tasks without needing to handle large image files directly.
Brain tumor classification and segmentation are critical tasks in medical imaging, and datasets like these are valuable for developing and testing machine learning and deep learning models. While there are several publicly available brain tumor image datasets, they often consist of large image files that can be challenging to process. This CSV-based dataset addresses that by providing a compact and accessible format. Potential use cases include: - Tumor Classification: Identifying different types of brain tumors, such as glioma, meningioma, and pituitary tumors, or distinguishing between tumor and non-tumor images. - Tumor Segmentation: Applying pixel-level classification and segmentation techniques for tumor boundary detection. - Educational and Rapid Prototyping: Ideal for educational purposes or quick experimentation without requiring large image processing capabilities.
This dataset is structured as a single CSV file where each row represents an image, and each column represents a grayscale pixel value. The pixel values are stored as integers ranging from 0 (black) to 255 (white).
This dataset is intended for research and educational purposes only. Users are encouraged to cite and credit the original data sources if using this dataset in any publications or projects. This is a derived CSV version aimed to simplify access and usability for machine learning and data science applications.
Facebook
TwitterThis resource contains a draft Jupyter Notebook that has example code snippets showing how to access HydroShare resource files using HydroShare S3 buckets. The user_account.py is a utility to read user hydroshare cached account information in any of the JupyterHub instances that HydroShare has access to. The example notebook uses this utility so that you don't have to enter your hydroshare account information in order to access hydroshare buckets.
Here are the 3 notebooks in this resource:
The above notebook has examples showing how to upload/download resource files from the resource bucket. It also contains examples how to list files and folders of a resource in a bucket.
The above notebook has examples for reading raster and shapefile from bucket using gdal without the need of downloading the file from the bucket to local disk.
The above notebook has examples of using h5netcdf and xarray for reading netcdf file directly from bucket. It also contains examples of using rioxarray to read raster file, and pandas to read CSV file from hydroshare buckets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of four years of technical language annotations from two paper machines in northern Sweden, structured as a Pandas dataframe. The same data is also available as a semicolon-separated .csv file. The data consists of two columns, where the first column corresponds to annotation note contents, and the second column corresponds to annotation titles. The annotations are in Swedish, and processed so that all mentions of personal information are replaced with the string ‘egennamn’, meaning “personal name” in Swedish. Each row corresponds to one annotation with the corresponding title.
Data can be accessed in Python with: import pandas as pd annotations_df = pd.read_pickle("Technical_Language_Annotations.pkl") annotation_contents = annotations_df['noteComment'] annotation_titles = annotations_df['title']
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Raw data for the article: Gradient boosted decision trees reveal nuances of auditory discrimination behaviour (PLOS Computational Biology).This data repository contains the csv files after extraction of the raw MATLAB metadata files into pandas (Python) dataframes (helper function author: Jules Lebert). The csv files can easily be loaded back into dataframe objects using pandas before the subsampling steps (as documented in the paper, we used subsampling to ensure the number of F0-roved and control F0 trials were relatively equal) are completed.Link to GitHub repository to run the models on this data: https://github.com/carlacodes/boostmodelsA full description of each of the variables within the dataframe can be found in the data_description_instructions_for_datasets_plos_bio.pdf.Abstract: Animal psychophysics can generate rich behavioral datasets, often comprised of many 1000s of trials for an individual subject. Gradient-boosted models are a promising machine learning approach for analyzing such data, partly due to the tools that allow users to gain insight into how the model makes predictions. We trained ferrets to report a target word’s presence, timing, and lateralization within a stream of consecutively presented non-target words. To assess the animals’ ability to generalize across pitch, we manipulated the fundamental frequency (F0) of the speech stimuli across trials, and to assess the contribution of pitch to streaming, we roved the F0 from word token-to-token. We then implemented gradient-boosted regression and decision trees on the trial outcome and reaction time data to understand the behavioral factors behind the ferrets’ decision-making. We visualized model contributions by implementing SHAPs feature importance and partial dependency plots. While ferrets could accurately perform the task across all pitch-shifted conditions, our models reveal subtle effects of shifting F0 on performance, with within-trial pitch shifting elevating false alarms and extending reaction times. Our models identified a subset of non-target words that animals commonly false alarmed to. Follow-up analysis demonstrated that the spectrotemporal similarity of target and non-target words rather than similarity in duration or amplitude waveform was the strongest predictor of the likelihood of false alarming. Finally, we compared the results with those obtained with traditional mixed effects models, revealing equivalent or better performance for the gradient-boosted models over these approaches.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Water discharge and temperature of the springs monitored during the WABEsense project (UTF 642.21.20). The data corresponding to the all field measurements performed between Feb. 2021 and Dec 2023. Data for each spring might not cover the whole period. For each spring there are two files: *.csv and *.meta. The .csv file contains the recorded data. The .meta fiel contains further information about the spring (e.g. location) and the data.
Facebook
TwitterAttribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Estimating the distributional impacts of energy subsidy removal and compensation schemes in Ecuador based on input-output and household data.
Import files: Dictionary Categories.csv, Dictionary ENI-IOT.csv, and Dictionary Subcategories.csv based on [1] Dictionary IOT.csv and IOT_2012.csv (cannot be redistruted) based on [2] Dictionary Taxes.csv and Dictionary Transfers.csv based on [3] ENIGHUR11_GASTOS_V.csv, ENIGHUR11_HOGARES_AGREGADOS.csv, and ENIGHUR11_PERSONAS_INGRESOS.csv based on [4] Price increase scenarios.csv based on [5]
Further basic files and documents: [1] 4_M&D_Mapping ENIGHUR expenditures to IOT_180605.xlsm [2] Input-output table 2012 (https://contenido.bce.fin.ec/documentos/PublicacionesNotas/Catalogo/CuentasNacionales/Anuales/Dolares/MIP2012Ampliada.xls). Save the sheet with the IOT 2012 (Matriz simétrica) as IOT_2012.csv and edit the format: first column and row: IOT labels [3] 4_M&D_ENIGHUR income_180606.xlsx [4] ENIGHUR data can be retrieved from http://www.ecuadorencifras.gob.ec/encuesta-nacional-de-ingresos-y-gastos-de-los-hogares-urbanos-y-rurales/ Household datasets are only available in SPSS file format and the free software PSPP is used to convert .sav- to .csv-files, as this format can be read directly and efficiently into a Python Pandas DataFrame. See PSPP syntax below: save translate /outfile = filename /type = CSV /textoptions decimal = DOT /textoptions delimiter = ';' /fieldnames /cells=values /replace. [5] 3_Ecuador_Energy subsidies and 4_M&D_Price scenarios_180610.xlsx
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Data pulled from Traffy Fondue, by accessing the Traffy Fondue Open API. Date January 2022 until January 2025
The following code pulled the data:
import os
import json
import requests
from datetime import datetime, timedelta
import time
class TraffyDataFetcher:
def _init_(self, start_date, subfolder='traffyfonduedata'):
self.url = "https://publicapi.traffy.in.th/share/teamchadchart/search"
self.query = {'offset': '0'}
self.payload = {}
self.headers = {}
self.start_date = datetime.strptime(start_date, '%Y-%m-%d')
self.end_date = datetime.now()
self.subfolder = subfolder
self.max_requests_per_minute = 99
if not os.path.exists(self.subfolder):
os.makedirs(self.subfolder)
def add_days_to_date(self, start_date_str, days_to_add):
start_date = datetime.strptime(start_date_str, '%Y-%m-%d')
new_date = start_date + timedelta(days=days_to_add)
return new_date.strftime('%Y-%m-%d')
def fetch_data(self):
current_date = self.start_date
index = 0
while current_date <= self.end_date:
start_time = datetime.now()
self.query['start'] = current_date.strftime('%Y-%m-%d')
new_date = self.add_days_to_date(self.query['start'], 10)
self.query['end'] = new_date
response = requests.request("GET", self.url, headers=self.headers, data=self.payload, params=self.query)
print(f"offset: {index} response: {response.status_code}")
filename = f"traffy_{current_date.strftime('%Y-%m-%d')}.json"
file_path = os.path.join(self.subfolder, filename)
with open(file_path, "w") as outfile:
json_object = json.dumps(response.json(), indent=4)
outfile.write(json_object)
end_time = datetime.now()
elapsed_time = (end_time - start_time).total_seconds()
print(f"Elapsed time: {elapsed_time} s")
index += 950
current_date = datetime.strptime(new_date, '%Y-%m-%d') + timedelta(days=1)
if index % self.max_requests_per_minute == 0:
time.sleep(60 - elapsed_time)
if _name_ == "_main_":
fetcher = TraffyDataFetcher(start_date='2022-01-01')
fetcher.fetch_data()
--
And the following code converted the json to CSV files
import os
import glob
import json
import pandas as pd
#import numpy as np
class TraffyJSONFixer:
def _init_(self, path_to_json='*.json', subfolder='traffyfonduedata'):
self.path_to_json = path_to_json
self.subfolder = subfolder
self.outputfolder = 'fixedjson'
self.excelfolder = 'exceloutput'
self.file_path = os.path.join(self.subfolder, self.path_to_json)
self.json_files = glob.glob(self.file_path)
# Ensure the subfolder exists
if not os.path.exists(self.subfolder):
os.makedirs(self.subfolder)
# Ensure the outputfolder exists
if not os.path.exists(self.outputfolder):
os.makedirs(self.outputfolder)
# Ensure the excelfolder exists
if not os.path.exists(self.excelfolder):
os.makedirs(self.excelfolder)
# Debugging: Print the current working directory and the list of JSON files
print(f"Current working directory: {os.getcwd()}")
print(f"Found JSON files: {self.json_files}")
def fix_json_files(self):
for count, ele in enumerate(self.json_files):
new_file_name = os.path.join(self.outputfolder, f"data_{os.path.basename(ele)}")
try:
with open(ele, 'r', encoding='utf-8') as f:
data = json.load(f)
# Debugging: Print the type of data
print(f"Processing file: {ele}")
print(f"Type of data: {type(data)}")
# Handle different JSON structures
if isinstance(data, dict) and "results" in data:
results = data["results"]
elif isinstance(data, list):
results = data
else:
print(f"Unexpected JSON structure in file: {ele}")
continue
# Ensure results is a list or dict before writing
if isinstance(results, (list, dict)):
with open(new_file_name, 'w', encoding='utf-8') as f:
f.write(json.dumps(results, indent=4))
else:
print(f"Unexpected type for results in file: {ele}")
except (json.JSONDecodeError, KeyError) as e:
print(f"Error processing file {ele}: {e}")
def jsontoexcel(self):
jsonfile_path = os.path.join(self.out...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
2,121,458 records
I used Google Colab to check out this dataset and pull the column names using Pandas.
Sample code example: Python Pandas read csv file compressed with gzip and load into Pandas dataframe https://pastexy.com/106/python-pandas-read-csv-file-compressed-with-gzip-and-load-into-pandas-dataframe
Columns: ['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative', 'Company public response', 'Company', 'State', 'ZIP code', 'Tags', 'Consumer consent provided?', 'Submitted via', 'Date sent to company', 'Company response to consumer', 'Timely response?', 'Consumer disputed?', 'Complaint ID']
I did not modify the dataset.
Use it to practice with dataframes - Pandas or PySpark on Google Colab:
!unzip complaints.csv.zip
import pandas as pd df = pd.read_csv('complaints.csv') df.columns
df.head() etc.