100+ datasets found

Merge number of excel file,convert into csv file
kaggle.com
zip
Updated Mar 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aashirvad pandey (2024). Merge number of excel file,convert into csv file [Dataset]. https://www.kaggle.com/datasets/aashirvadpandey/merge-number-of-excel-fileconvert-into-csv-file
Explore at:
zip(6731 bytes)Available download formats
Dataset updated
Mar 30, 2024
Authors
Aashirvad pandey
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Project Description:

Title: Pandas Data Manipulation and File Conversion

Overview: This project aims to demonstrate the basic functionalities of Pandas, a powerful data manipulation library in Python. In this project, we will create a DataFrame, perform some data manipulation operations using Pandas, and then convert the DataFrame into both Excel and CSV formats.

Key Objectives:

DataFrame Creation: Utilize Pandas to create a DataFrame with sample data.

Data Manipulation: Perform basic data manipulation tasks such as adding columns, filtering data, and performing calculations.

File Conversion: Convert the DataFrame into Excel (.xlsx) and CSV (.csv) file formats.

Tools and Libraries Used:

Python

Pandas

Project Implementation:

DataFrame Creation:

Import the Pandas library.

Create a DataFrame using either a dictionary, a list of dictionaries, or by reading data from an external source like a CSV file.

Populate the DataFrame with sample data representing various data types (e.g., integer, float, string, datetime).

Data Manipulation:

Add new columns to the DataFrame representing derived data or computations based on existing columns.

Filter the DataFrame to include only specific rows based on certain conditions.

Perform basic calculations or transformations on the data, such as aggregation functions or arithmetic operations.

File Conversion:

Utilize Pandas to convert the DataFrame into an Excel (.xlsx) file using the to_excel() function.

Convert the DataFrame into a CSV (.csv) file using the to_csv() function.

Save the generated files to the local file system for further analysis or sharing.

Expected Outcome:

Upon completion of this project, you will have gained a fundamental understanding of how to work with Pandas DataFrames, perform basic data manipulation tasks, and convert DataFrames into different file formats. This knowledge will be valuable for data analysis, preprocessing, and data export tasks in various data science and analytics projects.

Conclusion:

The Pandas library offers powerful tools for data manipulation and file conversion in Python. By completing this project, you will have acquired essential skills that are widely applicable in the field of data science and analytics. You can further extend this project by exploring more advanced Pandas functionalities or integrating it into larger data processing pipelines.in this data we add number of data and make that data a data frame.and save in single excel file as different sheet name and then convert that excel file in csv file .
ENTSO-E Hydropower modelling data (PECD) in CSV format
zenodo.org
csv
Updated Aug 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matteo De Felice; Matteo De Felice (2020). ENTSO-E Hydropower modelling data (PECD) in CSV format [Dataset]. http://doi.org/10.5281/zenodo.3949757
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3949757
Dataset updated
Aug 14, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matteo De Felice; Matteo De Felice
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PECD Hydro modelling

This repository contains a more user-friendly version of the Hydro modelling data released by ENTSO-E with their latest Seasonal Outlook.

The original URLs:

The zipped file: https://eepublicdownloads.blob.core.windows.net/public-cdn-container/clean-documents/sdc-documents/seasonal/SOR2020/data/Hydro.zip

The documentation file (v 1.0): https://eepublicdownloads.blob.core.windows.net/public-cdn-container/clean-documents/sdc-documents/MAF/2019/Hydropower_Modelling_New_database_and_methodology.pdf

The original ENTSO-E hydropower dataset integrates the PECD (Pan-European Climate Database) released for the MAF 2019

As I did for the wind & solar data, the datasets released in this repository are only a more user- and machine-readable version of the original Excel files. As avid user of ENTSO-E data, with this repository I want to share my data wrangling efforts to make this dataset more accessible.

Data description

The zipped file contains 86 Excel files, two different files for each ENTSO-E zone.

In this repository you can find 6 CSV files:

PECD-hydro-capacities.csv: installed capacities

PECD-hydro-weekly-inflows.csv: weekly inflows for reservoir and open-loop pumping

PECD-hydro-daily-ror-generation.csv: daily run-of-river generation

PECD-hydro-weekly-reservoir-min-max-generation.csv: minimum and maximum weekly reservoir generation

PECD-hydro-weekly-reservoir-min-max-levels.csv: weekly minimum and maximum reservoir levels

Capacities

The file PECD-hydro-capacities.csv contains: run of river capacity (MW) and storage capacity (GWh), reservoir plants capacity (MW) and storage capacity (GWh), closed-loop pumping/turbining (MW) and storage capacity and open-loop pumping/turbining (MW) and storage capacity. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Run-of-River and pondage, rows from 5 to 7, columns from 2 to 5

sheet Reservoir, rows from 5 to 7, columns from 1 to 3

sheet Pump storage - Open Loop, rows from 5 to 7, columns from 1 to 3

sheet Pump storage - Closed Loop, rows from 5 to 7, columns from 1 to 3

Inflows

The file PECD-hydro-weekly-inflows.csv contains the weekly inflow (GWh) for the climatic years 1982-2017 for reservoir plants and open-loop pumping. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Reservoir, rows from 13 to 66, columns from 16 to 51

sheet Pump storage - Open Loop, rows from 13 to 66, columns from 16 to 51

Daily run-of-river

The file PECD-hydro-daily-ror-generation.csv contains the daily run-of-river generation (GWh). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Run-of-River and pondage, rows from 13 to 378, columns from 15 to 51

Miminum and maximum reservoir generation

The file PECD-hydro-weekly-reservoir-min-max-generation.csv contains the minimum and maximum generation (MW, weekly) for reservoir-based plants for the climatic years 1982-2017. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Reservoir, rows from 13 to 66, columns from 196 to 231

sheet Reservoir, rows from 13 to 66, columns from 232 to 267

Minimum/Maximum reservoir levels

The file PECD-hydro-weekly-reservoir-min-max-levels.csv contains the minimum/maximum reservoir levels at beginning of each week (scaled coefficient from 0 to 1). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Reservoir, rows from 14 to 66, column 12

sheet Reservoir, rows from 14 to 66, column 13

CHANGELOG

[2020/07/17] Added maximum generation for the reservoir
d
WBPHS segment counts and segment effort, 1955-Present
catalog.data.gov
Updated Nov 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Fish and Wildlife Service (2025). WBPHS segment counts and segment effort, 1955-Present [Dataset]. https://catalog.data.gov/dataset/wbphs-segment-counts-and-segment-effort-1955-present
Explore at:
Dataset updated
Nov 25, 2025
Dataset provided by
U.S. Fish and Wildlife Service
Description
The segment counts by social group and species or species group for the Waterfowl Breeding Population and Habitat Survey and associated segment effort information. Three data files are included with their associated metadata (html and xml formats). Segment counts are summed counts of waterfowl per segment and are separated into two files, described below, along with the effort table needed to analyze recent segment count information. wbphs_segment_counts_1955to1999_forDistribution.csv, which represents the period prior the collection of geolocated data. There is no associated effort file for these counts and segments with zero birds are included in the segment counts table, so effort can be inferred; there is no information to determine the proportion of each segment surveyed for this period and it must be presumed they were surveyed completely. Number of rows in table = 1,988,290. wbphs_segment_counts_forDistribution.csv, which contains positive segment records only, by species or species group beginning with 2000. wbphs_segment_effort_forDistribution.csv file is important for this segment counts file and can be used to infer zero value segments, by species or species group. Number of rows in table = 381,402. wbphs_segment_effort_forDistribution.csv. The segment survey effort and location from the Waterfowl Breeding Population and Habitat Survey beginning with 2000. If a segment was not flown, it is absent from the table for the corresponding year. Number of rows in table = 67,874. Also included here is a small R code file, createSingleSegmentCountTable.R, which can be run to format the 2000+ data to match the 1955-1999 format and combine the data over the two time periods. Please consult the metadata for an explanation of the fields and other information to understand the limitations of the data.
Z
KGCW 2024 Challenge @ ESWC 2024
data.niaid.nih.gov
investigacion.usc.gal
+3more
Updated Jun 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Serles, Umutcan; Iglesias, Ana (2024). KGCW 2024 Challenge @ ESWC 2024 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10721874
Explore at:
Dataset updated
Jun 11, 2024
Dataset provided by
Universidad Politécnica de Madrid
KU Leuven
Universidade de Santiago de Compostela
IDLab
STI Insbruck
Authors
Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Serles, Umutcan; Iglesias, Ana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Knowledge Graph Construction Workshop 2024: challenge

Knowledge graph construction of heterogeneous data has seen a lot of uptakein the last decade from compliance to performance optimizations with respectto execution time. Besides execution time as a metric for comparing knowledgegraph construction, other metrics e.g. CPU or memory usage are not considered.This challenge aims at benchmarking systems to find which RDF graphconstruction system optimizes for metrics e.g. execution time, CPU,memory usage, or a combination of these metrics.

Task description

The task is to reduce and report the execution time and computing resources(CPU and memory usage) for the parameters listed in this challenge, comparedto the state-of-the-art of the existing tools and the baseline results providedby this challenge. This challenge is not limited to execution times to createthe fastest pipeline, but also computing resources to achieve the most efficientpipeline.

We provide a tool which can execute such pipelines end-to-end. This tool alsocollects and aggregates the metrics such as execution time, CPU and memoryusage, necessary for this challenge as CSV files. Moreover, the informationabout the hardware used during the execution of the pipeline is available aswell to allow fairly comparing different pipelines. Your pipeline should consistof Docker images which can be executed on Linux to run the tool. The tool isalready tested with existing systems, relational databases e.g. MySQL andPostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuosowhich can be combined in any configuration. It is strongly encouraged to usethis tool for participating in this challenge. If you prefer to use a differenttool or our tool imposes technical requirements you cannot solve, please contactus directly.

Track 1: Conformance

The set of new specification for the RDF Mapping Language (RML) established by the W3C Community Group on Knowledge Graph Construction provide a set of test-cases for each module:

RML-Core

RML-IO

RML-CC

RML-FNML

RML-Star

These test-cases are evaluated in this Track of the Challenge to determine their feasibility, correctness, etc. by applying them in implementations. This Track is in Beta status because these new specifications have not seen any implementation yet, thus it may contain bugs and issues. If you find problems with the mappings, output, etc. please report them to the corresponding repository of each module.

Note: validating the output of the RML Star module automatically through the provided tooling is currently not possible, see https://github.com/kg-construct/challenge-tool/issues/1.

Through this Track we aim to spark development of implementations for the new specifications and improve the test-cases. Let us know your problems with the test-cases and we will try to find a solution.

Track 2: Performance

Part 1: Knowledge Graph Construction Parameters

These parameters are evaluated using synthetic generated data to have moreinsights of their influence on the pipeline.

Data

Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).

Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).

Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).

Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).

Number of input files: scaling the number of datasets (1, 5, 10, 15).

Mappings

Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).

Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).

Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)

Part 2: GTFS-Madrid-Bench

The GTFS-Madrid-Bench provides insights in the pipeline with real data from thepublic transport domain in Madrid.

Scaling

GTFS-1 SQL

GTFS-10 SQL

GTFS-100 SQL

GTFS-1000 SQL

Heterogeneity

GTFS-100 XML + JSON

GTFS-100 CSV + XML

GTFS-100 CSV + JSON

GTFS-100 SQL + XML + JSON + CSV

Example pipeline

The ground truth dataset and baseline results are generated in different stepsfor each parameter:

The provided CSV files and SQL schema are loaded into a MySQL relational database.

Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format

The pipeline is executed 5 times from which the median execution time of eachstep is calculated and reported. Each step with the median execution time isthen reported in the baseline results with all its measured metrics.Knowledge graph construction timeout is set to 24 hours. The execution is performed with the following tool: https://github.com/kg-construct/challenge-tool,you can adapt the execution plans for this example pipeline to your own needs.

Each parameter has its own directory in the ground truth dataset with thefollowing files:

Input dataset as CSV.

Mapping file as RML.

Execution plan for the pipeline in metadata.json.

Datasets

Knowledge Graph Construction Parameters

The dataset consists of:

Input dataset as CSV for each parameter.

Mapping file as RML for each parameter.

Baseline results for each parameter with the example pipeline.

Ground truth dataset for each parameter generated with the example pipeline.

Format

All input datasets are provided as CSV, depending on the parameter that is beingevaluated, the number of rows and columns may differ. The first row is alwaysthe header of the CSV.

GTFS-Madrid-Bench

The dataset consists of:

Input dataset as CSV with SQL schema for the scaling and a combination of XML,

CSV, and JSON is provided for the heterogeneity.

Mapping file as RML for both scaling and heterogeneity.

SPARQL queries to retrieve the results.

Baseline results with the example pipeline.

Ground truth dataset generated with the example pipeline.

Format

CSV datasets always have a header as their first row.JSON and XML datasets have their own schema.

Evaluation criteria

Submissions must evaluate the following metrics:

Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.

CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.

Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

Expected output

Duplicate values

Scale Number of Triples

0 percent 2000000 triples

25 percent 1500020 triples

50 percent 1000020 triples

75 percent 500020 triples

100 percent 20 triples

Empty values

Scale Number of Triples

0 percent 2000000 triples

25 percent 1500000 triples

50 percent 1000000 triples

75 percent 500000 triples

100 percent 0 triples

Mappings

Scale Number of Triples

1TM + 15POM 1500000 triples

3TM + 5POM 1500000 triples

5TM + 3POM 1500000 triples

15TM + 1POM 1500000 triples

Properties

Scale Number of Triples

1M rows 1 column 1000000 triples

1M rows 10 columns 10000000 triples

1M rows 20 columns 20000000 triples

1M rows 30 columns 30000000 triples

Records

Scale Number of Triples

10K rows 20 columns 200000 triples

100K rows 20 columns 2000000 triples

1M rows 20 columns 20000000 triples

10M rows 20 columns 200000000 triples

Joins

1-1 joins

Scale Number of Triples

0 percent 0 triples

25 percent 125000 triples

50 percent 250000 triples

75 percent 375000 triples

100 percent 500000 triples

1-N joins

Scale Number of Triples

1-10 0 percent 0 triples

1-10 25 percent 125000 triples

1-10 50 percent 250000 triples

1-10 75 percent 375000 triples

1-10 100 percent 500000 triples

1-5 50 percent 250000 triples

1-10 50 percent 250000 triples

1-15 50 percent 250005 triples

1-20 50 percent 250000 triples

1-N joins

Scale Number of Triples

10-1 0 percent 0 triples

10-1 25 percent 125000 triples

10-1 50 percent 250000 triples

10-1 75 percent 375000 triples

10-1 100 percent 500000 triples

5-1 50 percent 250000 triples

10-1 50 percent 250000 triples

15-1 50 percent 250005 triples

20-1 50 percent 250000 triples

N-M joins

Scale Number of Triples

5-5 50 percent 1374085 triples

10-5 50 percent 1375185 triples

5-10 50 percent 1375290 triples

5-5 25 percent 718785 triples

5-5 50 percent 1374085 triples

5-5 75 percent 1968100 triples

5-5 100 percent 2500000 triples

5-10 25 percent 719310 triples

5-10 50 percent 1375290 triples

5-10 75 percent 1967660 triples

5-10 100 percent 2500000 triples

10-5 25 percent 719370 triples

10-5 50 percent 1375185 triples

10-5 75 percent 1968235 triples

10-5 100 percent 2500000 triples

GTFS Madrid Bench

Generated Knowledge Graph

Scale Number of Triples

1 395953 triples

10 3959530 triples

100 39595300 triples

1000 395953000 triples

Queries

Query Scale 1 Scale 10 Scale 100 Scale 1000

Q1 58540 results 585400 results No results available No results available

Q2 636 results 11998 results
125565 results 1261368 results

Q3 421 results 4207 results 42067 results 420667 results

Q4 13 results 130 results 1300 results 13000 results

Q5 35 results 350 results 3500 results 35000 results

Q6 1 result 1 result 1 result 1 result

Q7 68 results 67 results 67 results 53 results

Q8 35460 results 354600 results No results available No results available

Q9 130 results 1300
Z
Data from: KGCW 2023 Challenge @ ESWC 2023
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated May 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Şimşek, Umutcan; Iglesias, Ana (2023). KGCW 2023 Challenge @ ESWC 2023 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7689309
Explore at:
Dataset updated
May 17, 2023
Dataset provided by
IDLab - Ghent University - imec
Universidad Politécnica de Madrid
KU Leuven
STI Insbruck
Authors
Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Şimşek, Umutcan; Iglesias, Ana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Knowledge Graph Construction Workshop 2023: challenge

Knowledge graph construction of heterogeneous data has seen a lot of uptake in the last decade from compliance to performance optimizations with respect to execution time. Besides execution time as a metric for comparing knowledge graph construction, other metrics e.g. CPU or memory usage are not considered. This challenge aims at benchmarking systems to find which RDF graph construction system optimizes for metrics e.g. execution time, CPU, memory usage, or a combination of these metrics.

Task description

The task is to reduce and report the execution time and computing resources (CPU and memory usage) for the parameters listed in this challenge, compared to the state-of-the-art of the existing tools and the baseline results provided by this challenge. This challenge is not limited to execution times to create the fastest pipeline, but also computing resources to achieve the most efficient pipeline.

We provide a tool which can execute such pipelines end-to-end. This tool also collects and aggregates the metrics such as execution time, CPU and memory usage, necessary for this challenge as CSV files. Moreover, the information about the hardware used during the execution of the pipeline is available as well to allow fairly comparing different pipelines. Your pipeline should consist of Docker images which can be executed on Linux to run the tool. The tool is already tested with existing systems, relational databases e.g. MySQL and PostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuoso which can be combined in any configuration. It is strongly encouraged to use this tool for participating in this challenge. If you prefer to use a different tool or our tool imposes technical requirements you cannot solve, please contact us directly.

Part 1: Knowledge Graph Construction Parameters

These parameters are evaluated using synthetic generated data to have more insights of their influence on the pipeline.

Data

Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).

Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).

Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).

Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).

Number of input files: scaling the number of datasets (1, 5, 10, 15).

Mappings

Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).

Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).

Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)

Part 2: GTFS-Madrid-Bench

The GTFS-Madrid-Bench provides insights in the pipeline with real data from the public transport domain in Madrid.

Scaling

GTFS-1 SQL

GTFS-10 SQL

GTFS-100 SQL

GTFS-1000 SQL

Heterogeneity

GTFS-100 XML + JSON

GTFS-100 CSV + XML

GTFS-100 CSV + JSON

GTFS-100 SQL + XML + JSON + CSV

Example pipeline

The ground truth dataset and baseline results are generated in different steps for each parameter:

The provided CSV files and SQL schema are loaded into a MySQL relational database.

Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format.

The constructed knowledge graph is loaded into a Virtuoso triplestore, tuned according to the Virtuoso documentation.

The provided SPARQL queries are executed on the SPARQL endpoint exposed by Virtuoso.

The pipeline is executed 5 times from which the median execution time of each step is calculated and reported. Each step with the median execution time is then reported in the baseline results with all its measured metrics. Query timeout is set to 1 hour and knowledge graph construction timeout to 24 hours. The execution is performed with the following tool , you can adapt the execution plans for this example pipeline to your own needs.

Each parameter has its own directory in the ground truth dataset with the following files:

Input dataset as CSV.

Mapping file as RML.

Queries as SPARQL.

Execution plan for the pipeline in metadata.json.

Datasets

Knowledge Graph Construction Parameters

The dataset consists of:

Input dataset as CSV for each parameter.

Mapping file as RML for each parameter.

SPARQL queries to retrieve the results for each parameter.

Baseline results for each parameter with the example pipeline.

Ground truth dataset for each parameter generated with the example pipeline.

Format

All input datasets are provided as CSV, depending on the parameter that is being evaluated, the number of rows and columns may differ. The first row is always the header of the CSV.

GTFS-Madrid-Bench

The dataset consists of:

Input dataset as CSV with SQL schema for the scaling and a combination of XML,

CSV, and JSON is provided for the heterogeneity.

Mapping file as RML for both scaling and heterogeneity.

SPARQL queries to retrieve the results.

Baseline results with the example pipeline.

Ground truth dataset generated with the example pipeline.

Format

CSV datasets always have a header as their first row. JSON and XML datasets have their own schema.

Evaluation criteria

Submissions must evaluate the following metrics:

Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.

CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.

Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

Expected output

Duplicate values

Scale Number of Triples 0 percent 2000000 triples 25 percent 1500020 triples 50 percent 1000020 triples 75 percent 500020 triples 100 percent 20 triples

Empty values

Scale Number of Triples 0 percent 2000000 triples 25 percent 1500000 triples 50 percent 1000000 triples 75 percent 500000 triples 100 percent 0 triples

Mappings

Scale Number of Triples 1TM + 15POM 1500000 triples 3TM + 5POM 1500000 triples 5TM + 3POM 1500000 triples 15TM + 1POM 1500000 triples

Properties

Scale Number of Triples 1M rows 1 column 1000000 triples 1M rows 10 columns 10000000 triples 1M rows 20 columns 20000000 triples 1M rows 30 columns 30000000 triples

Records

Scale Number of Triples 10K rows 20 columns 200000 triples 100K rows 20 columns 2000000 triples 1M rows 20 columns 20000000 triples 10M rows 20 columns 200000000 triples

Joins

1-1 joins

Scale Number of Triples 0 percent 0 triples 25 percent 125000 triples 50 percent 250000 triples 75 percent 375000 triples 100 percent 500000 triples

1-N joins

Scale Number of Triples 1-10 0 percent 0 triples 1-10 25 percent 125000 triples 1-10 50 percent 250000 triples 1-10 75 percent 375000 triples 1-10 100 percent 500000 triples 1-5 50 percent 250000 triples 1-10 50 percent 250000 triples 1-15 50 percent 250005 triples 1-20 50 percent 250000 triples

1-N joins

Scale Number of Triples 10-1 0 percent 0 triples 10-1 25 percent 125000 triples 10-1 50 percent 250000 triples 10-1 75 percent 375000 triples 10-1 100 percent 500000 triples 5-1 50 percent 250000 triples 10-1 50 percent 250000 triples 15-1 50 percent 250005 triples 20-1 50 percent 250000 triples

N-M joins

Scale Number of Triples 5-5 50 percent 1374085 triples 10-5 50 percent 1375185 triples 5-10 50 percent 1375290 triples 5-5 25 percent 718785 triples 5-5 50 percent 1374085 triples 5-5 75 percent 1968100 triples 5-5 100 percent 2500000 triples 5-10 25 percent 719310 triples 5-10 50 percent 1375290 triples 5-10 75 percent 1967660 triples 5-10 100 percent 2500000 triples 10-5 25 percent 719370 triples 10-5 50 percent 1375185 triples 10-5 75 percent 1968235 triples 10-5 100 percent 2500000 triples

GTFS Madrid Bench

Generated Knowledge Graph

Scale Number of Triples 1 395953 triples 10 3959530 triples 100 39595300 triples 1000 395953000 triples

Queries

Query Scale 1 Scale 10 Scale 100 Scale 1000 Q1 58540 results 585400 results No results available No results available Q2 636 results 11998 results 125565 results 1261368 results Q3 421 results 4207 results 42067 results 420667 results Q4 13 results 130 results 1300 results 13000 results Q5 35 results 350 results 3500 results 35000
Parkinson_csv
kaggle.com
zip
Updated Sep 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sagar Bapodara (2021). Parkinson_csv [Dataset]. https://www.kaggle.com/datasets/sagarbapodara/parkinson-csv/discussion
Explore at:
zip(15986 bytes)Available download formats
Dataset updated
Sep 12, 2021
Authors
Sagar Bapodara
Description
Dataset

This dataset is the CSV Version of the Original Parkison Dataset found at https://www.kaggle.com/nidaguler/parkinsons-data-set

Content

Title: Parkinson's Disease Data Set

Abstract: Oxford Parkinson's Disease Detection Dataset

Source

The dataset was created by Max Little of the University of Oxford, in collaboration with the National Centre for Voice and Speech, Denver, Colorado, who recorded the speech signals. The original study published the feature extraction methods for general voice disorders.

Dataset Info

This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.

The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column.For further information or to pass on comments, please contact Max Little (littlem '@' robots.ox.ac.uk).

Further details are contained in the following reference -- if you use this dataset, please cite: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering (to appear).

Attribute Information:

Matrix column entries (attributes): name - ASCII subject name and recording number MDVP:Fo(Hz) - Average vocal fundamental frequency MDVP:Fhi(Hz) - Maximum vocal fundamental frequency MDVP:Flo(Hz) - Minimum vocal fundamental frequency MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude NHR,HNR - Two measures of ratio of noise to tonal components in the voice status - Health status of the subject (one) - Parkinson's, (zero) - healthy RPDE,D2 - Two nonlinear dynamical complexity measures DFA - Signal fractal scaling exponent spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation

Citation Request:

If you use this dataset, please cite the following paper: 'Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection', Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering OnLine 2007, 6:23 (26 June 2007)
Clean Meta Kaggle
kaggle.com
Updated Sep 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yoni Kremer (2023). Clean Meta Kaggle [Dataset]. https://www.kaggle.com/datasets/yonikremer/clean-meta-kaggle
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yoni Kremer
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Cleaned Meta-Kaggle Dataset

The Original Dataset - Meta-Kaggle

Explore our public data on competitions, datasets, kernels (code / notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.

https://i.imgur.com/2Egeb8R.png" alt="" title="a title">

This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.

Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.

August 2023 update

In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here

We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.

The Problems with the Original Dataset

The original dataset is 32 CSV files, with 268 colums and 7GB of compressed data. Having so many tables and columns makes it hard to understand the data.

The data is not normalized, so when you join tables you get a lot of errors.

Some values refer to non-existing values in other tables. For example, the UserId column in the ForumMessages table has values that do not exist in the Users table.

There are missing values.

There are duplicate values.

There are values that are not valid. For example, Ids that are not positive integers.

The date and time columns are not in the right format.

Some columns only have the same value for all rows, so they are not useful.

The boolean columns have string values True or False.

Incorrect values for the Total columns. For example, the DatasetCount is not the total number of datasets with the Tag according to the DatasetTags table.

Users upvote their own messages.

The Solution

To handle so many tables and columns I use a relational database. I use MySQL, but you can use any relational database.

The steps to create the database are:

Creating the database tables with the right data types and constraints. I do that by running the db_abd_create_tables.sql script.

Downloading the CSV files from Kaggle using the Kaggle API.

Cleaning the data using pandas. I do that by running the clean_data.py script. The script does the following steps for each table:

Drops the columns that are not needed.

Converts each column to the right data type.

Replaces foreign keys that do not exist with NULL.

Replaces some of the missing values with default values.

Removes rows where there are missing values in the primary key/not null columns.

Removes duplicate rows.

Loading the data into the database using the LOAD DATA INFILE command.

Checks that the number of rows in the database tables is the same as the number of rows in the CSV files.

Adds foreign key constraints to the database tables. I do that by running the add_foreign_keys.sql script.

Update the Total columns in the database tables. I do that by running the update_totals.sql script.

Backup the database.
France, the evolution of the number of Covid cases
kaggle.com
zip
Updated Feb 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Saad (2021). France, the evolution of the number of Covid cases [Dataset]. https://www.kaggle.com/datasets/saadsikander/france-the-evolution-of-the-number-of-covid-cases
Explore at:
zip(590783 bytes)Available download formats
Dataset updated
Feb 27, 2021
Authors
Muhammad Saad
Area covered
France
Description
Dataset

This dataset was created by Muhammad Saad

Contents
Z
HWRT database of handwritten symbols
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thoma, Martin (2020). HWRT database of handwritten symbols [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_50022
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Karlsruhe Institute of Technology
Authors
Thoma, Martin
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
The HWRT database of handwritten symbols contains on-line data of handwritten symbols such as all alphanumeric characters, arrows, greek characters and mathematical symbols like the integral symbol.

The database can be downloaded in form of bzip2-compressed tar files. Each tar file contains:

symbols.csv: A CSV file with the rows symbol_id, latex, training_samples, test_samples. The symbol id is an integer, the row latex contains the latex code of the symbol, the rows training_samples and test_samples contain integers with the number of labeled data.

train-data.csv: A CSV file with the rows symbol_id, user_id, user_agent and data.

test-data.csv: A CSV file with the rows symbol_id, user_id, user_agent and data.

All CSV files use ";" as delimiter and "'" as quotechar. The data is given in YAML format as a list of lists of dictinaries. Each dictionary has the keys "x", "y" and "time". (x,y) are coordinates and time is the UNIX time.

About 90% of the data was made available by Daniel Kirsch via github.com/kirel/detexify-data. Thank you very much, Daniel!
h
solusdt_180_days
huggingface.co
Updated Oct 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
oxxroad (2024). solusdt_180_days [Dataset]. https://huggingface.co/datasets/roadz/solusdt_180_days
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 22, 2024
Authors
oxxroad
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
SOLUSDT 180 Days Minute Data

This dataset contains historical SOL/USDT minute-by-minute data for the last 180 days, downloaded from the Binance API.

Contents

data_SOLUSDT.csv: Historical data with close_SOLUSDT column.

scaler_SOLUSDT.pkl: MinMaxScaler object to back-transform normalised data.

Number of rows in the CSV file: 259200

Start date of record: 2024-04-25 07:38:00

End record date: 2024-10-22 07:37:00

Data structure

open_time… See the full description on the dataset page: https://huggingface.co/datasets/roadz/solusdt_180_days.
2022 Bikeshare Data -Reduced File Size -All Months
kaggle.com
zip
Updated Mar 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kendall Marie (2023). 2022 Bikeshare Data -Reduced File Size -All Months [Dataset]. https://www.kaggle.com/datasets/kendallmarie/2022-bikeshare-data-all-months-combined
Explore at:
zip(98884 bytes)Available download formats
Dataset updated
Mar 8, 2023
Authors
Kendall Marie
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This is a condensed version of the raw data obtained through the Google Data Analytics Course, made available by Lyft and the City of Chicago under this license (https://ride.divvybikes.com/data-license-agreement).

I originally did my study in another platform, and the original files were too large to upload to Posit Cloud in full. Each of the 12 monthly files contained anywhere from 100k to 800k rows. Therefore, I decided to reduce the number of rows drastically by performing grouping, summaries, and thoughtful omissions in Excel for each csv file. What I have uploaded here is the result of that process.

Data is grouped by: month, day, rider_type, bike_type, and time_of_day. total_rides represent the sum of the data in each grouping as well as the total number of rows that were combined to make the new summarized row, avg_ride_length is the calculated average of all data in each grouping.

Be sure that you use weighted averages if you want to calculate the mean of avg_ride_length for different subgroups as the values in this file are already averages of the summarized groups. You can include the total_rides value in your weighted average calculation to weigh properly.

9 Columns:

date - year, month, and day in date format - includes all days in 2022 day_of_week - Actual day of week as character. Set up a new sort order if needed. rider_type - values are either 'casual', those who pay per ride, or 'member', for riders who have annual memberships. bike_type - Values are 'classic' (non-electric, traditional bikes), or 'electric' (e-bikes). time_of_day - this divides the day into 6 equal time frames, 4 hours each, starting at 12AM. Each individual ride was placed into one of these time frames using the time they STARTED their rides, even if the ride was long enough to end in a later time frame. This column was added to help summarize the original dataset. total_rides - Count of all individual rides in each grouping (row). This column was added to help summarize the original dataset. avg_ride_length - The calculated average of all rides in each grouping (row). Look to total_rides to know how many original rides length values were included in this average. This column was added to help summarize the original dataset. min_ride_length - Minimum ride length of all rides in each grouping (row). This column was added to help summarize the original dataset. max_ride_length - Maximum ride length of all rides in each grouping (row). This column was added to help summarize the original dataset.

Please note: the time_of_day column has inconsistent spacing. Use mutate(time_of_day = gsub(" ", "", time_of _day)) to remove all spaces.

Revisions

Below is the list of revisions I made in Excel before uploading the final csv files to the R environment:

Deleted station location columns and lat/long as much of this data was already missing.

Deleted ride id column since each observation was unique and I would not be joining with another table on this variable.

Deleted rows pertaining to "docked bikes" since there were no member entries for this type and I could not compare member vs casual rider data. I also received no information in the project details about what constitutes a "docked" bike.

Used ride start time and end time to calculate a new column called ride_length (by subtracting), and deleted all rows with 0 and 1 minute results, which were explained in the project outline as being related to staff tasks rather than users. An example would be taking a bike out of rotation for maintenance.

Placed start time into a range of times (time_of_day) in order to group more observations while maintaining general time data. time_of_day now represents a time frame when the bike ride BEGAN. I created six 4-hour time frames, beginning at 12AM.

Added a Day of Week column, with Sunday = 1 and Saturday = 7, then changed from numbers to the actual day names.

Used pivot tables to group total_rides, avg_ride_length, min_ride_length, and max_ride_length by date, rider_type, bike_type, and time_of_day.

Combined into one csv file with all months, containing less than 9,000 rows (instead of several million)
C
Replication data for "High life satisfaction reported among small-scale...
dataverse.csuc.cat
csv, txt
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric Galbraith; Eric Galbraith; Victoria Reyes Garcia; Victoria Reyes Garcia (2024). Replication data for "High life satisfaction reported among small-scale societies with low incomes" [Dataset]. http://doi.org/10.34810/data904
Explore at:
csv(1620), csv(7829), txt(7017), csv(227502)Available download formats
Unique identifier
https://doi.org/10.34810/data904
Dataset updated
Feb 7, 2024
Dataset provided by
CORA.Repositori de Dades de Recerca
Authors
Eric Galbraith; Eric Galbraith; Victoria Reyes Garcia; Victoria Reyes Garcia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2021 - Oct 24, 2023
Area covered
Kumbungu, Ghana, China, Shangri-la, Argentina, Puna, Tanzania, Mafia Island, United Republic of, Mongolia, Bulgan soum, Fiji, Ba, Guatemala, Western highlands, India, Darjeeling, Laprak, Nepal, Bassari country, Senegal
Dataset funded by
European Commission
Description
This dataset was created in order to document self-reported life evaluations among small-scale societies that exist on the fringes of mainstream industrialized socieities. The data were produced as part of the LICCI project, through fieldwork carried out by LICCI partners. The data include individual responses to a life satisfaction question, and household asset values. Data from Gallup World Poll and the World Values Survey are also included, as used for comparison. TABULAR DATA-SPECIFIC INFORMATION --------------------------------- 1. File name: LICCI_individual.csv Number of rows and columns: 2814,7 Variable list: Variable names: User, Site, village Description: identification of investigator and location Variable name: Well.being.general Description: numerical score for life satisfaction question Variable names: HH_Assets_US, HH_Assets_USD_capita Description: estimated value of representative assets in the household of respondent, total and per capita (accounting for number of household inhabitants) 2. File name: LICCI_bySite.csv Number of rows and columns: 19,8 Variable list: Variable names: Site, N Description: site name and number of respondents at the site Variable names: SWB_mean, SWB_SD Description: mean and standard deviation of life satisfaction score Variable names: HHAssets_USD_mean, HHAssets_USD_sd Description: Site mean and standard deviation of household asset value Variable names: PerCapAssets_USD_mean, PerCapAssets_USD_sd Description: Site mean and standard deviation of per capita asset value 3. File name: gallup_WVS_GDP_pk.csv Number of rows and columns: 146,8 Variable list: Variable name: Happiness Score, Whisker-high, Whisker-low Description: from Gallup World Poll as documented in World Happiness Report 2022. Variable name: GDP-PPP2017 Description: Gross Domestic Product per capita for year 2020 at PPP (constant 2017 international $). Accessed May 2022. Variable name: pk Description: Produced capital per capita for year 2018 (in 2018 US$) for available countries, as estimated by the World Bank (accessed February 2022). Variable names: WVS7_mean, WVS7_std Description: Results of Question 49 in the World Values Survey, Wave 7.
New 1000 Sales Records Data 2
kaggle.com
zip
Updated Jan 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Calvin Oko Mensah (2023). New 1000 Sales Records Data 2 [Dataset]. https://www.kaggle.com/datasets/calvinokomensah/new-1000-sales-records-data-2
Explore at:
zip(49305 bytes)Available download formats
Dataset updated
Jan 12, 2023
Authors
Calvin Oko Mensah
Description
This is a dataset downloaded off excelbianalytics.com created off of random VBA logic. I recently performed an extensive exploratory data analysis on it and I included new columns to it, namely: Unit margin, Order year, Order month, Order weekday and Order_Ship_Days which I think can help with analysis on the data. I shared it because I thought it was a great dataset to practice analytical processes on for newbies like myself.
s
Data from: Abiotic legacies mediate plant-soil feedback during early...
repository.soilwise-he.eu
data.niaid.nih.gov
+2more
Updated Oct 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Data from: Abiotic legacies mediate plant-soil feedback during early vegetation succession on rare earth element mine tailings [Dataset]. http://doi.org/10.5061/dryad.ht76hdrnm
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.ht76hdrnm
Dataset updated
Oct 19, 2023
Area covered
Earth
Description
Open Access# Abiotic legacies mediate plant-soil feedback during early vegetation succession on rare earth element mine tailings These datasets are derived from a two-phase plant-soil feedback experiment based on the rare earth element (REE) mine tailing. We investigated biotic (changes in bacterial and fungal community) and abiotic legacies (changes in chemical properties) of three pioneer grass species, and examined feedback effects to three grasses, two legumes and two woody plants with different root traits. More details about the study please find in the paper of Zhu et al (the same name as this dataset) or contact the corresponding author. Dataset DOI link: ## Description of the data and file structure ### File list: * (1) SoilAbioticData Soil_Abiotic_Properties.csv * (2) SoilBioticData Bacterial_feature_table.txt * (3) SoilBioticData Bacterial_taxonomy.tsv * (4) SoilBioticData Fungal_feature_table.txt * (5) SoilBioticData Fungal_taxonomy.tsv * (6) PlantTraitData Plant_Info.csv * (7) PlantTraitData Plant-soil_Feedback.csv * (8) PlantTraitData Root_Functional_Traits.csv ### Additional related data collected that was not included in the current data package: * The raw sequences of the soil bacterial and fungal communities have been deposited in the Genome Sequence Archive in the publicly accessible Beijing Institute of Genomics Data Centre, Chinese Academy of Sciences, under accession number PRJCA009170 (). ######################################################################### ### DATA-SPECIFIC INFORMATION FOR: (1) SoilAbioticData Soil _Abiotic _Properties.csv 1 . Number of variables: 18 2 . Number of cases/rows: 20 3 . Variable List: * sample ID: soil identities, 4 Treatments (5 replicates for each) in conditioning phase. * pH_water: soil pH, analysed at a ratio of 1:2.5 (soil: deionized water, w/v). * CaCl2_Al: soil extractable aluminum content, extracted by 0.01 M CaCl2 solution. * CaCl2_K: soil extractable potassium content, extracted by 0.01 M CaCl2 solution. * CaCl2_Mg: soil extractable potassium content, extracted by 0.01 M CaCl2 solution. * CaCl2_Mn: soil extractable potassium content, extracted by 0.01 M CaCl2 solution. * CaCl2_Na: soil extractable potassium content, extracted by 0.01 M CaCl2 solution. * CaCl2_P: soil extractable potassium content, extracted by 0.01 M CaCl2 solution. * CaCl2_S: soil extractable potassium content, extracted by 0.01 M CaCl2 solution. * CaCl2_REEs: soil extractable potassium content, extracted by 0.01 M CaCl2 solution. * Total_C: total carbon content. * Total_N: total nitrogen content. * Total_Organic_C: total organic carbon content. * CN_ratio: the ratio of carbon:nitrogen content. * Total_P: total content of phosphorus. * Available_N: available nitrogen content. * Available_P: available phosphorus content. * Available_K: available potassium content. 4 . Missing data codes: None 5 . Specialized formats or other abbreviations used: None ######################################################################### ### DATA-SPECIFIC INFORMATION FOR: (2) SoilBioticData Bacterial _feature _table.txt 1 . Number of variables: 21 2 . Number of cases/rows: 17322 3 . Variable List: * #ASV ID: amplicon sequence variants of bacterial community, clustered by high-quality reads. * column 2~21: soil identities, 4 Treatments (5 replicates for each) in conditioning phase. 4 . Missing data codes: None 5 . Specialized formats or other abbreviations used: * (1) CK - control soil * (2) M.s. - Miscanthus sinensis * (3) P. t. - Paspalum thunbergii * (4) D.s. - Digitaria sanguinalis ######################################################################### ### DATA-SPECIFIC INFORMATION FOR: (3) SoilBioticData Bacterial _taxonomy.tsv 1 . Number of variables: 3 2 . Number of cases/rows: 17322 3 . Variable List: * Feature ID: the same as '#ASV ID', amplicon sequence variants of bacterial community, clustered by high-quality reads. * Taxon: taxonomy of each ASV. * Confidence: confidence of taxonomy classfication. 4 . Missing data codes: None 5 . Specialized formats or other abbreviations used: None ######################################################################### ### DATA-SPECIFIC INFORMATION FOR: (4) SoilBioticData Fungal _feature _table.txt 1 . Number of variables: 21 2 . Number of cases/rows: 2660 3 . Variable List: * #ASV ID: amplicon sequence variants of fungal community, clustered by high-quality reads. * column 2~21: soil identities, 4 Treatments (5 replicates for each) in conditioning phase. 4 . Missing data codes: None 5 . Specialized formats or other abbreviations used: * (1) CK - control soil * (2) M.s. - Miscanthus sinensis * (3) P. t. - Paspalum thunbergii * (4) D.s. - Digitaria sanguinalis ######################################################################### ### DATA-SPECIFIC INFORMATION FOR: (5) SoilBioticData Fungal _taxonomy.tsv 1 . Number of variables: 3 2 . Number of cases/rows: 2660 3 . Variable List: * Feature ID: the same as '#ASV ID', amplicon sequence variants of fungal community, clustered by high-quality reads. * Taxon: taxonomy of each ASV. * Confidence: confidence of taxonomy classfication. 4 . Missing data codes: None 5 . Specialized formats or other abbreviations used: None ######################################################################### ### DATA-SPECIFIC INFORMATION FOR: (6) PlantTraitData Plant _Info.csv 1 . Number of variables: 3 2 . Number of cases/rows: 11 3 . Variable List: * Cultivating_Phase: the experimental phase, including 'conditioning phase' and 'feedback phase'. * Plant_Species: plant species (control pots were planting nothing) in both cultivating phase. * Data_Processing_ID: the abbreviation ID for each treatment during data presenting and processing. 4 . Missing data codes: None 5 . Specialized formats or other abbreviations used: as it shown in the file. ######################################################################### ### DATA-SPECIFIC INFORMATION FOR: (7) PlantTraitData Plant-soil _Feedback.csv 1 . Number of variables: 4 2 . Number of cases/rows: 140 & 105 3 . Variable List: * Sample_ID_1: sample ID of plant samples in feedback phase. * Aver_total_biomass(g): Averaged individual biomass (including roots and shoots) of plant samples in feedback phase. * Sample_ID_2: sample ID of plant samples in feedback phase (for PSF calculating result). * PSF_value: Plant-soil feedback value (The feedback effect was calculated as the difference in biomass (BM) of a focal plant a grown on plant A conditioned and control soils divided by the maximum biomass of the focal plant a on either plant A conditioned or control soils: PSFA-*a* = (BMA-*a* – BMcontrol-*a*) / max(BMA-*a*, BMcontrol-*a*)). 4 . Missing data codes: None 5 . Specialized formats or other abbreviations used: as it shown in the file (6). ######################################################################### ### DATA-SPECIFIC INFORMATION FOR: (8) PlantTraitData Root _Functional _Traits.csv 1 . Number of variables: 13 2 . Number of cases/rows: 140 3 . Variable List: * Sample ID: sample ID of plant samples in feedback phase. * Shoot_Biomass(g): averaged individual shoot biomass of plant samples in feedback phase. * Root_Biomass(g): averaged individual root biomass of plant samples in feedback phase. * Root_Length(cm): total root length of plant samples in feedback phase. * Root_Volume(cm3): total root volume of plant samples in feedback phase. * Average_Diameter(mm): average root diameter of plant samples in feedback phase. * Root_Carbon_Content(mg/g): root carbon content of plant samples in feedback phase. * Root_Nitrogen_Content(mg/g): root nitrogen content of plant samples in feedback phase. * CN_Ratio: The ratio of root carbon to nitrogen content of plant samples in feedback phase. * Specific_Root_Length(cm/g): The root length divided by root mass of plant samples in feedback phase. * Root_Tissue_Density(g/cm3): The ratio of root mass to the volume of plant samples in feedback phase. * Fine_Root_Diameter(mm): The diameter of fine roots (diameter < 2mm) of plant samples in feedback phase. * Root_Shoot_Ratio: The ratio of root mass to shoot mass of plant samples in feedback phase. 4 . Missing data codes: None 5 . Specialized formats or other abbreviations used: as it shown in the file (6). ## Sharing/Access information Data was derived from the following source: Zhu, S.-C., Liu, W.-S., Chen, Z.-W., Liu, X.-R., Zheng, H.-X., Chen, B.-Y., Zhi, X.-Y., Chao, Y., Qiu, R.-L., Chu, C.-j., Liu, C., Morel, J.L., van der Ent, A. & Tang, Y.-T. Abiotic legacies mediate plant-soil feedback during early vegetation succession on Rare Earth Element mine tailings. Journal of Applied Ecology.
c
BF skip indexes for Ethereum - Dataset - CryptoData Hub
cryptodata.center
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). BF skip indexes for Ethereum - Dataset - CryptoData Hub [Dataset]. https://cryptodata.center/dataset/bf-skip-indexes-for-ethereum
Explore at:
Dataset updated
Dec 4, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General information This repository includes all data needed to reproduce the experiments presented in [1]. The paper describes the BF skip index, a data structure based on Bloom filters [2] that can be used for answering inter-block queries on blockchains efficiently. The article also includes a historical analysis of logsBloom filters included in the Ethereum block headers, as well as an experimental analysis of the proposed data structure. The latter was conducted using the data set of events generated by the CryptoKitties Core contract, a popular decentralized application launched in 2017 (and also one of the first applications based on NFTs). In this description, we use the following abbreviations (also adopted throughout the paper) to denote two different sets of Ethereum blocks. D1: set of all Ethereum blocks between height 0 and 14999999. D2: set of all Ethereum blocks between height 14000000 and 14999999. Moreover, in accordance with the terminology adopted in the paper, we define the set of keys of a block as the set of all contract addresses and log topics of the transactions in the block. As defined in [3], log topics comprise event signature digests and the indexed parameters associated with the event occurrence. Data set description File Description filters_ones_0-14999999.csv.xz Compressed CSV file containing the number of ones for each logsBloom filter in D1. receipt_stats_0-14999999.csv.xz Compressed CSV file containing statistics about all transaction receipts in D1. Approval.csv CSV file containing the Approval event occurrences for the CryptoKitties Core contract in D2. Birth.csv CSV file containing the Birth event occurrences for the CryptoKitties Core contract in D2. Pregnant.csv CSV file containing the Pregnant event occurrences for the CryptoKitties Core contract in D2. Transfer.csv CSV file containing the Transfer event occurrences for the CryptoKitties Core contract in D2. events.xz Compressed binary file containing information about all contract events in D2. keys.xz Compressed binary file containing information about all keys in D2. File structure We now describe the structure of the files included in this repository. filters_ones_0-14999999.csv.xz is a compressed CSV file with 15 million rows (one for each block in D1) and 3 columns. Note that it is not necessary to decompress this file, as the provided code is capable of processing it directly in its compressed form. The columns have the following meaning. blockId: the identifier of the block. timestamp: timestamp of the block. numOnes: number of bits set to 1 in the logsBloom filter of the block. receipt_stats_0-14999999.csv.xz is a compressed CSV file with 15 million rows (one for each block in D1) and 5 columns. As for the previous file, it is not necessary to decompress this file. blockId: the identifier of the block. txCount: number of transactions included in the block. numLogs: number of event logs included in the block. numKeys: number of keys included in the block. numUniqueKeys: number of distinct keys in the block (useful as the same key may appear multiple times). All CSV files related to the CryptoKitties Core events (i.e., Approval.csv, Birth.csv, Pregnant.csv, Transfer.csv) have the same structure. They consist of 1 million rows (one for each block in D2) and 2 columns, namely: blockId: identifier of the block. numOcc: number of event occurrences in the block. events.xz is a compressed binary file describing all unique event occurrences in the blocks of D2. The file contains 1 million data chunks (i.e., one for each Ethereum block). Each chunk includes the following information. Do note that this file only records unique event occurrences in each block, meaning that if an event from a contract is triggered more than once within the same block, there will be only one sequence within the corresponding chunk. blockId: identifier of the block (4 bytes). numEvents: number of event occurrences in the block (4 bytes). A list of numEvent sequences, each made up of 52 bytes. A sequence represents an event occurrence and is indeed the concatenation of two fields, namely: Address of the contract triggering the event (20 bytes). Event signature digest (32 bytes). keys.xz is a compressed binary file describing all unique keys in the blocks of D2. As for the previous file, duplicate keys only appear once. The file contains 1 million data chunks, each representing an Ethereum block and including the following information. blockId: identifier of the block (4 bytes) numAddr: number of unique contract addresses (4 bytes). numTopics: number of unique topics (4 bytes). A sequence of numAddr addresses, each represented using 20 bytes. A sequence of numTopics topics, each represented using 32 bytes. Notes For space reasons, some of the files in this repository have been compressed using the XZ compression utility. Unless otherwise specified, these files need to be decompressed before they can be read. Please make sure you have an application installed on your system that is capable of decompressing such files. References Loporchio, Matteo et al. "Skip index: supporting efficient inter-block queries and query authentication on the blockchain". (2023). Bloom, Burton H. "Space/time trade-offs in hash coding with allowable errors." Communications of the ACM 13.7 (1970): 422-426. Wood, Gavin. "Ethereum: A secure decentralised generalised transaction ledger." Ethereum project yellow paper 151.2014 (2014): 1-32.
d
Geochemical data supporting analysis of geochemical conditions and nitrogen...
catalog.data.gov
data.usgs.gov
+2more
Updated Nov 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Geochemical data supporting analysis of geochemical conditions and nitrogen transport in nearshore groundwater and the subterranean estuary at a Cape Cod embayment, East Falmouth, Massachusetts, 2013 [Dataset]. https://catalog.data.gov/dataset/geochemical-data-supporting-analysis-of-geochemical-conditions-and-nitrogen-transport-in-n
Explore at:
Dataset updated
Nov 19, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Cape Cod, East Falmouth, Falmouth, Massachusetts
Description
This data release provides analytical and other data in support of an analysis of nitrogen transport and transformation in groundwater and in a subterranean estuary in the Eel River and onshore locations on the Seacoast Shores peninsula, Falmouth, Massachusetts. The analysis is described in U.S. Geological Survey Scientific Investigations Report 2018-5095 by Colman and others (2018). This data release is structured as a set of comma-separated values (CSV) files, each of which contains data columns for laboratory (if applicable), USGS Site Name, date sampled, time sampled, and columns of specific analytical and(or) other data. The .csv data files have the same number of rows and each row in each .csv file corresponds to the same sample. Blank cells in a .csv file indicate that the sample was not analyzed for that constituent. The data release also provides a Data Dictionary (Data_Dictionary.csv) that provides the following information for each constituent (analyte): laboratory or data source, data type, description of units, method, minimum reporting limit, limit of quantitation if appropriate, method reference citations, minimum, maximum, median, and average values for each analyte. The data release also contains a file called Abbreviations in Data_Dictionary.pdf that contains all of the abbreviations in the Data Dictionary and in the well characteristics file in the companion report, Colman and others (2018).
d
WBPHS geolocated data and segment effort, 2000+
catalog.data.gov
Updated Nov 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Fish and Wildlife Service (2025). WBPHS geolocated data and segment effort, 2000+ [Dataset]. https://catalog.data.gov/dataset/wbphs-geolocated-data-and-segment-effort-2000
Explore at:
Dataset updated
Nov 25, 2025
Dataset provided by
U.S. Fish and Wildlife Service
Description
The geolocated counts for the Waterfowl Breeding Population and Habitat Survey and associated segment effort information from 2000 to present. The survey was not conducted in 2020-21 due to the COVID pandemic. Two data files are included with their associated metadata (html and xml formats). wbphs_geolocated_counts_forDistribution.csv includes the locations of the plane when survey species observations were made. For each observation, the social group and count is recorded along with a description of the location quality. Number of rows in table = 1,880,556. wbphs_segment_effort_forDistribution.csv. The survey effort file includes the midpoint latitude and longitude of each segment when known, which can differ by year (as indicated by the version number). If a segment was not flown, it is absent from the table for the corresponding year. Number of rows in table = 67,874. Not all geolocated records have locations. Please consult the metadata for an explanation of the fields and other information to understand the limitations of the data.
r
Data from: Cold winters drive consistent and spatially synchronous 8-year...
researchdata.se
demo.researchdata.se
Updated Dec 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sara Emery; Ola Lundin (2022). Data from: Cold winters drive consistent and spatially synchronous 8-year population cycles of cabbage stem flea beetle [Dataset]. https://researchdata.se/en/catalogue/dataset/2022-215-1
Explore at:
(8960)Available download formats
Dataset updated
Dec 8, 2022
Dataset provided by
Swedish University of Agricultural Sciences
Authors
Sara Emery; Ola Lundin
Time period covered
Jan 1, 1968 - Dec 31, 2020
Area covered
Sweden, Skåne County
Description
The data contain information on the number of cabbage stem flea beetle (Psylliodes chrysocephala) larva in winter oilseed rape plants in southern Sweden 1968-2018. A monitoring program for cabbage stem flea beetles in southern Sweden winter oilseed rape fields started in 1969. These data were collected over a 50-year period from commercial winter oilseed rape fields across Scania, the southernmost county in Sweden, by the Swedish University of Agricultural Sciences and its predecessors as well as the Swedish Board of Agriculture. The sampling region of Scania, Sweden was divided into five subregions: (1) southeast, (2) southwest, (3) west, (4) northeast, and (5) northwest. For each subregion we also include data on daily maximum and minimum temperature data in Celsius from 1968-2018. The total area planted to winter oilseed rape in Scania, the mean regional number of cold days and the North Atlantic Oscillation index are reported. Lastly, P. chrysocephala density in winter oilseed rape across five subregions in the UK from 2001-2020 are extracted from a plot in a public report and reported.

The data in the CSFB_Scania_NAs2010-11.csv file have information on crop planting date, the total number of P. chrysocephala larvae detected, number of plants sampled, the density of larva (total larvae/plants examined), sampling date, year, subregion, whether a seed coating or spray pesticide was used in the field and whether the sample was from a commercial field or from an experiment. 3118 rows.

Five files have daily maximum and minimum temperatures for each subregion 1968-2018 (NorthwestInsectYear.csv, NortheastInsectYear.csv, SoutheastInsectYear.csv, SouthwestInsectYear.csv, WestInsectYear.csv). All weather data comes from Swedish weather data website https://www.smhi.se/data. 18,566 rows in the NorthwestInsectYear.csv, 18,606 rows in the NortheastInsectYear.csv, 18,574 rows in the SoutheastInsectYear.csv, 18,574 rows in the SouthwestInsectYear.csv, 18,574 rows in the WestInsectYear.csv.

The file WOSRareaSkane1968-present.csv pulls data from the Swedish Board of Agriculture on the number of hectares planted to winter oilseed rape, turnip rape and the combined total in the Scania region in Sweden from 1968 (matching with harvest year 1969) to 2019. 53 rows.

The file NAO_coldDays.csv gives the annual Hurrell PC-Based North Atlantic Oscillation Index value from https://climatedataguide.ucar.edu/climate-data/hurrell-north-atlantic-oscillation-nao-index-pc-based as well as the raw and log-transformed regional mean number of cold days (below -10C) per year for Scania. 51 rows.

Finally, the data in the CSFBinUK2001-2020.csv is the data extracted from “historical comparisons” figure from the Crop Monitor report last accessed 7 November, 2022 at https://www.cropmonitor.co.uk/wosr/surveys/wosrPestAssLab.cfm?year=2006/2007&season=Autumn. These data include the regions in the UK that data was collected from, the harvest year and the mean number of cabbage stem flea beetle larvae counted per plant in each region and year. 101 rows.
s
Annual maps of cropland abandonment, land cover, and other derived data for...
repository.soilwise-he.eu
data.niaid.nih.gov
Updated Apr 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Annual maps of cropland abandonment, land cover, and other derived data for time-series analysis of cropland abandonment [Dataset]. http://doi.org/10.5281/zenodo.5348287
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5348287
Dataset updated
Apr 2, 2022
Description
Open AccessThis archive contains raw annual land cover maps, cropland abandonment maps, and accompanying derived data products to support: Crawford C.L., Yin, H., Radeloff, V.C., and Wilcove, D.S. 2022. Rural land abandonment is too ephemeral to provide major benefits for biodiversity and climate. Science Advances doi.org/10.1126/sciadv.abm8999. An archive of the analysis scripts developed for this project can be found at: https://github.com/chriscra/abandonment_trajectories (https://doi.org/10.5281/zenodo.6383127). Note that the label '_2022_02_07' in many file names refers to the date of the primary analysis. 'dts” or “dt” refer to “data.tables,' large .csv files that were manipulated using the data.table package in R (Dowle and Srinivasan 2021, http://r-datatable.com/). “Rasters” refer to “.tif” files that were processed using the raster and terra packages in R (Hijmans, 2022; https://rspatial.org/terra/; https://rspatial.org/raster). Data files fall into one of four categories of data derived during our analysis of abandonment: observed, potential, maximum, or recultivation. Derived datasets also follow the same naming convention, though are aggregated across sites. These four categories are as follows (using “age_dts” for our site in Shaanxi Province, China as an example): observed abandonment identified through our primary analysis, with a threshold of five years. These files do not have a specific label beyond the description of the file and the date of analysis (e.g., shaanxi_age_2022_02_07.csv); potential abandonment for a scenario without any recultivation, in which abandoned croplands are left abandoned from the year of initial abandonment through the end of the time series, with the label “_potential” (e.g., shaanxi_potential_age_2022_02_07.csv); maximum age of abandonment over the course of the time series, with the label “_max” (e.g., shaanxi_max_age_2022_02_07.csv); recultivation periods, corresponding to the lengths of recultivation periods following abandonment, given the label “_recult” (e.g., shaanxi_recult_age_2022_02_07.csv). This archive includes multiple .zip files, the contents of which are described below: age_dts.zip - Maps of abandonment age (i.e., how long each pixel has been abandoned for, as of that year, also referred to as length, duration, etc.), for each year between 1987-2017 for all 11 sites. These maps are stored as .csv files, where each row is a pixel, the first two columns refer to the x and y coordinates (in terms of longitude and latitude), and subsequent columns contain the abandonment age values for an individual year (where years are labeled with 'y' followed by the year, e.g., 'y1987'). Maps are given with a latitude and longitude coordinate reference system. Folder contains observed age, potential age (“_potential”), maximum age (“_max”), and recultivation lengths (“_recult”) for all sites. Maximum age .csv files include only three columns: x, y, and the maximum length (i.e., “max age”, in years) for each pixel throughout the entire time series (1987-2017). Files were produced using the custom functions 'cc_filter_abn_dt(),' “cc_calc_max_age(),' “cc_calc_potential_age(),” and “cc_calc_recult_age();” see '_util/_util_functions.R.' age_rasters.zip - Maps of abandonment age (i.e., how long each pixel has been abandoned for), for each year between 1987-2017 for all 11 sites. Maps are stored as .tif files, where each band corresponds to one of the 31 years in our analysis (1987-2017), in ascending order (i.e., the first layer is 1987 and the 31st layer is 2017). Folder contains observed age, potential age (“_potential”), and maximum age (“_max”) rasters for all sites. Maximum age rasters include just one band (“layer”). These rasters match the corresponding .csv files contained in 'age_dts.zip.” derived_data.zip - summary datasets created throughout this analysis, listed below. diff.zip - .csv files for each of our eleven sites containing the year-to-year lagged differences in abandonment age (i.e., length of time abandoned) for each pixel. The rows correspond to a single pixel of land, and the columns refer to the year the difference is in reference to. These rows do not have longitude or latitude values associated with them; however, rows correspond to the same rows in the .csv files in 'input_data.tables.zip' and 'age_dts.zip.' These files were produced using the custom function 'cc_diff_dt()' (much like the base R function 'diff()'), contained within the custom function 'cc_filter_abn_dt()' (see '_util/_util_functions.R'). Folder contains diff files for observed abandonment, potential abandonment (“_potential”), and recultivation lengths (“_recult”) for all sites. input_dts.zip - annual land cover maps for eleven sites with four land cover classes (see below), adapted from Yin et al. 2020 Remote Sensing of Environment (https://doi.org/10.1016/j.rse.2020.111873). Like “age_dts,” these maps are stored as .csv files, where each row is a pixel and the first two columns refer to x and y coordinates (in terms of longitude and latitude). Subsequent columns contain the land cover class for an individual year (e.g., 'y1987'). Note that these maps were recoded from Yin et al. 2020 so that land cover classification was consistent across sites (see below). This contains two files for each site: the raw land cover maps from Yin et al. 2020 (after recoding), and a “clean” version produced by applying 5- and 8-year temporal filters to the raw input (see custom function “cc_temporal_filter_lc(),” in “_util/_util_functions.R” and “1_prep_r_to_dt.R”). These files correspond to those in 'input_rasters.zip,' and serve as the primary inputs for the analysis. input_rasters.zip - annual land cover maps for eleven sites with four land cover classes (see below), adapted from Yin et al. 2020 Remote Sensing of Environment. Maps are stored as '.tif' files, where each band corresponds one of the 31 years in our analysis (1987-2017), in ascending order (i.e., the first layer is 1987 and the 31st layer is 2017). Maps are given with a latitude and longitude coordinate reference system. Note that these maps were recoded so that land cover classes matched across sites (see below). Contains two files for each site: the raw land cover maps (after recoding), and a “clean” version that has been processed with 5- and 8-year temporal filters (see above). These files match those in 'input_dts.zip.' length.zip - .csv files containing the length (i.e., age or duration, in years) of each distinct individual period of abandonment at each site. This folder contains length files for observed and potential abandonment, as well as recultivation lengths. Produced using the custom function 'cc_filter_abn_dt()' and “cc_extract_length();” see '_util/_util_functions.R.' derived_data.zip contains the following files: 'site_df.csv' - a simple .csv containing descriptive information for each of our eleven sites, along with the original land cover codes used by Yin et al. 2020 (updated so that all eleven sites in how land cover classes were coded; see below). Primary derived datasets for both observed abandonment (“area_dat”) and potential abandonment (“potential_area_dat”). area_dat - Shows the area (in ha) in each land cover class at each site in each year (1987-2017), along with the area of cropland abandoned in each year following a five-year abandonment threshold (abandoned for >=5 years) or no threshold (abandoned for >=1 years). Produced using custom functions 'cc_calc_area_per_lc_abn()' via 'cc_summarize_abn_dts()'. See scripts 'cluster/2_analyze_abn.R' and '_util/_util_functions.R.' persistence_dat - A .csv containing the area of cropland abandoned (ha) for a given 'cohort' of abandoned cropland (i.e., a group of cropland abandoned in the same year, also called 'year_abn') in a specific year. This area is also given as a proportion of the initial area abandoned in each cohort, or the area of each cohort when it was first classified as abandoned at year 5 ('initial_area_abn'). The 'age' is given as the number of years since a given cohort of abandoned cropland was last actively cultivated, and 'time' is marked relative to the 5th year, when our five-year definition first classifies that land as abandoned (and where the proportion of abandoned land remaining abandoned is 1). Produced using custom functions 'cc_calc_persistence()' via 'cc_summarize_abn_dts()'. See scripts 'cluster/2_analyze_abn.R' and '_util/_util_functions.R.' This serves as the main input for our linear models of recultivation (“decay”) trajectories. turnover_dat - A .csv showing the annual gross gain, annual gross loss, and annual net change in the area (in ha) of abandoned cropland at each site in each year of the time series. Produced using custom functions 'cc_calc_abn_diff()' via 'cc_summarize_abn_dts()' (see '_util/_util_functions.R'), implemented in 'cluster/2_analyze_abn.R.' This file is only produced for observed abandonment. Area summary files (for observed abandonment only) area_summary_df - Contains a range of summary values relating to the area of cropland abandonment for each of our eleven sites. All area values are given in hectares (ha) unless stated otherwise. It contains 16 variables as columns, including 1) 'site,' 2) 'total_site_area_ha_2017' - the total site area (ha) in 2017, 3) 'cropland_area_1987' - the area in cropland in 1987 (ha), 4) 'area_abn_ha_2017' -
Shopee CodeLeague 2020 Product Detection (Resized)
kaggle.com
zip
Updated Jun 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YKFlash (2020). Shopee CodeLeague 2020 Product Detection (Resized) [Dataset]. https://www.kaggle.com/tanyongkeong/shopee-code-league-2020-product-detection
Explore at:
zip(3700955073 bytes)Available download formats
Dataset updated
Jun 20, 2020
Authors
YKFlash
Description
Context

This dataset is specifically created for Shopee Code League 2020 Product Detection competition. This competition lasts for 2 weeks which required all the teams and participants to come out with a image classification model. The purpose of creating this dataset is to resize the original dataset provided into 299x299 images that match to Kaggle Kernel limitation. The number of images is same as the number of rows provided in the train.csv and test.csv.

Please refer: https://www.kaggle.com/c/shopee-product-detection-open/overview

Content

This dataset consists for 1 folder and 2 csv files which are images folders, train.csv and test.csv

Acknowledgements

We would like to thank Shopee for hosting a series of great competitions and giving chances for us to work with real world problems.

Facebook

Twitter

Click to copy link

Link copied

Cite

Aashirvad pandey (2024). Merge number of excel file,convert into csv file [Dataset]. https://www.kaggle.com/datasets/aashirvadpandey/merge-number-of-excel-fileconvert-into-csv-file

Merge number of excel file,convert into csv file

merging the file and converting the file

Explore at:

zip(6731 bytes)Available download formats

Dataset updated

Mar 30, 2024

Authors

Aashirvad pandey

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Project Description:

Title: Pandas Data Manipulation and File Conversion

Overview: This project aims to demonstrate the basic functionalities of Pandas, a powerful data manipulation library in Python. In this project, we will create a DataFrame, perform some data manipulation operations using Pandas, and then convert the DataFrame into both Excel and CSV formats.

Key Objectives:

DataFrame Creation: Utilize Pandas to create a DataFrame with sample data.
Data Manipulation: Perform basic data manipulation tasks such as adding columns, filtering data, and performing calculations.
File Conversion: Convert the DataFrame into Excel (.xlsx) and CSV (.csv) file formats.

Tools and Libraries Used:

Python
Pandas

Project Implementation:

DataFrame Creation:
- Import the Pandas library.
- Create a DataFrame using either a dictionary, a list of dictionaries, or by reading data from an external source like a CSV file.
- Populate the DataFrame with sample data representing various data types (e.g., integer, float, string, datetime).
Data Manipulation:
- Add new columns to the DataFrame representing derived data or computations based on existing columns.
- Filter the DataFrame to include only specific rows based on certain conditions.
- Perform basic calculations or transformations on the data, such as aggregation functions or arithmetic operations.
File Conversion:
- Utilize Pandas to convert the DataFrame into an Excel (.xlsx) file using the to_excel() function.
- Convert the DataFrame into a CSV (.csv) file using the to_csv() function.
- Save the generated files to the local file system for further analysis or sharing.

Expected Outcome:

Upon completion of this project, you will have gained a fundamental understanding of how to work with Pandas DataFrames, perform basic data manipulation tasks, and convert DataFrames into different file formats. This knowledge will be valuable for data analysis, preprocessing, and data export tasks in various data science and analytics projects.

Conclusion:

The Pandas library offers powerful tools for data manipulation and file conversion in Python. By completing this project, you will have acquired essential skills that are widely applicable in the field of data science and analytics. You can further extend this project by exploring more advanced Pandas functionalities or integrating it into larger data processing pipelines.in this data we add number of data and make that data a data frame.and save in single excel file as different sheet name and then convert that excel file in csv file .

Clear search

Close search

Google apps

Main menu

Merge number of excel file,convert into csv file

ENTSO-E Hydropower modelling data (PECD) in CSV format

WBPHS segment counts and segment effort, 1955-Present

KGCW 2024 Challenge @ ESWC 2024

Data from: KGCW 2023 Challenge @ ESWC 2023

Parkinson_csv

Dataset

Content

Source

Dataset Info

Attribute Information:

Citation Request:

Clean Meta Kaggle

Cleaned Meta-Kaggle Dataset

The Original Dataset - Meta-Kaggle

August 2023 update

The Problems with the Original Dataset

The Solution

France, the evolution of the number of Covid cases

Dataset

Contents

HWRT database of handwritten symbols

solusdt_180_days

2022 Bikeshare Data -Reduced File Size -All Months

9 Columns:

Revisions

Replication data for "High life satisfaction reported among small-scale...

New 1000 Sales Records Data 2

Data from: Abiotic legacies mediate plant-soil feedback during early...

BF skip indexes for Ethereum - Dataset - CryptoData Hub

Geochemical data supporting analysis of geochemical conditions and nitrogen...

WBPHS geolocated data and segment effort, 2000+

Data from: Cold winters drive consistent and spatially synchronous 8-year...

Annual maps of cropland abandonment, land cover, and other derived data for...

Shopee CodeLeague 2020 Product Detection (Resized)

Context

Content

Acknowledgements

Merge number of excel file,convert into csv file

merging the file and converting the file