75 datasets found
  1. R

    Dataset made from a Pandas Dataframe

    • peter.demo.socrata.com
    csv, xlsx, xml
    Updated Jul 5, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Dataset made from a Pandas Dataframe [Dataset]. https://peter.demo.socrata.com/dataset/Dataset-made-from-a-Pandas-Dataframe/w2r9-3vfi
    Explore at:
    xlsx, csv, xmlAvailable download formats
    Dataset updated
    Jul 5, 2017
    Description

    a description

  2. h

    example-data-frame

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI Robotics Ethics Society (PUCRS), example-data-frame [Dataset]. https://huggingface.co/datasets/AiresPucrs/example-data-frame
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    AI Robotics Ethics Society (PUCRS)
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Example DataFrame (Teeny-Tiny Castle)

    This dataset is part of a tutorial tied to the Teeny-Tiny Castle, an open-source repository containing educational tools for AI Ethics and Safety research.

      How to Use
    

    from datasets import load_dataset

    dataset = load_dataset("AiresPucrs/example-data-frame", split = 'train')

  3. Learn Data Science Series Part 1

    • kaggle.com
    Updated Dec 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 30, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rupesh Kumar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

    Overview:

    • Chapter 1: Getting started with pandas
    • Chapter 2: Analysis: Bringing it all together and making decisions
    • Chapter 3: Appending to DataFrame
    • Chapter 4: Boolean indexing of dataframes
    • Chapter 5: Categorical data
    • Chapter 6: Computational Tools
    • Chapter 7: Creating DataFrames
    • Chapter 8: Cross sections of different axes with MultiIndex
    • Chapter 9: Data Types
    • Chapter 10: Dealing with categorical variables
    • Chapter 11: Duplicated data
    • Chapter 12: Getting information about DataFrames
    • Chapter 13: Gotchas of pandas
    • Chapter 14: Graphs and Visualizations
    • Chapter 15: Grouping Data
    • Chapter 16: Grouping Time Series Data
    • Chapter 17: Holiday Calendars
    • Chapter 18: Indexing and selecting data
    • Chapter 19: IO for Google BigQuery
    • Chapter 20: JSON
    • Chapter 21: Making Pandas Play Nice With Native Python Datatypes
    • Chapter 22: Map Values
    • Chapter 23: Merge, join, and concatenate
    • Chapter 24: Meta: Documentation Guidelines
    • Chapter 25: Missing Data
    • Chapter 26: MultiIndex
    • Chapter 27: Pandas Datareader
    • Chapter 28: Pandas IO tools (reading and saving data sets)
    • Chapter 29: pd.DataFrame.apply
    • Chapter 30: Read MySQL to DataFrame
    • Chapter 31: Read SQL Server to Dataframe
    • Chapter 32: Reading files into pandas DataFrame
    • Chapter 33: Resampling
    • Chapter 34: Reshaping and pivoting
    • Chapter 35: Save pandas dataframe to a csv file
    • Chapter 36: Series
    • Chapter 37: Shifting and Lagging Data
    • Chapter 38: Simple manipulation of DataFrames
    • Chapter 39: String manipulation
    • Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame
    • Chapter 41: Working with Time Series
  4. o

    dataset: Create interoperable and well-documented data frames

    • explore.openaire.eu
    Updated Jun 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Antal (2022). dataset: Create interoperable and well-documented data frames [Dataset]. http://doi.org/10.5281/zenodo.6854273
    Explore at:
    Dataset updated
    Jun 23, 2022
    Authors
    Daniel Antal
    Description

    See the package documentation website on dataset.dataobservatory.eu. Report bugs and suggestions on Github: https://github.com/dataobservatory-eu/dataset/issues The primary aim of dataset is to build well-documented data.frames, tibbles or data.tables that follow the W3C Data Cube Vocabulary based on the statistical SDMX data cube model. Such standard R objects (data.fame, data.table, tibble, or well-structured lists like json) become highly interoperable and can be placed into relational databases, semantic web applications, archives, repositories. They follow the FAIR principles: they are findable, accessible, interoperable and reusable. Our datasets: Contain Dublin Core or DataCite (or both) metadata that makes the findable and easier accessible via online libraries. See vignette article Datasets With FAIR Metadata. Their dimensions can be easily and unambigously reduced to triples for RDF applications; they can be easily serialized to, or synchronized with semantic web applications. See vignette article From dataset To RDF. Contain processing metadata that greatly enhance the reproducibility of the results, and the reviewability of the contents of the dataset, including metadata defined by the DDI Alliance, which is particularly helpful for not yet processed data; Follow the datacube model of the Statistical Data and Metadata eXchange, therefore allowing easy refreshing with new data from the source of the analytical work, and particularly useful for datasets containing results of statistical operations in R; Correct exporting with FAIR metadata to the most used file formats and straighforward publication to open science repositories with correct bibliographical and use metadata. See Export And Publish a dataset. Relatively lightweight in dependencies and easily works with data.frame, tibble or data.table R objects.

  5. R

    Dataframe Detection Dataset

    • universe.roboflow.com
    zip
    Updated Apr 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    detection (2024). Dataframe Detection Dataset [Dataset]. https://universe.roboflow.com/detection-fvah2/dataframe-detection
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 6, 2024
    Dataset authored and provided by
    detection
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Student Responses On Exams Bounding Boxes
    Description

    Dataframe Detection

    ## Overview
    
    Dataframe Detection is a dataset for object detection tasks - it contains Student Responses On Exams annotations for 1,052 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  6. Z

    MAT-Builder datasets

    • data.niaid.nih.gov
    Updated Apr 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chiara Renso (2023). MAT-Builder datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7839805
    Explore at:
    Dataset updated
    Apr 19, 2023
    Dataset provided by
    Chiara Pugliese
    Fabio Pinelli
    Chiara Renso
    Francesco Lettich
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The archive contains two datasets that have been used to empirically evaluate MAT-Builder, a system to generate multiple aspect trajectories.

    The first one is located in the "rome" folder and contains 26395 trajectories from 3181 individuals. The trajectories move over the city of Rome and were collected from OpenStreetMap. The folder contains also auxiliary datasets, i.e., the set of POIs within the province of Rome's boundaries (downloaded from OpenStreetMap) (see the "poi" subfolder), historical weather information (downloaded from Meteostat https://meteostat.net/it/) (see the "weather" subfolder), and a dataset of social media posts from the individuals which was generated synthetically (see the "tweets" subfolder). All the datasets are pandas dataframes, except for the POI dataset which is a geopandas DataFrame. All the datasets have been stored according to the parquet format.

    The second one is located in the "geolife" folder, and contains the GeoLife dataset. The dataset contains 17621 trajectories from 178 users. The timestamps of the trajectory samples have been adjusted from the GMT to the GMT+8 timezone. As in the former dataset's case, this folder contains also a dataset of POIs, a dataset of historical weather information, and a dataset of social media posts that were generated synthetically.

    For more information on the MAT-Builder project (i.e., published papers, how to use to datasets, how the information within the datasets is structured, and so on) we refer to the MAT-Builder's GitHub page: https://github.com/chiarap2/MAT_Builder.

  7. dataframe-with-removed-features

    • kaggle.com
    Updated Nov 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anton Kostin (2022). dataframe-with-removed-features [Dataset]. https://www.kaggle.com/datasets/visualcomments/dataframe-with-removed-features/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 13, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Anton Kostin
    Description

    Dataset

    This dataset was created by Anton Kostin

    Contents

  8. Datasets of the CIKM resource paper "A Semantically Enriched Mobility...

    • zenodo.org
    zip
    Updated Jun 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francesco Lettich; Francesco Lettich; Chiara Pugliese; Chiara Pugliese; Guido Rocchietti; Guido Rocchietti; Chiara Renso; Chiara Renso; Fabio PINELLI; Fabio PINELLI (2025). Datasets of the CIKM resource paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions" [Dataset]. http://doi.org/10.5281/zenodo.15658129
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 16, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francesco Lettich; Francesco Lettich; Chiara Pugliese; Chiara Pugliese; Guido Rocchietti; Guido Rocchietti; Chiara Renso; Chiara Renso; Fabio PINELLI; Fabio PINELLI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the two semantically enriched trajectory datasets introduced in the CIKM Resource Paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions", by Chiara Pugliese (CNR-IIT), Francesco Lettich (CNR-ISTI), Guido Rocchietti (CNR-ISTI), Chiara Renso (CNR-ISTI), and Fabio Pinelli (IMT Lucca, CNR-ISTI).

    The two datasets were generated with an open source pipeline based on the Jupyter notebooks published in the GitHub repository behind our resource paper, and our MAT-Builder system. Overall, our pipeline first generates the files that we provide in the [paris|nyc]_input_matbuilder.zip archives; the files are then passed as input to the MAT-Builder system, which ultimately generates the two semantically enriched trajectory datasets for Paris and New York City, both in tabular and RDF formats. For more details on the input and output data, please see the sections below.

    Input data

    The [paris|nyc]_input_matbuilder.zip archives contain the data sources we used with the MAT-Builder system to semantically enrich raw preprocessed trajectories. More specifically, the archives contain the following files:

    • raw_trajectories_[paris|nyc]_matbuilder.parquet: these are the datasets of raw preprocessed trajectories, ready for ingestion by the MAT-Builder system, as outputted by the notebook 5 - Ensure MAT-Builder compatibility.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents the sample of some trajectory, and the dataframe has the following columns:
      • traj_id: trajectory identifier;
      • user: user identifier;
      • lat: latitude of a trajectory sample;
      • lon: longitude of a trajectory sample;
      • time: timestamp of a sample;

    • pois.parqet: these are the POI datasets, ready for ingestion by the MAT-Builder system. outputted by the notebook 6 - Generate dataset POI from OpenStreetMap.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents a POI, and the dataframe has the following columns:
      • osmid: POI OSM identifier
      • element_type: POI OSM element type
      • name: POI native name;
      • name:en: POI English name;
      • wikidata: POI WikiData identifier;
      • geometry: geometry associated with the POI;
      • category: POI category.

    • social_[paris|ny].parquet: these are the social media post datasets, ready for ingestion by the MAT-Builder system, outputted by the notebook 9 - Prepare social media dataset for MAT-Builder.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents a single social media post, and the dataframe has the following columns:
      • tweet_ID: identifier of the post;
      • text: post's text;
      • tweet_created: post's timestamp;
      • uid: identifier of the user who posted.

    • weather_conditions.parquet: these are the weather conditions datasets, ready for ingestion by the MAT-Builder system, outputted by the notebook 7 - Meteostat daily data downloader.ipynb our GitHub repository, saved in Parquet format. Each row in the dataframe represents the weather conditions recorder in a single day, and the dataframe has the following columns:
      • DATE: date in which the weather observation was recorded;
      • TAVG_C: average temperature in celsius;
      • DESCRIPTION: weather conditions.

    Output data: the semantically enriched Paris and New York City datasets

    Tabular Representation

    The [paris|nyc]_output_tabular.zip zip archives contain the output files generated by MAT-Builder that express the semantically enriched Paris and New York City datasets in tabular format. More specifically, they contain the following files:

    • traj_cleaned.parquet: parquet file storing the dataframe containing the raw preprocessed trajectories after applying the MAT-Builder's preprocessing step on raw_trajectories_[paris|nyc]_matbuilder.parquet. The dataframe contains the same columns found in raw_trajectories_[paris|nyc]_matbuilder.parquet, except for time which in this dataframe has been renamed to datetime. The operations performed in the MAT-Builder's preprocessing step were:
      • (1) we filtered out trajectories having less than 2 samples;
      • (2) we filtered noisy samples inducing velocities above 300km/h:
      • (3) finally, we compressed the trajectories such that all points within a radius of 20 meters from a given initial point are compressed into a single point that has the median coordinates of all points and the time of the initial point.

    • stops.parquet: parquet file storing the dataframe containing the stop segments detected from the trajectories by the MAT-Builder's segmentation step. Each row in the dataframe represents a specific stop segment from some trajectory. The columns are:
      • datetime, which indicates when a stop segments starts;
      • leaving_datetime, which indicates when a stop segment ends;
      • uid, the trajectory user's identifier;
      • tid, the trajectory's identifier;
      • lat, the stop segment's centroid latitude;
      • lng, the stop segment's centroid longitude.
        NOTE: to uniquely identify a stop segment, you need the triple (stop segment's index in the dataframe, uid, tid).
    • moves.parquet: parquet file storing the dataframe containing the samples associated with the move segments detected from the trajectories by the MAT-Builder's segmentation step. Each row in the dataframe represents a specific sample beloning to some move segment of some trajectory. The columns are:
      • datetime, which indicates when a sample's timestamp;
      • uid, the samples' trajectory user's identifier;
      • tid, the sample's trajectory's identifier;
      • lat, the sample's latitude;
      • lng, the sample's longitude;
      • move_id, the identifier of a move segment.
        NOTE: to uniquely identify a move segment, you need the triple (uid, tid, move_id).

    • enriched_occasional.parquet: parquet file storing the dataframe containing pairs representing associations between stop segments that have been deemed occasional and POIs found to be close to their centroids. As such, in this dataframe an occasional stop can appear multiple times, i.e., when the are multiple POIs located nearby a stop's centroid. The columns found in this dataframe are the same from stops.parquet, plus two sets of columns.

      The first set of columns concerns a stop's charachteristics:
      • stop_id, which represents the unique identifier of a stop segment and corresponds to the index of said stop in stops.parquet;
      • geometry_stop, which is a Shapely Point representing a stop's centroid;
      • geometry, which is the aforementioned Shapely Point plus a 50 meters buffer around it.

    There is then a second set of columns which represents the characteristics of the POI that has been associated with a stop. The relevant ones are:

      • index_poi, which is the index of the associated POI in the pois.parqet file;
      • osmid, which is the identifier given by OpenStreetMap to the POI;
      • name, the POI's name;
      • wikidata, the POI identifier on wikidata;
      • category, the POI's category;
      • geometry_poi, a Shapely (multi)polygon describing the POI's geometry;
      • distance, the distance between the stop segment's centroid and the POI.

    • enriched_systematic.parquet: parquet file storing the dataframe containing pairs representing associations between stop segments that have been deemed systematic and POIs found to be close to their centroids. This dataframe has exactly the same characteristics of enriched_occasional.parquet, plus the following columns:
      • systematic_id, the identifier of the cluster of systematic stops a systematic stop belongs to;
      • frequency, the number of systematic stops within a systematic stop's cluster;
      • home, the probability that the systematic stop's cluster represents the home of the associated user;
      • work, the probability that the systematic stop's cluster represents the workplace of the associated user;
      • other,

  9. Data of The Balance-Scale task (paper-and-pencil and Math Garden)

    • figshare.com
    application/gzip
    Updated Jan 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abe Hofman; Han van der Maas (2016). Data of The Balance-Scale task (paper-and-pencil and Math Garden) [Dataset]. http://doi.org/10.6084/m9.figshare.1309897.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 19, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Abe Hofman; Han van der Maas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Paper: The Balance-Scale Task Revisited: A Comparison of Statistical Models for Rule-Based and Information-Integration Theories of Proportional Reasoning Abe Hofman, Ingmar Visser, Brenda Jansen & Han van der Maas; 15-2-2015 ————————————————————————The “dataBS.Rdata” file include four dataframes based on two different datasets: A paper-and-pencil dataset collected by Jansen & van der Maas (1997), and a online dataset collected with the Math Garden. Description of the for dataframes: 1) student_info_pp: Student information of paper-and-pencil dataset - id = student id - age = student age 2) student_info_mg: Student information of Math Garden dataset - id = student id - age = student age - new = student has not played the task before data collection started - practise = number of items made by students before the data collection started 3) responses_pp: Response information of paper-and-pencil dataset in long format 4) responses_mg: Response information of Math Garden dataset in long format - id = student id - it = item id - item_type = item type as defined in paper - product_difference = difference between the product of weights and distance on each side of the fulcrum - weight_difference = difference between the weights on each side of the fulcrum - distance_difference = difference between the distance of the weights on each side of the fulcrum - resp = response; left, balance, right - cor = 0 incorrect; 1 correct

  10. Z

    Flow map data of the singel pendulum, double pendulum and 3-body problem

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Horn, Philipp (2024). Flow map data of the singel pendulum, double pendulum and 3-body problem [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11032351
    Explore at:
    Dataset updated
    Apr 23, 2024
    Dataset provided by
    Simon, Portegies Zwart
    Koren, Barry
    Horn, Philipp
    Veronica, Saz Ulibarrena
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was constructed to compare the performance of various neural network architectures learning the flow maps of Hamiltonian systems. It was created for the paper: A Generalized Framework of Neural Networks for Hamiltonian Systems.

    The dataset consists of trajectory data from three different Hamiltonian systems. Namely, the single pendulum, double pendulum and 3-body problem. The data was generated using numerical integrators. For the single pendulum, the symplectic Euler method with a step size of 0.01 was used. The data of the double pendulum was also computed by the symplectic Euler method, however, with an adaptive step size. The trajectories of the 3-body problem were calculated by the arbitrarily high-precision code Brutus.

    For each Hamiltonian system, there is one file containing the entire trajectory information (*_all_runs.h5.1). In these files, the states along all trajectories are recorded with a step size of 0.01. These files are composed of several Pandas DataFrames. One DataFrame per trajectory, called "run0", "run1", ... and finally one large DataFrame in which all the trajectories are combined, called "all_runs". Additionally, one Pandas Series called "constants" is contained in these files, in which several parameters of the data are listed.

    Also, there is a second file per Hamiltonian system in which the data is prepared as features and labels ready for neural networks to be trained (*_training.h5.1). Similar to the first type of files, they contain a Series called "constants". The features and labels are then separated into 6 DataFrames called "features", "labels", "val_features", "val_labels", "test_features" and "test_labels". The data is split into 80% training data, 10% validation data and 10% test data.

    The code used to train various neural network architectures on this data can be found on GitHub at: https://github.com/AELITTEN/GHNN.

    Already trained neural networks can be found on GitHub at: https://github.com/AELITTEN/NeuralNets_GHNN.

    Single pendulum Double pendulum 3-body problem

    Number of trajectories 500 2000 5000

    final time in all_runs T (one period of the pendulum) 10 10

    final time in training data 0.25*T 5 5

    step size in training data 0.1 0.1 0.5

  11. n

    CODEX multiplexed imaging cell datasets used for using STELLAR to transfer...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Jul 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Hickey (2022). CODEX multiplexed imaging cell datasets used for using STELLAR to transfer cell type annotations to other tissues and donors [Dataset]. http://doi.org/10.5061/dryad.g4f4qrfrc
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 19, 2022
    Dataset provided by
    Stanford University
    Authors
    John Hickey
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    We performed CODEX (co-detection by indexing) multiplexed imaging on 24 sections of the human intestine from 3 donors (B004, B005, B006) using a panel of 47 oligonucleotide-barcoded antibodies. We also performed CODEX imaging on both human tonsil and Barrett's esophagus (BE) using a panel of 57 oligonucleotide-barcoded antibodies. Subsequently images underwent standard CODEX image processing (tile stitching, drift compensation, cycle concatenation, background subtraction, deconvolution, and determination of best focal plane), single cell segmentation, and column marker z-normalization by tissue. Output of this process were dataframes of 870,000 cells and 220,000 cells respectively with fluorescence values quantified from each marker. Methods See README file.

  12. Shopping Mall

    • kaggle.com
    Updated Dec 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anshul Pachauri (2023). Shopping Mall [Dataset]. https://www.kaggle.com/datasets/anshulpachauri/shopping-mall
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Anshul Pachauri
    Description

    Libraries Import:

    Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:

    Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:

    Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:

    Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:

    Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:

    Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:

    Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:

    Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:

    Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").

  13. Raw data from datasets used in SIMON analysis

    • zenodo.org
    bin
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adriana Tomic; Adriana Tomic; Ivan Tomic; Ivan Tomic (2020). Raw data from datasets used in SIMON analysis [Dataset]. http://doi.org/10.5281/zenodo.2580414
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Adriana Tomic; Adriana Tomic; Ivan Tomic; Ivan Tomic
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here you can find raw data and information about each of the 34 datasets generated by the mulset algorithm and used for further analysis in SIMON.
    Each dataset is stored in separate folder which contains 4 files:

    json_info: This file contains, number of features with their names and number of subjects that are available for the same dataset
    data_testing: data frame with data used to test trained model
    data_training: data frame with data used to train models
    results: direct unfiltered data from database

    Files are written in feather format. Here is an example of data structure for each file in repository.

    File was compressed using 7-Zip available at https://www.7-zip.org/.

  14. PandasPlotBench

    • huggingface.co
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PandasPlotBench [Dataset]. https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 25, 2024
    Dataset provided by
    JetBrainshttp://jetbrains.com/
    Authors
    JetBrains Research
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    PandasPlotBench

    PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.

  15. dfencoder - AutoEncoders for DataFrames

    • kaggle.com
    zip
    Updated Oct 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KingOfDayDream (2020). dfencoder - AutoEncoders for DataFrames [Dataset]. https://www.kaggle.com/kingofdaydream/dfencoder-autoencoders-for-dataframes
    Explore at:
    zip(755488 bytes)Available download formats
    Dataset updated
    Oct 30, 2020
    Authors
    KingOfDayDream
    Description

    Dataset

    This dataset was created by KingOfDayDream

    Contents

    It contains the following files:

  16. Crop classification dataset for testing domain adaptation or distributional...

    • zenodo.org
    • explore.openaire.eu
    • +1more
    bin, csv
    Updated May 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan M. Kluger; Dan M. Kluger; Sherrie Wang; Sherrie Wang; David B. Lobell; David B. Lobell (2022). Crop classification dataset for testing domain adaptation or distributional shift methods [Dataset]. http://doi.org/10.5281/zenodo.6376160
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    May 13, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dan M. Kluger; Dan M. Kluger; Sherrie Wang; Sherrie Wang; David B. Lobell; David B. Lobell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this upload we share processed crop type datasets from both France and Kenya. These datasets can be helpful for testing and comparing various domain adaptation methods. The datasets are processed, used, and described in this paper: https://doi.org/10.1016/j.rse.2021.112488 (arXiv version: https://arxiv.org/pdf/2109.01246.pdf).

    In summary, each point in the uploaded datasets corresponds to a particular location. The label is the crop type grown at that location in 2017. The 70 processed features are based on Sentinel-2 satellite measurements at that location in 2017. The points in the France dataset come from 11 different departments (regions) in Occitanie, France, and the points in the Kenya dataset come from 3 different regions in Western Province, Kenya. Within each dataset there are notable shifts in the distribution of the labels and in the distribution of the features between regions. Therefore, these datasets can be helpful for testing for testing and comparing methods that are designed to address such distributional shifts.

    More details on the dataset and processing steps can be found in Kluger et. al. (2021). Much of the processing steps were taken to deal with Sentinel-2 measurements that were corrupted by cloud cover. For users interested in the raw multi-spectral time series data and dealing with cloud cover issues on their own (rather than using the 70 processed features provided here), the raw dataset from Kenya can be found in Yeh et. al. (2021), and the raw dataset from France can be made available upon request from the authors of this Zenodo upload.

    All of the data uploaded here can be found in "CropTypeDatasetProcessed.RData". We also post the dataframes and tables within that .RData file as separate .csv files for users who do not have R. The contents of each R object (or .csv file) is described in the file "Metadata.rtf".

    Preferred Citation:

    -Kluger, D.M., Wang, S., Lobell, D.B., 2021. Two shifts for crop mapping: Leveraging aggregate crop statistics to improve satellite-based maps in new regions. Remote Sens. Environ. 262, 112488. https://doi.org/10.1016/j.rse.2021.112488.

    -URL to this Zenodo post https://zenodo.org/record/6376160

  17. b

    pandas DataFrames of the DYToMuMu_M-20_CT10_TuneZ2star_v2_8TeV process

    • bonndata.uni-bonn.de
    bin, text/x-python +1
    Updated Jan 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timo Saala; Timo Saala (2025). pandas DataFrames of the DYToMuMu_M-20_CT10_TuneZ2star_v2_8TeV process [Dataset]. http://doi.org/10.60507/FK2/1MTTRE
    Explore at:
    bin(630694234), bin(595883050), text/x-python(2553), bin(642092194), txt(7203), bin(525465770), bin(637589794), bin(637555602), bin(515541514), bin(624730562), bin(635941242), bin(632160114)Available download formats
    Dataset updated
    Jan 21, 2025
    Dataset provided by
    bonndata
    Authors
    Timo Saala; Timo Saala
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains pandas DataFrames that represent filtered versions of CMS Open Data (in the form of ROOT files) available on the CERN OpenData Portal. This dataset specifically contains data from a DYToMuMu process (Drell-Yan process resulting in two Muons in the final state), which is a simulated process created during the 2012 LHC run. A total of 121 (99 for real collision data) relevant variables are contained in the filtered pandas DataFrames that can be found here. A list of variables can be found below, for a full explanation of them, please refer to the following paper (PLACEHOLDER, REFERENCE PAPER HERE): nEvent, runNum, lumisection, evtNum; nMuon, vecMuon_PT, vecMuon_Eta, vecMuon_Phi, vecMuon_PTErr, vecMuon_Q, vecMuon_StaPt, vecMuon_StaEta, vecMuon_StaPhi, vecMuon_TrkIso03, vecMuon_EcalIso03, vecMuon_HcalIso03; nVertex, vecVertex_nTracksfit, vecVertex_ndof, vecVertex_Chi2, vecVertex_X, vecVertex_Y, vecVertex_Z; nEle, vecEle_PT, vecEle_Eta, vecEle_Phi, vecEle_Q, vecEle_TrkIso03, vecEle_EcalIso03, vecEle_HcalIso03, vecEle_D0, vecEle_Dz; nTau, vecTau_PT, vecTau_Eta, vecTau_Phi, vecTau_Q, vecTau_RawIso3Hits, vecTau_RawIsoMVA3oldDMwoLT, vecTau_RawIsoMVA3oldDMwLT, vecTau_RawIsoMVA3newDMwoLT, vecTau_RawIsoMVA3newDMwLT; nPhoton, vecPhoton_PT, vecPhoton_Eta, vecPhoton_Phi, vecPhoton_Hovere, vecPhoton_Sthovere, vecPhoton_HasPixelSeed, vecPhoton_IsConv, vecPhoton_PassElectronVeto; nMctruth, vecMctruth_PT, vecMctruth_Eta, vecMctruth_Phi, vecMctruth_Id_1, vecMctruth_Id_2, vecMctruth_X_1, vecMctruth_X_2, vecMctruth_PdgId, vecMctruth_Status, vecMctruth_Y, vecMctruth_Mass, vecMctruth_Mothers.first, vecMctruth_Mothers.second; nJets, vecJet_PT, vecJet_Eta, vecJet_Phi, vecJet_D0, vecJet_Dz, vecJet_nCharged, vecJet_nNeutrals, vecJet_nParticles, vecJet_Beta, vecJet_BetaStar, vecJet_dR2Mean, vecJet_Q, vecJet_Mass, vecJet_Area, vecJet_Energy, vecJet_chEmEnergy, vecJet_neuEmEnergy, vecJet_chHadEnergy, vecJet_neuHadEnergy, vecJet_ID, vecJet_Num, vecJet_mcFlavor, vecJet_GenPT, vecJet_GenEta, vecJet_GenPhi, vecJet_GenMass, vecJet_flavorMatchPT, vecJet_JEC, vecJet_MatchIdx; nPF, vecPF_PT, vecPF_Eta, vecPF_Phi, vecPF_Mass, vecPF_E, vecPF_Q, vecPF_PfType, vecPF_EcalE, vecPF_HcalE, vecPF_ndof, vecPF_Chi2, vecPF_pvId, vecPF_X, vecPF_Y, vecPF_Z, vecPF_JetNum; fMET_PT, fMET_Eta, fMET_Phi; HLT_Mu17_Mu8, HLT_Mu24, HLT_MET120_v, HLT_Ele27, HLT_HT350. For the datasets containing data from real collisions at the LHC, the following variables are NOT contained: nMctruth, vecMctruth_PT, vecMctruth_Eta, vecMctruth_Phi, vecMctruth_Id_1, vecMctruth_Id_2, vecMctruth_X_1, vecMctruth_X_2, vecMctruth_PdgId, vecMctruth_Status, vecMctruth_Y, vecMctruth_Mass, vecMctruth_Mothers.first, vecMctruth_Mothers.second; vecJet_mcFlavor, vecJet_GenPT, vecJet_GenEta, vecJet_GenPhi, vecJet_GenMass, vecJet_flavorMatchPT, vecJet_JEC, vecJet_MatchIdx

  18. E

    A Replication Dataset for Fundamental Frequency Estimation

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    • +1more
    json
    Updated Oct 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). A Replication Dataset for Fundamental Frequency Estimation [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7808
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Oct 19, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.© 2020, Bastian Bechtold. All rights reserved. Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise.The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time.Included Code and Data

    ground truth data.zip is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora:

    CMU-ARCTIC (consensus truth) [1]FDA (corpus truth and consensus truth) [2]KEELE (corpus truth and consensus truth) [3]MOCHA-TIMIT (consensus truth) [4]PTDB-TUG (corpus truth and consensus truth) [5]TIMIT (consensus truth) [6]

    noisy speech data.zip is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora:NOISEX [7]QUT-NOISE [8]

    synthetic speech data.zip is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise.noisy_speech.pkl and synthetic_speech.pkl are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms:AUTOC [9]AMDF [10]BANA [11]CEP [12]CREPE [13]DIO [14]DNN [15]KALDI [16]MAPSMBSC [17]NLS [18]PEFAC [19]PRAAT [20]RAPT [21]SACC [22]SAFE [23]SHR [24]SIFT [25]SRH [26]STRAIGHT [27]SWIPE [28]YAAPT [29]YIN [30]

    noisy speech evaluation.py and synthetic speech evaluation.py are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures:Gross Pitch Error (GPE), the percentage of pitches where the estimated pitch deviates from the true pitch by more than 20%.Fine Pitch Error (FPE), the mean error of grossly correct estimates.High/Low Octave Pitch Error (OPE), the percentage pitches that are GPEs and happens to be at an integer multiple of the true pitch.Gross Remaining Error (GRE), the percentage of pitches that are GPEs but not OPEs.Fine Remaining Bias (FRB), the median error of GREs.True Positive Rate (TPR), the percentage of true positive voicing estimates.False Positive Rate (FPR), the percentage of false positive voicing estimates.False Negative Rate (FNR), the percentage of false negative voicing estimates.F₁, the harmonic mean of precision and recall of the voicing decision.

    Pipfile is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs.

    The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory.References:

    John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003.Paul C Bagshaw, Steven Hiller, and Mervyn A Jack. Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching. In EUROSPEECH, 1993.F Plante, Georg F Meyer, and William A Ainsworth. A Pitch Extraction Reference Database. In Fourth European Conference on Speech Communication and Technology, pages 837–840, Madrid, Spain, 1995.Alan Wrench. MOCHA MultiCHannel Articulatory database: English, November 1999.Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011.John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993.Andrew Varga and Herman J.M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recog- nition systems. Speech Communication, 12(3):247–251, July 1993.David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010.Man Mohan Sondhi. New methods of pitch extraction. Audio and Electroacoustics, IEEE Transactions on, 16(2):262—266, 1968.Myron J. Ross, Harry L. Shaffer, Asaf Cohen, Richard Freudberg, and Harold J. Manley. Average magnitude difference function pitch extractor. Acoustics, Speech and Signal Processing, IEEE Transactions on, 22(5):353—362, 1974.Na Yang, He Ba, Weiyang Cai, Ilker Demirkol, and Wendi Heinzelman. BaNa: A Noise Resilient Fundamental Frequency Detection Algorithm for Speech and Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1833–1848, December 2014.Michael Noll. Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, 41(2):293–309, 1967.Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A Convolutional Representation for Pitch Estimation. arXiv:1802.06182 [cs, eess, stat], February 2018. arXiv: 1802.06182.Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems, E99.D(7):1877–1884, 2016.Kun Han and DeLiang Wang. Neural Network Based Pitch Tracking in Very Noisy Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2158–2168, Decem- ber 2014.Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal, and Sanjeev Khudanpur. A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2494–2498. IEEE, 2014.Lee Ngee Tan and Abeer Alwan. Multi-band summary correlogram-based pitch detection for noisy speech. Speech Communication, 55(7-8):841–856, September 2013.Jesper Kjær Nielsen, Tobias Lindstrøm Jensen, Jesper Rindom Jensen, Mads Græsbøll Christensen, and Søren Holdt Jensen. Fast fundamental frequency estimation: Making a statistically efficient estimator computationally efficient. Signal Processing, 135:188–197, June 2017.Sira Gonzalez and Mike Brookes. PEFAC - A Pitch Estimation Algorithm Robust to High Levels of Noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(2):518—530, February 2014.Paul Boersma. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the institute of phonetic sciences, volume 17, page 97—110. Amsterdam, 1993.David Talkin. A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis, 495:518, 1995.Byung Suk Lee and Daniel PW Ellis. Noise robust pitch tracking by subband autocorrelation classification. In Interspeech, pages 707–710, 2012.Wei Chu and Abeer Alwan. SAFE: a statistical algorithm for F0 estimation for both clean and noisy speech. In INTERSPEECH, pages 2590–2593, 2010.Xuejing Sun. Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio. In Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, volume 1, page I—333. IEEE, 2002.Markel. The SIFT algorithm for fundamental frequency estimation. IEEE Transactions on Audio and Electroacoustics, 20(5):367—377, December 1972.Thomas Drugman and Abeer Alwan. Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics. In Interspeech, page 1973—1976, 2011.Hideki Kawahara, Masanori Morise, Toru Takahashi, Ryuichi Nisimura, Toshio Irino, and Hideki Banno. TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In Acous- tics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages 3933–3936. IEEE, 2008.Arturo Camacho. SWIPE: A sawtooth waveform inspired pitch estimator for speech and music. PhD thesis, University of Florida, 2007.Kavita Kasi and Stephen A. Zahorian. Yet Another Algorithm for Pitch Tracking. In IEEE International Conference on Acoustics Speech and Signal Processing, pages I–361–I–364, Orlando, FL, USA, May 2002. IEEE.Alain de Cheveigné and Hideki Kawahara. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4):1917, 2002.

  19. h

    flores200-eng-bem

    • huggingface.co
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kreasof AI (2025). flores200-eng-bem [Dataset]. https://huggingface.co/datasets/kreasof-ai/flores200-eng-bem
    Explore at:
    Dataset updated
    May 4, 2025
    Dataset authored and provided by
    Kreasof AI
    Description

    Dataset Details

    This is Bemba-to-English dataset for machine translation task. This dataset is a customized version of the from FLORES-200. It includes parallel sentences between Bemba and English.

      Preprocessing Notes
    

    Drop some unused columns like URL, domain, topic, has_image, has_hyperlink. Merge the Bemba and English DataFrames on the ID column. Rename columns name from sentence_bem into text_bem and sentence_en into text_en. Convert dataframe into DatasetDict.… See the full description on the dataset page: https://huggingface.co/datasets/kreasof-ai/flores200-eng-bem.

  20. Fracture toughness of mixed-mode anticracks in highly porous materials...

    • zenodo.org
    bin, text/x-python +1
    Updated Sep 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentin Adam; Valentin Adam; Bastian Bergfeld; Bastian Bergfeld; Philipp Weißgraeber; Philipp Weißgraeber; Alec van Herwijnen; Alec van Herwijnen; Philipp L. Rosendahl; Philipp L. Rosendahl (2024). Fracture toughness of mixed-mode anticracks in highly porous materials dataset and data processing [Dataset]. http://doi.org/10.5281/zenodo.11443644
    Explore at:
    text/x-python, txt, binAvailable download formats
    Dataset updated
    Sep 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Valentin Adam; Valentin Adam; Bastian Bergfeld; Bastian Bergfeld; Philipp Weißgraeber; Philipp Weißgraeber; Alec van Herwijnen; Alec van Herwijnen; Philipp L. Rosendahl; Philipp L. Rosendahl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This repository contains the code and datasets used in the data analysis for "Fracture toughness of mixed-mode anticracks in highly porous materials". The analysis is implemented in Python, using Jupyter Notebooks.

    Contents

    • main.ipynb: Jupyter notebook with the main data analysis workflow.
    • energy.py: Methods for the calculation of energy release rates.
    • regression.py: Methods for the regression analyses.
    • visualization.py: Methods for generating visualizations.
    • df_mmft.pkl: Pickled DataFrame with experimental data gathered in the present work.
    • df_legacy.pkl: Pickled DataFrame with literature data.

    Prerequisites

    • To run the scripts and notebooks, you need:
    • Python 3.12 or higher
    • Jupyter Notebook or JupyterLab
    • Libraries: pandas, matplotlib, numpy, scipy, tqdm, uncertainties, weac

    Setup

    1. Download the zip file or clone this repository to your local machine.
    2. Ensure that Python and Jupyter are installed.
    3. Install required Python libraries using pip install -r requirements.txt.

    Running the Analysis

    1. Open the main.ipynb notebook in Jupyter Notebook or JupyterLab.
    2. Execute the cells in sequence to reproduce the analysis.

    Data Description

    The data included in this repository is encapsulated in two pickled DataFrame files, df_mmft.pkl and df_legacy.pkl, which contain experimental measurements and corresponding parameters. Below are the descriptions for each column in these DataFrames:

    df_mmft.pkl

    Includes data such as experiment identifiers, datetime, and physical measurements like slope inclination and critical cut lengths.
    • exp_id: Unique identifier for each experiment.
    • datestring: Date of the experiment as a string.
    • datetime: Timestamp of the experiment.
    • bunker: Field site of the experiment. Bunker IDs 1 and 2 correspond to field sites A and B, respectively.
    • slope_incl: Inclination of the slope in degrees.
    • h_sledge_top: Distance from sample top surface to the sled in mm.
    • h_wl_top: Distance from sample top surface to weak layer in mm.
    • h_wl_notch: Distance from the notch root to the weak layer in mm.
    • rc_right: Critical cut length in mm, measured on the front side of the sample.
    • rc_left: Critical cut length in mm, measured on the back side of the sample.
    • rc: Mean of rc_right and rc_left.
    • densities: List of density measurements in kg/m^3 for each distinct slab layer of each sample.
    • densities_mean: Daily mean of densities.
    • layers: 2D array with layer density (kg/m^3) and layer thickness (mm) pairs for each distinct slab layer.
    • layers_mean: Daily mean of layers.
    • surface_lineload: Surface line load of added surface weights in N/mm.
    • wl_thickness: Weak-layer thickness in mm.
    • notes: Additional notes regarding the experiment or observations.
    • L: Length of the slab–weak-layer assembly in mm.

    df_legacy.pkl

    Contains robustness data such as radii of curvature, slope inclination, and various geometrical measurements.
    • #: Record number.
    • rc: Critical cut length in mm.
    • slope_incl: Inclination of the slope in degrees.
    • h: Slab height in mm.
    • density: Mean slab density in kg/m^3.
    • L: Lenght of the slab–weak-layer assembly in mm.
    • collapse_height: Weak-layer height reduction through collapse.
    • layers_mean: 2D array with layer density (kg/m^3) and layer thickness (mm) pairs for each distinct slab layer.
    • wl_thickness: Weak-layer thickness in mm.
    • surface_lineload: Surface line load from added weights in N/mm.

    For more detailed information on the datasets, refer to the paper or the documentation provided within the Jupyter notebook.

    License

    You are free to:
    • Share — copy and redistribute the material in any medium or format
    • Adapt — remix, transform, and build upon the material for any purpose, even commercially.
    Under the following terms:
    • Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

    Citation

    Please cite the following paper if you use this analysis or the accompanying datasets:
    • Adam, V., Bergfeld, B., Weißgraeber, P. van Herwijnen, A., Rosendahl, P.L., Fracture toughness of mixed-mode anticracks in highly porous materials. Nature Communincations 15, 7379 (2024). https://doi.org/10.1038/s41467-024-51491-7
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2017). Dataset made from a Pandas Dataframe [Dataset]. https://peter.demo.socrata.com/dataset/Dataset-made-from-a-Pandas-Dataframe/w2r9-3vfi

Dataset made from a Pandas Dataframe

Explore at:
xlsx, csv, xmlAvailable download formats
Dataset updated
Jul 5, 2017
Description

a description

Search
Clear search
Close search
Google apps
Main menu