36 datasets found
  1. R

    Dataset made from a Pandas Dataframe

    • peter.demo.socrata.com
    csv, xlsx, xml
    Updated Jul 5, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Dataset made from a Pandas Dataframe [Dataset]. https://peter.demo.socrata.com/dataset/Dataset-made-from-a-Pandas-Dataframe/w2r9-3vfi
    Explore at:
    xlsx, csv, xmlAvailable download formats
    Dataset updated
    Jul 5, 2017
    Description

    a description

  2. PandasPlotBench

    • huggingface.co
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JetBrains Research (2024). PandasPlotBench [Dataset]. https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 25, 2024
    Dataset provided by
    JetBrainshttp://jetbrains.com/
    Authors
    JetBrains Research
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    PandasPlotBench

    PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.

  3. Z

    polyOne Data Set - 100 million hypothetical polymers including 29 properties...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rampi Ramprasad (2023). polyOne Data Set - 100 million hypothetical polymers including 29 properties [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7124187
    Explore at:
    Dataset updated
    Mar 24, 2023
    Dataset provided by
    Christopher Kuenneth
    Rampi Ramprasad
    Description

    polyOne Data Set

    The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.

    Full data set including the properties

    The data files are in Apache Parquet format. The files start with polyOne_*.parquet.

    I recommend using dask (pip install dask) to load and process the data set. Pandas also works but is slower.

    Load sharded data set with dask python import dask.dataframe as dd ddf = dd.read_parquet("*.parquet", engine="pyarrow")

    For example, compute the description of data set ```python df_describe = ddf.describe().compute() df_describe

    
    
    PSMILES strings only
    
    
    
      
    generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.
      
    generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
    
  4. h

    kaggle-entity-annotated-corpus-ner-dataset

    • huggingface.co
    Updated Jul 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 10, 2022
    Authors
    Rafael Arias Calles
    License

    https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/

    Description

    Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

      About Dataset
    

    from Kaggle Datasets

      Context
    

    Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.

  5. Tokopedia Product Reviews

    • kaggle.com
    zip
    Updated Jul 21, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farhan (2019). Tokopedia Product Reviews [Dataset]. https://www.kaggle.com/datasets/farhan999/tokopedia-product-reviews/versions/1/code
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Jul 21, 2019
    Authors
    Farhan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Tokopedia Product Reviews 2019

    Dataset Description

    This dataset contains 40,607 product reviews from Tokopedia, one of Indonesia's largest e-commerce platforms, scraped in 2019. The dataset provides valuable insights into customer sentiment and shopping behavior in the Indonesian e-commerce market.

    Dataset Summary

    • Language: Indonesian (Bahasa Indonesia)
    • Task: Sentiment Analysis, Product Review Analysis, E-commerce Research
    • Size: 40,607 reviews
    • Categories: 5 product categories
    • Unique Products: 3,647
    • Collection Period: 2019

    Dataset Structure

    Column Description

    • text (string): The review text written by customers
    • rating (int): Rating given by the reviewer (typically 1-5 scale)
    • category (string): Product category, one of:
      • pertukangan (tools/hardware)
      • fashion (fashion)
      • elektronik (electronics)
      • handphone (mobile phones)
      • olahraga (sports)
    • product_name (string): Name of the product
    • product_id (string): Unique identifier for the product
    • sold (int): Number of items sold
    • shop_id (string): Unique identifier for the shop/seller
    • product_url (string): URL link to the product page

    Data Splits

    The dataset comes as a single split containing all 40,607 reviews.

    Dataset Statistics

    CategoryCount
    Total Reviews40,607
    Unique Products3,647
    Product Categories5
    LanguageIndonesian

    Usage

    Loading the Dataset

    python
    
    # ------------------------------------------------------------------
    # Minimal example: download the "Tokopedia Product Reviews" dataset
    # from Kaggle and load it into a pandas DataFrame
    # ------------------------------------------------------------------
    
    # --- KaggleHub (no manual kaggle.json) ------------------
    # Install required packages
    !pip install -q --upgrade kagglehub pandas
    
    import kagglehub
    import os
    import zipfile
    import pandas as pd
    
    # Download the dataset (cached after the first run)
    dataset_path = kagglehub.dataset_download("farhan999/tokopedia-product-reviews")
    print("Dataset saved at:", dataset_path)
    
    # Locate the main CSV file inside the downloaded folder
    csv_file = None
    for root, _, files in os.walk(dataset_path):
      for f in files:
        if f.lower().endswith('.csv'):
          csv_file = os.path.join(root, f)
          break
    
    if csv_file:
      # Load CSV into a DataFrame and display the first few rows
      df = pd.read_csv(csv_file)
      display(df.head())
    else:
      print("No CSV file found in the dataset.")
    

    Potential Use Cases

    • Sentiment Analysis: Classify customer sentiment based on review text and ratings
    • Product Recommendation: Analyze product preferences across different categories
    • Market Research: Understand Indonesian e-commerce customer behavior
    • Natural Language Processing: Train Indonesian language models for e-commerce domain
    • Category Classification: Predict product categories from review text
    • Rating Prediction: Predict customer ratings from review text

    Data Collection

    The data was collected through web scraping of Tokopedia product pages in 2019. The scraping process captured genuine customer reviews across five major product categories, providing a representative sample of customer feedback on the platform.

    Ethical Considerations

    • This dataset contains public reviews that were posted on Tokopedia's platform
    • Personal information has been anonymized (shop_id and product_id are anonymized identifiers)
    • The data reflects genuine customer opinions and experiences
    • Users should be mindful of potential biases in the data (e.g., selection bias, temporal bias from 2019)

    Limitations

    • Temporal Limitation: Data is from 2019 and may not reflect current market trends
    • Platform Specific: Limited to Tokopedia platform, may not generalize to other Indonesian e-commerce platforms
    • Category Limitation: Only covers 5 product categories
    • Language: Primarily in Indonesian, limiting applicability to other languages

    Citation

    If you use this dataset in your research, please cite:

    @misc{tokopedia-product-reviews-2019,
      title={Tokopedia Product Reviews},
      url={https://www.kaggle.com/dsv/562904},
      DOI={10.34740/KAGGLE/DSV/562904},
      publisher={Kaggle},
      author={M. Farhan},
      year={2019}
    }
    

    Contact

    For questions or issues regarding this dataset, please open an issue in the dataset repository or contact kontak.farhan@gmail.com.

    Acknowledgments

    • Thanks to Tokopedia for providing a platform that enables customer reviews
  6. SELTO Dataset

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch (2023). SELTO Dataset [Dataset]. http://doi.org/10.5281/zenodo.7781392
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A Benchmark Dataset for Deep Learning for 3D Topology Optimization

    This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.

    The following paper provides full documentation and examples:

    Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.

    The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets.
    Each TAR.GZ file container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i-th sample is stored in the files i.csv and i_info.csv, where i.csv contains all voxel-wise information and i_info.csv contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.

    For the i-th sample, the columns of i_info.csv correspond to the following scalar information:

    • E - Young's modulus [Pa]
    • ν - Poisson's ratio [-]
    • σ_ys - a yield stress [Pa]
    • h - discretization size of the voxel grid [m]

    The columns of i.csv correspond to the following voxel-wise information:

    • x, y, z - the indices that state the location of the voxel within the voxel mesh
    • Ω_design - design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0 and 1 indicate that the density is fixed at 0 or 1, respectively. -1 indicates the absence of constraints, i.e., the density in that voxel can be freely optimized
    • Ω_dirichlet_x, Ω_dirichlet_y, Ω_dirichlet_z - homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimension
    • F_x, F_y, F_z - floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]
    • density - defines the binary voxel-wise density of the ground truth solution to the topology optimization problem

    How to Import the Dataset

    with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset object. As shown in the tutorial this can be done via:

    from dl4to.datasets import SELTODataset
    
    dataset = SELTODataset(root=root, name=name, train=train)

    Here, root is the path where the dataset should be saved. name is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset class.

    without DL4TO: After downloading and unzipping, any of the i.csv files can be manually imported into Python as a Pandas dataframe object:

    import pandas as pd
    
    root = ...
    file_path = f'{root}/{i}.csv'
    columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density']
    df = pd.read_csv(file_path, names=columns)

    Similarly, we can import a i_info.csv file via:

    file_path = f'{root}/{i}_info.csv'
    info_column_names = ['E', 'ν', 'σ_ys', 'h']
    df_info = pd.read_csv(file_path, names=info_columns)

    We can extract PyTorch tensors from the Pandas dataframe df using the following function:

    import torch
    
    def get_torch_tensors_from_dataframe(df, dtype=torch.float32):
      shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1
      voxels = [df['x'].values, df['y'].values, df['z'].values]
    
      Ω_design = torch.zeros(1, *shape, dtype=int)
      Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int))
    
      Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype)
      Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype)
      Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype)
      Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype)
    
      F = torch.zeros(3, *shape, dtype=dtype)
      F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype)
      F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype)
      F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype)
    
      density = torch.zeros(1, *shape, dtype=dtype)
      density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype)
    
      return Ω_design, Ω_Dirichlet, F, density

  7. h

    oldIT2modIT

    • huggingface.co
    Updated Jun 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massimo Romano (2025). oldIT2modIT [Dataset]. https://huggingface.co/datasets/cybernetic-m/oldIT2modIT
    Explore at:
    Dataset updated
    Jun 3, 2025
    Authors
    Massimo Romano
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Download the dataset

    At the moment to download the dataset you should use Pandas DataFrame: import pandas as pd df = pd.read_csv("https://huggingface.co/datasets/cybernetic-m/oldIT2modIT/resolve/main/oldIT2modIT_dataset.csv")

    You can visualize the dataset with: df.head()

    To convert into Huggingface dataset: from datasets import Dataset dataset = Dataset.from_pandas(df)

      Dataset Description
    

    This is an italian dataset formed by 200 old (ancient) italian sentence and… See the full description on the dataset page: https://huggingface.co/datasets/cybernetic-m/oldIT2modIT.

  8. Zippi_Shvartsman_et_al_2023_bmi_manual_files

    • figshare.com
    bin
    Updated Aug 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabrielle Shvartsman; Ellen Zippi; Nuria Vendrell-Llopis; Joni D. Wallis; Jose M. Carmena (2023). Zippi_Shvartsman_et_al_2023_bmi_manual_files [Dataset]. http://doi.org/10.6084/m9.figshare.23674200.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 29, 2023
    Dataset provided by
    figshare
    Authors
    Gabrielle Shvartsman; Ellen Zippi; Nuria Vendrell-Llopis; Joni D. Wallis; Jose M. Carmena
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Included files:Each file includes LFP (local field potential) data for both animals (‘h’, ‘y’) during a particular type of task control (‘bmi’ or ‘manual’) and time-locked to 500ms before or after a particular event in the task (‘go_cue’ or ‘target’) for each rewarded trial in each day of the task (‘h’: [1-13], ‘y’: [1-22]).File description:Each file includes a Pandas DataFrame, saved as a .feather file. Data can be accessed using Python by calling:import pandas as pdpd.read_feather([file name])Each DataFrame has the following columns:control_type: ‘bmi’, ‘manual’, or ‘baseline’event: go cue (‘go_cue’) or target acquisition (‘target’)subj: which animal, ‘h’ or ‘y’day: which day of the session, ‘h’: [1-13], ‘y’: [1-22]roi: region of interest; ‘direct’, ‘dlpfc’, or ‘cd’ where ‘direct’ includes most channels from m1 each day but is specific to channels which had sufficient spiking to be used as input to the BMI decoderch: electrode channel number, only low-noise channels were included (see Methods for details)n_rewarded_trial: which trial number data segment is from, only successfully completed (rewarded) trials are includedtime_from_window_ms: go_cue: 0-500ms from go cue, for target: -500-0ms from target acquisitionlfp: local field potential value (see Methods for details)

  9. Arabic(Indian) digits MADBase

    • kaggle.com
    Updated Jul 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HOSSAM_AHMED_SALAH (2023). Arabic(Indian) digits MADBase [Dataset]. https://www.kaggle.com/datasets/hossamahmedsalah/arabicindian-digits-madbase/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 26, 2023
    Dataset provided by
    Kaggle
    Authors
    HOSSAM_AHMED_SALAH
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    This dataset is flattern images where each image is represented in a row - Objective: Establish benchmark results for Arabic digit recognition using different classification techniques. - Objective: Compare performances of different classification techniques on Arabic and Latin digit recognition problems. - Valid comparison requires Arabic and Latin digit databases to be in the same format. - A Modified version of the ADBase (MADBase) with the same size and format as MNIST is created. - MADBase is derived from ADBase by size-normalizing each digit to a 20x20 box while preserving aspect ratio. - Size-normalization procedure results in gray levels due to anti-aliasing filter. - MADBase and MNIST have the same size and format. - MNIST is a modified version of the NIST digits database. - MNIST is available for download. I used this code to turn 70k arabic digit into a tabular data for ease of use and to waste less time in the preprocessing ```

    Define the root directory of the dataset

    root_dir = "MAHD"

    Define the names of the folders containing the images

    folder_names = ['Part{:02d}'.format(i) for i in range(1, 13)]

    folder_names = ['Part{}'.format(i) if i>9 else 'Part0{}'.format(i) for i in range(1, 13)]

    Define the names of the subfolders containing the training and testing images

    train_test_folders = ['MAHDBase_TrainingSet', 'test']

    Initialize an empty list to store the image data and labels

    data = [] labels = []

    Loop over the training and testing subfolders in each Part folder

    for tt in train_test_folders: for folder_name in folder_names: if tt == train_test_folders[1] and folder_name == 'Part03': break subfolder_path = os.path.join(root_dir, tt, folder_name) print(subfolder_path) print(os.listdir(subfolder_path)) for filename in os.listdir(subfolder_path): # check of the file fromat that it's an image if os.path.splitext(filename)[1].lower() not in '.bmp': continue # Load the image img_path = os.path.join(subfolder_path, filename) img = Image.open(img_path)

        # Convert the image to grayscale and flatten it into a 1D array
        img_grey = img.convert('L')
        img_data = np.array(img_grey).flatten()
    
        # Extract the label from the filename and convert it to an integer
        label = int(filename.split('_')[2].replace('digit', '').split('.')[0])
    
        # Add the image data and label to the lists
        data.append(img_data)
        labels.append(label)
    

    Convert the image data and labels to a pandas dataframe

    df = pd.DataFrame(data) df['label'] = labels ``` This dataset made by https://datacenter.aucegypt.edu/shazeem with 2 datasets - ADBase - MADBase (✅ the one this dataset derived from , similar in form to mnist)

  10. h

    agnews

    • huggingface.co
    Updated Apr 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Washington Cunha (2025). agnews [Dataset]. https://huggingface.co/datasets/waashk/agnews
    Explore at:
    Dataset updated
    Apr 5, 2025
    Authors
    Washington Cunha
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset used in the paper: A thorough benchmark of automatic text classification From traditional approaches to large language models https://github.com/waashk/atcBench To guarantee the reproducibility of the obtained results, the dataset and its respective CV train-test partitions is available here. Each dataset contains the following files:

    data.parquet: pandas DataFrame with texts and associated encoded labels for each document. split_

  11. f

    Zippi_Shvartsman_et_al_2023_baseline_files

    • figshare.com
    bin
    Updated Aug 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabrielle Shvartsman; Ellen Zippi; Nuria Vendrell-Llopis; Joni D. Wallis; Jose M. Carmena (2023). Zippi_Shvartsman_et_al_2023_baseline_files [Dataset]. http://doi.org/10.6084/m9.figshare.24052824.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 29, 2023
    Dataset provided by
    figshare
    Authors
    Gabrielle Shvartsman; Ellen Zippi; Nuria Vendrell-Llopis; Joni D. Wallis; Jose M. Carmena
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Included files:Each file includes LFP (local field potential) data for both animals (‘h’, ‘y’) during rest periods for each day (‘baseline’) without any time-locking (500ms segments were randomly selected from baseline in our analyses). Separate baseline files are included for each animal.File description:Each file includes a Pandas DataFrame, saved as a .feather file. Data can be accessed using Python by calling:import pandas as pdpd.read_feather([file name])Each DataFrame has the following columns:control_type: ‘bmi’, ‘manual’, or ‘baseline’subj: which animal, ‘h’ or ‘y’day: which day of the session, ‘h’: [1-13], ‘y’: [1-22]roi: region of interest; ‘direct’, ‘dlpfc’, or ‘cd’ where ‘direct’ includes most channels from m1 each day but is specific to channels which had sufficient spiking to be used as input to the BMI decoderch: electrode channel number, only low-noise channels were included (see Methods for details)time_from_window_ms: represents every ms from start to end of the recorded rest periodlfp: local field potential value (see Methods for details)

  12. TinyStories

    • kaggle.com
    • opendatalab.com
    • +1more
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). TinyStories [Dataset]. https://www.kaggle.com/datasets/thedevastator/tinystories-narrative-classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 24, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    TinyStories

    A Diverse, Richly Annotated Corpus of Short-Form Stories

    By Huggingface Hub [source]

    About this dataset

    This dataset contains the text of a remarkable collection of short stories known as the TinyStories Corpus. With over 2,000 annotated stories, it is populated with an array of diverse styles and genres from multiple sources. This corpus is enriched by intricate annotations across each narrative content, making it a valuable resource for narrative text classification. The text field in each row includes the entirety of each story that can be used to identify plots, characters and other features associated with story-telling techniques. Through this collection of stories, users will gain an extensive insight into a wide range of narratives which could be used to produce powerful machine learning models for Narrative Text Classification

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    In this dataset, each row contains a short story along with its associated labels for narrative text classification tasks. The data consists of the following columns: - text: The story text itself (string) - validation.csv: Contains a set of short stories for validation (dataframe) - train.csv: Contains the text of short stories used for narrative text classification (dataframe)

    The data contained in both files can be used for various types of machine learning tasks related to narrative text classification. These include but are not limited to experiments such as determining story genres, predicting user reactions, sentiment analysis etc.

    To get started with using this dataset, begin by downloading both validation and train csv files from Kaggle datasets page and saving them on your computer or local environment. Once downloaded, you may need to preprocess both datasets by cleaning up any unnecessary/wrongly formatted values or duplicate entries if any exists within it before proceeding further on to your research work or machine learning task experimentations as these have great impacts on your research results accuracy rate which you do not want compromised!

    Next step is simply loading up these two datasets into Python pandas dataframes so that they can easily be manipulated and analyzed using common tools associated with Natural Language Processing(NLP). This would require you writing few simple lines using pandas API functions like read_csv(), .append(), .concat()etc depending upon what kind of analysis/experiment you intend conducting afterwards utilizing this dataset in Python Jupyter Notebook framework as well as other machine learning frameworks popular among data scientists like scikit-learn if it will be something more complex than simple NLP task operations!

    By now if done everything mentioned correctly here then we are ready now to finally get into actually working out our desired applications from exploring potential connections between different narratives or character traits via supervised Machine Learning models such as Naive Bayes Classifier among many others that could ultimately provide us useful insights revealing patterns existing underneath all those texts! With all necessary datas loaded up in supporting python platforms correctly so feel free to make interesting discoveries/predictions from extensive analyses provided by this richly annotated TinyStories Narrative Dataset!

    Research Ideas

    • Creating a text classification algorithm to automatically categorize short stories by genre.
    • Developing an AI-based summarization tool to quickly summarize the main points in a story.
    • Developing an AI-based story generator that can generate new stories based on existing ones in the dataset

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: validation.csv | Column name | Description | |:--------------|:--------------------------------| | text | The text of the story. (String) |

    File: train.csv | Column name | Description | |:--------------|:----------------------------...

  13. e

    Data for Gradient boosted decision trees reveal nuances of auditory...

    • b2find.eudat.eu
    Updated Jun 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Data for Gradient boosted decision trees reveal nuances of auditory discrimination behavior - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/da3ad647-2df4-55a9-8ab7-001bfe18ec0e
    Explore at:
    Dataset updated
    Jun 20, 2024
    Description

    Raw data for the article: Gradient boosted decision trees reveal nuances of auditory discrimination behaviour (PLOS Computational Biology).This data repository contains the csv files after extraction of the raw MATLAB metadata files into pandas (Python) dataframes (helper function author: Jules Lebert). The csv files can easily be loaded back into dataframe objects using pandas before the subsampling steps (as documented in the paper, we used subsampling to ensure the number of F0-roved and control F0 trials were relatively equal) are completed.Link to GitHub repository to run the models on this data: https://github.com/carlacodes/boostmodelsA full description of each of the variables within the dataframe can be found in the data_description_instructions_for_datasets_plos_bio.pdf.Abstract: Animal psychophysics can generate rich behavioral datasets, often comprised of many 1000s of trials for an individual subject. Gradient-boosted models are a promising machine learning approach for analyzing such data, partly due to the tools that allow users to gain insight into how the model makes predictions. We trained ferrets to report a target word’s presence, timing, and lateralization within a stream of consecutively presented non-target words. To assess the animals’ ability to generalize across pitch, we manipulated the fundamental frequency (F0) of the speech stimuli across trials, and to assess the contribution of pitch to streaming, we roved the F0 from word token-to-token. We then implemented gradient-boosted regression and decision trees on the trial outcome and reaction time data to understand the behavioral factors behind the ferrets’ decision-making. We visualized model contributions by implementing SHAPs feature importance and partial dependency plots. While ferrets could accurately perform the task across all pitch-shifted conditions, our models reveal subtle effects of shifting F0 on performance, with within-trial pitch shifting elevating false alarms and extending reaction times. Our models identified a subset of non-target words that animals commonly false alarmed to. Follow-up analysis demonstrated that the spectrotemporal similarity of target and non-target words rather than similarity in duration or amplitude waveform was the strongest predictor of the likelihood of false alarming. Finally, we compared the results with those obtained with traditional mixed effects models, revealing equivalent or better performance for the gradient-boosted models over these approaches.

  14. Covid-19 Czech Republic

    • kaggle.com
    zip
    Updated Jul 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michal Brezak (2020). Covid-19 Czech Republic [Dataset]. https://www.kaggle.com/michalbrezk/covid19-czech-republic
    Explore at:
    zip(116897 bytes)Available download formats
    Dataset updated
    Jul 3, 2020
    Authors
    Michal Brezak
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Czechia
    Description

    Context

    This dataset has been collected from multiple sources provided by MVCR on their websites and contains daily summarized statistics as well as details statistics up to age & sex level.

    Content

    What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

    Columns description

    Date - Calendar date when data were collected Daily tested - Sum of tests performed Daily infected - Sum of confirmed cases those were positive Daily cured - Sum of cured people that does not have Covid-19 anymore Daily deaths - Sum of people those died on Covid-19 Daily cum tested - Cumulative sum of tests performed Daily infected - Cumulative sum of confirmed cases those were positive Daily cured - Cumulative sum of cured people that does not have Covid-19 anymore Daily deaths - Cumulative sum of people those died on Covid-19 Region - Region of Czech republic Sub-Region - Sub-Region of Czech republic Region accessories qty - Quantity of health care accessories delivered to region for all the time Age - Age of person Sex - Sex of person Infected - Sum of infected people for specific date, region, sub-region, age and sex Cured - Sum of cured people for specific date, region, sub-region, age and sex Death - Sum of people those dies on Covid-19 for specific date, region, sub-region, age and sex

    Data granularity

    Dataset contains data on different level of granularities. Make sure you do not mix different granularities. Let's suppose you have loaded data into pandas dataframe called df.

    Day level

    df_daily = df.groupby(['date']).max()[['daily_tested','daily_infected','daily_cured','daily_deaths','daily_cum_tested','daily_cum_infected','daily_cum_cured','daily_cum_deaths']].reset_index()
    

    Region level

    df_region = df[df['region'] != ''].groupby(['region']).agg(
      region_accessories_qty=pd.NamedAgg(column='region_accessories_qty', aggfunc='max'), 
      infected=pd.NamedAgg(column='infected', aggfunc='sum'),
      cured=pd.NamedAgg(column='cured', aggfunc='sum'),
      death=pd.NamedAgg(column='death', aggfunc='sum')
    ).reset_index()
    

    Detail level

    df_detail = df[['date','region','sub_region','age','sex','infected','cured','death']].reset_index(drop=True)
    

    Acknowledgements

    Thanks to websites of MVCR for sharing such great information.

    Inspiration

    Can you see relation between health care accessories delivered to region and number of cured/infected in that region? Why Czech Republic belongs to pretty safe countries when talking about Covid-19 Pandemic? Can you find out what is difference of pandemic evolution in Czech Republic comparing to other surrounding coutries, like Germany or Slovakia?

  15. Named Entity Recognition (NER) Corpus

    • kaggle.com
    Updated Jan 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naser Al-qaydeh (2022). Named Entity Recognition (NER) Corpus [Dataset]. https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 14, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Naser Al-qaydeh
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Task

    Named Entity Recognition(NER) is a task of categorizing the entities in a text into categories like names of persons, locations, organizations, etc.

    Dataset

    Each row in the CSV file is a complete sentence, list of POS tags for each word in the sentence, and list of NER tags for each word in the sentence

    You can use Pandas Dataframe to read and manipulate this dataset.

    Since each row in the CSV file contain lists, if we read the file with pandas.read_csv() and try to get tag lists by indexing the list will be a string. ```

    data['tag'][0] "['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']" type(data['tag'][0]) string You can use the following to convert it back to list type: from ast import literal_eval literal_eval(data['tag'][0] ) ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O'] type(literal_eval(data['tag'][0] )) list ```

    Acknowledgements

    This dataset is taken from Annotated Corpus for Named Entity Recognition by Abhinav Walia dataset and then processed.

    Annotated Corpus for Named Entity Recognition is annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

    Essential info about entities:

    • geo = Geographical Entity
    • org = Organization
    • per = Person
    • gpe = Geopolitical Entity
    • tim = Time indicator
    • art = Artifact
    • eve = Event
    • nat = Natural Phenomenon
  16. Online Retail Knowledge Graph Datasets

    • kaggle.com
    Updated May 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yunus Bilgiç (2025). Online Retail Knowledge Graph Datasets [Dataset]. https://www.kaggle.com/datasets/yunusbilgi/online-retail-knowledge-graph-datasets/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 9, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yunus Bilgiç
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Description: Online Retail Transaction Data

    This dataset contains transactional data from an online retail store, including customer purchases, product details, invoice information, and country-specific data. The dataset is structured into four main files:

    Invoices.csv – Contains invoice-related details such as date and customer information.

    Products.csv – Includes product-specific data like stock codes, descriptions, and unit prices.

    Invoice_rel_product.csv – Represents the relationship between invoices and products, detailing quantities purchased.

    Customers.csv – Provides customer identifiers and their respective countries.

    Column Descriptions:

    InvoiceNo: Unique identifier for each order (invoices starting with "C" indicate refunds/cancellations).

    InvoiceDate: The date and time when the invoice was issued.

    StockCode: Unique code assigned to each product.

    Description: Name or description of the product.

    UnitPrice: Price per unit of the product (in GBP).

    Quantity: Number of units purchased per transaction.

    CustomerID: Unique identifier for each customer.

    Country: The country from which the order was placed.

    Preprocessing Notes:

    -Refund Flag: Invoices starting with "C" were marked with an additional feature {is_return: True/False} in the graph database to distinguish refunded transactions.

    -Data Cleaning: Rows with negative values in UnitPrice or Quantity were removed using Pandas DataFrame for consistency.

  17. TREC

    • zenodo.org
    bin, txt, zip
    Updated Jan 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    N/A; N/A (2023). TREC [Dataset]. http://doi.org/10.5281/zenodo.7555342
    Explore at:
    bin, txt, zipAvailable download formats
    Dataset updated
    Jan 21, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    N/A; N/A
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    TREC with 5,952 documents (i.e. questions), is a question classification dataset in which the task is classify a question into 6 main subject categories: such as human, location, entity, abbreviation, description and numeric value.

    The files:
    texts.txt: Document set (text). One per line.
    score.txt: Document class whose index is associated with texts.txt
    split_

  18. Z

    glassDef dataset: metallic glass deformation

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rene Alvarez-Donado (2023). glassDef dataset: metallic glass deformation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7736625
    Explore at:
    Dataset updated
    Dec 24, 2023
    Dataset provided by
    Rene Alvarez-Donado
    Kamran Karimi
    Amin Esfandiarpour
    Stefanos Papanikolaou
    Mikko J. Alava
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    The glassDef dataset contains a set of text-based LAMMPS dump files corresponding to shear deformation tests on different bulk metallic glasses. This includes FeNi, CoNiFe, CoNiCrFe, CoCrFeMn, CoNiCrFeMn, and Co5Cr2Fe40Mn27Ni26 amorphous alloys with data files that exist in relevant subdirectories. Each dump file corresponds to multiple realizations and includes the dimensions of the simulation box as well as atom coordinates, the atom ID, and associated type of nearly 50,000 atoms.

    Load glassDef Dataset in Python

    The glassDef dataset may be loaded in Python into Pandas DataFrame. To go into the relevant subdirectory, run cd glass{glass_name}/Run[0-3]/ where “glass_name” denotes the chemical composition. Each subdirectory contains at least three glass realizations within subfolders that are labeled as “Run[0-3]”.

    cd glassFeNi/Run0; python

    import pandas

    df = pandas.read_csv("FeNi_glass.dump",skiprows=9)

    One may display an assigned DataFrame in the form of a table:

    df.head()

    To learn more about further analyses performed on the loaded data, please refer to the paper cited below.

    glassDef Dataset Structure

    glassDef Data Fields

    Dump files: “id”, “type”, “x”, “y”, “z”.

    glassDef Dataset Description

    Paper: Karimi, Kamran, Amin Esfandiarpour, René Alvarez-Donado, Mikko J. Alava, and Stefanos Papanikolaou. "Shear banding instability in multicomponent metallic glasses: Interplay of composition and short-range order." Physical Review B 105, no. 9 (2022): 094117.

    Contact: kamran.karimi@ncbj.gov.pl

  19. m

    Dataset of Leak Simulations in Experimental Testbed Water Distribution...

    • data.mendeley.com
    Updated Dec 12, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohsen Aghashahi (2022). Dataset of Leak Simulations in Experimental Testbed Water Distribution System [Dataset]. http://doi.org/10.17632/tbrnp6vrnj.1
    Explore at:
    Dataset updated
    Dec 12, 2022
    Authors
    Mohsen Aghashahi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the first fully labeled open dataset for leak detection and localization in water distribution systems. This dataset includes two hundred and eighty signals acquired from a laboratory-scale water distribution testbed with four types of induced leaks and no-leak. The testbed was 47 m long built from 152.4 mm diameter PVC pipes. Two accelerometers (A1 and A2), two hydrophones (H1 and H2), and two dynamic pressure sensors (P1 and P2) were deployed to measure acceleration, acoustic, and dynamic pressure data. The data were recorded through controlled experiments where the following were changed: network architecture, leak type, background flow condition, background noise condition, and sensor types and locations. Each signal was recorded for 30 seconds. Network architectures were looped (LO) and branched (BR). Leak types were Longitudinal Crack (LC), Circumferential Crack (CC), Gasket Leak (GL), Orifice Leak (OL), and No-leak (NL). Background flow conditions included 0 L/s (ND), 0.18 L/s, 0.47 L/s, and Transient (background flow rate abruptly changed from 0.47 L/s to 0 L/s at the second 20th of 30-second long measurements). Background noise conditions, with noise (N) and without noise (NN), determined whether a background noise was present during acoustic data measurements. Accelerometer and dynamic pressure data are in ‘.csv’ format, and the hydrophone data are in ‘.raw’ format with 8000 Hz frequency. The file “Python code to convert raw acoustic data to pandas DataFrame.py” converts the raw hydrophone data to DataFrame in Python.

  20. Binary Stanford Sentiment Treebank 2 (SST-2)

    • zenodo.org
    • explore.openaire.eu
    bin, txt, zip
    Updated Jan 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    N/a; N/a (2023). Binary Stanford Sentiment Treebank 2 (SST-2) [Dataset]. http://doi.org/10.5281/zenodo.7555310
    Explore at:
    txt, bin, zipAvailable download formats
    Dataset updated
    Jan 21, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    N/a; N/a
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Binary Stanford Sentiment Treebank (SST2) is a binary version of SST and Movie Review dataset (the neutral class was removed), that is, the data was classified only into positive and negative classes.

    The files:
    texts.txt: Document set (text). One per line.
    score.txt: Document class whose index is associated with texts.txt
    split_

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2017). Dataset made from a Pandas Dataframe [Dataset]. https://peter.demo.socrata.com/dataset/Dataset-made-from-a-Pandas-Dataframe/w2r9-3vfi

Dataset made from a Pandas Dataframe

Explore at:
xlsx, csv, xmlAvailable download formats
Dataset updated
Jul 5, 2017
Description

a description

Search
Clear search
Close search
Google apps
Main menu