62 datasets found
  1. R

    Dataset made from a Pandas Dataframe

    • peter.demo.socrata.com
    csv, xlsx, xml
    Updated Jul 5, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Dataset made from a Pandas Dataframe [Dataset]. https://peter.demo.socrata.com/dataset/Dataset-made-from-a-Pandas-Dataframe/w2r9-3vfi
    Explore at:
    xlsx, csv, xmlAvailable download formats
    Dataset updated
    Jul 5, 2017
    Description

    a description

  2. PandasPlotBench

    • huggingface.co
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JetBrains Research (2024). PandasPlotBench [Dataset]. https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 25, 2024
    Dataset provided by
    JetBrainshttp://jetbrains.com/
    Authors
    JetBrains Research
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    PandasPlotBench

    PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.

  3. h

    pandas-issues

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clyde Cossey, pandas-issues [Dataset]. https://huggingface.co/datasets/cicboy/pandas-issues
    Explore at:
    Authors
    Clyde Cossey
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Pandas GitHub Issues

    This dataset contains 5,000 GitHub issues collected from the pandas-dev/pandas repository.It includes issue metadata, content, labels, user information, timestamps, and comments.
    The dataset is suitable for text classification, multi-label classification, and document retrieval tasks.

      Dataset Structure
    

    Columns:

    id — Internal ID of the issue (int64)
    number — GitHub issue number (int64)
    title — Title of the issue (string)
    state — Issue… See the full description on the dataset page: https://huggingface.co/datasets/cicboy/pandas-issues.

  4. Pandas

    • kaggle.com
    Updated Feb 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shail_2604 (2024). Pandas [Dataset]. https://www.kaggle.com/datasets/shail2604/pandas/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 27, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shail_2604
    Description

    Dataset

    This dataset was created by Shail_2604

    Released under Other (specified in description)

    Contents

  5. Z

    polyOne Data Set - 100 million hypothetical polymers including 29 properties...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rampi Ramprasad (2023). polyOne Data Set - 100 million hypothetical polymers including 29 properties [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7124187
    Explore at:
    Dataset updated
    Mar 24, 2023
    Dataset provided by
    Rampi Ramprasad
    Christopher Kuenneth
    Description

    polyOne Data Set

    The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.

    Full data set including the properties

    The data files are in Apache Parquet format. The files start with polyOne_*.parquet.

    I recommend using dask (pip install dask) to load and process the data set. Pandas also works but is slower.

    Load sharded data set with dask python import dask.dataframe as dd ddf = dd.read_parquet("*.parquet", engine="pyarrow")

    For example, compute the description of data set ```python df_describe = ddf.describe().compute() df_describe

    
    
    PSMILES strings only
    
    
    
      
    generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.
      
    generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
    
  6. f

    Table5_Whole genome bisulfite sequencing reveals DNA methylation roles in...

    • figshare.com
    xlsx
    Updated Jun 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaodie Jie; Honglin Wu; Miao Yang; Ming He; Guangqing Zhao; Shanshan Ling; Yan Huang; Bisong Yue; Nan Yang; Xiuyue Zhang (2023). Table5_Whole genome bisulfite sequencing reveals DNA methylation roles in the adaptive response of wildness training giant pandas to wild environment.XLSX [Dataset]. http://doi.org/10.3389/fgene.2022.995700.s005
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 13, 2023
    Dataset provided by
    Frontiers
    Authors
    Xiaodie Jie; Honglin Wu; Miao Yang; Ming He; Guangqing Zhao; Shanshan Ling; Yan Huang; Bisong Yue; Nan Yang; Xiuyue Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DNA methylation modification can regulate gene expression without changing the genome sequence, which helps organisms to rapidly adapt to new environments. However, few studies have been reported in non-model mammals. Giant panda (Ailuropoda melanoleuca) is a flagship species for global biodiversity conservation. Wildness and reintroduction of giant pandas are the important content of giant pandas’ protection. However, it is unclear how wildness training affects the epigenetics of giant pandas, and we lack the means to assess the adaptive capacity of wildness training giant pandas. We comparatively analyzed genome-level methylation differences in captive giant pandas with and without wildness training to determine whether methylation modification played a role in the adaptive response of wildness training pandas. The whole genome DNA methylation sequencing results showed that genomic cytosine methylation ratio of all samples was 5.35%–5.49%, and the methylation ratio of the CpG site was the highest. Differential methylation analysis identified 544 differentially methylated genes (DMGs). The results of KEGG pathway enrichment of DMGs showed that VAV3, PLCG2, TEC and PTPRC participated in multiple immune-related pathways, and may participate in the immune response of wildness training giant pandas by regulating adaptive immune cells. A large number of DMGs enriched in GO terms may also be related to the regulation of immune activation during wildness training of giant pandas. Promoter differentially methylation analysis identified 1,199 genes with differential methylation at promoter regions. Genes with low methylation level at promoter regions and high expression such as, CCL5, P2Y13, GZMA, ANP32A, VWF, MYOZ1, NME7, MRPS31 and TPM1 were important in environmental adaptation for wildness training giant pandas. The methylation and expression patterns of these genes indicated that wildness training giant pandas have strong immunity, blood coagulation, athletic abilities and disease resistance. The adaptive response of giant pandas undergoing wildness training may be regulated by their negatively related promoter methylation. We are the first to describe the DNA methylation profile of giant panda blood tissue and our results indicated methylation modification is involved in the adaptation of captive giant pandas when undergoing wildness training. Our study also provided potential monitoring indicators for the successful reintroduction of valuable and threatened animals to the wild.

  7. Data from: Compressed table of cloud field metrics computed for a dataset of...

    • figshare.com
    hdf
    Updated Sep 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Janssens (2020). Compressed table of cloud field metrics computed for a dataset of satellite observations [Dataset]. http://doi.org/10.6084/m9.figshare.12687302.v1
    Explore at:
    hdfAvailable download formats
    Dataset updated
    Sep 26, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Martin Janssens
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This file comprises a hdf5-compressed table intended for use with the Python package Pandas. Its columns describe 42 metrics, or computational details on those metrics; its rows are scenes, indexed by a string according the format "yyyy-mm-dd-s-n", where:- y: year- m: month- d: day- s: satellite (a - Aqua, t - Terra)- n: scene number on the dateThe file's metadata contains a dictionary that converts column headers into more legible descriptions. See e.g. https://stackoverflow.com/a/29130146 for instructions to load this data. Use keyword 'mydata' to access the data and metadata in the file.

  8. SELTO Dataset

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated May 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch (2023). SELTO Dataset [Dataset]. http://doi.org/10.5281/zenodo.7034899
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A Benchmark Dataset for Deep Learning-based Methods for 3D Topology Optimization.

    One can find a description of the provided dataset partitions in Section 3 of Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.


    Every dataset container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and a corresponding binarized SIMP solution. Every file of the form {i}.csv contains all voxel-wise information about the sample i. Every file of the form {i}_info.csv file contains scalar parameters of the topology optimization problem, such as material parameters.


    This dataset represents topology optimization problems and solutions on the bases of voxels. We define all spatially varying quantities via the voxels' centers -- rather than via the vertices or surfaces of the voxels.
    In {i}.csv files, each row corresponds to one voxel in the design space. The columns correspond to ['x', 'y', 'z', 'design_space', 'dirichlet_x', 'dirichlet_y', 'dirichlet_z', 'force_x', 'force_y', 'force_z', 'density'].

    • x, y, z - These are three integer indices stating the index/location of the voxel within the voxel mesh.
    • design_space - This is one ternary variable indicating the type of material density constraint on the voxel within the TO problem formulation. "0" and "1" indicate a material density fixed at 0 or 1, respectively. "-1" indicates the absence of constraints.
    • dirichlet_x, dirichlet_y, dirichlet_z - These are three binary variables defining whether the voxel contains homogenous Dirichlet constraints in the respective axis direction.
    • force_x, force_y, force_z - These are three floating point variables giving the three spacial components of the forces applied to each voxel. All forces are body forces given in [N/m^3].
    • density - This is a binary variable stating whether the voxel carries material in the solution of the topology optimization problem.

    Any of these files with the index i can be imported using pandas by executing:

    import pandas as pd
    
    directory = ...
    file_path = f'{directory}/{i}.csv'
    column_names = ['x', 'y', 'z', 'design_space','dirichlet_x', 'dirichlet_y', 'dirichlet_z', 'force_x', 'force_y', 'force_z', 'density']
    data = pd.read_csv(file_path, names=column_names)

    From this pandas dataframe one can extract the torch tensors of forces F, Dirichlet conditions ωDirichlet, and design space information ωdesign using the following functions:

    import torch
    
    def get_shape_and_voxels(data):
      shape = data[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1
      vox_x = data['x'].values
      vox_y = data['y'].values
      vox_z = data['z'].values
      voxels = [vox_x, vox_y, vox_z]
      return shape, voxels
    
    
    def get_forces_boundary_conditions_and_design_space(data, shape, voxels):
      F = torch.zeros(3, *shape, dtype=torch.float32)
      F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['force_x'].values, dtype=torch.float32)
      F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['force_y'].values, dtype=torch.float32)
      F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['force_z'].values, dtype=torch.float32)
    
      ω_Dirichlet = torch.zeros(3, *shape, dtype=torch.float32)
      ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['dirichlet_x'].values, dtype=torch.float32)
      ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['dirichlet_y'].values, dtype=torch.float32)
      ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['dirichlet_z'].values, dtype=torch.float32)
    
      ω_design = torch.zeros(1, *shape, dtype=int)
      ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['design_space'].values.astype(int))
      return F, ω_Dirichlet, ω_design

    The corresponding {i}_info.csv files only have one row with column labels ['E', 'ν', 'σ_ys', 'vox_size', 'p_x', 'p_y', 'p_z'].

    • E - Young's modulus [Pa]
    • ν - Poisson's ratio [-]
    • σ_ys - Yield stress [Pa]
    • vox_size - Length of the edge of a (cube-shaped) voxel [m]
    • p_x, p_y, p_z - Location of the root of the design space [m]

    Analogously to above, one can import any {i}_info.csv file by executing:

    file_path = f'{directory}/{i}_info.csv'
    data_info_column_names = ['E', 'ν', 'σ_ys', 'vox_size', 'p_x', 'p_y', 'p_z']
    data_info = pd.read_csv(file_path, names=data_info_column_names)

  9. f

    Data from: First Steps toward the Giant Panda Metabolome Database:...

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chenglin Zhu; Luca Laghi; Zhizhong Zhang; Yongguo He; Daifu Wu; Hemin Zhang; Yan Huang; Caiwu Li; Likou Zou (2023). First Steps toward the Giant Panda Metabolome Database: Untargeted Metabolomics of Feces, Urine, Serum, and Saliva by 1H NMR [Dataset]. http://doi.org/10.1021/acs.jproteome.9b00564.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    ACS Publications
    Authors
    Chenglin Zhu; Luca Laghi; Zhizhong Zhang; Yongguo He; Daifu Wu; Hemin Zhang; Yan Huang; Caiwu Li; Likou Zou
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Differences in the concentration of metabolites in the biofluids of animals closely reflect their physiological diversities. In order to set the basis for a metabolomic atlas for giant panda (Ailuropoda melanoleuca), we characterized the metabolome of healthy giant panda feces (23), urine (16), serum (6), and saliva (4) samples by means of 1H NMR. A total of 107 metabolites and a core metabolome of 12 metabolites was quantified across the four biological matrices. Through univariate analysis followed by robust principal component analysis, we were able to describe how the molecular profile observed in giant panda urine and feces was affected by gender and age. Among the molecules modified by age in feces, fucose plays a peculiar role because it is related to the digestion of bamboo’s hemicellulose, which is considered as the main source of energy for giant panda. A metagenomic investigation directed toward this molecule showed that its concentration was indeed positively related to the two-component system pathway and negatively related to the amino sugar and nucleotide sugar metabolism pathway. Such work is meant to provide a robust framework for further -omics research studies on giant panda to accelerate our understanding of the interaction of giant panda with its natural environment.

  10. h

    oldIT2modIT

    • huggingface.co
    Updated Jun 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massimo Romano (2025). oldIT2modIT [Dataset]. https://huggingface.co/datasets/cybernetic-m/oldIT2modIT
    Explore at:
    Dataset updated
    Jun 3, 2025
    Authors
    Massimo Romano
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Download the dataset

    At the moment to download the dataset you should use Pandas DataFrame: import pandas as pd df = pd.read_csv("https://huggingface.co/datasets/cybernetic-m/oldIT2modIT/resolve/main/oldIT2modIT_dataset.csv")

    You can visualize the dataset with: df.head()

    To convert into Huggingface dataset: from datasets import Dataset dataset = Dataset.from_pandas(df)

      Dataset Description
    

    This is an italian dataset formed by 200 old (ancient) italian sentence and… See the full description on the dataset page: https://huggingface.co/datasets/cybernetic-m/oldIT2modIT.

  11. d

    Data from: Ecological and anthropogenic drivers of local extinction and...

    • search.dataone.org
    • datasetcatalog.nlm.nih.gov
    • +1more
    Updated Dec 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Junfeng Tang; Ronald R. Swaisgood; Megan A. Owen; Xuzhe Zhao; Wei Wei; Mingsheng Hong; Hong Zhou; Jindong Zhang; Zenjun Zhang (2024). Ecological and anthropogenic drivers of local extinction and colonization of giant pandas over the past 30 years [Dataset]. http://doi.org/10.5061/dryad.2280gb60d
    Explore at:
    Dataset updated
    Dec 10, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Junfeng Tang; Ronald R. Swaisgood; Megan A. Owen; Xuzhe Zhao; Wei Wei; Mingsheng Hong; Hong Zhou; Jindong Zhang; Zenjun Zhang
    Time period covered
    Jan 1, 2023
    Description

    Understanding the patterns and drivers of species range shifts is essential to disentangle mechanisms driving species’ responses to global change. Here, we quantified local extinction and colonization dynamics of giant pandas (Ailuropoda melanoleuca) using occurrence data collected by harnessing the labor of >1,000 workers and >60,000 worker days for each of the three periods (TP1: 1985-1988, TP2: 1998-2002, and TP3: 2011-2014), and evaluated how these patterns were associated with (1) protected area, (2) local rarity/abundance, and (3) abiotic factors (i.e., climate, land-use and topography). We documented a decreased rate (from 0.433 during TP1-TP2 to 0.317 during TP2-TP3) of local extinction and a relatively stable rate (from 0.060 during TP1-TP2 to 0.056 during TP2-TP3) of local colonization through time. Furthermore, the occupancy gains have exceeded losses by a ratio of approximately 1.5 to 1, illustrating an expanding of panda’s range at a rate of 1408.3 km2/decade. We also..., , , # Data from: Ecological and anthropogenic drivers of local extinction and colonization of giant pandas over the past 30 years

    https://doi.org/10.5061/dryad.2280gb60d

    Description of the data and file structure

    Data from: Ecological and anthropogenic drivers of local extinction and colonization of giant pandas over the past 30 years

    Datasets used to identify ecological and anthropogenic drivers of local extinction and colonization of giant pandas over the past 30 years

    Files and variables:

    File:

    R script—Script to run spatial generalized additive models in the programming language R

    TP12_5km_ext.csv — local extinction (loss [1] and persistence [0]), local rarity, local abundance, protected area status, 19 future bioclimatic variables and 10 land use variables during TP1-TP2 at 5 km X 5 km grid cell

    TP12_5km_col.csv — local coloniz...

  12. Stack Overflow tags

    • kaggle.com
    Updated Jan 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abid Ali Awan (2021). Stack Overflow tags [Dataset]. https://www.kaggle.com/kingabzpro/stack-overflow-tags/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 8, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Abid Ali Awan
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Context

    How can we tell what programming languages and technologies are used by the most people? How about what languages are growing and which are shrinking, so that we can tell which are most worth investing time in?

    One excellent source of data is Stack Overflow, a programming question and answer site with more than 16 million questions on programming topics. By measuring the number of questions about each technology, we can get an approximate sense of how many people are using it. We're going to use open data from the Stack Exchange Data Explorer to examine the relative popularity of languages like R, Python, Java and Javascript have changed over time.

    Content

    Each Stack Overflow question has a tag, which marks a question to describe its topic or technology. For instance, there's a tag for languages like R or Python, and for packages like ggplot2 or pandas.

    We'll be working with a dataset with one observation for each tag in each year. The dataset includes both the number of questions asked in that tag in that year, and the total number of questions asked in that year.

    Acknowledgements

    DataCamp

  13. H

    Forecasting model of an apartment interior quality assessment

    • dataverse.harvard.edu
    Updated Dec 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleksii Filimonchuk (2020). Forecasting model of an apartment interior quality assessment [Dataset]. http://doi.org/10.7910/DVN/OGYPHO
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 7, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Oleksii Filimonchuk
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The entire code and project was written in the Python programming language Imports libraries such as Numpy, CV2, PyTorch, Albumentations, Pandas, Atexit Using pandas, we set settings to make the display window output, giving the maximum number of rows, columns. Using time, datetime, atexit we create a function to measure the time when the program started, when the program was completed, how long it was used. By announcing the variable train, we read a training data set with the photo name and their assessment. After that, we use OneHotEncoder to convert the ratings into a more extensive look. Next, create a class TestDataset, to process the photos that are stored in the folder on the server, specify the path of the folder, describe the transformation for augmentation, using CV2 we open each photo, change its size to 224x224, after transforming and saving in a pixel. With albumentations, we transform the photo and store it in tensor. Then we read the test data set, which will be tested, model. With torch.utils.data.DataLoader, we load our test dataset. Then load our pre-trained model (based on Resnet50). We convert the output through tangential function. Keep the score in a separate column and each class in different columns. After each class, we translate into a conditional coefficient, for a better understanding of the results of the model.

  14. Auction Data Set

    • kaggle.com
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steve Shreedhar (2024). Auction Data Set [Dataset]. https://www.kaggle.com/datasets/noob2511/auction-data-set
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 8, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Steve Shreedhar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Columns Definition and Information of the data set

    The auction dataset is a really small data set ( 19 items) which is being created for the sole purpose of learning pandas library.

    The auction data set contains 5 columns :

    1. Item :Gives the description of what items are being sold. 2. Bidding Price : Gives the price at which the item will start being sold at. 3. Selling Price : The selling price tells us at which amount the item was sold. 4. Calls :Calls indicate the number of times the items value was raised or decreased by the customer. 5. Bought By : Gives us the idea which customer bought the item.

    Note: There are missing values, which we will try to fill. And yes some values might not make sense once we make those imputations, but this notebook is for the sole purpose of learning.

  15. h

    panda-bench

    • huggingface.co
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Beijing Institute of AI Safety and Governance (2025). panda-bench [Dataset]. https://huggingface.co/datasets/Beijing-AISI/panda-bench
    Explore at:
    Dataset updated
    May 22, 2025
    Dataset authored and provided by
    Beijing Institute of AI Safety and Governance
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    PandaBench

    PandaBench is a comprehensive benchmark for evaluating Large Language Model (LLM) safety, focusing on jailbreak attacks, defense mechanisms, and evaluation methodologies.

    The PandaGuard framework architecture illustrating the end-to-end pipeline for LLM safety evaluation. The system connects three key components: Attackers, Defenders, and Judges.

      Dataset Description
    

    This repository contains the benchmark results from extensive evaluations of various LLMs… See the full description on the dataset page: https://huggingface.co/datasets/Beijing-AISI/panda-bench.

  16. IMDb Top 4070: Explore the Cinema Data

    • kaggle.com
    Updated Aug 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    K.T.S. Prabhu (2023). IMDb Top 4070: Explore the Cinema Data [Dataset]. https://www.kaggle.com/datasets/ktsprabhu/imdb-top-4070-explore-the-cinema-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    K.T.S. Prabhu
    Description

    Description: Dive into the world of exceptional cinema with our meticulously curated dataset, "IMDb's Gems Unveiled." This dataset is a result of an extensive data collection effort based on two critical criteria: IMDb ratings exceeding 7 and a substantial number of votes, surpassing 10,000. The outcome? A treasure trove of 4070 movies meticulously selected from IMDb's vast repository.

    What sets this dataset apart is its richness and diversity. With more than 20 data points meticulously gathered for each movie, this collection offers a comprehensive insight into each cinematic masterpiece. Our data collection process leveraged the power of Selenium and Pandas modules, ensuring accuracy and reliability.

    Cleaning this vast dataset was a meticulous task, combining both Excel and Python for optimum precision. Analysis is powered by Pandas, Matplotlib, and NLTK, enabling to uncover hidden patterns, trends, and themes within the realm of cinema.

    Note: The data is collected as of April 2023. Future versions of this analysis include Movie recommendation system Please do connect for any queries, All Love, No Hate.

  17. m

    Reddit r/AskScience Flair Dataset

    • data.mendeley.com
    Updated May 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
    Explore at:
    Dataset updated
    May 23, 2022
    Authors
    Sumit Mishra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

    The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

    The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

    This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.

  18. f

    Table1_Immunological characterization of an Italian PANDAS cohort.docx

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    docx
    Updated Jan 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucia Leonardi; Giulia Lorenzetti; Rita Carsetti; Eva Piano Mortari; Cristiana Alessia Guido; Anna Maria Zicari; Elisabeth Förster-Waldl; Lorenzo Loffredo; Marzia Duse; Alberto Spalice (2024). Table1_Immunological characterization of an Italian PANDAS cohort.docx [Dataset]. http://doi.org/10.3389/fped.2023.1216282.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jan 4, 2024
    Dataset provided by
    Frontiers
    Authors
    Lucia Leonardi; Giulia Lorenzetti; Rita Carsetti; Eva Piano Mortari; Cristiana Alessia Guido; Anna Maria Zicari; Elisabeth Förster-Waldl; Lorenzo Loffredo; Marzia Duse; Alberto Spalice
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This cross-sectional study aimed to contribute to the definition of Pediatric Autoimmune Neuropsychiatric Disorders Associated with Streptococcal Infections (PANDAS) pathophysiology. An extensive immunological assessment has been conducted to investigate both immune defects, potentially leading to recurrent Group A β-hemolytic Streptococcus (GABHS) infections, and immune dysregulation responsible for a systemic inflammatory state. Twenty-six PANDAS patients with relapsing-remitting course of disease and 11 controls with recurrent pharyngotonsillitis were enrolled. Each subject underwent a detailed phenotypic and immunological assessment including cytokine profile. A possible correlation of immunological parameters with clinical-anamnestic data was analyzed. No inborn errors of immunity were detected in either group, using first level immunological assessments. However, a trend toward higher TNF-alpha and IL-17 levels, and lower C3 levels, was detected in the PANDAS patients compared to the control group. Maternal autoimmune diseases were described in 53.3% of PANDAS patients and neuropsychiatric symptoms other than OCD and tics were detected in 76.9% patients. ASO titer did not differ significantly between the two groups. A possible correlation between enduring inflammation (elevated serum TNF-α and IL-17) and the persistence of neuropsychiatric symptoms in PANDAS patients beyond infectious episodes needs to be addressed. Further studies with larger cohorts would be pivotal to better define the role of TNF-α and IL-17 in PANDAS pathophysiology.

  19. h

    panda

    • huggingface.co
    Updated Jan 17, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI at Meta (2017). panda [Dataset]. https://huggingface.co/datasets/facebook/panda
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 17, 2017
    Dataset authored and provided by
    AI at Meta
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for PANDA

      Dataset Summary
    

    PANDA (Perturbation Augmentation NLP DAtaset) consists of approximately 100K pairs of crowdsourced human-perturbed text snippets (original, perturbed). Annotators were given selected terms and target demographic attributes, and instructed to rewrite text snippets along three demographic axes: gender, race and age, while preserving semantic meaning. Text snippets were sourced from a range of text corpora (BookCorpus, Wikipedia, ANLI… See the full description on the dataset page: https://huggingface.co/datasets/facebook/panda.

  20. Tokopedia Product Reviews

    • kaggle.com
    zip
    Updated Jul 21, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farhan (2019). Tokopedia Product Reviews [Dataset]. https://www.kaggle.com/datasets/farhan999/tokopedia-product-reviews/versions/1/code
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Jul 21, 2019
    Authors
    Farhan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Tokopedia Product Reviews 2019

    Dataset Description

    This dataset contains 40,607 product reviews from Tokopedia, one of Indonesia's largest e-commerce platforms, scraped in 2019. The dataset provides valuable insights into customer sentiment and shopping behavior in the Indonesian e-commerce market.

    Dataset Summary

    • Language: Indonesian (Bahasa Indonesia)
    • Task: Sentiment Analysis, Product Review Analysis, E-commerce Research
    • Size: 40,607 reviews
    • Categories: 5 product categories
    • Unique Products: 3,647
    • Collection Period: 2019

    Dataset Structure

    Column Description

    • text (string): The review text written by customers
    • rating (int): Rating given by the reviewer (typically 1-5 scale)
    • category (string): Product category, one of:
      • pertukangan (tools/hardware)
      • fashion (fashion)
      • elektronik (electronics)
      • handphone (mobile phones)
      • olahraga (sports)
    • product_name (string): Name of the product
    • product_id (string): Unique identifier for the product
    • sold (int): Number of items sold
    • shop_id (string): Unique identifier for the shop/seller
    • product_url (string): URL link to the product page

    Data Splits

    The dataset comes as a single split containing all 40,607 reviews.

    Dataset Statistics

    CategoryCount
    Total Reviews40,607
    Unique Products3,647
    Product Categories5
    LanguageIndonesian

    Usage

    Loading the Dataset

    python
    
    # ------------------------------------------------------------------
    # Minimal example: download the "Tokopedia Product Reviews" dataset
    # from Kaggle and load it into a pandas DataFrame
    # ------------------------------------------------------------------
    
    # --- KaggleHub (no manual kaggle.json) ------------------
    # Install required packages
    !pip install -q --upgrade kagglehub pandas
    
    import kagglehub
    import os
    import zipfile
    import pandas as pd
    
    # Download the dataset (cached after the first run)
    dataset_path = kagglehub.dataset_download("farhan999/tokopedia-product-reviews")
    print("Dataset saved at:", dataset_path)
    
    # Locate the main CSV file inside the downloaded folder
    csv_file = None
    for root, _, files in os.walk(dataset_path):
      for f in files:
        if f.lower().endswith('.csv'):
          csv_file = os.path.join(root, f)
          break
    
    if csv_file:
      # Load CSV into a DataFrame and display the first few rows
      df = pd.read_csv(csv_file)
      display(df.head())
    else:
      print("No CSV file found in the dataset.")
    

    Potential Use Cases

    • Sentiment Analysis: Classify customer sentiment based on review text and ratings
    • Product Recommendation: Analyze product preferences across different categories
    • Market Research: Understand Indonesian e-commerce customer behavior
    • Natural Language Processing: Train Indonesian language models for e-commerce domain
    • Category Classification: Predict product categories from review text
    • Rating Prediction: Predict customer ratings from review text

    Data Collection

    The data was collected through web scraping of Tokopedia product pages in 2019. The scraping process captured genuine customer reviews across five major product categories, providing a representative sample of customer feedback on the platform.

    Ethical Considerations

    • This dataset contains public reviews that were posted on Tokopedia's platform
    • Personal information has been anonymized (shop_id and product_id are anonymized identifiers)
    • The data reflects genuine customer opinions and experiences
    • Users should be mindful of potential biases in the data (e.g., selection bias, temporal bias from 2019)

    Limitations

    • Temporal Limitation: Data is from 2019 and may not reflect current market trends
    • Platform Specific: Limited to Tokopedia platform, may not generalize to other Indonesian e-commerce platforms
    • Category Limitation: Only covers 5 product categories
    • Language: Primarily in Indonesian, limiting applicability to other languages

    Citation

    If you use this dataset in your research, please cite:

    @misc{tokopedia-product-reviews-2019,
      title={Tokopedia Product Reviews},
      url={https://www.kaggle.com/dsv/562904},
      DOI={10.34740/KAGGLE/DSV/562904},
      publisher={Kaggle},
      author={M. Farhan},
      year={2019}
    }
    

    Contact

    For questions or issues regarding this dataset, please open an issue in the dataset repository or contact kontak.farhan@gmail.com.

    Acknowledgments

    • Thanks to Tokopedia for providing a platform that enables customer reviews
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2017). Dataset made from a Pandas Dataframe [Dataset]. https://peter.demo.socrata.com/dataset/Dataset-made-from-a-Pandas-Dataframe/w2r9-3vfi

Dataset made from a Pandas Dataframe

Explore at:
xlsx, csv, xmlAvailable download formats
Dataset updated
Jul 5, 2017
Description

a description

Search
Clear search
Close search
Google apps
Main menu