100+ datasets found
  1. Google Patent Phrase Similarity Dataset

    • kaggle.com
    zip
    Updated Jul 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2022). Google Patent Phrase Similarity Dataset [Dataset]. https://www.kaggle.com/datasets/google/google-patent-phrase-similarity-dataset
    Explore at:
    zip(364234 bytes)Available download formats
    Dataset updated
    Jul 19, 2022
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a human rated contextual phrase to phrase matching dataset focused on technical terms from patents. In addition to similarity scores that are typically included in other benchmark datasets we include granular rating classes similar to WordNet, such as synonym, antonym, hypernym, hyponym, holonym, meronym, domain related. The dataset was used in the U.S. Patent Phrase to Phrase Matching competition.

    The dataset was generated with focus on the following: - Phrase disambiguation: certain keywords and phrases can have multiple different meanings. For example, the phrase "mouse" may refer to an animal or a computer input device. To help disambiguate the phrases we have included Cooperative Patent Classification (CPC) classes with each pair of phrases. - Adversarial keyword match: there are phrases that have matching keywords but are otherwise unrelated (e.g. “container section” → “kitchen container”, “offset table” → “table fan”). Many models will not do well on such data (e.g. bag of words models). Our dataset is designed to include many such examples. - Hard negatives: We created our dataset with the aim to improve upon current state of the art language models. Specifically, we have used the BERT model to generate some of the target phrases. So our dataset contains many human rated examples of phrase pairs that BERT may identify as very similar but in fact they may not be.

    Each entry of the dataset contains two phrases - anchor and target, a context CPC class, a rating class, and a similarity score. The rating classes have the following meanings: - 4 - Very high. - 3 - High. - 2 - Medium. - 2a - Hyponym (broad-narrow match). - 2b - Hypernym (narrow-broad match). - 2c - Structural match. - 1 - Low. - 1a - Antonym. - 1b - Meronym (a part of). - 1c - Holonym ( a whole of). - 1d - Other high level domain match. - 0 - Not related.

    The dataset is split into a training (75%), validation (5%), and test (20%) sets. When splitting the data all of the entries with the same anchor are kept together in the same set. There are 106 different context CPC classes and all of them are represented in the training set.

    More details about the dataset are available in the corresponding paper. Please cite the paper if you use the dataset.

  2. d

    Maryland Counties Match Tool for Data Quality

    • catalog.data.gov
    • opendata.maryland.gov
    • +1more
    Updated Oct 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    opendata.maryland.gov (2025). Maryland Counties Match Tool for Data Quality [Dataset]. https://catalog.data.gov/dataset/maryland-counties-match-tool-for-data-quality
    Explore at:
    Dataset updated
    Oct 25, 2025
    Dataset provided by
    opendata.maryland.gov
    Area covered
    Maryland
    Description

    Data standardization is an important part of effective management. However, sometimes people have data that doesn't match. This dataset includes different ways that counties might get written by different people. It can be used as a lookup table when you need County to be your unique identifier. For example, it allows you to match St. Mary's, St Marys, and Saint Mary's so that you can use it with disparate data from other data sets.

  3. Job Dataset

    • kaggle.com
    zip
    Updated Sep 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravender Singh Rana (2023). Job Dataset [Dataset]. https://www.kaggle.com/datasets/ravindrasinghrana/job-description-dataset
    Explore at:
    zip(479575920 bytes)Available download formats
    Dataset updated
    Sep 17, 2023
    Authors
    Ravender Singh Rana
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Job Dataset

    This dataset provides a comprehensive collection of synthetic job postings to facilitate research and analysis in the field of job market trends, natural language processing (NLP), and machine learning. Created for educational and research purposes, this dataset offers a diverse set of job listings across various industries and job types.

    Descriptions for each of the columns in the dataset:

    1. Job Id: A unique identifier for each job posting.
    2. Experience: The required or preferred years of experience for the job.
    3. Qualifications: The educational qualifications needed for the job.
    4. Salary Range: The range of salaries or compensation offered for the position.
    5. Location: The city or area where the job is located.
    6. Country: The country where the job is located.
    7. Latitude: The latitude coordinate of the job location.
    8. Longitude: The longitude coordinate of the job location.
    9. Work Type: The type of employment (e.g., full-time, part-time, contract).
    10. Company Size: The approximate size or scale of the hiring company.
    11. Job Posting Date: The date when the job posting was made public.
    12. Preference: Special preferences or requirements for applicants (e.g., Only Male or Only Female, or Both)
    13. Contact Person: The name of the contact person or recruiter for the job.
    14. Contact: Contact information for job inquiries.
    15. Job Title: The job title or position being advertised.
    16. Role: The role or category of the job (e.g., software developer, marketing manager).
    17. Job Portal: The platform or website where the job was posted.
    18. Job Description: A detailed description of the job responsibilities and requirements.
    19. Benefits: Information about benefits offered with the job (e.g., health insurance, retirement plans).
    20. Skills: The skills or qualifications required for the job.
    21. Responsibilities: Specific responsibilities and duties associated with the job.
    22. Company Name: The name of the hiring company.
    23. Company Profile: A brief overview of the company's background and mission.

    Potential Use Cases:

    • Building predictive models to forecast job market trends.
    • Enhancing job recommendation systems for job seekers.
    • Developing NLP models for resume parsing and job matching.
    • Analyzing regional job market disparities and opportunities.
    • Exploring salary prediction models for various job roles.

    Acknowledgements:

    We would like to express our gratitude to the Python Faker library for its invaluable contribution to the dataset generation process. Additionally, we appreciate the guidance provided by ChatGPT in fine-tuning the dataset, ensuring its quality, and adhering to ethical standards.

    Note:

    Please note that the examples provided are fictional and for illustrative purposes. You can tailor the descriptions and examples to match the specifics of your dataset. It is not suitable for real-world applications and should only be used within the scope of research and experimentation. You can also reach me via email at: rrana157@gmail.com

  4. FLM Train Samples

    • kaggle.com
    zip
    Updated Apr 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Murat Cihan Sorkun (2022). FLM Train Samples [Dataset]. https://www.kaggle.com/datasets/sorkun/flm-train-samples
    Explore at:
    zip(261186264 bytes)Available download formats
    Dataset updated
    Apr 19, 2022
    Authors
    Murat Cihan Sorkun
    Description

    Random sampling between 100K - 600K instances from training data. -train-df -> Sampled training data -match-df -> All matches from the Sample -sub -> Perfect submission from the sampled data -sub-naive -> Naive submission (only same IDs) from the sampled data

  5. f

    Data from: Graded Matching for Large Observational Studies

    • datasetcatalog.nlm.nih.gov
    • tandf.figshare.com
    Updated Mar 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu, Ruoqi; Rosenbaum, Paul R. (2022). Graded Matching for Large Observational Studies [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000420071
    Explore at:
    Dataset updated
    Mar 28, 2022
    Authors
    Yu, Ruoqi; Rosenbaum, Paul R.
    Description

    Observational studies of causal effects often use multivariate matching to control imbalances in measured covariates. For instance, using network optimization, one may seek the closest possible pairing for key covariates among all matches that balance a propensity score and finely balance a nominal covariate, perhaps one with many categories. This is all straightforward when matching thousands of individuals, but requires some adjustments when matching tens or hundreds of thousands of individuals. In various senses, a sparser network—one with fewer edges—permits optimization in larger samples. The question is: What is the best way to make the network sparse for matching? A network that is too sparse will eliminate from consideration possible pairings that it should consider. A network that is not sparse enough will waste computation considering pairings that do not deserve serious consideration. We propose a new graded strategy in which potential pairings are graded, with a preference for higher grade pairings. We try to match with pairs of the best grade, incorporating progressively lower grade pairs only to the degree they are needed. In effect, only sparse networks are built, stored and optimized. Two examples are discussed, a small example with 1567 matched pairs from clinical medicine, and a slightly larger example with 22,111 matched pairs from economics. The method is implemented in an R package RBestMatch available at https://github.com/ruoqiyu/RBestMatch. Supplementary materials for this article are available online.

  6. Dating App Behavior Dataset 2025

    • kaggle.com
    zip
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Keyush nisar (2025). Dating App Behavior Dataset 2025 [Dataset]. https://www.kaggle.com/datasets/keyushnisar/dating-app-behavior-dataset
    Explore at:
    zip(3558623 bytes)Available download formats
    Dataset updated
    Apr 11, 2025
    Authors
    Keyush nisar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset provides a synthetic representation of user behavior on a fictional dating app. It contains 50,000 records with 19 features capturing demographic details, app usage patterns, swipe tendencies, and match outcomes. The data was generated programmatically to simulate realistic user interactions, making it ideal for exploratory data analysis (EDA), machine learning modeling (e.g., predicting match outcomes), or studying user behavior trends in online dating platforms.

    Key features include gender, sexual orientation, location type, income bracket, education level, user interests, app usage time, swipe ratios, likes received, mutual matches, and match outcomes (e.g., "Mutual Match," "Ghosted," "Catfished"). The dataset is designed to be diverse and balanced, with categorical, numerical, and labeled variables for various analytical purposes.

    Usage

    This dataset can be used for:

    Exploratory Data Analysis (EDA): Investigate correlations between demographics, app usage, and match success. Machine Learning: Build models to predict match outcomes or user engagement levels. Social Studies: Analyze trends in dating app behavior across different demographics. Feature Engineering Practice: Experiment with transforming categorical and numerical data.

  7. Data from: EyeFi: Fast Human Identification Through Vision and WiFi-based...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Dec 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shiwei Fang; Tamzeed Islam; Sirajum Munir; Shahriar Nirjon; Shiwei Fang; Tamzeed Islam; Sirajum Munir; Shahriar Nirjon (2022). EyeFi: Fast Human Identification Through Vision and WiFi-based Trajectory Matching [Dataset]. http://doi.org/10.5281/zenodo.7396485
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 5, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shiwei Fang; Tamzeed Islam; Sirajum Munir; Shahriar Nirjon; Shiwei Fang; Tamzeed Islam; Sirajum Munir; Shahriar Nirjon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    EyeFi Dataset

    This dataset is collected as a part of the EyeFi project at Bosch Research and Technology Center, Pittsburgh, PA, USA. The dataset contains WiFi CSI values of human motion trajectories along with ground truth location information captured through a camera. This dataset is used in the following paper "EyeFi: Fast Human Identification Through Vision and WiFi-based Trajectory Matching" that is published in the IEEE International Conference on Distributed Computing in Sensor Systems 2020 (DCOSS '20). We also published a dataset paper titled as "Dataset: Person Tracking and Identification using Cameras and Wi-Fi Channel State Information (CSI) from Smartphones" in Data: Acquisition to Analysis 2020 (DATA '20) workshop describing details of data collection. Please check it out for more information on the dataset.

    Data Collection Setup

    In our experiments, we used Intel 5300 WiFi Network Interface Card (NIC) installed in an Intel NUC and Linux CSI tools [1] to extract the WiFi CSI packets. The (x,y) coordinates of the subjects are collected from Bosch Flexidome IP Panoramic 7000 panoramic camera mounted on the ceiling and Angle of Arrivals (AoAs) are derived from the (x,y) coordinates. Both the WiFi card and camera are located at the same origin coordinates but at different height, the camera is location around 2.85m from the ground and WiFi antennas are around 1.12m above the ground.

    The data collection environment consists of two areas, first one is a rectangular space measured 11.8m x 8.74m, and the second space is an irregularly shaped kitchen area with maximum distances of 19.74m and 14.24m between two walls. The kitchen also has numerous obstacles and different materials that pose different RF reflection characteristics including strong reflectors such as metal refrigerators and dishwashers.

    To collect the WiFi data, we used a Google Pixel 2 XL smartphone as an access point and connect the Intel 5300 NIC to it for WiFi communication. The transmission rate is about 20-25 packets per second. The same WiFi card and phone are used in both lab and kitchen area.

    List of Files
    Here is a list of files included in the dataset:

    |- 1_person
      |- 1_person_1.h5
      |- 1_person_2.h5
    |- 2_people
      |- 2_people_1.h5
      |- 2_people_2.h5
      |- 2_people_3.h5
    |- 3_people
      |- 3_people_1.h5
      |- 3_people_2.h5
      |- 3_people_3.h5
    |- 5_people
      |- 5_people_1.h5
      |- 5_people_2.h5
      |- 5_people_3.h5
      |- 5_people_4.h5
    |- 10_people
      |- 10_people_1.h5
      |- 10_people_2.h5
      |- 10_people_3.h5
    |- Kitchen
      |- 1_person
        |- kitchen_1_person_1.h5
        |- kitchen_1_person_2.h5
        |- kitchen_1_person_3.h5
      |- 3_people
        |- kitchen_3_people_1.h5
    |- training
      |- shuffuled_train.h5
      |- shuffuled_valid.h5
      |- shuffuled_test.h5
    View-Dataset-Example.ipynb
    README.md
    
    

    In this dataset, folder `1_person/` , `2_people/` , `3_people/` , `5_people/`, and `10_people/` contains data collected from the lab area whereas `Kitchen/` folder contains data collected from the kitchen area. To see how the each file is structured, please see below in section Access the data.

    The training folder contains the training dataset we used to train the neural network discussed in our paper. They are generated by shuffling all the data from `1_person/` folder collected in the lab area (`1_person_1.h5` and `1_person_2.h5`).

    Why multiple files in one folder?

    Each folder contains multiple files. For example, `1_person` folder has two files: `1_person_1.h5` and `1_person_2.h5`. Files in the same folder always have the same number of human subjects present simultaneously in the scene. However, the person who is holding the phone can be different. Also, the data could be collected through different days and/or the data collection system needs to be rebooted due to stability issue. As result, we provided different files (like `1_person_1.h5`, `1_person_2.h5`) to distinguish different person who is holding the phone and possible system reboot that introduces different phase offsets (see below) in the system.

    Special note:

    For `1_person_1.h5`, this file is generated by the same person who is holding the phone, and `1_person_2.h5` contains different people holding the phone but only one person is present in the area at a time. Boths files are collected in different days as well.


    Access the data
    To access the data, hdf5 library is needed to open the dataset. There are free HDF5 viewer available on the official website: https://www.hdfgroup.org/downloads/hdfview/. We also provide an example Python code View-Dataset-Example.ipynb to demonstrate how to access the data.

    Each file is structured as (except the files under *"training/"* folder):

    |- csi_imag
    |- csi_real
    |- nPaths_1
      |- offset_00
        |- spotfi_aoa
      |- offset_11
        |- spotfi_aoa
      |- offset_12
        |- spotfi_aoa
      |- offset_21
        |- spotfi_aoa
      |- offset_22
        |- spotfi_aoa
    |- nPaths_2
      |- offset_00
        |- spotfi_aoa
      |- offset_11
        |- spotfi_aoa
      |- offset_12
        |- spotfi_aoa
      |- offset_21
        |- spotfi_aoa
      |- offset_22
        |- spotfi_aoa
    |- nPaths_3
      |- offset_00
        |- spotfi_aoa
      |- offset_11
        |- spotfi_aoa
      |- offset_12
        |- spotfi_aoa
      |- offset_21
        |- spotfi_aoa
      |- offset_22
        |- spotfi_aoa
    |- nPaths_4
      |- offset_00
        |- spotfi_aoa
      |- offset_11
        |- spotfi_aoa
      |- offset_12
        |- spotfi_aoa
      |- offset_21
        |- spotfi_aoa
      |- offset_22
        |- spotfi_aoa
    |- num_obj
    |- obj_0
      |- cam_aoa
      |- coordinates
    |- obj_1
      |- cam_aoa
      |- coordinates
    ...
    |- timestamp
    

    The `csi_real` and `csi_imag` are the real and imagenary part of the CSI measurements. The order of antennas and subcarriers are as follows for the 90 `csi_real` and `csi_imag` values : [subcarrier1-antenna1, subcarrier1-antenna2, subcarrier1-antenna3, subcarrier2-antenna1, subcarrier2-antenna2, subcarrier2-antenna3,… subcarrier30-antenna1, subcarrier30-antenna2, subcarrier30-antenna3]. `nPaths_x` group are SpotFi [2] calculated WiFi Angle of Arrival (AoA) with `x` number of multiple paths specified during calculation. Under the `nPath_x` group are `offset_xx` subgroup where `xx` stands for the offset combination used to correct the phase offset during the SpotFi calculation. We measured the offsets as:

    |Antennas | Offset 1 (rad) | Offset 2 (rad) |
    |:-------:|:---------------:|:-------------:|
    | 1 & 2 |   1.1899   |   -2.0071
    | 1 & 3 |   1.3883   |   -1.8129
    
    

    The measurement is based on the work [3], where the authors state there are two possible offsets between two antennas which we measured by booting the device multiple times. The combination of the offset are used for the `offset_xx` naming. For example, `offset_12` is offset 1 between antenna 1 & 2 and offset 2 between antenna 1 & 3 are used in the SpotFi calculation.

    The `num_obj` field is used to store the number of human subjects present in the scene. The `obj_0` is always the subject who is holding the phone. In each file, there are `num_obj` of `obj_x`. For each `obj_x1`, we have the `coordinates` reported from the camera and `cam_aoa`, which is estimated AoA from the camera reported coordinates. The (x,y) coordinates and AoA listed here are chronologically ordered (except the files in the `training` folder) . It reflects the way the person carried the phone moved in the space (for `obj_0`) and everyone else walked (for other `obj_y`, where `y` > 0).

    The `timestamp` is provided here for time reference for each WiFi packets.

    To access the data (Python):

    import h5py
    
    data = h5py.File('3_people_3.h5','r')
    
    csi_real = data['csi_real'][()]
    csi_imag = data['csi_imag'][()]
    
    cam_aoa = data['obj_0/cam_aoa'][()] 
    cam_loc = data['obj_0/coordinates'][()] 
    

    For file inside `training/` folder:

    Files inside training folder has a different data structure:

    
    |- nPath-1
      |- aoa
      |- csi_imag
      |- csi_real
      |- spotfi
    |- nPath-2
      |- aoa
      |- csi_imag
      |- csi_real
      |- spotfi
    |- nPath-3
      |- aoa
      |- csi_imag
      |- csi_real
      |- spotfi
    |- nPath-4
      |- aoa
      |- csi_imag
      |- csi_real
      |- spotfi
    


    The group `nPath-x` is the number of multiple path specified during the SpotFi calculation. `aoa` is the camera generated angle of arrival (AoA) (can be considered as ground truth), `csi_image` and `csi_real` is the imaginary and real component of the CSI value. `spotfi` is the SpotFi calculated AoA values. The SpotFi values are chosen based on the lowest median and mean error from across `1_person_1.h5` and `1_person_2.h5`. All the rows under the same `nPath-x` group are aligned (i.e., first row of `aoa` corresponds to the first row of `csi_imag`, `csi_real`, and `spotfi`. There is no timestamp recorded and the sequence of the data is not chronological as they are randomly shuffled from the `1_person_1.h5` and `1_person_2.h5` files.

    Citation
    If you use the dataset, please cite our paper:

    @inproceedings{eyefi2020,
     title={EyeFi: Fast Human Identification Through Vision and WiFi-based Trajectory Matching},
     author={Fang, Shiwei and Islam, Tamzeed and Munir, Sirajum and Nirjon, Shahriar},
     booktitle={2020 IEEE International Conference on Distributed Computing in Sensor Systems (DCOSS)},
     year={2020},

  8. f

    Data from: Double-Matched Matrix Decomposition for Multi-View Data

    • datasetcatalog.nlm.nih.gov
    • tandf.figshare.com
    Updated May 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuan, Dongbang; Gaynanova, Irina (2022). Double-Matched Matrix Decomposition for Multi-View Data [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000404690
    Explore at:
    Dataset updated
    May 24, 2022
    Authors
    Yuan, Dongbang; Gaynanova, Irina
    Description

    We consider the problem of extracting joint and individual signals from multi-view data, that is, data collected from different sources on matched samples. While existing methods for multi-view data decomposition explore single matching of data by samples, we focus on double-matched multi-view data (matched by both samples and source features). Our motivating example is the miRNA data collected from both primary tumor and normal tissues of the same subjects; the measurements from two tissues are thus matched both by subjects and by miRNAs. Our proposed double-matched matrix decomposition allows us to simultaneously extract joint and individual signals across subjects, as well as joint and individual signals across miRNAs. Our estimation approach takes advantage of double-matching by formulating a new type of optimization problem with explicit row space and column space constraints, for which we develop an efficient iterative algorithm. Numerical studies indicate that taking advantage of double-matching leads to superior signal estimation performance compared to existing multi-view data decomposition based on single-matching. We apply our method to miRNA data as well as data from the English Premier League soccer matches and find joint and individual multi-view signals that align with domain-specific knowledge. Supplementary materials for this article are available online.

  9. Premier League Match Data and Season Stats

    • kaggle.com
    zip
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evan Gora (2024). Premier League Match Data and Season Stats [Dataset]. https://www.kaggle.com/datasets/evangora/premier-league-data
    Explore at:
    zip(559848 bytes)Available download formats
    Dataset updated
    May 29, 2024
    Authors
    Evan Gora
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset contains data from Premier League seasons from 1888/1889 season until the 2023/2024 season. This dataset has files for every unique team, every season, all the season stats, and all matches played since the 1888/1889 season.

    There is some data missing, as it was not available on the website that the data was scraped from. For example, most of the seasons are missing passing data (attempts, completions, percentage), and a majority of the games are missing things such as attendance or expected goals. For the games that do have expected goals, xG is for the home team and xG.1 is for the away team.

  10. Club Football Match Data (2000 - 2025) ⚽

    • kaggle.com
    zip
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam Gábor (2025). Club Football Match Data (2000 - 2025) ⚽ [Dataset]. https://www.kaggle.com/datasets/adamgbor/club-football-match-data-2000-2025/code
    Explore at:
    zip(15085125 bytes)Available download formats
    Dataset updated
    Jun 27, 2025
    Authors
    Adam Gábor
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Club Football Match Data (2000 - 2025)

    This dataset offers a simple entrance to the world of football match data analysis. It offers football match data from 27 countries and 42 leagues worldwide, including some of the best leagues such as the English Premier League, German Bundesliga, and Spanish La Liga. The data spans from the 2000/01 season to the most recent results from the 2024/25 season. The dataset also includes Elo Ratings for the given time period with snapshots of ~500 of the best teams in Europe taken twice a month, on the 1st and 15th.

    Match results and statistics provided in the table are taken from Football-Data.co.uk. Elo data are taken from ClubElo.

    📅 DATASET OVERVIEW

    📂 Files number: 2

    🔗 Files type: .csv

    ⌨️ Total rows: ~475 000 as of 07/2025

    💾 Total size: ~51 MB as of 07/2025

    The dataset is a great starting point for football match prediction, both pre-match and in-play, with huge potential lying in the amount of data and their accuracy. The dataset contains information about teams' strength and form prior to the match, as well as general market predictions via pre-match odds.

    🔑 KEY FEATURES

    1️⃣ SIZE - This is the biggest open and free dataset on the internet, keeping uniform information about tens of thousands of football matches, including match statistics, odds, and Elo and form information.

    2️⃣ READABILITY - The whole dataset is tabular, and all of the data are clear to navigate and explain. Both tables in the dataset correspond to each other via remapped club names, and all of the formats within the table (such as odds) are uniform.

    3️⃣ RECENCY - This is the most up-to-date open football dataset, containing data from matches as recent as July 2025. The plan is to update this dataset monthly or bi-monthly via a custom-made Python pipeline.

    📋 COLUMNS AND DESCRIPTIONS

    💪 TABLE 1 - ELO RATINGS.csv

    This table is a collection of Elo ratings taken from ClubElo. Snapshots are taken twice a month, on the 1st and 15th day of the month, saving the whole Club Elo database. Some clubs' names are remapped to correspond with the Matches table (for example "Bayern" to "Bayern Munich").``

    ColumnData TypeDescription
    📅 DatedateDate of the snapshot.
    🛡️ ClubstringClub name in English corresponding to Matches table.
    🌍️ CountryenumClub country three-letter code.
    📈 ElofloatClub's current Elo rating, rounded to two decimal spots.

    💪 TABLE 2 - MATCHES.csv

    ColumnData TypeDescription
    🏆 DivisionenumLeague that the match was played in - country code + division number (I1 for Italian First Division). For countries where we only have one league, we use 3-letter country code (ARG for Argentina).
    📆 MatchDatedateMatch date in the classic YYYY-MM-DD format.
    🕘 MatchTimetimeMatch time in the HH:MM:SS format. CET-1 timezone.
    🏠 HomeTeamstringHome team's club name in English, abbreviated if needed.
    🚗 AwayTeamstringHome team's club name in English, abbreviated if needed.
    📊 HomeElofloatHome team's most recent Elo rating.
    📊 AwayElofloatAway team's most recent Elo rating.
    📉 Form3HomeintNumber of points gathered by home team in the last 3 matches (Win = 3 points, Draw = 1 point, Loss = 0 points, so this value is between 0 and 9).
    📈 Form5HomeintNumber of points gathered by home team in the last 5 matches (Win = 3 points, Draw = 1 point, Loss = 0 points, so this value is between 0 and 15).
    📉 Form3AwayintNumber of points gathered by away team in the last 3 matches (Win = 3 points, Draw = 1 point, Loss = 0 points, so this value is between 0 and 9).
    📈 Form5AwayintNumber of points gathered by away team in the last 5 matches (Win = 3 points, Draw = 1 point, Loss = 0 points, so this value is between 0 and 15).
    FTHomeintFull-time goals scored by home team.
    FTAwayintFull-time goals scored by away team.
    🏁 FTResultenumFull-time result (H for Home win, D for Draw and A for Away win).
    HTHomeintHalf-time goals scored by home team.
    HTAwayintHalf-time goals scored by away team.
    ⏱️ HTResultenumHalf-time result (H for Home win, D for Draw and A for Away win).
    🏹 HomeShotsintTotal shots (goal, saved, blocked, off-target) by home team.
    🏹 AwayShotsintTotal shots (goal, saved, blocked, off-target) by away team.
    🎯 HomeTargetintTotal shots on target (goal, saved) by home team.
    🎯 AwayTargetintTotal sh...
  11. Global Football Results: (1872–2024)

    • kaggle.com
    zip
    Updated Sep 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Ehsan (2024). Global Football Results: (1872–2024) [Dataset]. https://www.kaggle.com/datasets/muhammadehsan02/global-football-results-18722024
    Explore at:
    zip(1193155 bytes)Available download formats
    Dataset updated
    Sep 4, 2024
    Authors
    Muhammad Ehsan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset offers a comprehensive record of international football matches from the very first game in 1872 to the present day in 2024. It covers a broad spectrum of football matches, including major tournaments like the FIFA World Cup and various friendly matches. With a total of 47,126 match records, this dataset is a valuable resource for analyzing historical trends, team performances, and match outcomes over more than a century of international football.

    Key Features

    • Extensive Coverage: The dataset includes detailed information on 47,126 international football matches, providing a rich historical archive of over 150 years.
    • Diverse Match Types: Records include results from high-profile tournaments such as the FIFA World Cup and lesser-known events like the FIFI Wild Cup, as well as numerous friendly matches.
    • In-depth Details: Each record features comprehensive match details, including scores, dates, locations, and tournament names.
    • Player and Team Insights: The dataset includes information on goal scorers, penalty shootouts, and own goals, offering a deeper understanding of match events and player performances.

    File Breakdown

    1) Match_Results.csv - Date: The date when the match was played. - Home Team: The team playing at home. - Away Team: The team playing away. - Home Score: The score of the home team, including extra time but not penalty shootouts. - Away Score: The score of the away team, including extra time but not penalty shootouts. - Tournament: The name of the tournament or competition in which the match was played. - City: The city where the match was held. - Country: The country where the match took place. - Neutral: Indicates if the match was played at a neutral venue (TRUE/FALSE).

    2) Penalty_Shootouts.csv - Date: The date of the match. - Home Team: The name of the home team. - Away Team: The name of the away team. - Winner: The team that won the penalty shootout. - First Shooter: The team that took the first shot in the penalty shootout.

    3) Goal_Scorers.csv - Date: The date of the match. - Home Team: The name of the home team. - Away Team: The name of the away team. - Team: The team that scored the goal. - Scorer: The player who scored the goal. - Minute: The minute when the goal was scored. - Own Goal: Indicates if the goal was an own goal (TRUE/FALSE). - Penalty: Indicates if the goal was scored from a penalty (TRUE/FALSE).

    Data Usage

    • Historical Analysis: This dataset is ideal for exploring the history of international football, including changes in team performances and the evolution of the sport over the decades.
    • Performance Trends: Analyze trends in team performance across different eras, including how teams have fared in various tournaments and friendlies.
    • Geopolitical Insights: Investigate how geopolitical changes have impacted international football, including changes in team names and country borders over time.
    • Match Statistics: Detailed statistics on scores, penalty shootouts, and goal scorers provide insights into match dynamics and player contributions.

    Additional Notes

    • Team and Country Names: The dataset uses current names for teams to simplify historical tracking. For example, teams that were known by different names in the past are referred to by their current names to maintain consistency.
    • Geographic and Temporal Coverage: Covering matches played worldwide from 1872 to 2024, this dataset provides a global perspective on international football history.

    Acknowledgement

    Full credit goes to Mart Jürisoo for the original work on international football results. The dataset titled International Football Results from 1872 to 2017 provided the foundational data and inspiration for this comprehensive historical archive.

    The purpose of sharing this dataset is to foster collaborative research and analysis within the football community. By making this extensive historical data available, we aim to support studies on historical trends, team performances, and the evolution of international football over more than 150 years. This dataset is intended to be a valuable resource for researchers, analysts, and enthusiasts who wish to explore the rich history of international football and gain deeper insights into the sport's development.

  12. Data from: Robust Post-Matching Inference

    • tandf.figshare.com
    • figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alberto Abadie; Jann Spiess (2023). Robust Post-Matching Inference [Dataset]. http://doi.org/10.6084/m9.figshare.13135798.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Alberto Abadie; Jann Spiess
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Nearest-neighbor matching is a popular nonparametric tool to create balance between treatment and control groups in observational studies. As a preprocessing step before regression, matching reduces the dependence on parametric modeling assumptions. In current empirical practice, however, the matching step is often ignored in the calculation of standard errors and confidence intervals. In this article, we show that ignoring the matching step results in asymptotically valid standard errors if matching is done without replacement and the regression model is correctly specified relative to the population regression function of the outcome variable on the treatment variable and all the covariates used for matching. However, standard errors that ignore the matching step are not valid if matching is conducted with replacement or, more crucially, if the second step regression model is misspecified in the sense indicated above. Moreover, correct specification of the regression model is not required for consistent estimation of treatment effects with matched data. We show that two easily implementable alternatives produce approximations to the distribution of the post-matching estimator that are robust to misspecification. A simulation study and an empirical example demonstrate the empirical relevance of our results. Supplementary materials for this article are available online.

  13. Z

    Modern China Geospatial Database - Main Dataset

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    Updated Feb 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Henriot (2025). Modern China Geospatial Database - Main Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5735393
    Explore at:
    Dataset updated
    Feb 28, 2025
    Dataset provided by
    Aix-Marseille University
    Authors
    Christian Henriot
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    China
    Description

    MCGD_Data_V2.2 contains all the data that we have collected on locations in modern China, plus a number of locations outside of China that we encounter frequently in historical sources on China. All further updates will appear under the name "MCGD_Data" with a time stamp (e.g., MCGD_Data2023-06-21)

    You can also have access to this dataset and all the datasets that the ENP-China makes available on GitLab: https://gitlab.com/enpchina/IndexesEnp

    Altogether there are 464,970 entries. The data include the name of locations and their variants in Chinese, pinyin, and any recorded transliteration; the name of the province in Chinese and in pinyin; Province ID; the latitude and longitude; the Name ID and Location ID, and NameID_Legacy. The Name IDs all start with H followed by seven digits. This is the internal ID system of MCGD (the NameID_Legacy column records the Name IDs in their original format depending on the source). Locations IDs that start with "DH" are data points extracted from China Historical GIS (Harvard University); those that start with "D" are locations extracted from the data points in Geonames; those that have only digits (8 digits) are data points we have added from various map sources.

    One of the main features of the MCGD Main Dataset is the systematic collection and compilation of place names from non-Chinese language historical sources. Locations were designated in transliteration systems that are hardly comprehensible today, which makes it very difficult to find the actual locations they correspond to. This dataset allows for the conversion from these obsolete transliterations to the current names and geocoordinates.

    From June 2021 onward, we have adopted a different file naming system to keep track of versions. From MCGD_Data_V1 we have moved to MCGD_Data_V2. In June 2022, we introduced time stamps, which result in the following naming convention: MCGD_Data_YYYY.MM.DD.

    UPDATES

    MCGD_Data2025_02_28 includes a major change with the duplication of all the locations listed under Beijing, Shanghai, Tianjin, and Chongqing (北京, 上海, 天津, 重慶) and their listing under the name of the provinces to which they belonge origially before the creation of the four special municipalities after 1949. This is meant to facilitate the matching of data from historical sources. Each location has a unique NameID. Altogether there are 472,818 entries

    MCGD_Data2025_02_27 inclues an update on locations extracted from Minguo zhengfu ge yuanhui keyuan yishang zhiyuanlu 國民政府各院部會科員以上職員錄 (Directory of staff members and above in the ministries and committees of the National Government). Nanjing: Guomin zhengfu wenguanchu yinzhuju 國民政府文官處印鑄局國民政府文官處印鑄局, 1944). We also made corrections in the Prov_Py and Prov_Zh columns as there were some misalignments between the pinyin name and the name in Chines characters. The file now includes 465,128 entries.

    MCGD_Data2024_03_23 includes an update on locations in Taiwan from the Asia Directories. Altogether there are 465,603 entries (of which 187 place names without geocoordinates, labelled in the Lat Long columns as "Unknown").

    MCGD_Data2023.12.22 contains all the data that we have collected on locations in China, whatever the period. Altogether there are 465,603 entries (of which 187 place names without geocoordinates, labelled in the Lat Long columns as "Unknown"). The dataset also includes locations outside of China for the purpose of matching such locations to the place names extracted from historical sources. For example, one may need to locate individuals born outside of China. Rather than maintaining two separate files, we made the decision to incorporate all the place names found in historical sources in the gazetteer. Such place names can easily be removed by selecting all the entries where the 'Province' data is missing.

  14. Data from: A consensus compound/bioactivity dataset for data-driven drug...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated May 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laura Isigkeit; Laura Isigkeit; Apirat Chaikuad; Apirat Chaikuad; Daniel Merk; Daniel Merk (2022). A consensus compound/bioactivity dataset for data-driven drug design and chemogenomics [Dataset]. http://doi.org/10.5281/zenodo.6320761
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 13, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Laura Isigkeit; Laura Isigkeit; Apirat Chaikuad; Apirat Chaikuad; Daniel Merk; Daniel Merk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Information

    The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144803 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design.

    The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation.

    Structure and content of the dataset

    Dataset structure

    ChEMBL

    ID

    PubChem

    ID

    IUPHAR

    ID

    Target

    Activity

    type

    Assay typeUnitMean C (0)...Mean PC (0)...Mean B (0)...Mean I (0)...Mean PD (0)...Activity check annotationLigand namesCanonical SMILES C...Structure checkSource

    The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file.

    Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format.

    Column content:

    • ChEMBL ID, PubChem ID, IUPHAR ID: chemical identifier of the databases
    • Target: biological target of the molecule expressed as the HGNC gene symbol
    • Activity type: for example, pIC50
    • Assay type: Simplification/Classification of the assay into cell-free, cellular, functional and unspecified
    • Unit: unit of bioactivity measurement
    • Mean columns of the databases: mean of bioactivity values or activity comments denoted with the frequency of their occurrence in the database, e.g. Mean C = 7.5 *(15) -> the value for this compound-target pair occurs 15 times in ChEMBL database
    • Activity check annotation: a bioactivity check was performed by comparing values from the different sources and adding an activity check annotation to provide automated activity validation for additional confidence
      • no comment: bioactivity values are within one log unit;
      • check activity data: bioactivity values are not within one log unit;
      • only one data point: only one value was available, no comparison and no range calculated;
      • no activity value: no precise numeric activity value was available;
      • no log-value could be calculated: no negative decadic logarithm could be calculated, e.g., because the reported unit was not a compound concentration
    • Ligand names: all unique names contained in the five source databases are listed
    • Canonical SMILES columns: Molecular structure of the compound from each database
    • Structure check: To denote matching or differing compound structures in different source databases
      • match: molecule structures are the same between different sources;
      • no match: the structures differ;
      • 1 source: no structure comparison is possible, because the molecule comes from only one source database.
    • Source: From which databases the data come from

  15. Film Circulation dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, png
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
    Explore at:
    csv, png, binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.


    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.


    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.


    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.


    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,

  16. h

    resumes

    • huggingface.co
    Updated Feb 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oks (2025). resumes [Dataset]. https://huggingface.co/datasets/datasetmaster/resumes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 22, 2025
    Authors
    Oks
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Advanced Resume Parser & Job Matcher Resumes

    This dataset contains a merged collection of real and synthetic resume data in JSON format. The resumes have been normalized to a common schema to facilitate the development of NLP models for candidate-job matching in the technical recruitment domain.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    This dataset is a combined collection of real resumes and synthetically generated CVs.

    Curated by: datasetmaster… See the full description on the dataset page: https://huggingface.co/datasets/datasetmaster/resumes.

  17. La Liga Match Data

    • kaggle.com
    zip
    Updated Jan 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sanjeet Singh Naik (2022). La Liga Match Data [Dataset]. https://www.kaggle.com/datasets/sanjeetsinghnaik/la-liga-match-data
    Explore at:
    zip(231150 bytes)Available download formats
    Dataset updated
    Jan 2, 2022
    Authors
    Sanjeet Singh Naik
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    La Liga is by far one of the world’s most entertaining leagues. They have some of the best managers, players and fans! But, what makes it truly entertaining is the sheer unpredictability. There are 6 equally amazing teams with a different team lifting the trophy every season. Not only that, the league has also witnessed victories from teams outside of the top 6. So, let us analyze some of these instances.

    So far, the implementation of statistics into soccer has been positive. Teams can easily gather information about their opponents and their tactics. This convenience allows managers to create well-thought out game plans that suit their team, maximize opponents' weaknesses, and increase their chances of winning.

    A goal is scored when the whole of the ball passes over the goal line, between the goalposts and under the crossbar, provided that no offence has been committed by the team scoring the goal. If the goalkeeper throws the ball directly into the opponents' goal, a goal kick is awarded.

    THE TIME OF SEASON/MOTIVATION: While a club battling for a league title is going to be hungry for a win, as is a side that is fighting to stay up, a club that has already won the title or has already been relegated is unlikely to work as hard, and often rest players as well. THE REFEREE: Of course, when referee's send players off it make a massive impact on a match, but even if he is just awarding a yellow card then it can affect the outcome of the game as the player booked is less likely to go in as hard for the rest of the match.

    SUBSTITUTES: The whole point of substitutes is for them to be able to come on and impact a match. Subs not only bring on a fresh pair of legs that are less tired than starters and more likely to track back and push forward, but can also play crucial roles in the formation of a team.

    MIND GAMES/MANAGERS: Playing mind games has almost become a regular routine for top level managers, and rightly so. Just a simple mind game can do so much to impact a match, a good example coming from Sir Alex Ferguson.

    Per his autobiography, when Manchester United were losing late on in a match at a certain point he would tap his watch and make sure to let the opposition know he is signalling this to his players. United's opposition already know that United have a tendency to come back from behind, and upon seeing this gesture they will think that United are going to come back. And because scientific studies prove that living creatures are more likely to accept things that have happened before than not - horses are more likely to lose to a horse they have already lost to in a race even if they are on an even playing field - they often succumb to a loss.

    FORM/INJURIES/FIXTURES: A team on better form is more likely to win a match than if they have been on a poor run of form, while a team in the middle of a condensed run of fixtures is less likely to win than a well rested team. These are just some of the things that affect matches - if you have any other just mention them in the comment section below and I'll try to add them in!

  18. s

    Spatial Multimodal Analysis (SMA) - Spatial Transcriptomics

    • figshare.scilifelab.se
    • demo.researchdata.se
    • +1more
    json
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marco Vicari; Reza Mirzazadeh; Anna Nilsson; Patrik Bjärterot; Ludvig Larsson; Hower Lee; Mats Nilsson; Julia Foyer; Markus Ekvall; Paulo Czarnewski; Xiaoqun Zhang; Per Svenningsson; Per Andrén; Lukas Käll; Joakim Lundeberg (2025). Spatial Multimodal Analysis (SMA) - Spatial Transcriptomics [Dataset]. http://doi.org/10.17044/scilifelab.22778920.v1
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    KTH Royal Institute of Technology, Science for Life Laboratory
    Authors
    Marco Vicari; Reza Mirzazadeh; Anna Nilsson; Patrik Bjärterot; Ludvig Larsson; Hower Lee; Mats Nilsson; Julia Foyer; Markus Ekvall; Paulo Czarnewski; Xiaoqun Zhang; Per Svenningsson; Per Andrén; Lukas Käll; Joakim Lundeberg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains Spatial Transcriptomics (ST) data matching with Matrix Assisted Laser Desorption/Ionization - Mass Spetrometry Imaging (MALDI-MSI). This data is complementary to data contained in the same project. FIles with the same identifiers in the two datasets originated from the very same tissue section and can be combined in a multimodal ST-MSI object. For more information about the dataset please see our manuscript posted on BioRxiv (doi: https://doi.org/10.1101/2023.01.26.525195). This dataset includes ST data from 19 tissue sections, including human post-mortem and mouse samples. The spatial transcriptomics data was generated using the Visium protocol (10x Genomics). The murine tissue sections come from three different mice unilaterally injected with 6-OHDA. 6-OHDA is a neurotoxin that when injected in the brain can selectively destroy dopaminergic neurons. We used this mouse model to show the applicability of the technology that we developed, named Spatial Multimodal Analysis (SMA). Using our technology on these mouse brain tissue sections we were able to detect both dopamine with MALDI-MSI and the corresponding gene expression with ST. This dataset includes also one human post-mortem striatum sample that was placed on one Visium slide across the four capture areas. This sample was analyzed with a different ST protocol named RRST (Mirzazadeh, R., Andrusivova, Z., Larsson, L. et al. Spatially resolved transcriptomic profiling of degraded and challenging fresh frozen samples. Nat Commun 14, 509 (2023). https://doi.org/10.1038/s41467-023-36071-5), where probes capturing the whole transcriptome are first hybridized in the tissue section and then spatially detected. Each tissue section contained in the dataset has been given a unique identifier that is composed of the Visium array ID and capture area ID of the Visium slide that the tissue section was placed on. This unique identifier is included in the file names of all the files relative to the same tissue section, including the MALDI-MSI files published in the other dataset included in this project. In this dataset you will find the following files for each tissue section: - raw files: these are the read one fastq files (containing the pattern *R1*fastq.gz in the file name), read two fastq files (containing the pattern *R1*fastq.gz in the file name) and the raw microscope images (containing the pattern Spot.jpg in the file name). These are the only files needed to run the Space Ranger pipeline, which is freely available for any user (please see the 10x Genomics website for information on how to install and run Space Ranger); - processed data files: we provide processed data files of two types: a) Space Ranger outputs that were used to produce the figures in our publication; b) manual annotation tables in csv format produced using Loupe Browser 6 (csv tables with file names ending _RegionLoupe.csv, _filter.csv, _dopamine.csv, _lesion.csv, _region.csv patterns); c) json files that we used as input for Space Ranger in the cases where the automatic tissue detection included in the pipeline failed to recognize the tissue or the fiducials. Using these processed files the user can reproduce the figures of our publication without having to restart from the raw data files. The MALDI-MSI analyses preceding ST was performed with different matrices in different tissue section. We used 1) 9-aminoacridine (9-AA) for detection of metabolites in negative ionization mode, 2) 2,5-dihydroxybenzoic acid (DHB) for detection of metabolites in positive ionization mode, 3) 4-(anthracen-9-yl)-2-fluoro-1-ethylpyridin-1-ium iodide (FMP-10), which charge-tags molecules with phenolic hydroxyls and/or primary amines, including neurotransmitters. The information about which matrix was sprayed on the tissue sections and other information about the samples is included in the metadata table. We also used three types of control samples: - standard Visium: samples processed with standard Visium (i.e. no matrix spraying, no MALDI-MSI, protocol as recommended by 10x Gemomics with no exeptions) - internal controls (iCTRL): samples not sprayed with any matrix, neither processed with MALDI-MSI, but located on the same Visium slide were other samples were processed with MALDI-MSI - FMP-10-iCTRL: sample sprayed with FMP-10, and then processed as an iCTRL. This and other information is provided in the metadata table.

  19. d

    Data from: Optimal background matching camouflage

    • datadryad.org
    • datasetcatalog.nlm.nih.gov
    • +2more
    zip
    Updated Jun 7, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Constantine Michalis; Nicholas E. Scott-Samuel; David P. Gibson; Innes C. Cuthill (2017). Optimal background matching camouflage [Dataset]. http://doi.org/10.5061/dryad.q4b78
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 7, 2017
    Dataset provided by
    Dryad
    Authors
    Constantine Michalis; Nicholas E. Scott-Samuel; David P. Gibson; Innes C. Cuthill
    Time period covered
    Apr 3, 2017
    Description

    Day 3 003_srgb_1_to_day2 083_srgb_5_humancontains the L*a*b* and Gabor filter outputs of all 505 samples of bark, as seen by Human visual systems. Filename+photo= Name of image. rep= number of sample from each image (5 for each image). L= L value of L*a*b*, a= a value of L*a*b*, b= b value of L*a*b*. fxoy= f: the gabor filter’s spatial frequency (1-4 from fine to coarse), o: the gabor filter’s orientation (1-6 from 0 to 150° in 30° increments)Day 3 003_srgb_1_to_day2 083_srgb_5_birdcontains the L*a*b* and Gabor filter outputs of all 505 samples of bark, as seen by Bird visual systems. Filename+photo= Name of image. rep= number of sample from each image (5 for each image). L= L value of L*a*b*, a= a value of L*a*b*, b= b value of L*a*b*. fxoy= f: the gabor filter’s spatial frequency (x=1-4 from fine to coarse), o: the gabor filter’s orientation (y=1-6 from 0 to 150° in 30° increments)Survival Experimentcontains all the data from the field experiment. Columns: Block= number of block, Trea...

  20. FAIRmat, nomad-remote-tools-hub, nomad-parser-nexus, development, collection...

    • meta4cat.fokus.fraunhofer.de
    • zenodo.org
    unknown, zip
    Updated Jan 18, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2022). FAIRmat, nomad-remote-tools-hub, nomad-parser-nexus, development, collection 1, example datasets for atom probe microscopy and electron microscopy [Dataset]. https://meta4cat.fokus.fraunhofer.de/datasets/oai-zenodo-org-7050774?locale=en
    Explore at:
    unknown(66364816), zip(1920)Available download formats
    Dataset updated
    Jan 18, 2022
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The following repository contains a collection of data and metadata files in different vendor formats which were collected in the fields of atom probe microscopy (LEAP instruments) and electron microscopy (Nion instruments). These files are meant for development and testing purposes of the nomad north-remote-tools-hub and the related nomad-nexus-parser software tools within the FAIRmat project.FAIRmat is a consortium lead by the Humboldt-Universität zu Berlin. FAIRmat is a member of the German Research Data Infrastructure (NFDI) initiative. A detailed description of the background and content of the individual files follows: EM.STEM.Nion.Dataset.1.zip: This is a dataset for testing the nx_em_nion reader which handles files from Nion microscopes and NionSwift software. The data were collected by Benedikt Haas and Sherjeel Shabih from Humboldt-Universität zu Berlin who worked (at the point of publication) in the group of Prof. Christoph Koch. APM.LEAP.Datasets.*.zip: This is a collection of two datasets for testing the generic nx_apm reader which handles commercial and community file formats for reconstructed ion position and ranging data from atom probe microscopy experiments. The datasets were collected by different authors. APM.LEAP.Datasets.1.zip: R31_06365-v02.pos, was shared by Jing Wang and Daniel Schreiber (both at PNNL). Details to the dataset are available under the following DOIs: https://doi.org/10.1017/S1431927618015386 https://doi.org/10.1017/S1431927621012241 70_50_50.apt, was a shared by Xuyang Zhou at his time with the Max-Planck-Institut für Eisenforschung GmbH as a open-source test data to the publication he lead on machine-learning-based techniques for composition profiling. The dataset and publication is available via the following DOI and resources: https://doi.org/10.1016/j.actamat.2022.117633 The dataset specifically is also available here: https://github.com/RhettZhou/APT_GB/tree/main/example/Cropped_70_50_50 The range files *.rng and *.rrng range serve as examples to develop tools for parsing them and handle the formatting of range files. The scientific content of the range files was inspired by experiments but is not related to the above-mentioned atom probe datasets and should not be used to analyze these test data for more than pure development purposes. Use instead your own data and matching range files for scientific analyses. APM.LEAP.Datasets.2.zip R18_53222_W_18K-v01.epos, was shared with Markus Kühbach by Andrew Breen during their time at the Max-Planck-Institut für Eisenforschung GmbH. We would like to invite the community to use the nomad infrastructure and support us with sharing data and dataset which we can then use to improve the file format parsing, the reading capabilities, and analyses services of the nomad infrastructure so that the community can profit again from these developments. aut_leoben_leitner.tar.gz Is the dataset associated to the grain boundary solute segregation case study discussed in https://arxiv.org/abs/2205.13510 usa_portland_wang.tar.gz is the dataset associated with the ODS steel specimen dataset, which is a good example for testing and learning iso-surface based analyses with the paraprobe-toolbox. This dataset was mentioned as one of the test cases in https://arxiv.org/abs/2205.13510 ger_berlin_kuehbach_fairmat_is_usa_portland_wang.tar.gz is contentwise the same scientific dataset as the one in usa_portland_wang but stored in an HDF5 file which is formatted compliant with the NXapm NeXus application definition as of the NeXus code camp 2022 https://manual.nexusformat.org/classes/contributed_definitions/NXapm.html#nxapm

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Google (2022). Google Patent Phrase Similarity Dataset [Dataset]. https://www.kaggle.com/datasets/google/google-patent-phrase-similarity-dataset
Organization logo

Google Patent Phrase Similarity Dataset

Similarity for phrases and technical terms from patents

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
zip(364234 bytes)Available download formats
Dataset updated
Jul 19, 2022
Dataset authored and provided by
Googlehttp://google.com/
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is a human rated contextual phrase to phrase matching dataset focused on technical terms from patents. In addition to similarity scores that are typically included in other benchmark datasets we include granular rating classes similar to WordNet, such as synonym, antonym, hypernym, hyponym, holonym, meronym, domain related. The dataset was used in the U.S. Patent Phrase to Phrase Matching competition.

The dataset was generated with focus on the following: - Phrase disambiguation: certain keywords and phrases can have multiple different meanings. For example, the phrase "mouse" may refer to an animal or a computer input device. To help disambiguate the phrases we have included Cooperative Patent Classification (CPC) classes with each pair of phrases. - Adversarial keyword match: there are phrases that have matching keywords but are otherwise unrelated (e.g. “container section” → “kitchen container”, “offset table” → “table fan”). Many models will not do well on such data (e.g. bag of words models). Our dataset is designed to include many such examples. - Hard negatives: We created our dataset with the aim to improve upon current state of the art language models. Specifically, we have used the BERT model to generate some of the target phrases. So our dataset contains many human rated examples of phrase pairs that BERT may identify as very similar but in fact they may not be.

Each entry of the dataset contains two phrases - anchor and target, a context CPC class, a rating class, and a similarity score. The rating classes have the following meanings: - 4 - Very high. - 3 - High. - 2 - Medium. - 2a - Hyponym (broad-narrow match). - 2b - Hypernym (narrow-broad match). - 2c - Structural match. - 1 - Low. - 1a - Antonym. - 1b - Meronym (a part of). - 1c - Holonym ( a whole of). - 1d - Other high level domain match. - 0 - Not related.

The dataset is split into a training (75%), validation (5%), and test (20%) sets. When splitting the data all of the entries with the same anchor are kept together in the same set. There are 106 different context CPC classes and all of them are represented in the training set.

More details about the dataset are available in the corresponding paper. Please cite the paper if you use the dataset.

Search
Clear search
Close search
Google apps
Main menu