Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a human rated contextual phrase to phrase matching dataset focused on technical terms from patents. In addition to similarity scores that are typically included in other benchmark datasets we include granular rating classes similar to WordNet, such as synonym, antonym, hypernym, hyponym, holonym, meronym, domain related. The dataset was used in the U.S. Patent Phrase to Phrase Matching competition.
The dataset was generated with focus on the following: - Phrase disambiguation: certain keywords and phrases can have multiple different meanings. For example, the phrase "mouse" may refer to an animal or a computer input device. To help disambiguate the phrases we have included Cooperative Patent Classification (CPC) classes with each pair of phrases. - Adversarial keyword match: there are phrases that have matching keywords but are otherwise unrelated (e.g. “container section” → “kitchen container”, “offset table” → “table fan”). Many models will not do well on such data (e.g. bag of words models). Our dataset is designed to include many such examples. - Hard negatives: We created our dataset with the aim to improve upon current state of the art language models. Specifically, we have used the BERT model to generate some of the target phrases. So our dataset contains many human rated examples of phrase pairs that BERT may identify as very similar but in fact they may not be.
Each entry of the dataset contains two phrases - anchor and target, a context CPC class, a rating class, and a similarity score. The rating classes have the following meanings: - 4 - Very high. - 3 - High. - 2 - Medium. - 2a - Hyponym (broad-narrow match). - 2b - Hypernym (narrow-broad match). - 2c - Structural match. - 1 - Low. - 1a - Antonym. - 1b - Meronym (a part of). - 1c - Holonym ( a whole of). - 1d - Other high level domain match. - 0 - Not related.
The dataset is split into a training (75%), validation (5%), and test (20%) sets. When splitting the data all of the entries with the same anchor are kept together in the same set. There are 106 different context CPC classes and all of them are represented in the training set.
More details about the dataset are available in the corresponding paper. Please cite the paper if you use the dataset.
Facebook
TwitterData standardization is an important part of effective management. However, sometimes people have data that doesn't match. This dataset includes different ways that counties might get written by different people. It can be used as a lookup table when you need County to be your unique identifier. For example, it allows you to match St. Mary's, St Marys, and Saint Mary's so that you can use it with disparate data from other data sets.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a comprehensive collection of synthetic job postings to facilitate research and analysis in the field of job market trends, natural language processing (NLP), and machine learning. Created for educational and research purposes, this dataset offers a diverse set of job listings across various industries and job types.
We would like to express our gratitude to the Python Faker library for its invaluable contribution to the dataset generation process. Additionally, we appreciate the guidance provided by ChatGPT in fine-tuning the dataset, ensuring its quality, and adhering to ethical standards.
Please note that the examples provided are fictional and for illustrative purposes. You can tailor the descriptions and examples to match the specifics of your dataset. It is not suitable for real-world applications and should only be used within the scope of research and experimentation. You can also reach me via email at: rrana157@gmail.com
Facebook
TwitterRandom sampling between 100K - 600K instances from training data. -train-df -> Sampled training data -match-df -> All matches from the Sample -sub -> Perfect submission from the sampled data -sub-naive -> Naive submission (only same IDs) from the sampled data
Facebook
TwitterObservational studies of causal effects often use multivariate matching to control imbalances in measured covariates. For instance, using network optimization, one may seek the closest possible pairing for key covariates among all matches that balance a propensity score and finely balance a nominal covariate, perhaps one with many categories. This is all straightforward when matching thousands of individuals, but requires some adjustments when matching tens or hundreds of thousands of individuals. In various senses, a sparser network—one with fewer edges—permits optimization in larger samples. The question is: What is the best way to make the network sparse for matching? A network that is too sparse will eliminate from consideration possible pairings that it should consider. A network that is not sparse enough will waste computation considering pairings that do not deserve serious consideration. We propose a new graded strategy in which potential pairings are graded, with a preference for higher grade pairings. We try to match with pairs of the best grade, incorporating progressively lower grade pairs only to the degree they are needed. In effect, only sparse networks are built, stored and optimized. Two examples are discussed, a small example with 1567 matched pairs from clinical medicine, and a slightly larger example with 22,111 matched pairs from economics. The method is implemented in an R package RBestMatch available at https://github.com/ruoqiyu/RBestMatch. Supplementary materials for this article are available online.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a synthetic representation of user behavior on a fictional dating app. It contains 50,000 records with 19 features capturing demographic details, app usage patterns, swipe tendencies, and match outcomes. The data was generated programmatically to simulate realistic user interactions, making it ideal for exploratory data analysis (EDA), machine learning modeling (e.g., predicting match outcomes), or studying user behavior trends in online dating platforms.
Key features include gender, sexual orientation, location type, income bracket, education level, user interests, app usage time, swipe ratios, likes received, mutual matches, and match outcomes (e.g., "Mutual Match," "Ghosted," "Catfished"). The dataset is designed to be diverse and balanced, with categorical, numerical, and labeled variables for various analytical purposes.
This dataset can be used for:
Exploratory Data Analysis (EDA): Investigate correlations between demographics, app usage, and match success. Machine Learning: Build models to predict match outcomes or user engagement levels. Social Studies: Analyze trends in dating app behavior across different demographics. Feature Engineering Practice: Experiment with transforming categorical and numerical data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
EyeFi Dataset
This dataset is collected as a part of the EyeFi project at Bosch Research and Technology Center, Pittsburgh, PA, USA. The dataset contains WiFi CSI values of human motion trajectories along with ground truth location information captured through a camera. This dataset is used in the following paper "EyeFi: Fast Human Identification Through Vision and WiFi-based Trajectory Matching" that is published in the IEEE International Conference on Distributed Computing in Sensor Systems 2020 (DCOSS '20). We also published a dataset paper titled as "Dataset: Person Tracking and Identification using Cameras and Wi-Fi Channel State Information (CSI) from Smartphones" in Data: Acquisition to Analysis 2020 (DATA '20) workshop describing details of data collection. Please check it out for more information on the dataset.
Data Collection Setup
In our experiments, we used Intel 5300 WiFi Network Interface Card (NIC) installed in an Intel NUC and Linux CSI tools [1] to extract the WiFi CSI packets. The (x,y) coordinates of the subjects are collected from Bosch Flexidome IP Panoramic 7000 panoramic camera mounted on the ceiling and Angle of Arrivals (AoAs) are derived from the (x,y) coordinates. Both the WiFi card and camera are located at the same origin coordinates but at different height, the camera is location around 2.85m from the ground and WiFi antennas are around 1.12m above the ground.
The data collection environment consists of two areas, first one is a rectangular space measured 11.8m x 8.74m, and the second space is an irregularly shaped kitchen area with maximum distances of 19.74m and 14.24m between two walls. The kitchen also has numerous obstacles and different materials that pose different RF reflection characteristics including strong reflectors such as metal refrigerators and dishwashers.
To collect the WiFi data, we used a Google Pixel 2 XL smartphone as an access point and connect the Intel 5300 NIC to it for WiFi communication. The transmission rate is about 20-25 packets per second. The same WiFi card and phone are used in both lab and kitchen area.
List of Files
Here is a list of files included in the dataset:
|- 1_person
|- 1_person_1.h5
|- 1_person_2.h5
|- 2_people
|- 2_people_1.h5
|- 2_people_2.h5
|- 2_people_3.h5
|- 3_people
|- 3_people_1.h5
|- 3_people_2.h5
|- 3_people_3.h5
|- 5_people
|- 5_people_1.h5
|- 5_people_2.h5
|- 5_people_3.h5
|- 5_people_4.h5
|- 10_people
|- 10_people_1.h5
|- 10_people_2.h5
|- 10_people_3.h5
|- Kitchen
|- 1_person
|- kitchen_1_person_1.h5
|- kitchen_1_person_2.h5
|- kitchen_1_person_3.h5
|- 3_people
|- kitchen_3_people_1.h5
|- training
|- shuffuled_train.h5
|- shuffuled_valid.h5
|- shuffuled_test.h5
View-Dataset-Example.ipynb
README.md
In this dataset, folder `1_person/` , `2_people/` , `3_people/` , `5_people/`, and `10_people/` contains data collected from the lab area whereas `Kitchen/` folder contains data collected from the kitchen area. To see how the each file is structured, please see below in section Access the data.
The training folder contains the training dataset we used to train the neural network discussed in our paper. They are generated by shuffling all the data from `1_person/` folder collected in the lab area (`1_person_1.h5` and `1_person_2.h5`).
Why multiple files in one folder?
Each folder contains multiple files. For example, `1_person` folder has two files: `1_person_1.h5` and `1_person_2.h5`. Files in the same folder always have the same number of human subjects present simultaneously in the scene. However, the person who is holding the phone can be different. Also, the data could be collected through different days and/or the data collection system needs to be rebooted due to stability issue. As result, we provided different files (like `1_person_1.h5`, `1_person_2.h5`) to distinguish different person who is holding the phone and possible system reboot that introduces different phase offsets (see below) in the system.
Special note:
For `1_person_1.h5`, this file is generated by the same person who is holding the phone, and `1_person_2.h5` contains different people holding the phone but only one person is present in the area at a time. Boths files are collected in different days as well.
Access the data
To access the data, hdf5 library is needed to open the dataset. There are free HDF5 viewer available on the official website: https://www.hdfgroup.org/downloads/hdfview/. We also provide an example Python code View-Dataset-Example.ipynb to demonstrate how to access the data.
Each file is structured as (except the files under *"training/"* folder):
|- csi_imag
|- csi_real
|- nPaths_1
|- offset_00
|- spotfi_aoa
|- offset_11
|- spotfi_aoa
|- offset_12
|- spotfi_aoa
|- offset_21
|- spotfi_aoa
|- offset_22
|- spotfi_aoa
|- nPaths_2
|- offset_00
|- spotfi_aoa
|- offset_11
|- spotfi_aoa
|- offset_12
|- spotfi_aoa
|- offset_21
|- spotfi_aoa
|- offset_22
|- spotfi_aoa
|- nPaths_3
|- offset_00
|- spotfi_aoa
|- offset_11
|- spotfi_aoa
|- offset_12
|- spotfi_aoa
|- offset_21
|- spotfi_aoa
|- offset_22
|- spotfi_aoa
|- nPaths_4
|- offset_00
|- spotfi_aoa
|- offset_11
|- spotfi_aoa
|- offset_12
|- spotfi_aoa
|- offset_21
|- spotfi_aoa
|- offset_22
|- spotfi_aoa
|- num_obj
|- obj_0
|- cam_aoa
|- coordinates
|- obj_1
|- cam_aoa
|- coordinates
...
|- timestamp
The `csi_real` and `csi_imag` are the real and imagenary part of the CSI measurements. The order of antennas and subcarriers are as follows for the 90 `csi_real` and `csi_imag` values : [subcarrier1-antenna1, subcarrier1-antenna2, subcarrier1-antenna3, subcarrier2-antenna1, subcarrier2-antenna2, subcarrier2-antenna3,… subcarrier30-antenna1, subcarrier30-antenna2, subcarrier30-antenna3]. `nPaths_x` group are SpotFi [2] calculated WiFi Angle of Arrival (AoA) with `x` number of multiple paths specified during calculation. Under the `nPath_x` group are `offset_xx` subgroup where `xx` stands for the offset combination used to correct the phase offset during the SpotFi calculation. We measured the offsets as:
|Antennas | Offset 1 (rad) | Offset 2 (rad) |
|:-------:|:---------------:|:-------------:|
| 1 & 2 | 1.1899 | -2.0071
| 1 & 3 | 1.3883 | -1.8129
The measurement is based on the work [3], where the authors state there are two possible offsets between two antennas which we measured by booting the device multiple times. The combination of the offset are used for the `offset_xx` naming. For example, `offset_12` is offset 1 between antenna 1 & 2 and offset 2 between antenna 1 & 3 are used in the SpotFi calculation.
The `num_obj` field is used to store the number of human subjects present in the scene. The `obj_0` is always the subject who is holding the phone. In each file, there are `num_obj` of `obj_x`. For each `obj_x1`, we have the `coordinates` reported from the camera and `cam_aoa`, which is estimated AoA from the camera reported coordinates. The (x,y) coordinates and AoA listed here are chronologically ordered (except the files in the `training` folder) . It reflects the way the person carried the phone moved in the space (for `obj_0`) and everyone else walked (for other `obj_y`, where `y` > 0).
The `timestamp` is provided here for time reference for each WiFi packets.
To access the data (Python):
import h5py
data = h5py.File('3_people_3.h5','r')
csi_real = data['csi_real'][()]
csi_imag = data['csi_imag'][()]
cam_aoa = data['obj_0/cam_aoa'][()]
cam_loc = data['obj_0/coordinates'][()]
For file inside `training/` folder:
Files inside training folder has a different data structure:
|- nPath-1
|- aoa
|- csi_imag
|- csi_real
|- spotfi
|- nPath-2
|- aoa
|- csi_imag
|- csi_real
|- spotfi
|- nPath-3
|- aoa
|- csi_imag
|- csi_real
|- spotfi
|- nPath-4
|- aoa
|- csi_imag
|- csi_real
|- spotfi
The group `nPath-x` is the number of multiple path specified during the SpotFi calculation. `aoa` is the camera generated angle of arrival (AoA) (can be considered as ground truth), `csi_image` and `csi_real` is the imaginary and real component of the CSI value. `spotfi` is the SpotFi calculated AoA values. The SpotFi values are chosen based on the lowest median and mean error from across `1_person_1.h5` and `1_person_2.h5`. All the rows under the same `nPath-x` group are aligned (i.e., first row of `aoa` corresponds to the first row of `csi_imag`, `csi_real`, and `spotfi`. There is no timestamp recorded and the sequence of the data is not chronological as they are randomly shuffled from the `1_person_1.h5` and `1_person_2.h5` files.
Citation
If you use the dataset, please cite our paper:
@inproceedings{eyefi2020,
title={EyeFi: Fast Human Identification Through Vision and WiFi-based Trajectory Matching},
author={Fang, Shiwei and Islam, Tamzeed and Munir, Sirajum and Nirjon, Shahriar},
booktitle={2020 IEEE International Conference on Distributed Computing in Sensor Systems (DCOSS)},
year={2020},
Facebook
TwitterWe consider the problem of extracting joint and individual signals from multi-view data, that is, data collected from different sources on matched samples. While existing methods for multi-view data decomposition explore single matching of data by samples, we focus on double-matched multi-view data (matched by both samples and source features). Our motivating example is the miRNA data collected from both primary tumor and normal tissues of the same subjects; the measurements from two tissues are thus matched both by subjects and by miRNAs. Our proposed double-matched matrix decomposition allows us to simultaneously extract joint and individual signals across subjects, as well as joint and individual signals across miRNAs. Our estimation approach takes advantage of double-matching by formulating a new type of optimization problem with explicit row space and column space constraints, for which we develop an efficient iterative algorithm. Numerical studies indicate that taking advantage of double-matching leads to superior signal estimation performance compared to existing multi-view data decomposition based on single-matching. We apply our method to miRNA data as well as data from the English Premier League soccer matches and find joint and individual multi-view signals that align with domain-specific knowledge. Supplementary materials for this article are available online.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains data from Premier League seasons from 1888/1889 season until the 2023/2024 season. This dataset has files for every unique team, every season, all the season stats, and all matches played since the 1888/1889 season.
There is some data missing, as it was not available on the website that the data was scraped from. For example, most of the seasons are missing passing data (attempts, completions, percentage), and a majority of the games are missing things such as attendance or expected goals. For the games that do have expected goals, xG is for the home team and xG.1 is for the away team.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset offers a simple entrance to the world of football match data analysis. It offers football match data from 27 countries and 42 leagues worldwide, including some of the best leagues such as the English Premier League, German Bundesliga, and Spanish La Liga. The data spans from the 2000/01 season to the most recent results from the 2024/25 season. The dataset also includes Elo Ratings for the given time period with snapshots of ~500 of the best teams in Europe taken twice a month, on the 1st and 15th.
Match results and statistics provided in the table are taken from Football-Data.co.uk. Elo data are taken from ClubElo.
📂 Files number: 2
🔗 Files type: .csv
⌨️ Total rows: ~475 000 as of 07/2025
💾 Total size: ~51 MB as of 07/2025
The dataset is a great starting point for football match prediction, both pre-match and in-play, with huge potential lying in the amount of data and their accuracy. The dataset contains information about teams' strength and form prior to the match, as well as general market predictions via pre-match odds.
1️⃣ SIZE - This is the biggest open and free dataset on the internet, keeping uniform information about tens of thousands of football matches, including match statistics, odds, and Elo and form information.
2️⃣ READABILITY - The whole dataset is tabular, and all of the data are clear to navigate and explain. Both tables in the dataset correspond to each other via remapped club names, and all of the formats within the table (such as odds) are uniform.
3️⃣ RECENCY - This is the most up-to-date open football dataset, containing data from matches as recent as July 2025. The plan is to update this dataset monthly or bi-monthly via a custom-made Python pipeline.
This table is a collection of Elo ratings taken from ClubElo. Snapshots are taken twice a month, on the 1st and 15th day of the month, saving the whole Club Elo database. Some clubs' names are remapped to correspond with the Matches table (for example "Bayern" to "Bayern Munich").``
| Column | Data Type | Description |
|---|---|---|
📅 Date | date | Date of the snapshot. |
🛡️ Club | string | Club name in English corresponding to Matches table. |
🌍️ Country | enum | Club country three-letter code. |
📈 Elo | float | Club's current Elo rating, rounded to two decimal spots. |
| Column | Data Type | Description |
|---|---|---|
🏆 Division | enum | League that the match was played in - country code + division number (I1 for Italian First Division). For countries where we only have one league, we use 3-letter country code (ARG for Argentina). |
📆 MatchDate | date | Match date in the classic YYYY-MM-DD format. |
🕘 MatchTime | time | Match time in the HH:MM:SS format. CET-1 timezone. |
🏠 HomeTeam | string | Home team's club name in English, abbreviated if needed. |
🚗 AwayTeam | string | Home team's club name in English, abbreviated if needed. |
📊 HomeElo | float | Home team's most recent Elo rating. |
📊 AwayElo | float | Away team's most recent Elo rating. |
📉 Form3Home | int | Number of points gathered by home team in the last 3 matches (Win = 3 points, Draw = 1 point, Loss = 0 points, so this value is between 0 and 9). |
📈 Form5Home | int | Number of points gathered by home team in the last 5 matches (Win = 3 points, Draw = 1 point, Loss = 0 points, so this value is between 0 and 15). |
📉 Form3Away | int | Number of points gathered by away team in the last 3 matches (Win = 3 points, Draw = 1 point, Loss = 0 points, so this value is between 0 and 9). |
📈 Form5Away | int | Number of points gathered by away team in the last 5 matches (Win = 3 points, Draw = 1 point, Loss = 0 points, so this value is between 0 and 15). |
⚽ FTHome | int | Full-time goals scored by home team. |
⚽ FTAway | int | Full-time goals scored by away team. |
🏁 FTResult | enum | Full-time result (H for Home win, D for Draw and A for Away win). |
⚽ HTHome | int | Half-time goals scored by home team. |
⚽ HTAway | int | Half-time goals scored by away team. |
⏱️ HTResult | enum | Half-time result (H for Home win, D for Draw and A for Away win). |
🏹 HomeShots | int | Total shots (goal, saved, blocked, off-target) by home team. |
🏹 AwayShots | int | Total shots (goal, saved, blocked, off-target) by away team. |
🎯 HomeTarget | int | Total shots on target (goal, saved) by home team. |
🎯 AwayTarget | int | Total sh... |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset offers a comprehensive record of international football matches from the very first game in 1872 to the present day in 2024. It covers a broad spectrum of football matches, including major tournaments like the FIFA World Cup and various friendly matches. With a total of 47,126 match records, this dataset is a valuable resource for analyzing historical trends, team performances, and match outcomes over more than a century of international football.
1) Match_Results.csv - Date: The date when the match was played. - Home Team: The team playing at home. - Away Team: The team playing away. - Home Score: The score of the home team, including extra time but not penalty shootouts. - Away Score: The score of the away team, including extra time but not penalty shootouts. - Tournament: The name of the tournament or competition in which the match was played. - City: The city where the match was held. - Country: The country where the match took place. - Neutral: Indicates if the match was played at a neutral venue (TRUE/FALSE).
2) Penalty_Shootouts.csv - Date: The date of the match. - Home Team: The name of the home team. - Away Team: The name of the away team. - Winner: The team that won the penalty shootout. - First Shooter: The team that took the first shot in the penalty shootout.
3) Goal_Scorers.csv - Date: The date of the match. - Home Team: The name of the home team. - Away Team: The name of the away team. - Team: The team that scored the goal. - Scorer: The player who scored the goal. - Minute: The minute when the goal was scored. - Own Goal: Indicates if the goal was an own goal (TRUE/FALSE). - Penalty: Indicates if the goal was scored from a penalty (TRUE/FALSE).
Full credit goes to Mart Jürisoo for the original work on international football results. The dataset titled International Football Results from 1872 to 2017 provided the foundational data and inspiration for this comprehensive historical archive.
The purpose of sharing this dataset is to foster collaborative research and analysis within the football community. By making this extensive historical data available, we aim to support studies on historical trends, team performances, and the evolution of international football over more than 150 years. This dataset is intended to be a valuable resource for researchers, analysts, and enthusiasts who wish to explore the rich history of international football and gain deeper insights into the sport's development.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Nearest-neighbor matching is a popular nonparametric tool to create balance between treatment and control groups in observational studies. As a preprocessing step before regression, matching reduces the dependence on parametric modeling assumptions. In current empirical practice, however, the matching step is often ignored in the calculation of standard errors and confidence intervals. In this article, we show that ignoring the matching step results in asymptotically valid standard errors if matching is done without replacement and the regression model is correctly specified relative to the population regression function of the outcome variable on the treatment variable and all the covariates used for matching. However, standard errors that ignore the matching step are not valid if matching is conducted with replacement or, more crucially, if the second step regression model is misspecified in the sense indicated above. Moreover, correct specification of the regression model is not required for consistent estimation of treatment effects with matched data. We show that two easily implementable alternatives produce approximations to the distribution of the post-matching estimator that are robust to misspecification. A simulation study and an empirical example demonstrate the empirical relevance of our results. Supplementary materials for this article are available online.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MCGD_Data_V2.2 contains all the data that we have collected on locations in modern China, plus a number of locations outside of China that we encounter frequently in historical sources on China. All further updates will appear under the name "MCGD_Data" with a time stamp (e.g., MCGD_Data2023-06-21)
You can also have access to this dataset and all the datasets that the ENP-China makes available on GitLab: https://gitlab.com/enpchina/IndexesEnp
Altogether there are 464,970 entries. The data include the name of locations and their variants in Chinese, pinyin, and any recorded transliteration; the name of the province in Chinese and in pinyin; Province ID; the latitude and longitude; the Name ID and Location ID, and NameID_Legacy. The Name IDs all start with H followed by seven digits. This is the internal ID system of MCGD (the NameID_Legacy column records the Name IDs in their original format depending on the source). Locations IDs that start with "DH" are data points extracted from China Historical GIS (Harvard University); those that start with "D" are locations extracted from the data points in Geonames; those that have only digits (8 digits) are data points we have added from various map sources.
One of the main features of the MCGD Main Dataset is the systematic collection and compilation of place names from non-Chinese language historical sources. Locations were designated in transliteration systems that are hardly comprehensible today, which makes it very difficult to find the actual locations they correspond to. This dataset allows for the conversion from these obsolete transliterations to the current names and geocoordinates.
From June 2021 onward, we have adopted a different file naming system to keep track of versions. From MCGD_Data_V1 we have moved to MCGD_Data_V2. In June 2022, we introduced time stamps, which result in the following naming convention: MCGD_Data_YYYY.MM.DD.
UPDATES
MCGD_Data2025_02_28 includes a major change with the duplication of all the locations listed under Beijing, Shanghai, Tianjin, and Chongqing (北京, 上海, 天津, 重慶) and their listing under the name of the provinces to which they belonge origially before the creation of the four special municipalities after 1949. This is meant to facilitate the matching of data from historical sources. Each location has a unique NameID. Altogether there are 472,818 entries
MCGD_Data2025_02_27 inclues an update on locations extracted from Minguo zhengfu ge yuanhui keyuan yishang zhiyuanlu 國民政府各院部會科員以上職員錄 (Directory of staff members and above in the ministries and committees of the National Government). Nanjing: Guomin zhengfu wenguanchu yinzhuju 國民政府文官處印鑄局國民政府文官處印鑄局, 1944). We also made corrections in the Prov_Py and Prov_Zh columns as there were some misalignments between the pinyin name and the name in Chines characters. The file now includes 465,128 entries.
MCGD_Data2024_03_23 includes an update on locations in Taiwan from the Asia Directories. Altogether there are 465,603 entries (of which 187 place names without geocoordinates, labelled in the Lat Long columns as "Unknown").
MCGD_Data2023.12.22 contains all the data that we have collected on locations in China, whatever the period. Altogether there are 465,603 entries (of which 187 place names without geocoordinates, labelled in the Lat Long columns as "Unknown"). The dataset also includes locations outside of China for the purpose of matching such locations to the place names extracted from historical sources. For example, one may need to locate individuals born outside of China. Rather than maintaining two separate files, we made the decision to incorporate all the place names found in historical sources in the gazetteer. Such place names can easily be removed by selecting all the entries where the 'Province' data is missing.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Information
The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144803 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design.
The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation.
Structure and content of the dataset
|
ChEMBL ID |
PubChem ID |
IUPHAR ID | Target |
Activity type | Assay type | Unit | Mean C (0) | ... | Mean PC (0) | ... | Mean B (0) | ... | Mean I (0) | ... | Mean PD (0) | ... | Activity check annotation | Ligand names | Canonical SMILES C | ... | Structure check | Source |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file.
Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format.
Column content:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Advanced Resume Parser & Job Matcher Resumes
This dataset contains a merged collection of real and synthetic resume data in JSON format. The resumes have been normalized to a common schema to facilitate the development of NLP models for candidate-job matching in the technical recruitment domain.
Dataset Details
Dataset Description
This dataset is a combined collection of real resumes and synthetically generated CVs.
Curated by: datasetmaster… See the full description on the dataset page: https://huggingface.co/datasets/datasetmaster/resumes.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
La Liga is by far one of the world’s most entertaining leagues. They have some of the best managers, players and fans! But, what makes it truly entertaining is the sheer unpredictability. There are 6 equally amazing teams with a different team lifting the trophy every season. Not only that, the league has also witnessed victories from teams outside of the top 6. So, let us analyze some of these instances.
So far, the implementation of statistics into soccer has been positive. Teams can easily gather information about their opponents and their tactics. This convenience allows managers to create well-thought out game plans that suit their team, maximize opponents' weaknesses, and increase their chances of winning.
A goal is scored when the whole of the ball passes over the goal line, between the goalposts and under the crossbar, provided that no offence has been committed by the team scoring the goal. If the goalkeeper throws the ball directly into the opponents' goal, a goal kick is awarded.
THE TIME OF SEASON/MOTIVATION: While a club battling for a league title is going to be hungry for a win, as is a side that is fighting to stay up, a club that has already won the title or has already been relegated is unlikely to work as hard, and often rest players as well. THE REFEREE: Of course, when referee's send players off it make a massive impact on a match, but even if he is just awarding a yellow card then it can affect the outcome of the game as the player booked is less likely to go in as hard for the rest of the match.
SUBSTITUTES: The whole point of substitutes is for them to be able to come on and impact a match. Subs not only bring on a fresh pair of legs that are less tired than starters and more likely to track back and push forward, but can also play crucial roles in the formation of a team.
MIND GAMES/MANAGERS: Playing mind games has almost become a regular routine for top level managers, and rightly so. Just a simple mind game can do so much to impact a match, a good example coming from Sir Alex Ferguson.
Per his autobiography, when Manchester United were losing late on in a match at a certain point he would tap his watch and make sure to let the opposition know he is signalling this to his players. United's opposition already know that United have a tendency to come back from behind, and upon seeing this gesture they will think that United are going to come back. And because scientific studies prove that living creatures are more likely to accept things that have happened before than not - horses are more likely to lose to a horse they have already lost to in a race even if they are on an even playing field - they often succumb to a loss.
FORM/INJURIES/FIXTURES: A team on better form is more likely to win a match than if they have been on a poor run of form, while a team in the middle of a condensed run of fixtures is less likely to win than a well rested team. These are just some of the things that affect matches - if you have any other just mention them in the comment section below and I'll try to add them in!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains Spatial Transcriptomics (ST) data matching with Matrix Assisted Laser Desorption/Ionization - Mass Spetrometry Imaging (MALDI-MSI). This data is complementary to data contained in the same project. FIles with the same identifiers in the two datasets originated from the very same tissue section and can be combined in a multimodal ST-MSI object. For more information about the dataset please see our manuscript posted on BioRxiv (doi: https://doi.org/10.1101/2023.01.26.525195). This dataset includes ST data from 19 tissue sections, including human post-mortem and mouse samples. The spatial transcriptomics data was generated using the Visium protocol (10x Genomics). The murine tissue sections come from three different mice unilaterally injected with 6-OHDA. 6-OHDA is a neurotoxin that when injected in the brain can selectively destroy dopaminergic neurons. We used this mouse model to show the applicability of the technology that we developed, named Spatial Multimodal Analysis (SMA). Using our technology on these mouse brain tissue sections we were able to detect both dopamine with MALDI-MSI and the corresponding gene expression with ST. This dataset includes also one human post-mortem striatum sample that was placed on one Visium slide across the four capture areas. This sample was analyzed with a different ST protocol named RRST (Mirzazadeh, R., Andrusivova, Z., Larsson, L. et al. Spatially resolved transcriptomic profiling of degraded and challenging fresh frozen samples. Nat Commun 14, 509 (2023). https://doi.org/10.1038/s41467-023-36071-5), where probes capturing the whole transcriptome are first hybridized in the tissue section and then spatially detected. Each tissue section contained in the dataset has been given a unique identifier that is composed of the Visium array ID and capture area ID of the Visium slide that the tissue section was placed on. This unique identifier is included in the file names of all the files relative to the same tissue section, including the MALDI-MSI files published in the other dataset included in this project. In this dataset you will find the following files for each tissue section: - raw files: these are the read one fastq files (containing the pattern *R1*fastq.gz in the file name), read two fastq files (containing the pattern *R1*fastq.gz in the file name) and the raw microscope images (containing the pattern Spot.jpg in the file name). These are the only files needed to run the Space Ranger pipeline, which is freely available for any user (please see the 10x Genomics website for information on how to install and run Space Ranger); - processed data files: we provide processed data files of two types: a) Space Ranger outputs that were used to produce the figures in our publication; b) manual annotation tables in csv format produced using Loupe Browser 6 (csv tables with file names ending _RegionLoupe.csv, _filter.csv, _dopamine.csv, _lesion.csv, _region.csv patterns); c) json files that we used as input for Space Ranger in the cases where the automatic tissue detection included in the pipeline failed to recognize the tissue or the fiducials. Using these processed files the user can reproduce the figures of our publication without having to restart from the raw data files. The MALDI-MSI analyses preceding ST was performed with different matrices in different tissue section. We used 1) 9-aminoacridine (9-AA) for detection of metabolites in negative ionization mode, 2) 2,5-dihydroxybenzoic acid (DHB) for detection of metabolites in positive ionization mode, 3) 4-(anthracen-9-yl)-2-fluoro-1-ethylpyridin-1-ium iodide (FMP-10), which charge-tags molecules with phenolic hydroxyls and/or primary amines, including neurotransmitters. The information about which matrix was sprayed on the tissue sections and other information about the samples is included in the metadata table. We also used three types of control samples: - standard Visium: samples processed with standard Visium (i.e. no matrix spraying, no MALDI-MSI, protocol as recommended by 10x Gemomics with no exeptions) - internal controls (iCTRL): samples not sprayed with any matrix, neither processed with MALDI-MSI, but located on the same Visium slide were other samples were processed with MALDI-MSI - FMP-10-iCTRL: sample sprayed with FMP-10, and then processed as an iCTRL. This and other information is provided in the metadata table.
Facebook
TwitterDay 3 003_srgb_1_to_day2 083_srgb_5_humancontains the L*a*b* and Gabor filter outputs of all 505 samples of bark, as seen by Human visual systems. Filename+photo= Name of image. rep= number of sample from each image (5 for each image). L= L value of L*a*b*, a= a value of L*a*b*, b= b value of L*a*b*. fxoy= f: the gabor filter’s spatial frequency (1-4 from fine to coarse), o: the gabor filter’s orientation (1-6 from 0 to 150° in 30° increments)Day 3 003_srgb_1_to_day2 083_srgb_5_birdcontains the L*a*b* and Gabor filter outputs of all 505 samples of bark, as seen by Bird visual systems. Filename+photo= Name of image. rep= number of sample from each image (5 for each image). L= L value of L*a*b*, a= a value of L*a*b*, b= b value of L*a*b*. fxoy= f: the gabor filter’s spatial frequency (x=1-4 from fine to coarse), o: the gabor filter’s orientation (y=1-6 from 0 to 150° in 30° increments)Survival Experimentcontains all the data from the field experiment. Columns: Block= number of block, Trea...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The following repository contains a collection of data and metadata files in different vendor formats which were collected in the fields of atom probe microscopy (LEAP instruments) and electron microscopy (Nion instruments). These files are meant for development and testing purposes of the nomad north-remote-tools-hub and the related nomad-nexus-parser software tools within the FAIRmat project.FAIRmat is a consortium lead by the Humboldt-Universität zu Berlin. FAIRmat is a member of the German Research Data Infrastructure (NFDI) initiative. A detailed description of the background and content of the individual files follows: EM.STEM.Nion.Dataset.1.zip: This is a dataset for testing the nx_em_nion reader which handles files from Nion microscopes and NionSwift software. The data were collected by Benedikt Haas and Sherjeel Shabih from Humboldt-Universität zu Berlin who worked (at the point of publication) in the group of Prof. Christoph Koch. APM.LEAP.Datasets.*.zip: This is a collection of two datasets for testing the generic nx_apm reader which handles commercial and community file formats for reconstructed ion position and ranging data from atom probe microscopy experiments. The datasets were collected by different authors. APM.LEAP.Datasets.1.zip: R31_06365-v02.pos, was shared by Jing Wang and Daniel Schreiber (both at PNNL). Details to the dataset are available under the following DOIs: https://doi.org/10.1017/S1431927618015386 https://doi.org/10.1017/S1431927621012241 70_50_50.apt, was a shared by Xuyang Zhou at his time with the Max-Planck-Institut für Eisenforschung GmbH as a open-source test data to the publication he lead on machine-learning-based techniques for composition profiling. The dataset and publication is available via the following DOI and resources: https://doi.org/10.1016/j.actamat.2022.117633 The dataset specifically is also available here: https://github.com/RhettZhou/APT_GB/tree/main/example/Cropped_70_50_50 The range files *.rng and *.rrng range serve as examples to develop tools for parsing them and handle the formatting of range files. The scientific content of the range files was inspired by experiments but is not related to the above-mentioned atom probe datasets and should not be used to analyze these test data for more than pure development purposes. Use instead your own data and matching range files for scientific analyses. APM.LEAP.Datasets.2.zip R18_53222_W_18K-v01.epos, was shared with Markus Kühbach by Andrew Breen during their time at the Max-Planck-Institut für Eisenforschung GmbH. We would like to invite the community to use the nomad infrastructure and support us with sharing data and dataset which we can then use to improve the file format parsing, the reading capabilities, and analyses services of the nomad infrastructure so that the community can profit again from these developments. aut_leoben_leitner.tar.gz Is the dataset associated to the grain boundary solute segregation case study discussed in https://arxiv.org/abs/2205.13510 usa_portland_wang.tar.gz is the dataset associated with the ODS steel specimen dataset, which is a good example for testing and learning iso-surface based analyses with the paraprobe-toolbox. This dataset was mentioned as one of the test cases in https://arxiv.org/abs/2205.13510 ger_berlin_kuehbach_fairmat_is_usa_portland_wang.tar.gz is contentwise the same scientific dataset as the one in usa_portland_wang but stored in an HDF5 file which is formatted compliant with the NXapm NeXus application definition as of the NeXus code camp 2022 https://manual.nexusformat.org/classes/contributed_definitions/NXapm.html#nxapm
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a human rated contextual phrase to phrase matching dataset focused on technical terms from patents. In addition to similarity scores that are typically included in other benchmark datasets we include granular rating classes similar to WordNet, such as synonym, antonym, hypernym, hyponym, holonym, meronym, domain related. The dataset was used in the U.S. Patent Phrase to Phrase Matching competition.
The dataset was generated with focus on the following: - Phrase disambiguation: certain keywords and phrases can have multiple different meanings. For example, the phrase "mouse" may refer to an animal or a computer input device. To help disambiguate the phrases we have included Cooperative Patent Classification (CPC) classes with each pair of phrases. - Adversarial keyword match: there are phrases that have matching keywords but are otherwise unrelated (e.g. “container section” → “kitchen container”, “offset table” → “table fan”). Many models will not do well on such data (e.g. bag of words models). Our dataset is designed to include many such examples. - Hard negatives: We created our dataset with the aim to improve upon current state of the art language models. Specifically, we have used the BERT model to generate some of the target phrases. So our dataset contains many human rated examples of phrase pairs that BERT may identify as very similar but in fact they may not be.
Each entry of the dataset contains two phrases - anchor and target, a context CPC class, a rating class, and a similarity score. The rating classes have the following meanings: - 4 - Very high. - 3 - High. - 2 - Medium. - 2a - Hyponym (broad-narrow match). - 2b - Hypernym (narrow-broad match). - 2c - Structural match. - 1 - Low. - 1a - Antonym. - 1b - Meronym (a part of). - 1c - Holonym ( a whole of). - 1d - Other high level domain match. - 0 - Not related.
The dataset is split into a training (75%), validation (5%), and test (20%) sets. When splitting the data all of the entries with the same anchor are kept together in the same set. There are 106 different context CPC classes and all of them are represented in the training set.
More details about the dataset are available in the corresponding paper. Please cite the paper if you use the dataset.