CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Abstract: Scholars studying organizations often work with multiple datasets lacking shared unique identifiers or covariates. In such situations, researchers usually use approximate string (``fuzzy'') matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even when two strings appear similar to humans, fuzzy matching often does not work because it fails to adapt to the informativeness of the character combinations. In response, a number of machine-learning methods have been developed to refine string matching. Yet, the effectiveness of these methods is limited by the size and diversity of training data. This paper introduces data from a prominent employment networking site (LinkedIn) as a massive training corpus to address these limitations. We show how, by leveraging information from LinkedIn regarding organizational name-to-name links, we can improve upon existing matching benchmarks, incorporating the trillions of name pair examples from LinkedIn into various methods to improve performance by explicitly maximizing match probabilities inferred from the LinkedIn corpus. We also show how relationships between organization names can be modeled using a network representation of the LinkedIn data. In illustrative merging tasks involving lobbying firms, we document improvements when using the LinkedIn corpus in matching calibration and make all data and methods open source. Keywords: Record linkage; Interest groups; Text as data; Unstructured data
Salutary Data is a boutique, B2B contact and company data provider that's committed to delivering high quality data for sales intelligence, lead generation, marketing, recruiting / HR, identity resolution, and ML / AI. Our database currently consists of 148MM+ highly curated B2B Contacts ( US only), along with over 4M+ companies, and is updated regularly to ensure we have the most up-to-date information.
We can enrich your in-house data ( CRM Enrichment, Lead Enrichment, etc.) and provide you with a custom dataset ( such as a lead list) tailored to your target audience specifications and data use-case. We also support large-scale data licensing to software providers and agencies that intend to redistribute our data to their customers and end-users.
What makes Salutary unique? - We offer our clients a truly unique, one-stop aggregation of the best-of-breed quality data sources. Our supplier network consists of numerous, established high quality suppliers that are rigorously vetted. - We leverage third party verification vendors to ensure phone numbers and emails are accurate and connect to the right person. Additionally, we deploy automated and manual verification techniques to ensure we have the latest job information for contacts. - We're reasonably priced and easy to work with.
Products: API Suite Web UI Full and Custom Data Feeds
Services: Data Enrichment - We assess the fill rate gaps and profile your customer file for the purpose of appending fields, updating information, and/or rendering net new “look alike” prospects for your campaigns. ABM Match & Append - Send us your domain or other company related files, and we’ll match your Account Based Marketing targets and provide you with B2B contacts to campaign. Optionally throw in your suppression file to avoid any redundant records. Verification (“Cleaning/Hygiene”) Services - Address the 2% per month aging issue on contact records! We will identify duplicate records, contacts no longer at the company, rid your email hard bounces, and update/replace titles or phones. This is right up our alley and levers our existing internal and external processes and systems.
Success.ai’s Company Data Solutions provide businesses with powerful, enterprise-ready B2B company datasets, enabling you to unlock insights on over 28 million verified company profiles. Our solution is ideal for organizations seeking accurate and detailed B2B contact data, whether you’re targeting large enterprises, mid-sized businesses, or small business contact data.
Success.ai offers B2B marketing data across industries and geographies, tailored to fit your specific business needs. With our white-glove service, you’ll receive curated, ready-to-use company datasets without the hassle of managing data platforms yourself. Whether you’re looking for UK B2B data or global datasets, Success.ai ensures a seamless experience with the most accurate and up-to-date information in the market.
API Features:
Why Choose Success.ai’s Company Data Solution? At Success.ai, we prioritize quality and relevancy. Every company profile is AI-validated for a 99% accuracy rate and manually reviewed to ensure you're accessing actionable and GDPR-compliant data. Our price match guarantee ensures you receive the best deal on the market, while our white-glove service provides personalized assistance in sourcing and delivering the data you need.
Why Choose Success.ai?
Our database spans 195 countries and covers 28 million public and private company profiles, with detailed insights into each company’s structure, size, funding history, and key technologies. We provide B2B company data for businesses of all sizes, from small business contact data to large corporations, with extensive coverage in regions such as North America, Europe, Asia-Pacific, and Latin America.
Comprehensive Data Points: Success.ai delivers in-depth information on each company, with over 15 data points, including:
Company Name: Get the full legal name of the company. LinkedIn URL: Direct link to the company's LinkedIn profile. Company Domain: Website URL for more detailed research. Company Description: Overview of the company’s services and products. Company Location: Geographic location down to the city, state, and country. Company Industry: The sector or industry the company operates in. Employee Count: Number of employees to help identify company size. Technologies Used: Insights into key technologies employed by the company, valuable for tech-based outreach. Funding Information: Track total funding and the most recent funding dates for investment opportunities. Maximize Your Sales Potential: With Success.ai’s B2B contact data and company datasets, sales teams can build tailored lists of target accounts, identify decision-makers, and access real-time company intelligence. Our curated datasets ensure you’re always focused on high-value leads—those who are most likely to convert into clients. Whether you’re conducting account-based marketing (ABM), expanding your sales pipeline, or looking to improve your lead generation strategies, Success.ai offers the resources you need to scale your business efficiently.
Tailored for Your Industry: Success.ai serves multiple industries, including technology, healthcare, finance, manufacturing, and more. Our B2B marketing data solutions are particularly valuable for businesses looking to reach professionals in key sectors. You’ll also have access to small business contact data, perfect for reaching new markets or uncovering high-growth startups.
From UK B2B data to contacts across Europe and Asia, our datasets provide global coverage to expand your business reach and identify new...
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
As many others I have asked myself if it is possible to use machine learning in order to create valid predictions for football (soccer) match outcomes. Hence I created a dataset consisting of historic match data for the German Bundesliga (1st and 2nd Division) as well as the English Premier League reaching back as far as 1993 up to 2016. Besides the mere information concerning goals scored and home/draw/away win the dataset also includes per site (team) data such as transfer value per team (pre-season), the squad strength, etc. Unfortunately I was only able to find sources for these advanced attributes going back to the 2005 season.
I have used this dataset with different machine learning algorithms including random forests, XGBoost as well as different recurrent neural network architectures (in order to potentially identify recurring patterns in winning streaks, etc.). I'd like to share the approaches I used as separate Kernels here as well. So far I did not manage to exceed an accuracy of 53% consistently on a validation set using 2016 season of Bundesliga 1 (no information rate = 49%).
Although I have done some visual exploration before implementing the different machine learning approaches using Tableau, I think a visual exploration kernel would be very beneficial.
The data comes as an Sqlite file containing the following tables and fields:
Table: Matches
Table: Teams
Table: Unique Teams
Table: Teams_in_Matches
Based on these tables I created a couple of views which I used as input for my machine learning models:
View: FlatView
Combination of all matches with the respective additional data from Teams table for both home and away team.
View: FlatView_Advanced
Same as Flatview but also includes Unique_Team_ID and Unique_Team in order to easily retrieve all matches played by a team in chronological order.
View: FlatView_Chrono_TeamOrder_Reduced
Similar to Flatview_Advanced, however missing the additional attributes from team in order to have a longer history including years 1993 - 2004. Especially interesting if one is only interested in analyzing winning/loosing streaks.
Thanks to football-data.co.uk and transfermarkt.de for providing the raw data used in this dataset.
Please feel free to use the humble dataset provided here for any purpose you want. To me it would be most interesting if others think that recurrent neural networks could in fact be of help (and even maybe outperform classical feature engineering) in identifying streaks of losses and wins. In the literature I mostly only found example of RNN application where the data were time series in a very narrow sense (e.g. temperature measurements over time) hence it would be interesting to get your input on this question.
Maybe someone also finds additional attributes per team or match which have substantial impact on match outcome. So far I have found the "Market Value" of a team to be by far the best predictor when two teams face each other, which makes sense as the market value usually tends to correlate closely with the strength of a team and it's propects at winning
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Currently, the repository provides codes for two such methods:The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.
By data.world's Admin [source]
This dataset contains aggregated spellings and mispellings of the names of 15 famous celebrities. Ever wonder if people can actually spell someone's name correctly? Now you can see for yourself with this compiled data from The Pudding's interactive spelling experiment called The Gyllenhaal Experiment! Interesting to see which names get misspelled more than others - some are easy to guess, some are surprising! With the data provided here, you can start uncovering trends in name-spelling habits. Visualize the data and start analyzing how unique or common each celebrity is with respect to spelling - who stands out? Who blends in? Check it out today and explore a side of celebrity life that hasn't been seen before!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains misnames of 15 famous celebrities. It can be used for a variety of research and analysis purposes, including exploring human language, understanding how names are misspelled, or generating data visualizations.
In order to get the most out of this dataset, you will need to familiarize yourself with its columns. The dataset consists of two columns- “data” and “updated”. The “data” column contains the misnames associated with each celebrity name. The “updated” column is automatically updated with the date on which the data was last changed or modified.
To use this dataset for your own research and analysis purposes, you may find it useful to filter out certain types of responses or patterns in order to focus more closely on particular trends or topics of interest; for example, if you are interested in exploring how spellings vary by region then you might wish to group together similar responses regardless of whether they exactly match one celebrity name over another (i.e., categorizing all spellings that follow a certain phonetic pattern). You can also separate different types of responses into separate groups in order to explore different aspects such as popularity (i.e., looking at which misspellings occurred most frequently).
- Creating an interactive quiz for users to test their spelling ability by challenging them to spell names correctly from the celebrity dataset.
- Building a dictionary database of the misspellings, fans’ nicknames and phonetic spellings of each celebrity so that people can find more information about them more easily and accurately.
- Measuring the popularity of individual celebrities by tracking the frequency in which their name is misspelled
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: data-all.csv | Column name | Description | |:--------------|:---------------------------------------------------| | data | Misspellings of celebrity names. (String) | | updated | Date when the misspelling was last updated. (Date) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit data.world's Admin.
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The FRC Match Dataset is based on the FRST Robotics Competition (FRC) competition records from 2018 to 2025, and is a robot competition match data that includes various information such as EPA (Expected Score Contribution), match win rate, team composition, and match results for each match.
2) Data Utilization (1) FRC Match Data has characteristics that: • Each row contains numerical and categorical variables such as year, event, playoff status, match stage, winning team, EPA-based probability of victory, team name and composition, and match results, which together provide team/match performance and forecasting indicators. (2) FRC Match Data can be used to: • Prediction and Assessment of Match Results: Using EPA and past match data, machine learning models can predict match wins and losses, and prediction models can be evaluated for reliability with indicators such as Brier score. • Team Strategy and Performance Analysis: By analyzing EPA, win rate, and matchup data for each team, you can use it to understand the strategic contribution, cooperation effects, seasonal trends, and strong and weak team characteristics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Intellectual Property Government Open Data (IPGOD) includes over 100 years of registry data on all intellectual property (IP) rights administered by IP Australia. It also has derived information about the applicants who filed these IP rights, to allow for research and analysis at the regional, business and individual level. This is the 2019 release of IPGOD.
IPGOD is large, with millions of data points across up to 40 tables, making them too large to open with Microsoft Excel. Furthermore, analysis often requires information from separate tables which would need specialised software for merging. We recommend that advanced users interact with the IPGOD data using the right tools with enough memory and compute power. This includes a wide range of programming and statistical software such as Tableau, Power BI, Stata, SAS, R, Python, and Scalar.
IP Australia is also providing free trials to a cloud-based analytics platform with the capabilities to enable working with large intellectual property datasets, such as the IPGOD, through the web browser, without any installation of software. IP Data Platform
The following pages can help you gain the understanding of the intellectual property administration and processes in Australia to help your analysis on the dataset.
Due to the changes in our systems, some tables have been affected.
Data quality has been improved across all tables.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The need for a names-based cyber-infrastructure for digital biology is based on the argument that scientific names serve as a standardized metadata system that has been used consistently and near universally for 250 years. As we move towards data-centric biology, name-strings can be called on to discover, index, manage, and analyze accessible digital biodiversity information from multiple sources. Known impediments to the use of scientific names as metadata include synonyms, homonyms, mis-spellings, and the use of other strings as identifiers. We here compare the name-strings in GenBank, Catalogue of Life (CoL), and the Dryad Digital Repository (DRYAD) to assess the effectiveness of the current names-management toolkit developed by Global Names to achieve interoperability among distributed data sources. New tools that have been used here include Parser (to break name-strings into component parts and to promote the use of canonical versions of the names), a modified TaxaMatch fuzzy-matcher (to help manage typographical, transliteration, and OCR errors), and Cross-Mapper (to make comparisons among data sets). The data sources include scientific names at multiple ranks; vernacular (common) names; acronyms; strain identifiers and other surrogates including idiosyncratic abbreviations and concatenations. About 40% of the name-strings in GenBank are scientific names representing about 400,000 species or infraspecies and their synonyms. Of the formally-named terminal taxa (species and lower taxa) represented, about 82% have a match in CoL. Using a subset of content in DRYAD, about 45% of the identifiers are names of species and infraspecies, and of these only about a third have a match in CoL. With simple processing, the extent of matching between DRYAD and CoL can be improved to over 90%. The findings confirm the necessity for name-processing tools and the value of scientific names as a mechanism to interconnect distributed data, and identify specific areas of improvement for taxonomic data sources. Some areas of diversity (bacteria and viruses) are not well represented by conventional scientific names, and they and other forms of strings (acronyms, identifiers, and other surrogates) that are used instead of names need to be managed in reconciliation services (mapping alternative name-strings for the same taxon together). On-line resolution services will bring older scientific names up to date or convert surrogate name-strings to scientific names should such names exist. Examples are given of many of the aberrant forms of ‘names’ that make their way into these databases. The occurrence of scientific names with incorrect authors, such as chresonyms within synonymy lists, is a quality-control issue in need of attention. We propose a future-proofing solution that will empower stakeholders to take advantage of the name-based infrastructure at little cost. This proposed infrastructure includes a standardized system that adopts or creates UUIDs for name-strings, software that can identify name-strings in sources and apply the UUIDs, reconciliation and resolution services to manage the name-strings, and an annotation environment for quality control by users of name-strings.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We argue that algorithmic modeling is a powerful approach to understanding the collective dynamics of human behavior. We consider the task of pairing up individuals connected over a network, according to the following model: each individual is able to propose to match with and accept a proposal from a neighbor in the network; if a matched individual proposes to another neighbor or accepts another proposal, the current match will be broken; individuals can only observe whether their neighbors are currently matched but have no knowledge of the network topology or the status of other individuals; and all individuals have the common goal of maximizing the total number of matches. By examining the experimental data, we identify a behavioral principle called prudence, develop an algorithmic model, analyze its properties mathematically and by simulations, and validate the model with human subject experiments for various network sizes and topologies. Our results include i) a -approximate maximum matching is obtained in logarithmic time in the network size for bounded degree networks; ii) for any constant , a -approximate maximum matching is obtained in polynomial time, while obtaining a maximum matching can require an exponential time; and iii) convergence to a maximum matching is slower on preferential attachment networks than on small-world networks. These results allow us to predict that while humans can find a “good quality” matching quickly, they may be unable to find a maximum matching in feasible time. We show that the human subjects largely abide by prudence, and their collective behavior is closely tracked by the above predictions.
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
A. SUMMARY This dataset comes from the San Francisco Emergency Medical Services Agency and includes all opioid overdose-related 911 calls responded to by emergency medical services (ambulances). The purpose of this dataset is to show how many opioid overdose-related 911 calls the San Francisco Fire Department and other ambulance companies respond to each week. This dataset is based on ambulance patient care records and not 911 calls for service data.
B. HOW THE DATASET IS CREATED The San Francisco Fire Department and other ambulance companies send electronic patient care reports to the California Emergency Medical Services Agency for all 911 calls they respond to. The San Francisco Emergency Medical Services Agency (SF EMSA) has access to the state database that includes all reports for 911 calls in San Francisco County. In order to identify overdose-related calls that resulted in an emergency medical service (or ambulance) response, SF EMSA filters the patient care reports based on set criteria used in other jurisdictions called The Rhode Island Criteria. These criteria filter calls to only include those calls where EMS documented that an opioid overdose was involved and/or naloxone (Narcan) was administered. Calls that do not involve an opioid overdose are filtered out of the dataset. Calls that result in a patient death on scene are also filtered out of the dataset.
This dataset is created by copying the total number of calls each week when the state makes this data available.
C. UPDATE PROCESS Data is generally available with a 24-hour lag on a weekly frequency but the exact lag and update frequency is based on when the State makes this data available.
D. HOW TO USE THIS DATASET This dataset includes the total number of calls a week. The week starts on a Sunday and ends on the following Saturday.
This dataset will not match the Fire Department Calls for Service dataset, as this dataset has been filtered to include only opioid overdose-related 911 calls based on electronic patient care report data. Additionally, the Fire Department Calls for Service data are primarily based on 911 call data (i.e. calls triaged and recorded by San Francisco’s 911 call center) and not the finalized electronic patient care reports recorded by Fire Department paramedics.
E. RELATED DATASETS Fire Department Calls for Service San Francisco Department of Public Health Substance Use Services Unintentional Overdose Death Rates by Race/Ethnicity Preliminary Unintentional Drug Overdose Deaths
F. CHANGE LOG
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I wanted to do some analysis on higher-elo matches (if there is a matchmaking system in place at all is one question that might be answered) so I collected my own dataset inspired by https://www.kaggle.com/skihikingkevin/pubg-match-deaths/data. I seeded my data with JPFog and collected 100 matches of first person squads matches of his and of those that he ran into until I had a reasonable amount of data (read: I got tired of leaving my machine on overnight scraping and didn't want to set up AWS).
Included is that data, along with a names-list of every name that I ran into, a names-list of every name I searched on, and a flattened pandas-friendlier version of that data.
I would have made this data friendlier to work with but I wanted to push it out before PUBG releases their own official seed data, which should be coming soon(tm). This way we all get to play around with this in tableu, and maybe make comparisons to other datasets released up to 8 months ago.
pubg.op.gg ; If this is against their TOS let me know (their TOS is in korean and I can't read it) and I will take this down.
https://www.kaggle.com/skihikingkevin/pubg-match-deaths/data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database links innovation data to Compustat firms. When using the data, please cite "Knowledge Spillovers and Corporate Investment in Scientific Research" (Arora, Belenzon and Sheer), NBER WP 23187. A special thanks and appreciation go to Bernardo Dionisi , Honggi Lee, Dror Shvadron and JK Suh for their diligent work and dedication to this effort over the past several years.
This project introduces major data extension and improvement to the historical NBER patent dataset, which should be valuable for all researchers working with patent data linked to firms. In updating the data to match between Compustat and patents to 2015, we address two major challenges: name changes and ownership changes. These challenges are central to how patents are assigned to firms over time. To be consistent over the sample period, we reconstruct the complete historical data covered in the NBER data files.
About 30% of the Compustat firms in our sample change their name at least once. Accounting for name changes improves the accuracy and scope of matches to patents (and other assets), ownership structure, and dynamic reassignments of GVKEY codes to companies. Dynamic reassignment means that, for instance, if a sample firm merges with another firm, the patents of the merged firm are included in the stock of patents linked to the Compustat record from that point onward, but not before.
For ownership and subsidiary data we rely on a wide range of M&A data, including SDC, historical snapshots of ORBIS files for 2002-2015, 10-K SEC filings, and NBER2006 as well as perform extensive manual checks that help us uncover firms’ structure and ownership changes before proceeding to the patent match. Thus, we have extended and improved the NBER patent data. In the enclosed "Data Appendix", we document our data construction work, present several examples (“case studies”), and outline the improvements we made to existing NBER historical patent data.
Looking back at the season that was 2018-2019 and looking to delve into sight deeper insights. Using the data to see how clubs are similar stylistically, in the way they pass, attack and score goals.
This data set is wide ranging in the sense it encompass stats seen on a regular league table but goes beyond looking at how teams pass and keep possession, how they defend, tackle as well as looking at market values of a team and how much money each team was allotted from the TV rights deal. This data was gathered from 1) BBC Sports Football, 2) Premierleague.com 3) Transfermarkt.co.uk This data was not scrapped in a conventional sense and appears in a rather haphazard manner. To counter this I included category descriptors at the start of each variable name, this should help to provide a more cohesive understanding of the data set as well as aid in sub setting.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
I've done some rather elementary data analysis and exploration. I would love to see the community wrangle with this and explore further, create more complex models, apply some ML and see what insight can be gathered from this data.
This is a layer of water service boundaries for 44,919 community water systems that deliver tap water to 306.88 million people in the US. This amounts to 97.22% of the population reportedly served by active community water systems and 90.85% of active community water systems. The layer is based on multiple data sources and a methodology developed by SimpleLab and collaborators called a Tiered, Explicit, Match, and Model approach–or TEMM, for short. The name of the approach reflects exactly how the nationwide data layer was developed. The TEMM is composed of three hierarchical tiers, arranged by data and model fidelity. First, we use explicit water service boundaries provided by states. These are spatial polygon data, typically provided at the state-level. We call systems with explicit boundaries Tier 1. In the absence of explicit water service boundary data, we use a matching algorithm to match water systems to the boundary of a town or city (Census Place TIGER polygons). When a water system and TIGER place match one-to-one, we label this Tier 2a. When multiple water systems match to the same TIGER place, we label this Tier 2b. Tier 2b reflects overlapping boundaries for multiple systems. Finally, in the absence of an explicit water service boundary (Tier 1) or a TIGER place polygon match (Tier 2a or Tier 2b), a statistical model trained on explicit water service boundary data (Tier 1) is used to estimate a reasonable radius at provided water system centroids, and model a spherical water system boundary (Tier 3).
Several limitations to this data exist–and the layer should be used with these in mind. First, the case of assigning a Census Place TIGER polygon to multiple systems results in an inaccurate assignment of the same exact area to multiple systems; we hope to resolve Tier 2b systems into Tier 2a or Tier 3 in a future iteration. Second, matching algorithms to assign Census Place boundaries require additional validation and iteration. Third, Tier 3 boundaries have modeled radii stemming from a lat/long centroid of a water system facility; but the underlying lat/long centroids for water system facilities are of variable quality. It is critical to evaluate the "geometry quality" column (included from the EPA ECHO data source) when looking at Tier 3 boundaries; fidelity is very low when geometry quality is a county or state centroid– but we did not exclude the data from the layer. Fourth, missing water systems are typically those without a centroid, in a U.S. territory, or missing population and connection data. Finally, Tier 1 systems are assumed to be high fidelity, but rely on the accuracy of state data collection and maintenance.
All data, methods, documentation, and contributions are open-source and available here: https://github.com/SimpleLab-Inc/wsb.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset describes all the matches made available. Each match is a document consisting of the following fields:- competitionId: the identifier of the competition to which the match belongs to. It is a integer and refers to the field "wyId" of the competition document;- date and dateutc: the former specifies date and time when the match starts in explicit format (e.g., May 20, 2018 at 8:45:00 PM GMT+2), the latter contains the same information but in the compact format YYYY-MM-DD hh:mm:ss; - duration: the duration of the match. It can be "Regular" (matches of regular duration of 90 minutes + stoppage time), "ExtraTime" (matches with supplementary times, as it may happen for matches in continental or international competitions), or "Penalities" (matches which end at penalty kicks, as it may happen for continental or international competitions);- gameweek: the week of the league, starting from the beginning of the league;- label: contains the name of the two clubs and the result of the match (e.g., "Lazio - Internazionale, 2 - 3");- roundID: indicates the match-day of the competition to which the match belongs to. During a competition for soccer clubs, each of the participating clubs plays against each of the other clubs twice, once at home and once away. The matches are organized in match-days: all the matches in match-day i are played before the matches in match-day i + 1, even tough some matches can be anticipated or postponed to facilitate players and clubs participating in Continental or Intercontinental competitions. During a competition for national teams, the "roundID" indicates the stage of the competition (eliminatory round, round of 16, quarter finals, semifinals, final);- seasonId: indicates the season of the match;- status: it can be "Played" (the match has officially finished), "Cancelled" (the match has been canceled for some reason), "Postponed" (the match has been postponed and no new date and time is available yet) or "Suspended" (the match has been suspended and no new date and time is available yet);- venue: the stadium where the match was held (e.g., "Stadio Olimpico");- winner: the identifier of the team which won the game, or 0 if the match ended with a draw;- wyId: the identifier of the match, assigned by Wyscout;- teamsData: it contains several subfields describing information about each team that is playing that match: such as lineup, bench composition, list of substitutions, coach and scores: - hasFormation: it has value 0 if no formation (lineups and benches) is present, and 1 otherwise; - score: the number of goals scored by the team during the match (not counting penalties); - scoreET: the number of goals scored by the team during the match, including the extra time (not counting penalties); - scoreHT: the number of goals scored by the team during the first half of the match; - scoreP: the total number of goals scored by the team after the penalties; - side: the team side in the match (it can be "home" or "away"); - teamId: the identifier of the team; - coachId: the identifier of the team's coach; - bench: the list of the team's players that started the match in the bench and some basic statistics about their performance during the match (goals, own goals, cards); - lineup: the list of the team's players in the starting lineup and some basic statistics about their performance during the match (goals, own goals, cards); - substitutions: the list of team's substitutions during the match, describing the players involved and the minute of the substitution.
Effective communication requires a match among signal characteristics, environmental conditions, and receptor tuning and decoding. The degree of matching, however, can vary, among others due to different selective pressures affecting the communication components. For evolutionary novelties, strong selective pressures are likely to act upon the signal and receptor to promote a tight match among them. We test this prediction by exploring the coupling between the acoustic signals and auditory sensitivity in Liolaemus chiliensis, the Weeping lizard, the only one of more than 285 Liolaemus species that vocalizes. Individuals emit distress calls that convey information of predation risk to conspecifics, which may respond with antipredator behaviors upon hearing calls. Specifically, we explored the match between spectral characteristics of the distress calls and the tympanic sensitivities of two populations separated by more than 700 km, for which previous data suggested variation in their dis...
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This repository contains reference datasets for heroes and cards used in Magic Chess: Go Go. These datasets are intended to complement the Magic Chess: Go Go Matches Dataset by providing structured information about hero attributes and card effects in order to enabling more meaningful and accurate analysis.
The dataset consists of two csv files:
heroes
A dataset containing information about all heroes available in the game.
Column Name | Description |
---|---|
Hero | Name of the hero |
Faction | Hero's faction (e.g., Doomsworm, Faeborn) |
Role | Hero's role (e.g., Marksman, Defender) |
Row | Recommended position on the board (Front or Back ) |
Cost | Hero's gold cost in the shop (1 to 5) |
cards
A dataset describing the attributes and effects of various in-game cards.
Each card is encoded with binary features: 1
indicates that the card has that attribute, and 0
means it does not.
Column Name | Description |
---|---|
Card | Name of the card |
Color | Card color type (e.g., Orange, Purple) |
Magic_Boost | Boosts magic stat |
Physical_Boost | Boosts physical stat |
ATKSpeed_Boost | Increases attack speed |
Defense_Boost | Increases defense or durability |
Synergy | Enhances or interacts with a synergy |
Magic_Equipment | Provides magic-type equipment |
Physical_Equipment | Provides physical-type equipment |
ATKSpeed_Equipment | Provides attack speed equipment |
Defense_Equipment | Provides defensive equipment |
Hero_Recruitment | Summons a new hero |
Capacity+ | Increases max team capacity |
Economy | Boosts economy or gold gain |
Commander_EXP | Grants Commander experience |
Commander_Life | Restores Commander HP |
Synergy_Effect | Modifies synergy effects or adds synergy bonuses |
These reference datasets are designed to support match-level analysis in the Magic Chess: Go Go Matches Dataset. They provide additional context such as:
By linking match data with this reference information, we can uncover deeper patterns and more accurate insights.
All data was entered manually into a spreadsheet.
Data from: American Community Survey, 5-year Series
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Abstract: Scholars studying organizations often work with multiple datasets lacking shared unique identifiers or covariates. In such situations, researchers usually use approximate string (``fuzzy'') matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even when two strings appear similar to humans, fuzzy matching often does not work because it fails to adapt to the informativeness of the character combinations. In response, a number of machine-learning methods have been developed to refine string matching. Yet, the effectiveness of these methods is limited by the size and diversity of training data. This paper introduces data from a prominent employment networking site (LinkedIn) as a massive training corpus to address these limitations. We show how, by leveraging information from LinkedIn regarding organizational name-to-name links, we can improve upon existing matching benchmarks, incorporating the trillions of name pair examples from LinkedIn into various methods to improve performance by explicitly maximizing match probabilities inferred from the LinkedIn corpus. We also show how relationships between organization names can be modeled using a network representation of the LinkedIn data. In illustrative merging tasks involving lobbying firms, we document improvements when using the LinkedIn corpus in matching calibration and make all data and methods open source. Keywords: Record linkage; Interest groups; Text as data; Unstructured data