100+ datasets found
  1. H

    Data for: "Linking Datasets on Organizations Using Half a Billion...

    • dataverse.harvard.edu
    Updated Jan 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Connor Jerzak (2025). Data for: "Linking Datasets on Organizations Using Half a Billion Open-Collaborated Records" [Dataset]. http://doi.org/10.7910/DVN/EHRQQL
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Connor Jerzak
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Abstract: Scholars studying organizations often work with multiple datasets lacking shared unique identifiers or covariates. In such situations, researchers usually use approximate string (``fuzzy'') matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even when two strings appear similar to humans, fuzzy matching often does not work because it fails to adapt to the informativeness of the character combinations. In response, a number of machine-learning methods have been developed to refine string matching. Yet, the effectiveness of these methods is limited by the size and diversity of training data. This paper introduces data from a prominent employment networking site (LinkedIn) as a massive training corpus to address these limitations. We show how, by leveraging information from LinkedIn regarding organizational name-to-name links, we can improve upon existing matching benchmarks, incorporating the trillions of name pair examples from LinkedIn into various methods to improve performance by explicitly maximizing match probabilities inferred from the LinkedIn corpus. We also show how relationships between organization names can be modeled using a network representation of the LinkedIn data. In illustrative merging tasks involving lobbying firms, we document improvements when using the LinkedIn corpus in matching calibration and make all data and methods open source. Keywords: Record linkage; Interest groups; Text as data; Unstructured data

  2. d

    Firmographic Data | 4MM + US Private and Public Companies | Employees,...

    • datarade.ai
    .json, .csv, .xls
    Updated Oct 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salutary Data (2023). Firmographic Data | 4MM + US Private and Public Companies | Employees, Revenue, Website, Industry + More Firmographics [Dataset]. https://datarade.ai/data-products/salutary-data-firmographic-data-4m-us-private-and-publi-salutary-data
    Explore at:
    .json, .csv, .xlsAvailable download formats
    Dataset updated
    Oct 16, 2023
    Dataset authored and provided by
    Salutary Data
    Area covered
    United States
    Description

    Salutary Data is a boutique, B2B contact and company data provider that's committed to delivering high quality data for sales intelligence, lead generation, marketing, recruiting / HR, identity resolution, and ML / AI. Our database currently consists of 148MM+ highly curated B2B Contacts ( US only), along with over 4M+ companies, and is updated regularly to ensure we have the most up-to-date information.

    We can enrich your in-house data ( CRM Enrichment, Lead Enrichment, etc.) and provide you with a custom dataset ( such as a lead list) tailored to your target audience specifications and data use-case. We also support large-scale data licensing to software providers and agencies that intend to redistribute our data to their customers and end-users.

    What makes Salutary unique? - We offer our clients a truly unique, one-stop aggregation of the best-of-breed quality data sources. Our supplier network consists of numerous, established high quality suppliers that are rigorously vetted. - We leverage third party verification vendors to ensure phone numbers and emails are accurate and connect to the right person. Additionally, we deploy automated and manual verification techniques to ensure we have the latest job information for contacts. - We're reasonably priced and easy to work with.

    Products: API Suite Web UI Full and Custom Data Feeds

    Services: Data Enrichment - We assess the fill rate gaps and profile your customer file for the purpose of appending fields, updating information, and/or rendering net new “look alike” prospects for your campaigns. ABM Match & Append - Send us your domain or other company related files, and we’ll match your Account Based Marketing targets and provide you with B2B contacts to campaign. Optionally throw in your suppression file to avoid any redundant records. Verification (“Cleaning/Hygiene”) Services - Address the 2% per month aging issue on contact records! We will identify duplicate records, contacts no longer at the company, rid your email hard bounces, and update/replace titles or phones. This is right up our alley and levers our existing internal and external processes and systems.

  3. Success.ai | EU Company Data | APIs | 28M+ Full Company Profiles & Contact...

    • datarade.ai
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Success.ai, Success.ai | EU Company Data | APIs | 28M+ Full Company Profiles & Contact Data – Best Price & Quality Guarantee [Dataset]. https://datarade.ai/data-products/success-ai-eu-company-data-apis-28m-full-company-profi-success-ai
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset provided by
    Area covered
    Korea (Democratic People's Republic of), Ascension and Tristan da Cunha, Timor-Leste, Lebanon, Belarus, Isle of Man, Lithuania, Nigeria, Kyrgyzstan, Saint Vincent and the Grenadines
    Description

    Success.ai’s Company Data Solutions provide businesses with powerful, enterprise-ready B2B company datasets, enabling you to unlock insights on over 28 million verified company profiles. Our solution is ideal for organizations seeking accurate and detailed B2B contact data, whether you’re targeting large enterprises, mid-sized businesses, or small business contact data.

    Success.ai offers B2B marketing data across industries and geographies, tailored to fit your specific business needs. With our white-glove service, you’ll receive curated, ready-to-use company datasets without the hassle of managing data platforms yourself. Whether you’re looking for UK B2B data or global datasets, Success.ai ensures a seamless experience with the most accurate and up-to-date information in the market.

    API Features:

    • Real-Time Data Access: Our APIs ensure you can integrate and access the latest company data directly into your systems, providing real-time updates and seamless data flow.
    • Scalable Integration: Designed to handle high-volume requests efficiently, our APIs can support extensive data operations, perfect for businesses of all sizes.
    • Customizable Data Retrieval: Tailor your data queries to match specific needs, selecting data points that align with your business goals for more targeted insights.

    Why Choose Success.ai’s Company Data Solution? At Success.ai, we prioritize quality and relevancy. Every company profile is AI-validated for a 99% accuracy rate and manually reviewed to ensure you're accessing actionable and GDPR-compliant data. Our price match guarantee ensures you receive the best deal on the market, while our white-glove service provides personalized assistance in sourcing and delivering the data you need.

    Why Choose Success.ai?

    • Best Price Guarantee: We offer industry-leading pricing and beat any competitor.
    • Global Reach: Access over 28 million verified company profiles across 195 countries.
    • Comprehensive Data: Over 15 data points, including company size, industry, funding, and technologies used.
    • Accurate & Verified: AI-validated with a 99% accuracy rate, ensuring high-quality data.
    • API Access: Our robust APIs and customizable data solutions provide the flexibility and scalability needed to adapt to changing market conditions and business needs.
    • Real-Time Updates: Stay ahead with continuously updated company information.
    • Ethically Sourced Data: Our B2B data is compliant with global privacy laws, ensuring responsible use.
    • Dedicated Service: Receive personalized, curated data without the hassle of managing platforms.
    • Tailored Solutions: Custom datasets are built to fit your unique business needs and industries.

    Our database spans 195 countries and covers 28 million public and private company profiles, with detailed insights into each company’s structure, size, funding history, and key technologies. We provide B2B company data for businesses of all sizes, from small business contact data to large corporations, with extensive coverage in regions such as North America, Europe, Asia-Pacific, and Latin America.

    Comprehensive Data Points: Success.ai delivers in-depth information on each company, with over 15 data points, including:

    Company Name: Get the full legal name of the company. LinkedIn URL: Direct link to the company's LinkedIn profile. Company Domain: Website URL for more detailed research. Company Description: Overview of the company’s services and products. Company Location: Geographic location down to the city, state, and country. Company Industry: The sector or industry the company operates in. Employee Count: Number of employees to help identify company size. Technologies Used: Insights into key technologies employed by the company, valuable for tech-based outreach. Funding Information: Track total funding and the most recent funding dates for investment opportunities. Maximize Your Sales Potential: With Success.ai’s B2B contact data and company datasets, sales teams can build tailored lists of target accounts, identify decision-makers, and access real-time company intelligence. Our curated datasets ensure you’re always focused on high-value leads—those who are most likely to convert into clients. Whether you’re conducting account-based marketing (ABM), expanding your sales pipeline, or looking to improve your lead generation strategies, Success.ai offers the resources you need to scale your business efficiently.

    Tailored for Your Industry: Success.ai serves multiple industries, including technology, healthcare, finance, manufacturing, and more. Our B2B marketing data solutions are particularly valuable for businesses looking to reach professionals in key sectors. You’ll also have access to small business contact data, perfect for reaching new markets or uncovering high-growth startups.

    From UK B2B data to contacts across Europe and Asia, our datasets provide global coverage to expand your business reach and identify new...

  4. Football Delphi

    • kaggle.com
    Updated Aug 16, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jörg Eitner (2017). Football Delphi [Dataset]. https://www.kaggle.com/datasets/laudanum/footballdelphi/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 16, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jörg Eitner
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    As many others I have asked myself if it is possible to use machine learning in order to create valid predictions for football (soccer) match outcomes. Hence I created a dataset consisting of historic match data for the German Bundesliga (1st and 2nd Division) as well as the English Premier League reaching back as far as 1993 up to 2016. Besides the mere information concerning goals scored and home/draw/away win the dataset also includes per site (team) data such as transfer value per team (pre-season), the squad strength, etc. Unfortunately I was only able to find sources for these advanced attributes going back to the 2005 season.
    I have used this dataset with different machine learning algorithms including random forests, XGBoost as well as different recurrent neural network architectures (in order to potentially identify recurring patterns in winning streaks, etc.). I'd like to share the approaches I used as separate Kernels here as well. So far I did not manage to exceed an accuracy of 53% consistently on a validation set using 2016 season of Bundesliga 1 (no information rate = 49%).

    Although I have done some visual exploration before implementing the different machine learning approaches using Tableau, I think a visual exploration kernel would be very beneficial.

    Content

    The data comes as an Sqlite file containing the following tables and fields:

    Table: Matches

    • Match_ID (int): unique ID per match
    • Div (str): identifies the division the match was played in (D1 = Bundesliga, D2 = Bundesliga 2, E0 = English Premier League)
    • Season (int): Season the match took place in (usually covering the period of August till May of the following year)
    • Date (str): Date of the match
    • HomeTeam (str): Name of the home team
    • AwayTeam (str): Name of the away team
    • FTHG (int) (Full Time Home Goals): Number of goals scored by the home team
    • FTAG (int) (Full Time Away Goals): Number of goals scored by the away team
    • FTR (str) (Full Time Result): 3-way result of the match (H = Home Win, D = Draw, A = Away Win)

    Table: Teams

    • Season (str): Football season for which the data is valid
    • TeamName (str): Name of the team the data concerns
    • KaderHome (str): Number of Players in the squad
    • AvgAgeHome (str): Average age of players
    • ForeignPlayersHome (str): Number of foreign players (non-German, non-English respectively) playing for the team
    • OverallMarketValueHome (str): Overall market value of the team pre-season in EUR (based on data from transfermarkt.de)
    • AvgMarketValueHome (str): Average market value (per player) of the team pre-season in EUR (based on data from transfermarkt.de)
    • StadiumCapacity (str): Maximum stadium capacity of the team's home stadium

    Table: Unique Teams

    • TeamName (str): Name of a team
    • Unique_Team_ID (int): Unique identifier for each team

    Table: Teams_in_Matches

    • Match_ID (int): Unique match ID
    • Unique_Team_ID (int): Unique team ID (This table is used to easily retrieve each match a given team has played in)

    Based on these tables I created a couple of views which I used as input for my machine learning models:

    View: FlatView

    Combination of all matches with the respective additional data from Teams table for both home and away team.

    View: FlatView_Advanced

    Same as Flatview but also includes Unique_Team_ID and Unique_Team in order to easily retrieve all matches played by a team in chronological order.

    View: FlatView_Chrono_TeamOrder_Reduced

    Similar to Flatview_Advanced, however missing the additional attributes from team in order to have a longer history including years 1993 - 2004. Especially interesting if one is only interested in analyzing winning/loosing streaks.

    Acknowledgements

    Thanks to football-data.co.uk and transfermarkt.de for providing the raw data used in this dataset.

    Inspiration

    Please feel free to use the humble dataset provided here for any purpose you want. To me it would be most interesting if others think that recurrent neural networks could in fact be of help (and even maybe outperform classical feature engineering) in identifying streaks of losses and wins. In the literature I mostly only found example of RNN application where the data were time series in a very narrow sense (e.g. temperature measurements over time) hence it would be interesting to get your input on this question.

    Maybe someone also finds additional attributes per team or match which have substantial impact on match outcome. So far I have found the "Market Value" of a team to be by far the best predictor when two teams face each other, which makes sense as the market value usually tends to correlate closely with the strength of a team and it's propects at winning

  5. Data from: Automated Linking of Historical Data

    • linkagelibrary.icpsr.umich.edu
    Updated Aug 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez (2020). Automated Linking of Historical Data [Dataset]. http://doi.org/10.3886/E120703V1
    Explore at:
    Dataset updated
    Aug 20, 2020
    Authors
    Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1850 - 1940
    Area covered
    United States
    Description

    Currently, the repository provides codes for two such methods:The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.

  6. Famous Celebrity Name Misspellings

    • kaggle.com
    Updated Jan 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Famous Celebrity Name Misspellings [Dataset]. https://www.kaggle.com/datasets/thedevastator/famous-celebrity-name-misspellings
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 22, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    Description

    Famous Celebrity Name Misspellings

    Aggregated data from The Gyllenhaal Experiment

    By data.world's Admin [source]

    About this dataset

    This dataset contains aggregated spellings and mispellings of the names of 15 famous celebrities. Ever wonder if people can actually spell someone's name correctly? Now you can see for yourself with this compiled data from The Pudding's interactive spelling experiment called The Gyllenhaal Experiment! Interesting to see which names get misspelled more than others - some are easy to guess, some are surprising! With the data provided here, you can start uncovering trends in name-spelling habits. Visualize the data and start analyzing how unique or common each celebrity is with respect to spelling - who stands out? Who blends in? Check it out today and explore a side of celebrity life that hasn't been seen before!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset contains misnames of 15 famous celebrities. It can be used for a variety of research and analysis purposes, including exploring human language, understanding how names are misspelled, or generating data visualizations.

    In order to get the most out of this dataset, you will need to familiarize yourself with its columns. The dataset consists of two columns- “data” and “updated”. The “data” column contains the misnames associated with each celebrity name. The “updated” column is automatically updated with the date on which the data was last changed or modified.

    To use this dataset for your own research and analysis purposes, you may find it useful to filter out certain types of responses or patterns in order to focus more closely on particular trends or topics of interest; for example, if you are interested in exploring how spellings vary by region then you might wish to group together similar responses regardless of whether they exactly match one celebrity name over another (i.e., categorizing all spellings that follow a certain phonetic pattern). You can also separate different types of responses into separate groups in order to explore different aspects such as popularity (i.e., looking at which misspellings occurred most frequently).

    Research Ideas

    • Creating an interactive quiz for users to test their spelling ability by challenging them to spell names correctly from the celebrity dataset.
    • Building a dictionary database of the misspellings, fans’ nicknames and phonetic spellings of each celebrity so that people can find more information about them more easily and accurately.
    • Measuring the popularity of individual celebrities by tracking the frequency in which their name is misspelled

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    See the dataset description for more information.

    Columns

    File: data-all.csv | Column name | Description | |:--------------|:---------------------------------------------------| | data | Misspellings of celebrity names. (String) | | updated | Date when the misspelling was last updated. (Date) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit data.world's Admin.

  7. c

    FRC Match Dataset

    • cubig.ai
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). FRC Match Dataset [Dataset]. https://cubig.ai/store/products/397/frc-match-dataset
    Explore at:
    Dataset updated
    Jun 5, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
    Description

    1) Data Introduction • The FRC Match Dataset is based on the FRST Robotics Competition (FRC) competition records from 2018 to 2025, and is a robot competition match data that includes various information such as EPA (Expected Score Contribution), match win rate, team composition, and match results for each match.

    2) Data Utilization (1) FRC Match Data has characteristics that: • Each row contains numerical and categorical variables such as year, event, playoff status, match stage, winning team, EPA-based probability of victory, team name and composition, and match results, which together provide team/match performance and forecasting indicators. (2) FRC Match Data can be used to: • Prediction and Assessment of Match Results: Using EPA and past match data, machine learning models can predict match wins and losses, and prediction models can be evaluated for reliability with indicators such as Brier score. • Team Strategy and Performance Analysis: By analyzing EPA, win rate, and matchup data for each team, you can use it to understand the strategic contribution, cooperation effects, seasonal trends, and strong and weak team characteristics.

  8. [Superseded] Intellectual Property Government Open Data 2019

    • data.gov.au
    • researchdata.edu.au
    csv-geo-au, pdf
    Updated Jan 26, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IP Australia (2022). [Superseded] Intellectual Property Government Open Data 2019 [Dataset]. https://data.gov.au/data/dataset/activity/intellectual-property-government-open-data-2019
    Explore at:
    csv-geo-au(59281977), csv-geo-au(680030), csv-geo-au(39873883), csv-geo-au(37247273), csv-geo-au(25433945), csv-geo-au(92768371), pdf(702054), csv-geo-au(208449), csv-geo-au(166844), csv-geo-au(517357734), csv-geo-au(32100526), csv-geo-au(33981694), csv-geo-au(21315), csv-geo-au(6828919), csv-geo-au(86824299), csv-geo-au(359763), csv-geo-au(567412), csv-geo-au(153175), csv-geo-au(165051861), csv-geo-au(115749297), csv-geo-au(79743393), csv-geo-au(55504675), csv-geo-au(221026), csv-geo-au(50760305), csv-geo-au(2867571), csv-geo-au(212907250), csv-geo-au(4352457), csv-geo-au(4843670), csv-geo-au(1032589), csv-geo-au(1163830), csv-geo-au(278689420), csv-geo-au(28585330), csv-geo-au(130674), csv-geo-au(13968748), csv-geo-au(11926959), csv-geo-au(4802733), csv-geo-au(243729054), csv-geo-au(64511181), csv-geo-au(592774239), csv-geo-au(149948862)Available download formats
    Dataset updated
    Jan 26, 2022
    Dataset authored and provided by
    IP Australiahttp://ipaustralia.gov.au/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    What is IPGOD?

    The Intellectual Property Government Open Data (IPGOD) includes over 100 years of registry data on all intellectual property (IP) rights administered by IP Australia. It also has derived information about the applicants who filed these IP rights, to allow for research and analysis at the regional, business and individual level. This is the 2019 release of IPGOD.

    How do I use IPGOD?

    IPGOD is large, with millions of data points across up to 40 tables, making them too large to open with Microsoft Excel. Furthermore, analysis often requires information from separate tables which would need specialised software for merging. We recommend that advanced users interact with the IPGOD data using the right tools with enough memory and compute power. This includes a wide range of programming and statistical software such as Tableau, Power BI, Stata, SAS, R, Python, and Scalar.

    IP Data Platform

    IP Australia is also providing free trials to a cloud-based analytics platform with the capabilities to enable working with large intellectual property datasets, such as the IPGOD, through the web browser, without any installation of software. IP Data Platform

    References

    The following pages can help you gain the understanding of the intellectual property administration and processes in Australia to help your analysis on the dataset.

    Updates

    Tables and columns

    Due to the changes in our systems, some tables have been affected.

    • We have added IPGOD 225 and IPGOD 325 to the dataset!
    • The IPGOD 206 table is not available this year.
    • Many tables have been re-built, and as a result may have different columns or different possible values. Please check the data dictionary for each table before use.

    Data quality improvements

    Data quality has been improved across all tables.

    • Null values are simply empty rather than '31/12/9999'.
    • All date columns are now in ISO format 'yyyy-mm-dd'.
    • All indicator columns have been converted to Boolean data type (True/False) rather than Yes/No, Y/N, or 1/0.
    • All tables are encoded in UTF-8.
    • All tables use the backslash \ as the escape character.
    • The applicant name cleaning and matching algorithms have been updated. We believe that this year's method improves the accuracy of the matches. Please note that the "ipa_id" generated in IPGOD 2019 will not match with those in previous releases of IPGOD.
  9. n

    Data from: Challenges with using names to link digital biodiversity...

    • data.niaid.nih.gov
    • datadryad.org
    • +1more
    zip
    Updated May 20, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David J. Patterson; Dmitry Mozzherin; David Peter Shorthouse; Anne Thessen (2017). Challenges with using names to link digital biodiversity information [Dataset]. http://doi.org/10.5061/dryad.3160r
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 20, 2017
    Authors
    David J. Patterson; Dmitry Mozzherin; David Peter Shorthouse; Anne Thessen
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The need for a names-based cyber-infrastructure for digital biology is based on the argument that scientific names serve as a standardized metadata system that has been used consistently and near universally for 250 years. As we move towards data-centric biology, name-strings can be called on to discover, index, manage, and analyze accessible digital biodiversity information from multiple sources. Known impediments to the use of scientific names as metadata include synonyms, homonyms, mis-spellings, and the use of other strings as identifiers. We here compare the name-strings in GenBank, Catalogue of Life (CoL), and the Dryad Digital Repository (DRYAD) to assess the effectiveness of the current names-management toolkit developed by Global Names to achieve interoperability among distributed data sources. New tools that have been used here include Parser (to break name-strings into component parts and to promote the use of canonical versions of the names), a modified TaxaMatch fuzzy-matcher (to help manage typographical, transliteration, and OCR errors), and Cross-Mapper (to make comparisons among data sets). The data sources include scientific names at multiple ranks; vernacular (common) names; acronyms; strain identifiers and other surrogates including idiosyncratic abbreviations and concatenations. About 40% of the name-strings in GenBank are scientific names representing about 400,000 species or infraspecies and their synonyms. Of the formally-named terminal taxa (species and lower taxa) represented, about 82% have a match in CoL. Using a subset of content in DRYAD, about 45% of the identifiers are names of species and infraspecies, and of these only about a third have a match in CoL. With simple processing, the extent of matching between DRYAD and CoL can be improved to over 90%. The findings confirm the necessity for name-processing tools and the value of scientific names as a mechanism to interconnect distributed data, and identify specific areas of improvement for taxonomic data sources. Some areas of diversity (bacteria and viruses) are not well represented by conventional scientific names, and they and other forms of strings (acronyms, identifiers, and other surrogates) that are used instead of names need to be managed in reconciliation services (mapping alternative name-strings for the same taxon together). On-line resolution services will bring older scientific names up to date or convert surrogate name-strings to scientific names should such names exist. Examples are given of many of the aberrant forms of ‘names’ that make their way into these databases. The occurrence of scientific names with incorrect authors, such as chresonyms within synonymy lists, is a quality-control issue in need of attention. We propose a future-proofing solution that will empower stakeholders to take advantage of the name-based infrastructure at little cost. This proposed infrastructure includes a standardized system that adopts or creates UUIDs for name-strings, software that can identify name-strings in sources and apply the UUIDs, reconciliation and resolution services to manage the name-strings, and an annotation environment for quality control by users of name-strings.

  10. f

    Human Matching Behavior in Social Networks: An Algorithmic Perspective

    • plos.figshare.com
    tiff
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorenzo Coviello; Massimo Franceschetti; Mathew D. McCubbins; Ramamohan Paturi; Andrea Vattani (2023). Human Matching Behavior in Social Networks: An Algorithmic Perspective [Dataset]. http://doi.org/10.1371/journal.pone.0041900
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Lorenzo Coviello; Massimo Franceschetti; Mathew D. McCubbins; Ramamohan Paturi; Andrea Vattani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We argue that algorithmic modeling is a powerful approach to understanding the collective dynamics of human behavior. We consider the task of pairing up individuals connected over a network, according to the following model: each individual is able to propose to match with and accept a proposal from a neighbor in the network; if a matched individual proposes to another neighbor or accepts another proposal, the current match will be broken; individuals can only observe whether their neighbors are currently matched but have no knowledge of the network topology or the status of other individuals; and all individuals have the common goal of maximizing the total number of matches. By examining the experimental data, we identify a behavioral principle called prudence, develop an algorithmic model, analyze its properties mathematically and by simulations, and validate the model with human subject experiments for various network sizes and topologies. Our results include i) a -approximate maximum matching is obtained in logarithmic time in the network size for bounded degree networks; ii) for any constant , a -approximate maximum matching is obtained in polynomial time, while obtaining a maximum matching can require an exponential time; and iii) convergence to a maximum matching is slower on preferential attachment networks than on small-world networks. These results allow us to predict that while humans can find a “good quality” matching quickly, they may be unable to find a maximum matching in feasible time. We show that the human subjects largely abide by prudence, and their collective behavior is closely tracked by the above predictions.

  11. D

    Overdose-Related 911 Responses by Emergency Medical Services

    • data.sfgov.org
    • catalog.data.gov
    application/rdfxml +5
    Updated Jul 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Overdose-Related 911 Responses by Emergency Medical Services [Dataset]. https://data.sfgov.org/widgets/ed3a-sn39?mobile_redirect=true
    Explore at:
    json, application/rssxml, tsv, xml, csv, application/rdfxmlAvailable download formats
    Dataset updated
    Jul 21, 2025
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    A. SUMMARY This dataset comes from the San Francisco Emergency Medical Services Agency and includes all opioid overdose-related 911 calls responded to by emergency medical services (ambulances). The purpose of this dataset is to show how many opioid overdose-related 911 calls the San Francisco Fire Department and other ambulance companies respond to each week. This dataset is based on ambulance patient care records and not 911 calls for service data.

    B. HOW THE DATASET IS CREATED The San Francisco Fire Department and other ambulance companies send electronic patient care reports to the California Emergency Medical Services Agency for all 911 calls they respond to. The San Francisco Emergency Medical Services Agency (SF EMSA) has access to the state database that includes all reports for 911 calls in San Francisco County. In order to identify overdose-related calls that resulted in an emergency medical service (or ambulance) response, SF EMSA filters the patient care reports based on set criteria used in other jurisdictions called The Rhode Island Criteria. These criteria filter calls to only include those calls where EMS documented that an opioid overdose was involved and/or naloxone (Narcan) was administered. Calls that do not involve an opioid overdose are filtered out of the dataset. Calls that result in a patient death on scene are also filtered out of the dataset.

    This dataset is created by copying the total number of calls each week when the state makes this data available.

    C. UPDATE PROCESS Data is generally available with a 24-hour lag on a weekly frequency but the exact lag and update frequency is based on when the State makes this data available.

    D. HOW TO USE THIS DATASET This dataset includes the total number of calls a week. The week starts on a Sunday and ends on the following Saturday.

    This dataset will not match the Fire Department Calls for Service dataset, as this dataset has been filtered to include only opioid overdose-related 911 calls based on electronic patient care report data. Additionally, the Fire Department Calls for Service data are primarily based on 911 call data (i.e. calls triaged and recorded by San Francisco’s 911 call center) and not the finalized electronic patient care reports recorded by Fire Department paramedics.

    E. RELATED DATASETS Fire Department Calls for Service San Francisco Department of Public Health Substance Use Services Unintentional Overdose Death Rates by Race/Ethnicity Preliminary Unintentional Drug Overdose Deaths

    F. CHANGE LOG

    • 1/17/2024 - updated date/time fields from Coordinated Universal Time (UTC) to Pacific Time (PT) which caused a slight change in historic case counts by week.

  12. Film Circulation dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, png
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
    Explore at:
    csv, png, binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.


    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.


    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.


    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.


    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,

  13. PUBG 1PP Squads, 31147 Matches

    • kaggle.com
    zip
    Updated Apr 6, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MichaelApers (2018). PUBG 1PP Squads, 31147 Matches [Dataset]. https://www.kaggle.com/datasets/michaelapers/pubg-1pp-squads-31147-matches
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 6, 2018
    Authors
    MichaelApers
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context/Content

    I wanted to do some analysis on higher-elo matches (if there is a matchmaking system in place at all is one question that might be answered) so I collected my own dataset inspired by https://www.kaggle.com/skihikingkevin/pubg-match-deaths/data. I seeded my data with JPFog and collected 100 matches of first person squads matches of his and of those that he ran into until I had a reasonable amount of data (read: I got tired of leaving my machine on overnight scraping and didn't want to set up AWS).

    Included is that data, along with a names-list of every name that I ran into, a names-list of every name I searched on, and a flattened pandas-friendlier version of that data.

    I would have made this data friendlier to work with but I wanted to push it out before PUBG releases their own official seed data, which should be coming soon(tm). This way we all get to play around with this in tableu, and maybe make comparisons to other datasets released up to 8 months ago.

    Acknowledgements

    pubg.op.gg ; If this is against their TOS let me know (their TOS is in korean and I can't read it) and I will take this down.

    Inspiration

    https://www.kaggle.com/skihikingkevin/pubg-match-deaths/data

  14. DISCERN: Duke Innovation & SCientific Enterprises Research Network

    • zenodo.org
    bin, zip
    Updated Aug 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arora Ashish; Belenzon Sharon; Sheer Lia; Arora Ashish; Belenzon Sharon; Sheer Lia (2024). DISCERN: Duke Innovation & SCientific Enterprises Research Network [Dataset]. http://doi.org/10.5281/zenodo.3709084
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Aug 1, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Arora Ashish; Belenzon Sharon; Sheer Lia; Arora Ashish; Belenzon Sharon; Sheer Lia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This database links innovation data to Compustat firms. When using the data, please cite "Knowledge Spillovers and Corporate Investment in Scientific Research" (Arora, Belenzon and Sheer), NBER WP 23187. A special thanks and appreciation go to Bernardo Dionisi , Honggi Lee, Dror Shvadron and JK Suh for their diligent work and dedication to this effort over the past several years.

    This project introduces major data extension and improvement to the historical NBER patent dataset, which should be valuable for all researchers working with patent data linked to firms. In updating the data to match between Compustat and patents to 2015, we address two major challenges: name changes and ownership changes. These challenges are central to how patents are assigned to firms over time. To be consistent over the sample period, we reconstruct the complete historical data covered in the NBER data files.

    About 30% of the Compustat firms in our sample change their name at least once. Accounting for name changes improves the accuracy and scope of matches to patents (and other assets), ownership structure, and dynamic reassignments of GVKEY codes to companies. Dynamic reassignment means that, for instance, if a sample firm merges with another firm, the patents of the merged firm are included in the stock of patents linked to the Compustat record from that point onward, but not before.

    For ownership and subsidiary data we rely on a wide range of M&A data, including SDC, historical snapshots of ORBIS files for 2002-2015, 10-K SEC filings, and NBER2006 as well as perform extensive manual checks that help us uncover firms’ structure and ownership changes before proceeding to the patent match. Thus, we have extended and improved the NBER patent data. In the enclosed "Data Appendix", we document our data construction work, present several examples (“case studies”), and outline the improvements we made to existing NBER historical patent data.

  15. 2018-2019 Premier League Data

    • kaggle.com
    Updated Jun 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JohnDoe (2019). 2018-2019 Premier League Data [Dataset]. https://www.kaggle.com/thesiff/premierleague1819/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 1, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    JohnDoe
    Description

    Context

    Looking back at the season that was 2018-2019 and looking to delve into sight deeper insights. Using the data to see how clubs are similar stylistically, in the way they pass, attack and score goals.

    Content

    This data set is wide ranging in the sense it encompass stats seen on a regular league table but goes beyond looking at how teams pass and keep possession, how they defend, tackle as well as looking at market values of a team and how much money each team was allotted from the TV rights deal. This data was gathered from 1) BBC Sports Football, 2) Premierleague.com 3) Transfermarkt.co.uk This data was not scrapped in a conventional sense and appears in a rather haphazard manner. To counter this I included category descriptors at the start of each variable name, this should help to provide a more cohesive understanding of the data set as well as aid in sub setting.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    I've done some rather elementary data analysis and exploration. I would love to see the community wrangle with this and explore further, create more complex models, apply some ML and see what insight can be gathered from this data.

  16. d

    U.S. Community Water Systems Service Boundaries, v1.0.0

    • search.dataone.org
    • hydroshare.org
    • +1more
    Updated Dec 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SimpleLab; EPIC (2023). U.S. Community Water Systems Service Boundaries, v1.0.0 [Dataset]. https://search.dataone.org/view/sha256%3A59229305d23a6ab6336be773a3ed2c75ac3586a69c775bba3a8e8101834dcc98
    Explore at:
    Dataset updated
    Dec 30, 2023
    Dataset provided by
    Hydroshare
    Authors
    SimpleLab; EPIC
    Area covered
    Description

    This is a layer of water service boundaries for 44,919 community water systems that deliver tap water to 306.88 million people in the US. This amounts to 97.22% of the population reportedly served by active community water systems and 90.85% of active community water systems. The layer is based on multiple data sources and a methodology developed by SimpleLab and collaborators called a Tiered, Explicit, Match, and Model approach–or TEMM, for short. The name of the approach reflects exactly how the nationwide data layer was developed. The TEMM is composed of three hierarchical tiers, arranged by data and model fidelity. First, we use explicit water service boundaries provided by states. These are spatial polygon data, typically provided at the state-level. We call systems with explicit boundaries Tier 1. In the absence of explicit water service boundary data, we use a matching algorithm to match water systems to the boundary of a town or city (Census Place TIGER polygons). When a water system and TIGER place match one-to-one, we label this Tier 2a. When multiple water systems match to the same TIGER place, we label this Tier 2b. Tier 2b reflects overlapping boundaries for multiple systems. Finally, in the absence of an explicit water service boundary (Tier 1) or a TIGER place polygon match (Tier 2a or Tier 2b), a statistical model trained on explicit water service boundary data (Tier 1) is used to estimate a reasonable radius at provided water system centroids, and model a spherical water system boundary (Tier 3).

    Several limitations to this data exist–and the layer should be used with these in mind. First, the case of assigning a Census Place TIGER polygon to multiple systems results in an inaccurate assignment of the same exact area to multiple systems; we hope to resolve Tier 2b systems into Tier 2a or Tier 3 in a future iteration. Second, matching algorithms to assign Census Place boundaries require additional validation and iteration. Third, Tier 3 boundaries have modeled radii stemming from a lat/long centroid of a water system facility; but the underlying lat/long centroids for water system facilities are of variable quality. It is critical to evaluate the "geometry quality" column (included from the EPA ECHO data source) when looking at Tier 3 boundaries; fidelity is very low when geometry quality is a county or state centroid– but we did not exclude the data from the layer. Fourth, missing water systems are typically those without a centroid, in a U.S. territory, or missing population and connection data. Finally, Tier 1 systems are assumed to be high fidelity, but rely on the accuracy of state data collection and maintenance.

    All data, methods, documentation, and contributions are open-source and available here: https://github.com/SimpleLab-Inc/wsb.

  17. f

    Matches

    • figshare.com
    zip
    Updated Feb 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luca Pappalardo; Emanuele Massucco (2019). Matches [Dataset]. http://doi.org/10.6084/m9.figshare.7770422.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 26, 2019
    Dataset provided by
    figshare
    Authors
    Luca Pappalardo; Emanuele Massucco
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset describes all the matches made available. Each match is a document consisting of the following fields:- competitionId: the identifier of the competition to which the match belongs to. It is a integer and refers to the field "wyId" of the competition document;- date and dateutc: the former specifies date and time when the match starts in explicit format (e.g., May 20, 2018 at 8:45:00 PM GMT+2), the latter contains the same information but in the compact format YYYY-MM-DD hh:mm:ss; - duration: the duration of the match. It can be "Regular" (matches of regular duration of 90 minutes + stoppage time), "ExtraTime" (matches with supplementary times, as it may happen for matches in continental or international competitions), or "Penalities" (matches which end at penalty kicks, as it may happen for continental or international competitions);- gameweek: the week of the league, starting from the beginning of the league;- label: contains the name of the two clubs and the result of the match (e.g., "Lazio - Internazionale, 2 - 3");- roundID: indicates the match-day of the competition to which the match belongs to. During a competition for soccer clubs, each of the participating clubs plays against each of the other clubs twice, once at home and once away. The matches are organized in match-days: all the matches in match-day i are played before the matches in match-day i + 1, even tough some matches can be anticipated or postponed to facilitate players and clubs participating in Continental or Intercontinental competitions. During a competition for national teams, the "roundID" indicates the stage of the competition (eliminatory round, round of 16, quarter finals, semifinals, final);- seasonId: indicates the season of the match;- status: it can be "Played" (the match has officially finished), "Cancelled" (the match has been canceled for some reason), "Postponed" (the match has been postponed and no new date and time is available yet) or "Suspended" (the match has been suspended and no new date and time is available yet);- venue: the stadium where the match was held (e.g., "Stadio Olimpico");- winner: the identifier of the team which won the game, or 0 if the match ended with a draw;- wyId: the identifier of the match, assigned by Wyscout;- teamsData: it contains several subfields describing information about each team that is playing that match: such as lineup, bench composition, list of substitutions, coach and scores: - hasFormation: it has value 0 if no formation (lineups and benches) is present, and 1 otherwise; - score: the number of goals scored by the team during the match (not counting penalties); - scoreET: the number of goals scored by the team during the match, including the extra time (not counting penalties); - scoreHT: the number of goals scored by the team during the first half of the match; - scoreP: the total number of goals scored by the team after the penalties; - side: the team side in the match (it can be "home" or "away"); - teamId: the identifier of the team; - coachId: the identifier of the team's coach; - bench: the list of the team's players that started the match in the bench and some basic statistics about their performance during the match (goals, own goals, cards); - lineup: the list of the team's players in the starting lineup and some basic statistics about their performance during the match (goals, own goals, cards); - substitutions: the list of team's substitutions during the match, describing the players involved and the minute of the substitution.

  18. d

    Data from: Geographic variation in the matching between call characteristics...

    • dataone.org
    • datadryad.org
    Updated May 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonieta Labra; Claudio Reyes-Olivares; Felipe N. Moreno-Gómez; Nelson A. Velásquez; Mario Penna; Paul H. Delano; Peter M. Narins (2025). Geographic variation in the matching between call characteristics and tympanic sensitivity in the Weeping lizard [Dataset]. http://doi.org/10.5061/dryad.mw6m905z2
    Explore at:
    Dataset updated
    May 23, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Antonieta Labra; Claudio Reyes-Olivares; Felipe N. Moreno-Gómez; Nelson A. Velásquez; Mario Penna; Paul H. Delano; Peter M. Narins
    Time period covered
    Feb 21, 2023
    Description

    Effective communication requires a match among signal characteristics, environmental conditions, and receptor tuning and decoding. The degree of matching, however, can vary, among others due to different selective pressures affecting the communication components. For evolutionary novelties, strong selective pressures are likely to act upon the signal and receptor to promote a tight match among them. We test this prediction by exploring the coupling between the acoustic signals and auditory sensitivity in Liolaemus chiliensis, the Weeping lizard, the only one of more than 285 Liolaemus species that vocalizes. Individuals emit distress calls that convey information of predation risk to conspecifics, which may respond with antipredator behaviors upon hearing calls. Specifically, we explored the match between spectral characteristics of the distress calls and the tympanic sensitivities of two populations separated by more than 700 km, for which previous data suggested variation in their dis...

  19. Magic Chess: Go Go

    • kaggle.com
    Updated Jul 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dina Nabila (2025). Magic Chess: Go Go [Dataset]. https://www.kaggle.com/datasets/dinanabb/magic-chess-go-go
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 1, 2025
    Dataset provided by
    Kaggle
    Authors
    Dina Nabila
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Magic Chess: Go Go Hero & Card Reference Dataset

    This repository contains reference datasets for heroes and cards used in Magic Chess: Go Go. These datasets are intended to complement the Magic Chess: Go Go Matches Dataset by providing structured information about hero attributes and card effects in order to enabling more meaningful and accurate analysis.

    📂 Dataset Structure

    The dataset consists of two csv files:

    1. heroes

    A dataset containing information about all heroes available in the game.

    Column NameDescription
    HeroName of the hero
    FactionHero's faction (e.g., Doomsworm, Faeborn)
    RoleHero's role (e.g., Marksman, Defender)
    RowRecommended position on the board (Front or Back)
    CostHero's gold cost in the shop (1 to 5)

    2. cards

    A dataset describing the attributes and effects of various in-game cards.
    Each card is encoded with binary features: 1 indicates that the card has that attribute, and 0 means it does not.

    Column NameDescription
    CardName of the card
    ColorCard color type (e.g., Orange, Purple)
    Magic_BoostBoosts magic stat
    Physical_BoostBoosts physical stat
    ATKSpeed_BoostIncreases attack speed
    Defense_BoostIncreases defense or durability
    SynergyEnhances or interacts with a synergy
    Magic_EquipmentProvides magic-type equipment
    Physical_EquipmentProvides physical-type equipment
    ATKSpeed_EquipmentProvides attack speed equipment
    Defense_EquipmentProvides defensive equipment
    Hero_RecruitmentSummons a new hero
    Capacity+Increases max team capacity
    EconomyBoosts economy or gold gain
    Commander_EXPGrants Commander experience
    Commander_LifeRestores Commander HP
    Synergy_EffectModifies synergy effects or adds synergy bonuses

    🎯 Purpose

    These reference datasets are designed to support match-level analysis in the Magic Chess: Go Go Matches Dataset. They provide additional context such as:

    • Hero characteristics for synergy or frontline/backline composition analysis
    • Card effects for evaluating card impact on win rate or synergy boosts

    By linking match data with this reference information, we can uncover deeper patterns and more accurate insights.

    🧠 Use Cases

    • Enhance feature engineering for match analysis
    • Map hero usage to roles/factions for synergy analysis
    • Analyze the strategic impact of specific card attributes
    • Cluster or group cards/heroes by characteristics (e.g., all economy-boosting cards)

    🏗️ Collection Methodology

    All data was entered manually into a spreadsheet.

  20. D

    Census Tract Top 50 American Community Survey Data

    • data.seattle.gov
    • hub.arcgis.com
    • +1more
    application/rdfxml +5
    Updated Feb 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Census Tract Top 50 American Community Survey Data [Dataset]. https://data.seattle.gov/dataset/Census-Tract-Top-50-American-Community-Survey-Data/jya9-y5bv/data
    Explore at:
    application/rdfxml, csv, json, application/rssxml, tsv, xmlAvailable download formats
    Dataset updated
    Feb 3, 2025
    Description

    Data from: American Community Survey, 5-year Series


    King County, Washington census tracts with nonoverlapping vintages of the 5-year American Community Survey (ACS) estimates starting in 2010 of over 50 attributes of the most requested data derived from the U.S. Census Bureau's demographic profiles (DP02-DP05). Also includes the most recent release annually with the vintage identified in the "ACS Vintage" field.

    The census tract boundaries match the vintage of the ACS data (currently 2010 and 2020) so please note the geographic changes between the decades.

    Tracts have been coded as being within the City of Seattle as well as assigned to neighborhood groups called "Community Reporting Areas". These areas were created after the 2000 census to provide geographically consistent neighborhoods through time for reporting U.S. Census Bureau data. This is not an attempt to identify neighborhood boundaries as defined by neighborhoods themselves.

    Vintages: 2010, 2015, 2020, 2021, 2022, 2023
    ACS Table(s): DP02, DP03, DP04, DP05


    The United States Census Bureau's American Community Survey (ACS):
    This ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. Please cite the Census and ACS when using this data.

    Data Note from the Census:
    Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.

    Data Processing Notes:
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Connor Jerzak (2025). Data for: "Linking Datasets on Organizations Using Half a Billion Open-Collaborated Records" [Dataset]. http://doi.org/10.7910/DVN/EHRQQL

Data for: "Linking Datasets on Organizations Using Half a Billion Open-Collaborated Records"

Related Article
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 13, 2025
Dataset provided by
Harvard Dataverse
Authors
Connor Jerzak
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Abstract: Scholars studying organizations often work with multiple datasets lacking shared unique identifiers or covariates. In such situations, researchers usually use approximate string (``fuzzy'') matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even when two strings appear similar to humans, fuzzy matching often does not work because it fails to adapt to the informativeness of the character combinations. In response, a number of machine-learning methods have been developed to refine string matching. Yet, the effectiveness of these methods is limited by the size and diversity of training data. This paper introduces data from a prominent employment networking site (LinkedIn) as a massive training corpus to address these limitations. We show how, by leveraging information from LinkedIn regarding organizational name-to-name links, we can improve upon existing matching benchmarks, incorporating the trillions of name pair examples from LinkedIn into various methods to improve performance by explicitly maximizing match probabilities inferred from the LinkedIn corpus. We also show how relationships between organization names can be modeled using a network representation of the LinkedIn data. In illustrative merging tasks involving lobbying firms, we document improvements when using the LinkedIn corpus in matching calibration and make all data and methods open source. Keywords: Record linkage; Interest groups; Text as data; Unstructured data

Search
Clear search
Close search
Google apps
Main menu