76 datasets found
  1. Supplementary material from "Visual comparison of two data sets: Do people...

    • figshare.com
    xlsx
    Updated Mar 14, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robin Kramer; Caitlin Telfer; Alice Towler (2017). Supplementary material from "Visual comparison of two data sets: Do people use the means and the variability?" [Dataset]. http://doi.org/10.6084/m9.figshare.4751095.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Mar 14, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Robin Kramer; Caitlin Telfer; Alice Towler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In our everyday lives, we are required to make decisions based upon our statistical intuitions. Often, these involve the comparison of two groups, such as luxury versus family cars and their suitability. Research has shown that the mean difference affects judgements where two sets of data are compared, but the variability of the data has only a minor influence, if any at all. However, prior research has tended to present raw data as simple lists of values. Here, we investigated whether displaying data visually, in the form of parallel dot plots, would lead viewers to incorporate variability information. In Experiment 1, we asked a large sample of people to compare two fictional groups (children who drank ‘Brain Juice’ versus water) in a one-shot design, where only a single comparison was made. Our results confirmed that only the mean difference between the groups predicted subsequent judgements of how much they differed, in line with previous work using lists of numbers. In Experiment 2, we asked each participant to make multiple comparisons, with both the mean difference and the pooled standard deviation varying across data sets they were shown. Here, we found that both sources of information were correctly incorporated when making responses. Taken together, we suggest that increasing the salience of variability information, through manipulating this factor across items seen, encourages viewers to consider this in their judgements. Such findings may have useful applications for best practices when teaching difficult concepts like sampling variation.

  2. Film Circulation dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, png
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
    Explore at:
    csv, png, binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.


    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.


    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.


    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.


    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,

  3. d

    A gridded database of the modern distributions of climate, woody plant taxa,...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). A gridded database of the modern distributions of climate, woody plant taxa, and ecoregions for the continental United States and Canada [Dataset]. https://catalog.data.gov/dataset/a-gridded-database-of-the-modern-distributions-of-climate-woody-plant-taxa-and-ecoregions-
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Contiguous United States, Canada, United States
    Description

    On the continental scale, climate is an important determinant of the distributions of plant taxa and ecoregions. To quantify and depict the relations between specific climate variables and these distributions, we placed modern climate and plant taxa distribution data on an approximately 25-kilometer (km) equal-area grid with 27,984 points that cover Canada and the continental United States (Thompson and others, 2015). The gridded climatic data include annual and monthly temperature and precipitation, as well as bioclimatic variables (growing degree days, mean temperatures of the coldest and warmest months, and a moisture index) based on 1961-1990 30-year mean values from the University of East Anglia (UK) Climatic Research Unit (CRU) CL 2.0 dataset (New and others, 2002), and absolute minimum and maximum temperatures for 1951-1980 interpolated from climate-station data (WeatherDisc Associates, 1989). As described below, these data were used to produce portions of the "Atlas of relations between climatic parameters and distributions of important trees and shrubs in North America" (hereafter referred to as "the Atlas"; Thompson and others, 1999a, 1999b, 2000, 2006, 2007, 2012a, 2015). Evolution of the Atlas Over the 16 Years Between Volumes A & B and G: The Atlas evolved through time as technology improved and our knowledge expanded. The climate data employed in the first five Atlas volumes were replaced by more standard and better documented data in the last two volumes (Volumes F and G; Thompson and others, 2012a, 2015). Similarly, the plant distribution data used in Volumes A through D (Thompson and others, 1999a, 1999b, 2000, 2006) were improved for the latter volumes. However, the digitized ecoregion boundaries used in Volume E (Thompson and others, 2007) remain unchanged. Also, as we and others used the data in Atlas Volumes A through E, we came to realize that the plant distribution and climate data for areas south of the US-Mexico border were not of sufficient quality or resolution for our needs and these data are not included in this data release. The data in this data release are provided in comma-separated values (.csv) files. We also provide netCDF (.nc) files containing the climate and bioclimatic data, grouped taxa and species presence-absence data, and ecoregion assignment data for each grid point (but not the country, state, province, and county assignment data for each grid point, which are available in the .csv files). The netCDF files contain updated Albers conical equal-area projection details and more precise grid-point locations. When the original approximately 25-km equal-area grid was created (ca. 1990), it was designed to be registered with existing data sets, and only 3 decimal places were recorded for the grid-point latitude and longitude values (these original 3-decimal place latitude and longitude values are in the .csv files). In addition, the Albers conical equal-area projection used for the grid was modified to match projection irregularities of the U.S. Forest Service atlases (e.g., Little, 1971, 1976, 1977) from which plant taxa distribution data were digitized. For the netCDF files, we have updated the Albers conical equal-area projection parameters and recalculated the grid-point latitudes and longitudes to 6 decimal places. The additional precision in the location data produces maximum differences between the 6-decimal place and the original 3-decimal place values of up to 0.00266 degrees longitude (approximately 143.8 m along the projection x-axis of the grid) and up to 0.00123 degrees latitude (approximately 84.2 m along the projection y-axis of the grid). The maximum straight-line distance between a three-decimal-point and six-decimal-point grid-point location is 144.2 m. Note that we have not regridded the elevation, climate, grouped taxa and species presence-absence data, or ecoregion data to the locations defined by the new 6-decimal place latitude and longitude data. For example, the climate data described in the Atlas publications were interpolated to the grid-point locations defined by the original 3-decimal place latitude and longitude values. Interpolating the data to the 6-decimal place latitude and longitude values would in many cases not result in changes to the reported values and for other grid points the changes would be small and insignificant. Similarly, if the digitized Little (1971, 1976, 1977) taxa distribution maps were regridded using the 6-decimal place latitude and longitude values, the changes to the gridded distributions would be minor, with a small number of grid points along the edge of a taxa's digitized distribution potentially changing value from taxa "present" to taxa "absent" (or vice versa). These changes should be considered within the spatial margin of error for the taxa distributions, which are based on hand-drawn maps with the distributions evidently generalized, or represented by a small, filled circle, and these distributions were subsequently hand digitized. Users wanting to use data that exactly match the data in the Atlas volumes should use the 3-decimal place latitude and longitude data provided in the .csv files in this data release to represent the center point of each grid cell. Users for whom an offset of up to 144.2 m from the original grid-point location is acceptable (e.g., users investigating continental-scale questions) or who want to easily visualize the data may want to use the data associated with the 6-decimal place latitude and longitude values in the netCDF files. The variable names in the netCDF files generally match those in the data release .csv files, except where the .csv file variable name contains a forward slash, colon, period, or comma (i.e., "/", ":", ".", or ","). In the netCDF file variable short names, the forward slashes are replaced with an underscore symbol (i.e., "_") and the colons, periods, and commas are deleted. In the netCDF file variable long names, the punctuation in the name matches that in the .csv file variable names. The "country", "state, province, or territory", and "county" data in the .csv files are not included in the netCDF files. Data included in this release: - Geographic scope. The gridded data cover an area that we labelled as "CANUSA", which includes Canada and the USA (excluding Hawaii, Puerto Rico, and other oceanic islands). Note that the maps displayed in the Atlas volumes are cropped at their northern edge and do not display the full northern extent of the data included in this data release. - Elevation. The elevation data were regridded from the ETOPO5 data set (National Geophysical Data Center, 1993). There were 35 coastal grid points in our CANUSA study area grid for which the regridded elevations were below sea level and these grid points were assigned missing elevation values (i.e., elevation = 9999). The grid points with missing elevation values occur in five coastal areas: (1) near San Diego (California, USA; 1 grid point), (2) Vancouver Island (British Columbia, Canada) and the Olympic Peninsula (Washington, USA; 2 grid points), (3) the Haida Gwaii (formerly Queen Charlotte Islands, British Columbia, Canada) and southeast Alaska (USA, 9 grid points), (4) the Canadian Arctic Archipelago (22 grid points), and (5) Newfoundland (Canada; 1 grid point). - Climate. The gridded climatic data provided here are based on the 1961-1990 30-year mean values from the University of East Anglia (UK) Climatic Research Unit (CRU) CL 2.0 dataset (New and others, 2002), and include annual and monthly temperature and precipitation. The CRU CL 2.0 data were interpolated onto the approximately 25-km grid using geographically-weighted regression, incorporating local lapse-rate estimation and correction. Additional bioclimatic variables (growing degree days on a 5 degrees Celsius base, mean temperatures of the coldest and warmest months, and a moisture index calculated as actual evapotranspiration divided by potential evapotranspiration) were calculated using the interpolated CRU CL 2.0 data. Also included are absolute minimum and maximum temperatures for 1951-1980 interpolated in a similar fashion from climate-station data (WeatherDisc Associates, 1989). These climate and bioclimate data were used in Atlas volumes F and G (see Thompson and others, 2015, for a description of the methods used to create the gridded climate data). Note that for grid points with missing elevation values (i.e., elevation values equal to 9999), climate data were created using an elevation value of -120 meters. Users may want to exclude these climate data from their analyses (see the Usage Notes section in the data release readme file). - Plant distributions. The gridded plant distribution data align with Atlas volume G (Thompson and others, 2015). Plant distribution data on the grid include 690 species, as well as 67 groups of related species and genera, and are based on U.S. Forest Service atlases (e.g., Little, 1971, 1976, 1977), regional atlases (e.g., Benson and Darrow, 1981), and new maps based on information available from herbaria and other online and published sources (for a list of sources, see Tables 3 and 4 in Thompson and others, 2015). See the "Notes" column in Table 1 (https://pubs.usgs.gov/pp/p1650-g/table1.html) and Table 2 (https://pubs.usgs.gov/pp/p1650-g/table2.html) in Thompson and others (2015) for important details regarding the species and grouped taxa distributions. - Ecoregions. The ecoregion gridded data are the same as in Atlas volumes D and E (Thompson and others, 2006, 2007), and include three different systems, Bailey's ecoregions (Bailey, 1997, 1998), WWF's ecoregions (Ricketts and others, 1999), and Kuchler's potential natural vegetation regions (Kuchler, 1985), that are each based on distinctive approaches to categorizing ecoregions. For the Bailey and WWF ecoregions for North America and the Kuchler potential natural vegetation regions for the contiguous United States (i.e.,

  4. d

    Bumble, Match, Tinder Dating App Data | Consumer Transaction Data | US, EU,...

    • datarade.ai
    .json, .xml, .csv
    Updated Jun 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Measurable AI (2024). Bumble, Match, Tinder Dating App Data | Consumer Transaction Data | US, EU, Asia, EMEA, LATAM, MENA, India | Granular & Aggregate Data available [Dataset]. https://datarade.ai/data-products/bumble-match-tinder-dating-app-data-consumer-transaction-measurable-ai
    Explore at:
    .json, .xml, .csvAvailable download formats
    Dataset updated
    Jun 26, 2024
    Dataset authored and provided by
    Measurable AI
    Area covered
    United States
    Description

    The Measurable AI Dating App Consumer Transaction Dataset is a leading source of in-app purchases , offering data collected directly from users via Proprietary Consumer Apps, with millions of opt-in users.

    We source our in-app and email receipt consumer data panel via two consumer apps which garner the express consent of our end-users (GDPR compliant). We then aggregate and anonymize all the transactional data to produce raw and aggregate datasets for our clients.

    Use Cases Our clients leverage our datasets to produce actionable consumer insights such as: - Market share analysis - User behavioral traits (e.g. retention rates) - Average order values - User overlap between competitors - Promotional strategies used by the key players. Several of our clients also use our datasets for forecasting and understanding industry trends better.

    Coverage - Asia - EMEA (Spain, United Arab Emirates) - USA - Europe

    Granular Data Itemized, high-definition data per transaction level with metrics such as - Order value - Features/subscription plans purchased - No. of orders per user - Promotions used - Geolocation data and more

    Aggregate Data - Weekly/ monthly order volume - Revenue delivered in aggregate form, with historical data dating back to 2018. All the transactional e-receipts are sent from app to users’ registered accounts.

    Most of our clients are fast-growing Tech Companies, Financial Institutions, Buyside Firms, Market Research Agencies, Consultancies and Academia.

    Our dataset is GDPR compliant, contains no PII information and is aggregated & anonymized with user consent. Contact michelle@measurable.ai for a data dictionary and to find out our volume in each country.

  5. World Soccer live data feed

    • kaggle.com
    Updated Jan 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Ghahramani (2019). World Soccer live data feed [Dataset]. https://www.kaggle.com/datasets/analystmasters/world-soccer-live-data-feed/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 28, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mohammad Ghahramani
    Description

    Context

    This is the first live data stream on Kaggle providing a simple yet rich source of all soccer matches around the world 24/7 in real-time.

    What makes it unique compared to other datasets?

    • It is the first live data feed on Kaggle and it is totally free
    • Unlike “Churn rate” datasets you do not have to wait months to evaluate your predictions; simply check the match’s outcome in a couple of hours
    • you can use your predictions/analysis for your own benefit instead of spending your time and resources on helping a company maximizing its profit
    • A Five year old laptop can do the calculations and you do not need high-end GPUs
    • Couldn’t make it to the top 3 submissions? Nevermind, you still have the chance to get your prize on your own
    • You can’t get accurate results on all samples? Do not worry, just filter out the hard ones (e.g. ignore international friendly) and simply choose the ones you are sure of.
    • Need help from human experts for each sample? Every sample comes with at least two opinions from experts
    • You wish you could add your complementary data? Just contact us and we will try to facilitate it.
    • Couldn’t win “Warren Buffett's 2018 March Madness Bracket Contest”? Here is your chance to make your accumulative profit.

    Simply train your algorithm on the first version of training dataset of approximately 11.5k matches and predict the data provided in the following data feed.

    Fetch the data stream

    The CSV file is updated every 30 minutes at minutes 20’ and 50’ of every hour. I kindly request not to download it more than twice per hour as it incurs additional cost.

    You may download the csv data file from the following link from Amazon S3 server by changing the FOLDER_NAME as below,

    https://s3.amazonaws.com/FOLDER_NAME/amasters.csv

    *. Substitute the FOLDER_NAME with "**analyst-masters**"

    Content

    Our goal is to identify the outcome of a match as Home, Draw or Away. The variety of sources and nature of information provided in this data stream makes it a unique database. Currently, FIVE servers are collecting data from soccer matches around the world, communicating with each other and finally aggregating the data based on the dominant features learned from 400,000 matches over 7 years. I describe every column and the data collection below in two categories, Category I – Current situation and Category II – Head-to-Head History. Hence, we divide the type of data we have from each team to 4 modes,

    • Mode 1: we have both Category I and Category II available
    • Mode 2: we only have Category I available
    • Mode 3: we only have Category II available
    • Mode 4: none of Category I and II are available

    Below you can find a full illustration of each category.

    I. Current situation

    Col 1 to 3:

    Votes_for_Home Votes_for_Draw Votes_for_Away
    

    The most distinctive parts of the database are these 3 columns. We are releasing opinions of over 100 professional soccer analysts predicting the outcome of a match. Their votes is the result of every piece of information they receive on players, team line-up, injuries and the urge of a team to win a match to stay in the league. They are spread around the world in various time zones and are experts on soccer teams from various regions. Our servers aggregate their opinions to update the CSV file until kickoff. Therefore, even if 40 users predict Real-Madrid wins against Real-Sociedad in Santiago Bernabeu on January 6th, 2019 but 5 users predict Real-Sociedad (the away team) will be the winner, you should doubt the home win. Here, the “majority of votes” works in conjunction with other features.

    Col 4 to 9:

    Weekday Day Month  Year  Hour  Minute
    

    There are over 60,000 matches during a year, and approximately 400 ones are usually held per day on weekends. More critical and exciting matches, which are usually less predictable, are held toward the evening in Europe. We are currently providing time in Central Europe Time (CET) equivalent to GMT +01:00.

    *. Please note that the 2nd row of the CSV file represents the time, data values are saved from all servers to the file.

    Col 10 to 13:

    Total_Bettors   Bet_Perc_on_Home    Bet_Perc_on_Draw   Bet_Perc_on_Away
    

    This data is recorded a few hours before the match as people place bets emotionally when kickoff approaches. The percentage of the overall number of people denoted as “Total_Bettors” is indicated in each column for “Home,” “Draw” and “Away” outcomes.

    Col 14 to 15:

    Team_1 Team_2   
    

    The team playing “Home” is “Team_1” and the opponent playing “Away” is “Team_2”.

    Col 16 to 36:

    League_Rank_1  League_Rank_2  Total_teams     Points_1  Points_2  Max_points Min_points Won_1  Draw_1 Lost_1 Won_2  Draw_2 Lost_2 Goals_Scored_1 Goals_Scored_2 Goals_Rec_1 Goal_Rec_2 Goals_Diff_1  Goals_Diff_2
    

    If the match is betw...

  6. H

    Replication Data for: Matching Methods for Causal Inference with Time-Series...

    • dataverse.harvard.edu
    Updated Oct 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kosuke Imai; In Song Kim; Erik Wang (2021). Replication Data for: Matching Methods for Causal Inference with Time-Series Cross-Section Data [Dataset]. http://doi.org/10.7910/DVN/ZTDHVE
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 13, 2021
    Dataset provided by
    Harvard Dataverse
    Authors
    Kosuke Imai; In Song Kim; Erik Wang
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/ZTDHVEhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/ZTDHVE

    Description

    Matching methods improve the validity of causal inference by reducing model dependence and offering intuitive diagnostics. While they have become a part of the standard tool kit across disciplines, matching methods are rarely used when analyzing time-series cross-sectional data. We fill this methodological gap. In the proposed approach, we first match each treated observation with control observations from other units in the same time period that have an identical treatment history up to the pre-specified number of lags. We use standard matching and weighting methods to further refine this matched set so that the treated and matched control observations have similar covariate values. Assessing the quality of matches is done by examining covariate balance. Finally, we estimate both short-term and long-term average treatment effects using the difference-in-differences estimator, accounting for a time trend. We illustrate the proposed methodology through simulation and empirical studies. An open-source software package is available for implementing the proposed methods.

  7. [Superseded] Intellectual Property Government Open Data 2019

    • data.gov.au
    • researchdata.edu.au
    csv-geo-au, pdf
    Updated Jan 26, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IP Australia (2022). [Superseded] Intellectual Property Government Open Data 2019 [Dataset]. https://data.gov.au/data/dataset/activity/intellectual-property-government-open-data-2019
    Explore at:
    csv-geo-au(59281977), csv-geo-au(680030), csv-geo-au(39873883), csv-geo-au(37247273), csv-geo-au(25433945), csv-geo-au(92768371), pdf(702054), csv-geo-au(208449), csv-geo-au(166844), csv-geo-au(517357734), csv-geo-au(32100526), csv-geo-au(33981694), csv-geo-au(21315), csv-geo-au(6828919), csv-geo-au(86824299), csv-geo-au(359763), csv-geo-au(567412), csv-geo-au(153175), csv-geo-au(165051861), csv-geo-au(115749297), csv-geo-au(79743393), csv-geo-au(55504675), csv-geo-au(221026), csv-geo-au(50760305), csv-geo-au(2867571), csv-geo-au(212907250), csv-geo-au(4352457), csv-geo-au(4843670), csv-geo-au(1032589), csv-geo-au(1163830), csv-geo-au(278689420), csv-geo-au(28585330), csv-geo-au(130674), csv-geo-au(13968748), csv-geo-au(11926959), csv-geo-au(4802733), csv-geo-au(243729054), csv-geo-au(64511181), csv-geo-au(592774239), csv-geo-au(149948862)Available download formats
    Dataset updated
    Jan 26, 2022
    Dataset authored and provided by
    IP Australiahttp://ipaustralia.gov.au/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    What is IPGOD?

    The Intellectual Property Government Open Data (IPGOD) includes over 100 years of registry data on all intellectual property (IP) rights administered by IP Australia. It also has derived information about the applicants who filed these IP rights, to allow for research and analysis at the regional, business and individual level. This is the 2019 release of IPGOD.

    How do I use IPGOD?

    IPGOD is large, with millions of data points across up to 40 tables, making them too large to open with Microsoft Excel. Furthermore, analysis often requires information from separate tables which would need specialised software for merging. We recommend that advanced users interact with the IPGOD data using the right tools with enough memory and compute power. This includes a wide range of programming and statistical software such as Tableau, Power BI, Stata, SAS, R, Python, and Scalar.

    IP Data Platform

    IP Australia is also providing free trials to a cloud-based analytics platform with the capabilities to enable working with large intellectual property datasets, such as the IPGOD, through the web browser, without any installation of software. IP Data Platform

    References

    The following pages can help you gain the understanding of the intellectual property administration and processes in Australia to help your analysis on the dataset.

    Updates

    Tables and columns

    Due to the changes in our systems, some tables have been affected.

    • We have added IPGOD 225 and IPGOD 325 to the dataset!
    • The IPGOD 206 table is not available this year.
    • Many tables have been re-built, and as a result may have different columns or different possible values. Please check the data dictionary for each table before use.

    Data quality improvements

    Data quality has been improved across all tables.

    • Null values are simply empty rather than '31/12/9999'.
    • All date columns are now in ISO format 'yyyy-mm-dd'.
    • All indicator columns have been converted to Boolean data type (True/False) rather than Yes/No, Y/N, or 1/0.
    • All tables are encoded in UTF-8.
    • All tables use the backslash \ as the escape character.
    • The applicant name cleaning and matching algorithms have been updated. We believe that this year's method improves the accuracy of the matches. Please note that the "ipa_id" generated in IPGOD 2019 will not match with those in previous releases of IPGOD.
  8. Z

    Dataset of knee joint contact force peaks and corresponding subject...

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Oct 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stenroth, Lauri (2023). Dataset of knee joint contact force peaks and corresponding subject characteristics from 4 open datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7253457
    Explore at:
    Dataset updated
    Oct 9, 2023
    Dataset provided by
    Lavikainen, Jere Joonatan
    Stenroth, Lauri
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains data from overground walking trials of 166 subjects with several trials per subject (approximately 2900 trials total).

    DATA ORIGINS & LICENSE INFORMATION

    The data comes from four existing open datasets collected by others:

    Schreiber & Moissenet, A multimodal dataset of human gait at different walking speeds established on injury-free adult participants

    article: https://www.nature.com/articles/s41597-019-0124-4

    dataset: https://figshare.com/articles/dataset/A_multimodal_dataset_of_human_gait_at_different_walking_speeds/7734767

    Fukuchi et al., A public dataset of overground and treadmill walking kinematics and kinetics in healthy individuals

    article: https://peerj.com/articles/4640/

    dataset: https://figshare.com/articles/dataset/A_public_data_set_of_overground_and_treadmill_walking_kinematics_and_kinetics_of_healthy_individuals/5722711

    Horst et al., A public dataset of overground walking kinetics and full-body kinematics in healthy adult individuals

    article: https://www.nature.com/articles/s41598-019-38748-8

    dataset: https://data.mendeley.com/datasets/svx74xcrjr/3

    Camargo et al., A comprehensive, open-source dataset of lower limb biomechanics in multiple conditions of stairs, ramps, and level-ground ambulation and transitions

    article: https://www.sciencedirect.com/science/article/pii/S0021929021001007

    dataset (3 links): https://data.mendeley.com/datasets/fcgm3chfff/1 https://data.mendeley.com/datasets/k9kvm5tn3f/1 https://data.mendeley.com/datasets/jj3r5f9pnf/1

    In this dataset, those datasets are referred to as the Schreiber, Fukuchi, Horst, and Camargo datasets, respectively. The Schreiber, Fukuchi, Horst, and Camargo datasets are licensed under the CC BY 4.0 license (https://creativecommons.org/licenses/by/4.0/).

    We have modified the datasets by analyzing the data with musculoskeletal simulations & analysis software (OpenSim). In this dataset, we publish modified data as well as some of the original data.

    STRUCTURE OF THE DATASET The dataset contains two kinds of text files: those starting with "predictors_" and those starting with "response_".

    Predictors comprise 12 text files, each describing the input (predictor) variables we used to train artifical neural networks to predict knee joint loading peaks. Responses similarly comprise 12 text files, each describing the response (outcome) variables that we trained and evaluated the network on. The file names are of the form "predictors_X" for predictors and "response_X" for responses, where X describes which response (outcome) variable is predicted with them. X can be: - loading_response_both: the maximum of the first peak of stance for the sum of the loading of the medial and lateral compartments - loading_response_lateral: the maximum of the first peak of stance for the loading of the lateral compartment - loading_response_medial: the maximum of the first peak of stance for the loading of the medial compartment - terminal_extension_both: the maximum of the second peak of stance for the sum of the loading of the medial and lateral compartments - terminal_extension_lateral: the maximum of the second peak of stance for the loading of the lateral compartment - terminal_extension_medial: the maximum of the second peak of stance for the loading of the medial compartment - max_peak_both: the maximum of the entire stance phase for the sum of the loading of the medial and lateral compartments - max_peak_lateral: the maximum of the entire stance phase for the loading of the lateral compartment - max_peak_medial: the maximum of the entire stance phase for the loading of the medial compartment - MFR_common: the medial force ratio for the entire stance phase - MFR_LR: the medial force ratio for the first peak of stance - MFR_TE: the medial force ratio for the second peak of stance

    The predictor text files are organized as comma-separated values. Each row corresponds to one walking trial. A single subject typically has several trials. The column labels are DATASET_INDEX,SUBJECT_INDEX,KNEE_ADDUCTION,MASS,HEIGHT,BMI,WALKING_SPEED,HEEL_STRIKE_VELOCITY,AGE,GENDER.

    DATASET_INDEX describes which original dataset the trial is from, where {1=Schreiber, 2=Fukuchi, 3=Horst, 4=Camargo}

    SUBJECT_INDEX is the index of the subject in the original dataset. If you use this column, you will have to rewrite these to avoid duplicates (e.g., several datasets probably have subject "3").

    KNEE_ADDUCTION is the knee adduction-abduction angle (positive for adduction, negative for abduction) of the subject in static pose, estimated from motion capture markers.

    MASS is the mass of the subject in kilograms

    HEIGHT is the height of the subject in millimeters

    BMI is the body mass index of the subject

    WALKING_SPEED is the mean walking speed of the subject during the trial

    HEEL_STRIKE_VELOCITY is the mean of the velocities of the subject's pelvis markers at the instant of heel strike

    AGE is the age of the subject in years

    GENDER is an integer/boolean where {1=male, 0=female}

    The response text files contain one floating-point value per row, describing the knee joint contact force peak for the trial in newtons (or the medial force ratio). Each row corresponds to one walking trial. The rows in predictor and response text files match each other (e.g., row 7 describes the same trial in both predictors_max_peak_medial.txt and response_max_peak_medial.txt).

    See our journal article "Prediction of Knee Joint Compartmental Loading Maxima Utilizing Simple Subject Characteristics and Neural Networks" (https://doi.org/10.1007/s10439-023-03278-y) for more information.

    Questions & other contacts: jere.lavikainen@uef.fi

  9. d

    Adult Arrests

    • catalog.data.gov
    • datasets.ai
    • +3more
    Updated Mar 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Washington, DC (2025). Adult Arrests [Dataset]. https://catalog.data.gov/dataset/adult-arrests-28903
    Explore at:
    Dataset updated
    Mar 11, 2025
    Dataset provided by
    City of Washington, DC
    Description

    The Metropolitan Police Department collects race and ethnicity data according to the United States Census Bureau standards (https://www.census.gov/topics/population/race/about.html). Hispanic, which was previously categorized under the Race field prior to August 2015, is now captured under Ethnicity. All records prior to August 2015 have been updated to “Unknown (Race), Hispanic (Ethnicity)”. Race, ethnicity and gender data are based on officer observation, which may or may not be accurate.MPD cannot release exact addresses to the general public unless proof of ownership or subpoena is submitted. The GeoX and GeoY values represent the block location (approximately 232 ft. radius) as of the date of the arrest and offense. Arrest and offense addresses that could not be geocoded are included as an “unknown” value.Arrestee age is calculated based on the number of days between the self-reported or verified date of birth (DOB) of the arrestee and the date of the arrest; DOB data may not be accurate if self-reported, and an arrestee may refuse to provide his or her date of birth. Due to the sensitive nature of juvenile data and to protect the arrestee’s confidentiality, any arrest records for defendants under the age of 18 or with missing age are excluded in this dataset.The Criminal Complaint Number (CCN) and arrest number have also been anonymized.This data may not match other arrest data requests that may have included all law enforcement agencies in the District or all arrest charges. Arrest totals are subject to change and may be different than MPD Annual Report totals or other publications due to inclusion of juvenile arrest summary, expungements, investigation updates, data quality audits, etc.

  10. BLM Natl WesternUS GRSG Sagebrush Focal Areas

    • s.cnmilf.com
    • catalog.data.gov
    • +1more
    Updated Nov 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bureau of Land Management (2024). BLM Natl WesternUS GRSG Sagebrush Focal Areas [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/blm-natl-westernus-grsg-sagebrush-focal-areas
    Explore at:
    Dataset updated
    Nov 20, 2024
    Dataset provided by
    Bureau of Land Managementhttp://www.blm.gov/
    Description

    This dataset is a modified version of the FWS developed data depicting “Highly Important Landscapes”, as outlined in Memorandum FWS/AES/058711 and provided to the Wildlife Habitat Spatial analysis Lab on October 29th 2014. Other names and acronyms used to refer to this dataset have included: Areas of Significance (AoSs - name of GIS data set provided by FWS), Strongholds (FWS), and Sagebrush Focal Areas (SFAs - BLM). The BLM will refer to these data as Sagebrush Focal Areas (SFAs). Data were provided as a series of ArcGIS map packages which, when extracted, contained several datasets each. Based on the recommendation of the FWS Geographer/Ecologist (email communication, see data originator for contact information) the dataset called “Outiline_AreasofSignificance” was utilized as the source for subsequent analysis and refinement. Metadata was not provided by the FWS for this dataset. For detailed information regarding the dataset’s creation refer to Memorandum FWS/AES/058711 or contact the FWS directly. Several operations and modifications were made to this source data, as outlined in the “Description” and “Process Step” sections of this metadata file. Generally: The source data was named by the Wildlife Habitat Spatial Analysis Lab to identify polygons as described (but not identified in the GIS) in the FWS memorandum. The Nevada/California EIS modified portions within their decision space in concert with local FWS personnel and provided the modified data back to the Wildlife Habitat Spatial Analysis Lab. Gaps around Nevada State borders, introduced by the NVCA edits, were then closed as was a large gap between the southern Idaho & southeast Oregon present in the original dataset. Features with an area below 40 acres were then identified and, based on FWS guidance, either removed or retained. Finally, guidance from BLM WO resulted in the removal of additional areas, primarily non-habitat with BLM surface or subsurface management authority. Data were then provided to each EIS for use in FEIS development. Based on guidance from WO, SFAs were to be limited to BLM decision space (surface/sub-surface management areas) within PHMA. Each EIS was asked to provide the limited SFA dataset back to the National Operations Center to ensure consistent representation and analysis. Returned SFA data, modified by each individual EIS, was then consolidated at the BLM’s National Operations Center retaining the three standardized fields contained in this dataset.Several Modifications from the original FWS dataset have been made. Below is a summary of each modification.1. The data as received from FWS: 16,514,163 acres & 1 record.2. Edited to name SFAs by Wildlife Habitat Spatial Analysis Lab:Upon receipt of the “Outiline_AreasofSignificance” dataset from the FWS, a copy was made and the one existing & unnamed record was exploded in an edit session within ArcMap. A text field, “AoS_Name”, was added. Using the maps provided with Memorandum FWS/AES/058711, polygons were manually selected and the “AoS_Name” field was calculated to match the names as illustrated. Once all polygons in the exploded dataset were appropriately named, the dataset was dissolved, resulting in one record representing each of the seven SFAs identified in the memorandum.3. The NVCA EIS made modifications in concert with local FWS staff. Metadata and detailed change descriptions were not returned with the modified data. Contact Leisa Wesch, GIS Specialist, BLM Nevada State Office, 775-861-6421, lwesch@blm.gov, for details.4. Once the data was returned to the Wildlife Habitat Spatial Analysis Lab from the NVCA EIS, gaps surrounding the State of NV were closed. These gaps were introduced by the NVCA edits, exacerbated by them, or existed in the data as provided by the FWS. The gap closing was performed in an edit session by either extending each polygon towards each other or by creating a new polygon, which covered the gap, and merging it with the existing features. In addition to the gaps around state boundaries, a large area between the S. Idaho and S.E. Oregon SFAs was filled in. To accomplish this, ADPP habitat (current as of January 2015) and BLM GSSP SMA data were used to create a new polygon representing PHMA and BLM management that connected the two existing SFAs.5. In an effort to simplify the FWS dataset, features whose areas were less than 40 acres were identified and FWS was consulted for guidance on possible removal. To do so, features from #4 above were exploded once again in an ArcMap edit session. Features whose areas were less than forty acres were selected and exported (770 total features). This dataset was provided to the FWS and then returned with specific guidance on inclusion/exclusion via email by Lara Juliusson (lara_juliusson@fws.gov). The specific guidance was:a. Remove all features whose area is less than 10 acresb. Remove features identified as slivers (the thinness ratio was calculated and slivers identified by Lara Juliusson according to https://tereshenkov.wordpress.com/2014/04/08/fighting-sliver-polygons-in-arcgis-thinness-ratio/) and whose area was less than 20 acres.c. Remove features with areas less than 20 acres NOT identified as slivers and NOT adjacent to other features.d. Keep the remainder of features identified as less than 40 acres.To accomplish “a” and “b”, above, a simple selection was applied to the dataset representing features less than 40 acres. The select by _location tool was used, set to select identical, to select these features from the dataset created in step 4 above. The records count was confirmed as matching between the two data sets and then these features were deleted. To accomplish “c” above, a field (“AdjacentSH”, added by FWS but not calculated) was calculated to identify features touching or intersecting other features. A series of selections was used: first to select records 6. Based on direction from the BLM Washington Office, the portion of the Upper Missouri River Breaks National Monument (UMRBNM) that was included in the FWS SFA dataset was removed. The BLM NOC GSSP NLCS dataset was used to erase these areas from #5 above. Resulting sliver polygons were also removed and geometry was repaired.7. In addition to removing UMRBNM, the BLM Washington Office also directed the removal of Non-ADPP habitat within the SFAs, on BLM managed lands, falling outside of Designated Wilderness’ & Wilderness Study Areas. An exception was the retention of the Donkey Hills ACEC and adjacent BLM lands. The BLM NOC GSSP NLCS datasets were used in conjunction with a dataset containing all ADPP habitat, BLM SMA and BLM sub-surface management unioned into one file to identify and delete these areas.8. The resulting dataset, after steps 2 – 8 above were completed, was dissolved to the SFA name field yielding this feature class with one record per SFA area.9. Data were provided to each EIS for use in FEIS allocation decision data development.10. Data were subset to BLM decision space (surface/sub-surface) within PHMA by each EIS and returned to the NOC.11. Due to variations in field names and values, three standardized fields were created and calculated by the NOC:a. SFA Name – The name of the SFA.b. Subsurface – Binary “Yes” or “No” to indicated federal subsurface estate.c. SMA – Represents BLM, USFS, other federal and non-federal surface management 12. The consolidated data (with standardized field names and values) were dissolved on the three fields illustrated above and geometry was repaired, resulting in this dataset.

  11. Global monthly catch of tuna, tuna-like and shark species (1950-2023) by 1°...

    • data.europa.eu
    unknown
    Updated May 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). Global monthly catch of tuna, tuna-like and shark species (1950-2023) by 1° or 5° squares (IRD level 2) - and efforts level 0 (1950-2023) [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-15405414?locale=es
    Explore at:
    unknown(2677816)Available download formats
    Dataset updated
    May 16, 2025
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Major differences from v1: For level 2 catch: Catches and number raised to nominal are only raised to exactly matching stratas or if not existing, to a strata corresponding with UNK/NEI or 99.9. (new feature in v4) When nominal strata lack specific dimensions (e.g., fishing_mode always UNK) but georeferenced strata include them, the nominal data are “upgraded” to match—preventing loss of detail. Currently this adjustment aligns nominal values to georeferenced totals; future versions may apply proportional scaling. This does not create a direct raising but rather allows more precise reallocation. (new feature in v4) IATTC Purse seine catch-and-effort are available in 3 separate files according to the group of species: tuna, billfishes, sharks. This is due to the fact that PS data is collected from 2 sources: observer and fishing vessel logbooks. Observer records are used when available, and for unobserved trips logbooks are used. Both sources collect tuna data but only observers collect shark and billfish data. As an example, a strata may have observer effort and the number of sets from the observed trips would be counted for tuna and shark and billfish. But there may have also been logbook data for unobserved sets in the same strata so the tuna catch and number of sets for a cell would be added. This would make a higher total number of sets for tuna catch than shark or billfish. Efforts in the billfish and shark datasets might hence represent only a proportion of the total effort allocated in some strata since it is the observed effort, i.e. for which there was an observer onboard. As a result, catch in the billfish and shark datasets might represent only a proportion of the total catch allocated in some strata. Hence, shark and billfish catch were raised to the fishing effort reported in the tuna dataset. (new feature in v4, was done in Firms Level 0 before) Data with resolution of 10degx10deg is removed, it is considered to disaggregate it in next versions. Catches in tons, raised to match nominal values, now consider the geographic area of the nominal data for improved accuracy. (as v3) Captures in "Number of fish" are converted to weight based on nominal data. The conversion factors used in the previous version are no longer used, as they did not adequately represent the diversity of captures. (as v3) Number of fish without corresponding data in nominal are not removed as they were before, creating a huge difference for this measurement_unit between the two datasets. (as v3) Strata for which catches in tons are raised to match nominal data have had their numbers removed. (as v3) Raising only applies to complete years to avoid overrepresenting specific months, particularly in the early years of georeferenced reporting. (as v3) Strata where georeferenced data exceed nominal data have not been adjusted downward, as it is unclear if these discrepancies arise from missing nominal data or different aggregation methods in both datasets. (as v3) The data is not aggregated to 5-degree squares and thus remains unharmonized spatially. Aggregation can be performed using CWP codes for geographic identifiers. For example, an R function is available: source("https://raw.githubusercontent.com/firms-gta/geoflow-tunaatlas/master/sardara_functions/transform_cwp_code_from_1deg_to_5deg.R") (as v3) This results in a raising of the data compared to v3 for IOTC, ICCAT, IATTC and WCPFC. However as the raising is more specific for CCSBT, the raising is of 22% less than in the previous version. Level 0 dataset has been modified creating differences in this new version notably : The species retained are different; only 32 major species are kept. Mappings have been somewhat modified based on new standards implemented by FIRMS. New rules have been applied for overlapping areas. Data is only displayed in 1 degrees square area and 5 degrees square areas. The data is enriched with "Species group", "Gear labels" using the fdiwg standards. These main differences are recapped in the Differences_v2018_v2024.zip Recommendations: To avoid converting data from number using nominal stratas, we recommend the use of conversion factors which could be provided by tRFMOs. In some strata, nominal data appears higher than georeferenced data, as observed during level 2 processing. These discrepancies may result from errors or differences in aggregation methods. Further analysis will examine these differences in detail to refine treatments accordingly. A summary of differences by tRFMOs, based on the number of strata, is included in the appendix. For level 0 effort : In some datasets—namely those from ICCAT and the purse seine (PS) data from WCPFC— same effort data has been reported multiple times by using different units which have been kept as is, since no official mapping allows conversion between these units. As a result, users have be remind that some ICCAT and WCPFC effort data are deliberately duplicated : in the case of ICCAT data, lines wi

  12. Z

    Dataset: A Systematic Literature Review on the topic of High-value datasets

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anastasija Nikiforova (2023). Dataset: A Systematic Literature Review on the topic of High-value datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7944424
    Explore at:
    Dataset updated
    Jun 23, 2023
    Dataset provided by
    Nina Rizun
    Magdalena Ciesielska
    Anastasija Nikiforova
    Charalampos Alexopoulos
    Andrea Miletič
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.

    The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.

    Methodology

    To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).

    These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.

    To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.

    Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.

    Description of the data in this data set

    Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies

    The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information

    Descriptive information
    1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet 2) Complete reference - the complete source information to refer to the study 3) Year of publication - the year in which the study was published 4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter} 5) DOI / Website- a link to the website where the study can be found 6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science 7) Availability in OA - availability of an article in the Open Access 8) Keywords - keywords of the paper as indicated by the authors 9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}

    Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?

    Quality- and relevance- related information
    17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)? 18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))

    HVD determination-related information
    19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term? 20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output") 21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description) 22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles? 23) Data - what data do HVD cover? 24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)

    Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx

    Licenses or restrictions CC-BY

    For more info, see README.txt

  13. Global monthly catch of tuna, tuna-like and shark species (1950-2021) by 1°...

    • data.europa.eu
    unknown
    Updated Dec 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2024). Global monthly catch of tuna, tuna-like and shark species (1950-2021) by 1° or 5° squares (IRD level 2) - and efforts level 0 (1950-2023) [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-15221705?locale=cs
    Explore at:
    unknown(21391)Available download formats
    Dataset updated
    Dec 1, 2024
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Major differences from previous work: For level 2 catch: Catches in tons, raised to match nominal values, now consider the geographic area of the nominal data for improved accuracy. Captures in "Number of fish" are converted to weight based on nominal data. The conversion factors used in the previous version are no longer used, as they did not adequately represent the diversity of captures. Number of fish without corresponding data in nominal are not removed as they were before, creating a huge difference for this measurement_unit between the two datasets. Nominal data from WCPFC includes fishing fleet information, and georeferenced data has been raised based on this instead of solely on the triplet year/gear/species, to avoid random reallocations. Strata for which catches in tons are raised to match nominal data have had their numbers removed. Raising only applies to complete years to avoid overrepresenting specific months, particularly in the early years of georeferenced reporting. Strata where georeferenced data exceed nominal data have not been adjusted downward, as it is unclear if these discrepancies arise from missing nominal data or different aggregation methods in both datasets. The data is not aggregated to 5-degree squares and thus remains unharmonized spatially. Aggregation can be performed using CWP codes for geographic identifiers. For example, an R function is available: source("https://raw.githubusercontent.com/firms-gta/geoflow-tunaatlas/master/sardara_functions/transform_cwp_code_from_1deg_to_5deg.R") Level 0 dataset has been modified creating differences in this new version notably : The species retained are different; only 32 major species are kept. Mappings have been somewhat modified based on new standards implemented by FIRMS. New rules have been applied for overlapping areas. Data is only displayed in 1 degrees square area and 5 degrees square areas. The data is enriched with "Species group", "Gear labels" using the fdiwg standards. These main differences are recapped in the Differences_v2018_v2024.zip Recommendations: To avoid converting data from number using nominal stratas, we recommend the use of conversion factors which could be provided by tRFMOs. In some strata, nominal data appears higher than georeferenced data, as observed during level 2 processing. These discrepancies may result from errors or differences in aggregation methods. Further analysis will examine these differences in detail to refine treatments accordingly. A summary of differences by tRFMOs, based on the number of strata, is included in the appendix. Some nominal data have no equivalent in georeferenced data and therefore cannot be disaggregated. What could be done is to check for each nominal data without equivalence if a georeferenced data exists in different buffers, and to average the distribution of this footprint. Then, disaggregate the nominal data based on the georeferenced data. This would lead to the creation of data (approximately 3%), and would necessitate reducing/removing all georeferenced data without a nominal equivalent or with a lesser equivalent. Tests are currently being conducted with and without this. It would help improve the biomass captured footprint but could lead to unexpected discrepancies with current datasets. For level 0 effort : In some datasets—namely those from ICCAT and the purse seine (PS) data from WCPFC— same effort data has been reported multiple times by using different units which have been kept as is, since no official mapping allows conversion between these units. As a result, users have be remind that some ICCAT and WCPFC effort data are deliberately duplicated : in the case of ICCAT data, lines with identical strata but different effort units are duplicates reporting the same fishing activity with different measurement units. It is indeed not possible to infer strict equivalence between units, as some contain information about others (e.g., Hours.FAD and Hours.FSC may inform Hours.STD). in the case of WCPFC data, effort records were also kept in all originally reported units. Here, duplicates do not necessarily share the same “fishing_mode”, as SETS for purse seiners are reported with an explicit association to fishing_mode, while DAYS are not. This distinction allows SETS records to be separated by fishing mode, whereas DAYS records remain aggregated. Some limited harmonization—particularly between units such as NET-days and Nets—has not been implemented in the current version of the dataset, but may be considered in future releases if a consistent relationship can be established.

  14. Z

    Data from: Hybrid LCA database generated using ecoinvent and EXIOBASE

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Oct 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agez Maxime (2021). Hybrid LCA database generated using ecoinvent and EXIOBASE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3890378
    Explore at:
    Dataset updated
    Oct 9, 2021
    Dataset authored and provided by
    Agez Maxime
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Hybrid LCA database generated using ecoinvent and EXIOBASE, i.e., each process of the original ecoinvent database is added new direct inputs (coming from EXIOBASE) deemed missing (e.g., services). Each process of the resulting hybrid database is thus not (or at least less) truncated and the calculated lifecycle emissions/impacts should therefore be closer to reality.

    For license reasons, only the added inputs for each process of ecoinvent are provided (and not all the inputs).

    Why are there two versions for hybrid-ecoinvent3.5?

    One of the version corresponds to ecoinvent hybridized with the normal version of EXIOBASE and the other is hybridized with a capital-endogenized version of EXIOBASE.

    What does capital endogenization do?

    It matches capital goods formation to the value chains of products where they are required. In a more LCA way of speaking, EXIOBASE in its normal version does not allocate capital use to value chains. It's like if ecoinvent processes had no inputs of buildings, etc. in their unit process inventory. For more detail on this, refer to (Södersten et al., 2019) or (Miller et al., 2019).

    So which version do I use?

    Using the version "with capitals" gives a more comprehensive coverage. Using the "without capitals" version means that if a process of ecoinvent misses inputs of capital goods (e.g., a process does not include the company laptops of the employees), it won't be added. It comes with its fair share of assumptions and uncertainties however.

    Why is it only available for hybrid-ecoinvent3.5?

    The work used for capital endogenization is not available for exiobase3.8.1.

    How do I use the dataset?

    First, to use it, you will need both the corresponding ecoinvent [cut-off] and EXIOBASE [product x product] versions. For the reference year of EXIOBASE to-be-used, take 2011 if using the hybrid-ecoinvent3.5 and 2019 for hybrid-ecoinvent3.6 and 3.7.1.

    In the four datasets of this package, only added inputs are given (i.e. inputs from EXIOBASE added to ecoinvent processes). Ecoinvent and EXIOBASE processes/sectors are not included, for copyright issues. You thus need both ecoinvent and EXIOBASE to calculate life cycle emissions/impacts.

    Module to get ecoinvent in a Python format: https://github.com/majeau-bettez/ecospold2matrix (make sure to take the most up-to-date branch)

    Module to get EXIOBASE in a Python format: https://github.com/konstantinstadler/pymrio (can also be installed with pip)

    If you want to use the "with capitals" version of the hybrid database, you also need to use the capital endogenized version of EXIOBASE, available here: https://zenodo.org/record/3874309. Choose the pxp version of the year you plan to study (which should match with the year of the EXIOBASE version). You then need to normalize the capital matrix (i.e., divide by the total output x of EXIOBASE). Then, you simply add the normalized capital matrix (K) to the technology matrix (A) of EXIOBASE (see equation below).

    Once you have all the data needed, you just need to apply a slightly modified version of the Leontief equation:

    (\begin{equation} \textbf{q}^{hyb} = \begin{bmatrix} \textbf{C}^{lca}\cdot\textbf{S}^{lca} & \textbf{C}^{io}\cdot\textbf{S}^{io} \end{bmatrix} \cdot \left( \textbf{I} - \begin{bmatrix} \textbf{A}^{lca} & \textbf{C}^{d} \ \textbf{C}^{u} & \textbf{A}^{io}+\textbf{K}^{io} \end{bmatrix} \right) ^{-1} \cdot \left( \begin{bmatrix} \textbf{y}^{lca} \ 0 \end{bmatrix} \right) \end{equation})

    qhyb gives the hybridized impact, i.e., the impacts of each process including the impacts generated by their new inputs.

    Clca and Cio are the respective characterization matrices for ecoinvent and EXIOBASE.

    Slca and Sio are the respective environmental extension matrices (or elementary flows in LCA terms) for ecoinvent and EXIOBASE.

    I is the identity matrix.

    Alca and Aio are the respective technology matrices for ecoinvent and EXIOBASE (the ones loaded with ecospold2matrix and pymrio).

    Kio is the capital matrix. If you do not use the endogenized version, do not include this matrix in the calculation.

    Cu (or upstream cut-offs) is the matrix that you get in this dataset.

    Cd (or downstream cut-offs) is simply a matrix of zeros in the case of this application.

    Finally you define your final demand (or functional unit/set of functional units for LCA) as ylca.

    Can I use it with different versions/reference years of EXIOBASE?

    Technically speaking, yes it will work, because the temporal aspect does not intervene in the determination of the hybrid database presented here. However, keep in mind that there might be some inconsistencies. For example, you would need to multiply each of the inputs of the datasets by a factor to account for inflation. Prices of ecoinvent (which were used to compile the hybrid databases, for all versions presented here) are defined in €2005.

    What are the weird suite of numbers in the columns?

    Ecoinvent processes are identified through unique identifiers (uuids) to which metadata (i.e., name, location, price, etc.) can be retraced with the appropriate metadata files in each dataset package.

    Why is the equation (I-A)-1 and not A-1 like in LCA?

    IO and LCA have the same computational background. In LCA however, the convention is to represents outputs and inputs in the technology matrix. That's why there is a diagonal of 1s (the outputs, i.e. functional units) and negative values elsewhere (inputs). In IO, the technology matrix does not include outputs and only registers inputs as positive values. In the end, it is just a convention difference. If we call T the technology matrix of LCA and A the technology matrix of IO we have T = I-A. When you load ecoinvent using ecospold2matrix, the resulting version of ecoinvent will already be in IO convention and you won't have to bother with it.

    Pymrio does not provide a characterization matrix for EXIOBASE, what do I do?

    You can find an up-to-date characterization matrix (with Impact World+) for environmental extensions of EXIOBASE here: https://zenodo.org/record/3890339

    If you want to match characterization across both EXIOBASE and ecoinvent (which you should do), here you can find a characterization matrix with Impact World+ for ecoinvent: https://zenodo.org/record/3890367

    It's too complicated...

    The custom software that was used to develop these datasets already deals with some of the steps described. Go check it out: https://github.com/MaximeAgez/pylcaio. You can also generate your own hybrid version of ecoinvent using this software (you can play with some parameters like correction for double counting, inflation rate, change price data to be used, etc.). As of pylcaio v2.1, the resulting hybrid database (generated directly by pylcaio) can be exported to and manipulated in brightway2.

    Where can I get more information?

    The whole methodology is detailed in (Agez et al., 2021).

  15. Inferring and comparing metabolisms across heterogeneous sets of annotated...

    • zenodo.org
    zip
    Updated Mar 27, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arnaud Belcour; Arnaud Belcour; Jeanne Got; Jeanne Got; Méziane Aite; Ludovic Delage; Ludovic Delage; Jonas Collen; Clémence Frioux; Clémence Frioux; Catherine Leblanc; Catherine Leblanc; Simon M. Dittami; Simon M. Dittami; Samuel Blanquart; Gabriel V. Markov; Gabriel V. Markov; Anne Siegel; Anne Siegel; Méziane Aite; Jonas Collen; Samuel Blanquart (2023). Inferring and comparing metabolisms across heterogeneous sets of annotated genomes using AuCoMe [Dataset]. http://doi.org/10.5281/zenodo.7387234
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 27, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Arnaud Belcour; Arnaud Belcour; Jeanne Got; Jeanne Got; Méziane Aite; Ludovic Delage; Ludovic Delage; Jonas Collen; Clémence Frioux; Clémence Frioux; Catherine Leblanc; Catherine Leblanc; Simon M. Dittami; Simon M. Dittami; Samuel Blanquart; Gabriel V. Markov; Gabriel V. Markov; Anne Siegel; Anne Siegel; Méziane Aite; Jonas Collen; Samuel Blanquart
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CONTENT OF THIS ARCHIVE

    The Zenodo archive is composed of one file and four main directories:
    * analyses this directory contains all tabulated files used to create the figures and results of the paper.

    * aucome_v0.5.1 this directory contains the code of AuCoMe used to run the three datasets.

    * datasets this directory gathers all datasets on which AuCoMe was run: the bacterial, fungal, and algal datasets, and the 32 synthetic datasets, which contain an E. coli K–12 MG1655 genome to which various degradations were applied, together with 28 other bacterial genomes.

    * metacyc_23.5.padmet the version 23.5 of the MetaCyc database (https://metacyc.org/) in the PADMET format. It was used by AuCoMe to reconstruct all the metabolic networks. Hence metacyc 23.5.padmet is required to reproduce the article results.

    * padmet_v5.0.1 this directory contains the code of PADMET used to run AuCoMe.

    * scripts this directory contains several scripts to generate figures and a script to degrade the E. coli K–12 MG1655 genome.

    1/ Content of the analyses subdirectory
    * figure_2_bacterial_nb_reactions.tsv for each species of the bacterial dataset, this file gives the number of reactions at each AuCoMe step. It was used to create figure 2B.

    * figure_2_fungal_nb_reactions.tsv for each species of the fungal dataset, this file gives the number of reactions at each AuCoMe step. It was used to create figure 2C.

    * figure_2_algal_nb_reactions.tsv for each species of the algal dataset, this file gives the number of reactions at each AuCoMe step. It was used to create figure 2D.

    * figure_3_nb_reactions_step.tsv for each dataset of the 32 synthetic bacterial datasets, this file enumerates the number of reactions at each AuCoMe step. It was used to create figure 3A.

    * figure_3_fmeasure_steps.tsv for each dataset of the 32 synthetic bacterial datasets, this file indicates the values of the F-measures resulting of the comparison of the GSMNs recovered for each E. coli K–12 MG1655 genome replicate with the gold-standard network EcoCyc. It was used to create figure 3B.

    * figure_S4_Deepec_fungal.tsv for each species of the fungal dataset, at each AuCoMe step (robust orthology, non-robust orthology, and annotation or orthology), several measures were computed, i.e.: the number of reactions, the number of ECs, the number of ECs valided by DeepEC, and ratio number of ECs validated by DeepEC / number of ECs. It was used to design figure S4(a).

    * figure_S4_Deepec_algal.tsv for each species of the algal dataset, at each AuCoMe step (robust orthology, non-robust orthology, and annotation or orthology), several measures were computed, i.e.: the number of reactions, the number of ECs, the number of ECs validated by DeepEC, and the ratio number of ECs valided by DeepEC / number of ECs. It was used to design figure S4(b).

    * SuplFile_o-Aminophenol_reactions_tables_S10_S11_S12.ods comprises three tables: S10, S11, and S10 with more detail (like the amino acid sequences in the S12).

    2/ Content of the aucome v0.5.1 subdirectory
    This directory contains a copy of the AuCoMe project on the GitHub site: https://github.com/AuReMe/aucome (downloaded the 15/11/2022). It is composed of two subdirectories and five files:
    * LICENCE licence of the AuCoMe software.

    * README.rst README of the AuCoMe software.

    * requirements.txt contains the list of requires Python packages.

    * setup.cfg contains metadata about AuCoMe package and is used with setup.py to distribute AuCoMe.

    * setup.py contains various information relevant to the AuCoMe package including options and metadata. Then, it is used to distribute AuCoMe with PyPI. It is also used to create an entrypoint when installing it with pip.

    * recipes this subdirectory contains two files:
    Dockerfile contains instructions to run AuCoMe in a Docker environment.

    Singularity contains instructions to run AuCoMe in a Singularity container.

    * aucome this directory contains 11 Python files:
    _init_.py indicates the directory as a python module.

    _main_.py contains the functions implementing the command-line interface of AuCoMe.

    analysis.py contains the functions to analyse the AuCoMe results.

    check.py contains the functions to check the input files.

    compare.py contains the functions to compare the AuCoMe results between two distinct subgroups.

    orthology.py contains the functions to propagate reaction through orthology.

    reconstruction.py contains the functions to perform the reconstruction of draft GSMNs by using Pathway Tools in a parallel implementation.

    spontaneous.py contains the functions to add spontaneous reactions to some GSMNs if it completes MetaCyc metabolic pathway.

    structural.py contains the functions to check that no reactions are missing due to missing gene structures. A genomic search is performed for all reactions present in one organism but not in another.

    utils.py contains a function to analyse the configuration file.

    workflow.py contains functions to run all the steps of AuCoMe.

    3/ Content of the datasets subdirectory
    3.1/ Content of the algal, bacterial, and fungal directories
    These three directories are composed of 8 subdirectories:
    * FASTA contains the proteome of each species as a FASTA file.

    * cleaned_GBKs for each species, it contains the annotated genome, with the protein sequences in a GenBank format file.

    * dictionaries for some species, genes needed to be renamed for compatibility reasons. This folder contains CSV files with the mapping between the old names of genes and the new ones.

    * annotated_DATs contains a subdirectory per species with all the output files from Pathway Tools v23.5, without any post-treatment, in the DAT format.

    * annotated_PADMETs for each species, it contains a metabolic network of the draft reconstruction step of AuCoMe, in the PADMET format.

    * final_PADMETs for each species, it contains a metabolic network generated by the AuCoMe workflow, at the PADMET format.

    * final_SBMLs for each species, it contains a metabolic network generated by the AuCoMe workflow, in the SBML format.

    * panmetabolism is composed of 7 files describing the final metabolic networks:
    genes.tsv contains, for each organism, the list of genes and the associated reactions.

    metabolites.tsv contains the list of metabolites present in the panmetabolism. Then, for each metabolite and for each organism, it lists the reactions that produced this compound and the reactions that consumed it.

    pathways.tsv contains the list of pathways present in the panmetabolism. For each pathway and for each organism, it indicates the number of reactions present in this pathway, and the names of these reactions.

    reactions.tsv contains the list of reactions present in the panmetabolism. Then for each reaction, it indicates whether or not it belongs to an organism. If a reaction is found in a species, the genes associated with the reaction are also listed.

    pvclust_reaction_dendrogram.png based on the presence/absence matrix of reactions in different species of the dataset, it computes the Jaccard distances between these species, and it applies a hierarchical clustering on these data with a complete linkage to create a dendrogram. The R package pvclust is used to create the dendrogram, with bootstrap resampling. For each node, a p-value indicates how strong the cluster is supported by data. This dendrogram is provided as a PNG picture.


    3.2/ Content of the synthetic_bacterial repertory
    The synthetic_bacterial repertory contains 32 subdirectories named Run_00, Run_01, . . . , etc, Run 31. Each subdirectory is composed of 9 files:
    * K_12_MG1655.gbk the annotated genome of E. coli K–12 MG1655 to which degradation of the functional and/or structural annotations was applied.

    * annotated_K_12_MG1655.sbml the metabolic network of E. coli K–12 MG1655 output of the draft reconstruction step of AuCoMe in the SBML format.

    * annotated_K_12_MG1655.padmet the metabolic network of E. coli K–12 MG1655 output of the draft reconstruction step of AuCoMe in the PADMET format.

    * orthology_K_12_MG1655.sbml the metabolic network of E. coli K–12 MG1655 output of the orthology propagation step of AuCoMe in the SBML format.

    * orthology_K_12_MG1655.padmet the metabolic network of E.

  16. A

    ‘Winter Olympics Prediction - Fantasy Draft Picks’ analyzed by Analyst-2

    • analyst-2.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘Winter Olympics Prediction - Fantasy Draft Picks’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-winter-olympics-prediction-fantasy-draft-picks-2684/07d15ca8/?iid=004-753&v=presentation
    Explore at:
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Winter Olympics Prediction - Fantasy Draft Picks’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ericsbrown/winter-olympics-prediction-fantasy-draft-picks on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Olympic Draft Predictive Model

    Our family runs an Olympic Draft - similar to fantasy football or baseball - for each Olympic cycle. The purpose of this case study is to identify trends in medal count / point value to create a predictive analysis of which teams should be selected in which order.

    There are a few assumptions that will impact the final analysis: Point Value - Each medal is worth the following: Gold - 6 points Silver - 4 points Bronze - 3 points For analysis reviewing the last 10 Olympic cycles. Winter Olympics only.

    All GDP numbers are in USD

    My initial hypothesis is that larger GDP per capita and size of contingency are correlated with better points values for the Olympic draft.

    All Data pulled from the following Datasets:

    Winter Olympics Medal Count - https://www.kaggle.com/ramontanoeiro/winter-olympic-medals-1924-2018 Worldwide GDP History - https://data.worldbank.org/indicator/NY.GDP.MKTP.CD?end=2020&start=1984&view=chart

    GDP data was a wide format when downloaded from the World Bank. Opened file in Excel, removed irrelevant years, and saved as .csv.

    Process

    In RStudio utilized the following code to convert wide data to long:

    install.packages("tidyverse") library(tidyverse) library(tidyr)

    Converting to long data from wide

    long <- newgdpdata %>% gather(year, value, -c("Country Name","Country Code"))

    Completed these same steps for GDP per capita.

    Primary Key Creation

    Differing types of data between these two databases and there is not a good primary key to utilize. Used CONCAT to create a new key column in both combining the year and country code to create a unique identifier that matches between the datasets.

    SELECT *, CONCAT(year,country_code) AS "Primary" FROM medal_count

    Saved as new table "medals_w_primary"

    Utilized Excel to concatenate the primary key for GDP and GDP per capita utilizing:

    =CONCAT()

    Saved as new csv files.

    Uploaded all to SSMS.

    Contingent Size

    Next need to add contingent size.

    No existing database had this information. Pulled data from Wikipedia.

    2018 - No problem, pulled existing table. 2014 - Table was not created. Pulled information into excel, needed to convert the country NAMES into the country CODES.

    Created excel document with all ISO Country Codes. Items were broken down between both formats, either 2 or 3 letters. Example:

    AF/AFG

    Used =RIGHT(C1,3) to extract only the country codes.

    For the country participants list in 2014, copied source data from Wikipedia and pasted as plain text (not HTML).

    Items then showed as: Albania (2)

    Broke cells using "(" as the delimiter to separate country names and numbers, then find and replace to remove all parenthesis from this data.

    We were left with: Albania 2

    Used VLOOKUP to create correct country code: =VLOOKUP(A1,'Country Codes'!A:D,4,FALSE)

    This worked for almost all items with a few exceptions that didn't match. Based on nature and size of items, manually checked on which items were incorrect.

    Chinese Taipei 3 #N/A Great Britain 56 #N/A Virgin Islands 1 #N/A

    This was relatively easy to fix by adding corresponding line items to the Country Codes sheet to account for future variability in the country code names.

    Copied over to main sheet.

    Repeated this process for additional years.

    Once complete created sheet with all 10 cycles of data. In total there are 731 items.

    Data Cleaning

    Filtered by Country Code since this was an issue early on.

    Found a number of N/A Country Codes:

    Serbia and Montenegro FR Yugoslavia FR Yugoslavia Czechoslovakia Unified Team Yugoslavia Czechoslovakia East Germany West Germany Soviet Union Yugoslavia Czechoslovakia East Germany West Germany Soviet Union Yugoslavia

    Appears to be issues with older codes, Soviet Union block countries especially. Referred to historical data and filled in these country codes manually. Codes found on iso.org.

    Filled all in, one issue that was more difficult is the Unified Team of 1992 and Soviet Union. For simplicity used code for Russia - GDP data does not recognize the Soviet Union, breaks the union down to constituent countries. Using Russia is a reasonable figure for approximations and analysis to attempt to find trends.

    From here created a filter and scanned through the country names to ensure there were no obvious outliers. Found the following:

    Olympic Athletes from Russia[b] -- This is a one-off due to the recent PED controversy for Russia. Amended the Country Code to RUS to more accurately reflect the trends.

    Korea[a] and South Korea -- both were listed in 2018. This is due to the unified Korean team that competed. This is an outlier and does not warrant standing on its own as the 2022 Olympics will not have this team (as of this writing on 01/14/2022). Removed the COR country code item.

    Confirmed Primary Key was created for all entries.

    Ran minimum and maximum years, no unexpected values. Ran minimum and maximum Athlete numbers, no unexpected values. Confirmed length of columns for Country Code and Primary Key.

    No NULL values in any columns. Ready to import to SSMS.

    SQL work

    We now have 4 tables, joined together to create the master table:

    SELECT [OlympicDraft].[dbo].[medals_w_primary].[year], host_country, host_city, [OlympicDraft].[dbo].[medals_w_primary].[country_name], [OlympicDraft].[dbo].[medals_w_primary].[country_code], Gold, Silver, Bronze, [OlympicDraft].[dbo].[gdp_w_primary].[value] AS GDP, [OlympicDraft].[dbo].[convertedgdpdatapercapita].[gdp_per_capita], Atheletes FROM medals_w_primary INNER JOIN gdp_w_primary ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[gdp_w_primary].[year_country] INNER JOIN contingency_cleaned ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[contingency_cleaned].[Year_Country] INNER JOIN convertedgdpdatapercapita ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[convertedgdpdatapercapita].[Year_Country] ORDER BY year DESC

    This left us with the following table:

    https://i.imgur.com/tpNhiNs.png" alt="Imgur">

    Performed some basic cleaning tasks to ensure no outliers:

    Checked GDP numbers: 1992 North Korea shows as null. Updated this row with information from countryeconomy.com - $12,458,000,000

    Checked GDP per capita:

    1992 North Korea again missing. Updated this to $595, utilized same source.

    UPDATE [OlympicDraft].[dbo].[gdp_w_primary] SET [OlympicDraft].[dbo].[gdp_w_primary].[value] = 12458000000 WHERE [OlympicDraft].[dbo].[gdp_w_primary].[year_country] = '1992PRK'

    UPDATE [OlympicDraft].[dbo].[convertedgdpdatapercapita] SET [OlympicDraft].[dbo].[convertedgdpdatapercapita].[gdp_per_capita] = 595 WHERE [OlympicDraft].[dbo].[convertedgdpdatapercapita].[year_country] = '1992PRK'

    Liechtenstein showed as an outlier with GDP per capita at 180,366 in 2018. Confirmed this number is correct per the World Bank, appears Liechtenstein does not often have atheletes in the winter olympics. Performing a quick SQL search to verify this shows that they fielded 3 atheletes in 2018, with a Bronze medal being won. Initially this appears to be a good ratio for win/loss.

    Finally, need to create a column that shows the total point value for each of these rows based on the above formula (6 points for Gold, 4 points for Silver, 3 points for Bronze).

    Updated query as follows:

    SELECT [OlympicDraft].[dbo].[medals_w_primary].[year], host_country, host_city, [OlympicDraft].[dbo].[medals_w_primary].[country_name], [OlympicDraft].[dbo].[medals_w_primary].[country_code], Gold, Silver, Bronze, [OlympicDraft].[dbo].[gdp_w_primary].[value] AS GDP, [OlympicDraft].[dbo].[convertedgdpdatapercapita].[gdp_per_capita], Atheletes, (Gold*6) + (Silver*4) + (Bronze*3) AS 'Total_Points' FROM [OlympicDraft].[dbo].[medals_w_primary] INNER JOIN gdp_w_primary ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[gdp_w_primary].[year_country] INNER JOIN contingency_cleaned ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[contingency_cleaned].[Year_Country] INNER JOIN convertedgdpdatapercapita ON [OlympicDraft].[dbo].[medals_w_primary].[primary] = [OlympicDraft].[dbo].[convertedgdpdatapercapita].[Year_Country] ORDER BY [OlympicDraft].[dbo].[convertedgdpdatapercapita].[year]

    Spot checked, calculating correctly.

    Saved result as winter_olympics_study.csv.

    We can now see that all relevant information is in this table:

    https://i.imgur.com/ceZvqCA.png" alt="Imgur">

    RStudio Work

    To continue our analysis, opened this CSV in RStudio.

    install.packages("tidyverse") library(tidyverse) library(ggplot2) install.packages("forecast") library(forecast) install.packages("GGally") library(GGally) install.packages("modelr") library(modelr)

    View(winter_olympic_study)

    Finding correlation between gdp_per_capita and Total_Points

    ggplot(data = winter_olympic_study) + geom_point(aes(x=gdp_per_capita,y=Total_Points,color=country_name)) + facet_wrap(~country_name)

    cor(winter_olympic_study$gdp_per_capita, winter_olympic_study$Total_Points, method = c("pearson"))

    Result is .347, showing a moderate correlation between these two figures.

    Looked next at GDP vs. Total_Points ggplot(data = winter_olympic_study) + geom_point(aes(x=GDP,y=Total_Points,color=country_name))+ facet_wrap(~country_name)

    cor(winter_olympic_study$GDP, winter_olympic_study$Total_Points, method = c("pearson")) This resulted in 0.35, statistically insignificant difference between this and GDP Per Capita

    Next looked at contingent size vs. total points ggplot(data = winter_olympic_study) + geom_point(aes(x=Atheletes,y=Total_Points,color=country_name)) +

  17. Football Delphi

    • kaggle.com
    Updated Aug 16, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jörg Eitner (2017). Football Delphi [Dataset]. https://www.kaggle.com/datasets/laudanum/footballdelphi/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 16, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jörg Eitner
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    As many others I have asked myself if it is possible to use machine learning in order to create valid predictions for football (soccer) match outcomes. Hence I created a dataset consisting of historic match data for the German Bundesliga (1st and 2nd Division) as well as the English Premier League reaching back as far as 1993 up to 2016. Besides the mere information concerning goals scored and home/draw/away win the dataset also includes per site (team) data such as transfer value per team (pre-season), the squad strength, etc. Unfortunately I was only able to find sources for these advanced attributes going back to the 2005 season.
    I have used this dataset with different machine learning algorithms including random forests, XGBoost as well as different recurrent neural network architectures (in order to potentially identify recurring patterns in winning streaks, etc.). I'd like to share the approaches I used as separate Kernels here as well. So far I did not manage to exceed an accuracy of 53% consistently on a validation set using 2016 season of Bundesliga 1 (no information rate = 49%).

    Although I have done some visual exploration before implementing the different machine learning approaches using Tableau, I think a visual exploration kernel would be very beneficial.

    Content

    The data comes as an Sqlite file containing the following tables and fields:

    Table: Matches

    • Match_ID (int): unique ID per match
    • Div (str): identifies the division the match was played in (D1 = Bundesliga, D2 = Bundesliga 2, E0 = English Premier League)
    • Season (int): Season the match took place in (usually covering the period of August till May of the following year)
    • Date (str): Date of the match
    • HomeTeam (str): Name of the home team
    • AwayTeam (str): Name of the away team
    • FTHG (int) (Full Time Home Goals): Number of goals scored by the home team
    • FTAG (int) (Full Time Away Goals): Number of goals scored by the away team
    • FTR (str) (Full Time Result): 3-way result of the match (H = Home Win, D = Draw, A = Away Win)

    Table: Teams

    • Season (str): Football season for which the data is valid
    • TeamName (str): Name of the team the data concerns
    • KaderHome (str): Number of Players in the squad
    • AvgAgeHome (str): Average age of players
    • ForeignPlayersHome (str): Number of foreign players (non-German, non-English respectively) playing for the team
    • OverallMarketValueHome (str): Overall market value of the team pre-season in EUR (based on data from transfermarkt.de)
    • AvgMarketValueHome (str): Average market value (per player) of the team pre-season in EUR (based on data from transfermarkt.de)
    • StadiumCapacity (str): Maximum stadium capacity of the team's home stadium

    Table: Unique Teams

    • TeamName (str): Name of a team
    • Unique_Team_ID (int): Unique identifier for each team

    Table: Teams_in_Matches

    • Match_ID (int): Unique match ID
    • Unique_Team_ID (int): Unique team ID (This table is used to easily retrieve each match a given team has played in)

    Based on these tables I created a couple of views which I used as input for my machine learning models:

    View: FlatView

    Combination of all matches with the respective additional data from Teams table for both home and away team.

    View: FlatView_Advanced

    Same as Flatview but also includes Unique_Team_ID and Unique_Team in order to easily retrieve all matches played by a team in chronological order.

    View: FlatView_Chrono_TeamOrder_Reduced

    Similar to Flatview_Advanced, however missing the additional attributes from team in order to have a longer history including years 1993 - 2004. Especially interesting if one is only interested in analyzing winning/loosing streaks.

    Acknowledgements

    Thanks to football-data.co.uk and transfermarkt.de for providing the raw data used in this dataset.

    Inspiration

    Please feel free to use the humble dataset provided here for any purpose you want. To me it would be most interesting if others think that recurrent neural networks could in fact be of help (and even maybe outperform classical feature engineering) in identifying streaks of losses and wins. In the literature I mostly only found example of RNN application where the data were time series in a very narrow sense (e.g. temperature measurements over time) hence it would be interesting to get your input on this question.

    Maybe someone also finds additional attributes per team or match which have substantial impact on match outcome. So far I have found the "Market Value" of a team to be by far the best predictor when two teams face each other, which makes sense as the market value usually tends to correlate closely with the strength of a team and it's propects at winning

  18. Success.ai | LinkedIn Full Dataset – 700M Public Profiles & 70M Companies –...

    • datarade.ai
    Updated Jan 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Success.ai (2022). Success.ai | LinkedIn Full Dataset – 700M Public Profiles & 70M Companies – Global Dataset – Best Price and Quality Guarantee [Dataset]. https://datarade.ai/data-products/success-ai-linkedin-full-dataset-700m-public-profiles-7-success-ai
    Explore at:
    .json, .csv, .bin, .xml, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Jan 1, 2022
    Dataset provided by
    Area covered
    Mali, Equatorial Guinea, Anguilla, Bulgaria, Jamaica, Holy See, Cambodia, Norfolk Island, Samoa, Tajikistan
    Description

    Success.ai’s LinkedIn Data Solutions offer unparalleled access to a vast dataset of 700 million public LinkedIn profiles and 70 million LinkedIn company records, making it one of the most comprehensive and reliable LinkedIn datasets available on the market today. Our employee data and LinkedIn data are ideal for businesses looking to streamline recruitment efforts, build highly targeted lead lists, or develop personalized B2B marketing campaigns.

    Whether you’re looking for recruiting data, conducting investment research, or seeking to enrich your CRM systems with accurate and up-to-date LinkedIn profile data, Success.ai provides everything you need with pinpoint precision. By tapping into LinkedIn company data, you’ll have access to over 40 critical data points per profile, including education, professional history, and skills.

    Key Benefits of Success.ai’s LinkedIn Data: Our LinkedIn data solution offers more than just a dataset. With GDPR-compliant data, AI-enhanced accuracy, and a price match guarantee, Success.ai ensures you receive the highest-quality data at the best price in the market. Our datasets are delivered in Parquet format for easy integration into your systems, and with millions of profiles updated daily, you can trust that you’re always working with fresh, relevant data.

    Global Reach and Industry Coverage: Our LinkedIn data covers professionals across all industries and sectors, providing you with detailed insights into businesses around the world. Our geographic coverage spans 259M profiles in the United States, 22M in the United Kingdom, 27M in India, and thousands of profiles in regions such as Europe, Latin America, and Asia Pacific. With LinkedIn company data, you can access profiles of top companies from the United States (6M+), United Kingdom (2M+), and beyond, helping you scale your outreach globally.

    Why Choose Success.ai’s LinkedIn Data: Success.ai stands out for its tailored approach and white-glove service, making it easy for businesses to receive exactly the data they need without managing complex data platforms. Our dedicated Success Managers will curate and deliver your dataset based on your specific requirements, so you can focus on what matters most—reaching the right audience. Whether you’re sourcing employee data, LinkedIn profile data, or recruiting data, our service ensures a seamless experience with 99% data accuracy.

    • Best Price Guarantee: We offer unbeatable pricing on LinkedIn data, and we’ll match any competitor.
    • Global Scale: Access 700 million LinkedIn profiles and 70 million company records globally.
    • AI-Verified Accuracy: Enjoy 99% data accuracy through our advanced AI and manual validation processes.
    • Real-Time Data: Profiles are updated daily, ensuring you always have the most relevant insights.
    • Tailored Solutions: Get custom-curated LinkedIn data delivered directly, without managing platforms.
    • Ethically Sourced Data: Compliant with global privacy laws, ensuring responsible data usage.
    • Comprehensive Profiles: Over 40 data points per profile, including job titles, skills, and company details.
    • Wide Industry Coverage: Covering sectors from tech to finance across regions like the US, UK, Europe, and Asia.

    Key Use Cases:

    • Sales Prospecting and Lead Generation: Build targeted lead lists using LinkedIn company data and professional profiles, helping sales teams engage decision-makers at high-value accounts.
    • Recruitment and Talent Sourcing: Use LinkedIn profile data to identify and reach top candidates globally. Our employee data includes work history, skills, and education, providing all the details you need for successful recruitment.
    • Account-Based Marketing (ABM): Use our LinkedIn company data to tailor marketing campaigns to key accounts, making your outreach efforts more personalized and effective.
    • Investment Research & Due Diligence: Identify companies with strong growth potential using LinkedIn company data. Access key data points such as funding history, employee count, and company trends to fuel investment decisions.
    • Competitor Analysis: Stay ahead of your competition by tracking hiring trends, employee movement, and company growth through LinkedIn data. Use these insights to adjust your market strategy and improve your competitive positioning.
    • CRM Data Enrichment: Enhance your CRM systems with real-time updates from Success.ai’s LinkedIn data, ensuring that your sales and marketing teams are always working with accurate and up-to-date information.
    • Comprehensive Data Points for LinkedIn Profiles: Our LinkedIn profile data includes over 40 key data points for every individual and company, ensuring a complete understanding of each contact:

    LinkedIn URL: Access direct links to LinkedIn profiles for immediate insights. Full Name: Verified first and last names. Job Title: Current job titles, and prior experience. Company Information: Company name, LinkedIn URL, domain, and location. Work and Per...

  19. Data from: A consensus compound/bioactivity dataset for data-driven drug...

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated May 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laura Isigkeit; Laura Isigkeit; Apirat Chaikuad; Apirat Chaikuad; Daniel Merk; Daniel Merk (2022). A consensus compound/bioactivity dataset for data-driven drug design and chemogenomics [Dataset]. http://doi.org/10.5281/zenodo.6320761
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 13, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Laura Isigkeit; Laura Isigkeit; Apirat Chaikuad; Apirat Chaikuad; Daniel Merk; Daniel Merk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Information

    The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144803 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design.

    The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation.

    Structure and content of the dataset

    Dataset structure

    ChEMBL

    ID

    PubChem

    ID

    IUPHAR

    ID

    Target

    Activity

    type

    Assay typeUnitMean C (0)...Mean PC (0)...Mean B (0)...Mean I (0)...Mean PD (0)...Activity check annotationLigand namesCanonical SMILES C...Structure checkSource

    The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file.

    Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format.

    Column content:

    • ChEMBL ID, PubChem ID, IUPHAR ID: chemical identifier of the databases
    • Target: biological target of the molecule expressed as the HGNC gene symbol
    • Activity type: for example, pIC50
    • Assay type: Simplification/Classification of the assay into cell-free, cellular, functional and unspecified
    • Unit: unit of bioactivity measurement
    • Mean columns of the databases: mean of bioactivity values or activity comments denoted with the frequency of their occurrence in the database, e.g. Mean C = 7.5 *(15) -> the value for this compound-target pair occurs 15 times in ChEMBL database
    • Activity check annotation: a bioactivity check was performed by comparing values from the different sources and adding an activity check annotation to provide automated activity validation for additional confidence
      • no comment: bioactivity values are within one log unit;
      • check activity data: bioactivity values are not within one log unit;
      • only one data point: only one value was available, no comparison and no range calculated;
      • no activity value: no precise numeric activity value was available;
      • no log-value could be calculated: no negative decadic logarithm could be calculated, e.g., because the reported unit was not a compound concentration
    • Ligand names: all unique names contained in the five source databases are listed
    • Canonical SMILES columns: Molecular structure of the compound from each database
    • Structure check: To denote matching or differing compound structures in different source databases
      • match: molecule structures are the same between different sources;
      • no match: the structures differ;
      • 1 source: no structure comparison is possible, because the molecule comes from only one source database.
    • Source: From which databases the data come from

  20. Tech Install Data | Tech Stack Data for 30M Verified Company Data Profiles |...

    • datarade.ai
    Updated Feb 12, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Success.ai (2018). Tech Install Data | Tech Stack Data for 30M Verified Company Data Profiles | Best Price Guarantee [Dataset]. https://datarade.ai/data-products/tech-install-data-tech-stack-data-for-30m-verified-company-success-ai
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Feb 12, 2018
    Dataset provided by
    Area covered
    Norway, Macedonia (the former Yugoslav Republic of), Greece, Poland, Liechtenstein, Latvia, Romania, Andorra, Austria, Estonia
    Description

    Success.ai presents our Tech Install Data offering, a comprehensive dataset drawn from 28 million verified company profiles worldwide. Our meticulously curated Tech Install Data is designed to empower your sales and marketing strategies by providing in-depth insights into the technology stacks used by companies across various industries. Whether you're targeting small businesses or large enterprises, our data encompasses a diverse range of sectors, ensuring you have the necessary tools to refine your outreach and engagement efforts.

    Comprehensive Coverage: Our Tech Install Data includes crucial information on technology installations used by companies. This encompasses software solutions, SaaS products, hardware configurations, and other technological setups critical for businesses. With data spanning industries such as finance, technology, healthcare, manufacturing, education, and more, our database offers unparalleled insights into corporate tech ecosystems.

    Data Accuracy and Compliance: At Success.ai, we prioritize data integrity and compliance. Our datasets are not only GDPR-compliant but also adhere to various international data protection regulations, making them safe for use across geographic boundaries. Each profile is AI-validated to ensure the accuracy and timeliness of the information provided, with regular updates to reflect any changes in company tech stacks.

    Tailored for Business Development: Leverage our Tech Install Data to enhance your account-based marketing (ABM) campaigns, improve sales prospecting, and execute targeted advertising strategies. Understanding a company's tech stack can help you tailor your messaging, align your product offerings, and address potential needs more effectively. Our data enables you to:

    Identify prospects using competing or complementary products. Customize pitches based on the prospect’s existing technology environment. Enhance product recommendations with insights into potential tech gaps in target companies. Data Points and Accessibility: Our Tech Install Data offers detailed fields such as:

    Company name and contact information. Detailed descriptions of installed technologies. Usage metrics for software and hardware. Decision-makers’ contact details related to tech purchases. This data is delivered in easily accessible formats, including CSV, Excel, or directly through our API, allowing seamless integration with your CRM or any other marketing automation tools. Guaranteed Best Price and Service: Success.ai is committed to providing high-quality data at the most competitive prices in the market. Our best price guarantee ensures that you receive the most value from your investment in our data solutions. Additionally, our customer support team is always ready to assist with any queries or custom data requests, ensuring you maximize the utility of your purchased data.

    Sample Dataset and Custom Requests: To demonstrate the quality and depth of our Tech Install Data, we offer a sample dataset for preliminary review upon request. For specific needs or custom data solutions, our team is adept at creating tailored datasets that precisely match your business requirements.

    Engage with Success.ai Today: Connect with us to discover how our Tech Install Data can transform your business strategy and operational efficiency. Our experts are ready to assist you in navigating the data landscape and unlocking actionable insights to drive your company's growth.

    Start exploring the potential of detailed tech stack insights with Success.ai and gain the competitive edge necessary to thrive in today’s fast-paced business environment.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Robin Kramer; Caitlin Telfer; Alice Towler (2017). Supplementary material from "Visual comparison of two data sets: Do people use the means and the variability?" [Dataset]. http://doi.org/10.6084/m9.figshare.4751095.v1
Organization logoOrganization logo

Supplementary material from "Visual comparison of two data sets: Do people use the means and the variability?"

Explore at:
xlsxAvailable download formats
Dataset updated
Mar 14, 2017
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Robin Kramer; Caitlin Telfer; Alice Towler
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

In our everyday lives, we are required to make decisions based upon our statistical intuitions. Often, these involve the comparison of two groups, such as luxury versus family cars and their suitability. Research has shown that the mean difference affects judgements where two sets of data are compared, but the variability of the data has only a minor influence, if any at all. However, prior research has tended to present raw data as simple lists of values. Here, we investigated whether displaying data visually, in the form of parallel dot plots, would lead viewers to incorporate variability information. In Experiment 1, we asked a large sample of people to compare two fictional groups (children who drank ‘Brain Juice’ versus water) in a one-shot design, where only a single comparison was made. Our results confirmed that only the mean difference between the groups predicted subsequent judgements of how much they differed, in line with previous work using lists of numbers. In Experiment 2, we asked each participant to make multiple comparisons, with both the mean difference and the pooled standard deviation varying across data sets they were shown. Here, we found that both sources of information were correctly incorporated when making responses. Taken together, we suggest that increasing the salience of variability information, through manipulating this factor across items seen, encourages viewers to consider this in their judgements. Such findings may have useful applications for best practices when teaching difficult concepts like sampling variation.

Search
Clear search
Close search
Google apps
Main menu