Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes. Due to these restrictions, the collection is not open data. Please fill out the form and upload the Data Sharing Agreement at Google Form.
Citation
Please cite our work as
@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }
Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.
Subtask 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:
False - The main claim made in an article is untrue.
Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.
True - This rating indicates that the primary elements of the main claim are demonstrably true.
Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.
Input Data
The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:
Output data format
Sample File
public_id, predicted_rating
1, false
2, true
Sample file
public_id, predicted_domain
1, health
2, crime
Additional data for Training
To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible sources:
IMPORTANT!
Evaluation Metrics
This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.
Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498
Submission Link: Coming soon
Related Work
This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that have occurred in the City of Chicago over the past year, minus the most recent seven days of data. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. Should you have questions about this dataset, you may contact the Research & Development Division of the Chicago Police Department at 312.745.6071 or RandD@chicagopolice.org. Disclaimer: These crimes may be based upon preliminary information supplied to the Police Department by the reporting parties that have not been verified. The preliminary crime classifications may be changed at a later date based upon additional investigation and there is always the possibility of mechanical or human error. Therefore, the Chicago Police Department does not guarantee (either expressed or implied) the accuracy, completeness, timeliness, or correct sequencing of the information and the information should not be used for comparison purposes over time. The Chicago Police Department will not be responsible for any error or omission, or for the use of, or the results obtained from the use of this information. All data visualizations on maps should be considered approximate and attempts to derive specific addresses are strictly prohibited.
The Chicago Police Department is not responsible for the content of any off-site pages that are referenced by or that reference this web page other than an official City of Chicago or Chicago Police Department web page. The user specifically acknowledges that the Chicago Police Department is not responsible for any defamatory, offensive, misleading, or illegal conduct of other users, links, or third parties and that the risk of injury from the foregoing rests entirely with the user. The unauthorized use of the words "Chicago Police Department," "Chicago Police," or any colorable imitation of these words or the unauthorized use of the Chicago Police Department logo is unlawful. This web page does not, in any way, authorize such use. Data is updated daily Tuesday through Sunday. The dataset contains more than 65,000 records/rows of data and cannot be viewed in full in Microsoft Excel. Therefore, when downloading the file, select CSV from the Export menu. Open the file in an ASCII text editor, such as Wordpad, to view and search. To access a list of Chicago Police Department - Illinois Uniform Crime Reporting (IUCR) codes, go to http://bit.ly/rk5Tpc.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘London bike sharing dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/hmavrodiev/london-bike-sharing-dataset on 12 November 2021.
--- Dataset description provided by original source is as follows ---
These licence terms and conditions apply to TfL's free transport data service and are based on version 2.0 of the Open Government Licence with specific amendments for Transport for London (the "Licence"). TfL may at any time revise this Licence without notice. It is up to you ("You") to regularly review the Licence, which will be available on this website, in case there are any changes. Your continued use of the transport data feeds You have opted to receive ("Information") after a change has been made to the Licence will be treated as Your acceptance of that change.
Using Information under this Licence TfL grants You a worldwide, royalty-free, perpetual, non-exclusive Licence to use the Information subject to the conditions below (as varied from time to time).
This Licence does not affect Your freedom under fair dealing or fair use or any other copyright or database right exceptions and limitations.
This Licence shall apply from the date of registration and shall continue for the period the Information is provided to You or You breach the Licence.
Rights You are free to:
Copy, publish, distribute and transmit the Information Adapt the Information and Exploit the Information commercially and non-commercially for example, by combining it with other Information, or by including it in Your own product or application Requirements You must, where You do any of the above:
Acknowledge TfL as the source of the Information by including the following attribution statement 'Powered by TfL Open Data' Acknowledge that this Information contains Ordnance Survey derived data by including the following attribution statement: 'Contains OS data © Crown copyright and database rights 2016' and Geomni UK Map data © and database rights [2019] Ensure our intellectual property rights, including all logos, design rights, patents and trademarks, are protected by following our design and branding guidelines Limit traffic requests up to a maximum of 300 calls per minute per data feed. TfL reserves the right to throttle or limit access to feeds when it is believed the overall service is being degraded by excessive use and Ensure the information You provide on registration is accurate These are important conditions of this Licence and if You fail to comply with them the rights granted to You under this Licence, or any similar licence granted by TfL, will end automatically.
Exemptions This Licence does not:
Transfer any intellectual property rights in the Information to You or any third party Include personal data in the Information Provide any rights to use the Information after this Licence has ended Provide any rights to use any other intellectual property rights, including patents, trade marks, and design rights or permit You to: Use data from the Oyster, Congestion Charging and Santander Cycles websites to populate or update any other software or database or Use any automated system, software or process to extract content and/or data, including trawling, data mining and screen scraping in relation to the Oyster, Congestion Charging and Santander Cycles websites, except where expressly permitted under a written licence agreement with TfL. These are important conditions of this Licence and, if You fail to comply with them, the rights granted to You under this Licence, or any similar licence granted by TfL, will end automatically.
Non-endorsement This Licence does not grant You any right to use the Information in a way that suggests any official status or that TfL endorses You or Your use of the Information.
The purpose is to try predict the future bike shares.
The data is acquired from 3 sources:
- Https://cycling.data.tfl.gov.uk/ 'Contains OS data © Crown copyright and database rights 2016' and Geomni UK Map data © and database rights [2019] 'Powered by TfL Open Data'
- freemeteo.com - weather data
- https://www.gov.uk/bank-holidays
From 1/1/2015 to 31/12/2016
The data from cycling dataset is grouped by "Start time", this represent the count of new bike shares grouped by hour. The long duration shares are not taken in the count.
"timestamp" - timestamp field for grouping the data
"cnt" - the count of a new bike shares
"t1" - real temperature in C
"t2" - temperature in C "feels like"
"hum" - humidity in percentage
"wind_speed" - wind speed in km/h
"weather_code" - category of the weather
"is_holiday" - boolean field - 1 holiday / 0 non holiday
"is_weekend" - boolean field - 1 if the day is weekend
"season" - category field meteorological seasons: 0-spring ; 1-summer; 2-fall; 3-winter.
"weathe_code" category description:
1 = Clear ; mostly clear but have some values with haze/fog/patches of fog/ fog in vicinity
2 = scattered clouds / few clouds
3 = Broken clouds
4 = Cloudy
7 = Rain/ light Rain shower/ Light rain
10 = rain with thunderstorm
26 = snowfall
94 = Freezing Fog
--- Original source retains full ownership of the source dataset ---
One of the lines of research of Sustaining the Knowledge Commons (SKC) is a longitudinal study of the minority (about a third) of the fully open access journals that use this business model. The original idea was to gather data during an annual two-week census period. The volume of data and growth in this area makes this an impractical goal. For this reason, we are posting this preliminary dataset in case it might be helpful to others working in this area. Future data gathering and analysis will be conducted on an ongoing basis. Major sources of data for this dataset include: • the Directory of Open Access Journals (DOAJ) downloadable metadata; the base set is from May 2014, with some additional data from the 2015 dataset • data on publisher article processing charges and related information gathered from publisher websites by the SKC team in 2015, 2014 (Morris on, Salhab, Calvé-Genest & Horava, 2015) and a 2013 pilot • DOAJ article content data screen scraped from DOAJ (caution; this data can be quite misleading due to limitations with article-level metadata) • Subject analysis based on DOAJ subject metadata in 2014 for selected journals • Data on APCs gathered in 2010 by Solomon and Björk (supplied by the authors). Note that Solomon and Björk use a different method of calculating APC so the numbers are not directly comparable. • Note that this full d ataset includes some working columns which are meaningful only by means of explaining very specific calculations which are not necessarily evident in the dataset per se. Details below. Significant limitation: • This dataset does not include new journals added to DOAJ in 2015. A recent publisher size analysis indicates some significant changes. For example, DeGruyter, not listed in the 2014 survey, is now the third largest DOAJ publisher with over 200 titles. Elsevier is now the 7th largest DOAJ publisher. In both cases, gathering data from the publisher websites will be time-consuming as it is necessary to conduct individual title look-up. • Some OA APC data for newly added journals was gathered in May 2015 but has not yet been added to this dataset. One of the reasons for gathering this data is a comparison of the DOAJ "one price listed" approach with potentially richer data on the publisher's own website. For full details see the documentation.
This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that have occurred in the City of Chicago over the past year, minus the most recent seven days of data. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. Should you have questions about this dataset, you may contact the Research & Development Division of the Chicago Police Department at 312.745.6071 or RandD@chicagopolice.org. Disclaimer: These crimes may be based upon preliminary information supplied to the Police Department by the reporting parties that have not been verified. The preliminary crime classifications may be changed at a later date based upon additional investigation and there is always the possibility of mechanical or human error. Therefore, the Chicago Police Department does not guarantee (either expressed or implied) the accuracy, completeness, timeliness, or correct sequencing of the information and the information should not be used for comparison purposes over time. The Chicago Police Department will not be responsible for any error or omission, or for the use of, or the results obtained from the use of this information. All data visualizations on maps should be considered approximate and attempts to derive specific addresses are strictly prohibited.
The Chicago Police Department is not responsible for the content of any off-site pages that are referenced by or that reference this web page other than an official City of Chicago or Chicago Police Department web page. The user specifically acknowledges that the Chicago Police Department is not responsible for any defamatory, offensive, misleading, or illegal conduct of other users, links, or third parties and that the risk of injury from the foregoing rests entirely with the user. The unauthorized use of the words "Chicago Police Department," "Chicago Police," or any colorable imitation of these words or the unauthorized use of the Chicago Police Department logo is unlawful. This web page does not, in any way, authorize such use. Data is updated daily Tuesday through Sunday. The dataset contains more than 65,000 records/rows of data and cannot be viewed in full in Microsoft Excel. Therefore, when downloading the file, select CSV from the Export menu. Open the file in an ASCII text editor, such as Wordpad, to view and search. To access a list of Chicago Police Department - Illinois Uniform Crime Reporting (IUCR) codes, go to http://bit.ly/rk5Tpc.
Exploring new ways to share information with each other is a cornerstone of improving the planning process. To do this it is essential to have city-wide data in accessible formats. A variety of 3D digital information and models exist but currently the data is not readily available to the general public. Providing a consistent city-wide 3D data source will link these digital city planning models and materials together and will allow us to imagine our city from different perspectives. The Open Data site will enable access to application developers, designers, urban planners and architects, and the public. Ideally this will enable the creation of a visual portal and access to a large collection of city building ideas. Further to the Open Government Licence, the Context Massing Model is being provided by City Planning on the Open Data website for information and illustrative purposes only. City Planning does not warranty the completeness, accuracy, content, or fitness for any precision purpose or use of context massing model for such purposes, nor are any such warranties to be implied or inferred with respect to Context Massing Model as furnished on the website. City Planning and the City are not liable for any deficiencies in the completeness, accuracy, content, or fitness for any particular purpose or use of Context Massing Model, or applications utilizing Context Massing Model, provided by any third party. Context Massing Model MUST BE VERIFIED BY THE USER FOR LEGAL OR OFFICIAL USE. Please use this Interactive Map to locate the 3D Massing tiles for SketchUp and AutoCAD format For further information visit the Urban Design web site A note on property assessments: MPAC (Municipal Property Assessment Corporation) holds copyright on many aspects of data around properties. The City of Toronto is unable to provide this data. From the MPAC website: MPAC 's range of services includes: Preparing annual Assessment Rolls for use by municipalities and the Province to calculate property and education taxes. Assessment Maps and Ontario Parcel (TM) In 2005, MPAC, the Ontario Government and Teranet Enterprises Inc. completed the Ontario Parcel(TM) - an ambitious project that brings assessment, ownership and land parcel data for almost 4.6 million properties into a standardized digital database. ... The Ontario Parcel(TM) is available to Ontario municipalities, public organizations and private businesses. Among other things, the Ontario Parcel(TM) data can be applied to property assessment and taxation, land registration, land use planning, land management and business planning. With the implementation of the Ontario Parcel (TM) and the digital mapping environment, MPAC no longer produces paper assessment maps. If you would like more information about the products and services available under the Ontario Parcel (TM), please visit the Ontario Parcel (TM) website at www.ontarioparcel.ca. You will need to contact MPAC directly for data that you may perceive as missing. MPAC website.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
The NIST Extensible Resource Data Model (NERDm) is a set of schemas for encoding in JSON format metadatathat describe digital resources. The variety of digital resources it can describe includes not onlydigital data sets and collections, but also software, digital services, web sites and portals, anddigital twins. It was created to serve as the internal metadata format used by the NIST Public DataRepository and Science Portal to drive rich presentations on the web and to enable discovery; however, itwas also designed to enable programmatic access to resources and their metadata by external users.Interoperability was also a key design aim: the schemas are defined using the JSON Schema standard,metadata are encoded as JSON-LD, and their semantics are tied to community ontologies, with an emphasison DCAT and the US federal Project Open Data (POD) models. Finally, extensibility is also central to itsdesign: the schemas are composed of a central core schema and various extension schemas. New extensionsto support richer metadata concepts can be added over time without breaking existing applications.Validation is central to NERDm's extensibility model. Consuming applications should be able to choosewhich metadata extensions they care to support and ignore terms and extensions they don't support.Furthermore, they should not fail when a NERDm document leverages extensions they don't recognize, evenwhen on-the-fly validation is required. To support this flexibility, the NERDm framework allowsdocuments to declare what extensions are being used and where. We have developed an optional extensionto the standard JSON Schema validation (see ejsonschema below) to support flexible validation: while astandard JSON Schema validater can validate a NERDm document against the NERDm core schema, our extensionwill validate a NERDm document against any recognized extensions and ignore those that are notrecognized.The NERDm data model is based around the concept of resource, semantically equivalent to a schema.orgResource, and as in schema.org, there can be different types of resources, such as data sets andsoftware. A NERDm document indicates what types the resource qualifies as via the JSON-LD "@type"property. All NERDm Resources are described by metadata terms from the core NERDm schema; however,different resource types can be described by additional metadata properties (often drawing on particularNERDm extension schemas). A Resource contains Components of various types (includingDCAT-defined Distributions) that are considered part of the Resource; specifically, these can include downloadable data files, hierachical datacollecitons, links to web sites (like software repositories), software tools, or other NERDm Resources.Through the NERDm extension system, domain-specific metadata can be included at either the resource orcomponent level. The direct semantic and syntactic connections to the DCAT, POD, and schema.org schemasis intended to ensure unambiguous conversion of NERDm documents into those schemas.As of this writing, the Core NERDm schema and its framework stands at version 0.7 and is compatible withthe "draft-04" version of JSON Schema. Version 1.0 is projected to be released in 2025. In thatrelease, the NERDm schemas will be updated to the "draft2020" version of JSON Schema. Other improvementswill include stronger support for RDF and the Linked Data Platform through its support of JSON-LD.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to
establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data
Approach
The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered.
Search methods
We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects.
We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories.
Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo.
Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories.
Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals.
Evaluation
We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results.
We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind.
Results
A summary of the major findings from our data review:
Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors.
There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection.
Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation.
See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt
Where exactly was that elementary school again that's closest to your home and that your children can easily reach without having to cross many streets? Can you reach your workplace entirely via bike paths? Will you have to wait at the construction site again next Sunday on the way to the sports field? There is a lot of data on the internet that can answer these and similar questions – but finding it is not always easy. OpenData.HRO is a web application that serves as a catalog for many useful datasets. The application is operated by the Hanseatic and University City of Rostock, which is also the owner and publisher of the data. You can use the application to search for, view, and download data for yourself and/or others. Depending on the type of dataset, OpenData.HRO also offers it as database content, providing you with some useful statistical and/or visualization tools. The present web application is based on the powerful open-source software CKAN, maintained and further developed by the Open Knowledge Foundation. Each dataset in CKAN consists of a description of the contained data as well as the data itself. The description includes important information such as the type of file formats in which the data is offered, the license under which it is provided, and the categories and subject areas to which it is assigned. The data and their descriptions can be updated or supplemented, with CKAN always recording all changes by means of automatic versioning. CKAN is used by a large number of data catalogs on the internet. The Data Hub, for example, is a publicly editable data catalog in the Wikipedia style. The British government uses CKAN to operate data.gov.uk – currently with approximately 8,000 government datasets. The official public data of most European countries are listed in the CKAN catalog on europeandataportal.eu. You can find a complete list of catalogs like this on dataportals.org, a page that is also operated with CKAN. Unless otherwise stated, the data on OpenData.HRO are subject to a free license. This means that you can freely use and exploit the data in compliance with the conditions set out in the terms of use (and they are anything but restrictive). Perhaps you would like to use the data on art in public spaces to build a smartphone app that helps to make a tour of Rostock culturally sophisticated? Go for it! Open Data promotes entrepreneurship, collaborative science, and transparent administration. You can learn more about Open Data in the Open Data Handbook. The Open Knowledge Foundation is a non-profit organization for the promotion of open knowledge: developing and improving CKAN is one of the ways to achieve this. If you would like to contribute to CKAN with design or code, you can join the developer mailing lists or visit the OKFN pages to learn more about CKAN and other projects. Translated from German Original Text: Wo genau war nochmal die Grundschule, die Ihrem zu Hause am nächsten ist und die Ihre Kinder gut erreichen können, ohne dabei viele Straßen überqueren zu müssen? Können Sie Ihren Arbeitsplatz durchgängig über Fahrradwege erreichen? Werden Sie nächsten Sonntag wieder an der Baustelle warten müssen auf dem Weg zum Sportplatz? Es gibt viele Daten im Internet, die solche und ähnliche Fragen beantworten können – allein sie zu finden ist nicht immer einfach. OpenData.HRO ist eine Web-Anwendung, die als Katalog für viele nützliche Daten dient. Betrieben wird die Anwendung von der Hanse- und Universitätsstadt Rostock, die zugleich Eigentümerin und Herausgeberin der Daten ist. Sie können die Anwendung nutzen, um für sich und/oder andere Daten zu suchen, anzuschauen und herunterzuladen. Abhängig von der Art eines Datensatzes bietet OpenData.HRO diesen auch als Datenbankinhalt an, sodass Ihnen einige nützliche Statistik- und/oder Visualisierungswerkzeuge zur Verfügung gestellt werden. Die vorliegende Web-Anwendung basiert auf der mächtigen Open-Source-Software CKAN, gepflegt und weiterentwickelt von der Open Knowledge Foundation. Jeder Datensatz in CKAN besteht aus einer Beschreibung der enthaltenen Daten sowie den Daten selbst. Zur Beschreibung zählen wichtige Informationen wie zum Beispiel die Art der Dateiformate, in denen die Daten angeboten werden, die Lizenz, unter der sie stehen, und die Kategorien und Themenbereiche, denen sie zugeordnet sind. Die Daten und deren Beschreibungen können aktualisiert oder ergänzt werden, wobei CKAN stets alle Änderungen aufzeichnet mittels einer automatischen Versionierung. CKAN wird von einer großen Anzahl an Datenkatalogen im Internet genutzt. The Data Hub zum Beispiel ist ein von der Öffentlichkeit bearbeitbarer Datenkatalog im Wikipedia-Stil. Die britische Regierung nutzt CKAN, um data.gov.uk zu betreiben – zur Zeit mit etwa 8.000 Regierungsdatensätzen. Die offiziellen öffentlichen Daten der meisten europäischen Staaten sind im CKAN-Katalog auf europeandataportal.eu gelistet. Sie finden eine vollständige Liste von Katalogen wie diesem auf dataportals.org, einer Seite, die ebenfalls mit CKAN betrieben wird. Sofern nicht anders angegeben unterliegen die Daten bei OpenData.HRO einer freien Lizenz. Das heißt, dass Sie die Daten unter Einhaltung der in den Nutzungsbedingungen festgelegten Konditionen (und die sind alles andere als restriktiv) beliebig verwenden und verwerten können. Vielleicht möchten Sie ja die Daten zur Kunst im öffentlichen Raum nutzen, um eine Smartphone-App zu bauen, die dabei hilft, einen Rundgang durch Rostock kulturell anspruchsvoll zu gestalten? Nur zu! Open Data fördert den Unternehmergeist, gemeinschaftliche Wissenschaft und transparentes Verwaltungshandeln. Mehr zu Open Data erfahren Sie im Open Data Handbook. Die Open Knowledge Foundation ist eine gemeinnützige Organisation zur Förderung von offenem Wissen: CKAN zu entwickeln und zu verbessern ist einer der Wege dies zu erreichen. Wenn Sie mit Design oder Code zu CKAN beitragen möchten, so können Sie den Entwickler-Mailinglisten beitreten oder die OKFN-Seiten besuchen, um mehr über CKAN und andere Projekte zu erfahren.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a collection of around 2,000 HTML pages: these web pages contain the search results obtained in return to queries for different products, searched by a set of synthetic users surfing Google Shopping (US version) from different locations, in July, 2016.
Each file in the collection has a name where there is indicated the location from where the search has been done, the userID, and the searched product: no_email_LOCATION_USERID.PRODUCT.shopping_testing.#.html
The locations are Philippines (PHI), United States (US), India (IN). The userIDs: 26 to 30 for users searching from Philippines, 1 to 5 from US, 11 to 15 from India.
Products have been choice following 130 keywords (e.g., MP3 player, MP4 Watch, Personal organizer, Television, etc.).
In the following, we describe how the search results have been collected.
Each user has a fresh profile. The creation of a new profile corresponds to launch a new, isolated, web browser client instance and open the Google Shopping US web page.
To mimic real users, the synthetic users can browse, scroll pages, stay on a page, and click on links.
A fully-fledged web browser is used to get the correct desktop version of the website under investigation. This is because websites could be designed to behave according to user agents, as witnessed by the differences between the mobile and desktop versions of the same website.
The prices are the retail ones displayed by Google Shopping in US dollars (thus, excluding shipping fees).
Several frameworks have been proposed for interacting with web browsers and analysing results from search engines. This research adopts OpenWPM. OpenWPM is automatised with Selenium to efficiently create and manage different users with isolated Firefox and Chrome client instances, each of them with their own associated cookies.
The experiments run, on average, 24 hours. In each of them, the software runs on our local server, but the browser's traffic is redirected to the designated remote servers (i.e., to India), via tunneling in SOCKS proxies. This way, all commands are simultaneously distributed over all proxies. The experiments adopt the Mozilla Firefox browser (version 45.0) for the web browsing tasks and run under Ubuntu 14.04. Also, for each query, we consider the first page of results, counting 40 products. Among them, the focus of the experiments is mostly on the top 10 and top 3 results.
Due to connection errors, one of the Philippine profiles have no associated results. Also, for Philippines, a few keywords did not lead to any results: videocassette recorders, totes, umbrellas. Similarly, for US, no results were for totes and umbrellas.
The search results have been analyzed in order to check if there were evidence of price steering, based on users' location.
One term of usage applies:
In any research product whose findings are based on this dataset, please cite
@inproceedings{DBLP:conf/ircdl/CozzaHPN19, author = {Vittoria Cozza and Van Tien Hoang and Marinella Petrocchi and Rocco {De Nicola}}, title = {Transparency in Keyword Faceted Search: An Investigation on Google Shopping}, booktitle = {Digital Libraries: Supporting Open Science - 15th Italian Research Conference on Digital Libraries, {IRCDL} 2019, Pisa, Italy, January 31 - February 1, 2019, Proceedings}, pages = {29--43}, year = {2019}, crossref = {DBLP:conf/ircdl/2019}, url = {https://doi.org/10.1007/978-3-030-11226-4_3}, doi = {10.1007/978-3-030-11226-4_3}, timestamp = {Fri, 18 Jan 2019 23:22:50 +0100}, biburl = {https://dblp.org/rec/bib/conf/ircdl/CozzaHPN19}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
National Monuments Service - Archaeological Survey of Ireland. Published by Department of Housing, Local Government and Heritage. Available under the license Creative Commons Attribution 4.0 (CC-BY-4.0).This Archaeological Survey of Ireland dataset is published from the database of the National Monuments Service Sites and Monuments Record (SMR). This dataset also can be viewed and interrogated through the online Historic Environment Viewer: https://heritagedata.maps.arcgis.com/apps/webappviewer/index.html?id=0c9eb9575b544081b0d296436d8f60f8
A Sites and Monuments Record (SMR) was issued for all counties in the State between 1984 and 1992. The SMR is a manual containing a numbered list of certain and possible monuments accompanied by 6-inch Ordnance Survey maps (at a reduced scale). The SMR formed the basis for issuing the Record of Monuments and Places (RMP) - the statutory list of recorded monuments established under Section 12 of the National Monuments (Amendment) Act 1994. The RMP was issued for each county between 1995 and 1998 in a similar format to the existing SMR. The RMP differs from the earlier lists in that, as defined in the Act, only monuments with known locations or places where there are believed to be monuments are included.
The large Archaeological Survey of Ireland archive and supporting database are managed by the National Monuments Service and the records are continually updated and supplemented as additional monuments are discovered. On the Historic Environment viewer an area around each monument has been shaded, the scale of which varies with the class of monument. This area does not define the extent of the monument, nor does it define a buffer area beyond which ground disturbance should not take place – it merely identifies an area of land within which it is expected that the monument will be located. It is not a constraint area for screening – such must be set by the relevant authority who requires screening for their own purposes. This data has been released for download as Open Data under the DPER Open Data Strategy and is licensed for re-use under the Creative Commons Attribution 4.0 International licence. http://creativecommons.org/licenses/by/4.0
Please note that the centre point of each record is not indicative of the geographic extent of the monument. The existing point centroids were digitised relative to the OSI 6-inch mapping and the move from this older IG-referenced series to the larger-scale ITM mapping will necessitate revisions. The accuracy of the derived ITM co-ordinates is limited to the OS 6-inch scale and errors may ensue should the user apply the co-ordinates to larger scale maps. Records that do not refer to 'monuments' are designated 'Redundant record' and are retained in the archive as they may relate to features that were once considered to be monuments but which on investigation proved otherwise. Redundant records may also refer to duplicate records or errors in the data structure of the Archaeological Survey of Ireland.
This dataset is provided for re-use in a number of ways and the technical options are outlined below. For a live and current view of the data, please use the web services or the data extract tool in the Historic Environment Viewer. The National Monuments Service also provide an Open Data snapshot of its national dataset in CSV as a bulk data download. Users should consult the National Monument Service website https://www.archaeology.ie/ for further information and guidance on the National Monument Act(s) and the legal significance of this dataset.
Open Data Bulk Data Downloads (version date: 23/08/2023)
The Sites and Monuments Record (SMR) is provided as a national download in Comma Separated Value (CSV) format. This format can be easily integrated into a number of software clients for re-use and analysis. The Longitude and Latitude coordinates are also provided to aid its re-use in web mapping systems, however, the ITM easting/northings coordinates should be quoted for official purposes. ERSI Shapefiles of the SMR points and SMRZone polygons are also available The SMRZones represent an area around each monument, the scale of which varies with the class of monument. This area does not define the extent of the monument, nor does it define a buffer area beyond which ground disturbance should not take place – it merely identifies an area of land within which it is expected that the monument will be located. It is not a constraint area for screening – such must be set by the relevant authority who requires screening for their own purposes.
GIS Web Service APIs (live views):
For users with access to GIS software please note that the Archaeological Survey of Ireland data is also available spatial data web services. By accessing and consuming the web service users are deemed to have accepted the Terms and Conditions. The web services are available at the URL endpoints advertised below:
SMR; https://services-eu1.arcgis.com/HyjXgkV6KGMSF3jt/arcgis/rest/services/SMROpenData/FeatureServer
SMRZone; https://services-eu1.arcgis.com/HyjXgkV6KGMSF3jt/arcgis/rest/services/SMRZoneOpenData/FeatureServer
Historic Environment Viewer - Query Tool
The "Query" tool can alternatively be used to selectively filter and download the data represented in the Historic Environment Viewer. The instructions for using this tool in the Historic Environment Viewer are detailed in the associated Help file: https://www.archaeology.ie/sites/default/files/media/pdf/HEV_UserGuide_v01.pdf...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Entity alignment seeks to find entities in different knowledge graphs (KGs) that refer to the same real-world object. Recent advancement in KG embedding impels the advent of embedding-based entity alignment, which encodes entities in a continuous embedding space and measures entity similarities based on the learned embeddings. In this paper, we conduct a comprehensive experimental study of this emerging field. This study surveys 23 recent embedding-based entity alignment approaches and categorizes them based on their techniques and characteristics. We further observe that current approaches use different datasets in evaluation, and the degree distributions of entities in these datasets are inconsistent with real KGs. Hence, we propose a new KG sampling algorithm, with which we generate a set of dedicated benchmark datasets with various heterogeneity and distributions for a realistic evaluation. This study also produces an open-source library, which includes 12 representative embedding-based entity alignment approaches. We extensively evaluate these approaches on the generated datasets, to understand their strengths and limitations. Additionally, for several directions that have not been explored in current approaches, we perform exploratory experiments and report our preliminary findings for future studies. The benchmark datasets, open-source library and experimental results are all accessible online and will be duly maintained.
Welcome to Lincolnshire Open Data! Lincolnshire Open Data has been created to support Lincolnshire County Council's commitment to freeing up Lincolnshire's data. It’s a place for the public, researchers and developers to access and analyse information about the county. We want citizens to be able to use the data that is held within this site - free of charge - in innovative ways. Lincolnshire Open Data is the place where Lincolnshire County Council's internal data will be readily accessible. The council's datasets will be added to by a number of datasets owned by other organisations and thought to provide valuable data about Lincolnshire. This site contains a catalogue of data. You can search for, preview and download a wide range of data in a number of different themes shown as Groups (see the Groups link at the top). A list of all datasets on the site is shown in the Open Datasets register. For more information on the council's Approach to Open Data, please see this document:- LCC's Approach to Open Data - Lincolnshire County Council. Feedback We are continually looking for ways to improve the content and experience on the site and would welcome your feedback. Please tell us what you think of this website using this Feedback form. Suggest a New Dataset We are committed to trying to publish data you want to see! Suggest a new dataset you would like us to try and publish using this Request form. While not all data requests we receive can be published, we listen to people's feedback and suggestions for new datasets. If you would like an update on progress of any datasets you requested, please contact us at opendata@lincolnshire.gov.uk. Important note: The two forms above are not for Freedom of Information requests, instead they should be sent to CustomerInformationService@lincolnshire.gov.uk . More information about Freedom of Information requests is available on the Lincolnshire County Council website.
This dataset is the result of a full-population crawl of the .gov.uk web domain, aiming to capture a full picture of the scope of public-facing government activity online and the links between different government bodies. Local governments have been developing online services, aiming to better serve the public and reduce administrative costs. However, the impact of this work, and the links between governments’ online and offline activities, remain uncertain. The overall research question for this research examines whether local e-government has met these expectations, of Digital Era Governance and of its practitioners. Aim was to directly analyse the structure and content of government online. It shows that recent digital-centric public administration theories, typified by the Digital Era Governance quasi-paradigm, are not empirically supported by the UK local government experience. The data consist of a file of individual Uniform Resource Locators (URLs) fetched during the crawl, and a further file containing pairs of URLs reflecting the Hypertext Markup Language (HTML) links between them. In addition, a GraphML format file is presented for a version of the data reduced to third-level-domains, with accompanying attribute data for the publishing government organisations and calculated webometric statistics based on the third-level-domain link network.This project engages with the Digital Era Governance (DEG) work of Dunleavy et. al. and draws upon new empirical methods to explore local government and its use of Internet-related technology. It challenges the existing literature, arguing that e-government benefits have been oversold, particularly for transactional services; it updates DEG with insights from local government. The distinctive methodological approach is to use full-population datasets and large-scale web data to provide an empirical foundation for theoretical development, and to test existing theorists’ claims. A new full-population web crawl of .gov.uk is used to analyse the shape and structure of online government using webometrics. Tools from computer science, such as automated classification, are used to enrich our understanding of the dataset. A new full-population panel dataset is constructed covering council performance, cost, web quality, and satisfaction. The local government web shows a wide scope of provision but only limited evidence in support of the existing rhetorics of Internet-enabled service delivery. In addition, no evidence is found of a link between web development and performance, cost, or satisfaction. DEG is challenged and developed in light of these findings. The project adds value by developing new methods for the use of big data in public administration, by empirically challenging long-held assumptions on the value of the web for government, and by building a foundation of knowledge about local government online to be built on by further research. This is an ESRC-funded DPhil research project. A web crawl was carried out with Heritrix, the Internet Archive's web crawler. A list of all registered domains in .gov.uk (and their www.x.gov.uk equivalents) was used as a set of start seeds. Sites outside .gov.uk were excluded; robots.txt files were respected, with the consequence that some .gov.uk sites (and some parts of other .gov.uk sites) were not fetched. Certain other areas were manually excluded, particularly crawling traps (e.g. calendars which will serve infinite numbers of pages in the past and future and those websites returning different URLs for each browser session) and the contents of certain large peripheral databases such as online local authority library catalogues. A full set of regular expressions used to filter the URLs fetched are included in the archive. On completion of the crawl, the page URLs and link data were extracted from the output WARC files. The page URLs were manually examined and re-filtered to handle various broken web servers and to reduce duplication of content where multiple views were presented onto the same content (for example, where a site was presented at both http://organisation.gov.uk/ and http://www.organisation.gov.uk/ without HTTP redirection between the two). Finally, The link list was filtered against the URL list to remove bogus links and both lists were map/reduced to a single set of files. Also included in this data release is a derived dataset more useful for high-level work. This is a GraphML file containing all the link and page information reduced to third-level domain level (so darlington.gov.uk is considered as a single node, not a large set of pages) and with the links binarised to present/not present between each node. Each graph node also has various attributes, including the name of the registering organisation and various webometric measures including PageRank, indegree and betweenness centrality.
Phishing is a form of identity theft that occurs when a malicious website impersonates a legitimate one in order to acquire sensitive information such as passwords, account details, or credit card numbers. People generally tend to fall pray to this very easily. Kudos to the commendable craftsmanship of the attackers which makes people believe that it is a legitimate website. There is a need to identify the potential phishing websites and differentiate them from the legitimate ones. This dataset identifies the prominent features of the phishing websites, 10 such features have been identified.
Generally, the open source datasets available on the internet do not comes with the code and the logic which arises certain problems i.e.:
On the contrary we are trying to overcome all the above-mentioned problems.
1. Real Time Data: Before applying a Machine Learning algorithm, we can run the script and fetch real time URLs from Phishtank (for phishing URLs) and from moz (for legitimate URLs) 2. Scalable Data: We can also specify the number of URLs we want to feed the model and hence the web scrapper will fetch that much amount of data from the websites. Presently we are using 1401 URLs in this project i.e. 901 Phishing URLs and 500 Legitimate URLS. 3. New Features: We have tried to implement the prominent new features that is there in the current phishing URLs and since we own the code, new features can also be added. 4. Source code on Github: The source code is published on GitHub for public use and can be used for further scope of improvements. This way there will be transparency to the logic and more creators can add there meaningful additions to the code.
https://github.com/akshaya1508/detection_of_phishing_websites.git
The idea to develop the dataset and the code for this dataset has been inspired by various other creators who have worked on the similar lines.
https://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions
This is an Annotation for Transparent Inquiry (ATI) data project. The annotated article can be viewed on the Publisher's Website. Data Generation The research project engages a story about perceptions of fairness in criminal justice decisions. The specific focus involves a debate between ProPublica, a news organization, and Northpointe, the owner of a popular risk tool called COMPAS. ProPublica wrote that COMPAS was racist against blacks, while Northpointe posted online a reply rejecting such a finding. These two documents were the obvious foci of the qualitative analysis because of the further media attention they attracted, the confusion their competing conclusions caused readers, and the power both companies wield in public circles. There were no barriers to retrieval as both documents have been publicly available on their corporate websites. This public access was one of the motivators for choosing them as it meant that they were also easily attainable by the general public, thus extending the documents’ reach and impact. Additional materials from ProPublica relating to the main debate were also freely downloadable from its website and a third party, open source platform. Access to secondary source materials comprising additional writings from Northpointe representatives that could assist in understanding Northpointe’s main document, though, was more limited. Because of a claim of trade secrets on its tool and the underlying algorithm, it was more difficult to reach Northpointe’s other reports. Nonetheless, largely because its clients are governmental bodies with transparency and accountability obligations, some of Northpointe-associated reports were retrievable from third parties who had obtained them, largely through Freedom of Information Act queries. Together, the primary and (retrievable) secondary sources allowed for a triangulation of themes, arguments, and conclusions. The quantitative component uses a dataset of over 7,000 individuals with information that was collected and compiled by ProPublica and made available to the public on github. ProPublica’s gathering the data directly from criminal justice officials via Freedom of Information Act requests rendered the dataset in the public domain, and thus no confidentiality issues are present. The dataset was loaded into SPSS v. 25 for data analysis. Data Analysis The qualitative enquiry used critical discourse analysis, which investigates ways in which parties in their communications attempt to create, legitimate, rationalize, and control mutual understandings of important issues. Each of the two main discourse documents was parsed on its own merit. Yet the project was also intertextual in studying how the discourses correspond with each other and to other relevant writings by the same authors. Several more specific types of discursive strategies were of interest in attracting further critical examination: Testing claims and rationalizations that appear to serve the speaker’s self-interest Examining conclusions and determining whether sufficient evidence supported them Revealing contradictions and/or inconsistencies within the same text and intertextually Assessing strategies underlying justifications and rationalizations used to promote a party’s assertions and arguments Noticing strategic deployment of lexical phrasings, syntax, and rhetoric Judging sincerity of voice and the objective consideration of alternative perspectives Of equal importance in a critical discourse analysis is consideration of what is not addressed, that is to uncover facts and/or topics missing from the communication. For this project, this included parsing issues that were either briefly mentioned and then neglected, asserted yet the significance left unstated, or not suggested at all. This task required understanding common practices in the algorithmic data science literature. The paper could have been completed with just the critical discourse analysis. However, because one of the salient findings from it highlighted that the discourses overlooked numerous definitions of algorithmic fairness, the call to fill this gap seemed obvious. Then, the availability of the same dataset used by the parties in conflict, made this opportunity more appealing. Calculating additional algorithmic equity equations would not thereby be troubled by irregularities because of diverse sample sets. New variables were created as relevant to calculate algorithmic fairness equations. In addition to using various SPSS Analyze functions (e.g., regression, crosstabs, means), online statistical calculators were useful to compute z-test comparisons of proportions and t-test comparisons of means. Logic of Annotation Annotations were employed to fulfil a variety of functions, including supplementing the main text with context, observations, counter-points, analysis, and source attributions. These fall under a few categories. Space considerations. Critical discourse analysis offers a rich method...
The New York State Energy Research and Development Authority (NYSERDA) hosts a web-based Distributed Energy Resources (DER) integrated data system at https://der.nyserda.ny.gov/. This site provides information on DERs that are funded by and report performance data to NYSERDA. Information is incorporated on more diverse DER technology as it becomes available. Distributed energy resources (DER) are technologies that generate or manage the demand of electricity at different points of the grid, such as at homes and businesses, instead of exclusively at power plants, and includes Combined Heat and Power (CHP) Systems, Anaerobic Digester Gas (ADG)-to-Electricity Systems, Fuel Cell Systems, Energy Storage Systems, and Large Photovoltaic (PV) Solar Electric Systems (larger than 50 kW). Historical databases with hourly readings for each system are updated each night to include data from the previous day. The web interface allows users to view, plot, analyze, and download performance data from one or several different DER sites. Energy storage systems include all operational systems in New York including projects not funded by NYSERDA. Only NYSERDA-funded energy storage systems will have performance data available. The database is intended to provide detailed, accurate performance data that can be used by potential users, developers, and other stakeholders to understand the real-world performance of these technologies. For NYSERDA’s performance-based programs, these data provide the basis for incentive payments to these sites. How does your organization use this dataset? What other NYSERDA or energy-related datasets would you like to see on Open NY? Let us know by emailing OpenNY@nyserda.ny.gov. The New York State Energy Research and Development Authority (NYSERDA) offers objective information and analysis, innovative programs, technical expertise, and support to help New Yorkers increase energy efficiency, save money, use renewable energy, and reduce reliance on fossil fuels. To learn more about NYSERDA’s programs, visit https://nyserda.ny.gov or follow us on Twitter, Facebook, YouTube, or Instagram.
Introduction
The GiGL Open Space Friends Group subset provides locations and boundaries for selected open space sites in Greater London.
The chosen sites represent sites that have established Friends Groups in Greater London and are therefore important to local communities, even if they may not be accessible open spaces, or don’t typically function as destinations for leisure, activities and community engagement*.
Friends Groups are groups of interested local people who come together to protect, enhance and improve their local open space or spaces.
The dataset has been created by Greenspace Information for Greater London CIC (GiGL). As London’s Environmental Records Centre, GiGL mobilises, curates and shares data that underpin our knowledge of London’s natural environment. We provide impartial evidence to support informed discussion and decision making in policy and practice.
GiGL maps under licence from the Greater London Authority.
*Publicly accessible sites for leisure, activities and community engagement can be found in GiGL's Spaces to Visit dataset
Description
This dataset is a sub-set of the GiGL Open Space dataset, the most comprehensive dataset available of open spaces in London. Sites are selected for inclusion in the Friends Group subset based on whether there is a friends group recorded for the site in the Open Space dataset.
The dataset is a mapped Geographic Information System (GIS) polygon dataset where one polygon (or multi-polygon) represents one space. As well as site boundaries, the dataset includes information about a site’s name, size, access and type (e.g. park, playing field etc.) and the name and/or web address of the site’s friends group.
GiGL developed the dataset to support anyone who is interested in identifying sites in London with friends groups - including friends groups and other community groups, web and app developers, policy makers and researchers - with an open licence data source. More detailed and extensive data are available under GiGL data use licences for GIGL partners, researchers and students. Information services are also available for ecological consultants, biological recorders, community groups and members of the public – please see www.gigl.org.uk for more information.
The dataset is updated on a quarterly basis. If you have questions about this dataset please contact GiGL’s GIS and Data Officer.
Data sources
The boundaries and information in this dataset are a combination of data collected during the London Survey Method habitat and open space survey programme (1986 – 2008) and information provided to GiGL from other sources since. These sources include London borough surveys, land use datasets, volunteer surveys, feedback from the public, park friends’ groups, and updates made as part of GiGL’s on-going data validation and verification process.
This is a preliminary version of the dataset as there is currently low coverage of friends groups in GiGL’s Open Space database. We are continually working on updating and improving this dataset. If you have any additional information or corrections for sites included in GiGL’s Friends Group subset please contact GiGL’s GIS and Data Officer.
NOTE: The dataset contains OS data © Crown copyright and database rights 2025. The site boundaries are based on Ordnance Survey mapping, and the data are published under Ordnance Survey's 'presumption to publish'. When using these data please acknowledge GiGL and Ordnance Survey as the source of the information using the following citation:
‘Dataset created by Greenspace Information for Greater London CIC (GiGL), 2025 – Contains Ordnance Survey and public sector information licensed under the Open Government Licence v3.0 ’
This data set contains financial assistance values, including the number of approved applications, as well as individual, public assistance, and hazard mitigation grant amounts.rnrnThis is raw, unedited data from FEMA's National Emergency Management Information System (NEMIS) and as such is subject to a small percentage of human error. The financial information is derived from NEMIS and not FEMA's official financial systems. Due to differences in reporting periods, status of obligations and how business rules are applied, this financial information may differ slightly from official publication on public websites such as usaspending.gov; this dataset is not intended to be used for any official federal financial reporting.rnrnIf you have media inquiries about this dataset please email the FEMA News Desk FEMA-News-Desk@dhs.gov or call (202) 646-3272. For inquiries about FEMA's data and Open government program please contact the OpenFEMA team via email OpenFEMA@fema.dhs.gov.
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes. Due to these restrictions, the collection is not open data. Please fill out the form and upload the Data Sharing Agreement at Google Form.
Citation
Please cite our work as
@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }
Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.
Subtask 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:
False - The main claim made in an article is untrue.
Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.
True - This rating indicates that the primary elements of the main claim are demonstrably true.
Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.
Input Data
The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:
Output data format
Sample File
public_id, predicted_rating
1, false
2, true
Sample file
public_id, predicted_domain
1, health
2, crime
Additional data for Training
To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible sources:
IMPORTANT!
Evaluation Metrics
This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.
Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498
Submission Link: Coming soon
Related Work