23 datasets found

g
Website Traffic Dataset
gts.ai
json
Updated Aug 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GTS (2024). Website Traffic Dataset [Dataset]. https://gts.ai/dataset-download/website-traffic-dataset/
Explore at:
jsonAvailable download formats
Dataset updated
Aug 23, 2024
Dataset provided by
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
Authors
GTS
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Explore our detailed website traffic dataset featuring key metrics like page views, session duration, bounce rate, traffic source, and conversion rates.
Most popular paid Google Play Store apps India June 2021, by usage rank
statista.com
Updated Dec 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2022). Most popular paid Google Play Store apps India June 2021, by usage rank [Dataset]. https://www.statista.com/statistics/1247302/india-trending-paid-google-play-store-apps-by-usage-rank/
Explore at:
Dataset updated
Dec 22, 2022
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
India
Description
According to research from SimilarWeb, 1DM+, the download manager app led the list of trending apps among the paid ones on the India Google Play Store as of June 2021. uTorrent Pro followed at rank six during the same time period.
h
falcon-refinedweb
huggingface.co
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technology Innovation Institute, falcon-refinedweb [Dataset]. http://doi.org/10.57967/hf/0737
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/0737
Dataset authored and provided by
Technology Innovation Institute
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
📀 Falcon RefinedWeb

Falcon RefinedWeb is a massive English web dataset built by TII and released under an ODC-By 1.0 license. See the 📓 paper on arXiv for more details. RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data. RefinedWeb is also "multimodal-friendly": it contains links and alt… See the full description on the dataset page: https://huggingface.co/datasets/tiiuae/falcon-refinedweb.
i
Netflix
ieee-dataport.org
Updated Oct 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danil Shamsimukhametov (2021). Netflix [Dataset]. https://ieee-dataport.org/documents/youtube-netflix-web-dataset-encrypted-traffic-classification
Explore at:
Dataset updated
Oct 1, 2021
Authors
Danil Shamsimukhametov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
YouTube
Description
YouTube flows
m
Network traffic for machine learning classification
data.mendeley.com
Updated Feb 12, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Víctor Labayen Guembe (2020). Network traffic for machine learning classification [Dataset]. http://doi.org/10.17632/5pmnkshffm.1
Explore at:
Unique identifier
https://doi.org/10.17632/5pmnkshffm.1
Dataset updated
Feb 12, 2020
Authors
Víctor Labayen Guembe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is a set of network traffic traces in pcap/csv format captured from a single user. The traffic is classified in 5 different activities (Video, Bulk, Idle, Web, and Interactive) and the label is shown in the filename. There is also a file (mapping.csv) with the mapping of the host's IP address, the csv/pcap filename and the activity label.

Activities:

Interactive: applications that perform real-time interactions in order to provide a suitable user experience, such as editing a file in google docs and remote CLI's sessions by SSH. Bulk data transfer: applications that perform a transfer of large data volume files over the network. Some examples are SCP/FTP applications and direct downloads of large files from web servers like Mediafire, Dropbox or the university repository among others. Web browsing: contains all the generated traffic while searching and consuming different web pages. Examples of those pages are several blogs and new sites and the moodle of the university. Vídeo playback: contains traffic from applications that consume video in streaming or pseudo-streaming. The most known server used are Twitch and Youtube but the university online classroom has also been used. Idle behaviour: is composed by the background traffic generated by the user computer when the user is idle. This traffic has been captured with every application closed and with some opened pages like google docs, YouTube and several web pages, but always without user interaction.

The capture is performed in a network probe, attached to the router that forwards the user network traffic, using a SPAN port. The traffic is stored in pcap format with all the packet payload. In the csv file, every non TCP/UDP packet is filtered out, as well as every packet with no payload. The fields in the csv files are the following (one line per packet): Timestamp, protocol, payload size, IP address source and destination, UDP/TCP port source and destination. The fields are also included as a header in every csv file.

The amount of data is stated as follows:

Bulk : 19 traces, 3599 s of total duration, 8704 MBytes of pcap files Video : 23 traces, 4496 s, 1405 MBytes Web : 23 traces, 4203 s, 148 MBytes Interactive : 42 traces, 8934 s, 30.5 MBytes Idle : 52 traces, 6341 s, 0.69 MBytes
h
fineweb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
Explore at:
Unique identifier
https://doi.org/10.57967/hf/2493
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

What is it?

The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
f
Our set of 17 public web archives.
figshare.com
xls
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Aturban; Martin Klein; Herbert Van de Sompel; Sawood Alam; Michael L. Nelson; Michele C. Weigle (2023). Our set of 17 public web archives. [Dataset]. http://doi.org/10.1371/journal.pone.0286879.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0286879.t001
Dataset updated
Jun 9, 2023
Dataset provided by
PLOS ONE
Authors
Mohamed Aturban; Martin Klein; Herbert Van de Sompel; Sawood Alam; Michael L. Nelson; Michele C. Weigle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Web archives, such as the Internet Archive, preserve the web and allow access to prior states of web pages. We implicitly trust their versions of archived pages, but as their role moves from preserving curios of the past to facilitating present day adjudication, we are concerned with verifying the fixity of archived web pages, or mementos, to ensure they have always remained unaltered. A widely used technique in digital preservation to verify the fixity of an archived resource is to periodically compute a cryptographic hash value on a resource and then compare it with a previous hash value. If the hash values generated on the same resource are identical, then the fixity of the resource is verified. We tested this process by conducting a study on 16,627 mementos from 17 public web archives. We replayed and downloaded the mementos 39 times using a headless browser over a period of 442 days and generated a hash for each memento after each download, resulting in 39 hashes per memento. The hash is calculated by including not only the content of the base HTML of a memento but also all embedded resources, such as images and style sheets. We expected to always observe the same hash for a memento regardless of the number of downloads. However, our results indicate that 88.45% of mementos produce more than one unique hash value, and about 16% (or one in six) of those mementos always produce different hash values. We identify and quantify the types of changes that cause the same memento to produce different hashes. These results point to the need for defining an archive-aware hashing function, as conventional hashing functions are not suitable for replayed archived web pages.
c
ckanext-ga-report - Extensions - CKAN Ecosystem Catalog
catalog.civicdataecosystem.org
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). ckanext-ga-report - Extensions - CKAN Ecosystem Catalog [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-ga-report
Explore at:
Dataset updated
Jun 4, 2025
Description
The ga-report extension for CKAN is designed to provide detailed Google Analytics reports, including totals per group, to site managers. Unlike other extensions that focus on providing real-time page view statistics for end-users, this extension focuses on building regular, periodic reports to monitor site usage and performance. It enables site administrators to download Google Analytics data into CKAN's database tables and view the data as web page reports, thereby facilitating informed decision-making and resource allocation. Key Features: Google Analytics Data Download: Enables users to download Google Analytics data for specified time periods using a CLI tool, storing this data within the extension's database tables. Web Page Reports: Allows users to view the downloaded Google Analytics data as web page reports accessible through the CKAN interface. Periodic Report Generation: Emphasizes the creation of regular, periodic reports instead of focusing solely on real-time analytics, offering a historical perspective on site traffic and engagement. Bounce Rate Tracking: Specifies a particular URL (typically the homepage) to record and analyze bounce rates for, enabling optimization of landing pages. Data Retrieval Customization: Supports retrieving data for all time, the latest available data, or data from a specific date, providing flexibility in data analysis. Technical Integration: Requires setting up Google Analytics and obtaining API credentials that are then used to access Google Analytics data. This involves enabling the "Analytics API" in the Google APIs Console and creating an OAuth 2.0 client ID and secret. The extension utilizes a credentials.json file to store authentication details, allowing the CKAN instance to securely access Google Analytics. The location of the token.dat authentication token generated is specified in the CKAN configuration file (development.ini or similar). The extension's database tables are initialized using a paster command, which ensures that the required data structures are set up within CKAN's database to store the Google Analytics data. Benefits & Impact: By providing a mechanism for regular Google Analytics reporting, the ga-report extension assists CKAN site managers in monitoring trends, identifying areas for improvement, and making data-driven decisions to optimize site performance. The ability to download and store Google Analytics data within CKAN also allows for more in-depth analysis and integration with other data sources.
A
Iowa BMP Mapping Project Data Download Website
data.amerigeoss.org
html
Updated Oct 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AmericaView (2024). Iowa BMP Mapping Project Data Download Website [Dataset]. https://data.amerigeoss.org/tl/dataset/groups/iowa-bmp-mapping-project-data-download-website
Explore at:
htmlAvailable download formats
Dataset updated
Oct 18, 2024
Dataset provided by
AmericaView
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This web app allows users to search a map of Iowa for conservation practice data by HUC 12 watershed and then download it as a pdf or geodatabase.
Evaluating Web Table Annotation Methods: From Entity Lookups to Entity...
springernature.figshare.com
application/gzip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vasilis Efthymiou; Oktie Hassanzadeh; Mariano Rodríguez-Muro; Vassilis Christophides (2023). Evaluating Web Table Annotation Methods: From Entity Lookups to Entity Embeddings [Dataset]. http://doi.org/10.6084/m9.figshare.5229847.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5229847.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Vasilis Efthymiou; Oktie Hassanzadeh; Mariano Rodríguez-Muro; Vassilis Christophides
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data sets used for experimental evaluation in the related publication: Evaluating Web Table Annotation Methods: From Entity Lookups to Entity EmbeddingsThe data sets are contained within archive folders corresponding the three gold standard data sets used in the related publication. Each is presented in both .csv and .json formats.The gold standard data sets are collections of web tables:T2D consists of a schema-level gold standard of 1,748Web tables, manually annotated with class- and property-mappings, as well as an entity-level gold standard of 233 Web tables. Limaye consists of 400 manually annotated Web tables with entity-, class-, and property-level correspondences, where single cells (not rows) are mapped to entities. The corrected version of this gold standard is adapted to annotate rows with entities, from the annotations of the label column cells. WikipediaGS is an instance-level gold standard developed from 485K Wikipedia tables, in which links in the label column are used to infer the annotation of a row to a DBpedia entity. Data formatCSV:The .csv files are formatted as double quoted (' " ') fields, separated by commas (',').In the tables files, each file corresponds to one table, each field represents a column, and each line represents a different row.In the entities files, there are only three fields:"DBpedia uri","cell string","row number"representing the correct annotation, the string of the label column cell, and the row (starting from 0) in which this mapping is found, respectively.Tables and entities files that correspond to the same table have the same filename.The same formatting and naming convention is used in T2D gold standard (http://webdatacommons.org/webtables/goldstandard.html).JSON:Each line in a .json file corresponds to a table, written as a JSONObject. T2D and Limaye tables files contain only one line (table) per file, while the Wikipedia gold standard contains multiple lines (tables) per .json file. In T2D and Limaye, the entity mappings of those tables can be found in the entities files with the same filename, while in Wikipedia, the entity mappings of each table can be found the line of the entities files having the "tableId" field as the one of the corresponding table.The contents of a table in .json are given as a two-dimensional array (a JSONArray of JSONArray s), called "contents". Each JSONArray in the contents represents a table row. Each element of this array is a JSONObject, representing one cell of the row. The field "data" of each cell contains the cell string contents, while there may also be a field "isHeader" to denote of the current cell is in a header row. In the Wikipedia gold standard there may also be a "wikiPageId" field, denoting the existing hyperlink of this cell to a Wikipedia page. It only contains the suffix of a Wikipedia URL, skipping the first part "https://en.wikipedia.org/wiki/".The entity mappings files are in the same format as in csv:["DBpedia uri","cell string",row number] inside the "mappings" field of a json file. Note on license: please refer to the README.txt. Data is derived from Wikipedia and other sources may have different licenses.

Wikipedia contents can be shared under the terms of Creative Commons Attribution-ShareAlike License as outlined on Wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content

The correspondences of the T2D Gold standard is provided under the terms of the Apache license. The Web tables are provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus. The DBpedia subset is licensed under the terms of the Creative Commons Attribution-ShareAlike License and the GNU Free Documentation License that applies to DBpedia. Limaye gold standard is downloaded from: http://websail-fe.cs.northwestern.edu/TabEL/ (download date: August 25, 2016). Please refer to the original website and the following paper for more details and citation information: G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and Searching Web Tables Using Entities, Types and Relationships. PVLDB, 3(1):1338â€“1347, 2010.Also: THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
a
Download module
ohiostatefreightplan-ohiodot.hub.arcgis.com
Updated Jun 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ohio Department of Transportation (2022). Download module [Dataset]. https://ohiostatefreightplan-ohiodot.hub.arcgis.com/datasets/ohiodot::download-module
Explore at:
Dataset updated
Jun 17, 2022
Dataset authored and provided by
Ohio Department of Transportation
Description
A set of bar graphs representing the Tonnage by Industry, Value by Industry, Tonnage by Freight Mode, and Value of Freight Mode. There are two different sets of data: ODOT Districts and JobsOhio Regions. The graphs can be filtered further by: The direction of import, export or within the state; ODOT District/JobsOhio Region; and Second ODOT District/JobsOhio Region or State. Each of the datasets can be downloaded in table form.
Octo Browser: Your Ultimate Web Browsing Solution
kaggle.com
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tahir tabassum (2025). Octo Browser: Your Ultimate Web Browsing Solution [Dataset]. https://www.kaggle.com/datasets/tahirtabassum/octo-browser-your-ultimate-web-browsing-solution/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 15, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
tahir tabassum
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F26037753%2Fbbb0cf1bf56d21eefd8affbdd3ba1230%2FzZeZUKTdB2N2MNkxptRRIagxm1Yewj1oTZ61NwpI.png?generation=1744691255224374&alt=media" alt="">

Octo Browser: Your Ultimate Web Browsing Solution
In today's digital world, having a good web browser is key. The Octo Browser is here to help. It offers a top-notch browsing experience unlike any other.

This browser has cool features and is easy to use. It's perfect for anyone, whether you're just browsing or need it for work. It's made to make your online time better.
The Octo Browser uses the latest tech. It loads pages quickly, keeps you safe, and is easy to get around. It's the best choice for anyone looking for a great browser.

Key Takeaways
- Advanced features for a seamless browsing experience
- Robust security to protect your online activities
- Fast page loading and intuitive navigation
- User-centric design for enhanced usability
- Ideal for both casual users and professionals

Introducing Octo Browser
Octo Browser is changing how we browse the web. It's a top-notch web browser that makes browsing fast. It's perfect for those who want quick and reliable results.
Octo Browser has cool features that make browsing better. It's easy to use and works great.

Key Features at a Glance
Octo Browser has some key features:
- High-speed page loading
- Advanced security protocols
- Intuitive interface design

These features make browsing smooth and safe. Experts say it's a game-changer:
"Octo Browser's blend of speed and security sets a new standard in the world of web browsers."

How Octo Browser Stands Out
Octo Browser is different because it focuses on speed and security. It offers a better browsing experience than others. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F26037753%2F0826c351a3a6febc05c9d0ee82f37595%2F51a667a1-7803-4b62-8b0e-f9dedc9394fc_1440x900.png?generation=1744691337624920&alt=media" alt=""> Blazing-Fast Performance
Octo Browser brings you the future of web browsing with its lightning-fast speed. It uses a top-notch rendering engine and smart resource management.

Optimized Rendering Engine
Octo Browser has an optimized rendering engine that makes pages load much faster. This means you can quickly move through your favorite websites.

Efficient Resource Management
The browser's efficient resource management makes sure your system runs smoothly. It prevents slowdowns and crashes. Key features include:
- Intelligent memory allocation
- Background process optimization
- Prioritization of active tabs

Speed Comparison with Leading Browsers
Octo Browser is the fastest among leading browsers. Here's why:
- Loads pages up to 30% faster than the average browser
- Maintains speed even with multiple tabs open
- Outperforms competitors in both JavaScript and page rendering tests

Uncompromising Security and Privacy
In today's digital world, security and privacy are key. Octo Browser is built with these in mind. It's a secure browser that protects your data from cyber threats.
Octo Browser is all about keeping your online activities safe. It has strong features to do just that.

Built-in Privacy Protection
Octo Browser has privacy features to keep your browsing private. It stops tracking and profiling, so your habits stay hidden.
It uses advanced anti-tracking tech. This blocks third-party cookies and other tracking tools.

Advanced Data Encryption
Data encryption is vital for online safety. Octo Browser uses advanced encryption protocols to secure your data.
This means your data is safe from unauthorized access. It's protected when you send or store it.

Automatic Security Updates
Octo Browser also has automatic security updates. This keeps your browser current with the latest security fixes.
This way, you're always safe from new threats. You don't have to manually update the browser.

Seamless User Experience
Octo Browser is designed with the user in mind. It offers a seamless user experience. This means users can easily explore their favorite websites.

Intuitive Interface Design
The Octo Browser has an intuitive interface design. It's easy to use and navigate. The layout is clean and simple, focusing on your browsing experience.

Extensive Customization Options
Octo Browser gives you extensive customization options. You can personalize your browsing experience. Choose from various themes, customize toolbar layouts, and more.
- Choose from multiple theme options
- Customize toolbar layouts
- Personalize your browsing experience

Cross-Device Synchronization
Octo Browser's cross-device synchronization lets you access your data on different devices. This means you ...
d
3D Maps
dataone.org
Updated Aug 9, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Campbell, Karen (https://www.linkedin.com/in/karen-campbell-1336965); Morin, Paul (2016). 3D Maps [Dataset]. https://dataone.org/datasets/seadva-20ef8e4e-12fd-4244-be19-7a79c827e85f
Explore at:
Dataset updated
Aug 9, 2016
Dataset provided by
SEAD Virtual Archive
Authors
Campbell, Karen (https://www.linkedin.com/in/karen-campbell-1336965); Morin, Paul
Description
NCED is currently involved in researching the effectiveness of anaglyph maps in the classroom and are working with educators and scientists to interpret various Earth-surface processes. Based on the findings of the research, various activities and interpretive information will be developed and available for educators to use in their classrooms. Keep checking back with this website because activities and maps are always being updated. We believe that anaglyph maps are an important tool in helping students see the world and are working to further develop materials and activities to support educators in their use of the maps.

This website has various 3-D maps and supporting materials that are available for download. Maps can be printed, viewed on computer monitors, or projected on to screens for larger audiences. Keep an eye on our website for more maps, activities and new information. Let us know how you use anaglyph maps in your classroom. Email any ideas or activities you have to ncedmaps@umn.edu

Anaglyph paper maps are a cost effective offshoot of the GeoWall Project. Geowall is a high end visualization tool developed for use in the University of Minnesota's Geology and Geophysics Department. Because of its effectiveness it has been expanded to 300 institutions across the United States. GeoWall projects 3-D images and allows students to see 3-D representations but is limited because of the technology. Paper maps are a cost effective solution that allows anaglyph technology to be used in classroom and field-based applications.

Maps are best when viewed with RED/CYAN anaglyph glasses!

A note on downloading: "viewable" maps are .jpg files; "high-quality downloads" are .tif files. While it is possible to view the latter in a web-browser in most cases, the download may be slow. As an alternative, try right-clicking on the link to the high-quality download and choosing "save" from the pop-up menu that results. Save the file to your own machine, then try opening the saved copy. This may be faster than clicking directly on the link to open it in the browser.

World Map: 3-D map that highlights oceanic bathymetry and plate boundaries.

Continental United States: 3-D grayscale map of the Lower 48.

Western United States: 3-D grayscale map of the Western United States with state boundaries.

Regional Map: 3-D greyscale map stretching from Hudson Bay to the Central Great Plains. This map includes the Western Great Lakes and the Canadian Shield.

Minnesota Map: 3-D greyscale map of Minnesota with county and state boundaries.

Twin Cities: 3-D map extending beyond Minneapolis and St. Paul.

Twin Cities Confluence Map: 3-D map highlighting the confluence of the Mississippi and Minnesota Rivers. This map includes most of Minneapolis and St. Paul.

Minneapolis, MN: 3-D topographical map of South Minneapolis.

Bassets Creek, Minneapolis: 3-D topographical map of the Bassets Creek watershed.

North Minneapolis: 3-D topographical map highlighting North Minneapolis and the Mississippi River.

St. Paul, MN: 3-D topographical map of St. Paul.

Western Suburbs, Twin Cities: 3-D topographical map of St. Louis Park, Hopkins and Minnetonka area.

Minnesota River Valley Suburbs, Twin Cities: 3-D topographical map of Bloomington, Eden Prairie and Edina area.

Southern Suburbs, Twin Cities: 3-D topographical map of Burnsville, Lakeville and Prior Lake area.

Southeast Suburbs, Twin Cities: 3-D topographical map of South St. Paul, Mendota Heights, Apple Valley and Eagan area.

Northeast Suburbs, Twin Cities: 3-D topographical map of White Bear Lake, Maplewood and Roseville area.

Northwest Suburbs, Mississippi River, Twin Cities: 3-D topographical map of North Minneapolis, Brooklyn Center and Maple Grove area.

Blaine, MN: 3-D map of Blaine and the Mississippi River.

White Bear Lake, MN: 3-D topographical map of White Bear Lake and the surrounding area.

Maple Grove, MN: 3-D topographical mmap of the NW suburbs of the Twin Cities.
n
Web Graphs
networkrepository.com
csv
Updated Dec 31, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Network Data Repository (2016). Web Graphs [Dataset]. https://networkrepository.com/web.php
Explore at:
csvAvailable download formats
Dataset updated
Dec 31, 2016
Dataset authored and provided by
Network Data Repository
License
https://networkrepository.com/policy.phphttps://networkrepository.com/policy.php
Description
Google graphs, download web graphs, download google graphs, web networks, hyperlink graphs, webpage graph
f
WSDL file download failure status.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yang Song (2023). WSDL file download failure status. [Dataset]. http://doi.org/10.1371/journal.pone.0242089.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0242089.t002
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Yang Song
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
WSDL file download failure status.
n
Data from: Repository Analytics and Metrics Portal (RAMP) 2021 data
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Wheeler; Kenning Arlitsch (2023). Repository Analytics and Metrics Portal (RAMP) 2021 data [Dataset]. http://doi.org/10.5061/dryad.1rn8pk0tz
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.1rn8pk0tz
Dataset updated
May 23, 2023
Dataset provided by
Montana State University
University of New Mexico
Authors
Jonathan Wheeler; Kenning Arlitsch
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2021. For a description of the data collection, processing, and output methods, please see the "methods" section below.

The record will be revised periodically to make new data available through the remainder of 2021.

Methods

Data Collection

RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).

Data are downloaded in two sets per participating IR. The first set includes page level statistics about URLs pointing to IR pages and content files. The following fields are downloaded for each URL, with one row per URL:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search.

Following data processing describe below, on ingest into RAMP a additional field, citableContent, is added to the page level data.

The second set includes similar information, but instead of being aggregated at the page level, the data are grouped based on the country from which the user submitted the corresponding search, and the type of device used. The following fields are downloaded for combination of country and device, with one row per country/device combination:

country: The country from which the corresponding search originated. device: The device used for the search. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search.

Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.

More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en

Data Processing

Upon download from GSC, the page level data described above are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of page level statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the page level data which records whether each page/URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."

The data aggregated by the search country of origin and device type do not include URLs. No additional processing is done on these data. Harvested data are passed directly into Elasticsearch.

Processed data are then saved in a series of Elasticsearch indices. Currently, RAMP stores data in two indices per participating IR. One index includes the page level data, the second index includes the country of origin and device type data.

About Citable Content Downloads

Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.

CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).

For any specified date range, the steps to calculate CCD are:

Filter data to only include rows where "citableContent" is set to "Yes." Sum the value of the "clicks" field on these rows.

Output to CSV

Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above. Also as noted above, daily data are downloaded for each IR in two sets which cannot be combined. One dataset includes the URLs of items that appear in SERP. The second dataset is aggregated by combination of the country from which a search was conducted and the device used.

As a result, two CSV datasets are provided for each month of published data:

page-clicks:

The data in these CSV files correspond to the page-level data, and include the following fields:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search. citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No. index: The Elasticsearch index corresponding to page click data for a single IR. repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the previous field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.

Filenames for files containing these data end with “page-clicks”. For example, the file named 2021-01_RAMP_all_page-clicks.csv contains page level click data for all RAMP participating IR for the month of January, 2021.

country-device-info:

The data in these CSV files correspond to the data aggregated by country from which a search was conducted and the device used. These include the following fields:

country: The country from which the corresponding search originated. device: The device used for the search. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search. index: The Elasticsearch index corresponding to country and device access information data for a single IR. repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the previous field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.

Filenames for files containing these data end with “country-device-info”. For example, the file named 2021-01_RAMP_all_country-device-info.csv contains country and device data for all participating IR for the month of January, 2021.

References

Google, Inc. (2021). Search Console APIs. Retrieved from https://developers.google.com/webmaster-tools/search-console-api-original.
h
Mind2Web
huggingface.co
Updated Jun 12, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OSU NLP Group (2023). Mind2Web [Dataset]. https://huggingface.co/datasets/osunlp/Mind2Web
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 12, 2023
Dataset authored and provided by
OSU NLP Group
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for Dataset Name

Dataset Summary

Mind2Web is a dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Existing datasets for web agents either use simulated websites or only cover a limited set of websites and tasks, thus not suitable for generalist web agents. With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains and crowdsourced action… See the full description on the dataset page: https://huggingface.co/datasets/osunlp/Mind2Web.
w
Discover Feed
web-highlights.com
previewbox.link
Updated Dec 10, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). Discover Feed [Dataset]. https://web-highlights.com/
Explore at:
Dataset updated
Dec 10, 2019
Description
A feed of shared highlights from Web Highlighter users.
w
IMQ07 - Container Traffic (Lift On/Lift Off) (TEU Twenty-foot Equivalent) by...
data.wu.ac.at
json-stat, px
Updated Mar 5, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irish Maritime Development Office (2018). IMQ07 - Container Traffic (Lift On/Lift Off) (TEU Twenty-foot Equivalent) by Country, Quarter and Type of Traffic [Dataset]. https://data.wu.ac.at/schema/data_gov_ie/MGJlZDdhOWQtZmMwYy00YmI2LTk1ZGItYzlhMTNlMGVmYzdh
Explore at:
px, json-statAvailable download formats
Dataset updated
Mar 5, 2018
Dataset provided by
Irish Maritime Development Office
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Container Traffic (Lift On/Lift Off) (TEU Twenty-foot Equivalent) by Country, Quarter and Type of Traffic

View data using web pages

Download .px file (Software required)
h
openwebtext
huggingface.co
paperswithcode.com
+4more
Updated Jul 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaron Gokaslan (2023). openwebtext [Dataset]. https://huggingface.co/datasets/Skylion007/openwebtext
Explore at:
Dataset updated
Jul 17, 2023
Authors
Aaron Gokaslan
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
An open-source replication of the WebText dataset from OpenAI.

Facebook

Twitter

Click to copy link

Link copied

Cite

GTS (2024). Website Traffic Dataset [Dataset]. https://gts.ai/dataset-download/website-traffic-dataset/

Website Traffic Dataset

Explore at:

jsonAvailable download formats

Dataset updated

Aug 23, 2024

Dataset provided by

GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED

Authors

GTS

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Explore our detailed website traffic dataset featuring key metrics like page views, session duration, bounce rate, traffic source, and conversion rates.

Clear search

Close search

Google apps

Main menu

Website Traffic Dataset

Most popular paid Google Play Store apps India June 2021, by usage rank

falcon-refinedweb

Netflix

Network traffic for machine learning classification

fineweb

Our set of 17 public web archives.

ckanext-ga-report - Extensions - CKAN Ecosystem Catalog

Iowa BMP Mapping Project Data Download Website

Evaluating Web Table Annotation Methods: From Entity Lookups to Entity...

Download module

Octo Browser: Your Ultimate Web Browsing Solution

3D Maps

Web Graphs

WSDL file download failure status.

Data from: Repository Analytics and Metrics Portal (RAMP) 2021 data

Mind2Web

Discover Feed

IMQ07 - Container Traffic (Lift On/Lift Off) (TEU Twenty-foot Equivalent) by...

openwebtext

Website Traffic Dataset