Skip to main content

5 posts tagged with "Algorithms"

Algorithms and computational methods

View All Tags

Practical Data Science: How to Easily Rank in Kaggle Beginner NLP Competition Using sklearn

· 6 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

Kaggle is an online community and data science competition platform for data scientists, machine learning engineers, and data analysts, featuring many rewarded data science competitions and datasets. The Kaggle community is very famous in the data science field, with many major internet companies publishing rewarded competitions with prizes ranging from tens of thousands to millions of dollars. This article introduces a recent participation in a Kaggle beginner NLP competition, which has no cash rewards but allows learning NLP-related machine learning knowledge.

Kaggle Competition

Competition Overview

This data science competition asks participants to determine whether a tweet is about a real disaster based on a given Twitter tweet. The image below shows a particular tweet containing the keyword "ABLAZE," indicating the tweet is about a house catching fire.

Disater Tweet

Web Crawler in Action: How to use Webspot to implement automatic recognition and data extraction of list web pages

· 7 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

Using web crawling programs to extract list web pages is a one of those common web data extraction tasks. For engineers to write web crawlers, how to efficiently code and generate extraction rules is quite necessary, otherwise most of the time can be wasted on writing CSS selectors and XPath data extraction rules of web crawling programs. In light of this issue, this article will introduce an example of using open source tool Webspot to automatically recognize and extract data of list web pages.

Webspot

Webspot is an open source project aimed at automating web page data extraction. Currently, it supports recognition and crawling rules extraction of list pages and pagination. In addition, it provides a web UI interface for users to visually view the identified results, and allows developers to use APIs to obtain recognition results.

Installation of Webspot is quite easy, you can refer to the official documentation for the installation tutorial with Docker and Docker Compose. Execute the commands below to install and start Webspot.

# clone git repo
git clone https://github.com/crawlab-team/webspot

# start docker containers
docker-compose up -d

Talking Algorithm: Exploration of Intelligent Web Crawlers

· 8 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

"If I had asked people what they wanted, they would have said faster horses" -- Henry Ford

Today is the era of artificial intelligence. Whether it is ChatGPT or the various intelligent applications that follow, many people see the upcoming sci-fi world that was almost unimaginable a few years ago. However, in the field of reptiles, artificial intelligence does not seem to be involved too much. It is true that crawlers, as an "ancient" technology, have created many technical industries such as search engines, news aggregation, and data analysis in the past 20 years, but we have not seen obvious technological breakthroughs yet: crawler engineers still mainly rely on technologies such as XPath and reverse engineering to automatically obtain web data. However, with the development of artificial intelligence and machine learning, crawler technology can theoretically achieve "automatic driving". This article will introduce the current status and possible future development direction of the so-called intelligent crawler (intelligent, automated data extraction crawler technology) from multiple perspectives.

Current Web Crawling Technology

A web crawler is an automated program used to obtain data from the Internet or other computer networks. They usually use automated scraping techniques to automatically visit the website and collect, parse and store information on the website. This information can be structured or unstructured data.

Crawler technology in the traditional sense mainly includes the following modules or systems:

  1. Network request : initiate an HTTP request to a website or web page to obtain data such as HTML;
  2. Web page parsing : parse HTML to form a structured tree structure, and obtain target data through XPath or CSS Selector;
  3. Data storage : store the parsed structured data, which can be in the form of a database or a file;
  4. URL management : manage the URL list to be crawled and the URL list that has been crawled, such as URL resolution and request for paging or list pages.

web crawling system

On Theory: Why Graph Theory is Essential Knowledge for All Industries Today

· 5 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

"Entities should not be multiplied without necessity" -- Ockham's Razor Principle

Graph Theory is a mathematical foundational theory that has been severely underestimated by the public. It doesn't study images, pictures, or charts, but rather an abstract and simple mathematical theory. The graph in graph theory is an abstract concept, very similar to a relationship network, with corresponding nodes (or vertices), and associative relationships or edges between nodes. Graph theory concepts are very simple: graphs, nodes, and edges. This article will briefly introduce basic concepts of graph theory and its applications in the real world. (Note! This is not a scientific paper, so there won't be boring mathematical formulas - please enjoy reading)

graph

Graph Theory Overview

In graph theory, there are three important concepts:

  1. Node: Can be understood as an entity, such as Zhang San, Li Si, Wang Wu in a relationship network;
  2. Edge: Can be understood as relationships between entities, for example, Zhang San and Li Si are husband and wife, Wang Wu is their son;
  3. Graph: Can be understood as the collection of all nodes and edges, such as the happy family composed of Zhang San, Li Si, and Wang Wu.

From these three basic concepts, we can infer relationships between nodes. For example, Li Si's older brother Li Yi would be Wang Wu's uncle, and Wang Wu would be his nephew.

Talking Algorithm: The hidden secret of nature in the divide-and-conquer algorithm

· 3 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

"The empire, long divided, must unite; long united, must divide. " -- The Romance of the Three Kingdoms

It is hard to deny the importance of algorithms in our modern IT industry. A well designed algorithm can run software programs with the minimal resources in the most efficient way. Therefore, algorithms are important, so that IT companies set high standards in the hiring process. Think about the algorithm tests in the technical interviews. Many people might feel that algorithms are quite distant from us, but I think its efficiency enhancement comes naturally, as the reason behind can be found in nature.

From a snowflake

snowflake

As we all know, a snowflake is beautiful, because of not only its shining appearing, but also the shape. It looks like a polished hexagon, and each branch is a snowflake-like sub-hexagon. The term for this structure with recursive self similarity is Fractal. Fractals are so common in nature, such as tree roots and tree leaf branches, pulmonary capillaries, natural coastal lines, or even broken glasses.

So, why? What is the reason that fractals are so common in nature? Is it the design from Gods, or some fundamental mathematical laws behind it? Academic researchers believe in the latter. According the thermodynamics, snowflakes are formed when water vapor encounters an abrupt drop in temperature; according to fluid mechanics, the tree-like structure of pulmonary capillaries would allow the oxygen to be absorbed by red blood cells. In conclusion, fractals exist for efficiencies.