TY  - JOUR
AU  - Chandran, Deepa 
AU  - Vijendran, Anna Saro 
PY  - 2015
TI  - A Layout Based Detachment Approach for Extracting Content from Webpages
JF  - American Journal of Applied Sciences
VL  - 12
IS  - 6
DO  - 10.3844/ajassp.2015.411.420
UR  - https://thescipub.com/abstract/ajassp.2015.411.420
AB  - Enormous amount of useful information presented in Internet is usually formatted for the web users. But it is a really complex task to extract the relevant data from various web sources. Recently, various approaches for the extraction of data from the webpages were proposed. This study provides a simple but effective approach, named Layout Based Detachment Approach (LBDA). The proposed approach extracts the main content from the webpage by removing the irrelevant information like header-footer contents, navigation bars, advertisements and other noisy images. The proposed methodology uses the following techniques: Tag tree parsing to get the analysis structure, block acquiring page segmentation method to remove unwanted tags and data extraction to retrieve the necessary contents. The proposed approach eliminates noise and perform effective extraction of the main content blocks from the webpage and display of the essential content to the users. The performance of the proposed approach is evaluated using the performance metrics such as accuracy, precision, recall, execution time and memory usage. The implementation results obviously show that our proposed LBDA approach exhibits better performance than the existing heuristic approach.