Thông báo tuyển sinh Thạc sỹ Toán ứng dụng Pháp - Việt khóa 2023 - Thời gian nộp hồ sơ từ 15/5/2023 đến 30/5/2023 - Xem thêm

10.2019 - Internship topic proposal for master 2 students

INTERNSHIP TOPIC PROPOSAL FOR MASTER 2 STUDENTS
(3‐4 months, March‐June 2020)
 
 
 
SUPERVISOR INFORMATION
Lê Thị Hoài An
Email: hoai‐an.le‐thi@univ‐lorraine.fr
Tel: (33) ‐ [0] 3 ‐ 72 74 79 51
Lê Hoài Minh
Email: minh.le @univ‐lorraine.fr
Tel: (33) ‐ [0] 3 ‐ 72 74 79 54
Host organization: Informatics & Applications Dept, LGIPM ‐ University of Lorraine.
 
Topic: Text classification
Context:
Automated classification of text into predefined categories has always been considered as a vital method to manage and process a vast amount of documents in digital forms that are widespread and continuously increasing. Some examples of domains in which text classification is commonly used are: document organization and retrieval, opinion mining, email classification and spam filtering, …
Since text may be modeled as quantitative data with frequencies on the word attributes, it is possible to use most of the methods for quantitative data directly on text. However, text is a particular kind of data in which the word attributes are sparse, and high dimensional, with low frequencies on most of the words. Therefore, it is critical to design classification methods which effectively account for these characteristics of text.
Generally, a text classification system’s pipeline can be illustrated in the below figure
In the first step, Feature Extraction, text sequences are converted into a structured feature space. The common techniques of feature extractions are Term Frequency-Inverse Document requency (TF-IDF), Term Frequency (TF), Word2Vec, and Global Vectors for Word Representation (GloVe), ...
As text or document data sets often contain many unique words, data pre-processing steps can be lagged by high time and memory complexity. Hence, one can apply a Dimensionality Reduction step to reduce the data dimension. The most common techniques of dimensionality reduction include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and non-negative matrix factorization (NMF), random projection, autoencoders, and t-distributed stochastic neighbor embedding (t-SNE),…
Finally, the most important step of the text classification pipeline is developing the best classifier. Several classification methods have been studied for text classification such as SVM, logistic regression, Naïve Bayes Classifier, … In recent years, deep learning approaches have been developed and have achieved very good results.
Objective:
The objective of this Master project is to develop novel optimization techniques (models and methods) for text classification problem.
The project includes the following main tasks:
  • Building new optimization models for text classification and developing optimization methods based on DC programming and DCA [4,5,6,7] for solving the proposed models. For instance, a joint model which combines two steps a Dimensionality Reduction (e.g., t-SNE) and Classification. This combination will lead to a non-convex optimization problem. Another research direction could be developing DCA for solving deep learning methods such as conditional random fields (CRFs).
  • Evaluating and comparing the performance of proposed models & methods with existing works.
Used Methods and Techniques:
The methodological basis of the new approach is the DC (Difference of Convex functions) programming and DCA (DC Algorithms) which are internationally recognized as "the State-of-Art Technology nonconvex programming and global optimization" and have been successfully applied, by researchers and practitioners on the world, for modelling and solving their large-scale nonconvex programs in different fields of Applied Sciences. In particular, DC programming and DCA are well known to the community of Data Mining and Machine Learning for their high performance and ability to handle the large-scale setting in these areas (see, for example, the references for the DCA in [4]).
Some references:
  1. 1. Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura Barnes and Donald Brown, Text Classification Algorithms: A Survey, Information, Vol 10, pp.150-218, 2019
  2. 2. Aggarwal C.C., Zhai C., A Survey of Text Classification Algorithms. In: Aggarwal C., Zhai C. (eds) Mining Text Data. Springer, 2012
  3. 3. M. Thangaraj, M Sivakami, Text Classification Techniques: A Literature Review, Interdisciplinary Journal of Information, Knowledge, and Management, Vol 13, pp. 117-135, 2018
  4. 4. http://www.lita.univ-lorraine.fr/~lethi/index.php/dca.html
  5. 5. Pham Dinh Tao, Le Thi Hoai An, Convex analysis approach to d.c. programming: Theory, Algorithm and Applications. Acta Mathematica Vietnamica, Volume 22, Number 1, pp. 289-355, dedicated to Professor HoangTuy on the occasion of his 70th birthday, 1997.
  6. 6. Pham Dinh Tao, Le Thi Hoai An, Recent advances in DC programming and DCA. Transactions on Computational Collective Intelligence, volume 8342, pp. 1-37, 2014.
  7. 7. Le Thi Hoai An, Pham Dinh Tao, DC programming and DCA: thirty years of developments. Mathematical Programming, Volume 169, Issue 1, pp. 5-68, May 2018.