- Data Science and Analysis
- Provost and Fawcett (2013) Data Science for Business
- Grus (2015) Data Science from Scratch: First Principles with Python
- Python for Data Analysis
- Jupyter notebooks
- How to collect the right data?
- Savoia (2019) The Right It: Why so many ideas fail and how to make sure yours succeed
- How to collect data in the early-stage product ideation
- Savoia (2019) The Right It: Why so many ideas fail and how to make sure yours succeed
Author Archives: gene lee
Recommended Books on “How technology is changing the industry and society?”
- Read one of the following books during the course.
- Write a book review with the following questions:
- Why did you select this book?
- Write a brief summary of the book.
- What did you learn from this book? Did you get a new idea from this?
- Andrew McAfee and Erik Brynjolfsson (2017) Machine, Platform, Crowd: Harnessing Our Digital Future. Link: Norton
- Kartik Hosanagar (2019). A Human’s Guide to Machine Intelligence: How algorithms are shaping our lives and how we can stay in control. Link: Penguin Random House
- Cathy O’Neil (2016). Weapons of Math Destruction: How Big Data increases inequality and threatens democracy. Link: Penguin Random House
- Michael D. Smith and Rahul Telang (2016) Streaming, Sharing, Stealing: Big Data and the Future of Entertainment. Link: MIT Press.
- Ajay Agrawal, Joshua Gans, and Avi Goldfarb (2018) Prediction Machines: The Simple Economics of Artificial Intelligence. Link: Book website
- Anindya Ghose (2017). Tap: Unlocking the Mobile Economy. Link: MIT Press.
- Arun Sundararajan (2016) The Sharing Economy: The end of employment and the rise of crowd-based capitalism. Link: MIT Press
- Eric Topol (2019) Deep Medicine: How AI can make healthcare human again. Link: Basic Books.
Discussion: Your Tech/Analytics Story
Your Tech/Analytics Story
- Objective: To understand student’s prior experience and expectation of the course
- Ask students to describe their experiences on technology and analytics
- What prior work/school project experience have you had that required data analysis?
- Programming experiences?
- R, Stata, Excel, Tableau, SQL, Python, SPSS/SAS, Matlab)
- Scale: 0 (none), 1 (some familiarity), 2 (used in the project), 3 (strong)
- What do you want to learn about tech/analytics in this course?
- What is the most interesting thing you heard about tech/analytics in the past one year?
- Debrief
- Collect text data
- Show word cloud, sentiment analysis, LDA topics
IT Risk and Stock Price Crash Risk (Working Paper)
Song, Victor, Hasan Cavusoglu, Mary L. Z. Ma, Gene Moo Lee (2023) “IT Risk and Stock Price Crash Risk,” Under review.
- Presented at UBC Cybersecurity 2018 (Vancouver, BC) and HICSS 2020 (Maui, HI)
- Based on Victor Song’s Master’s Thesis and HICSS 2020 Paper
IT risk, especially cybersecurity risk, has rapidly increased and become a top concern for researchers, regulators, firm managers, and investors. This study creates a novel firm-level IT risk measure applicable to all US-listed firms by applying the BERTopic topic modeling to risk factors reported in Item 1A of the 10-K annual reports. We validate the measure with multiple approaches including cross-validations, presenting illustrative excerpts of IT risk factors, conducting cross-sectional and over-time distribution analyses, and analyzing firm characteristics associated with IT risk. The measure is found to be heightened in IT-intensive industries and for firms with larger sizes, higher profits, and better growth potential, and it can predict future data breaches. Using this ex-ante IT risk measure, we examine the relation between IT risk and stock price crash risk, which reflects a firm’s propensity to stock price crashes. Our findings suggest that IT risk is positively associated with crash risk, and we also identify that downward operating risk and predictability for data breaches are two mechanisms for the crash risk effect of IT risk. By decomposing IT risk into cybersecurity risk and non-cybersecurity IT risk, we find that both types of IT risk increase crash risk, but the effect of cybersecurity risk is stronger than that of non-cybersecurity IT risk, consistent with their different risk natures. We further observe that the novelty and readability of IT risk factors strengthen the crash risk effects of IT risk, consistent with the notion that the novelty represents updated and increased IT risk, and readability improves the understanding of IT risk. Lastly, difference-in-differences analyses reveal that IT risk increases stock price crash risk, not the other way around. We conclude the paper by discussing academic contributions and practical implications in the context of the SEC’s directives on reporting and managing IT risk and cybersecurity risk.
Enhancing Social Media Analysis with Visual Data Analytics: A Deep Learning Approach (MISQ 2020)
Shin, Donghyuk, Shu He, Gene Moo Lee, Andrew B. Whinston, Suleyman Cetintas, Kuang-Chih Lee (2020) Enhancing Social Media Analysis with Visual Data Analytics: A Deep Learning Approach, MIS Quarterly, 44(4), pp. 1459-1492. [SSRN]
- Based on an industry collaboration with Yahoo! Research
- The first MISQ methods article based on machine learning
- Presented in WeB (Fort Worth, TX 2015), WITS (Dallas, TX 2015), UT Arlington (2016), Texas FreshAIR (San Antonio, TX 2016), SKKU (2016), Korea Univ. (2016), Hanyang (2016), Kyung Hee (2016), Chung-Ang (2016), Yonsei (2016), Seoul National Univ. (2016), Kyungpook National Univ. (2016), UKC (Dallas, TX 2016), UBC (2016), INFORMS CIST (Nashville, TN 2016), DSI (Austin, TX 2016), Univ. of North Texas (2017), Arizona State (2018), Simon Fraser (2019), Saarland (2021), Kyung Hee (2021), Tennessee Chattanooga (2021), Rochester (2021), KAIST (2021), Yonsei (2021), UBC (2022), Temple (2023)
This research methods article proposes a visual data analytics framework to enhance social media research using deep learning models. Drawing on the literature of information systems and marketing, complemented with data-driven methods, we propose a number of visual and textual content features including complexity, similarity, and consistency measures that can play important roles in the persuasiveness of social media content. We then employ state-of-the-art machine learning approaches such as deep learning and text mining to operationalize these new content features in a scalable and systematic manner. For the newly developed features, we validate them against human coders on Amazon Mechanical Turk. Furthermore, we conduct two case studies with a large social media dataset from Tumblr to show the effectiveness of the proposed content features. The first case study demonstrates that both theoretically motivated and data-driven features significantly improve the model’s power to predict the popularity of a post, and the second one highlights the relationships between content features and consumer evaluations of the corresponding posts. The proposed research framework illustrates how deep learning methods can enhance the analysis of unstructured visual and textual data for social media research.
Understanding Security Vulnerability Awareness, Firm Incentives, and ICT Development in Pan-Asia (JMIS 2020)
Zhuang, Yunhui, Yunsik Choi, Shu He, Alvin Chung Man Leung, Gene Moo Lee, Andrew B. Whinston (2020) Understanding Security Vulnerability Awareness, Firm Incentives, and ICT Development in Pan-Asia. Journal of Management Information Systems, 37(3): 668-693.
- Funded by the US National Science Foundation (Award #1718600) and the Hong Kong Policy Innovation and Coordination Office (2015.A1.030.16A)
- Best WIP Runner-Up Award at WITS 2017
- Presented in WITS 2017 (Seoul, Korea), WEIS 2018 (Innsbruck, Austria), BIGS 2018 (San Francisco, CA), and HICSS 2020 (Maui, HI)
- Media coverage: [UBC Sauder] [Phys.org] [Science Daily] [UConn Today] [Security.com]
- Research assistants: Markus Iivonen, Mark Varga
- Previous titles:
- Information Disclosure and Security Vulnerability Awareness: A Large-Scale Randomized Field Experiment in Pan-Asia
- Information Disclosure and Security Policy Design: A Large-Scale Randomization Experiment in Pan-Asia
This paper investigates how the awareness of a security vulnerability index affects firms’ security protection strategy and how the information awareness effect interacts with firm incentives and country-wide IT development level. The security index is constructed based on outgoing spams and phishing website hosting, which may serve as an indicator of a firm’s security controls. To study whether security vulnerability awareness causes firms to improve their security, we conducted a randomized field experiment on 1,262 firms in six Pan-Asian countries and regions. Among 631 randomly selected treated firms, we alerted them of their security vulnerability index and their relative rankings compared to their peers via advisory emails and websites. Difference-in-differences analyses show that compared with the controls, the treated firms improve their security over time, with a statistically significant reduction of outgoing spam volume according to one of the data sources but not phishing website hosting. However, a statistically significant reduction in phishing website hosting was observed among non-web hosting firms, suggesting that firms’ underlying incentives play an important role in the treatment effect. Lastly, exploiting the multi-country nature of the data, we found that firms in countries with high information and communications technology (ICT) development are more responsive to our intervention because they have higher IT capabilities and more resources to resolve security issues. Our study provides cybersecurity policymakers with useful insights on how firm incentives and ICT environments play roles in firms’ security measure adoption.
Public data sources
- Papers with code: https://paperswithcode.com/datasets
- Statistics Canada: https://www150.statcan.gc.ca/n1/en/type/data
- Kaggle: https://www.kaggle.com/datasets
- UCI Machine Learning: https://archive.ics.uci.edu/ml/datasets.php
- US Data.gov: https://www.data.gov/
- Europe: data.europa.eu
- Yelp: https://www.yelp.com/dataset
- YouTube 8M Video Understanding: https://www.kaggle.com/c/youtube8m
- DataBC: https://data.gov.bc.ca/
- Spotify: https://research.atspotify.com/datasets/
- SEC EDGAR: https://www.sec.gov/os/accessing-edgar-data, https://www.sec.gov/dera/data/edgar-log-file-data-set.html
- Wharton Customer Analytics: http://wcai.wharton.upenn.edu/for-researchers/research-opportunities/
- US Patent: https://www.uspto.gov/patents/search
- Finance Data: https://pandas-datareader.readthedocs.io/en/latest/remote_data.html
- Job Classification Data: https://www.onetonline.org/help/onet/database
- Stanford SNAP Datasets: http://snap.stanford.edu/data/index.html
- UCSD Amazon Product Data: https://jmcauley.ucsd.edu/data/amazon/
- UCSD RecSys / Personalization Datasets: https://cseweb.ucsd.edu/~jmcauley/datasets.html
Matching Mobile Applications for Cross Promotion (ISR 2020)
Lee, Gene Moo, Shu He, Joowon Lee, Andrew B. Whinston (2020) Matching Mobile Applications for Cross-Promotion. Information Systems Research 31(3), pp. 865-891.
- Based on an industry collaboration with IGAWorks
- Presented in Chicago Marketing Analytics (Chicago, IL 2013), WeB (Auckland, New Zealand 2014), Notre Dame (2015), Temple (2015), UC Irvine (2015), Indiana (2015), UT Dallas (2015), Minnesota (2015), UT Arlington (2015), Michigan State (2016), Korea Univ (2021)
- Dissertation Paper #3
- Research assistant: Raymond Situ
The mobile applications (apps) market is one of the most successful software markets. As the platform grows rapidly, with millions of apps and billions of users, search costs are increasing tremendously. The challenge is how app developers can target the right users with their apps and how consumers can find the apps that fit their needs. Cross-promotion, advertising a mobile app (target app) in another app (source app), is introduced as a new app-promotion framework to alleviate the issue of search costs. In this paper, we model source app user behaviors (downloads and postdownload usages) with respect to different target apps in cross-promotion campaigns. We construct a novel app similarity measure using latent Dirichlet allocation topic modeling on apps’ production descriptions and then analyze how the similarity between the source and target apps influences users’ app download and usage decisions. To estimate the model, we use a unique data set from a large-scale random matching experiment conducted by a major mobile advertising company in Korea. The empirical results show that consumers prefer more diversified apps when they are making download decisions compared with their usage decisions, which is supported by the psychology literature on people’s variety-seeking behavior. Lastly, we propose an app-matching system based on machine-learning models (on app download and usage prediction) and generalized deferred acceptance algorithms. The simulation results show that app analytics capability is essential in building accurate prediction models and in increasing ad effectiveness of cross-promotion campaigns and that, at the expense of privacy, individual user data can further improve the matching performance. This paper has implications on the trade-off between utility and privacy in the growing mobile economy.
Development of Topic Trend Analysis Model for Industrial Intelligence using Public Data (J. Technology Innovation 2018)
Park, S., Lee, G. M., Kim, Y.-E., Seo, J. (2018). Development of Topic Trend Analysis Model for Industrial Intelligence using Public Data (in Korean), Journal of Technology Innovation, 26(4), 199-232.
- Funded by the Korea Institute of Science and Technology Information (KISTI)
- Demo website: https://misr.sauder.ubc.ca/edgar_dashboard/
- Presented at UKC (2017), KISTI (2017), WITS (2017), Rutgers Business School (2018)
There are increasing needs for understanding and fathoming of the business management environment through big data analysis at the industrial and corporative level. The research using the company disclosure information, which is comprehensively covering the business performance and the future plan of the company, is getting attention. However, there is limited research on developing applicable analytical models leveraging such corporate disclosure data due to its unstructured nature. This study proposes a text-mining-based analytical model for industrial and firm-level analyses using publicly available company disclosure data. Specifically, we apply LDA topic model and word2vec word embedding model on the U.S. SEC data from the publicly listed firms and analyze the trends of business topics at the industrial and corporate levels.
Using LDA topic modeling based on SEC EDGAR 10-K document, whole industrial management topics are figured out. For comparison of different pattern of industries’ topic trend, software and hardware industries are compared in recent 20 years. Also, the changes in management subject at the firm level are observed with a comparison of two companies in the software industry. The changes in topic trends provide a lens for identifying decreasing and growing management subjects at industrial and firm-level. Mapping companies and products(or services) based on dimension reduction after using word2vec word embedding model and principal component analysis of 10-K document at the firm level in the software industry, companies and products(services) that have similar management subjects are identified and also their changes in decades.
For suggesting a methodology to develop an analytical model based on public management data at the industrial and corporate level, there may be contributions in terms of making the ground of practical methodology to identifying changes of management subjects. However, there are required further researches to provide a microscopic analytical model with regard to the relation of technology management strategy between management performance in case of related to the various pattern of management topics as of frequent changes of management subject or their momentum. Also, more studies are needed for developing competitive context analysis model with product(service)-portfolios between firms.
Developing Cyber Risk Assessment Framework for Cyber Insurance: A Big Data Approach (KIRI Research Report 2018)
Lee, G. M. (2018). Developing Cyber Risk Assessment Framework for Cyber Insurance: A Big Data Approach (in Korean). KIRI Research Report 2018-15.
- Funded by Korea Insurance Research Institute (KIRI)
- Presented at KIRI (2018)
- Demo website: https://misr.sauder.ubc.ca/cyberrisk/
- Research assistant: Austin Cho
As our society is heavily dependent on information and communication technology, the associated risk has also significantly increased. Cyber insurance has been emerged as a possible means to better manage such cyber risk. However, the cyber insurance market is still in a premature stage due to the lack of data sharing and standards on cyber risk and cyber insurance. To address this issue, this research proposes a data-driven framework to assess cyber risk using externally observable cyber attack data sources such as outbound spam and phishing websites. We show that the feasibility of such an approach by building cyber risk assessment reports for Korean organizations. Then, by conducting a large-scale randomized field experiment, we measure the causal effect of cyber risk disclosure on organizational security levels. Finally, we develop machine-learning models to predict data breach incidents, as a case of cyber incidents, using the developed cyber risk assessment data. We believe that the proposed data-driven methods can be a stepping-stone to enable information transparency in the cyber insurance market.